Skip to content

Commit dfd82f4

Browse files
committed
Experimental fix for a worker-setup concurrency issue
If some workers crashed in startup, other workers could start processing their regular workloads instead of aborting early. I think this might be related to failing to check for interrupts before we start processing.
1 parent 1b064f4 commit dfd82f4

File tree

1 file changed

+18
-6
lines changed

1 file changed

+18
-6
lines changed

jepsen/src/jepsen/core.clj

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -34,14 +34,23 @@
3434
[tea-time.core :as tt]
3535
[slingshot.slingshot :refer [try+ throw+]])
3636
(:import (java.util.concurrent CyclicBarrier
37-
CountDownLatch)))
37+
CountDownLatch
38+
TimeUnit)))
3839

3940
(defn synchronize
40-
"A synchronization primitive for tests. When invoked, blocks until all
41-
nodes have arrived at the same point."
42-
[test]
43-
(or (= ::no-barrier (:barrier test))
44-
(.await ^CyclicBarrier (:barrier test))))
41+
"A synchronization primitive for tests. When invoked, blocks until all nodes
42+
have arrived at the same point.
43+
44+
This is often used in IO-heavy DB setup code to ensure all nodes have
45+
completed some phase of execution before moving on to the next. However, if
46+
an exception is thrown by one of those threads, the call to `synchronize`
47+
will deadlock! To avoid this, we include a default timeout of 60 seconds,
48+
which can be overridden by passing an alternate timeout in seconds."
49+
([test]
50+
(synchronize test 60))
51+
([test timeout-s]
52+
(or (= ::no-barrier (:barrier test))
53+
(.await ^CyclicBarrier (:barrier test) timeout-s TimeUnit/SECONDS))))
4554

4655
(defn conj-op!
4756
"Add an operation to a tests's history, and returns the operation."
@@ -170,6 +179,9 @@
170179
(with-thread-name (str "jepsen " name)
171180
(try (info "Starting" name)
172181
(setup-worker! worker)
182+
(when (.interrupted (Thread/currentThread))
183+
(throw InterruptedException. "Interrupted before running"))
184+
173185
(try (.countDown run-latch)
174186
(info "Running" name)
175187
(run-worker! worker)

0 commit comments

Comments
 (0)