Mock connections more accurately in DisruptableMockTransport #37296

ywelsch · 2019-01-10T11:36:17Z

This PR moves DisruptableMockTransport to use a more accurate representation of connection management, which allows to use the full connection manager and does not require mocking out any behavior. With this, we can implement restarting nodes in CoordinatorTests.

elasticmachine · 2019-01-10T11:36:19Z

Pinging @elastic/es-distributed

DaveCTurner

I like this change, it makes a lot more sense. I pointed out a handful of places where we might want to be a bit more badly behaved and suggested a couple of naming changes.

DaveCTurner · 2019-01-10T16:13:17Z

server/src/main/java/org/elasticsearch/cluster/coordination/Join.java

        return targetNode;
    }

+    public boolean matchesTarget(DiscoveryNode matchingNode) {


Maybe targetMatches?

thanks, mine didn't feel quite right.

DaveCTurner · 2019-01-10T16:23:07Z

test/framework/src/main/java/org/elasticsearch/test/disruption/DisruptableMockTransport.java

+            });
+            return () -> {};
+        } else {
+            throw new ConnectTransportException(node, "node " + node + " does not exist");


I think this is OK for now but in future (hoho) we will want this to be async and/or to timeout on an unknown node.

this depends on the connection manager becoming async. Right now there's a Future.get() waiting for us behind this call.

DaveCTurner · 2019-01-10T16:35:38Z

server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

+            } else if (nodeExists(sender) && nodeExists(destination)) {
                connectionStatus = ConnectionStatus.CONNECTED;
+            } else {
+                connectionStatus = ConnectionStatus.DISCONNECTED;


I think it'd be good to test both DISCONNECTED and BLACK_HOLE here, perhaps using mostly the same value for the duration of a test.

DaveCTurner · 2019-01-10T16:38:58Z

test/framework/src/main/java/org/elasticsearch/test/disruption/DisruptableMockTransport.java

+    protected abstract Optional<DisruptableMockTransport> getDisruptableMockTransport(TransportAddress address);

-    protected abstract void handle(DiscoveryNode sender, DiscoveryNode destination, String action, Runnable doDelivery);
+    protected abstract void schedule(Runnable runnable);


I think the name execute would be more consistent with things like ExecutorService. schedule is largely used for delayed execution (ignoring that the DeterministicTaskQueue uses scheduleNow for this).

DaveCTurner · 2019-01-10T16:40:56Z

server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

+                        logger.debug("----> [runRandomly {}] rebooting [{}]", thisStep, clusterNode.getId());
+                        clusterNode.close();
+                        clusterNodes.forEach(cn -> cn.onNode(
+                            () -> cn.transportService.disconnectFromNode(clusterNode.getLocalNode())).run());


I think we should delay these disconnections. Maybe we should rarely delay them by a lot.

removing this line makes the tests fail, will need to look at why that's the case. Do you think this should be possibly delayed beyond the safety phase?

I think completely removing it is unrealistic, but we may not get a disconnection event for quite some time (up to ~15 minutes by default on Linux). I do not think it should be delayed beyond the safety phase.

ok, I've made the according changes. AFAICS the reason why we can't extend it beyond the safety phase is that PeerFinder will not start connecting to the new node as long as the transport claims for the old node with same address to be still connected.

Apologies, the 15 minutes example wasn't supposed to be a suggestion. I think just scheduling it as a normal delayed action is sufficient, given that EXTREME_DELAY_VARIABILITY is mostly in force. I think this also means we don't need to clean it up specially, because we reduce the delay variability down to something reasonable for the end of the safety phase, and it shouldn't matter if it occurs within the first DEFAULT_DELAY_VARIABILITY of the stabilisation phase.

I've pushed efa0728

DaveCTurner · 2019-01-10T16:48:15Z

Also note I haven't run a soak test, my CI machine is otherwise engaged.

original-brownbear

LGTM just one (somewhat irrelevant) suggestion and one question :)

original-brownbear · 2019-01-10T15:35:08Z

server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

                mockTransport = new DisruptableMockTransport(logger) {
                    @Override
-                    protected DiscoveryNode getLocalNode() {
+                    public DiscoveryNode getLocalNode() {


It seems that all the implementations of DisruptableMockTransport simply have a getter for some constant value for the local node as their implementation. Maybe just move that getter up into DisruptableMockTransport and pass it as constructor parameter while we're changing this anyway? (just to save a bit of noise in the concrete tests :))

server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

DaveCTurner

LGTM, but needs a soak test to be sure.

Emulate connections more accurately

35de95f

ywelsch added >non-issue >test Issues or PRs that are addressing/adding tests v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Jan 10, 2019

ywelsch requested review from DaveCTurner and original-brownbear January 10, 2019 11:36

ywelsch mentioned this pull request Jan 10, 2019

A new cluster coordination layer #32006

Closed

61 tasks

fix assertion

7bd4f68

DaveCTurner reviewed Jan 10, 2019

View reviewed changes

original-brownbear approved these changes Jan 10, 2019

View reviewed changes

ywelsch added 6 commits January 10, 2019 23:22

join.targetMatches

df9912a

schedule -> execute

c60a8ac

move DiscoveryNode to constructor

a82bcbd

random unknown node connection status

4916f9b

randomize time of disconnect

cbd1c1a

why not 15

dff6fbb

ywelsch requested a review from DaveCTurner January 11, 2019 08:47

no cleanup for disconnect

efa0728

DaveCTurner approved these changes Jan 11, 2019

View reviewed changes

ywelsch merged commit f4abf96 into elastic:master Jan 11, 2019

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Mock connections more accurately in DisruptableMockTransport #37296

Mock connections more accurately in DisruptableMockTransport #37296

Uh oh!

Conversation

ywelsch commented Jan 10, 2019

Uh oh!

elasticmachine commented Jan 10, 2019

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner commented Jan 10, 2019

Uh oh!

original-brownbear left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear Jan 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

original-brownbear Jan 10, 2019 •

edited

Loading