-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Mock connections more accurately in DisruptableMockTransport #37296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mock connections more accurately in DisruptableMockTransport #37296
Conversation
|
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this change, it makes a lot more sense. I pointed out a handful of places where we might want to be a bit more badly behaved and suggested a couple of naming changes.
| return targetNode; | ||
| } | ||
|
|
||
| public boolean matchesTarget(DiscoveryNode matchingNode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe targetMatches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, mine didn't feel quite right.
| }); | ||
| return () -> {}; | ||
| } else { | ||
| throw new ConnectTransportException(node, "node " + node + " does not exist"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK for now but in future (hoho) we will want this to be async and/or to timeout on an unknown node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this depends on the connection manager becoming async. Right now there's a Future.get() waiting for us behind this call.
| } else if (nodeExists(sender) && nodeExists(destination)) { | ||
| connectionStatus = ConnectionStatus.CONNECTED; | ||
| } else { | ||
| connectionStatus = ConnectionStatus.DISCONNECTED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be good to test both DISCONNECTED and BLACK_HOLE here, perhaps using mostly the same value for the duration of a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| protected abstract Optional<DisruptableMockTransport> getDisruptableMockTransport(TransportAddress address); | ||
|
|
||
| protected abstract void handle(DiscoveryNode sender, DiscoveryNode destination, String action, Runnable doDelivery); | ||
| protected abstract void schedule(Runnable runnable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name execute would be more consistent with things like ExecutorService. schedule is largely used for delayed execution (ignoring that the DeterministicTaskQueue uses scheduleNow for this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| logger.debug("----> [runRandomly {}] rebooting [{}]", thisStep, clusterNode.getId()); | ||
| clusterNode.close(); | ||
| clusterNodes.forEach(cn -> cn.onNode( | ||
| () -> cn.transportService.disconnectFromNode(clusterNode.getLocalNode())).run()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should delay these disconnections. Maybe we should rarely delay them by a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing this line makes the tests fail, will need to look at why that's the case. Do you think this should be possibly delayed beyond the safety phase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think completely removing it is unrealistic, but we may not get a disconnection event for quite some time (up to ~15 minutes by default on Linux). I do not think it should be delayed beyond the safety phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I've made the according changes. AFAICS the reason why we can't extend it beyond the safety phase is that PeerFinder will not start connecting to the new node as long as the transport claims for the old node with same address to be still connected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, the 15 minutes example wasn't supposed to be a suggestion. I think just scheduling it as a normal delayed action is sufficient, given that EXTREME_DELAY_VARIABILITY is mostly in force. I think this also means we don't need to clean it up specially, because we reduce the delay variability down to something reasonable for the end of the safety phase, and it shouldn't matter if it occurs within the first DEFAULT_DELAY_VARIABILITY of the stabilisation phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed efa0728
|
Also note I haven't run a soak test, my CI machine is otherwise engaged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM just one (somewhat irrelevant) suggestion and one question :)
| mockTransport = new DisruptableMockTransport(logger) { | ||
| @Override | ||
| protected DiscoveryNode getLocalNode() { | ||
| public DiscoveryNode getLocalNode() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that all the implementations of DisruptableMockTransport simply have a getter for some constant value for the local node as their implementation. Maybe just move that getter up into DisruptableMockTransport and pass it as constructor parameter while we're changing this anyway? (just to save a bit of noise in the concrete tests :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but needs a soak test to be sure.
This PR moves DisruptableMockTransport to use a more accurate representation of connection management, which allows to use the full connection manager and does not require mocking out any behavior. With this, we can implement restarting nodes in CoordinatorTests.