RUST-2046 Fix flaky afterClusterTime test #1209
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
RUST-2046
Rewriting the spec test as a prose test turned out to be very useful and definitely a technique I'm keeping in my pocket for the future; once I had that done it was very quick to narrow the problem down.
Caveat: the following is my theory of what's going on. The fix works but I'm less than 100% confident in my knowledge of the behavior of mongodb deployments in complex situations like this.
The core issue is the behavior of the
snapshot
read concern:There are no particular guarantees about it being the most recent (if you want that, use
majority
) or "at least after the last transaction" (if you want that, use causally consistent sessions). In the case of this particular test, it's not even done as part of a transaction or session, it's just afind
(which is allowed, that's one of the three read operations that support it outside of transactions).In practice, what seems to happen is you get the timestamp of the last write acknowledged by the server you happen to be connected to. Where this goes wrong for this particular test:
useMultipleMongoses: false
, so it'll always be connecting to the first (of two) configured mongoses.majority
write concern; in this case, I think the "calculated majority" will simply be1
.snapshot
read concern in the test will be before the timestamp for the write, and the testfind
will return an empty list, causing our flake.The fix is to use a secondary internal client that's also pinned to the first mongos for initial data population, which avoids the first domino in the chain of failure. Hypothetically I could have updated the unified runner to make the internal client always be pinned but that seemed to be much more likely to cause surprising and unwanted behavior elsewhere.
Sidebar: if my understanding of the meaning of
snapshot
is correct, this test relies on an awful lot of implicit behavior both serverside and in driver test runners.