RUST-2046 Fix flaky afterClusterTime test #1209

abr-egn · 2024-09-27T16:24:59Z

RUST-2046

Rewriting the spec test as a prose test turned out to be very useful and definitely a technique I'm keeping in my pocket for the future; once I had that done it was very quick to narrow the problem down.

Caveat: the following is my theory of what's going on. The fix works but I'm less than 100% confident in my knowledge of the behavior of mongodb deployments in complex situations like this.

The core issue is the behavior of the snapshot read concern:

A query with read concern snapshot returns majority-committed data as it appears across shards from a specific single point in time in the recent past.

There are no particular guarantees about it being the most recent (if you want that, use majority) or "at least after the last transaction" (if you want that, use causally consistent sessions). In the case of this particular test, it's not even done as part of a transaction or session, it's just a find (which is allowed, that's one of the three read operations that support it outside of transactions).

In practice, what seems to happen is you get the timestamp of the last write acknowledged by the server you happen to be connected to. Where this goes wrong for this particular test:

The test client entity has useMultipleMongoses: false, so it'll always be connecting to the first (of two) configured mongoses.
The initial data is populated using the internal client, which is just a normal client, connecting to wherever server selection happens to fall.
Initial data write uses majority write concern; in this case, I think the "calculated majority" will simply be 1.
So, if...
- the internal client picks the second mongos
- and the write isn't acknowledged by the first mongos
- and it hasn't replicated by the time the test runs
then the timestamp chosen for the snapshot read concern in the test will be before the timestamp for the write, and the test find will return an empty list, causing our flake.

The fix is to use a secondary internal client that's also pinned to the first mongos for initial data population, which avoids the first domino in the chain of failure. Hypothetically I could have updated the unified runner to make the internal client always be pinned but that seemed to be much more likely to cause surprising and unwanted behavior elsewhere.

Sidebar: if my understanding of the meaning of snapshot is correct, this test relies on an awful lot of implicit behavior both serverside and in driver test runners.

abr-egn · 2024-09-27T16:28:20Z

src/operation.rs

 pub(crate) mod list_collections;
 pub(crate) mod list_databases;
 mod list_indexes;
+#[cfg(feature = "in-use-encryption")]


This is unrelated but the lack of it was causing dead code warnings when compiling without the feature selected.

abr-egn · 2024-09-27T16:28:55Z

src/test/spec/json/sessions/README.md

I tried doing a test sync to see if it had been fixed by someone else, but this was all that had changed :)

abr-egn · 2024-09-27T18:18:48Z

I was thinking this had a chance at being the first natural green in a while but it looks like orchestration's broken on mac, ah well :)

isabelatkinson

LGTM! Nice investigation; glad this didn't turn out to be bad behavior in the driver.

Rewriting the spec test as a prose test turned out to be very useful

I would love to use the next skunkworks to investigate code-genning the unified tests into individual test functions -- I think that would make these tests much more debuggable.

RUST-2046 Fix flaky afterClusterTime test

e9b3410

abr-egn commented Sep 27, 2024

View reviewed changes

abr-egn marked this pull request as ready for review September 27, 2024 18:18

abr-egn requested a review from isabelatkinson September 27, 2024 18:19

isabelatkinson approved these changes Sep 27, 2024

View reviewed changes

abr-egn merged commit 7f455fc into mongodb:main Sep 30, 2024
13 of 15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RUST-2046 Fix flaky afterClusterTime test #1209

RUST-2046 Fix flaky afterClusterTime test #1209

Uh oh!

abr-egn commented Sep 27, 2024 •

edited

Loading

Uh oh!

abr-egn Sep 27, 2024

Uh oh!

abr-egn Sep 27, 2024

Uh oh!

abr-egn commented Sep 27, 2024

Uh oh!

isabelatkinson left a comment

Uh oh!

Uh oh!

Uh oh!

RUST-2046 Fix flaky afterClusterTime test #1209

RUST-2046 Fix flaky afterClusterTime test #1209

Uh oh!

Conversation

abr-egn commented Sep 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abr-egn Sep 27, 2024

Choose a reason for hiding this comment

Uh oh!

abr-egn Sep 27, 2024

Choose a reason for hiding this comment

Uh oh!

abr-egn commented Sep 27, 2024

Uh oh!

isabelatkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

abr-egn commented Sep 27, 2024 •

edited

Loading