Skip to content

Conversation

abr-egn
Copy link
Contributor

@abr-egn abr-egn commented Sep 27, 2024

RUST-2046

Rewriting the spec test as a prose test turned out to be very useful and definitely a technique I'm keeping in my pocket for the future; once I had that done it was very quick to narrow the problem down.

Caveat: the following is my theory of what's going on. The fix works but I'm less than 100% confident in my knowledge of the behavior of mongodb deployments in complex situations like this.

The core issue is the behavior of the snapshot read concern:

A query with read concern snapshot returns majority-committed data as it appears across shards from a specific single point in time in the recent past.

There are no particular guarantees about it being the most recent (if you want that, use majority) or "at least after the last transaction" (if you want that, use causally consistent sessions). In the case of this particular test, it's not even done as part of a transaction or session, it's just a find (which is allowed, that's one of the three read operations that support it outside of transactions).

In practice, what seems to happen is you get the timestamp of the last write acknowledged by the server you happen to be connected to. Where this goes wrong for this particular test:

  • The test client entity has useMultipleMongoses: false, so it'll always be connecting to the first (of two) configured mongoses.
  • The initial data is populated using the internal client, which is just a normal client, connecting to wherever server selection happens to fall.
  • Initial data write uses majority write concern; in this case, I think the "calculated majority" will simply be 1.
  • So, if...
    • the internal client picks the second mongos
    • and the write isn't acknowledged by the first mongos
    • and it hasn't replicated by the time the test runs
  • then the timestamp chosen for the snapshot read concern in the test will be before the timestamp for the write, and the test find will return an empty list, causing our flake.

The fix is to use a secondary internal client that's also pinned to the first mongos for initial data population, which avoids the first domino in the chain of failure. Hypothetically I could have updated the unified runner to make the internal client always be pinned but that seemed to be much more likely to cause surprising and unwanted behavior elsewhere.

Sidebar: if my understanding of the meaning of snapshot is correct, this test relies on an awful lot of implicit behavior both serverside and in driver test runners.

pub(crate) mod list_collections;
pub(crate) mod list_databases;
mod list_indexes;
#[cfg(feature = "in-use-encryption")]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated but the lack of it was causing dead code warnings when compiling without the feature selected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried doing a test sync to see if it had been fixed by someone else, but this was all that had changed :)

@abr-egn abr-egn marked this pull request as ready for review September 27, 2024 18:18
@abr-egn
Copy link
Contributor Author

abr-egn commented Sep 27, 2024

I was thinking this had a chance at being the first natural green in a while but it looks like orchestration's broken on mac, ah well :)

Copy link
Contributor

@isabelatkinson isabelatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice investigation; glad this didn't turn out to be bad behavior in the driver.

Rewriting the spec test as a prose test turned out to be very useful

I would love to use the next skunkworks to investigate code-genning the unified tests into individual test functions -- I think that would make these tests much more debuggable.

@abr-egn abr-egn merged commit 7f455fc into mongodb:main Sep 30, 2024
13 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants