fix: flaky TestSessionBetweenPeers with shuffle enabled #1022
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TLDR: fixes a racy test. This is likely just a problem with test's inability to fully test in isolation, rather a bug in the code.
cc @gammazero for sanity check
Why
Unsure if I fully understand the test/bitswap, so take this with the grain of salt, but iiuc the test was flaky (example) when run with
-shuffle=onbecause of a race condition in message counting. After fetching the first block (cids[0]), which triggers a broadcast want-have to all peers and then a CANCEL when received, the test immediately continued fetching more blocks.During this time, some other internal timers (idle tick or periodic search?) could fire and cause a rebroadcast of wants before the test checked message counts on uninvolved nodes. With test shuffling, goroutine scheduling changes made this timing issue more likely, causing uninvolved nodes to sometimes receive 3 messages instead of the expected 2.
Adding a small delay after the first block fetch ensures the CANCEL is fully processed and the session stabilizes before continuing, preventing the race condition.
(Not scientific, but run it hundreds times and was not able to reproduce flaky race anymore, so at least this PR makes our CI more stable across PRs)