fix: flaky TestSessionBetweenPeers with shuffle enabled #1022

lidel · 2025-09-01T14:11:51Z

TLDR: fixes a racy test. This is likely just a problem with test's inability to fully test in isolation, rather a bug in the code.

cc @gammazero for sanity check

Why

Unsure if I fully understand the test/bitswap, so take this with the grain of salt, but iiuc the test was flaky (example) when run with -shuffle=on because of a race condition in message counting. After fetching the first block (cids[0]), which triggers a broadcast want-have to all peers and then a CANCEL when received, the test immediately continued fetching more blocks.

During this time, some other internal timers (idle tick or periodic search?) could fire and cause a rebroadcast of wants before the test checked message counts on uninvolved nodes. With test shuffling, goroutine scheduling changes made this timing issue more likely, causing uninvolved nodes to sometimes receive 3 messages instead of the expected 2.

Adding a small delay after the first block fetch ensures the CANCEL is fully processed and the session stabilizes before continuing, preventing the race condition.

(Not scientific, but run it hundreds times and was not able to reproduce flaky race anymore, so at least this PR makes our CI more stable across PRs)

The test was flaky when run with -shuffle=on because of a race condition in message counting. After fetching the first block (cids[0]), which triggers a broadcast want-have to all peers and then a CANCEL when received, the test immediately continued fetching more blocks. During this time, the session's internal timers (idle tick or periodic search) could fire and cause a rebroadcast of wants before the test checked message counts on uninvolved nodes. With test shuffling, goroutine scheduling changes made this timing issue more likely, causing uninvolved nodes to sometimes receive 3 messages instead of the expected 2. Adding a small delay after the first block fetch ensures the CANCEL is fully processed and the session stabilizes before continuing, preventing the race condition.

codecov · 2025-09-01T14:16:22Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.70%. Comparing base (4c0aa3a) to head (27d0588).

@@            Coverage Diff             @@
##             main    #1022      +/-   ##
==========================================
- Coverage   60.76%   60.70%   -0.07%     
==========================================
  Files         268      268              
  Lines       33593    33593              
==========================================
- Hits        20414    20391      -23     
- Misses      11508    11524      +16     
- Partials     1671     1678       +7

see 7 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gammazero · 2025-09-03T01:40:28Z

I think there may be some other issues causing test unreliability.. It looks like CI is sometimes hanging forever (until timeout) or getting wrong count with this PR. Need further investigation.

lidel requested a review from a team as a code owner September 1, 2025 14:11

lidel requested a review from gammazero September 1, 2025 14:12

lidel added the skip/changelog label Sep 1, 2025

gammazero marked this pull request as draft September 3, 2025 01:40

Merge branch 'main' into fix-flaky-test-session-between-peers

27d0588

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: flaky TestSessionBetweenPeers with shuffle enabled #1022

fix: flaky TestSessionBetweenPeers with shuffle enabled #1022

Uh oh!

lidel commented Sep 1, 2025

Uh oh!

codecov bot commented Sep 1, 2025 •

edited

Loading

Uh oh!

gammazero commented Sep 3, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: flaky TestSessionBetweenPeers with shuffle enabled #1022

Are you sure you want to change the base?

fix: flaky TestSessionBetweenPeers with shuffle enabled #1022

Uh oh!

Conversation

lidel commented Sep 1, 2025

Why

Uh oh!

codecov bot commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gammazero commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Sep 1, 2025 •

edited

Loading

gammazero commented Sep 3, 2025 •

edited

Loading