Skip to content

Conversation

@lidel
Copy link
Member

@lidel lidel commented Sep 1, 2025

TLDR: fixes a racy test. This is likely just a problem with test's inability to fully test in isolation, rather a bug in the code.

cc @gammazero for sanity check

Why

Unsure if I fully understand the test/bitswap, so take this with the grain of salt, but iiuc the test was flaky (example) when run with -shuffle=on because of a race condition in message counting. After fetching the first block (cids[0]), which triggers a broadcast want-have to all peers and then a CANCEL when received, the test immediately continued fetching more blocks.

During this time, some other internal timers (idle tick or periodic search?) could fire and cause a rebroadcast of wants before the test checked message counts on uninvolved nodes. With test shuffling, goroutine scheduling changes made this timing issue more likely, causing uninvolved nodes to sometimes receive 3 messages instead of the expected 2.

Adding a small delay after the first block fetch ensures the CANCEL is fully processed and the session stabilizes before continuing, preventing the race condition.

(Not scientific, but run it hundreds times and was not able to reproduce flaky race anymore, so at least this PR makes our CI more stable across PRs)

The test was flaky when run with -shuffle=on because of a race condition
in message counting. After fetching the first block (cids[0]), which
triggers a broadcast want-have to all peers and then a CANCEL when
received, the test immediately continued fetching more blocks.

During this time, the session's internal timers (idle tick or periodic
search) could fire and cause a rebroadcast of wants before the test
checked message counts on uninvolved nodes. With test shuffling,
goroutine scheduling changes made this timing issue more likely,
causing uninvolved nodes to sometimes receive 3 messages instead of
the expected 2.

Adding a small delay after the first block fetch ensures the CANCEL
is fully processed and the session stabilizes before continuing,
preventing the race condition.
@lidel lidel requested a review from a team as a code owner September 1, 2025 14:11
@lidel lidel requested a review from gammazero September 1, 2025 14:12
@codecov
Copy link

codecov bot commented Sep 1, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.70%. Comparing base (4c0aa3a) to head (27d0588).

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1022      +/-   ##
==========================================
- Coverage   60.76%   60.70%   -0.07%     
==========================================
  Files         268      268              
  Lines       33593    33593              
==========================================
- Hits        20414    20391      -23     
- Misses      11508    11524      +16     
- Partials     1671     1678       +7     

see 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gammazero
Copy link
Contributor

gammazero commented Sep 3, 2025

I think there may be some other issues causing test unreliability.. It looks like CI is sometimes hanging forever (until timeout) or getting wrong count with this PR. Need further investigation.

@gammazero gammazero marked this pull request as draft September 3, 2025 01:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants