bitswap/httpnet: fix sudden stop of http retrieval requests #984

hsanjuan · 2025-07-21T08:23:44Z

We observed an issue in rainbow production, which consisted in
http-block-retrieval requests to a specific endpoint dropping to 0 permanently.

Upon inspection, we stablished that the issue is that the peer's message-queue
is shutdown due to a message sender error, while the peer itself seems never
to disconnect/reconnect. The lack of queue cleanup, while being shutdown,
means no more requests are sent.

The root cause(s) seem to be race conditions that cause Disconnect/Connect
event to never be bubbled: because Connect happens very soon after Disconnect,
or because Disconnect happens on Bitswap while Connect happens on HTTP etc.

This commit addresses the issue from multiple angles:

Do not shutdown the messages queues on MessageSender errors: I have not
found a way to get past this issue otherwise. This might mean that the queue
keeps being processed while a Disconnect arrives. This is a small price to
pay as there is no way to ensure a notification arrives with the current
(efficient) implementation of the connectEventManager.
Avoid Connect/Disconnect races in httpnet: do not disconnect while
connecting, do not report connectedness while connecting disconnecting, do
not connect while disconnecting. The issue here is that races between
operations may cause things like reporting a peer is disconnected when it's
about to be connect and vice-versa.
Share ConnectEventManager between bitswap/networks: before, if one
network disconnected, the other one was not able to re-connect, as the other
one had not disconnected at all. That opened the window to not recovering
situations that were recoverable.
Improve routing logic between httpnet and bsnet: the clearest trigger of
the issue was disconnections of HTTP due to client errors. Queue processing
tried to send more wantlists to the HTTP peers, but they were disconnected,
so they were instead attempted to be sent via Bitswap. Bitswap failed after
several seconds trying to lookup addresses. In the meantime, http had
already reconnected, but the message queues were shutdown
nevertheless. Improved logic attemps to reduce this situation, and unwanted
Disconnects due to bad routing of NewMessageSender.

codecov · 2025-07-21T08:27:45Z

Codecov Report

❌ Patch coverage is 45.63107% with 56 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.55%. Comparing base (7758521) to head (d8ba4d2).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
bitswap/network/router.go	0.00%	29 Missing ⚠️
bitswap/network/httpnet/httpnet.go	78.00%	9 Missing and 2 partials ⚠️
bitswap/network/bsnet/ipfs_impl.go	50.00%	4 Missing and 1 partial ⚠️
bitswap/network/bsnet/options.go	0.00%	4 Missing ⚠️
...tswap/client/internal/messagequeue/messagequeue.go	0.00%	2 Missing ⚠️
bitswap/network/connecteventmanager.go	0.00%	2 Missing ⚠️
bitswap/testnet/virtual.go	0.00%	2 Missing ⚠️
bitswap/network/httpnet/pinger.go	75.00%	1 Missing ⚠️

@@            Coverage Diff             @@
##             main     #984      +/-   ##
==========================================
- Coverage   61.57%   61.55%   -0.02%     
==========================================
  Files         254      254              
  Lines       31504    31542      +38     
==========================================
+ Hits        19398    19416      +18     
- Misses      10520    10540      +20     
  Partials     1586     1586

Files with missing lines	Coverage Δ
bitswap/network/httpnet/pinger.go	`35.34% <75.00%> (+1.41%)`	⬆️
...tswap/client/internal/messagequeue/messagequeue.go	`84.87% <0.00%> (+0.18%)`	⬆️
bitswap/network/connecteventmanager.go	`84.96% <0.00%> (-1.30%)`	⬇️
bitswap/testnet/virtual.go	`67.08% <0.00%> (-0.58%)`	⬇️
bitswap/network/bsnet/options.go	`33.33% <0.00%> (-16.67%)`	⬇️
bitswap/network/bsnet/ipfs_impl.go	`75.29% <50.00%> (-0.84%)`	⬇️
bitswap/network/httpnet/httpnet.go	`65.64% <78.00%> (+1.04%)`	⬆️
bitswap/network/router.go	`0.00% <0.00%> (ø)`

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

We observed an issue in rainbow production, which consisted in http-block-retrieval requests to a specific endpoint dropping to 0 permanently. Upon inspection, we stablished that the issue is that the peer's message-queue is shutdown due to a message sender error, while the peer itself seems never to disconnect/reconnect. The lack of queue cleanup, while being shutdown, means no more requests are sent. The root cause(s) seem to be race conditions that cause Disconnect/Connect event to never be bubbled: because Connect happens very soon after Disconnect, or because Disconnect happens on Bitswap while Connect happens on HTTP etc. This commit addresses the issue from multiple angles: 1. Do not shutdown the messages queues on MessageSender errors: I have not found a way to get past this issue otherwise. This might mean that the queue keeps being processed while a Disconnect arrives. This is a small price to pay as there is no way to ensure a notification arrives with the current (efficient) implementation of the connectEventManager. 2. Avoid Connect/Disconnect races in httpnet: do not disconnect while connecting, do not report connectedness while connecting disconnecting, do not connect while disconnecting. The issue here is that races between operations may cause things like reporting a peer is disconnected when it's about to be connect and vice-versa. 3. Share ConnectEventManager between bitswap/networks: before, if one network disconnected, the other one was not able to re-connect, as the other one had not disconnected at all. That opened the window to not recovering situations that were recoverable. 3. Improve routing logic between httpnet and bsnet: the clearest trigger of the issue was disconnections of HTTP due to client errors. Queue processing tried to send more wantlists to the HTTP peers, but they were disconnected, so they were instead attempted to be sent via Bitswap. Bitswap failed after several seconds trying to lookup addresses. In the meantime, http had already reconnected, but the message queues were shutdown nevertheless. Improved logic attemps to reduce this situation, and unwanted Disconnects due to bad routing of NewMessageSender.

hsanjuan requested a review from a team as a code owner July 21, 2025 08:23

hsanjuan force-pushed the httpnet-connect-disconnect branch from aa4552c to 2cfd4a8 Compare July 21, 2025 08:24

gammazero approved these changes Jul 22, 2025

View reviewed changes

gammazero approved these changes Jul 23, 2025

View reviewed changes

hsanjuan force-pushed the httpnet-connect-disconnect branch from e2d3b03 to 0e1bfaf Compare July 23, 2025 22:35

hsanjuan changed the title ~~httpnet: another attempt to fix httpnet silently stopping requests~~ bitswap/httpnet: fix sudden stop of http retrieval requests Jul 23, 2025

hsanjuan requested a review from gammazero July 23, 2025 23:10

hsanjuan force-pushed the httpnet-connect-disconnect branch from 0e1bfaf to 2eb609b Compare July 23, 2025 23:13

hsanjuan force-pushed the httpnet-connect-disconnect branch from 2eb609b to b0ed944 Compare July 24, 2025 13:41

gammazero approved these changes Jul 24, 2025

View reviewed changes

Merge branch 'main' into httpnet-connect-disconnect

d8ba4d2

gammazero merged commit 458afb1 into main Jul 25, 2025
16 checks passed

gammazero deleted the httpnet-connect-disconnect branch July 25, 2025 20:55

BrewTestBot mentioned this pull request Aug 27, 2025

ipfs 0.37.0 Homebrew/homebrew-core#235214

Merged

This was referenced Oct 13, 2025

IPFS example with Kubo paritytech/polkadot-bulletin-chain#72

Closed

bitswap: reuse inbound substream for subsequent requests paritytech/litep2p#444

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bitswap/httpnet: fix sudden stop of http retrieval requests #984

bitswap/httpnet: fix sudden stop of http retrieval requests #984

Uh oh!

hsanjuan commented Jul 21, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jul 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bitswap/httpnet: fix sudden stop of http retrieval requests #984

bitswap/httpnet: fix sudden stop of http retrieval requests #984

Uh oh!

Conversation

hsanjuan commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hsanjuan commented Jul 21, 2025 •

edited

Loading

codecov bot commented Jul 21, 2025 •

edited

Loading