Skip to content

Conversation

@hsanjuan
Copy link
Contributor

@hsanjuan hsanjuan commented Jul 21, 2025

We observed an issue in rainbow production, which consisted in
http-block-retrieval requests to a specific endpoint dropping to 0 permanently.

Upon inspection, we stablished that the issue is that the peer's message-queue
is shutdown due to a message sender error, while the peer itself seems never
to disconnect/reconnect. The lack of queue cleanup, while being shutdown,
means no more requests are sent.

The root cause(s) seem to be race conditions that cause Disconnect/Connect
event to never be bubbled: because Connect happens very soon after Disconnect,
or because Disconnect happens on Bitswap while Connect happens on HTTP etc.

This commit addresses the issue from multiple angles:

  1. Do not shutdown the messages queues on MessageSender errors: I have not
    found a way to get past this issue otherwise. This might mean that the queue
    keeps being processed while a Disconnect arrives. This is a small price to
    pay as there is no way to ensure a notification arrives with the current
    (efficient) implementation of the connectEventManager.

  2. Avoid Connect/Disconnect races in httpnet: do not disconnect while
    connecting, do not report connectedness while connecting disconnecting, do
    not connect while disconnecting. The issue here is that races between
    operations may cause things like reporting a peer is disconnected when it's
    about to be connect and vice-versa.

  3. Share ConnectEventManager between bitswap/networks: before, if one
    network disconnected, the other one was not able to re-connect, as the other
    one had not disconnected at all. That opened the window to not recovering
    situations that were recoverable.

  4. Improve routing logic between httpnet and bsnet: the clearest trigger of
    the issue was disconnections of HTTP due to client errors. Queue processing
    tried to send more wantlists to the HTTP peers, but they were disconnected,
    so they were instead attempted to be sent via Bitswap. Bitswap failed after
    several seconds trying to lookup addresses. In the meantime, http had
    already reconnected, but the message queues were shutdown
    nevertheless. Improved logic attemps to reduce this situation, and unwanted
    Disconnects due to bad routing of NewMessageSender.

@hsanjuan hsanjuan requested a review from a team as a code owner July 21, 2025 08:23
@hsanjuan hsanjuan force-pushed the httpnet-connect-disconnect branch from aa4552c to 2cfd4a8 Compare July 21, 2025 08:24
@codecov
Copy link

codecov bot commented Jul 21, 2025

Codecov Report

❌ Patch coverage is 45.63107% with 56 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.55%. Comparing base (7758521) to head (d8ba4d2).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
bitswap/network/router.go 0.00% 29 Missing ⚠️
bitswap/network/httpnet/httpnet.go 78.00% 9 Missing and 2 partials ⚠️
bitswap/network/bsnet/ipfs_impl.go 50.00% 4 Missing and 1 partial ⚠️
bitswap/network/bsnet/options.go 0.00% 4 Missing ⚠️
...tswap/client/internal/messagequeue/messagequeue.go 0.00% 2 Missing ⚠️
bitswap/network/connecteventmanager.go 0.00% 2 Missing ⚠️
bitswap/testnet/virtual.go 0.00% 2 Missing ⚠️
bitswap/network/httpnet/pinger.go 75.00% 1 Missing ⚠️

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #984      +/-   ##
==========================================
- Coverage   61.57%   61.55%   -0.02%     
==========================================
  Files         254      254              
  Lines       31504    31542      +38     
==========================================
+ Hits        19398    19416      +18     
- Misses      10520    10540      +20     
  Partials     1586     1586              
Files with missing lines Coverage Δ
bitswap/network/httpnet/pinger.go 35.34% <75.00%> (+1.41%) ⬆️
...tswap/client/internal/messagequeue/messagequeue.go 84.87% <0.00%> (+0.18%) ⬆️
bitswap/network/connecteventmanager.go 84.96% <0.00%> (-1.30%) ⬇️
bitswap/testnet/virtual.go 67.08% <0.00%> (-0.58%) ⬇️
bitswap/network/bsnet/options.go 33.33% <0.00%> (-16.67%) ⬇️
bitswap/network/bsnet/ipfs_impl.go 75.29% <50.00%> (-0.84%) ⬇️
bitswap/network/httpnet/httpnet.go 65.64% <78.00%> (+1.04%) ⬆️
bitswap/network/router.go 0.00% <0.00%> (ø)

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hsanjuan hsanjuan force-pushed the httpnet-connect-disconnect branch from e2d3b03 to 0e1bfaf Compare July 23, 2025 22:35
@hsanjuan hsanjuan changed the title httpnet: another attempt to fix httpnet silently stopping requests bitswap/httpnet: fix sudden stop of http retrieval requests Jul 23, 2025
@hsanjuan hsanjuan requested a review from gammazero July 23, 2025 23:10
@hsanjuan hsanjuan force-pushed the httpnet-connect-disconnect branch from 0e1bfaf to 2eb609b Compare July 23, 2025 23:13
We observed an issue in rainbow production, which consisted in
http-block-retrieval requests to a specific endpoint dropping to 0 permanently.

Upon inspection, we stablished that the issue is that the peer's message-queue
is shutdown due to a message sender error, while the peer itself seems never
to disconnect/reconnect. The lack of queue cleanup, while being shutdown,
means no more requests are sent.

The root cause(s) seem to be race conditions that cause Disconnect/Connect
event to never be bubbled: because Connect happens very soon after Disconnect,
or because Disconnect happens on Bitswap while Connect happens on HTTP etc.

This commit addresses the issue from multiple angles:

  1. Do not shutdown the messages queues on MessageSender errors: I have not
  found a way to get past this issue otherwise. This might mean that the queue
  keeps being processed while a Disconnect arrives. This is a small price to
  pay as there is no way to ensure a notification arrives with the current
  (efficient) implementation of the connectEventManager.

  2. Avoid Connect/Disconnect races in httpnet: do not disconnect while
  connecting, do not report connectedness while connecting disconnecting, do
  not connect while disconnecting. The issue here is that races between
  operations may cause things like reporting a peer is disconnected when it's
  about to be connect and vice-versa.

  3. Share ConnectEventManager between bitswap/networks: before, if one
  network disconnected, the other one was not able to re-connect, as the other
  one had not disconnected at all. That opened the window to not recovering
  situations that were recoverable.

  3. Improve routing logic between httpnet and bsnet: the clearest trigger of
  the issue was disconnections of HTTP due to client errors. Queue processing
  tried to send more wantlists to the HTTP peers, but they were disconnected,
  so they were instead attempted to be sent via Bitswap. Bitswap failed after
  several seconds trying to lookup addresses. In the meantime, http had
  already reconnected, but the message queues were shutdown
  nevertheless. Improved logic attemps to reduce this situation, and unwanted
  Disconnects due to bad routing of NewMessageSender.
@hsanjuan hsanjuan force-pushed the httpnet-connect-disconnect branch from 2eb609b to b0ed944 Compare July 24, 2025 13:41
@gammazero gammazero merged commit 458afb1 into main Jul 25, 2025
16 checks passed
@gammazero gammazero deleted the httpnet-connect-disconnect branch July 25, 2025 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants