generated from ipfs/ipfs-repository-template
-
Notifications
You must be signed in to change notification settings - Fork 136
bitswap/httpnet: fix sudden stop of http retrieval requests #984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
aa4552c to
2cfd4a8
Compare
Codecov Report❌ Patch coverage is @@ Coverage Diff @@
## main #984 +/- ##
==========================================
- Coverage 61.57% 61.55% -0.02%
==========================================
Files 254 254
Lines 31504 31542 +38
==========================================
+ Hits 19398 19416 +18
- Misses 10520 10540 +20
Partials 1586 1586
... and 10 files with indirect coverage changes 🚀 New features to boost your workflow:
|
gammazero
approved these changes
Jul 22, 2025
gammazero
approved these changes
Jul 23, 2025
e2d3b03 to
0e1bfaf
Compare
0e1bfaf to
2eb609b
Compare
We observed an issue in rainbow production, which consisted in http-block-retrieval requests to a specific endpoint dropping to 0 permanently. Upon inspection, we stablished that the issue is that the peer's message-queue is shutdown due to a message sender error, while the peer itself seems never to disconnect/reconnect. The lack of queue cleanup, while being shutdown, means no more requests are sent. The root cause(s) seem to be race conditions that cause Disconnect/Connect event to never be bubbled: because Connect happens very soon after Disconnect, or because Disconnect happens on Bitswap while Connect happens on HTTP etc. This commit addresses the issue from multiple angles: 1. Do not shutdown the messages queues on MessageSender errors: I have not found a way to get past this issue otherwise. This might mean that the queue keeps being processed while a Disconnect arrives. This is a small price to pay as there is no way to ensure a notification arrives with the current (efficient) implementation of the connectEventManager. 2. Avoid Connect/Disconnect races in httpnet: do not disconnect while connecting, do not report connectedness while connecting disconnecting, do not connect while disconnecting. The issue here is that races between operations may cause things like reporting a peer is disconnected when it's about to be connect and vice-versa. 3. Share ConnectEventManager between bitswap/networks: before, if one network disconnected, the other one was not able to re-connect, as the other one had not disconnected at all. That opened the window to not recovering situations that were recoverable. 3. Improve routing logic between httpnet and bsnet: the clearest trigger of the issue was disconnections of HTTP due to client errors. Queue processing tried to send more wantlists to the HTTP peers, but they were disconnected, so they were instead attempted to be sent via Bitswap. Bitswap failed after several seconds trying to lookup addresses. In the meantime, http had already reconnected, but the message queues were shutdown nevertheless. Improved logic attemps to reduce this situation, and unwanted Disconnects due to bad routing of NewMessageSender.
2eb609b to
b0ed944
Compare
gammazero
approved these changes
Jul 24, 2025
This was referenced Oct 13, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We observed an issue in rainbow production, which consisted in
http-block-retrieval requests to a specific endpoint dropping to 0 permanently.
Upon inspection, we stablished that the issue is that the peer's message-queue
is shutdown due to a message sender error, while the peer itself seems never
to disconnect/reconnect. The lack of queue cleanup, while being shutdown,
means no more requests are sent.
The root cause(s) seem to be race conditions that cause Disconnect/Connect
event to never be bubbled: because Connect happens very soon after Disconnect,
or because Disconnect happens on Bitswap while Connect happens on HTTP etc.
This commit addresses the issue from multiple angles:
Do not shutdown the messages queues on MessageSender errors: I have not
found a way to get past this issue otherwise. This might mean that the queue
keeps being processed while a Disconnect arrives. This is a small price to
pay as there is no way to ensure a notification arrives with the current
(efficient) implementation of the connectEventManager.
Avoid Connect/Disconnect races in httpnet: do not disconnect while
connecting, do not report connectedness while connecting disconnecting, do
not connect while disconnecting. The issue here is that races between
operations may cause things like reporting a peer is disconnected when it's
about to be connect and vice-versa.
Share ConnectEventManager between bitswap/networks: before, if one
network disconnected, the other one was not able to re-connect, as the other
one had not disconnected at all. That opened the window to not recovering
situations that were recoverable.
Improve routing logic between httpnet and bsnet: the clearest trigger of
the issue was disconnections of HTTP due to client errors. Queue processing
tried to send more wantlists to the HTTP peers, but they were disconnected,
so they were instead attempted to be sent via Bitswap. Bitswap failed after
several seconds trying to lookup addresses. In the meantime, http had
already reconnected, but the message queues were shutdown
nevertheless. Improved logic attemps to reduce this situation, and unwanted
Disconnects due to bad routing of NewMessageSender.