-
Notifications
You must be signed in to change notification settings - Fork 38.2k
p2p: improve TxOrphanage denial of service bounds #31829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31829. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. LLM Linter (✨ experimental)Possible typos and grammar issues: List of typos found in added lines:
drahtbot_id_4_m |
bfc78fa to
765fcdf
Compare
|
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
765fcdf to
d302324
Compare
d302324 to
0ccf21e
Compare
|
Rebased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resource bounds additions seem to make sense, still working through the workset change implications.
I've got a minimal fuzz harness checking that the "honest" peer cannot be evicted, please feel free to take it: https://github.com/instagibbs/bitcoin/tree/2025-01-orphanage-peer-dos_greg_2
0ccf21e to
7aaf390
Compare
|
Thanks @instagibbs for the testing and review, added your fuzz commits and took comments. Still need to write the p2p_orphan_handling test. |
7aaf390 to
61b40f0
Compare
|
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
61b40f0 to
ff82676
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Halfway through, some minor points below - my main conceptual question is why m_total_announcements is a meaningful metric in limiting the orphanage.
My understanding is that m_total_orphan_usage exists to limit memory usage, and m_total_announcements to limit CPU usage - but why the number of announcements instead of number of orphans?
Why would it make the situation any less DoSy if we remove an announcer but keep the orphan? Since we only assign the tx to one peer's workset after 7426afb, more announcers for the same number of orphans doesn't really mean any additional work.
Yep, to limit CPU usage. The complexity of eviction for example is bounded by the total number of announcements: in the worst case, each orphan has many announcers and the
Perhaps I should have stated this in the OP more explicitly, but a major motivation for this eviction strategy is to prevent any peer from being to evict any announcements of another peer, hence the per-peer limits. If we changed the eviction code to remove orphans wholesale instead of just announcements, we'd have a similar situation to today's: an attacker can cause churn of an honest orphan by announcing it along with a lot of other orphans. So evicting announcements instead of orphans isn't less DoSy, but it does make the orphanage less churnable. |
19194f2 to
3903310
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach ACK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
combing through tests a bit, think I spotted the CI failure cause
790f6e7 to
5002462
Compare
|
ReACK 5002462 A couple additional assertions, some nits addressed, and improvements in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 5002462
Wish orphan traffic was higher for more live testing on mainnet. Will test anyways and report back if I see anything odd.
I'm not convinced that we really need EraseForBlock anymore, but I don't think it's a unique danger and it's nice that we get some real benchmarks for it in master.
|
Looking at logs, was wondering if we can get some more information about which peer/ which tx is being evicted from the orphanage? I'm eyeballing some logs since I've been running variants of this for a few weeks now, and the e.g.: "2025-07-07T11:27:34.481530Z [txpackages] orphanage overflow, removed 1 tx (4 announcements)" Being able to quickly see that, f.e., the only reason we're evicting is because of a single faulty / spammy peer would be helpful to separate from "legitimate" traffic, where you'd expect to see more a round-robin eviction pattern. |
Will add to the followup. What about adding a log for each peer chosen in the loop? So for example 1 call to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code-review ACK 5002462
With two suggestions regarding sanity checks on lower_bound iterators. Probably more than just nits, but still fine to tackle in the follow-up IMHO.
| if (!Assume(it_ann->m_announcer == worst_peer)) break; | ||
| if (!Assume(it_ann != m_orphans.get<ByPeer>().end())) break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two Assume lines should be swapped I think, to prevent potential dereference of an end() iterator (which, AFAIR, would be UB)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, added to #32941
| for (const auto& wtxid : it_by_prev->second) { | ||
| // Belt and suspenders, each entry in m_outpoint_to_orphan_it should always have at least 1 announcement. | ||
| auto it = index_by_wtxid.lower_bound(ByWtxidView{wtxid, MIN_PEER}); | ||
| if (!Assume(it != index_by_wtxid.end())) continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (!Assume(it != index_by_wtxid.end())) continue; | |
| if (!Assume(it != index_by_wtxid.end() && it->m_tx->GetWitnessHash() == wtxid)) continue; |
for a full belts and suspenders (though I guess if no m_orphan entry with this wtxid exists, it would still be caught with the next Assume below, as std::distance would return a negative(?) value 🤔 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, added to #32941
|
light ACK 5002462 |
| } else if (command-- == 0) { | ||
| // AddChildrenToWorkSet | ||
| auto tx = read_tx_fn(); | ||
| FastRandomContext rand_ctx(rng.rand256()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Alpine Linux v3.22, using GCC 14.2.0:
[ 74%] Building CXX object src/test/fuzz/CMakeFiles/fuzz.dir/txorphan.cpp.o
In file included from /bitcoin/src/script/script.h:10,
from /bitcoin/src/primitives/transaction.h:11,
from /bitcoin/src/consensus/validation.h:11,
from /bitcoin/src/test/fuzz/txorphan.cpp:6:
/bitcoin/src/crypto/common.h: In function 'void txorphanage_sim_fuzz_target(FuzzBufferType)':
/bitcoin/src/crypto/common.h:53:11: warning: writing 4 bytes into a region of size 0 [-Wstringop-overflow=]
53 | memcpy(ptr, &v, 4);
| ^
/bitcoin/src/test/fuzz/txorphan.cpp:669:55: note: at offset 32 into destination object '<anonymous>' of size 32
669 | FastRandomContext rand_ctx(rng.rand256());
| ~~~~~~~~~~~^~
|
Why not simply evict nodes that are using up a lot of data that isn't resulting in mempool entries? Perhaps even ban nodes that keep doing this? (And perhaps also the ban logic to include a probationary period causing the ban duration to be extended for repeated offenders). |
This PR is part of the orphan resolution project, see #27463.
This design came from collaboration with sipa - thanks.
We want to limit the CPU work and memory used by
TxOrphanageto avoid denial of service attacks. On master, this is achieved by limiting the number of transactions in this data structure to 100, and the weight of each transaction to 400KWu (the largest standard tx) [0]. We always allow new orphans, but if the addition causes us to exceed 100, we evict one randomly. This is dead simple, but has problems:Just jacking up the
-maxorphantxsnumber is not a good enough solution, because it doesn't solve the churnability problem, and the effective resource bounds scale poorly.This PR introduces numbers for {global, per-peer} {memory usage, announcements + number of inputs}, representing resource limits:
LimitOrphansorEraseForBlock.Eviction changes in a few ways:
This PR also:
TxOrphanageas single multi-index container.-maxorphantxsconfig option, as the orphanage no longer limits by unique orphans.This means we can receive 1p1c packages in the presence of spammy peers. It also makes the orphanage more useful and increases our download capacity without drastically increasing orphanage resource usage.
[0]: This means the effective memory limit in orphan weight is 100 * 400KWu = 40MWu
[1]: https://delvingbitcoin.org/t/stats-on-orphanage-overflows/1421
[2]: Limit is 3000, which is equivalent to one max size ancestor package (24 transactions can be missing inputs) for each peer (default max connections is 125).
[3]: Orphan weight is used in place of actual memory usage because something like "one maximally sized standard tx" is easier to reason about than "considering the bytes allocated for vin and vout vectors, it needs to be within N bytes..." etc. We can also consider a different formula to encapsulate more the memory overhead but still have an interface that is easy to reason about.
[4]: The limit is 404KWu, which is the maximum size of an ancestor package.
[5]: With 125 peers, this is 50.5MWu, which is a small increase from the existing limit of 40MWu. While the actual memory usage limit is higher (this number does not include the other memory used by
TxOrphanageto store the outpoints map, etc.), this is within the same ballpark as the old limit.