p2p: improve TxOrphanage denial of service bounds #31829

glozow · 2025-02-09T21:14:45Z

This PR is part of the orphan resolution project, see #27463.

This design came from collaboration with sipa - thanks.

We want to limit the CPU work and memory used by TxOrphanage to avoid denial of service attacks. On master, this is achieved by limiting the number of transactions in this data structure to 100, and the weight of each transaction to 400KWu (the largest standard tx) [0]. We always allow new orphans, but if the addition causes us to exceed 100, we evict one randomly. This is dead simple, but has problems:

It makes the orphanage trivially churnable: any one peer can render it useless by spamming us with lots of orphans. It's possible this is happening: "Looking at data from node alice on 2024-09-14 shows that we’re sometimes removing more than 100k orphans per minute. This feels like someone flooding us with orphans." [1]
Effectively, opportunistic 1p1c is useless in the presence of adversaries: it is opportunistic and pairs a low feerate tx with a child that happens to be in the orphanage. So if nothing is able to stay in orphanages, we can't expect 1p1cs to propagate.
This number is also often insufficient for the volume of orphans we handle: historical data show that overflows are pretty common, and there are times where "it seems like [the node] forgot about the orphans and re-requested them multiple times." [1]

Just jacking up the -maxorphantxs number is not a good enough solution, because it doesn't solve the churnability problem, and the effective resource bounds scale poorly.

This PR introduces numbers for {global, per-peer} {memory usage, announcements + number of inputs}, representing resource limits:

The (constant) global latency score limit is the number of unique (wtxid, peer) pairs in the orphanage + the number of inputs spent by those (deduplicated) transactions floor-divided by 10 [2]. This represents a cap on CPU or latency for any given operation, and does not change with the number of peers we have. Evictions must happen whenever this limit is reached. The primary goal of this limit is to ensure we do not spend more than a few ms on any call to LimitOrphans or EraseForBlock.
The (variable) per-peer latency score limit is the global latency score limit divided by the number of peers. Peers are allowed to exceed this limit provided the global announcement limit has not been reached. The per-peer announcement limit decreases with more peers.
The (constant) per-peer memory usage reservation is the amount of orphan weight [3] reserved per peer [4]. Reservation means that peers are effectively guaranteed this amount of space. Peers are allowed to exceed this limit provided the global usage limit is not reached. The primary goal of this limit is to ensure we don't oom.
The (variable) global memory usage limit is the number of peers multiplied by the per-peer reservation [5]. As such, the global memory usage limit scales up with the number of peers we have. Evictions must happen whenever this limit is reached.
We introduce a "Peer DoS Score" which is the maximum between its "CPU Score" and "Memory Score." The CPU score is the ratio between the number of orphans announced by this peer / peer announcement limit. The memory score is the total usage of all orphans announced by this peer / peer usage reservation.

Eviction changes in a few ways:

It is triggered if either limit is exceeded.
On each iteration of the loop, instead of selecting a random orphan, we select a peer and delete 1 of its announcements. Specifically, we select the peer with the highest DoS score, which is the maximum between its CPU DoS score (based on announcements) and Memory DoS score (based on tx weight). After the peer has been selected, we evict the oldest orphan (non-reconsiderable sorted before reconsiderable).
Instead of evicting orphans, we evict announcements. An orphan is still in the orphanage as long as there is 1 peer announcer. Of course, over the course of several iteration loops, we may erase all announcers, thus erasing the orphan itself. The purpose of this change is to prevent a peer from being able to trigger eviction of another peer's orphans.

This PR also:

Reimplements TxOrphanage as single multi-index container.
Effectively bounds the number of transactions that can be in a peer's work set by ensuring it is a subset of the peer's announcements.
Removes the -maxorphantxs config option, as the orphanage no longer limits by unique orphans.

This means we can receive 1p1c packages in the presence of spammy peers. It also makes the orphanage more useful and increases our download capacity without drastically increasing orphanage resource usage.

[0]: This means the effective memory limit in orphan weight is 100 * 400KWu = 40MWu
[1]: https://delvingbitcoin.org/t/stats-on-orphanage-overflows/1421
[2]: Limit is 3000, which is equivalent to one max size ancestor package (24 transactions can be missing inputs) for each peer (default max connections is 125).
[3]: Orphan weight is used in place of actual memory usage because something like "one maximally sized standard tx" is easier to reason about than "considering the bytes allocated for vin and vout vectors, it needs to be within N bytes..." etc. We can also consider a different formula to encapsulate more the memory overhead but still have an interface that is easy to reason about.
[4]: The limit is 404KWu, which is the maximum size of an ancestor package.
[5]: With 125 peers, this is 50.5MWu, which is a small increase from the existing limit of 40MWu. While the actual memory usage limit is higher (this number does not include the other memory used by TxOrphanage to store the outpoints map, etc.), this is within the same ballpark as the old limit.

DrahtBot · 2025-02-09T21:14:49Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31829.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	marcofleon, instagibbs, theStack, achow101
Approach ACK	sipa, jsarenik
Stale ACK	monlovesmango

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#32896 (wallet, rpc: add v3 transaction creation and wallet support by ishaanam)
#32827 (mempool: Avoid needless vtx iteration during IBD by l0rinc)
#32430 (test: Add and use ElapseTime helper by maflcko)
#30277 ([DO NOT MERGE] Erlay: bandwidth-efficient transaction relay protocol (Full implementation) by sr-gi)
#29415 (Broadcast own transactions only via short-lived Tor or I2P connections by vasild)
#28690 (build: Introduce internal kernel library by TheCharlatan)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

LLM Linter (✨ experimental)

Possible typos and grammar issues:

List of typos found in added lines:

tranaction -> transaction [misspelling in comment “memory usage of the tranaction”]
comphehensive -> comprehensive [misspelling in comment “This is a comphehensive simulation…”]

^{drahtbot_id_4_m}

DrahtBot · 2025-02-09T21:20:53Z

🚧 At least one of the CI tasks failed.
_{Debug: https://github.com/bitcoin/bitcoin/runs/36925040096}

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

glozow · 2025-02-10T13:13:07Z

Rebased

instagibbs

The resource bounds additions seem to make sense, still working through the workset change implications.

I've got a minimal fuzz harness checking that the "honest" peer cannot be evicted, please feel free to take it: https://github.com/instagibbs/bitcoin/tree/2025-01-orphanage-peer-dos_greg_2

src/txorphanage.cpp

src/txorphanage.h

src/txorphanage.cpp

test/functional/p2p_opportunistic_1p1c.py

glozow · 2025-02-11T17:04:51Z

Thanks @instagibbs for the testing and review, added your fuzz commits and took comments. Still need to write the p2p_orphan_handling test.

DrahtBot · 2025-02-11T17:08:13Z

🚧 At least one of the CI tasks failed.
_{Debug: https://github.com/bitcoin/bitcoin/runs/37041607307}

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

mzumsande

Halfway through, some minor points below - my main conceptual question is why m_total_announcements is a meaningful metric in limiting the orphanage.

My understanding is that m_total_orphan_usage exists to limit memory usage, and m_total_announcements to limit CPU usage - but why the number of announcements instead of number of orphans?
Why would it make the situation any less DoSy if we remove an announcer but keep the orphan? Since we only assign the tx to one peer's workset after 7426afb, more announcers for the same number of orphans doesn't really mean any additional work.

src/txorphanage.cpp

src/txorphanage.h

src/txorphanage.cpp

glozow · 2025-02-11T22:49:32Z

My understanding is that m_total_orphan_usage exists to limit memory usage, and m_total_announcements to limit CPU usage - but why the number of announcements instead of number of orphans?

Yep, to limit CPU usage. The complexity of eviction for example is bounded by the total number of announcements: in the worst case, each orphan has many announcers and the MaybeTrimOrphans loop first removes announcements until each orphan just has 1 left, and then finally can remove transactions. See comment above declaration, "The loop can run a maximum of m_max_global_announcement times"

Why would it make the situation any less DoSy if we remove an announcer but keep the orphan?

Perhaps I should have stated this in the OP more explicitly, but a major motivation for this eviction strategy is to prevent any peer from being to evict any announcements of another peer, hence the per-peer limits. If we changed the eviction code to remove orphans wholesale instead of just announcements, we'd have a similar situation to today's: an attacker can cause churn of an honest orphan by announcing it along with a lot of other orphans.

So evicting announcements instead of orphans isn't less DoSy, but it does make the orphanage less churnable.

sipa

Approach ACK

src/txorphanage.h

src/txorphanage.cpp

instagibbs

combing through tests a bit, think I spotted the CI failure cause

test/functional/p2p_orphan_handling.py

test/functional/p2p_opportunistic_1p1c.py

marcofleon · 2025-07-15T11:17:18Z

ReACK 5002462

A couple additional assertions, some nits addressed, and improvements in the txorphanage_sim fuzz target since last review. Ran the fuzz tests for a bit on existing corpora to be sure.

instagibbs

ACK 5002462

Wish orphan traffic was higher for more live testing on mainnet. Will test anyways and report back if I see anything odd.

I'm not convinced that we really need EraseForBlock anymore, but I don't think it's a unique danger and it's nice that we get some real benchmarks for it in master.

instagibbs · 2025-07-15T16:42:44Z

Looking at logs, was wondering if we can get some more information about which peer/ which tx is being evicted from the orphanage? I'm eyeballing some logs since I've been running variants of this for a few weeks now, and the orphanage overflow string shows up significantly more often due to the non-timeout of announcements after this PR.

e.g.: "2025-07-07T11:27:34.481530Z [txpackages] orphanage overflow, removed 1 tx (4 announcements)"

Being able to quickly see that, f.e., the only reason we're evicting is because of a single faulty / spammy peer would be helpful to separate from "legitimate" traffic, where you'd expect to see more a round-robin eviction pattern.

glozow · 2025-07-15T17:48:02Z

Looking at logs, was wondering if we can get some more information about which peer/ which tx is being evicted from the orphanage?

Will add to the followup. What about adding a log for each peer chosen in the loop? So for example 1 call to LimitOrphans:

[txpackages] peer=25 orphanage overflow, removed 4 announcements
[txpackages] peer=177 orphanage overflow, removed 1 announcements
[txpackages] peer=25 orphanage overflow, removed 1 announcements
[txpackages] orphanage overflow, removed 5 tx (6 announcements)

theStack

Code-review ACK 5002462

With two suggestions regarding sanity checks on lower_bound iterators. Probably more than just nits, but still fine to tackle in the follow-up IMHO.

theStack · 2025-07-16T23:14:18Z

src/node/txorphanage.cpp

+            if (!Assume(it_ann->m_announcer == worst_peer)) break;
+            if (!Assume(it_ann != m_orphans.get<ByPeer>().end())) break;


these two Assume lines should be swapped I think, to prevent potential dereference of an end() iterator (which, AFAIR, would be UB)

thanks, added to #32941

theStack · 2025-07-16T23:36:13Z

src/node/txorphanage.cpp

+            for (const auto& wtxid : it_by_prev->second) {
+                // Belt and suspenders, each entry in m_outpoint_to_orphan_it should always have at least 1 announcement.
+                auto it = index_by_wtxid.lower_bound(ByWtxidView{wtxid, MIN_PEER});
+                if (!Assume(it != index_by_wtxid.end())) continue;


Suggested change

if (!Assume(it != index_by_wtxid.end())) continue;

if (!Assume(it != index_by_wtxid.end() && it->m_tx->GetWitnessHash() == wtxid)) continue;

for a full belts and suspenders (though I guess if no m_orphan entry with this wtxid exists, it would still be caught with the next Assume below, as std::distance would return a negative(?) value 🤔 )

thanks, added to #32941

achow101 · 2025-07-18T19:40:33Z

light ACK 5002462

hebasto · 2025-07-20T09:58:15Z

src/test/fuzz/txorphan.cpp

+            } else if (command-- == 0) {
+                // AddChildrenToWorkSet
+                auto tx = read_tx_fn();
+                FastRandomContext rand_ctx(rng.rand256());


b113877

On Alpine Linux v3.22, using GCC 14.2.0:

[ 74%] Building CXX object src/test/fuzz/CMakeFiles/fuzz.dir/txorphan.cpp.o In file included from /bitcoin/src/script/script.h:10, from /bitcoin/src/primitives/transaction.h:11, from /bitcoin/src/consensus/validation.h:11, from /bitcoin/src/test/fuzz/txorphan.cpp:6: /bitcoin/src/crypto/common.h: In function 'void txorphanage_sim_fuzz_target(FuzzBufferType)': /bitcoin/src/crypto/common.h:53:11: warning: writing 4 bytes into a region of size 0 [-Wstringop-overflow=] 53 | memcpy(ptr, &v, 4); | ^ /bitcoin/src/test/fuzz/txorphan.cpp:669:55: note: at offset 32 into destination object '<anonymous>' of size 32 669 | FastRandomContext rand_ctx(rng.rand256()); | ~~~~~~~~~~~^~

rebroad · 2025-08-19T11:15:02Z

Why not simply evict nodes that are using up a lot of data that isn't resulting in mempool entries? Perhaps even ban nodes that keep doing this? (And perhaps also the ban logic to include a probationary period causing the ban duration to be extended for repeated offenders).

glozow added the P2P label Feb 9, 2025

glozow force-pushed the 2025-01-orphanage-peer-dos branch from bfc78fa to 765fcdf Compare February 9, 2025 21:20

DrahtBot added the CI failed label Feb 9, 2025

glozow force-pushed the 2025-01-orphanage-peer-dos branch from 765fcdf to d302324 Compare February 9, 2025 21:28

DrahtBot mentioned this pull request Feb 10, 2025

test: add coverage for immediate orphanage eviction case #31628

Closed

glozow force-pushed the 2025-01-orphanage-peer-dos branch from d302324 to 0ccf21e Compare February 10, 2025 13:12

This was referenced Feb 10, 2025

TxOrphanage: account for size of orphans and count announcements #31810

Merged

Package Relay Project Tracking #27463

Open

glozow marked this pull request as ready for review February 10, 2025 13:27

DrahtBot removed the CI failed label Feb 10, 2025

DrahtBot mentioned this pull request Feb 10, 2025

Don't zero-after-free DataStream: Faster IBD on some configurations #30987

Closed

instagibbs reviewed Feb 10, 2025

View reviewed changes

glozow force-pushed the 2025-01-orphanage-peer-dos branch from 0ccf21e to 7aaf390 Compare February 11, 2025 17:03

glozow force-pushed the 2025-01-orphanage-peer-dos branch from 7aaf390 to 61b40f0 Compare February 11, 2025 17:08

DrahtBot added the CI failed label Feb 11, 2025

glozow force-pushed the 2025-01-orphanage-peer-dos branch from 61b40f0 to ff82676 Compare February 11, 2025 17:25

mzumsande reviewed Feb 11, 2025

View reviewed changes

src/txorphanage.cpp Outdated Show resolved Hide resolved

src/txorphanage.cpp Outdated Show resolved Hide resolved

src/txorphanage.h Outdated Show resolved Hide resolved

src/txorphanage.cpp Outdated Show resolved Hide resolved

src/txorphanage.cpp Outdated Show resolved Hide resolved

glozow force-pushed the 2025-01-orphanage-peer-dos branch 2 times, most recently from 19194f2 to 3903310 Compare February 12, 2025 04:37

sipa reviewed Feb 12, 2025

View reviewed changes

src/txorphanage.h Outdated Show resolved Hide resolved

src/txorphanage.h Outdated Show resolved Hide resolved

src/txorphanage.h Outdated Show resolved Hide resolved

src/txorphanage.cpp Outdated Show resolved Hide resolved

instagibbs reviewed Feb 12, 2025

View reviewed changes

glozow added this to the 29.0 milestone Feb 12, 2025

DrahtBot mentioned this pull request Feb 13, 2025

test: Rename send_message to send_without_ping #31859

Merged

glozow requested a review from sr-gi February 13, 2025 16:59

glozow force-pushed the 2025-01-orphanage-peer-dos branch from 790f6e7 to 5002462 Compare July 14, 2025 20:19

instagibbs approved these changes Jul 15, 2025

View reviewed changes

theStack approved these changes Jul 16, 2025

View reviewed changes

This was referenced Jul 18, 2025

wallet, rpc: add v3 transaction creation and wallet support #32896

Merged

[DO NOT MERGE] Erlay: bandwidth-efficient transaction relay protocol (Full implementation) #30277

Draft

achow101 merged commit 80067ac into bitcoin:master Jul 18, 2025
19 checks passed

kevkevinpal mentioned this pull request Jul 18, 2025

p2p: improve TxOrphanage denial of service bounds kevkevinpal/bitcoin#109

Closed

hebasto reviewed Jul 20, 2025

View reviewed changes

hebasto mentioned this pull request Jul 20, 2025

Alpine: Avoid build failure from -Wstringop-overflow= warning hebasto/bitcoin-core-nightly#64

Merged

glozow deleted the 2025-01-orphanage-peer-dos branch July 20, 2025 14:03

xstoicunicornx mentioned this pull request Aug 5, 2025

August 2025 Topic Suggestions ChicagoBitdevs/ChicagoBitdevs.github.io#142

Closed

glozow mentioned this pull request Sep 2, 2025

[NO MERGE] BIP331 Ancestor Package Relay #27742

Closed

instagibbs mentioned this pull request Oct 10, 2025

test: intermittent issue in p2p_1p1c_network.py #33318

Open

btsea mentioned this pull request Oct 12, 2025

Socratic Session: Bitcoin Core - Project Updates and Cherry Picked PRs TABConf/7.tabconf.com#55

Closed

		if (!Assume(it_ann->m_announcer == worst_peer)) break;
		if (!Assume(it_ann != m_orphans.get<ByPeer>().end())) break;

	if (!Assume(it != index_by_wtxid.end())) continue;
	if (!Assume(it != index_by_wtxid.end() && it->m_tx->GetWitnessHash() == wtxid)) continue;

p2p: improve TxOrphanage denial of service bounds #31829

p2p: improve TxOrphanage denial of service bounds #31829

Uh oh!

Conversation

glozow commented Feb 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DrahtBot commented Feb 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage & Benchmarks

Reviews

Conflicts

LLM Linter (✨ experimental)

Uh oh!

DrahtBot commented Feb 9, 2025

Uh oh!

glozow commented Feb 10, 2025

Uh oh!

instagibbs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glozow commented Feb 11, 2025

Uh oh!

DrahtBot commented Feb 11, 2025

Uh oh!

mzumsande left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glozow commented Feb 11, 2025

Uh oh!

sipa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

instagibbs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marcofleon commented Jul 15, 2025

Uh oh!

instagibbs left a comment

Choose a reason for hiding this comment

Uh oh!

instagibbs commented Jul 15, 2025

Uh oh!

glozow commented Jul 15, 2025

Uh oh!

theStack left a comment

Choose a reason for hiding this comment

Uh oh!

theStack Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

glozow Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

theStack Jul 16, 2025

Choose a reason for hiding this comment

glozow commented Feb 9, 2025 •

edited

Loading

DrahtBot commented Feb 9, 2025 •

edited

Loading

mzumsande left a comment •

edited

Loading