Skip to content

Commit eef02af

Browse files
authored
Fix data availability checker race condition causing partial data columns to be served over RPC (#7961)
Partially resolves #6439, an simpler alternative to #7931. Race condition occurs when RPC data columns arrives after a block has been imported and removed from the DA checker: 1. Block becomes available via gossip 2. RPC columns arrive and pass fork choice check (block hasn't been imported) 3. Block import completes (removing block from DA checker) 4. RPC data columns finish verification and get imported into DA checker This causes two issues: 1. **Partial data serving**: Already imported components get re-inserted, potentially causing LH to serve incomplete data 2. **State cache misses**: Leads to state reconstruction, holding the availability cache write lock longer and increasing race likelihood ### Proposed Changes 1. Never manually remove pending components from DA checker. Components are only removed via LRU eviction as finality advances. This makes sure we don't run into the issue described above. 2. Use `get` instead of `pop` when recovering the executed block, this prevents cache misses in race condition. This should reduce the likelihood of the race condition 3. Refactor DA checker to drop write lock as soon as components are added. This should also reduce the likelihood of the race condition **Trade-offs:** This solution eliminates a few nasty race conditions while allowing simplicity, with the cost of allowing block re-import (already existing). The increase in memory in DA checker can be partially offset by a reduction in block cache size if this really comes an issue (as we now serve recent blocks from DA checker).
1 parent 979ed25 commit eef02af

File tree

4 files changed

+132
-176
lines changed

4 files changed

+132
-176
lines changed

beacon_node/beacon_chain/src/beacon_chain.rs

Lines changed: 0 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -3815,19 +3815,6 @@ impl<T: BeaconChainTypes> BeaconChain<T> {
38153815
.await??
38163816
};
38173817

3818-
// Remove block components from da_checker AFTER completing block import. Then we can assert
3819-
// the following invariant:
3820-
// > A valid unfinalized block is either in fork-choice or da_checker.
3821-
//
3822-
// If we remove the block when it becomes available, there's some time window during
3823-
// `import_block` where the block is nowhere. Consumers of the da_checker can handle the
3824-
// extend time a block may exist in the da_checker.
3825-
//
3826-
// If `import_block` errors (only errors with internal errors), the pending components will
3827-
// be pruned on data_availability_checker maintenance as finality advances.
3828-
self.data_availability_checker
3829-
.remove_pending_components(block_root);
3830-
38313818
Ok(AvailabilityProcessingStatus::Imported(block_root))
38323819
}
38333820

beacon_node/beacon_chain/src/data_availability_checker.rs

Lines changed: 8 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -38,19 +38,18 @@ use crate::observed_data_sidecars::ObservationStrategy;
3838
pub use error::{Error as AvailabilityCheckError, ErrorCategory as AvailabilityCheckErrorCategory};
3939
use types::non_zero_usize::new_non_zero_usize;
4040

41-
/// The LRU Cache stores `PendingComponents`, which can store up to `MAX_BLOBS_PER_BLOCK` blobs each.
41+
/// The LRU Cache stores `PendingComponents`, which store block and its associated blob data:
4242
///
4343
/// * Deneb blobs are 128 kb each and are stored in the form of `BlobSidecar`.
4444
/// * From Fulu (PeerDAS), blobs are erasure-coded and are 256 kb each, stored in the form of 128 `DataColumnSidecar`s.
4545
///
4646
/// With `MAX_BLOBS_PER_BLOCK` = 48 (expected in the next year), the maximum size of data columns
47-
/// in `PendingComponents` is ~12.29 MB. Setting this to 64 means the maximum size of the cache is
48-
/// approximately 0.8 GB.
47+
/// in `PendingComponents` is ~12.29 MB. Setting this to 32 means the maximum size of the cache is
48+
/// approximately 0.4 GB.
4949
///
50-
/// Under normal conditions, the cache should only store the current pending block, but could
51-
/// occasionally spike to 2-4 for various reasons e.g. components arriving late, but would very
52-
/// rarely go above this, unless there are many concurrent forks.
53-
pub const OVERFLOW_LRU_CAPACITY: NonZeroUsize = new_non_zero_usize(64);
50+
/// `PendingComponents` are now never removed from the cache manually are only removed via LRU
51+
/// eviction to prevent race conditions (#7961), so we expect this cache to be full all the time.
52+
pub const OVERFLOW_LRU_CAPACITY: NonZeroUsize = new_non_zero_usize(32);
5453
pub const STATE_LRU_CAPACITY_NON_ZERO: NonZeroUsize = new_non_zero_usize(32);
5554
pub const STATE_LRU_CAPACITY: usize = STATE_LRU_CAPACITY_NON_ZERO.get();
5655

@@ -346,11 +345,6 @@ impl<T: BeaconChainTypes> DataAvailabilityChecker<T> {
346345
.put_pending_executed_block(executed_block)
347346
}
348347

349-
pub fn remove_pending_components(&self, block_root: Hash256) {
350-
self.availability_cache
351-
.remove_pending_components(block_root)
352-
}
353-
354348
/// Verifies kzg commitments for an RpcBlock, returns a `MaybeAvailableBlock` that may
355349
/// include the fully available block.
356350
///
@@ -589,8 +583,8 @@ impl<T: BeaconChainTypes> DataAvailabilityChecker<T> {
589583

590584
// Check indices from cache again to make sure we don't publish components we've already received.
591585
let Some(existing_column_indices) = self.cached_data_column_indexes(block_root) else {
592-
return Ok(DataColumnReconstructionResult::RecoveredColumnsNotImported(
593-
"block already imported",
586+
return Err(AvailabilityCheckError::Unexpected(
587+
"block no longer exists in the data availability checker".to_string(),
594588
));
595589
};
596590

0 commit comments

Comments
 (0)