Skip to content

Conversation

@dapplion
Copy link
Collaborator

@dapplion dapplion commented Nov 7, 2025

Issue Addressed

Allows Lighthouse to bootstrap into a state for a checkpoint that is not finalized. This feature could save Ethereum Mainnet when shit hits the fan (= long period of non-finality).

Why we can't just checkpoint sync into a non-finalized checkpoint today?

  • You can't import any new block into fork-choice
  • You can't propose blocks
  • You can't sync

Why, and how to solve it? Keep reading :)

Proposed Changes

Let's consider a node that wants to bootstrap from Checkpoint A.

Screenshot 2025-11-07 at 11 55 02

The node has 3 checkpoints of interest, and we will use the following naming conventions:

  • Finalized checkpoint: The network's finalized checkpoint or finalized_checkpoint.on_chain()
  • Justified checkpoint: The network's justified checkpoint or justified_checkpoint.on_chain()
  • Checkpoint A: The checkpoint sync checkpoint, or anchor state, or local irreversible checkpoint matching both finalized_checkpoint.local() and justified_checkpoint.local()

Different parts of Lighthouse want to use either the network checkpoints (on_chain) or the local view of the node (local). To force consumers to think about which one they want the fork-choice now exposes ForkChoiceCheckpoint instead of just Checkpoint

pub enum ForkChoiceCheckpoint {
    Local {
        local: Checkpoint,
        on_chain: Checkpoint,
    },
    OnChain(Checkpoint),
}

The most relevant places where we use this checkpoints are

Item Checkpoint to use
Gossip verification, reject objects older than the finalized checkpoint local
Fork-choice irreversible checkpoint, reject blocks that do not descend from this checkpoint local
Fork-choice filter tree function, only heads that descend from this block are viable local
Fork-choice filter tree function, only heads with correct finalization and justification on_chain
Casper FFG votes, source checkpoint on_chain
Status message on_chain
Sync forward range sync start epoch local
Beacon HTTP API finalized + justified tag on_chain

Let me justify each one. For unstable every item would use local and I'll explain why that breaks.

Gossip verification, reject objects older than the finalized checkpoint

We can't import blocks or objects that don't descend from our anchor state because we don't have the pre-states. We need to use local since we may not have the on_chain finalized state available.

Fork-choice irreversible checkpoint, reject blocks that do not descend from this checkpoint

Same as above, reject blocks that conflict with our local irreversible checkpoint

Fork-choice filter tree function, only heads that descend from this block are viable

While we could use the on_chain justified checkpoint here we don't have its ProtoNode available. To reduce the diff in the fork-choice code, we use the local one. However it's always true that justified_checkpoint.local will equal or be a descendant of justified_checkpoint.on_chain

Fork-choice filter tree function, only heads with correct finalization and justification

Our ProtoNode objects track the finalized and justified checkpoints of their states. Those are the on_chain ones, so to make those blocks viable we need to compare against on_chain checkpoints. Otherwise we end up with a fork-choice that looks like this where all nodes imported after the anchor block are not viable. Note that the block at slot 643 has non-matching justified and finalized checkpoint.

dump of debug/fork-choice on unstable

{
  "justified_checkpoint": {
    "epoch": "20",
    "root": "0x44a053199e37647e4dd6a21ad2def14d17e0af13ea9ba9d467a3dc99fad817a2"
  },
  "finalized_checkpoint": {
    "epoch": "20",
    "root": "0x44a053199e37647e4dd6a21ad2def14d17e0af13ea9ba9d467a3dc99fad817a2"
  },
  "fork_choice_nodes": [
    {
      "slot": "640",
      "block_root": "0x44a053199e37647e4dd6a21ad2def14d17e0af13ea9ba9d467a3dc99fad817a2",
      "parent_root": null,
      "justified_epoch": "20",
      "finalized_epoch": "20",
      "weight": "1574400000000",
      "validity": "valid",
      "execution_block_hash": "0xa668211691c40797287c258d328cec77d0edd12956eebd5c64440adb489eaca0"
    },
    {
      "slot": "643",
      "block_root": "0x8b9b01aadc0f7ead90ed7043da3cedfb17a967a7552a9d45dfcd5ce4cc4ae073",
      "parent_root": "0x44a053199e37647e4dd6a21ad2def14d17e0af13ea9ba9d467a3dc99fad817a2",
      "justified_epoch": "16",
      "finalized_epoch": "15",
      "weight": "1574400000000",
      "validity": "optimistic",
      "execution_block_hash": "0xdc9799a99f3e07da0b45ce04586bbcb999910907bea8c86c68e3faf03b878f86"
    },

Casper FFG votes, source checkpoint

We must use the on_chain ones to prevent surround votes and for our votes to be includable on-chain.

Status message

It's best to tell other nodes that we share the same view of finality. Otherwise they look to use like being "behind" and the may try to fetch from us a finalized chain that doesn't finalize at that epoch.

Sync forward range sync start epoch

Range sync assumes that the split slot == finalized slot. But we only need to sync blocks descendant of the split slot (= anchor slot).

Other changes

LH currently has a manual finalization mechanism via a HTTP API call. It triggers a special store finalization routine and forces gossip to filter by split slot too. I have changed the manual finalization route to advance the fork-choice local irreversible checkpoint and run the regular finalization routine. Overall looks simpler and more maintainable.

@dapplion dapplion requested a review from jxs as a code owner November 7, 2025 17:12
@dapplion
Copy link
Collaborator Author

dapplion commented Nov 7, 2025

Test of ff431ac

Started a kurtosis network with 6 participants on latest unstable

participants:
  - el_type: geth
    el_image: ethereum/client-go:latest
    cl_type: lighthouse
    cl_image: sigp/lighthouse:latest-unstable
    cl_extra_params:
      - --target-peers=7
    vc_extra_params:
      - --use-long-timeouts
      - --long-timeouts-multiplier=3
    count: 6
    validator_count: 16
network_params:
  electra_fork_epoch: 0
  seconds_per_slot: 3
  genesis_delay: 400
global_log_level: debug
snooper_enabled: false
additional_services:
  - dora
  - spamoor
  - prometheus_grafana
  - tempo

Let the network run for ~15 epochs and stopped 50% of the validators. Let the network run in non finality for many epochs, and started a docker build of this branch

version: "3.9"

services:
  cl-lighthouse-syncer:
    image: "sigp/lighthouse:non-fin"
    command: >
      lighthouse beacon_node
      --debug-level=debug
      --datadir=/data/lighthouse/beacon-data
      --listen-address=0.0.0.0
      --port=9000
      --http
      --http-address=0.0.0.0
      --http-port=4000
      --disable-packet-filter
      --execution-endpoints=http://172.16.0.89:8551
      --jwt-secrets=/jwt/jwtsecret
      --suggested-fee-recipient=0x8943545177806ED17B9F23F0a21ee5948eCaa776
      --disable-enr-auto-update
      --enr-address=172.16.0.21
      --enr-tcp-port=9000
      --enr-udp-port=9000
      --enr-quic-port=9001
      --quic-port=9001
      --metrics
      --metrics-address=0.0.0.0
      --metrics-allow-origin=*
      --metrics-port=5054
      --enable-private-discovery
      --testnet-dir=/network-configs
      --boot-nodes=enr:-OK4QHWvCmwiaEj8437Z6Wlk32gLVM5Hbw9n6PesII42toDOPmdquevxog8OS8SMMru3VjRvo3qOk80qCXzOUEa8ecoDh2F0dG5ldHOIAAAAAAAAAMCGY2xpZW501opMaWdodGhvdXNlijguMC4wLXJjLjKEZXRoMpASBoYoYAAAOP__________gmlkgnY0gmlwhKwQABKEcXVpY4IjKYlzZWNwMjU2azGhAmc8-hvS_9yO5fBwlBhgTYVDSdOtFJW7uVpTmYkVcZBWiHN5bmNuZXRzAIN0Y3CCIyiDdWRwgiMo
      --target-peers=3
      --execution-timeout-multiplier=3
      --checkpoint-block=/blocks/block_640.ssz
      --checkpoint-state=/blocks/state_640.ssz
      --checkpoint-blobs=/blocks/blobs_640.ssz
    environment:
      - RUST_BACKTRACE=full
    extra_hosts:
      # Allow container to reach host service (Linux-compatible)
      - "host.docker.internal:host-gateway"
    volumes:
      - configs:/network-configs
      - jwt:/jwt
      - /root/kurtosis-non-fin/blocks:/blocks
    ports:
      - "33400:4000/tcp"   # HTTP API
      - "33554:5054/tcp"   # Metrics
      - "33900:9000/tcp"   # Libp2p TCP
      - "33900:9000/udp"   # Libp2p UDP
      - "33901:9001/udp"   # QUIC
    networks:
      kt:
        ipv4_address: 172.16.0.88
    shm_size: "64m"

  lcli-mock-el:
    image: sigp/lcli
    command: >
      lcli mock-el
      --listen-address 0.0.0.0
      --listen-port 8551
      --jwt-output-path=/jwt/jwtsecret
    volumes:
      - jwt:/jwt
    ports:
      - "33851:8551"
    networks:
      kt:
        ipv4_address: 172.16.0.89

networks:
  kt:
    external: true
    name: kt-quiet-crater

volumes:
  configs:
    name: files-artifact-expansion--e56f64e9c6aa4409b27b11e37d1ab4d3--bc0964a8b6c54745ba6473aaa684a81e
    external: true
  jwt:
    name: files-artifact-expansion--870bc5edd3eb44598f50a70ada54cd31--bc0964a8b6c54745ba6473aaa684a81e
    external: true

All logs below are from the cl-lighthouse-syncer container

The node starts with checkpoint sync at a more recent checkpoint than latest finalized, specifically epoch 20. The node range synced to head without issues. See log Synced and finalized_checkpoint: 0xc1edeaf0997ead34936cf20372084f0348ffb79366d453c19f2ef0a1536e766a/15/local/0x44a053199e37647e4dd6a21ad2def14d17e0af13ea9ba9d467a3dc99fad817a2/20

Then I triggered manual finalization into a more recent non-finalized checkpoint. It now logs finalized_checkpoint: 0xc1edeaf0997ead34936cf20372084f0348ffb79366d453c19f2ef0a1536e766a/15/local/0xe155e7846c20f1db50d53e9257c9eaa48c07dc5426312aded75be46672c3d022/727

Oct 25 13:42:24.501 INFO  Synced                                        peers: "3", exec_hash: "0x1ec25768c49daf9edf41a93a23ce3c2419c0412d016a0a32b135922947dd91a0 (unverified)", finalized_checkpoint: 0xc1edeaf0997ead34936cf20372084f0348ffb79366d453c19f2ef0a1536e766a/15/local/0x44a053199e37647e4dd6a21ad2def14d17e0af13ea9ba9d467a3dc99fad817a2/20, epoch: 732, block: "0x2e69210f7caba54c127b79b1959ab5fdb9fe04948365ce4a46d44e1f103fd7e1", slot: 23439
Oct 25 13:42:27.127 DEBUG Processed HTTP API request                    elapsed_ms: 63.17743300000001, status: 200 OK, path: /lighthouse/finalize, method: POST
Oct 25 13:42:27.501 INFO  Synced                                        peers: "3", exec_hash: "0x1ec25768c49daf9edf41a93a23ce3c2419c0412d016a0a32b135922947dd91a0 (unverified)", finalized_checkpoint: 0xc1edeaf0997ead34936cf20372084f0348ffb79366d453c19f2ef0a1536e766a/15/local/0xe155e7846c20f1db50d53e9257c9eaa48c07dc5426312aded75be46672c3d022/727, epoch: 732, block: "   …  empty", slot: 23440
Oct 25 13:42:27.668 DEBUG Starting database pruning                     split_prior_to_migration: Split { slot: Slot(640), state_root: 0x890ac4381ba8306416f7cc2c8af52d7af0292b5a5b78e429921c98d917256d1b, block_root: 0x44a053199e37647e4dd6a21ad
2def14d17e0af13ea9ba9d467a3dc99fad817a2 }, new_finalized_checkpoint: Checkpoint { epoch: Epoch(727), root: 0xe155e7846c20f1db50d53e9257c9eaa48c07dc5426312aded75be46672c3d022 }, new_finalized_state_root: 0x61dd07935440fe95b50a5bad21760d8ae96cc
58439ba6047b5c41caa89ee06f6
Oct 25 13:42:27.878 DEBUG Extra pruning information                     new_finalized_checkpoint: Checkpoint { epoch: Epoch(727), root: 0xe155e7846c20f1db50d53e9257c9eaa48c07dc5426312aded75be46672c3d022 }, new_finalized_state_root: 0x61dd0793
5440fe95b50a5bad21760d8ae96cc58439ba6047b5c41caa89ee06f6, split_prior_to_migration: Split { slot: Slot(640), state_root: 0x890ac4381ba8306416f7cc2c8af52d7af0292b5a5b78e429921c98d917256d1b, block_root: 0x44a053199e37647e4dd6a21ad2def14d17e0af1
3ea9ba9d467a3dc99fad817a2 }, newly_finalized_blocks: 22625, newly_finalized_state_roots: 22625, newly_finalized_states_min_slot: 640, required_finalized_diff_state_slots: [Slot(23264), Slot(23040), Slot(22528), Slot(16384), Slot(640)], kept_s
ummaries_for_hdiff: [(0x890ac4381ba8306416f7cc2c8af52d7af0292b5a5b78e429921c98d917256d1b, Slot(640)), (0x77de49b559e30ac0ffd9f1fdff87c19a57803fc8036b62f387dcc55885a83f47, Slot(16384)), (0xa99ffc63658d2cd51b8d0f9e59caa755f376cd46e1d08ecc0acc9f
f77eab4192, Slot(22528)), (0x5d8a9d02d3f88d1c8a51f8ed533933e593bee96f335756336bbc654c4a898542, Slot(23040))], state_summaries_count: 22989, state_summaries_dag_roots: [(0x890ac4381ba8306416f7cc2c8af52d7af0292b5a5b78e429921c98d917256d1b, DAGSt
ateSummary { slot: Slot(640), latest_block_root: 0x44a053199e37647e4dd6a21ad2def14d17e0af13ea9ba9d467a3dc99fad817a2, latest_block_slot: Slot(640), previous_state_root: 0xc210447f3b63eca6d76073d7f647a22a069b6c69c0dd275a723f2ed8acf566fd })], fi
nalized_and_descendant_state_roots_of_finalized_checkpoint: 277, blocks_to_prune: 0, states_to_prune: 22708
Oct 25 13:42:28.162 DEBUG Database pruning complete                     new_finalized_state_root: 0x61dd07935440fe95b50a5bad21760d8ae96cc58439ba6047b5c41caa89ee06f6
Oct 25 13:42:28.164 INFO  Starting database compaction                  old_finalized_epoch: 20, new_finalized_epoch: 727

Then I restarted the validators and the network finalized. See that the node pruned from the latest manual finalization. It now logs finalized_checkpoint: 0x13700c4b236867eb7bf1a6752fe6729118cfcf5d70f5bc1916e70ecea542d01a/736

Oct 25 13:51:14.585 DEBUG Starting database pruning                     split_prior_to_migration: Split { slot: Slot(23264), state_root: 0x61dd07935440fe95b50a5bad21760d8ae96cc58439ba6047b5c41caa89ee06f6, block_root: 0xe155e7846c20f1db50d53e9257c9eaa48c07dc5426312aded75be46672c3d022 }, new_finalized_checkpoint: Checkpoint { epoch: Epoch(736), root: 0x13700c4b236867eb7bf1a6752fe6729118cfcf5d70f5bc1916e70ecea542d01a }, new_finalized_state_root: 0x8de7833fb5b0a73211982b32bfa5c1e7b79154386150a04d6be60afd62b92988
qOct 25 13:53:51.500 INFO  Synced                                        peers: "3", exec_hash: "0x875c7aa7b85bd792a7842858a34c507e70f5eb05af80cfe170c891cacbe15c19 (unverified)", finalized_checkpoint: 0x13700c4b236867eb7bf1a6752fe6729118cfcf5d70f5bc1916e70ecea542d01a/736, epoch: 739, block: "0x70782ea183f9c5edfbf9709e9728b2a8c775b2f72e70b9e715f1d815cc236471", slot: 23668

Then I stopped 50% of the validators again and triggered manual finalization. See that it transitions from finalized_checkpoint: 0x7c51f4c364f60561e9b39931264b54c7656dd74fb99c5a048c09c16126bb70ce/740 to finalized_checkpoint: 0x7c51f4c364f60561e9b39931264b54c7656dd74fb99c5a048c09c16126bb70ce/740/local/0xff5d2b08ebaf5bec69c1e9251d9ad788c6350e0a6ece69a27b13fa477af76729/746

Oct 25 14:09:24.500 INFO  Synced                                        peers: "3", exec_hash: "0xa106339a1ccf80a10e1c5cdd2b50e8d5c58cb87f66c2574527d054d710886a5c (unverified)", finalized_checkpoint: 0x7c51f4c364f60561e9b39931264b54c7656dd74fb99c5a048c09c16126bb70ce/740, epoch: 749, block: "   …  empty", slot: 23979

Oct 25 14:09:26.157 DEBUG Processed HTTP API request                    elapsed_ms: 0.6780459999999999, status: 200 OK, path: /lighthouse/finalize, method: POST
Oct 25 14:09:27.111 DEBUG Starting database pruning                     split_prior_to_migration: Split { slot: Slot(23680), state_root: 0xce62a2b2dd6f00d9aacb3469050dcd1d1035e84457adba3d85f6142be3a2013c, block_root: 0x7c51f4c364f60561e9b39931264b54c7656dd74fb99c5a048c09c16126bb70ce }, new_finalized_checkpoint: Checkpoint { epoch: Epoch(746), root: 0xff5d2b08ebaf5bec69c1e9251d9ad788c6350e0a6ece69a27b13fa477af76729 }, new_finalized_state_root: 0x77c39c0504220d7a563a9403e25a05ccf46d7294c4282787f564559a191a6f2d

Oct 25 14:09:27.504 INFO  Synced                                        peers: "3", exec_hash: "0xa106339a1ccf80a10e1c5cdd2b50e8d5c58cb87f66c2574527d054d710886a5c (unverified)", finalized_checkpoint: 0x7c51f4c364f60561e9b39931264b54c7656dd74fb99c5a048c09c16126bb70ce/740/local/0xff5d2b08ebaf5bec69c1e9251d9ad788c6350e0a6ece69a27b13fa477af76729/746, epoch: 749, block: "   …  empty", slot: 23980

Restart the validators, and the network finalizes again

Oct 25 14:21:39.500 INFO  Synced                                        peers: "3", exec_hash: "0x2ed485a642ad02a2f11208c8af1aafbc7f2e961344fb47424c61d64fee71f43e (unverified)", finalized_checkpoint: 0x67160358278de47d9d7a4c88bd07391c7470edc4a431047145d4edd183de7915/755, epoch: 757, block: "0xca984edf928634d9f7460b1cfe64ed73b80f99d8ed8cebda2a752e9a6cb3c995", slot: 24224

Notes: I tested manual finalization and checkpoint syncing into only blocks that are first in epoch. In prior tests using non-aligned blocks broke, and I still don't know the reason

// finality is ahead of the split and the split block has been pruned, as `is_descendant` will
// return `false` in this case.
let fork_choice = chain.canonical_head.fork_choice_read_lock();
let attestation_block_root = attestation_data.beacon_block_root;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please read the PR body description before the diff :)

Comment on lines +217 to +218
justified_state_root: anchor_state_root,
finalized_checkpoint: finalized_checkpoint_on_chain,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very relevant change: now fork-choice is initialized to the network's finalized and justified checkpoints. Previously we initialized always to a "dummy" checkpoint derived from the anchor state. That "dummy" checkpoint was correct as we expected the anchor state to be exactly the finalized state.

With this change the initial checkpoints have a root for which we don't have a ProtoNode available. This is fine, see fork-choice diff

.canonical_head
.cached_head()
.finalized_checkpoint()
.local()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When doing a merge of unstable this line did not compile. The ForkChoiceCheckpoint type forced me to think here if backfill wanted to read the network finalized checkpoint or the local irreversible checkpoint.

) -> Result<(), Error> {
self.local_irreversible_checkpoint = checkpoint;
if self.local_irreversible_checkpoint.epoch > self.justified_checkpoint.epoch {
self.update_justified_balances(checkpoint, state_root)?;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelsproul I think justified_balances should match the on_chain checkpoint whenever possible. Otherwise the fork-choices of nodes won't be consistent. On initialization we can try to fetch the actual state of the justified balances

@michaelsproul
Copy link
Member

michaelsproul commented Nov 12, 2025

Test failure is caused by this:

    #[serde(skip)]
    /// The `genesis` field is not serialized or deserialized by `serde` to ensure it is defined
    /// via the CLI at runtime, instead of from a configuration file saved to disk.
    pub genesis: ClientGenesis,

I'm going to get rid of that skip and the comment, seeing as it isn't relevant while we don't have config files, and it just makes this feature impossible to test.

Skipping it doesn't provide any safety either: it just means it gets initialized to Default::default().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants