-
Notifications
You must be signed in to change notification settings - Fork 932
Safe non-finalized checkpoint sync #8382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
Test of ff431acStarted a kurtosis network with 6 participants on latest unstable participants:
- el_type: geth
el_image: ethereum/client-go:latest
cl_type: lighthouse
cl_image: sigp/lighthouse:latest-unstable
cl_extra_params:
- --target-peers=7
vc_extra_params:
- --use-long-timeouts
- --long-timeouts-multiplier=3
count: 6
validator_count: 16
network_params:
electra_fork_epoch: 0
seconds_per_slot: 3
genesis_delay: 400
global_log_level: debug
snooper_enabled: false
additional_services:
- dora
- spamoor
- prometheus_grafana
- tempoLet the network run for ~15 epochs and stopped 50% of the validators. Let the network run in non finality for many epochs, and started a docker build of this branch version: "3.9"
services:
cl-lighthouse-syncer:
image: "sigp/lighthouse:non-fin"
command: >
lighthouse beacon_node
--debug-level=debug
--datadir=/data/lighthouse/beacon-data
--listen-address=0.0.0.0
--port=9000
--http
--http-address=0.0.0.0
--http-port=4000
--disable-packet-filter
--execution-endpoints=http://172.16.0.89:8551
--jwt-secrets=/jwt/jwtsecret
--suggested-fee-recipient=0x8943545177806ED17B9F23F0a21ee5948eCaa776
--disable-enr-auto-update
--enr-address=172.16.0.21
--enr-tcp-port=9000
--enr-udp-port=9000
--enr-quic-port=9001
--quic-port=9001
--metrics
--metrics-address=0.0.0.0
--metrics-allow-origin=*
--metrics-port=5054
--enable-private-discovery
--testnet-dir=/network-configs
--boot-nodes=enr:-OK4QHWvCmwiaEj8437Z6Wlk32gLVM5Hbw9n6PesII42toDOPmdquevxog8OS8SMMru3VjRvo3qOk80qCXzOUEa8ecoDh2F0dG5ldHOIAAAAAAAAAMCGY2xpZW501opMaWdodGhvdXNlijguMC4wLXJjLjKEZXRoMpASBoYoYAAAOP__________gmlkgnY0gmlwhKwQABKEcXVpY4IjKYlzZWNwMjU2azGhAmc8-hvS_9yO5fBwlBhgTYVDSdOtFJW7uVpTmYkVcZBWiHN5bmNuZXRzAIN0Y3CCIyiDdWRwgiMo
--target-peers=3
--execution-timeout-multiplier=3
--checkpoint-block=/blocks/block_640.ssz
--checkpoint-state=/blocks/state_640.ssz
--checkpoint-blobs=/blocks/blobs_640.ssz
environment:
- RUST_BACKTRACE=full
extra_hosts:
# Allow container to reach host service (Linux-compatible)
- "host.docker.internal:host-gateway"
volumes:
- configs:/network-configs
- jwt:/jwt
- /root/kurtosis-non-fin/blocks:/blocks
ports:
- "33400:4000/tcp" # HTTP API
- "33554:5054/tcp" # Metrics
- "33900:9000/tcp" # Libp2p TCP
- "33900:9000/udp" # Libp2p UDP
- "33901:9001/udp" # QUIC
networks:
kt:
ipv4_address: 172.16.0.88
shm_size: "64m"
lcli-mock-el:
image: sigp/lcli
command: >
lcli mock-el
--listen-address 0.0.0.0
--listen-port 8551
--jwt-output-path=/jwt/jwtsecret
volumes:
- jwt:/jwt
ports:
- "33851:8551"
networks:
kt:
ipv4_address: 172.16.0.89
networks:
kt:
external: true
name: kt-quiet-crater
volumes:
configs:
name: files-artifact-expansion--e56f64e9c6aa4409b27b11e37d1ab4d3--bc0964a8b6c54745ba6473aaa684a81e
external: true
jwt:
name: files-artifact-expansion--870bc5edd3eb44598f50a70ada54cd31--bc0964a8b6c54745ba6473aaa684a81e
external: trueAll logs below are from the The node starts with checkpoint sync at a more recent checkpoint than latest finalized, specifically epoch 20. The node range synced to head without issues. See log Then I triggered manual finalization into a more recent non-finalized checkpoint. It now logs Then I restarted the validators and the network finalized. See that the node pruned from the latest manual finalization. It now logs Then I stopped 50% of the validators again and triggered manual finalization. See that it transitions from Restart the validators, and the network finalizes again Notes: I tested manual finalization and checkpoint syncing into only blocks that are first in epoch. In prior tests using non-aligned blocks broke, and I still don't know the reason |
| // finality is ahead of the split and the split block has been pruned, as `is_descendant` will | ||
| // return `false` in this case. | ||
| let fork_choice = chain.canonical_head.fork_choice_read_lock(); | ||
| let attestation_block_root = attestation_data.beacon_block_root; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please read the PR body description before the diff :)
| justified_state_root: anchor_state_root, | ||
| finalized_checkpoint: finalized_checkpoint_on_chain, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very relevant change: now fork-choice is initialized to the network's finalized and justified checkpoints. Previously we initialized always to a "dummy" checkpoint derived from the anchor state. That "dummy" checkpoint was correct as we expected the anchor state to be exactly the finalized state.
With this change the initial checkpoints have a root for which we don't have a ProtoNode available. This is fine, see fork-choice diff
| .canonical_head | ||
| .cached_head() | ||
| .finalized_checkpoint() | ||
| .local() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When doing a merge of unstable this line did not compile. The ForkChoiceCheckpoint type forced me to think here if backfill wanted to read the network finalized checkpoint or the local irreversible checkpoint.
| ) -> Result<(), Error> { | ||
| self.local_irreversible_checkpoint = checkpoint; | ||
| if self.local_irreversible_checkpoint.epoch > self.justified_checkpoint.epoch { | ||
| self.update_justified_balances(checkpoint, state_root)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaelsproul I think justified_balances should match the on_chain checkpoint whenever possible. Otherwise the fork-choices of nodes won't be consistent. On initialization we can try to fetch the actual state of the justified balances
|
Test failure is caused by this: #[serde(skip)]
/// The `genesis` field is not serialized or deserialized by `serde` to ensure it is defined
/// via the CLI at runtime, instead of from a configuration file saved to disk.
pub genesis: ClientGenesis,I'm going to get rid of that Skipping it doesn't provide any safety either: it just means it gets initialized to |
Issue Addressed
Allows Lighthouse to bootstrap into a state for a checkpoint that is not finalized. This feature could save Ethereum Mainnet when shit hits the fan (= long period of non-finality).
Why we can't just checkpoint sync into a non-finalized checkpoint today?
Why, and how to solve it? Keep reading :)
Proposed Changes
Let's consider a node that wants to bootstrap from Checkpoint A.
The node has 3 checkpoints of interest, and we will use the following naming conventions:
finalized_checkpoint.on_chain()justified_checkpoint.on_chain()finalized_checkpoint.local()andjustified_checkpoint.local()Different parts of Lighthouse want to use either the network checkpoints (
on_chain) or the local view of the node (local). To force consumers to think about which one they want the fork-choice now exposesForkChoiceCheckpointinstead of justCheckpointThe most relevant places where we use this checkpoints are
locallocallocalon_chainon_chainon_chainlocalfinalized+justifiedtagon_chainLet me justify each one. For unstable every item would use
localand I'll explain why that breaks.Gossip verification, reject objects older than the finalized checkpoint
We can't import blocks or objects that don't descend from our anchor state because we don't have the pre-states. We need to use
localsince we may not have theon_chainfinalized state available.Fork-choice irreversible checkpoint, reject blocks that do not descend from this checkpoint
Same as above, reject blocks that conflict with our local irreversible checkpoint
Fork-choice filter tree function, only heads that descend from this block are viable
While we could use the
on_chainjustified checkpoint here we don't have its ProtoNode available. To reduce the diff in the fork-choice code, we use thelocalone. However it's always true thatjustified_checkpoint.localwill equal or be a descendant ofjustified_checkpoint.on_chainFork-choice filter tree function, only heads with correct finalization and justification
Our
ProtoNodeobjects track the finalized and justified checkpoints of their states. Those are theon_chainones, so to make those blocks viable we need to compare againston_chaincheckpoints. Otherwise we end up with a fork-choice that looks like this where all nodes imported after the anchor block are not viable. Note that the block at slot 643 has non-matching justified and finalized checkpoint.dump of debug/fork-choice on unstable
{ "justified_checkpoint": { "epoch": "20", "root": "0x44a053199e37647e4dd6a21ad2def14d17e0af13ea9ba9d467a3dc99fad817a2" }, "finalized_checkpoint": { "epoch": "20", "root": "0x44a053199e37647e4dd6a21ad2def14d17e0af13ea9ba9d467a3dc99fad817a2" }, "fork_choice_nodes": [ { "slot": "640", "block_root": "0x44a053199e37647e4dd6a21ad2def14d17e0af13ea9ba9d467a3dc99fad817a2", "parent_root": null, "justified_epoch": "20", "finalized_epoch": "20", "weight": "1574400000000", "validity": "valid", "execution_block_hash": "0xa668211691c40797287c258d328cec77d0edd12956eebd5c64440adb489eaca0" }, { "slot": "643", "block_root": "0x8b9b01aadc0f7ead90ed7043da3cedfb17a967a7552a9d45dfcd5ce4cc4ae073", "parent_root": "0x44a053199e37647e4dd6a21ad2def14d17e0af13ea9ba9d467a3dc99fad817a2", "justified_epoch": "16", "finalized_epoch": "15", "weight": "1574400000000", "validity": "optimistic", "execution_block_hash": "0xdc9799a99f3e07da0b45ce04586bbcb999910907bea8c86c68e3faf03b878f86" },Casper FFG votes, source checkpoint
We must use the
on_chainones to prevent surround votes and for our votes to be includable on-chain.Status message
It's best to tell other nodes that we share the same view of finality. Otherwise they look to use like being "behind" and the may try to fetch from us a finalized chain that doesn't finalize at that epoch.
Sync forward range sync start epoch
Range sync assumes that the split slot == finalized slot. But we only need to sync blocks descendant of the split slot (= anchor slot).
Other changes
LH currently has a manual finalization mechanism via a HTTP API call. It triggers a special store finalization routine and forces gossip to filter by split slot too. I have changed the manual finalization route to advance the fork-choice local irreversible checkpoint and run the regular finalization routine. Overall looks simpler and more maintainable.