Commit 44fa540
committed
Persist to DB after setting canonical head (#2547)
## Issue Addressed
NA
## Proposed Changes
Missed head votes on attestations is a well-known issue. The primary cause is a block getting set as the head *after* the attestation deadline.
This PR aims to shorten the overall time between "block received" and "block set as head" by:
1. Persisting the head and fork choice *after* setting the canonical head
- Informal measurements show this takes ~200ms
1. Pruning the op pool *after* setting the canonical head.
1. No longer persisting the op pool to disk during `BeaconChain::fork_choice`
- Informal measurements show this can take up to 1.2s.
I also add some metrics to help measure the effect of these changes.
Persistence changes like this run the risk of breaking assumptions downstream. However, I have considered these risks and I think we're fine here. I will describe my reasoning for each change.
## Reasoning
### Change 1: Persisting the head and fork choice *after* setting the canonical head
For (1), although the function is called `persist_head_and_fork_choice`, it only persists:
- Fork choice
- Head tracker
- Genesis block root
Since `BeaconChain::fork_choice_internal` does not modify these values between the original time we were persisting it and the current time, I assert that the change I've made is non-substantial in terms of what ends up on-disk. There's the possibility that some *other* thread has modified fork choice in the extra time we've given it, but that's totally fine.
Since the only time we *read* those values from disk is during startup, I assert that this has no impact during runtime.
### Change 2: Pruning the op pool after setting the canonical head
Similar to the argument above, we don't modify the op pool during `BeaconChain::fork_choice_internal` so it shouldn't matter when we prune. This change should be non-substantial.
### Change 3: No longer persisting the op pool to disk during `BeaconChain::fork_choice`
This change *is* substantial. With the proposed changes, we'll only be persisting the op pool to disk when we shut down cleanly (i.e., the `BeaconChain` gets dropped). This means we'll save disk IO and time during usual operation, but a `kill -9` or similar "crash" will probably result in an out-of-date op pool when we reboot. An out-of-date op pool can only have an impact when producing blocks or aggregate attestations/sync committees.
I think it's pretty reasonable that a crash might result in an out-of-date op pool, since:
- Crashes are fairly rare. Practically the only time I see LH suffer a full crash is when the OOM killer shows up, and that's a very serious event.
- It's generally quite rare to produce a block/aggregate immediately after a reboot. Just a few slots of runtime is probably enough to have a decent-enough op pool again.
## Additional Info
Credits to @macladson for the timings referenced here.1 parent 1031f79 commit 44fa540
2 files changed
+13
-6
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2810 | 2810 | | |
2811 | 2811 | | |
2812 | 2812 | | |
| 2813 | + | |
| 2814 | + | |
2813 | 2815 | | |
2814 | 2816 | | |
2815 | 2817 | | |
| |||
2913 | 2915 | | |
2914 | 2916 | | |
2915 | 2917 | | |
2916 | | - | |
2917 | | - | |
2918 | | - | |
2919 | | - | |
2920 | | - | |
2921 | | - | |
2922 | 2918 | | |
2923 | 2919 | | |
2924 | 2920 | | |
| |||
2934 | 2930 | | |
2935 | 2931 | | |
2936 | 2932 | | |
| 2933 | + | |
| 2934 | + | |
2937 | 2935 | | |
2938 | 2936 | | |
2939 | 2937 | | |
| |||
2984 | 2982 | | |
2985 | 2983 | | |
2986 | 2984 | | |
| 2985 | + | |
| 2986 | + | |
| 2987 | + | |
| 2988 | + | |
| 2989 | + | |
2987 | 2990 | | |
2988 | 2991 | | |
2989 | 2992 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
261 | 261 | | |
262 | 262 | | |
263 | 263 | | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
264 | 268 | | |
265 | 269 | | |
266 | 270 | | |
| |||
0 commit comments