-
Notifications
You must be signed in to change notification settings - Fork 21.5k
Description
This issue was discovered by greg(kudos for the debugging assistance!).
Specifically, the Geth node panic'd during the snap sync, and a few missing nodes were detected during the snapshot generation. The missing nodes are all in the storage tries, particularly the topmost trie nodes in one or two trie paths.
Incomplete storage tries are quite common due to storage chunkification. The strange part is that the shortNode containing the associated account data exists in the account trie, which prevents the state from healing the missing storage trie nodes.
After debugging it for a while, I realized that it's caused by redoing the state sync after the unexpected termination.
Specifically, in sync cycle A, the storage trie of account X was fully synchronized and properly persisted on the disk. The associated account data was also inserted into the account trie and flushed to the disk, indicating that the storage trie was complete and no healing was required. However, a panic occurred, causing the process to terminate without saving the state snap progress indicator.
In sync cycle B, after relaunching, the storage retrieval of account X was redone using the old sync progress indicator. In this new cycle, the storage was chunkified into several pieces, and several trie nodes on the boundary path were deleted from the disk. Since the storage trie in this new cycle was incomplete, account X was tagged as "needHeal," and the account data itself was discarded. Theoretically, this mechanism ensures that a healing operation will be conducted, refilling all missing trie nodes within the account trie and storage trie. However, in cycle A, the account data was already persisted on the disk and not deleted in cycle B. This leftover trie node with account data prevents the state healing, as it assumes the storage trie is complete.
The leftover node of account X in cycle B breaks the state healing.
Originally bug report #30149
