Skip to content

Conversation

@glyh
Copy link
Member

@glyh glyh commented Jul 21, 2025

Status

  • Fix bug where proofs read from persisted disk cache can't be garbage collected;
  • Fix bug where disk cache and persistent snark pool are out of sync: Just don't reuse LMDB keys any more;
    • Fix now infinitely growing disk cache indices, left in follow-up PRs;
  • Add change log;

Context

  • Snark Persistence is removed in Fix snark pool long async cycles #13409, for it's taking too long to write all proofs to disk;
  • However, it causes nodes having trouble to catching up, producing empty blocks, blocks with 0 coinbase, or high snark fees. This is due to the fact that we only push broadcast snark works for a finite amount of duration. During that period, if a node is down, it'll never hear such work;
  • Long term solution is to use pull-based snark work query, but this touches on the underlying protocol and it's non-trivial and might be breaking. Our short-term solution is to bring back this feature. We're aware of it's slowing down the system, so this feature should be optional;

This PR

  • Refactor the exit handler a bit so we can run specific tasks in order
  • Brought back snark persistence, that synced to disk run every 10min and on shutdown
  • On startup, try to load from snark persistence pool
  • For disk cache, make it so when shutting down we're storing the disk cache params that could be reloaded when starting the node. And we're not cleaning the disk cache on shutdown. For LMDB specifically, this causes disk cache to grow infinitely, but I think we can come up with solutions for this in consecutive PRs.

@glyh glyh force-pushed the lyh/snark-persistence branch from d341b28 to d35a6cf Compare July 22, 2025 07:04
@glyh glyh force-pushed the lyh/snark-persistence branch from d35a6cf to 9799d78 Compare July 22, 2025 07:21
@glyh glyh changed the title Snark Persistence: iteration 1 Bring back snark persistence: iteration 1 Jul 22, 2025
@glyh
Copy link
Member Author

glyh commented Jul 23, 2025

!ci-build-me

@glyh
Copy link
Member Author

glyh commented Jul 23, 2025

I broke something, node is decreasing trust on peers like crazy..

Actually, this is known false positive.

@glyh
Copy link
Member Author

glyh commented Jul 23, 2025

This is now blocked by https://github.com//issues/17501

I didn't run into this in a previous run on devnet on our server, weird.

I'll test with mina client close RPCs

@glyh
Copy link
Member Author

glyh commented Jul 24, 2025

!ci-docker-me

@glyh
Copy link
Member Author

glyh commented Jul 24, 2025

!ci-docker-me

~metadata:[ ("description", `String description) ] ;
let logging_thunk () =
[%log info] "Running async shutdown handler: $description"
let register_async_shutdown_handler =
Copy link
Member Author

@glyh glyh Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a refactor on this module because we want to enforce ordering for exit handlers between frontier and snark pool.

-> disk_location:string
-> proof_cache_db:Proof_cache_tag.cache_db
-> ?persistence:
[< `Disk_location of string ] * [< `Store_every of Time.Span.t ]
Copy link
Member Author

@glyh glyh Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit ugly because I want these param to present or missing at the same time. And the functor restricted the signature of Config to be T, where I would've put a module Persistence tracking persistent config inside it.

@glyh
Copy link
Member Author

glyh commented Jul 24, 2025

!ci-docker-me

@glyh
Copy link
Member Author

glyh commented Jul 24, 2025

!ci-docker-me

@glyh
Copy link
Member Author

glyh commented Jul 24, 2025

Verifier failed when restarting a node, I'm not sure if it's common or not. I assume I introduced bugs here.

@glyh
Copy link
Member Author

glyh commented Jul 24, 2025

Well, it seems our devnet alpha release also have this non-fatal issue:

2025-07-24 07:26:27 UTC [Error] verifier terminated unexpectedly

Repro:

  • Just start a node make it synced
  • mina client stop-daemon
  • restart the daemon

@glyh glyh marked this pull request as ready for review July 24, 2025 07:28
@glyh glyh requested a review from a team as a code owner July 24, 2025 07:28
@glyh
Copy link
Member Author

glyh commented Jul 24, 2025

!ci-build-me

@glyh
Copy link
Member Author

glyh commented Jul 24, 2025

this deserves a changelog entry, will add it

@glyh glyh marked this pull request as draft July 24, 2025 11:26
@glyh
Copy link
Member Author

glyh commented Jul 24, 2025

!ci-docker-me

module Make (B : sig
include Binable.S
end) =
struct
Copy link
Member Author

@glyh glyh Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may consider removing FS disk cache as a whole before merging:

  • We're not using it
  • It performs worse than LMDB
  • It's a burden to implement similar logic twice here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants