fix: cleaning up of prefixes under heavy concurrency #764
+1,476
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Prefix GC: Statement-level triggers + per-subtree advisory locking (race-free under concurrency)
This PR replaces our prefix GC logic with a transaction-safe, concurrency-robust design based on PostgreSQL statement-level triggers and transition tables. It removes the race condition that previously left dangling prefixes during concurrent deletes/moves—without introducing any external GC tables or background jobs.
Why
Current issue: under concurrent deletes/moves, row-level triggers can evaluate leaf-ness before sibling transactions commit, clear their own work, and fail to restage parents → dangling prefixes.
Goals:
What’s in this PR
New design (high level)
REFERENCING OLD/NEW TABLE
to batch the rows touched by a statement.This avoids removing shared roots prematurely.
New functions
storage.lock_top_prefixes(bucket_ids text[], names text[])
Takes per-(bucket, top) advisory locks in stable order:
pg_advisory_xact_lock(hashtextextended(bucket || '/' || top, 0))
storage.delete_leaf_prefixes(bucket_ids text[], names text[])
Builds the unique ancestor set in memory and deletes leaf prefixes (no immediate child objects or subprefixes), bottom-up until stable.
storage.objects_delete_cleanup()
(AFTER DELETE onstorage.objects
)Locks touched subtrees and prunes ancestors of deleted objects.
storage.objects_update_cleanup()
(AFTER UPDATE onstorage.objects
)Derives NEW−OLD (destinations to ensure prefixes exist) and OLD−NEW (sources to prune).
Locks sources first, then destinations; creates dest prefixes (idempotent) and then prunes source.
storage.prefixes_delete_cleanup()
(AFTER DELETE onstorage.prefixes
)Prunes ancestors when prefixes are deleted directly.
Triggers
objects_delete_cleanup
–AFTER DELETE … REFERENCING OLD TABLE AS deleted FOR EACH STATEMENT
objects_update_cleanup
–AFTER UPDATE … REFERENCING OLD/NEW TABLES … FOR EACH STATEMENT
prefixes_delete_cleanup
–AFTER DELETE on storage.prefixes … REFERENCING OLD TABLE … FOR EACH STATEMENT
Performance notes
(bucket, top)
so unrelated trees proceed in parallel.Recommended indexes (unchanged schema):
Test coverage (behaviors now correct)
a/b/c/file → a/x/y/file
) → old chain pruned, new chain created.src/
prefixes.Operational notes
(bucket, top)
, GC is serialized by design, but only for that subtree; other trees proceed concurrently.