-
Notifications
You must be signed in to change notification settings - Fork 515
arenaskl: optimize TrySeekUsingNext with skiplist tower traversal #5562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Previously, TrySeekUsingNext would perform up to 5 simple Next() operations
at level 0. If the target key wasn't found, all that work was abandoned and
the iterator fell back to a full seek from the top of the skiplist. This
wasted both the forward progress made and the information gained during those
Next operations.
The new implementation leverages the skiplist tower structure to make seeking
more efficient. Instead of advancing only at level 0, it attempts to climb to
higher levels when possible, allowing it to skip over many nodes efficiently.
When a node >= target is found at a higher level, it descends through
progressively lower levels to find the exact first node >= target at level 0.
This approach better utilizes the skiplist's hierarchical structure for
workloads with sequential access patterns.
The optimization shows a modest improvement in the SeekPrefixGE benchmark,
particularly as the skip distance between seeks increases. With `use-next=true`,
performance improves by up to 5.5% for larger skips (`skip=16`), while
shorter skips show smaller gains.
```
goos: darwin
goarch: arm64
pkg: github.com/cockroachdb/pebble/internal/arenaskl
cpu: Apple M2 Pro
│ ./bench_master.txt │ ./bench_optimized.txt │
│ sec/op │ sec/op vs base │
SeekPrefixGE/skip=1/use-next=false-12 180.8n ± 2% 186.8n ± 2% +3.32% (p=0.015 n=6)
SeekPrefixGE/skip=1/use-next=true-12 88.26n ± 10% 89.05n ± 3% ~ (p=0.699 n=6)
SeekPrefixGE/skip=2/use-next=false-12 183.0n ± 5% 184.9n ± 2% ~ (p=1.000 n=6)
SeekPrefixGE/skip=2/use-next=true-12 94.73n ± 2% 93.31n ± 0% -1.49% (p=0.002 n=6)
SeekPrefixGE/skip=4/use-next=false-12 175.9n ± 4% 178.1n ± 1% ~ (p=0.065 n=6)
SeekPrefixGE/skip=4/use-next=true-12 114.5n ± 1% 111.0n ± 3% -3.01% (p=0.006 n=6)
SeekPrefixGE/skip=8/use-next-false-12 177.2n ± 5% 177.6n ± 3% ~ (p=0.615 n=6)
SeekPrefixGE/skip=8/use-next=true-12 219.1n ± 1% 216.4n ± 3% ~ (p=0.485 n=6)
SeekPrefixGE/skip=16/use-next-false-12 180.2n ± 2% 179.2n ± 3% ~ (p=0.288 n=6)
SeekPrefixGE/skip=16/use-next=true-12 208.3n ± 3% 196.8n ± 6% -5.50% (p=0.009 n=6)
geomean 155.2n 154.4n -0.53%
│ ./bench_master.txt │ ./bench_optimized.txt │
│ B/op │ B/op vs base │
SeekPrefixGE/skip=1/use-next=false-12 22.00 ± 0% 22.00 ± 0% ~ (p=1.000 n=6) ¹
SeekPrefixGE/skip=1/use-next=true-12 22.00 ± 0% 22.00 ± 0% ~ (p=1.000 n=6) ¹
SeekPrefixGE/skip=2/use-next=false-12 22.00 ± 0% 22.00 ± 0% ~ (p=1.000 n=6) ¹
SeekPrefixGE/skip=2/use-next=true-12 22.00 ± 0% 22.00 ± 0% ~ (p=1.000 n=6) ¹
SeekPrefixGE/skip=4/use-next=false-12 22.00 ± 0% 22.00 ± 0% ~ (p=1.000 n=6) ¹
SeekPrefixGE/skip=4/use-next=true-12 22.00 ± 0% 22.00 ± 0% ~ (p=1.000 n=6) ¹
SeekPrefixGE/skip=8/use-next-false-12 22.00 ± 0% 22.00 ± 0% ~ (p=1.000 n=6) ¹
SeekPrefixGE/skip=8/use-next=true-12 22.00 ± 0% 22.00 ± 0% ~ (p=1.000 n=6) ¹
SeekPrefixGE/skip=16/use-next-false-12 22.00 ± 0% 22.00 ± 0% ~ (p=1.000 n=6) ¹
SeekPrefixGE/skip=16/use-next=true-12 22.00 ± 0% 22.00 ± 0% ~ (p=1.000 n=6) ¹
geomean 22.00 22.00 +0.00%
¹ all samples are equal
│ ./bench_master.txt │ ./bench_optimized.txt │
│ allocs/op │ allocs/op vs base │
SeekPrefixGE/skip=1/use-next=false-12 2.000 ± 0% 2.000 ± 0% ~ (p=1.000 n=6) ¹
SeekPrefixGE/skip=1/use-next=true-12 2.000 ± 0% 2.000 ± 0% ~ (p=1.000 n=6) ¹
SeekPrefixGE/skip=2/use-next=false-12 2.000 ± 0% 2.000 ± 0% ~ (p=1.000 n=6) ¹
SeekPrefixGE/skip=2/use-next=true-12 2.000 ± 0% 2.000 ± 0% ~ (p=1.000 n=6) ¹
SeekPrefixGE/skip=4/use-next=false-12 2.000 ± 0% 2.000 ± 0% ~ (p=1.000 n=6) ¹
SeekPrefixGE/skip=4/use-next=true-12 2.000 ± 0% 2.000 ± 0% ~ (p=1.000 n=6) ¹
SeekPrefixGE/skip=8/use-next-false-12 2.000 ± 0% 2.000 ± 0% ~ (p=1.000 n=6) ¹
SeekPrefixGE/skip=8/use-next=true-12 2.000 ± 0% 2.000 ± 0% ~ (p=1.000 n=6) ¹
SeekPrefixGE/skip=16/use-next-false-12 2.000 ± 0% 2.000 ± 0% ~ (p=1.000 n=6) ¹
SeekPrefixGE/skip=16/use-next=true-12 2.000 ± 0% 2.000 ± 0% ~ (p=1.000 n=6) ¹
geomean 2.000 2.000 +0.00%
¹ all samples are equal
```
Fixes cockroachdb#5358
RaduBerinde
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 2 unresolved discussions (waiting on @jbowens)
internal/arenaskl/iterator.go line 318 at r1 (raw file):
// Try to advance at the current level or higher levels. advanced := false for candidateLevel := level; candidateLevel <= maxLevel; candidateLevel++ {
I am a bit confused why we start with lower levels first, it feels like if we have to skip over a bunch of keys, those small steps will be more or less wasted.
Couldn't we binary search for the lowest level which points after key and then descend down? It feels like that would be close to optimal regardless of TrySeekUsingNext and doesn't need to have a maxSteps limit (though we could use TrySeekUsingNext to start the binary search with a lower or higher initial pivot).
internal/arenaskl/iterator.go line 365 at r1 (raw file):
// searching for the first node >= key. It starts at 'nd' which is known to be < key, // and searches forward at progressively lower levels. func (it *Iterator) searchDownToLevel0(nd *node, startLevel int, key []byte) *node {
Can this be consolidated with / used by seekForBaseSplice? They do almost the same thing AFAICT.
…Next Addresses reviewer feedback from @RaduBerinde regarding inefficient level traversal and code duplication. This PR adds a height field to the skiplist node structure to enable safe level-aware traversal from arbitrary positions, then implements binary search to find the optimal starting level as suggested in the review. Changes: 1. Added height field to node struct (4 bytes per node) - Nodes previously didn't store their tower height, making it impossible to safely access higher levels from arbitrary starting positions - Only the head node was guaranteed to have links at all levels - The height field enables checking before access 2. Implemented seekForBaseSpliceFrom with binary search - When starting from an intermediate node (not head), binary search finds the highest safe level where current < target - This addresses the review comment that starting from lower levels is inefficient - we now start from the optimal level - Falls back to standard skiplist descent after finding the starting level 3. Simplified trySeekUsingNext to use seekForBaseSpliceFrom - Removed maxSteps limitation for cleaner, always-correct implementation - Consolidated with seekForBaseSplice as suggested in review - Now simply calls seekForBaseSpliceFrom(currentNode, currentHeight-1, key) Implementation Journey: Several approaches were attempted before adding the height field: - Direct level climbing from arbitrary nodes: crashed due to accessing non-existent tower levels - Checking next pointer validity: can't reliably detect invalid levels - Starting from lower levels: inefficient as noted in review The height field was the minimal addition needed to enable proper binary search and level-aware traversal from any position. Benchmark Results (master vs optimized): Key improvement for longer seeks (target use case): skip=8/use-next=true: 219.1ns → 169.2ns (-22.75%) skip=16/use-next=true: 208.3ns → 211.6ns (+1.61%) Trade-off for very short seeks (overhead from binary search): skip=1/use-next=true: 88.3ns → 119.3ns (+35.17%) skip=2/use-next=true: 94.7ns → 132.0ns (+39.35%) skip=4/use-next=true: 114.5ns → 149.7ns (+30.74%)
|
@RaduBerinde Thank you for the feedback. Could you please confirm if my understanding is correct? Initially, I was proceeding with the understanding that I should traverse up the tower to perform the Regarding your comment:
I interpreted this as performing a binary search across the node's levels. Please let me know if that was a misunderstanding. I first attempted to implement this in commit 9b68e5c, but I found it was not possible to traverse up the tower from an arbitrary node (the iterator's current position) because I couldn't determine that node's height. To solve this, I have now added a Does this align with the implementation direction you had in mind? Below are the benchmark results. |
|
Oh, I see, thanks! I forgot that not all nodes have all the pointers - we might actually need to take some smaller steps just to get to a node that has more pointers (is "higher") before we can do a bigger jump. Let me think about this some more. @jbowens knows this code best, let's wait for him to chime in too. |
|
Going back to your first implementation (r1 in Reviewable), how does that work without knowing the height of a node? How do you know it's legal to look at the next pointer for a higher candidateLevel? An alternative to storing the height in the node is remembering the entire stack during each Seek (i.e. the previous node on each level). Then we can try them starting at L1+ to find the correct level to start our descent from. |
|
Hmm... initial idea was to traverse the tower of the node at the current iterator position. I assumed that if
Regarding your suggestion to maintain a stack: I'm concerned that this might be difficult to implement correctly because In the read patterns I anticipate, it's possible for |
Previously, TrySeekUsingNext would perform up to 5 simple Next() operations
at level 0. If the target key wasn't found, all that work was abandoned and
the iterator fell back to a full seek from the top of the skiplist. This
wasted both the forward progress made and the information gained during those
Next operations.
The new implementation leverages the skiplist tower structure to make seeking
more efficient. Instead of advancing only at level 0, it attempts to climb to
higher levels when possible, allowing it to skip over many nodes efficiently.
When a node >= target is found at a higher level, it descends through
progressively lower levels to find the exact first node >= target at level 0.
This approach better utilizes the skiplist's hierarchical structure for
workloads with sequential access patterns.
The optimization shows a modest improvement in the SeekPrefixGE benchmark,
particularly as the skip distance between seeks increases. With
use-next=true,performance improves by up to 5.5% for larger skips (
skip=16), whileshorter skips show smaller gains.
Fixes #5358