Adds some POC benchmarks for pruned_partition_list #18361
                
     Draft
            
            
          
      
        
          +2,412
        
        
          −1
        
        
          
        
      
    
  
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Which issue does this PR close?
N/A -- This PR is a POC meant for discussion to inform decisions related to
Rationale for this change
This PR is to share the code of some benchmarks for reproduction of results and discussion of results.
What changes are included in this PR?
This PR includes a set of benchmarks that exercise the
pruned_partition_listmethod (both theoriginal(current main) and thelist_all(PR)) to allow us to make a more informed decision on the potential path(s) forward related to #18146.This code is not intended for merge.
Please generally avert your eyes from the benchmark code, because it comes with an 🚨 🤖 AI generated code 🤖 🚨 warning. There's a bunch of really silly decisions the robot made and if we actually wanted to introduce permanent benchmarks we'd likely want to pare down the cases and re-write them. At the current time I am more interested in exploring results rather than nit-picking benchmark code, however I did ensure the actual timing loops were as tight as possible to ensure the results are trustworthy representations for both implementations.
The benchmarks themselves include both an in-memory benchmark and an S3 benchmark that uses a local MinIO via
testcontainers. The in-memory benches are more-or-less what I started with, and at this point are mostly there because they're academically interesting. The S3 benchmarks are necessary to truly understand the end-user performance for list operations because list operations for commercial object stores are paged with 1000 results per page. To add an additional dose of realism, the results included with this PR were run using a simulated latency of 120ms applied to my localhost interface usingtcin linux. Each underlying partition structure benchmark is run twice for each implementation, once to collect all the results from the list operation, and again to collect the time-to-first-result (TTFR) from the file stream.In order to better facilitate discussion on the results I have included both "raw" criterion results as text and a formatted table of the results as a markdown doc that's a bit easier to read. The "raw" criterion results are edited to remove some of the text-noise (improve/regression output that was not useful/accurate, warmup text etc.) and have had separators added just to make them a bit easier to navigate/digest. I think using in-line comments on the various table entries in
s3_results_formatted.mdis probably the easiest way to thread the discussion around the results, but I'm happy to facilitate other options.Are these changes tested?
They are tests.
Are there any user-facing changes?
no
cc @alamb
I was initially planning on adding some comments of my own interpretations to the benchmark results right after I submitted this PR to start the discussion, but in some sense I don't want the "poison the well" of additional perspectives. If you'd like me to start the discussion/interpretation I'd be happy to do so, just let me know and I can add my current thoughts.
Additional Notes:
If anyone wants to try these locally using the simulated latency, you can use this command (run as root):
tc qdisc add dev lo root handle 1:0 netem delay 60msecThis adds 60ms to each access across the
lodevice, resulting in a 120ms round trip latency.Since you're unlikely to want latency on localhost forever, you can reset it:
tc qdisc del dev lo rootI'm not sure if this functionality or similar exists on MacOS, and I don't believe there is a Windows solution.