Skip to content

Add a benchmark for extracting parquet data page statistics #10934

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

As we work to make extracting statistics from parquet data pages more correct and performant in #10922 one thing that would be good is to have benchmark overage

Describe the solution you'd like

Add a benchmark for extracting page statistics

Describe alternatives you've considered

Add a benchmark (source) for extracting data page statistics

These are run via

cargo bench --bench parquet_statistic

In order to create a reasonable number of data page staistics, it would be good to configure the parquet writer to limit the sizez of data pages

let props = WriterProperties::builder().build();

And use https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterProperties.html#method.data_page_row_count_limit to set the the limit to 1 and then send the data in row by row as we did in the test:

if let Some(data_page_row_count_limit) = self.data_page_row_count_limit {
builder = builder.set_data_page_row_count_limit(data_page_row_count_limit);
}
let props = builder.build();
let batches = vec![self.make_int64_batches_with_null()];
let schema = batches[0].schema();
let mut writer =
ArrowWriter::try_new(&mut output_file, schema, Some(props)).unwrap();
// if we have a datapage limit send the batches in one at a time to give
// the writer a chance to be split into multiple pages
if self.data_page_row_count_limit.is_some() {
for batch in batches {
for i in 0..batch.num_rows() {
writer.write(&batch.slice(i, 1)).expect("writing batch");
}
}
} else {
for batch in batches {
writer.write(&batch).expect("writing batch");
}
}

Additional context

The need for a benchmark also came up in #10932

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions