-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
As we work to make extracting statistics from parquet data pages more correct and performant in #10922 one thing that would be good is to have benchmark overage
Describe the solution you'd like
Add a benchmark for extracting page statistics
Describe alternatives you've considered
Add a benchmark (source) for extracting data page statistics
These are run via
cargo bench --bench parquet_statisticIn order to create a reasonable number of data page staistics, it would be good to configure the parquet writer to limit the sizez of data pages
| let props = WriterProperties::builder().build(); |
And use https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterProperties.html#method.data_page_row_count_limit to set the the limit to 1 and then send the data in row by row as we did in the test:
datafusion/datafusion/core/tests/parquet/arrow_statistics.rs
Lines 105 to 130 in d175163
| if let Some(data_page_row_count_limit) = self.data_page_row_count_limit { | |
| builder = builder.set_data_page_row_count_limit(data_page_row_count_limit); | |
| } | |
| let props = builder.build(); | |
| let batches = vec![self.make_int64_batches_with_null()]; | |
| let schema = batches[0].schema(); | |
| let mut writer = | |
| ArrowWriter::try_new(&mut output_file, schema, Some(props)).unwrap(); | |
| // if we have a datapage limit send the batches in one at a time to give | |
| // the writer a chance to be split into multiple pages | |
| if self.data_page_row_count_limit.is_some() { | |
| for batch in batches { | |
| for i in 0..batch.num_rows() { | |
| writer.write(&batch.slice(i, 1)).expect("writing batch"); | |
| } | |
| } | |
| } else { | |
| for batch in batches { | |
| writer.write(&batch).expect("writing batch"); | |
| } | |
| } | |
Additional context
The need for a benchmark also came up in #10932