-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
Related to #10453
There are at least two types of statistics stored in Parquet files
ColumnChunklevel statistics (a min/max/null_count per column per row group):RowGroupMetadata--> ColumnChunkMetaData --> Option<&Statistics>- "Page Index" statistics (stored per page, may be more than one page per column per row group): ColumnChunkMetaData --> read_columns_indexes --> Vec<Index>
As part of #10453 we have pulled conversion of the ColumnChunk level statistics into StatisticsConverter and #10802 prunes the row groups using this API
It would be good to apply the same treatment to the statistics in the page index
Describe the solution you'd like
- Add a clear API to efficiently extract page statistics outside of DataFusion
- Ensure that API is well tested
- Ensure the API is fast
Describe alternatives you've considered
- Move / refactor the code to extract
ArrayReffrom Index in page_filter (source link) toStatisticsConverter(source) - Update the tests in arrow_statistics (source) to also verify that the page statistics are correct (I believe the page min/maxes should be the same as the row group min/maxes)
- Update the parquet code
prune_pages_in_one_row_group(source) to use the newStatisticsExtractorcode - Update the benchmark (source) for extracting page statistics and use that to ensure the statistics extraction code is reasonably performant
Additional context
No response
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request