Skip to content

Efficiently and correctly Extract Page Index statistics into ArrayRefs #10806

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

Related to #10453

There are at least two types of statistics stored in Parquet files

  1. ColumnChunk level statistics (a min/max/null_count per column per row group): RowGroupMetadata --> ColumnChunkMetaData --> Option<&Statistics>
  2. "Page Index" statistics (stored per page, may be more than one page per column per row group): ColumnChunkMetaData --> read_columns_indexes --> Vec<Index>

As part of #10453 we have pulled conversion of the ColumnChunk level statistics into StatisticsConverter and #10802 prunes the row groups using this API

It would be good to apply the same treatment to the statistics in the page index

Describe the solution you'd like

  1. Add a clear API to efficiently extract page statistics outside of DataFusion
  2. Ensure that API is well tested
  3. Ensure the API is fast

Describe alternatives you've considered

  1. Move / refactor the code to extract ArrayRef from Index in page_filter (source link) to StatisticsConverter (source)
  2. Update the tests in arrow_statistics (source) to also verify that the page statistics are correct (I believe the page min/maxes should be the same as the row group min/maxes)
  3. Update the parquet code prune_pages_in_one_row_group (source) to use the new StatisticsExtractor code
  4. Update the benchmark (source) for extracting page statistics and use that to ensure the statistics extraction code is reasonably performant

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions