Incorrect statistics extracted from parquet data pages when all values are null

_Originally posted by @efredine in https://github.com/apache/datafusion/issues/10922#issuecomment-2209376864_


We always flatten the date page stats iterator - following the pattern from the initial PR: https://github.com/apache/datafusion/pull/10852/files#diff-7110f4709c105a18ef74a212396444d62052179a735d148fb62470a8b157fb40R582

But I'm wondering if flatten is the right thing to do here?

The min or max values for each page will be None if all the values on the page happen to be null: https://github.com/apache/arrow-rs/blob/master/parquet/src/file/page_index/index.rs#L37-L44

Using flatten in this case will mean that the length of result for that page will be shorter than the number of data pages? So, is it possible that rather than flatten we instead want to do something like a flat map where the Some values are flattened and None values are mapped to a null value?

# Potential user impact:

The code appended nulls for missing values. However, I think in most cases, missing values are simply omitted because all the None values are removed by flattening. So, in general, users of the data page statistics will need to check whether or not the length of the array matches the number of actual data pages? This is different from how the row group statistics are handled - they will instead have a null value for any missing statistics.

Is this difference in behaviour expected or just a side effect of the implementation.

A: I think it is a side effect of implementation and not a good one            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrect statistics extracted from parquet data pages when all values are null #11280

Potential user impact:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrect statistics extracted from parquet data pages when all values are null #11280

Description

Potential user impact:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions