Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Jun 14, 2024

Which issue does this PR close?

Closes #10923

Rationale for this change

The code that extracted statistics for ListingTable with parquet files:

  1. Was incomplete and did not handle all types (like strings)
  2. Did the same thing (extracted Row Group statistics from parquet files) but used different code than the ParquetExec
  3. Was inefficient (accumulated by creating single row record batches for each row group)

I also personally also found the previous code somewhat hard to follow

See #10923 for more details

What changes are included in this PR?

Update Parquet listing table to use the new StatisticsConverter API to create DataFusion statistics

Are these changes tested?

Yes, they are covered by existing tests

Are there any user-facing changes?

ListingFiles will now have more accurate statistics

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jun 14, 2024
ParquetStatistics::Int64(s) if DataType::Int64 == *fields[i].data_type() => {
if let Some(max_value) = &mut max_values[i] {
max_value
.update_batch(&[Arc::new(Int64Array::from_value(*s.max(), 1))])
Copy link
Contributor Author

@alamb alamb Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, even though update_batch is called, it first creates a single row array

}
}
let file_field = file_schema.field(file_idx);
let Some(converter) = StatisticsConverter::try_new(
Copy link
Contributor Author

@alamb alamb Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code now uses the well tested StatisticsConverter to extract statistics from the parquet file with the correct type of array in a single call

// Note in ASCII lower case is greater than upper case
assert_eq!(
c1_stats.max_value,
Precision::Exact(ScalarValue::from("bar"))
Copy link
Contributor Author

@alamb alamb Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings were previously not handled, but are now properly handled by StatisticsConverter

@alamb alamb force-pushed the alamb/unified_statistics branch from 90a5ea4 to 3e6a537 Compare June 17, 2024 10:24
}
}

/// Set the null count for the column
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions make creating ColumnStatitistic more ergonomic

@alamb alamb changed the title Update ListingTable to use StatisticsConverter Update ListingTable to use StatisticsConverter, remove redundant statistics extraction code Jun 17, 2024
@alamb alamb marked this pull request as ready for review June 17, 2024 16:25
@alamb
Copy link
Contributor Author

alamb commented Jun 24, 2024

Thank you for the review @xinlifoobar

@Ted-Jiang I wonder if you might be interested in reviewing this PR as well?

Update: let's work with #11068 (I didn't realize there was another PR)

Thanks @xinlifoobar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update ListingTable to use StatisticsConverter

2 participants