-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
Thanks to #10453, we now have the very nice StatisticsConverter API (code link) that extracts statistics from parquet row groups and is well tested.
However, there is a different code path that extracts and summarizes statistics in ListingTable used for pruning files
datafusion/datafusion/core/src/datasource/file_format/parquet.rs
Lines 310 to 373 in 9ab597b
| match stat { | |
| ParquetStatistics::Boolean(s) if DataType::Boolean == *fields[i].data_type() => { | |
| if let Some(max_value) = &mut max_values[i] { | |
| max_value | |
| .update_batch(&[Arc::new(BooleanArray::from(vec![*s.max()]))]) | |
| .unwrap_or_else(|_| max_values[i] = None); | |
| } | |
| if let Some(min_value) = &mut min_values[i] { | |
| min_value | |
| .update_batch(&[Arc::new(BooleanArray::from(vec![*s.min()]))]) | |
| .unwrap_or_else(|_| min_values[i] = None); | |
| } | |
| } | |
| ParquetStatistics::Int32(s) if DataType::Int32 == *fields[i].data_type() => { | |
| if let Some(max_value) = &mut max_values[i] { | |
| max_value | |
| .update_batch(&[Arc::new(Int32Array::from_value(*s.max(), 1))]) | |
| .unwrap_or_else(|_| max_values[i] = None); | |
| } | |
| if let Some(min_value) = &mut min_values[i] { | |
| min_value | |
| .update_batch(&[Arc::new(Int32Array::from_value(*s.min(), 1))]) | |
| .unwrap_or_else(|_| min_values[i] = None); | |
| } | |
| } | |
| ParquetStatistics::Int64(s) if DataType::Int64 == *fields[i].data_type() => { | |
| if let Some(max_value) = &mut max_values[i] { | |
| max_value | |
| .update_batch(&[Arc::new(Int64Array::from_value(*s.max(), 1))]) | |
| .unwrap_or_else(|_| max_values[i] = None); | |
| } | |
| if let Some(min_value) = &mut min_values[i] { | |
| min_value | |
| .update_batch(&[Arc::new(Int64Array::from_value(*s.min(), 1))]) | |
| .unwrap_or_else(|_| min_values[i] = None); | |
| } | |
| } | |
| ParquetStatistics::Float(s) if DataType::Float32 == *fields[i].data_type() => { | |
| if let Some(max_value) = &mut max_values[i] { | |
| max_value | |
| .update_batch(&[Arc::new(Float32Array::from(vec![*s.max()]))]) | |
| .unwrap_or_else(|_| max_values[i] = None); | |
| } | |
| if let Some(min_value) = &mut min_values[i] { | |
| min_value | |
| .update_batch(&[Arc::new(Float32Array::from(vec![*s.min()]))]) | |
| .unwrap_or_else(|_| min_values[i] = None); | |
| } | |
| } | |
| ParquetStatistics::Double(s) if DataType::Float64 == *fields[i].data_type() => { | |
| if let Some(max_value) = &mut max_values[i] { | |
| max_value | |
| .update_batch(&[Arc::new(Float64Array::from(vec![*s.max()]))]) | |
| .unwrap_or_else(|_| max_values[i] = None); | |
| } | |
| if let Some(min_value) = &mut min_values[i] { | |
| min_value | |
| .update_batch(&[Arc::new(Float64Array::from(vec![*s.min()]))]) | |
| .unwrap_or_else(|_| min_values[i] = None); | |
| } | |
| } | |
| _ => { | |
| max_values[i] = None; | |
| min_values[i] = None; |
In addition to being more code, this also seems like it doesn't properly convert the data types
Which is used
datafusion/datafusion/core/src/datasource/file_format/parquet.rs
Lines 511 to 535 in ea21b08
| if has_statistics { | |
| for (table_idx, null_cnt) in null_counts.iter_mut().enumerate() { | |
| if let Some(file_idx) = | |
| schema_adapter.map_column_index(table_idx, &file_schema) | |
| { | |
| if let Some((null_count, stats)) = column_stats.get(&file_idx) { | |
| *null_cnt = null_cnt.add(&Precision::Exact(*null_count as usize)); | |
| summarize_min_max( | |
| &mut max_values, | |
| &mut min_values, | |
| fields, | |
| table_idx, | |
| stats, | |
| ) | |
| } else { | |
| // If none statistics of current column exists, set the Max/Min Accumulator to None. | |
| max_values[table_idx] = None; | |
| min_values[table_idx] = None; | |
| } | |
| } else { | |
| *null_cnt = null_cnt.add(&Precision::Exact(num_rows as usize)); | |
| } | |
| } | |
| } | |
| } |
- @NGA-TRAN recently made this API public in refactor: fetch statistics for a given ParquetMetaData #10880 but I think we can do better
Describe the solution you'd like
I would like to update the code to use StatisticsConverter and rather than converting row/column at a time
Describe alternatives you've considered
No response
Additional context
No response