-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
We push pruning predicates and limits down to ParquetExec but it seems like the combination could be unsafe, or perhaps I am not comprehending the logic?
Given the predicate x > 10 and a limit of 10, suppose we have the following 2 partitions:
- Partition 0 has 100 rows and 5 rows match
x > 10 - Partition 1 has 100 rows and 5 rows match
x > 10
As we iterate over row groups we have
for row_group_meta in meta_data.row_groups() {
num_rows += row_group_meta.num_rows();we break out of this loop once hitting the limit, based on num_rows
if limit.map(|x| num_rows >= x as i64).unwrap_or(false) {
limit_exhausted = true;
break;
}This stops processing the file once the limit is reached, without considering how many rows the predicate would match.
Finally we stop processing partitions as well, here:
// remove files that are not needed in case of limit
filenames.truncate(total_files);
partitions.push(ParquetPartition::new(filenames, statistics));
if limit_exhausted {
break;
}To Reproduce
When I have time will write a test to see if there is an issue here.
Expected behavior
Perhaps we should not apply the limit when we are pushing predicates down?
Additional context
N/A
Dandandan and houqp
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working