Skip to content

Question: is the combination of limit and predicate push-down safe in ParquetExec? #900

@andygrove

Description

@andygrove

Describe the bug
We push pruning predicates and limits down to ParquetExec but it seems like the combination could be unsafe, or perhaps I am not comprehending the logic?

Given the predicate x > 10 and a limit of 10, suppose we have the following 2 partitions:

  • Partition 0 has 100 rows and 5 rows match x > 10
  • Partition 1 has 100 rows and 5 rows match x > 10

As we iterate over row groups we have

for row_group_meta in meta_data.row_groups() {
    num_rows += row_group_meta.num_rows();

we break out of this loop once hitting the limit, based on num_rows

if limit.map(|x| num_rows >= x as i64).unwrap_or(false) {
    limit_exhausted = true;
    break;
}

This stops processing the file once the limit is reached, without considering how many rows the predicate would match.

Finally we stop processing partitions as well, here:

// remove files that are not needed in case of limit
filenames.truncate(total_files);
partitions.push(ParquetPartition::new(filenames, statistics));
if limit_exhausted {
    break;
}

To Reproduce
When I have time will write a test to see if there is an issue here.

Expected behavior
Perhaps we should not apply the limit when we are pushing predicates down?

Additional context
N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions