Skip to content

Optimize "per partition" top-k : ROW_NUMBER < 5 / TopK #6899

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

DataFusion optimizes queries like ... ORDER BY value LIMIT 10 by only keeping the top 10 ("limit") rows when sorting which is great!

Another common pattern (that we also have in IOx) (https://github.com/influxdata/influxdb_iox/pull/8187/files#r1257834347) is queries like the following to select the top N values "per partition"

SELECT ...
  ROW_NUMBER() OVER (PARTITION BY value1, ORDER BY value2) as rn
WHERE
  rn < 10

Currently the plan will be something like:

Filter(rn < 10)
  WindowExec(ROW_NUMBER...)
    Sort(value1, value2)

The problem with this plan is that it will sort (and copy) the ENTIRE input even when the query only needs the first 10 rows of each partition

Describe the solution you'd like

It would be awesome to optimize this case somehow so that it did not need to sort the entire input (and somehow could only keep the top N values per partition). I am not sure how easy this would be to do for sorting

Describe alternatives you've considered

Maybe we could at least teach the window operator to only emit the top N values per partition if there was a row number predicate at at least save some of that work -- the sort would still be required, but at least the window operator would do less work

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions