-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
DataFusion optimizes queries like ... ORDER BY value LIMIT 10 by only keeping the top 10 ("limit") rows when sorting which is great!
Another common pattern (that we also have in IOx) (https://github.com/influxdata/influxdb_iox/pull/8187/files#r1257834347) is queries like the following to select the top N values "per partition"
SELECT ...
ROW_NUMBER() OVER (PARTITION BY value1, ORDER BY value2) as rn
WHERE
rn < 10Currently the plan will be something like:
Filter(rn < 10)
WindowExec(ROW_NUMBER...)
Sort(value1, value2)
The problem with this plan is that it will sort (and copy) the ENTIRE input even when the query only needs the first 10 rows of each partition
Describe the solution you'd like
It would be awesome to optimize this case somehow so that it did not need to sort the entire input (and somehow could only keep the top N values per partition). I am not sure how easy this would be to do for sorting
Describe alternatives you've considered
Maybe we could at least teach the window operator to only emit the top N values per partition if there was a row number predicate at at least save some of that work -- the sort would still be required, but at least the window operator would do less work
Additional context
No response