-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Suppose we want to get the top 5 LO_SUPPKEY for each LO_SHIPMODE based on SUM(LO_EXTENDEDPRICE) in descending order, it can be achieved by the following SQL:
select LO_SHIPMODE,
collect(LO_SUPPKEY)
from
(select LO_SHIPMODE,
LO_SUPPKEY,
ROW_NUMBER() OVER (PARTITION BY LO_SHIPMODE
ORDER BY SUM_LO_EXTENDEDPRICE desc) as rank_num
from
(select LO_SHIPMODE,
LO_SUPPKEY,
SUM(LO_EXTENDEDPRICE) as SUM_LO_EXTENDEDPRICE
from LINEORDER
group by 1,
2) T0) T1
where rank_num <= 5
group by 1
However, if the cardinality of LO_SUPPKEY may be extremely large, like billions, it will be very resource consuming to finish the inner most subquery
select LO_SHIPMODE,
LO_SUPPKEY,
SUM(LO_EXTENDEDPRICE) as SUM_LO_EXTENDEDPRICE
from LINEORDER
group by 1,
2
and the sort operation in the window function.
Describe the solution you'd like
It's would be better to provide a way to achieve an approximate topN result for each group. Therefore, we propose a UDAF to achieve this, like following:
select LO_SHIPMODE,
APPROX_SUM_TOPN(LO_EXTENDEDPRICE, [LO_SUPPKEY], 5) as SUM_LO_EXTENDEDPRICE
from LINEORDER
group by 1
where the parameters for the UDAF APPROX_SUM_TOPN will like (column_to_be_summed, [column1_to_be_topped, column2_to_be_topped, ...], top_k).
The result of this UDAF will be a nested structure. It's an array of struct, which contains at most top_k structs of (column_to_be_summed, column1_to_be_topped, column2_to_be_topped, ...) which is calculated based on the space saving algorithm introduced in paper
Describe alternatives you've considered
Additional context