Skip to content

Add a built-in UDAF approx_sum_topn based on space saving algorithm #2365

@yahoNanJing

Description

@yahoNanJing

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Suppose we want to get the top 5 LO_SUPPKEY for each LO_SHIPMODE based on SUM(LO_EXTENDEDPRICE) in descending order, it can be achieved by the following SQL:

select LO_SHIPMODE,
       collect(LO_SUPPKEY)
from
  (select LO_SHIPMODE,
          LO_SUPPKEY,
          ROW_NUMBER() OVER (PARTITION BY LO_SHIPMODE
                             ORDER BY SUM_LO_EXTENDEDPRICE desc) as rank_num
   from
     (select LO_SHIPMODE,
             LO_SUPPKEY,
             SUM(LO_EXTENDEDPRICE) as SUM_LO_EXTENDEDPRICE
      from LINEORDER
      group by 1,
               2) T0) T1
where rank_num <= 5
group by 1

However, if the cardinality of LO_SUPPKEY may be extremely large, like billions, it will be very resource consuming to finish the inner most subquery

select LO_SHIPMODE,
       LO_SUPPKEY,
       SUM(LO_EXTENDEDPRICE) as SUM_LO_EXTENDEDPRICE
from LINEORDER
group by 1,
         2

and the sort operation in the window function.

Describe the solution you'd like

It's would be better to provide a way to achieve an approximate topN result for each group. Therefore, we propose a UDAF to achieve this, like following:

select LO_SHIPMODE,
       APPROX_SUM_TOPN(LO_EXTENDEDPRICE, [LO_SUPPKEY], 5) as SUM_LO_EXTENDEDPRICE
from LINEORDER
group by 1

where the parameters for the UDAF APPROX_SUM_TOPN will like (column_to_be_summed, [column1_to_be_topped, column2_to_be_topped, ...], top_k).

The result of this UDAF will be a nested structure. It's an array of struct, which contains at most top_k structs of (column_to_be_summed, column1_to_be_topped, column2_to_be_topped, ...) which is calculated based on the space saving algorithm introduced in paper

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions