Skip to content

better performance improvement by optimizing a single distinct aggregation scene #1282

@ic4y

Description

@ic4y

In most query engines, the execution cost of the distinct aggregation function is huge, but it can be optimized by groupBy. I want to bring this optimization to datafushion

Currently for a single distinct aggregation scenario as follows

- Aggregation
       GROUP BY (k)
       F1(DISTINCT s0, s1, ...),
       F2(DISTINCT s0, s1, ...),
    - X
into
- Aggregation
         GROUP BY (k)
         F1(x)
         F2(x)
     - Aggregation
            GROUP BY (k, s0, s1, ...)
         - X

I used a test data set of 60 million to test datafunshion before and after using the optimizer.After optimization,the performance has doubled and the execution time has been reduced from 12 seconds to 6 seconds
The test results and the logical plan before and after optimization are as follows

sql : select count(distinct LO_EXTENDEDPRICE) from lineorder_flat;

------------------original---------------------
Display: Projection: #COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE) [COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
  Aggregate: groupBy=[[]], aggr=[[COUNT(DISTINCT #lineorder_flat.LO_EXTENDEDPRICE)]] [COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
    TableScan: lineorder_flat projection=Some([9]) [LO_EXTENDEDPRICE:Int64]
+-------------------------------------------------+
| COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE) |
+-------------------------------------------------+
| 1040570                                         |
+-------------------------------------------------+
usage millis: 12033

----------------after optimization-------------
Display: Projection: #COUNT(lineorder_flat.LO_EXTENDEDPRICE) [COUNT(lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
  Aggregate: groupBy=[[]], aggr=[[COUNT(#lineorder_flat.LO_EXTENDEDPRICE)]] [COUNT(lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
    Aggregate: groupBy=[[#lineorder_flat.LO_EXTENDEDPRICE]], aggr=[[]] [LO_EXTENDEDPRICE:Int64]
      TableScan: lineorder_flat projection=Some([9]) [LO_EXTENDEDPRICE:Int64]
+----------------------------------------+
| COUNT(lineorder_flat.LO_EXTENDEDPRICE) |
+----------------------------------------+
| 1040570                                |
+----------------------------------------+
usage millis: 5817

In the case of common aggregation functions and distinct aggregation functions used together, optimization can also be done in a way similar to GROUPING SET. Although it has not been tested on datafushion, I did the above optimization in trino. In our production environment Has a very good performance

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions