better performance improvement by optimizing a single distinct aggregation scene

In most query engines, the execution cost of the distinct aggregation function is huge, but it can be optimized by groupBy. I want to bring this optimization to datafushion

Currently for a single  distinct aggregation scenario as follows
```
- Aggregation
       GROUP BY (k)
       F1(DISTINCT s0, s1, ...),
       F2(DISTINCT s0, s1, ...),
    - X
into
- Aggregation
         GROUP BY (k)
         F1(x)
         F2(x)
     - Aggregation
            GROUP BY (k, s0, s1, ...)
         - X
```

I used a test data set of 60 million to test datafunshion before and after using the optimizer.After optimization,the performance has doubled and the execution time has been reduced from 12 seconds to 6 seconds
The test results and the logical plan before and after optimization are as follows

```
sql : select count(distinct LO_EXTENDEDPRICE) from lineorder_flat;

------------------original---------------------
Display: Projection: #COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE) [COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
  Aggregate: groupBy=[[]], aggr=[[COUNT(DISTINCT #lineorder_flat.LO_EXTENDEDPRICE)]] [COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
    TableScan: lineorder_flat projection=Some([9]) [LO_EXTENDEDPRICE:Int64]
+-------------------------------------------------+
| COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE) |
+-------------------------------------------------+
| 1040570                                         |
+-------------------------------------------------+
usage millis: 12033

----------------after optimization-------------
Display: Projection: #COUNT(lineorder_flat.LO_EXTENDEDPRICE) [COUNT(lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
  Aggregate: groupBy=[[]], aggr=[[COUNT(#lineorder_flat.LO_EXTENDEDPRICE)]] [COUNT(lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
    Aggregate: groupBy=[[#lineorder_flat.LO_EXTENDEDPRICE]], aggr=[[]] [LO_EXTENDEDPRICE:Int64]
      TableScan: lineorder_flat projection=Some([9]) [LO_EXTENDEDPRICE:Int64]
+----------------------------------------+
| COUNT(lineorder_flat.LO_EXTENDEDPRICE) |
+----------------------------------------+
| 1040570                                |
+----------------------------------------+
usage millis: 5817

```

In the case of common aggregation functions and distinct aggregation functions used together, optimization can also be done in a way similar to GROUPING SET. Although it has not been tested on datafushion, I did the above optimization in trino. In our production environment Has a very good performance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

better performance improvement by optimizing a single distinct aggregation scene #1282

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

better performance improvement by optimizing a single distinct aggregation scene #1282

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions