-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
In most query engines, the execution cost of the distinct aggregation function is huge, but it can be optimized by groupBy. I want to bring this optimization to datafushion
Currently for a single distinct aggregation scenario as follows
- Aggregation
GROUP BY (k)
F1(DISTINCT s0, s1, ...),
F2(DISTINCT s0, s1, ...),
- X
into
- Aggregation
GROUP BY (k)
F1(x)
F2(x)
- Aggregation
GROUP BY (k, s0, s1, ...)
- X
I used a test data set of 60 million to test datafunshion before and after using the optimizer.After optimization,the performance has doubled and the execution time has been reduced from 12 seconds to 6 seconds
The test results and the logical plan before and after optimization are as follows
sql : select count(distinct LO_EXTENDEDPRICE) from lineorder_flat;
------------------original---------------------
Display: Projection: #COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE) [COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
Aggregate: groupBy=[[]], aggr=[[COUNT(DISTINCT #lineorder_flat.LO_EXTENDEDPRICE)]] [COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
TableScan: lineorder_flat projection=Some([9]) [LO_EXTENDEDPRICE:Int64]
+-------------------------------------------------+
| COUNT(DISTINCT lineorder_flat.LO_EXTENDEDPRICE) |
+-------------------------------------------------+
| 1040570 |
+-------------------------------------------------+
usage millis: 12033
----------------after optimization-------------
Display: Projection: #COUNT(lineorder_flat.LO_EXTENDEDPRICE) [COUNT(lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
Aggregate: groupBy=[[]], aggr=[[COUNT(#lineorder_flat.LO_EXTENDEDPRICE)]] [COUNT(lineorder_flat.LO_EXTENDEDPRICE):UInt64;N]
Aggregate: groupBy=[[#lineorder_flat.LO_EXTENDEDPRICE]], aggr=[[]] [LO_EXTENDEDPRICE:Int64]
TableScan: lineorder_flat projection=Some([9]) [LO_EXTENDEDPRICE:Int64]
+----------------------------------------+
| COUNT(lineorder_flat.LO_EXTENDEDPRICE) |
+----------------------------------------+
| 1040570 |
+----------------------------------------+
usage millis: 5817
In the case of common aggregation functions and distinct aggregation functions used together, optimization can also be done in a way similar to GROUPING SET. Although it has not been tested on datafushion, I did the above optimization in trino. In our production environment Has a very good performance