-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Describe the bug
I'am testing DataFusion for using it in a system which has several thousand columns and billions of rows.
I'm excited about the flexibility and possibilities this technology provides.
The problems we faced with:
- Optimization of the logical plan works slowly because it has to copy the whole schema in some rules.
We workarounded it with prepared queries (we cache parametrized logical plan) - Creation of physical plan consume up to 35% on CPU, which is more than it's execution (we use several hundreds of aggregation functions and DF shows pretty good execution time)
Some investigation on that showed, that there a lot of string comparisons (take a look at flamegraph)
29 % datafusion_physical_expr::planner::create_physical_expr
28.5 % --> datafusion_common::dfschema::DFSchema::index_of_column
28.5 % -- --> datafusion_common::dfschema::DFSchema::index_of_column_by_name
7.4 % -- -- --> __memcmp_sse4_1
14.6 % -- -- --> datafusion_common::table_reference::TableReference::resolved_eq
6.8 % -- -- -- --> __memcmp_sse4_1
Now algorithm has O(N^2) complexity (N in iterating all the columns in
datafusion_common::dfschema::DFSchema::index_of_column_by_name
and N in datafusion_common::table_reference::TableReference::resolved_eq).
Some ideas to resolve:
- Use hashmap or btree in DFSchema instead of list (decrease complexity of resolving column index by it's name)
- Implement parametrization of Physical plan and prepared physical plans (in order to enable caching it the same as prepared logical plan)
Thank you for developing a such great tool!
To Reproduce
It's hard to extract some code from the project, but I will try to build simple repro
Expected behavior
Creation of physical plan spent much less time in CPU than it's execution
Additional context
No response
