Skip to content

Bad performance on wide tables (1000+ columns) #7698

@karlovnv

Description

@karlovnv

Describe the bug

I'am testing DataFusion for using it in a system which has several thousand columns and billions of rows.
I'm excited about the flexibility and possibilities this technology provides.

The problems we faced with:

  1. Optimization of the logical plan works slowly because it has to copy the whole schema in some rules.
    We workarounded it with prepared queries (we cache parametrized logical plan)
  2. Creation of physical plan consume up to 35% on CPU, which is more than it's execution (we use several hundreds of aggregation functions and DF shows pretty good execution time)

Some investigation on that showed, that there a lot of string comparisons (take a look at flamegraph)

  29 %      datafusion_physical_expr::planner::create_physical_expr 
  28.5 %    --> datafusion_common::dfschema::DFSchema::index_of_column
  28.5 %    -- --> datafusion_common::dfschema::DFSchema::index_of_column_by_name
   7.4 %    -- -- --> __memcmp_sse4_1
  14.6 %    -- -- --> datafusion_common::table_reference::TableReference::resolved_eq
   6.8 %    -- -- -- --> __memcmp_sse4_1

photo_2023-09-29_14-29-16

Now algorithm has O(N^2) complexity (N in iterating all the columns in
datafusion_common::dfschema::DFSchema::index_of_column_by_name
and N in datafusion_common::table_reference::TableReference::resolved_eq).

https://github.com/apache/arrow-datafusion/blob/22d03c127e7c5e56cf97ae33eb4446d5b7022eaa/datafusion/common/src/dfschema.rs#L211

Some ideas to resolve:

  • Use hashmap or btree in DFSchema instead of list (decrease complexity of resolving column index by it's name)
  • Implement parametrization of Physical plan and prepared physical plans (in order to enable caching it the same as prepared logical plan)

Thank you for developing a such great tool!

To Reproduce

It's hard to extract some code from the project, but I will try to build simple repro

Expected behavior

Creation of physical plan spent much less time in CPU than it's execution

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions