Skip to content

Introduce a Vec<u8> based row-wise representation for DataFusion #1708

@yjshen

Description

@yjshen

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Many pipeline-breaking operators are inherently row-based:

For sort that would shuffle records around, re-order would cause random memory access patterns for each column in the current columnar organization. The performance will deteriorate as the number of columns grows. Besides, the compound sort key also requires us to access different columns.
On the other hand, row-based representation avoids this problem (performance deteriorates with payload column number growth). we can check here for more explanations.

For hashtable entries that we buffer aggregation state, we are already utilizing a row-based format indirectly -- We use Vec<ScalarValue> as a state for each key. Vector of ScalarValue is mostly stored continuously in memory but faced with two kinds of inefficiency: 1. memory overhead introduced by ScalarValue enum (16bytes per field according to @alamb); 2. string or other non-primitive values stored on the heap elsewhere and accessed through pointers.

┌───────────────────────────────────────────────────────┐
│                                                       │
│ ┌────────────────┬────────────────┬────────────────┐  │
│ │  ScalarValue   │  ScalarValue   │  ScalarValue   │  │
│ │    ::Int(5)    │   ::Int(10)    │    ::Int(3)    │  │
│ └────────────────┴────────────────┴────────────────┘  │
│   Hash Table Entry                                    │
│   Vec<ScalarValue>                                    │
└───────────────────────────────────────────────────────┘
 When the keys are primitive values, they are stored     
               contiguously in the Vec                   


                           ┌ ─ ─ ─ ─ ─ ─ ─ ─ ┐           
                                  "foo"                  
                           │(heap allocation)│           
                            ─ ─ ─ ─ ─ ─ ─ ─ ─            
                                    ▲                    
                            ┌───────┘                    
┌───────────────────────────┼───────────────────────────┐
│                           │                           │
│ ┌────────────────┬────────────────┬────────────────┐  │
│ │  ScalarValue   │  ScalarValue   │  ScalarValue   │  │
│ │    ::Int(5)    │ ::Utf8("foo")  │    ::Int(3)    │  │
│ └────────────────┴────────────────┴────────────────┘  │
│   Hash Table Entry                                    │
│   Vec<ScalarValue>                                    │
└───────────────────────────────────────────────────────┘
  When the keys have strings/binary data, the variable   
   length data is stored non contiguously in the Vec     

I quote these two great diagrams above from @alamb. Thanks again!

For join, whether hash-based or sort-based, would suffer from similar problems as above.

Describe the solution you'd like

  1. A Vec<u8> based representation for tuple, store all columns continuously in memory, for row-logic operations.
  2. Efficient coding/decoding method from/to columnar arrow data.
  3. Access cells in vec<u8> tuple efficiently.

We could refer to PostgreSQL / DuckDB / Spark for the row format design. But note Spark's UnsafeRow incurs a lot of memory overhead due to its 8-byte alignment.

Describe alternatives you've considered
Not to introduce or use Vec<ScalarValue> with overhead.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions