-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Many pipeline-breaking operators are inherently row-based:
For sort that would shuffle records around, re-order would cause random memory access patterns for each column in the current columnar organization. The performance will deteriorate as the number of columns grows. Besides, the compound sort key also requires us to access different columns.
On the other hand, row-based representation avoids this problem (performance deteriorates with payload column number growth). we can check here for more explanations.
For hashtable entries that we buffer aggregation state, we are already utilizing a row-based format indirectly -- We use Vec<ScalarValue> as a state for each key. Vector of ScalarValue is mostly stored continuously in memory but faced with two kinds of inefficiency: 1. memory overhead introduced by ScalarValue enum (16bytes per field according to @alamb); 2. string or other non-primitive values stored on the heap elsewhere and accessed through pointers.
┌───────────────────────────────────────────────────────┐
│ │
│ ┌────────────────┬────────────────┬────────────────┐ │
│ │ ScalarValue │ ScalarValue │ ScalarValue │ │
│ │ ::Int(5) │ ::Int(10) │ ::Int(3) │ │
│ └────────────────┴────────────────┴────────────────┘ │
│ Hash Table Entry │
│ Vec<ScalarValue> │
└───────────────────────────────────────────────────────┘
When the keys are primitive values, they are stored
contiguously in the Vec
┌ ─ ─ ─ ─ ─ ─ ─ ─ ┐
"foo"
│(heap allocation)│
─ ─ ─ ─ ─ ─ ─ ─ ─
▲
┌───────┘
┌───────────────────────────┼───────────────────────────┐
│ │ │
│ ┌────────────────┬────────────────┬────────────────┐ │
│ │ ScalarValue │ ScalarValue │ ScalarValue │ │
│ │ ::Int(5) │ ::Utf8("foo") │ ::Int(3) │ │
│ └────────────────┴────────────────┴────────────────┘ │
│ Hash Table Entry │
│ Vec<ScalarValue> │
└───────────────────────────────────────────────────────┘
When the keys have strings/binary data, the variable
length data is stored non contiguously in the Vec
I quote these two great diagrams above from @alamb. Thanks again!
For join, whether hash-based or sort-based, would suffer from similar problems as above.
Describe the solution you'd like
- A
Vec<u8>based representation for tuple, store all columns continuously in memory, for row-logic operations. - Efficient coding/decoding method from/to columnar arrow data.
- Access cells in
vec<u8>tuple efficiently.
We could refer to PostgreSQL / DuckDB / Spark for the row format design. But note Spark's UnsafeRow incurs a lot of memory overhead due to its 8-byte alignment.
Describe alternatives you've considered
Not to introduce or use Vec<ScalarValue> with overhead.