[Epic]: Complete ROW Format  (Missing features)

**Goal: a complete row implementation, fully used in pipeline breaker operators when possible.**

**Summary**
TLDR: The key focus of this work is to speed up fundamentally row oriented operations like hash table lookup or comparisons (e.g. [#2427](https://github.com/apache/arrow-datafusion/issues/2427))

**Background**

DataFusion, like many Arrow systems, is a classic "vectorized computation engine" which works quite well for many common operations. The following paper, gives a good treatment on the various tradeoffs between vectorized and JIT's compilation of query plans: https://db.in.tum.de/~kersten/vectorization_vs_compilation.pdf?lang=de

As mentioned in the paper, there are some fundamentally "row oriented" operations in a database that are not typically amenable to vectorization. The "classics" are: Hash table updates in Joins and Hash Aggregates, as well as comparing tuples in sort.

When operating with a Row based format, the per-tuple type dispatch overhead becomes quite important, so such operations are typically implemented using just in time compilation (JIT) or other unsafe mechanims to minimize the overhead

@yjshen added initial support for JIT'ing in https://github.com/apache/arrow-datafusion/pull/1849 and it currently lives in https://github.com/apache/arrow-datafusion/tree/master/datafusion/jit. He also added partial  support for aggregates in https://github.com/apache/arrow-datafusion/pull/2375 

This ticket tracks the remaining work to fully support row formats, including JIT'ing

## Getters and setters
  - [X] #1891  
  - [ ] 1. Support all types that ScalarValue supports
      - [ ] 1.1 all basic types
          - [ ] 1.1.1 Decimal
          - [ ] 1.1.2 Timestamp
          - [ ] 1.1.3 Date
          - [ ] 1.1.4 Interval
          - [ ] 1.1.5 Null
      - [ ] 1.2 composite types: List / Struct
  - [ ] 2. Make varlena offset + length a type parameter for reader and writer, for space efficiency
  - [ ] 3. Assertion based on schema before getting. Think `date64` as an example.

## Formats
  - [ ] 1. basics: #2188 
  - [ ] 2. Compact: write once, never update, Eq comparable
     - [ ] 2.1 all type supports 
  - [ ] 3. WordAligned: update heavy on cells
     - [ ] 3.1 all basic type supports
     - [ ] 3.2 Varlena out-of-place store in memory, and inline/de-inline while serializing/deserializing
   - [ ] 4. RawComparable: best effort comparable based on raw bytes
     - [ ] 4.1 null-inline
     - [ ] 4.2 float bytes comparable
     - [ ] 4.3 comparator with best effort &[u8] comp, and interleave with varlena compare field-by-field

## Hook into execution (mainly the pipeline-breakers)
- [ ] Sort
  - [x] https://github.com/apache/arrow-datafusion/issues/2427
  - [ ] RawComparable as SortByKey
  - [ ] payload #2146 
  - [x] https://github.com/apache/arrow-datafusion/issues/2427
- [ ] Aggregate
  - [X] basics: Compact row as GroupByKey, WordAligned row as aggregation buffer #2375 
  - [x] Unify GroupByRowRow and GroupByRow: https://github.com/apache/arrow-datafusion/issues/2723
  - [ ] GroupBy: complete full row-based accumulator support.
- [ ] Join
  - [ ] Compact row as HashJoinKey
  - [ ] RawComparable row as MergeJoinKey
  - [ ] Compact row as Join payloads 

## Cleanups
  - [ ] Getter / setter / accessor consolidation, DRY

## JIT
   - [ ] basics: JIT the tuple field get/set with schema, avoid branching for each field in each row. (Try to fix in #1849 )
   - [ ] TBD





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Epic]: Complete ROW Format (Missing features) #1861

Getters and setters

Formats

Hook into execution (mainly the pipeline-breakers)

Cleanups

JIT

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Epic]: Complete ROW Format (Missing features) #1861

Description

Getters and setters

Formats

Hook into execution (mainly the pipeline-breakers)

Cleanups

JIT

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions