-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Upstream in DataFusion, there is a common common pattern where we have multiple input RecordBatches and want to produce an output RecordBatch with some subset of the rows from the input batches. This happens in
FilterExec-->CoalesceBatchesExecwhen filteringRepartitionExec-->CoalesceBatchesExec
The kernels used here are:
FilterExecusesfilter, takes a single inputArrayand produces a single outputArrayRepartitionExecusestake, which also takes a single inputArrayand produces a single outputArray``RepartitionExeceach take a single input batch and produce a single outputArrayCoalesceBatchesExeccallsconcatwhich takes multple Arrays and produces a single Array as output
The use of these kernels and patterns has two downsides:
- Performance overhead due to a second copy: Calling
filter/takeimmediately copies the data, which is copied again inCoalesceBatches(see illustration below) - Memory Overhead / Performance Overhead for GarbageCollecting StringView: Buffering up several
RecordBatches with StringView may consume significant amounts of memory for mostly filtered rows, which requires us to run gc periodically which actually slows some things down (see Reduce copying inCoalesceBatchesExecfor StringViews datafusion#11628)
Here is an ascii art picture (from apache/datafusion#7957) that shows the extra copy in action
┌────────────────────┐ Filter
│ │ ┌────────────────────┐ Coalesce
│ │ ─ ─ ─ ─ ─ ─ ▶ │ RecordBatch │ Batches
│ RecordBatch │ │ num_rows = 234 │─ ─ ─ ─ ─ ┐
│ num_rows = 8000 │ └────────────────────┘
│ │ │
│ │ ┌────────────────────┐
└────────────────────┘ │ │ │
┌────────────────────┐ ┌────────────────────┐ │ │
│ │ Filter │ │ │ │ │
│ │ │ RecordBatch │ ─ ─ ─ ─ ─ ▶│ │
│ RecordBatch │ ─ ─ ─ ─ ─ ─ ▶ │ num_rows = 500 │─ ─ ─ ─ ─ ┐ │ │
│ num_rows = 8000 │ │ │ │ RecordBatch │
│ │ │ │ └ ─ ─ ─ ─ ─▶│ num_rows = 8000 │
│ │ └────────────────────┘ │ │
└────────────────────┘ │ │
... ─ ─ ─ ─ ─ ▶│ │
... ... │ │ │
│ │
┌────────────────────┐ │ └────────────────────┘
│ │ ┌────────────────────┐
│ │ Filter │ │ │
│ RecordBatch │ │ RecordBatch │
│ num_rows = 8000 │ ─ ─ ─ ─ ─ ─ ▶ │ num_rows = 333 │─ ─ ─ ─ ─ ┘
│ │ │ │
│ │ └────────────────────┘
└────────────────────┘
FilterExec RepartitonExec copies the data
creates output batches with copies *again* to form final large
of the matching rows (calls take() RecordBatches
to make a copy)
Describe the solution you'd like
I would like to apply filter/take to each incoming RecordBatch as it arrives, copying the data to an in progress output array, in a way that is as fast as the filter and take operations. This would reduce the extra copy that is currently required.
Note this is somewhat like the interleave kernel, except that
- We only need the output rows to be in the same order as the input batches (so the second
usizebatch index is not needed) - We don't want to have to buffer all the input
Describe alternatives you've considered
One thing I have thought about is extending the builders so they can append more than one row at a time. For example:
Builder::append_filteredBuilder::append_take
So for example, to filter a stream of StringViewArrays I might do something like;
let mut builder = StringViewBuilder::new();
while let Some(input) = stream.next() {
// compute some subset of input rows that make it to the output
let filter: BooleanArray = compute_filter(&input, ....);
// append all rows from input where filter[i] is true
builder.append_filtered(&input, &filter);
}And also add an equivalent for append_take
I think if we did this right, it wouldn't be a lot of new code, we could just refactor the existing filter/take implementations. For example, I would expect that the filter kernel would then devolve into something like
fn filter(..) {
match data_type {
DataType::Int8 => Int8Builder::with_capacity(...)
.append_filter(input, filter)
.build()
...
}Additional context