Skip to content

Excessive Arc-clone in HashJoinStream with StringView on build-side #16206

@ctsk

Description

@ctsk

Describe the bug

An unfortunate pattern in the hash join implementation leads to excessive Arc-cloning: Assume the build-side carries a string-view column as a payload. Let N be the number of batches seen on the build side

  1. In the build phase, datafusion concatenates the batches on the build side. The string-view column now holds references to at least N data buffers in a vec;

  2. When constructing the output batch, the take implementation for string-views clones the data buffer vector of the concatenated build-side column - thus incrementing the references on all N data buffers.

To Reproduce

I noticed this issue when executing and profiling tpch query 18 - roughly 3% of the runtime is spent cloning these Arcs.

Expected behavior

No response

Additional context

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions