Refactor distinct aggregate implementations to use common buffer #18348

Jefffrey · 2025-10-29T08:52:03Z

Which issue does this PR close?

Relates to Support complete distinct usage for aggregate expressions #2406

Rationale for this change

Make it easier to write distinct variations of aggregate functions be refactoring some of the common code together; specifically how they handle maintaining the complete set of distinct primitive values, as this code was duplicated across different functions.

What changes are included in this PR?

Introduce new GenericDistinctBuffer which has methods similar to Accumulator to manage an internal HashSet of values, so implementations like percentile_cont and sum can use it internally and only implement their own evaluate functions.

Are these changes tested?

Existing tests.

Are there any user-facing changes?

No.

Jefffrey · 2025-10-29T08:53:33Z

datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs

It would be nice if I can pull in PrimitiveDistinctCountAccumulator to the deduplication as well, however it is specialized for types which don't need to hash through Hashable (aka non-float types) and I think there might be a performance hit if I try force them to use Hashable 🤔

Jefffrey · 2025-10-29T08:56:05Z

datafusion/functions-aggregate-common/src/utils.rs

+/// `merge_batch` and a `Vec` of `ArrayRef` that are converted to scalar values
+/// in the final evaluation step so that we avoid expensive conversions and
+/// allocations during `update_batch`.
+pub struct GenericDistinctBuffer<T: ArrowPrimitiveType> {


Main implementation here; I toyed with the idea of making this implement Accumulator and have the different functions (like median and percentile_cont) provide their evaluate logic as a closure but it got a bit messy; so for now they delegate their state/update_batch/merge_batch to this inner struct, which allows them to grab the final set of distinct values for them to do their own evaluate

Refactor distinct aggregate implementations to use common buffer

3c389c0

github-actions bot added the functions Changes to functions implementation label Oct 29, 2025

Jefffrey commented Oct 29, 2025

View reviewed changes

Jefffrey marked this pull request as ready for review October 29, 2025 09:13

alamb mentioned this pull request Nov 4, 2025

Andrew Lamb Weekly-ish Open Source plan - 2025-11-03 #18486

Open

47 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor distinct aggregate implementations to use common buffer #18348

Refactor distinct aggregate implementations to use common buffer #18348

Jefffrey commented Oct 29, 2025

Uh oh!

Jefffrey Oct 29, 2025

Uh oh!

Jefffrey Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Refactor distinct aggregate implementations to use common buffer #18348

Are you sure you want to change the base?

Refactor distinct aggregate implementations to use common buffer #18348

Conversation

Jefffrey commented Oct 29, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Jefffrey Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Jefffrey Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant