Skip to content

Conversation

@Jefffrey
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Make it easier to write distinct variations of aggregate functions be refactoring some of the common code together; specifically how they handle maintaining the complete set of distinct primitive values, as this code was duplicated across different functions.

What changes are included in this PR?

Introduce new GenericDistinctBuffer which has methods similar to Accumulator to manage an internal HashSet of values, so implementations like percentile_cont and sum can use it internally and only implement their own evaluate functions.

Are these changes tested?

Existing tests.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the functions Changes to functions implementation label Oct 29, 2025
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if I can pull in PrimitiveDistinctCountAccumulator to the deduplication as well, however it is specialized for types which don't need to hash through Hashable (aka non-float types) and I think there might be a performance hit if I try force them to use Hashable 🤔

/// `merge_batch` and a `Vec` of `ArrayRef` that are converted to scalar values
/// in the final evaluation step so that we avoid expensive conversions and
/// allocations during `update_batch`.
pub struct GenericDistinctBuffer<T: ArrowPrimitiveType> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main implementation here; I toyed with the idea of making this implement Accumulator and have the different functions (like median and percentile_cont) provide their evaluate logic as a closure but it got a bit messy; so for now they delegate their state/update_batch/merge_batch to this inner struct, which allows them to grab the final set of distinct values for them to do their own evaluate

@Jefffrey Jefffrey marked this pull request as ready for review October 29, 2025 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant