Skip to content

Conversation

@maniloya
Copy link
Contributor

@maniloya maniloya commented Nov 25, 2025

Description

Introduces a new query optimization that combines multiple approx_distinct()
function calls on distinct expressions of the same type into a single efficient
operation using set_agg and array operations.

Controlled by session property optimize_multiple_approx_distinct_on_same_type
(default: false).

Motivation and Context

Multiple approx_distinct() calls on different columns of the same type in a single query currently execute as separate aggregations, leading to redundant data processing and increased resource usage. This optimization combines these calls into a single efficient operation, similar to the existing CombineApproxPercentileFunctions optimizer.

The optimization uses set_agg with array operations (array_constructor, array_transpose, transform) to consolidate multiple distinct counts, reducing aggregation overhead and improving query performance when multiple distinct counts are needed.

Impact

  • Added new session property: optimize_multiple_approx_distinct_on_same_type (default: false)
  • Users must explicitly enable this optimization via session property

Test Plan

Added comprehensive testing in TestCombineApproxDistinctFunctions.java

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.
  • If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

== NO RELEASE NOTE ==

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Nov 25, 2025

Reviewer's Guide

Adds a new logical optimizer rule CombineApproxDistinctFunctions that, when enabled via a new session/config property, rewrites multiple approx_distinct calls on same-typed expressions into a single set_agg + array-based aggregation, along with corresponding wiring and tests.

Sequence diagram for approx_distinct optimization during query planning

sequenceDiagram
  actor User
  participant Session
  participant SystemSessionProperties
  participant QueryPlanner
  participant PlanOptimizers
  participant IterativeOptimizer
  participant CombineApproxDistinctFunctions

  User->>QueryPlanner: submit SQL with multiple approx_distinct
  QueryPlanner->>Session: create session
  Session-->>QueryPlanner: session with properties

  QueryPlanner->>SystemSessionProperties: isCombineApproxDistinctEnabled(session)
  SystemSessionProperties-->>QueryPlanner: enabled or disabled

  alt optimization disabled
    QueryPlanner->>PlanOptimizers: build logical plan without approx_distinct rewrite
    PlanOptimizers-->>QueryPlanner: optimizer pipeline
    QueryPlanner-->>User: plan uses separate approx_distinct aggregations
  else optimization enabled
    QueryPlanner->>PlanOptimizers: build logical plan with optimizer rules
    PlanOptimizers->>IterativeOptimizer: add CombineApproxDistinctFunctions rule
    IterativeOptimizer-->>PlanOptimizers: optimizer with rules

    QueryPlanner->>IterativeOptimizer: optimize AggregationNode
    IterativeOptimizer->>CombineApproxDistinctFunctions: match pattern on AggregationNode
    CombineApproxDistinctFunctions-->>IterativeOptimizer: PATTERN matches (multiple approx_distinct of same type)

    IterativeOptimizer->>CombineApproxDistinctFunctions: apply(aggregationNode, captures, context)
    CombineApproxDistinctFunctions->>CombineApproxDistinctFunctions: group approx_distinct by type and properties
    CombineApproxDistinctFunctions->>CombineApproxDistinctFunctions: build set_agg and array-based expressions
    CombineApproxDistinctFunctions-->>IterativeOptimizer: rewritten plan subtree

    IterativeOptimizer-->>QueryPlanner: optimized plan
    QueryPlanner-->>User: plan with combined set_agg approx_distinct
  end
Loading

Class diagram for CombineApproxDistinctFunctions optimizer and related configuration

classDiagram
class FeaturesConfig {
  - boolean pushRemoteExchangeThroughGroupId
  - boolean isOptimizeMultipleApproxPercentileOnSameFieldEnabled
  - boolean isOptimizeMultipleApproxDistinctOnSameTypeEnabled
  + boolean isOptimizeMultipleApproxDistinctOnSameTypeEnabled()
  + FeaturesConfig setOptimizeMultipleApproxDistinctOnSameTypeEnabled(boolean enabled)
}

class SystemSessionProperties {
  <<final>>
  + static String LEAF_NODE_LIMIT_ENABLED
  + static String PUSH_REMOTE_EXCHANGE_THROUGH_GROUP_ID
  + static String OPTIMIZE_MULTIPLE_APPROX_PERCENTILE_ON_SAME_FIELD
  + static String OPTIMIZE_MULTIPLE_APPROX_DISTINCT_ON_SAME_TYPE
  + static boolean isCombineApproxPercentileEnabled(Session session)
  + static boolean isCombineApproxDistinctEnabled(Session session)
}

class PlanOptimizers {
  - Metadata metadata
  - RuleStatsRoyalty ruleStats
  - StatsCalculator statsCalculator
  - CostCalculator estimatedExchangesCostCalculator
  + PlanOptimizers(Metadata metadata, RuleStatsRoyalty ruleStats, StatsCalculator statsCalculator, CostCalculator estimatedExchangesCostCalculator)
}

class IterativeOptimizer {
  - Metadata metadata
  - RuleStatsRoyalty ruleStats
  - StatsCalculator statsCalculator
  - CostCalculator costCalculator
  - Set~Rule_AggregationNode~ rules
}

class Rule_AggregationNode {
  <<interface>>
  + Pattern_AggregationNode getPattern()
  + boolean isEnabled(Session session)
  + Result apply(AggregationNode node, Captures captures, Context context)
}

class CombineApproxDistinctFunctions {
  - String APPROX_DISTINCT
  - String SET_AGG
  - String ARRAY_CONSTRUCTOR
  - String TRANSFORM
  - String ARRAY_TRANSPOSE
  - String ARRAY_DISTINCT
  - String REMOVE_NULLS
  - String CARDINALITY
  - String ELEMENT_AT
  - int ARRAY_SIZE_LIMIT
  - FunctionAndTypeManager functionAndTypeManager
  - static Pattern_AggregationNode PATTERN
  + CombineApproxDistinctFunctions(FunctionAndTypeManager functionAndTypeManager)
  + Pattern_AggregationNode getPattern()
  + boolean isEnabled(Session session)
  + Result apply(AggregationNode aggregationNode, Captures captures, Context context)
  - static boolean hasMultipleApproxDistinct(AggregationNode aggregation)
  - static boolean aggregationCanMerge(AggregationNode_Aggregation agg1, AggregationNode_Aggregation agg2)
  - static List~List~AggregationNode_Aggregation~~ createMergeableAggregations(List~AggregationNode_Aggregation~ candidateAggregations)
  - CallExpression createArrayExpression(List~AggregationNode_Aggregation~ aggregations)
  - AggregationNode_Aggregation createSetAggAggregation(List~AggregationNode_Aggregation~ candidateList, VariableReferenceExpression arrayVariableReference)
}

class AggregationNode {
  - Map~VariableReferenceExpression, AggregationNode_Aggregation~ aggregations
  + Map~VariableReferenceExpression, AggregationNode_Aggregation~ getAggregations()
}

class AggregationNode_Aggregation {
  + CallExpression getCall()
  + Optional~RowExpression~ getFilter()
  + Optional~OrderingScheme~ getOrderBy()
  + boolean isDistinct()
  + Optional~VariableReferenceExpression~ getMask()
}

class FunctionAndTypeManager {
  + FunctionAndTypeManager()
}

class Session {}
class Pattern_AggregationNode {}
class Captures {}
class Context {
  + VariableAllocator getVariableAllocator()
  + PlanNodeIdAllocator getIdAllocator()
}
class VariableAllocator {
  + VariableReferenceExpression newVariable(RowExpression expression)
  + VariableReferenceExpression newVariable(String name, Type type)
}
class PlanNodeIdAllocator {
  + PlanNodeId getNextId()
}

FeaturesConfig --> SystemSessionProperties : provides defaults
SystemSessionProperties ..> Session : uses
PlanOptimizers --> IterativeOptimizer : creates
IterativeOptimizer --> CombineApproxDistinctFunctions : manages rule
CombineApproxDistinctFunctions ..|> Rule_AggregationNode
CombineApproxDistinctFunctions --> FunctionAndTypeManager
CombineApproxDistinctFunctions --> AggregationNode
AggregationNode --> AggregationNode_Aggregation
Context --> VariableAllocator
Context --> PlanNodeIdAllocator
Loading

Flow diagram for AggregationNode rewrite by CombineApproxDistinctFunctions

flowchart LR
  A["AggregationNode<br/>approx_distinct(e1)<br/>approx_distinct(e2)<br/>..."] --> B["Collect approx_distinct aggregations<br/>with single argument"]

  B --> C["Group by argument type<br/>and aggregation properties<br/>(mask, filter, orderBy, distinct)"]

  C --> D["For each group<br/>create mergeable lists<br/>(createMergeableAggregations)"]

  D --> E["Filter candidate lists<br/>size > 1 and < ARRAY_SIZE_LIMIT"]

  E -->|no candidate lists| Z["Return original AggregationNode"]

  E -->|candidate lists exist| F["For each candidate list<br/>create array expression<br/>array_constructor(e1, e2, ...)"]

  F --> G["Add source ProjectNode<br/>assign array variable<br/>arrayVar <- array_constructor(...)"]

  G --> H["Replace approx_distinct<br/>aggregations with<br/>set_agg(arrayVar)"]

  H --> I["Create intermediate ProjectNode<br/>ads_array <- set_agg(arrayVar)<br/>transposed <- array_transpose(ads_array)<br/>resultArray <- transform(<br/> transposed,<br/> x -> cardinality(<br/>  array_distinct(<br/>   remove_nulls(x)<br/>  )<br/> )<br/>)"]

  I --> J["Create output ProjectNode<br/>for each original approx_distinct<br/>var_i <- coalesce(<br/> element_at(resultArray, i),<br/> 0<br/>)"]

  J --> K["Final plan subtree<br/>ProjectNode(coalesce element_at)<br/>  -> ProjectNode(transform)<br/>    -> AggregationNode(set_agg)<br/>      -> Original source with<br/>         extra array constructor project"]

  Z --> L["Output plan"]
  K --> L["Output plan"]
Loading

File-Level Changes

Change Details Files
Introduce session/config flag to control combining multiple approx_distinct calls on same-typed expressions.
  • Add boolean feature flag isOptimizeMultipleApproxDistinctOnSameTypeEnabled to FeaturesConfig with config key optimizer.optimize-multiple-approx-distinct-on-same-type and default false.
  • Expose new session property OPTIMIZE_MULTIPLE_APPROX_DISTINCT_ON_SAME_TYPE with description and default from FeaturesConfig, and add accessor isCombineApproxDistinctEnabled(Session).
presto-main-base/src/main/java/com/facebook/presto/sql/analyzer/FeaturesConfig.java
presto-main-base/src/main/java/com/facebook/presto/SystemSessionProperties.java
Wire new CombineApproxDistinctFunctions rule into the planner optimizer pipeline.
  • Register a new IterativeOptimizer instance in PlanOptimizers that runs the CombineApproxDistinctFunctions rule using the FunctionAndTypeManager.
  • Ensure the rule is placed alongside existing approx_percentile combination optimization rules.
presto-main-base/src/main/java/com/facebook/presto/sql/planner/PlanOptimizers.java
Implement CombineApproxDistinctFunctions rule to rewrite multiple approx_distinct aggregations on expressions of the same type into a single set_agg-based aggregation plus projection.
  • Match aggregation nodes containing more than one approx_distinct call and only run when the new session property is enabled.
  • Group approx_distinct aggregations by argument type and compatibility (same mask, filter, order-by, distinct flag), ignore duplicates, and enforce an upper bound (ARRAY_SIZE_LIMIT) on the number of merged expressions.
  • Insert a source Project to build array_constructor over the original arguments, then replace grouped approx_distinct aggregations with a single set_agg over that array.
  • Add an intermediate Project that applies array_transpose + transform with a lambda performing remove_nulls, array_distinct, and cardinality to compute per-position distinct counts.
  • Add a final Project that maps each original approx_distinct output to coalesce(element_at(transform_result, index), 0), and preserve non-merged aggregations and passthrough variables.
  • Use FunctionAndTypeManager-based call construction and proper type wiring for arrays and BIGINT counts.
presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/CombineApproxDistinctFunctions.java
Add rule tests covering when CombineApproxDistinctFunctions should and should not fire, including shape of the rewritten plan.
  • Add BaseRuleTest-based TestCombineApproxDistinctFunctions that initializes RuleTester with SqlInvokedFunctionsPlugin.
  • Test that the rule does not fire when the session property is disabled or only a single approx_distinct is present, or when argument types differ, aggregations are duplicates, exceed ARRAY_SIZE_LIMIT, or other aggregations are present in ways that prevent merging.
  • Test positive cases where two or more approx_distinct on different expressions of the same type are combined, asserting the exact Project/Aggregation/Project/Values structure and expressions (array_constructor, set_agg, array_transpose, transform, array_distinct, remove_nulls, cardinality, element_at, coalesce).
  • Include a mixed-type aggregation test that ensures no rewrite occurs when approx_distinct calls span multiple argument types.
presto-main-base/src/test/java/com/facebook/presto/sql/planner/iterative/rule/TestCombineApproxDistinctFunctions.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@maniloya maniloya changed the title Add CombineApproxDistinctFunctions optimizer to combine multiple approx_distinct calls feat: Add CombineApproxDistinctFunctions optimizer to combine multiple approx_distinct calls Nov 25, 2025
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and found some issues that need to be addressed.

  • In CombineApproxDistinctFunctions, avoid relying on displayName string comparisons and hardcoded function name constants (e.g., "approx_distinct", "set_agg", "array_constructor") and instead resolve and compare functions via function handles or metadata to make the rule more robust against renames and aliases.
  • The hardcoded ARRAY_SIZE_LIMIT = 254 couples this rule to an internal detail of ArrayConstructor; consider exposing or reusing a shared constant (or deriving the limit via metadata) so the rule cannot silently diverge from the actual engine limit.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In CombineApproxDistinctFunctions, avoid relying on displayName string comparisons and hardcoded function name constants (e.g., "approx_distinct", "set_agg", "array_constructor") and instead resolve and compare functions via function handles or metadata to make the rule more robust against renames and aliases.
- The hardcoded ARRAY_SIZE_LIMIT = 254 couples this rule to an internal detail of ArrayConstructor; consider exposing or reusing a shared constant (or deriving the limit via metadata) so the rule cannot silently diverge from the actual engine limit.

## Individual Comments

### Comment 1
<location> `presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/CombineApproxDistinctFunctions.java:319-320` </location>
<code_context>
+                new ProjectNode(context.getIdAllocator().getNextId(),
+                        aggregationNode.getSource(), sourceProjectAssignments.build()),
+                aggregations.build(),
+                aggregationNode.getGroupingSets(),
+                ImmutableList.of(),
+                aggregationNode.getStep(),
+                aggregationNode.getHashVariable(),
</code_context>

<issue_to_address>
**issue (bug_risk):** The new AggregationNode loses the original preGroupedVariables, which can affect correctness and performance.

Here, the new AggregationNode passes `ImmutableList.of()` for `preGroupedVariables` instead of `aggregationNode.getPreGroupedVariables()`. This drops the pre-grouping contract, which can degrade performance and, in some cases, change query semantics. Please propagate the original `preGroupedVariables` from the input node.
</issue_to_address>

### Comment 2
<location> `presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/CombineApproxDistinctFunctions.java:282-291` </location>
<code_context>
+            Map<VariableReferenceExpression, RowExpression> elementAtMap =
</code_context>

<issue_to_address>
**suggestion (bug_risk):** The output ProjectNode may change the relative ordering of output variables compared to the original AggregationNode.

Assignments for the combined approx_distinct variables are added first via `putAll(elementAtMap)` in the loop over `candidateAggregationLists`, and the remaining variables are added only afterward in a separate loop. Depending on the order of `candidateAggregationLists`, this can change the output variable order relative to `aggregationNode.getOutputVariables()`, which some consumers rely on. To avoid this, build `outputProjectAssignments` by iterating over `aggregationNode.getOutputVariables()` in order and, for each variable, either use the transformed expression (for combined aggregations) or an identity expression (for untouched ones).
</issue_to_address>

### Comment 3
<location> `presto-main-base/src/test/java/com/facebook/presto/sql/planner/iterative/rule/TestCombineApproxDistinctFunctions.java:194` </location>
<code_context>
+    }
+
+    @Test
+    public void testArrayLimit()
+    {
+        tester().assertThat(new CombineApproxDistinctFunctions(getMetadata().getFunctionAndTypeManager()))
</code_context>

<issue_to_address>
**suggestion (testing):** Consider adding tests for non-global aggregations and filtered/masked approx_distinct to cover aggregationCanMerge logic

Current tests cover array size limits and duplicate expressions, but not the specific conditions in `aggregationCanMerge` (mask, filter, orderBy, distinct) or non-global aggregations.

Please add:
- A case with two approx_distinct calls on compatible expressions where one differs by FILTER/mask/orderBy, asserting the rule does not fire or they are not merged.
- A case with a GROUP BY key to confirm grouping is preserved while compatible approx_distincts are combined.

These will directly validate the edge cases and grouping behavior guarded by `aggregationCanMerge`.

Suggested implementation:

```java
    @Test

                .on(p -> p.aggregation(af -> {
                    VariableReferenceExpression col = p.variable("col", VARCHAR);
                    af.globalGrouping()
                            .addAggregation(p.variable("approx_distinct_1"), p.rowExpression("approx_distinct(col)", AS_DOUBLE))
                            .addAggregation(p.variable("approx_distinct_2"), p.rowExpression("approx_distinct(col)", AS_DOUBLE))
                            .source(p.values(col));
                }))
                .doesNotFire();
    }

    @Test
    public void testApproxDistinctWithFilterAndOrderByNotMerged()
    {
        tester().assertThat(new CombineApproxDistinctFunctions(getMetadata().getFunctionAndTypeManager()))
                .on(p -> p.aggregation(af -> {
                    VariableReferenceExpression col = p.variable("col", VARCHAR);
                    VariableReferenceExpression col2 = p.variable("col2", VARCHAR);
                    af.globalGrouping()
                            .addAggregation(
                                    p.variable("approx_distinct_plain"),
                                    p.rowExpression("approx_distinct(col)", AS_DOUBLE))
                            .addAggregation(
                                    p.variable("approx_distinct_filtered"),
                                    p.rowExpression("approx_distinct(col) FILTER (WHERE col > 'a')", AS_DOUBLE))
                            .addAggregation(
                                    p.variable("approx_distinct_ordered"),
                                    p.rowExpression("approx_distinct(col ORDER BY col2)", AS_DOUBLE))
                            .source(p.values(col, col2));
                }))
                // The rule should not fire because aggregationCanMerge forbids merging when
                // FILTER or ORDER BY is present on one of the aggregations.
                .doesNotFire();
    }

    @Test
    public void testNonGlobalGroupByApproxDistinctsAreCombined()
    {
        tester().assertThat(new CombineApproxDistinctFunctions(getMetadata().getFunctionAndTypeManager()))
                .on(p -> p.aggregation(af -> {
                    VariableReferenceExpression group = p.variable("group_key", BIGINT);
                    VariableReferenceExpression col = p.variable("col", VARCHAR);
                    af.groupingKeys(group)
                            .addAggregation(
                                    p.variable("approx_distinct_1"),
                                    p.rowExpression("approx_distinct(col)", AS_DOUBLE))
                            .addAggregation(
                                    p.variable("approx_distinct_2"),
                                    p.rowExpression("approx_distinct(col)", AS_DOUBLE))
                            .source(p.values(group, col));
                }))
                // This assertion is intended to validate that:
                // 1) The GROUP BY key is preserved.
                // 2) Compatible approx_distinct aggregations are combined into a single aggregation.
                //
                // The exact shape of the PlanMatchPattern may need to be adjusted to match
                // the helpers and conventions used elsewhere in this test class.
                .matches(aggregation(
                        // grouping keys
                        ImmutableList.of(ImmutableList.of("group_key")),
                        // expected single approx_distinct aggregation on "col"
                        ImmutableMap.of("approx_distinct", functionCall("approx_distinct", ImmutableList.of("col"))),
                        // underlying source still provides group_key and col
                        values("group_key", "col")));
    }

```

1) Ensure the following static imports (or equivalent) exist at the top of TestCombineApproxDistinctFunctions.java, or adjust to match the existing assertion helpers:
   - import static com.facebook.presto.sql.planner.assertions.PlanMatchPattern.aggregation;
   - import static com.facebook.presto.sql.planner.assertions.PlanMatchPattern.functionCall;
   - import static com.facebook.presto.sql.planner.assertions.PlanMatchPattern.values;
   - import com.google.common.collect.ImmutableList;
   - import com.google.common.collect.ImmutableMap;

2) If this test class uses convenience wrappers for aggregation patterns (for example, singleGroupingSet(...) or similar), you may want to rewrite the .matches(...) call in testNonGlobalGroupByApproxDistinctsAreCombined() to use those helpers so it is consistent with other tests.

3) If the PlanMatchPattern.aggregation signature in this project differs from the one used above, adjust the argument types/order accordingly while preserving the intent:
   - one grouping set on "group_key"
   - exactly one approx_distinct aggregation on "col"
   - a source node that still has "group_key" and "col" symbols.

4) If this project represents FILTER and ORDER BY on aggregations differently in row expressions, update the expressions in testApproxDistinctWithFilterAndOrderByNotMerged() to match the existing style, keeping the semantics:
   - one plain approx_distinct(col)
   - one approx_distinct(col) with a filter
   - one approx_distinct(col ORDER BY col2).
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@maniloya maniloya force-pushed the mani-combine-approx-distinct branch 2 times, most recently from d8499d5 to e56525f Compare November 25, 2025 14:37
}
CallExpression expression1 = aggregation1.getCall();
CallExpression expression2 = aggregation2.getCall();
if (expression1.getArguments().size() != 1 || expression2.getArguments().size() != 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values of second arg - error tolenrance - should also match. Maybe best to check here itself. Just the exp1.getArg(2).equals(expr2.getArg(2))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the code to handle 2 argument approx_distinct(). Thanks

}
Type type1 = expression1.getArguments().get(0).getType();
Type type2 = expression2.getArguments().get(0).getType();
return type1.equals(type2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for single use temp vars

ImmutableMap.Builder<VariableReferenceExpression, AggregationNode.Aggregation> aggregations = ImmutableMap.builder();

List<AggregationNode.Aggregation> approxDistinct = aggregationNode.getAggregations().values().stream().filter(
x -> x.getCall().getDisplayName().equals(APPROX_DISTINCT) && x.getCall().getArguments().size() == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should do it for the 2 arg version as well - actually more important because when the user gives smaller error tolerance it's more expensive so merging is better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 2 args version (with error tolerance), optimizer is only used when the error tolerance matches for all multiple and the type is same

approx_distinct(c1, e1)
approx_distinct(c2, e2)
approx_distinct(c3, e3)

Rule is applied only if type of c1,c2,c3 is same and e1==e2==e3 is this correct ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the code, thanks

});

List<List<AggregationNode.Aggregation>> candidateAggregationLists =
candidateLists.build().stream().filter(x -> x.size() > 1 && x.size() < ARRAY_SIZE_LIMIT).collect(Collectors.toList());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - this is artificial - java only issue. We can remove it now. We have soe tool generated queries with lot more than 256 approx_distinct in a single query!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was trying to follow CombineApproxPercentiles optimizer, will remove this limit. Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code thanks.

@kaikalur
Copy link
Contributor

Description

Introduces a new query optimization that combines multiple approx_distinct() function calls on distinct expressions of the same type into a single efficient operation using set_agg and array operations.

Controlled by session property optimize_multiple_approx_distinct_on_same_type (default: false).

Motivation and Context

Multiple approx_distinct() calls on different columns of the same type in a single query currently execute as separate aggregations, leading to redundant data processing and increased resource usage. This optimization combines these calls into a single efficient operation, similar to the existing CombineApproxPercentileFunctions optimizer.

The optimization uses set_agg with array operations (array_constructor, array_transpose, transform) to consolidate multiple distinct counts, reducing aggregation overhead and improving query performance when multiple distinct counts are needed.

Impact

  • Added new session property: optimize_multiple_approx_distinct_on_same_type (default: false)
  • Users must explicitly enable this optimization via session property

Test Plan

Added comprehensive testing in TestCombineApproxDistinctFunctions.java

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.
  • If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

== NO RELEASE NOTE ==

Mention that this is useful generally non-int type like strings and dates with generally low-cardinality distinct set

* To:
* <pre>
* - Project (coalesce(ads[1], 0), coalesce(ads[2], 0), coalesce(ads[3], 0))
* - Project (ads <- transform(transpose(ads_array), x->cardinality(array_distinct(remove_nulls(x)))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this transform. Just inline the cardinality(array_distinct(remove_nulls(x))) part like:

project (coalesce(cardinality(array_distinct(remove_nulls(a[1]))), 0), project coalesce(cardinality(array_distinct(remove_nulls(a[2]))), 0)

@kaikalur
Copy link
Contributor

#26694 26694

…ox_distinct calls, controlled by session parameter
@maniloya maniloya force-pushed the mani-combine-approx-distinct branch from e56525f to 53434d7 Compare November 25, 2025 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants