Skip to content

Conversation

@mdashti
Copy link

@mdashti mdashti commented Oct 3, 2025

Ticket(s) Closed

What

Implements filter aggregation support in Tantivy, enabling multiple filtered aggregations in a single query.

Why

Currently, there's no way to compute aggregations on different filtered subsets of documents in a single query. Users must run separate queries for each filter, which is slow and inefficient. For example, computing "average price overall + average price for t-shirts + count of electronics" requires three separate queries.

Elasticsearch's filter aggregation solves this by creating a single bucket containing documents matching a query, with support for nested sub-aggregations. This is a common analytics pattern that Tantivy now supports!

How

Added a new FilterAggregation bucket aggregation type that:

  1. Accepts both query strings (parsed via QueryParser) and direct Query objects for custom query types
  2. Uses DocumentQueryEvaluator to evaluate filter queries per-document during aggregation collection, avoiding separate query executions
  3. Extended aggregation collectors to receive SegmentReader references, enabling filter aggregations to create query weights and scorers per segment

Some Implementation Details:

  • FilterAggregation supports two modes:
    • FilterQuery::QueryString: Parsed using Tantivy's standard QueryParser
    • FilterQuery::Direct: Accepts Box<dyn Query> for custom query extensions
  • FilterSegmentCollector evaluates the filter query on each document collected by the main query
  • Documents matching the filter are counted and passed to sub-aggregation collectors
  • Results include doc_count and flattened sub-aggregation results

Tests

Test suite with 20 tests covering:

  • Basic Filtering: Single filters, no matches, multiple independent filters
  • Query Types: Term queries, range queries, boolean queries, bool field queries
  • Nested Filters: 2-level nesting, deep nesting (4+ levels), multiple branches at each level
  • Sub-Aggregations: Terms aggregations, multiple metric aggregations
  • Edge Cases: Empty indexes, malformed queries, base query interaction
  • Custom Queries: Direct Query objects, serialization behavior, equality checks
  • Correctness: Validation against equivalent separate query execution

All tests use the assert_agg_results! macro for clean, consistent result validation with floating-point tolerance.

@PSeitz
Copy link
Collaborator

PSeitz commented Oct 21, 2025

The failing CI is a flaky test from me

Should be fixed now (after rebase)

@mdashti
Copy link
Author

mdashti commented Oct 21, 2025

Thanks @PSeitz. I've also noticed and fixed it in #2723. Closed that PR.
Now, this PR is good for final review.


// Filter aggregation benchmarks

fn filter_agg_all_query(index: &Index) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn filter_agg_all_query(index: &Index) {
fn filter_agg_all_query_count_agg(index: &Index) {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

execute_agg(index, agg_req);
}

fn filter_agg_term_query(index: &Index) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn filter_agg_term_query(index: &Index) {
fn filter_agg_term_query_count_agg(index: &Index) {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

execute_agg(index, agg_req);
}

fn filter_agg_all_query_with_sub_agg(index: &Index) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn filter_agg_all_query_with_sub_agg(index: &Index) {
fn filter_agg_all_query_with_sub_aggs(index: &Index) {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

execute_agg(index, agg_req);
}

fn filter_agg_term_query_with_sub_agg(index: &Index) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn filter_agg_term_query_with_sub_agg(index: &Index) {
fn filter_agg_term_query_with_sub_aggs(index: &Index) {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Author

@mdashti mdashti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PSeitz-dd Thanks for the comments. I also noticed there was a bug and fixed it in 94bdd5d


// Filter aggregation benchmarks

fn filter_agg_all_query(index: &Index) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

execute_agg(index, agg_req);
}

fn filter_agg_term_query(index: &Index) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

execute_agg(index, agg_req);
}

fn filter_agg_all_query_with_sub_agg(index: &Index) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

execute_agg(index, agg_req);
}

fn filter_agg_term_query_with_sub_agg(index: &Index) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@mdashti mdashti requested a review from PSeitz-dd October 22, 2025 09:33
fn parse_query(&self, schema: &Schema) -> crate::Result<Box<dyn Query>> {
match &self.query {
FilterQuery::QueryString(query_str) => {
let tokenizer_manager = TokenizerManager::default();
Copy link
Collaborator

@PSeitz PSeitz Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default tokenizer manager will fail for any fields with custom tokenizers. We'll need a mechanism to pass the TokenizerManager in there.
Probably the same way we pass the aggregations limits, we could put them both in a AggContextParams struct or similar.

    fn for_segment(
        &self,
        segment_local_id: crate::SegmentOrdinal,
        reader: &crate::SegmentReader,
    ) -> crate::Result<Self::Child> {
        AggregationSegmentCollector::from_agg_req_and_reader(
            &self.agg,
            reader,
            segment_local_id,
            &self.limits,
        )
    }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. Used the default and forgot to pipe it through. Fixed it.

}

#[test]
pub fn test_set_default_field_integer() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was removed by accident

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Yeah. Fixed it.

@@ -0,0 +1,1013 @@
//! Test suite for Filter Aggregation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this to the filter aggregation implementation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@mdashti mdashti force-pushed the paradedb/filter-agg-feature branch from 6fa68d6 to 42c9935 Compare October 23, 2025 20:58
@mdashti mdashti requested a review from PSeitz October 23, 2025 21:55
Copy link
Author

@mdashti mdashti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PSeitz Thanks for the comments. Please take another look.

}

#[test]
pub fn test_set_default_field_integer() {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Yeah. Fixed it.

@@ -0,0 +1,1013 @@
//! Test suite for Filter Aggregation
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

fn parse_query(&self, schema: &Schema) -> crate::Result<Box<dyn Query>> {
match &self.query {
FilterQuery::QueryString(query_str) => {
let tokenizer_manager = TokenizerManager::default();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. Used the default and forgot to pipe it through. Fixed it.

/// Get the fast field names used by this aggregation (none for filter aggregation)
pub fn get_fast_field_names(&self) -> Vec<&str> {
// Filter aggregation doesn't use fast fields directly
vec![]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added further comments. IMO, it should be fixed with a broader change in a follow-up PR.

/// - Extension query types
///
/// Note: This variant cannot be serialized to JSON (only QueryString can be serialized)
CustomQuery(Box<dyn SerializableQuery>),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use SerializableQuery when the query cannot be serialized?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a query constructor would be more suitable here, than de/serializing runtime objects, which may carry state.

//
// This limitation exists because:
// - Query::weight() is called during execution, not during planning
// - The fallback decision is made per-segment based on field configuration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the decision depends on the schema, which is not segment specific

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Elasticsearch Filter Aggregation Support

3 participants