Skip to content

[Feature Request] terms query's term lookup should be able to efficiently handle 100k+ (or 1M+) terms #12341

@msfroh

Description

@msfroh

Is your feature request related to a problem? Please describe

I recently read this blog post, where the author claims a 10x speedup on a large terms query by encoding the field values as a roaring bitmap.

I believe that part of the improvement comes from the use of doc values to post-filter hits that come from a lead iterator, which is now the OpenSearch (starting with 2.12) thanks to @harshavamsi's changes in #11209 to support IndexOrDocValuesQuery for all numeric query types. (Behind the scenes, Lucene implements the DV query using a LongHashSet, which I think should perform similarly to RoaringBitmap.)

The more interesting part (IMO) is that the roaring bitmap of numeric terms gets created on the client and sent as a base64-encoded bitset, where it's used as the doc value filter.

Similarly, we have the terms lookup feature on the terms query, but it's doing a kind-of naive "fetch an array of strings" approach.

My idea is to borrow the roaring bitmap idea from the linked blog post and add that to terms query's lookup.

Describe the solution you'd like

I would like to modify the terms lookup feature to add a new (opt-in) "protocol" between the main index and the term lookup index. The term lookup index should assign consistent, increasing ordinals to the values in the terms field. When the main index queries the term lookup index, it should pass a bitset of the ordinals whose values it "knows" (cached from previous requests). After finding the matching document in the term lookup index, the response should carry a bitset of matching ordinals, along with the ordinal-to-value mapping for any unknown values. This should allow us to carry very large sets of IDs across the index boundary in a compact representation.

As a next step, the term lookup should support multiple lookup keys and Boolean/bitset operations between them. The term lookup index will return the bitset after performing the Boolean operations. (The term lookup index may cache the result of these Boolean operations.)

Related component

Search:Query Capabilities

Describe alternatives you've considered

I drafted (on my computer) a whole idea around a new query type that would work with numeric ID bitsets, either passed in the query or stored in a custom data store (probably implemented as an OpenSearch index, but could be somewhere else).

My concerns with that were:

  1. It forces folks to use numeric IDs, which may not be a viable option.
  2. It would have added lots of new APIs (to organize, create, and manage bitsets), as well as a whole new query type.

Making the existing lookup feature of terms query "smarter" feels like a lot less work for me and for users.

Additional context

As I mentioned to above, I had drafted a proposal on my computer to build a whole dedicated API. While I no longer think that's the right move, my proposal did have some nice examples of possible use-cases:

Example 1 - Digital entitlements

An example would be a digital content entitlement system, with each document in the search index corresponding to a digital product. End-users can purchase access to content. When an end-user searches their library, they should only see content to which they have access. Updating each piece of content whenever a user makes a purchase is not practical, since a single item of content may be owned by many users. Instead, we would like each user

Example 2 - Multi-location retail search

A retail grocery chain offers online ordering and delivery across many store locations. They have a single catalog of products, but each location may carry a different selection of products. Product selection at any store may fluctuate as inventory sells out and new inventory is delivered. When a user in a given city searches for products, they should see items currently available from their local stores. There should be an updatable collection for each store and a user should be able to search across the union of products available from their local stores.

Metadata

Metadata

Type

No type

Projects

Status

✅ Done

Status

Done

Status

New

Status

2.17 (First RC 09/03, Release 09/17)

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions