Skip to content

Sparse vector indexes #770

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ authentication.
In this example we'll use two collections: a `users` collection to store the
user objects with names and credentials, and a `sessions` collection to store
the session data. We'll also make sure usernames are unique
by adding a hash index:
by adding a `persistent` index:

```js
"use strict";
Expand All @@ -37,7 +37,7 @@ if (!db._collection(sessions)) {
db._createDocumentCollection(sessions);
}
module.context.collection("users").ensureIndex({
type: "hash",
type: "persistent",
unique: true,
fields: ["username"]
});
Expand Down
4 changes: 2 additions & 2 deletions site/content/3.12/develop/http-api/indexes/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -221,8 +221,8 @@ paths:
insert a value into the index that already exists in the index always fails,
regardless of the value of this attribute.

The optional **estimates** attribute is supported by persistent indexes.
This attribute controls whether index selectivity estimates are
The optional **estimates** attribute is supported by `persistent`, `mdi`, and
`mdi-prefixed` indexes. This attribute controls whether index selectivity estimates are
maintained for the index. Not maintaining index selectivity estimates can have
a slightly positive impact on write performance.
The downside of turning off index selectivity estimates will be that
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,10 @@ paths:
default: false
sparse:
description: |
If `true`, then create a sparse index.
Whether to create a sparse index that excludes documents with
at least one of the attributes for indexing missing or set to
`null`. These attributes are defined by `fields` and (for
`mdi-prefixed` indexes) by `prefixFields`.
type: boolean
default: false
estimates:
Expand Down
5 changes: 3 additions & 2 deletions site/content/3.12/develop/http-api/indexes/persistent.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,8 +113,9 @@ paths:
default: false
sparse:
description: |
Whether create a sparse index that excludes documents with at least
one of the `fields` missing or set to `null`.
Whether to create a sparse index that excludes documents with
at least one of the attributes for indexing missing or set to
`null`. These attributes are defined by `fields`.
type: boolean
default: false
deduplicate:
Expand Down
9 changes: 8 additions & 1 deletion site/content/3.12/develop/http-api/indexes/vector.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,14 +57,21 @@ paths:
A list with exactly one attribute path to specify
where the vector embedding is stored in each document. The vector data needs
to be populated before creating the index.

If you want to index another vector embedding attribute, you need to create a
separate vector index.
type: array
minItems: 1
maxItems: 1
items:
type: string
sparse:
description: |
Whether to create a sparse index that excludes documents with
the attribute for indexing missing or set to `null`. This
attribute is defined by `fields`.
type: boolean
default: false
parallelism:
description: |
The number of threads to use for indexing.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,9 @@ It is often beneficial to create an index on more than just one attribute. By ad
to an index, an index can become more selective and thus reduce the number of documents that
queries need to process.

ArangoDB's primary indexes, edges indexes and hash indexes will automatically provide selectivity
estimates. Index selectivity estimates are provided in the web interface, the `indexes()` return
ArangoDB's `primary` and `edge` indexes automatically provide selectivity estimates.
The `persistent`, `mdi`, and `mdi-prefixed` indexes do too, by default.
Index selectivity estimates are provided in the web interface, the `indexes()` return
value and in the `explain()` output for a given query.

The more selective an index is, the more documents it will filter on average. The index selectivity
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -175,11 +175,11 @@ db.collection.ensureIndex({ type: "persistent", fields: [ "attributeName1", "att
When not explicitly set, the `sparse` attribute defaults to `false` for new indexes.
Indexes other than persistent do not support the `sparse` option.

As sparse indexes may exclude some documents from the collection, they cannot be used for
all types of queries. Sparse hash indexes cannot be used to find documents for which at
least one of the indexed attributes has a value of `null`. For example, the following AQL
query cannot use a sparse index, even if one was created on attribute `attr`:
<!-- TODO Remove above statement? -->
As sparse indexes may exclude some documents from the collection, they cannot
be used for all types of queries. For example, sparse persistent indexes cannot
be used to find documents for which at least one of the indexed attributes
is missing or has a value of `null`. For example, the following AQL
query cannot use a sparse index over the attribute `attr`:

```aql
FOR doc In collection
Expand All @@ -189,15 +189,25 @@ FOR doc In collection

If the lookup value is non-constant, a sparse index may or may not be used, depending on
the other types of conditions in the query. If the optimizer can safely determine that
the lookup value cannot be `null`, a sparse index may be used. When uncertain, the optimizer
does not make use of a sparse index in a query in order to produce correct results.
the lookup value cannot be `null`, a sparse index may be used.

```aql
FOR doc In collection
LET random = RAND() * 5
FILTER doc.attr < random // Includes numbers < random but also true, false, and null!
FILTER doc.attr != null // Explicitly exclude null to make a sparse index eligible
RETURN doc
```

When uncertain, the optimizer does not make use of a sparse index in a query in
order to produce correct results.

For example, the following queries cannot use a sparse index on `attr` because the optimizer
does not know beforehand whether the values which are compared to `doc.attr` include `null`:

```aql
FOR doc In collection
FILTER doc.attr == SOME_FUNCTION(...)
FILTER doc.attr == SOME_FUNCTION(...)
RETURN doc
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -65,15 +65,18 @@ centroids and the quality of vector search thus degrades.
- **fields** (array of strings): A list with a single attribute path to specify
where the vector embedding is stored in each document. The vector data needs
to be populated before creating the index.

If you want to index another vector embedding attribute, you need to create a
separate vector index.
- **sparse** (boolean): Whether to create a sparse index that excludes documents
with the attribute for indexing missing or set to `null`. This attribute is
defined by `fields`. Default: `false`.
- **parallelism** (number):
The number of threads to use for indexing. The default is `2`.
The number of threads to use for indexing. Default: `2`.
- **inBackground** (boolean):
Set this option to `true` to keep the collection/shards available for
write operations by not using an exclusive write lock for the duration
of the index creation. The default is `false`.
of the index creation. Default: `false`.
- **params**: The parameters as used by the Faiss library.
- **metric** (string): Whether to use `cosine` or `l2` (Euclidean) distance calculation.
- **dimension** (number): The vector dimension. The attribute to index needs to
Expand All @@ -89,11 +92,11 @@ centroids and the quality of vector search thus degrades.
number of documents.
- **defaultNProbe** (number, _optional_): How many neighboring centroids to
consider for the search results by default. The larger the number, the slower
the search but the better the search results. The default is `1`. You should
the search but the better the search results. Default: `1`. You should
generally use a higher value here or per query via the `nProbe` option of
the vector similarity functions.
- **trainingIterations** (number, _optional_): The number of iterations in the
training process. The default is `25`. Smaller values lead to a faster index
training process. Default: `25`. Smaller values lead to a faster index
creation but may yield worse search results.
- **factory** (string, _optional_): You can specify an index factory string that is
forwarded to the underlying Faiss library, allowing you to combine different
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1443,6 +1443,13 @@ utilizing vector indexes in queries.
Furthermore, a new error code `ERROR_QUERY_VECTOR_SEARCH_NOT_APPLIED` (1554)
has been added.

---

<small>Introduced in: v3.12.6</small>

Vector indexes can now be sparse to exclude documents with the embedding attribute
for indexing missing or set to `null`.

## Server options

### Effective and available startup options
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ authentication.
In this example we'll use two collections: a `users` collection to store the
user objects with names and credentials, and a `sessions` collection to store
the session data. We'll also make sure usernames are unique
by adding a hash index:
by adding a `persistent` index:

```js
"use strict";
Expand All @@ -37,7 +37,7 @@ if (!db._collection(sessions)) {
db._createDocumentCollection(sessions);
}
module.context.collection("users").ensureIndex({
type: "hash",
type: "persistent",
unique: true,
fields: ["username"]
});
Expand Down
4 changes: 2 additions & 2 deletions site/content/3.13/develop/http-api/indexes/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -221,8 +221,8 @@ paths:
insert a value into the index that already exists in the index always fails,
regardless of the value of this attribute.

The optional **estimates** attribute is supported by persistent indexes.
This attribute controls whether index selectivity estimates are
The optional **estimates** attribute is supported by `persistent`, `mdi`, and
`mdi-prefixed` indexes. This attribute controls whether index selectivity estimates are
maintained for the index. Not maintaining index selectivity estimates can have
a slightly positive impact on write performance.
The downside of turning off index selectivity estimates will be that
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,10 @@ paths:
default: false
sparse:
description: |
If `true`, then create a sparse index.
Whether to create a sparse index that excludes documents with
at least one of the attributes for indexing missing or set to
`null`. These attributes are defined by `fields` and (for
`mdi-prefixed` indexes) by `prefixFields`.
type: boolean
default: false
estimates:
Expand Down
5 changes: 3 additions & 2 deletions site/content/3.13/develop/http-api/indexes/persistent.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,8 +113,9 @@ paths:
default: false
sparse:
description: |
Whether create a sparse index that excludes documents with at least
one of the `fields` missing or set to `null`.
Whether to create a sparse index that excludes documents with
at least one of the attributes for indexing missing or set to
`null`. These attributes are defined by `fields`.
type: boolean
default: false
deduplicate:
Expand Down
9 changes: 8 additions & 1 deletion site/content/3.13/develop/http-api/indexes/vector.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,14 +57,21 @@ paths:
A list with exactly one attribute path to specify
where the vector embedding is stored in each document. The vector data needs
to be populated before creating the index.

If you want to index another vector embedding attribute, you need to create a
separate vector index.
type: array
minItems: 1
maxItems: 1
items:
type: string
sparse:
description: |
Whether to create a sparse index that excludes documents with
the attribute for indexing missing or set to `null`. This
attribute is defined by `fields`.
type: boolean
default: false
parallelism:
description: |
The number of threads to use for indexing.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,9 @@ It is often beneficial to create an index on more than just one attribute. By ad
to an index, an index can become more selective and thus reduce the number of documents that
queries need to process.

ArangoDB's primary indexes, edges indexes and hash indexes will automatically provide selectivity
estimates. Index selectivity estimates are provided in the web interface, the `indexes()` return
ArangoDB's `primary` and `edge` indexes automatically provide selectivity estimates.
The `persistent`, `mdi`, and `mdi-prefixed` indexes do too, by default.
Index selectivity estimates are provided in the web interface, the `indexes()` return
value and in the `explain()` output for a given query.

The more selective an index is, the more documents it will filter on average. The index selectivity
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -175,11 +175,11 @@ db.collection.ensureIndex({ type: "persistent", fields: [ "attributeName1", "att
When not explicitly set, the `sparse` attribute defaults to `false` for new indexes.
Indexes other than persistent do not support the `sparse` option.

As sparse indexes may exclude some documents from the collection, they cannot be used for
all types of queries. Sparse hash indexes cannot be used to find documents for which at
least one of the indexed attributes has a value of `null`. For example, the following AQL
query cannot use a sparse index, even if one was created on attribute `attr`:
<!-- TODO Remove above statement? -->
As sparse indexes may exclude some documents from the collection, they cannot
be used for all types of queries. For example, sparse persistent indexes cannot
be used to find documents for which at least one of the indexed attributes
is missing or has a value of `null`. For example, the following AQL
query cannot use a sparse index over the attribute `attr`:

```aql
FOR doc In collection
Expand All @@ -189,15 +189,25 @@ FOR doc In collection

If the lookup value is non-constant, a sparse index may or may not be used, depending on
the other types of conditions in the query. If the optimizer can safely determine that
the lookup value cannot be `null`, a sparse index may be used. When uncertain, the optimizer
does not make use of a sparse index in a query in order to produce correct results.
the lookup value cannot be `null`, a sparse index may be used.

```aql
FOR doc In collection
LET random = RAND() * 5
FILTER doc.attr < random // Includes numbers < random but also true, false, and null!
FILTER doc.attr != null // Explicitly exclude null to make a sparse index eligible
RETURN doc
```

When uncertain, the optimizer does not make use of a sparse index in a query in
order to produce correct results.

For example, the following queries cannot use a sparse index on `attr` because the optimizer
does not know beforehand whether the values which are compared to `doc.attr` include `null`:

```aql
FOR doc In collection
FILTER doc.attr == SOME_FUNCTION(...)
FILTER doc.attr == SOME_FUNCTION(...)
RETURN doc
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -65,15 +65,18 @@ centroids and the quality of vector search thus degrades.
- **fields** (array of strings): A list with a single attribute path to specify
where the vector embedding is stored in each document. The vector data needs
to be populated before creating the index.

If you want to index another vector embedding attribute, you need to create a
separate vector index.
- **sparse** (boolean): Whether to create a sparse index that excludes documents
with the attribute for indexing missing or set to `null`. This attribute is
defined by `fields`. Default: `false`.
- **parallelism** (number):
The number of threads to use for indexing. The default is `2`.
The number of threads to use for indexing. Default: `2`.
- **inBackground** (boolean):
Set this option to `true` to keep the collection/shards available for
write operations by not using an exclusive write lock for the duration
of the index creation. The default is `false`.
of the index creation. Default: `false`.
- **params**: The parameters as used by the Faiss library.
- **metric** (string): Whether to use `cosine` or `l2` (Euclidean) distance calculation.
- **dimension** (number): The vector dimension. The attribute to index needs to
Expand All @@ -89,11 +92,11 @@ centroids and the quality of vector search thus degrades.
number of documents.
- **defaultNProbe** (number, _optional_): How many neighboring centroids to
consider for the search results by default. The larger the number, the slower
the search but the better the search results. The default is `1`. You should
the search but the better the search results. Default: `1`. You should
generally use a higher value here or per query via the `nProbe` option of
the vector similarity functions.
- **trainingIterations** (number, _optional_): The number of iterations in the
training process. The default is `25`. Smaller values lead to a faster index
training process. Default: `25`. Smaller values lead to a faster index
creation but may yield worse search results.
- **factory** (string, _optional_): You can specify an index factory string that is
forwarded to the underlying Faiss library, allowing you to combine different
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1443,6 +1443,13 @@ utilizing vector indexes in queries.
Furthermore, a new error code `ERROR_QUERY_VECTOR_SEARCH_NOT_APPLIED` (1554)
has been added.

---

<small>Introduced in: v3.12.6</small>

Vector indexes can now be sparse to exclude documents with the embedding attribute
for indexing missing or set to `null`.

## Server options

### Effective and available startup options
Expand Down