-
Notifications
You must be signed in to change notification settings - Fork 0
First pass at ste_vec docs for JSONB containment indexing
#9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
b302b8c
First pass at `ste_vec` docs for JSONB containment indexing
freshtonic 98692bb
Add doc for which JSON types are supported by ste_vec
freshtonic bb99384
Minor tweaks
freshtonic 171b778
Update README.md
calvinbrewer 95924bc
Update README.md
calvinbrewer ff847ac
Update README.md
calvinbrewer 4a45b19
chore: merge with main
calvinbrewer File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -228,6 +228,7 @@ EQL provides specialized functions to interact with encrypted data: | |
| - **`cs_match_v1(val JSONB)`**: Enables basic full-text search. | ||
| - **`cs_unique_v1(val JSONB)`**: Retrieves the unique index for enforcing uniqueness. | ||
| - **`cs_ore_v1(val JSONB)`**: Retrieves the Order-Revealing Encryption index for range queries. | ||
| - **`cs_ste_vec_v1(val JSONB)`**: Retrieves the Structured Encryption Vector for containment queries. | ||
|
|
||
| ### Index functions | ||
|
|
||
|
|
@@ -245,7 +246,7 @@ cs_add_index(table_name text, column_name text, index_name text, cast_as text, o | |
| | column_name | Name of target column | Required | ||
| | index_name | The index kind | Required. | ||
| | cast_as | The PostgreSQL type decrypted data will be cast to | Optional. Defaults to `text` | ||
| | opts | Index options | Optional for `match` indexes (see below) | ||
| | opts | Index options | Optional for `match` indexes, required for `ste_vec` indexes (see below) | ||
|
|
||
| ##### cast_as | ||
|
|
||
|
|
@@ -256,6 +257,7 @@ Supported types: | |
| - big_int | ||
| - boolean | ||
| - date | ||
| - jsonb | ||
|
|
||
| ##### match opts | ||
|
|
||
|
|
@@ -278,6 +280,137 @@ The default Match index options are: | |
| } | ||
| ``` | ||
|
|
||
| - `tokenFilters`: a list of filters to apply to normalise tokens before indexing. | ||
| - `tokenizer`: determines how input text is split into tokens. | ||
| - `m`: The size of the backing [bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) in bits. Defaults to `2048`. | ||
| - `k`: The maximum number of bits set in the bloom filter per term. Defaults to `6`. | ||
|
|
||
| **Token Filters** | ||
|
|
||
| There are currently only two token filters available `downcase` and `upcase`. These are used to normalise the text before indexing and are also applied to query terms. An empty array can also be passed to `tokenFilters` if no normalisation of terms is required. | ||
|
|
||
| **Tokenizer** | ||
|
|
||
| There are two `tokenizer`s provided: `standard` and `ngram`. | ||
| The `standard` simply splits text into tokens using this regular expression: `/[ ,;:!]/`. | ||
| The `ngram` tokenizer splits the text into n-grams and accepts a configuration object that allows you to specify the `tokenLength`. | ||
|
|
||
| **m** and **k** | ||
|
|
||
| `k` and `m` are optional fields for configuring [bloom filters](https://en.wikipedia.org/wiki/Bloom_filter) that back full text search. | ||
|
|
||
| `m` is the size of the bloom filter in bits. `filterSize` must be a power of 2 between `32` and `65536` and defaults to `2048`. | ||
|
|
||
| `k` is the number of hash functions to use per term. | ||
| This determines the maximum number of bits that will be set in the bloom filter per term. | ||
| `k` must be an integer from `3` to `16` and defaults to `6`. | ||
|
|
||
| **Caveats around n-gram tokenization** | ||
|
|
||
| While using n-grams as a tokenization method allows greater flexibility when doing arbitrary substring matches, it is important to bear in mind the limitations of this approach. | ||
| Specifically, searching for strings _shorter_ than the `tokenLength` parameter will not _generally_ work. | ||
|
|
||
| If you're using n-gram as a token filter, then a token that is already shorter than the `tokenLength` parameter will be kept as-is when indexed, and so a search for that short token will match that record. | ||
| However, if that same short string only appears as a part of a larger token, then it will not match that record. | ||
| In general, therefore, you should try to ensure that the string you search for is at least as long as the `tokenLength` of the index, except in the specific case where you know that there are shorter tokens to match, _and_ you are explicitly OK with not returning records that have that short string as part of a larger token. | ||
|
|
||
| ##### ste_vec opts | ||
|
|
||
| An ste_vec index on a encrypted JSONB column enables the use of PostgreSQL's `@>` and `<@` [containment operators](https://www.postgresql.org/docs/16/functions-json.html#FUNCTIONS-JSONB-OP-TABLE). | ||
|
|
||
| An ste_vec index requires one piece of configuration: the `prefix` (a string) which is functionally similar to a salt for the hashing process. | ||
|
|
||
| Within a dataset, encrypted columns indexed using an ste_vec that use different prefixes can never compare as equal. | ||
| Containment queries that manage to mix index terms from multiple columns will never return a positive result. | ||
| This is by design. | ||
|
|
||
| The index is generated from a JSONB document by first flattening the structure of the document such that a hash can be generated for each unique path prefix to a node. | ||
|
|
||
| The complete set of JSON types is supported by the indexer. | ||
| Null values are ignored by the indexer. | ||
|
|
||
| - Object `{ ... }` | ||
| - Array `[ ... ]` | ||
| - String `"abc"` | ||
| - Boolean `true` | ||
| - Number `123.45` | ||
|
|
||
| For a document like this: | ||
|
|
||
| ```json | ||
| { | ||
| "account": { | ||
| "email": "[email protected]", | ||
| "name": { | ||
| "first_name": "Alice", | ||
| "last_name": "McCrypto", | ||
| }, | ||
| "roles": [ | ||
| "admin", | ||
| "owner", | ||
| ] | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| Hashes would be produced from the following list of entries: | ||
|
|
||
| ```json | ||
| [ | ||
| [Obj, Key("account"), Obj, Key("email"), String("[email protected]")], | ||
| [Obj, Key("account"), Obj, Key("name"), Obj, Key("first_name"), String("Alice")], | ||
| [Obj, Key("account"), Obj, Key("name"), Obj, Key("last_name"), String("McCrypto")], | ||
| [Obj, Key("account"), Obj, Key("roles"), Array, String("admin")], | ||
| [Obj, Key("account"), Obj, Key("roles"), Array, String("owner")], | ||
| ] | ||
| ``` | ||
|
|
||
| Using the first entry to illustrate how an entry is converted to hashes: | ||
|
|
||
| ```json | ||
| [Obj, Key("account"), Obj, Key("email"), String("[email protected]")] | ||
| ``` | ||
|
|
||
| The hashes would be generated for all prefixes of the full path to the leaf node. | ||
|
|
||
| ```json | ||
| [ | ||
| [Obj], | ||
| [Obj, Key("account")], | ||
| [Obj, Key("account"), Obj], | ||
| [Obj, Key("account"), Obj, Key("email")], | ||
| [Obj, Key("account"), Obj, Key("email"), String("[email protected]")], | ||
| // (remaining leaf nodes omitted) | ||
| ] | ||
| ``` | ||
|
|
||
| Query terms are processed in the same manner as the input document. | ||
|
|
||
| A query prior to encrypting & indexing looks like a structurally similar subset of the encrypted document, for example: | ||
|
|
||
| ```json | ||
| { "account": { "email": "[email protected]", "roles": "admin" }} | ||
| ``` | ||
|
|
||
| The expression `cs_ste_vec_v1(encrypted_account) @> cs_ste_vec_v1($query)` would match all records where the `encrypted_account` column contains a JSONB object with an "account" key containing an object with an "email" key where the value is the string "[email protected]". | ||
|
|
||
| When reduced to a prefix list, it would look like this: | ||
|
|
||
| ```json | ||
| [ | ||
| [Obj], | ||
| [Obj, Key("account")], | ||
| [Obj, Key("account"), Obj], | ||
| [Obj, Key("account"), Obj, Key("email")], | ||
| [Obj, Key("account"), Obj, Key("email"), String("[email protected]")] | ||
| [Obj, Key("account"), Obj, Key("roles")], | ||
| [Obj, Key("account"), Obj, Key("roles"), Array], | ||
| [Obj, Key("account"), Obj, Key("roles"), Array, String("admin")] | ||
| ] | ||
| ``` | ||
|
|
||
| Which is then turned into an ste_vec of hashes which can be directly queries against the index. | ||
|
|
||
| #### cs_modify_index | ||
|
|
||
| ```sql | ||
|
|
@@ -335,6 +468,15 @@ cs_ore_v1(val jsonb) | |
| Extracts an ore index from the `jsonb` value. | ||
| Returns `null` if no ore index is present. | ||
|
|
||
| #### cs_ste_vec_v1 | ||
|
|
||
| ```sql | ||
| cs_ste_vec_v1(val jsonb) | ||
| ``` | ||
|
|
||
| Extracts an ste_vec index from the `jsonb` value. | ||
| Returns `null` if no ste_vec index is present. | ||
|
|
||
| ### Data Format | ||
|
|
||
| Encrypted data is stored as `jsonb` with a specific schema: | ||
|
|
@@ -383,7 +525,8 @@ Cipherstash proxy handles the encoding, and EQL provides the functions. | |
| | c | Ciphertext | Ciphertext value. Encrypted by proxy. Required if kind is plaintext/pt or encrypting/et. | ||
| | m.1 | Match index | Ciphertext index value. Encrypted by proxy. | ||
| | o.1 | ORE index | Ciphertext index value. Encrypted by proxy. | ||
| | u.1 | Uniqueindex | Ciphertext index value. Encrypted by proxy. | ||
| | u.1 | Unique index | Ciphertext index value. Encrypted by proxy. | ||
| | sv.1 | STE vector index | Ciphertext index value. Encrypted by proxy. | ||
|
|
||
| #### Helper packages | ||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice