diff --git a/.gitignore b/.gitignore index 5ce7e7d7..00e99a38 100644 --- a/.gitignore +++ b/.gitignore @@ -185,3 +185,7 @@ cipherstash-proxy.toml release/ .mise.* + +# jupyter notebook +.ipynb_checkpoints +__pycache__ \ No newline at end of file diff --git a/README.md b/README.md index 3445bb14..55246a97 100644 --- a/README.md +++ b/README.md @@ -1,25 +1,47 @@ -# CipherStash Encrypt Query Language (EQL) +# Encrypt Query Language (EQL) -[![Why we built EQL](https://img.shields.io/badge/concept-Why%20EQL-8A2BE2)](https://github.com/cipherstash/encrypt-query-language/blob/main/WHY.md) -[![Getting started](https://img.shields.io/badge/guide-Getting%20started-008000)](https://github.com/cipherstash/encrypt-query-language/blob/main/GETTINGSTARTED.md) -[![CipherStash Proxy](https://img.shields.io/badge/guide-CipherStash%20Proxy-A48CF3)](https://github.com/cipherstash/encrypt-query-language/blob/main/PROXY.md) -[![CipherStash Migrator](https://img.shields.io/badge/guide-CipherStash%20Migrator-A48CF3)](https://github.com/cipherstash/encrypt-query-language/blob/main/MIGRATOR.md) +[![Why we built EQL](https://img.shields.io/badge/concept-Why%20EQL-8A2BE2)](https://github.com/cipherstash/encrypt-query-language/blob/main/docs/concepts/WHY.md) +[![Getting started](https://img.shields.io/badge/guide-Getting%20started-008000)](https://github.com/cipherstash/encrypt-query-language/blob/main/docs/tutorials/GETTINGSTARTED.md) +[![CipherStash Proxy](https://img.shields.io/badge/guide-CipherStash%20Proxy-A48CF3)](https://github.com/cipherstash/encrypt-query-language/blob/main/docs/tutorials/PROXY.md) +[![CipherStash Migrator](https://img.shields.io/badge/guide-CipherStash%20Migrator-A48CF3)](https://github.com/cipherstash/encrypt-query-language/blob/main/docs/reference/MIGRATOR.md) Encrypt Query Language (EQL) is a set of abstractions for transmitting, storing, and interacting with encrypted data and indexes in PostgreSQL. -EQL provides a data format for transmitting and storing encrypted data and indexes, as well as database types and functions to interact with the encrypted material. +Store encrypted data alongside your existing data. + +- Encrypted data is stored using a `jsonb` column type +- Query encrypted data with specialized SQL functions +- Index encrypted columns to enable searchable encryption +- Integrate with [CipherStash Proxy](https://github.com/cipherstash/encrypt-query-language/blob/main/docs/tutorials/PROXY.md) for transparent encryption/decryption ## Table of Contents - [Installation](#installation) -- [Usage](#usage) -- [Encrypted columns](#encrypted-columns) - - [Inserting data](#inserting-data) - - [Reading data](#reading-data) -- [Querying data with EQL](#querying-data-with-eql) -- [Querying JSONB data with EQL](#querying-jsonb-data-with-eql) -- [Managing indexes with EQL](#managing-indexes-with-eql) -- [Data Format](#data-format) + - [CipherStash Proxy](#cipherstash-proxy) +- [Getting started](#getting-started) + - [Enable encrypted columns](#enable-encrypted-columns) + - [Configuring the column](#configuring-the-column) + - [Activating configuration](#activating-configuration) + - [Refreshing CipherStash Proxy Configuration](#refreshing-cipherstash-proxy-configuration) +- [Storing data](#storing-data) + - [Inserting Data](#inserting-data) + - [Reading Data](#reading-data) +- [Configuring indexes for searching data](#configuring-indexes-for-searching-data) + - [Adding an index (`cs_add_index_v1`)](#adding-an-index-cs_add_index_v1) +- [Searching data with EQL](#searching-data-with-eql) + - [Equality search (`cs_unique_v1`)](#equality-search-cs_unique_v1) + - [Full-text search (`cs_match_v1`)](#full-text-search-cs_match_v1) + - [Range queries (`cs_ore_64_8_v1`)](#range-queries-cs_ore_64_8_v1) +- [JSON and JSONB support](#json-and-jsonb-support) + - [Configuring the index](#configuring-the-index) + - [Inserting JSON data](#inserting-json-data) + - [Reading JSON data](#reading-json-data) + - [Advanced JSON queries](#advanced-json-queries) +- [EQL payload data format](#eql-payload-data-format) +- [Frequently Asked Questions](#frequently-asked-questions) + - [How do I integrate CipherStash EQL with my application?](#how-do-i-integrate-cipherstash-eql-with-my-application) + - [Can I use EQL without the CipherStash Proxy?](#can-i-use-eql-without-the-cipherstash-proxy) + - [How is data encrypted in the database?](#how-is-data-encrypted-in-the-database) - [Helper packages](#helper-packages) - [Releasing](#releasing) @@ -27,42 +49,34 @@ EQL provides a data format for transmitting and storing encrypted data and index ## Installation -The simplest and fastest way to get up and running with EQL is to execute the install SQL file directly in your database. +The simplest way to get up and running with EQL is to execute the install SQL file directly in your database. + +1. Download the latest EQL install script: -1. Get the latest EQL install script: - ```bash - curl -sLo cipherstash-encrypt.sql https://github.com/cipherstash/encrypt-query-language/releases/latest/download/cipherstash-encrypt.sql + ```sh + curl -sLo cipherstash-encrypt.sql https://github.com/cipherstash/encrypt-query-language/releases/latest/download/cipherstash-encrypt.sql ``` -1. Run this command to install the custom types and functions: - ```bash + +2. Run this command to install the custom types and functions: + + ```sh psql -f cipherstash-encrypt.sql ``` -## Usage - -Once the custom types and functions are installed, you can start using EQL in your queries. +### CipherStash Proxy -1. Create a table with a column of type `cs_encrypted_v1` which will store your encrypted data. -1. Use EQL functions to add indexes for the columns you want to encrypt. - - Indexes are used by CipherStash Proxy to understand what cryptography schemes are required for your use case. -1. Initialize CipherStash Proxy for cryptographic operations. - - Proxy will dynamically encrypt data on the way in and decrypt data on the way out, based on the indexes you've defined. -1. Insert data into the defined columns using a specific payload format. - - See [data format](#data-format) for the payload format. -1. Query the data using the EQL functions defined in [querying data with EQL](#querying-data-with-eql). - - No modifications are required to simply `SELECT` data from your encrypted columns. - - To perform `WHERE` and `ORDER BY` queries, wrap the queries in the EQL functions defined in [querying data with EQL](#querying-data-with-eql). -1. Integrate with your application via the [helper packages](#helper-packages) to interact with the encrypted data. +EQL relies on [CipherStash Proxy](https://github.com/cipherstash/encrypt-query-language/blob/main/PROXY.md) for low-latency encryption & decryption. +We plan to support direct language integration in the future. -Read [GETTINGSTARTED.md](GETTINGSTARTED.md) for more detail. +## Getting started -## Encrypted columns +Once the custom types and functions are installed, you can start using EQL in your queries. -EQL relies on your database schema to define encrypted columns. +### Enable encrypted columns -Encrypted columns are defined using the `cs_encrypted_v1` [domain type](https://www.postgresql.org/docs/current/domains.html), which extends the `jsonb` type with additional constraints to ensure data integrity. +Define encrypted columns using the `cs_encrypted_v1` domain type, which extends the `jsonb` type with additional constraints to ensure data integrity. -**Example table definition:** +**Example:** ```sql CREATE TABLE users ( @@ -71,542 +85,353 @@ CREATE TABLE users ( ); ``` -In some instances, especially when using langugage specific ORMs, EQL also supports `jsonb` columns rather than the `cs_encrypted_v1` domain type. - ### Configuring the column -So that CipherStash Proxy can encrypt and decrypt the data, initialize the column in the database using the `cs_add_column_v1` function. -This function takes the following parameters: - -- `table_name`: the name of the table containing the encrypted column. -- `column_name`: the name of the encrypted column. - -This function will **not** enable searchable encryption, but will allow you to encrypt and decrypt data. -See [querying data with EQL](#querying-data-with-eql) for more information on how to enable searchable encryption. +Initialize the column using the `cs_add_column_v1` function to enable encryption and decryption via CipherStash Proxy. ```sql -SELECT cs_add_column_v1('table', 'column'); +SELECT cs_add_column_v1('users', 'encrypted_email'); ``` -### Activate configuration +**Note:** This function allows you to encrypt and decrypt data but does not enable searchable encryption. See [Querying Data with EQL](#querying-data-with-eql) for enabling searchable encryption. + +### Activating configuration -By default, the state of the configuration is `pending` after any modifications. -You can activate the configuration by running the `cs_encrypt_v1` and `cs_activate_v1` function. +After modifying configurations, activate them by running: ```sql SELECT cs_encrypt_v1(); SELECT cs_activate_v1(); ``` -> **Important:** These functions must be run after any modifications to the configuration. +**Important:** These functions must be run after any modifications to the configuration. -#### Refresh CipherStash Proxy configuration +#### Refreshing CipherStash Proxy Configuration -CipherStash Proxy pings the database every 60 seconds to refresh the configuration. -You can force CipherStash Proxy to refresh the configuration by running the `cs_refresh_encrypt_config` function. +CipherStash Proxy refreshes the configuration every 60 seconds. To force an immediate refresh, run: ```sql SELECT cs_refresh_encrypt_config(); ``` -### Inserting data +>Note: This statement must be executed when connected to CipherStash Proxy. +When connected to the database directly, it is a no-op. -When inserting data into the encrypted column, wrap the plaintext in the appropriate EQL payload. -These statements must be run through the CipherStash Proxy in order to **encrypt** the data. +## Storing data -**Example:** +Encrypted data is stored as `jsonb` values in the database, regardless of the original data type. -```rb -# Create the EQL payload using helper functions -payload = eqlPayload("users", "encrypted_email", "test@test.com") +You can read more about the data format [here][#data-format]. -Users.create(encrypted_email: payload) -``` +### Inserting Data + +When inserting data into the encrypted column, wrap the plaintext in the appropriate EQL payload. These statements must be run through the CipherStash Proxy to **encrypt** the data. -Which will execute on the server as: +**Example:** ```sql -INSERT INTO users (encrypted_email) VALUES ('{"v":1,"k":"pt","p":"test@test.com","i":{"t":"users","c":"encrypted_email"}}'); +INSERT INTO users (encrypted_email) VALUES ( + '{"v":1,"k":"pt","p":"test@example.com","i":{"t":"users","c":"encrypted_email"}}' +); ``` -And is the EQL equivalent of the following plaintext query. +Data is stored in the database as: -```sql -INSERT INTO users (email) VALUES ('test@test.com'); +```json +{ + "c": "generated_ciphertext", + "i": { + "c": "encrypted_email", + "t": "users" + }, + "k": "ct", + "m": null, + "o": null, + "u": null, + "v": 1 +} ``` -All the data stored in the database is fully encrypted and secure. - -### Reading data +### Reading Data -When querying data, wrap the encrypted column in the appropriate EQL payload. -These statements must be run through the CipherStash Proxy in order to **decrypt** the data. +When querying data, select the encrypted column. CipherStash Proxy will **decrypt** the data automatically. **Example:** -```rb -Users.findAll(&:encrypted_email) -``` - -Which will execute on the server as: - ```sql SELECT encrypted_email FROM users; ``` -And is the EQL equivalent of the following plaintext query: +Data is returned as: -```sql -SELECT email FROM users; +```json +{ + "k": "pt", + "p": "test@example.com", + "i": { + "t": "users", + "c": "encrypted_email" + }, + "v": 1, + "q": null +} ``` -All the data returned from the database is fully decrypted. - -## Querying data with EQL +>Note: If you execute this query directly on the database, you will not see any plaintext data but rather the `jsonb` payload with the ciphertext. -EQL provides specialized functions to interact with encrypted data to support operations like equality checks, range queries, and unique constraints. +## Configuring indexes for searching data -### `cs_match_v1(val JSONB)` +In order to perform searchable operations on encrypted data, you must configure indexes for the encrypted columns. -Enables basic full-text search. +> **IMPORTANT:** If you have existing data that's encrypted and you add or modify an index, all the data will need to be re-encrypted. +This is due to the way CipherStash Proxy handles searchable encryption operations. -**Example** +### Adding an index (`cs_add_index_v1`) -```rb -# Create the EQL payload using helper functions -payload = EQL.for_match("users", "encrypted_field", "plaintext value") - -Users.where("cs_match_v1(field) @> cs_match_v1(?)", payload) -``` - -Which will execute on the server as: +Add an index to an encrypted column. +This function also behaves the same as `cs_add_column_v1` but with the additional index configuration. ```sql -SELECT * FROM users WHERE cs_match_v1(field) @> cs_match_v1('{"v":1,"k":"pt","p":"plaintext value","i":{"t":"users","c":"encrypted_field"},"q":"match"}'); -``` - -And is the EQL equivalent of the following plaintext query. - -```sql -SELECT * FROM users WHERE field LIKE '%plaintext value%'; +SELECT cs_add_index_v1( + 'table_name', -- Name of the table + 'column_name', -- Name of the column + 'index_name', -- Index kind ('unique', 'match', 'ore', 'ste_vec') + 'cast_as', -- PostgreSQL type to cast decrypted data ('text', 'int', etc.) + 'opts' -- Index options as JSONB (optional) +); ``` -### `cs_unique_v1(val JSONB)` - -Retrieves the unique index for enforcing uniqueness. +You can read more about the index configuration options [here][https://github.com/cipherstash/encrypt-query-language/blob/main/docs/reference/INDEX.md]. -**Example:** - -```rb -# Create the EQL payload using helper functions -payload = EQL.for_unique("users", "encrypted_field", "plaintext value") - -Users.where("cs_unique_v1(field) = cs_unique_v1(?)", payload) -``` - -Which will execute on the server as: +**Example (Unique index):** ```sql -SELECT * FROM users WHERE cs_unique_v1(field) = cs_unique_v1('{"v":1,"k":"pt","p":"plaintext value","i":{"t":"users","c":"encrypted_field"},"q":"unique"}'); +SELECT cs_add_index_v1( + 'users', + 'encrypted_email', + 'unique', + 'text' +); ``` -And is the EQL equivalent of the following plaintext query. +After adding an index, you have to activate the configuration. ```sql -SELECT * FROM users WHERE field = 'plaintext value'; +SELECT cs_encrypt_v1(); +SELECT cs_activate_v1(); ``` -### `cs_ore_64_8_v1(val JSONB)` +## Searching data with EQL -Retrieves the Order-Revealing Encryption index for range queries. +EQL provides specialized functions to interact with encrypted data, supporting operations like equality checks, range queries, and unique constraints. -**Sorting example:** +In order to use the specialized functions, you must first configure the corresponding indexes. -```rb -# Create the EQL payload using helper functions -date = EQL.for_ore("users", "encrypted_date", Time.now) +### Equality search (`cs_unique_v1`) -User.where("cs_ore_64_8_v1(encrypted_date) < cs_ore_64_8_v1(?)", date) -``` +Enable equality search on encrypted data. -Which will execute on the server as: +**Index configuration example:** ```sql -SELECT * FROM examples WHERE cs_ore_64_8_v1(encrypted_date) < cs_ore_64_8_v1($1) +SELECT cs_add_index_v1( + 'users', + 'encrypted_email', + 'unique', + 'text' +); ``` -And is the EQL equivalent of the following plaintext query: +**Example:** ```sql -SELECT * FROM examples WHERE date < $1; -``` - -**Ordering example:** - -```rb -User.order("cs_ore_64_8_v1(encrypted_field)").all().map(&:id) +SELECT * FROM users +WHERE cs_unique_v1(encrypted_email) = cs_unique_v1( + '{"v":1,"k":"pt","p":"test@example.com","i":{"t":"users","c":"encrypted_email"},"q":"unique"}' +); ``` -Which will execute on the server as: +Equivalent plaintext query: ```sql -SELECT id FROM examples ORDER BY cs_ore_64_8_v1(encrypted_field) DESC; +SELECT * FROM users WHERE email = 'test@example.com'; ``` -And is the EQL equivalent of the following plaintext query. +### Full-text search (`cs_match_v1`) -```sql -SELECT id FROM examples ORDER BY field DESC; -``` - -**Grouping example:** +Enables basic full-text search on encrypted data. -ORE indexes can be used along with the `cs_grouped_value_v1` aggregate function to group by an encrypted column: - -``` -SELECT cs_grouped_value_v1(encrypted_field) COUNT(*) - FROM users - GROUP BY cs_ore_64_8_v1(encrypted_field) -``` - -## Querying JSONB data with EQL - -### `cs_ste_term_v1(val JSONB, epath TEXT)` - -Retrieves the encrypted _term_ associated with the encrypted JSON path, `epath`. - -### `cs_ste_vec_v1(val JSONB)` - -Retrieves the Structured Encryption Vector for containment queries. - -**Example:** - -```rb -# Serialize a JSONB value bound to the users table column -term = EQL.for_ste_vec("users", "attrs", {field: "value"}) -User.where("cs_ste_vec_v1(attrs) @> cs_ste_vec_v1(?)", term) -``` - -Which will execute on the server as: +**Index configuration example:** ```sql -SELECT * FROM users WHERE cs_ste_vec_v1(attrs) @> '53T8dtvW4HhofDp9BJnUkw'; -``` - -And is the EQL equivalent of the following plaintext query. - -```sql -SELECT * FROM users WHERE attrs @> '{"field": "value"}`; +SELECT cs_add_index_v1( + 'users', + 'encrypted_email', + 'match', + 'text', + '{"token_filters": [{"kind": "downcase"}], "tokenizer": { "kind": "ngram", "token_length": 3 }}' +); ``` -### `cs_ste_term_v1(val JSONB, epath TEXT)` - -Retrieves the encrypted index term associated with the encrypted JSON path, `epath`. - -This is useful for sorting or filtering on integers in encrypted JSON objects. - **Example:** -```rb -# Serialize a JSONB value bound to the users table column -path = EQL.for_ejson_path("users", "attrs", "$.login_count") -term = EQL.for_ore("users", "attrs", 100) -User.where("cs_ste_term_v1(attrs, ?) > cs_ore_64_8_v1(?)", path, term) -``` - -Which will execute on the server as: - ```sql -SELECT * FROM users WHERE cs_ste_term_v1(attrs, 'DQ1rbhWJXmmqi/+niUG6qw') > 'QAJ3HezijfTHaKrhdKxUEg'; +SELECT * FROM users +WHERE cs_match_v1(encrypted_email) @> cs_match_v1( + '{"v":1,"k":"pt","p":"test","i":{"t":"users","c":"encrypted_email"},"q":"match"}' +); ``` -And is the EQL equivalent of the following plaintext query. +Equivalent plaintext query: ```sql -SELECT * FROM users WHERE attrs->'login_count' > 10; +SELECT * FROM users WHERE email LIKE '%test%'; ``` -### `cs_ste_value_v1(val JSONB, epath TEXT)` - -Retrieves the encrypted _value_ associated with the encrypted JSON path, `epath`. - -**Example:** - -```rb -# Serialize a JSONB value bound to the users table column -path = EQL.for_ejson_path("users", "attrs", "$.login_count") -User.find_by_sql(["SELECT cs_ste_value_v1(attrs, ?) FROM users", path]) -``` +### Range queries (`cs_ore_64_8_v1`) -Which will execute on the server as: +Enable range queries on encrypted data. Supports: -```sql -SELECT cs_ste_value_v1(attrs, 'DQ1rbhWJXmmqi/+niUG6qw') FROM users; -``` +- `ORDER BY` +- `WHERE` -And is the EQL equivalent of the following plaintext query. +**Example (Filtering):** ```sql -SELECT attrs->'login_count' FROM users; +SELECT * FROM users +WHERE cs_ore_64_8_v1(encrypted_date) < cs_ore_64_8_v1( + '{"v":1,"k":"pt","p":"2023-10-05","i":{"t":"users","c":"encrypted_date"},"q":"ore"}' +); ``` -### Field extraction - -Extract a field from a JSONB object in a `SELECT` statement: +Equivalent plaintext query: ```sql -SELECT cs_ste_value_v1(attrs, 'DQ1rbhWJXmmqi/+niUG6qw') FROM users; +SELECT * FROM users WHERE date < '2023-10-05'; ``` -Which is the equivalent to the following SQL query: +**Example (Ordering):** ```sql -SELECT attrs->'login_count' FROM users; +SELECT id FROM users +ORDER BY cs_ore_64_8_v1(encrypted_field) DESC; ``` -### Extraction (in WHERE, ORDER BY) - -Select rows that match a field in a JSONB object: +Equivalent plaintext query: ```sql -SELECT * FROM users WHERE cs_ste_term_v1(attrs, 'DQ1rbhWJXmmqi/+niUG6qw') > 'QAJ3HezijfTHaKrhdKxUEg'; +SELECT id FROM users ORDER BY field DESC; ``` -Which is the equivalent to the following SQL query: +**Example (Grouping):** ```sql -SELECT * FROM users WHERE attrs->'login_count' > 10; -``` - -### Grouping - -`cs_ste_vec_term_v1` can be used along with the `cs_grouped_value_v1` aggregate function to group by a field in an encrypted JSONB column: - -``` --- $1 here is a param that containts the EQL payload for an ejson path. --- Example EQL payload for the path `$.field_one`: --- '{"k": "pt", "p": "$.field_one", "q": "ejson_path", "i": {"t": "users", "c": "attrs"}, "v": 1}' -SELECT cs_grouped_value_v1(cs_ste_vec_value_v1(attrs), $1) COUNT(*) +SELECT cs_grouped_value_v1(encrypted_field) COUNT(*) FROM users - GROUP BY cs_ste_vec_term_v1(attrs, $1); + GROUP BY cs_ore_64_8_v1(encrypted_field) ``` -## Managing indexes with EQL - -These functions expect a `jsonb` value that conforms to the storage schema. - -### `cs_add_index` +Equivalent plaintext query: ```sql -cs_add_index(table_name text, column_name text, index_name text, cast_as text, opts jsonb) +SELECT field, COUNT(*) FROM users GROUP BY field; ``` -| Parameter | Description | Notes | -| ------------- | -------------------------------------------------- | ------------------------------------------------------------------------ | -| `table_name` | Name of target table | Required | -| `column_name` | Name of target column | Required | -| `index_name` | The index kind | Required. | -| `cast_as` | The PostgreSQL type decrypted data will be cast to | Optional. Defaults to `text` | -| `opts` | Index options | Optional for `match` indexes, required for `ste_vec` indexes (see below) | - -#### cast_as - -Supported types: +## JSON and JSONB support -- `text` -- `int` -- `small_int` -- `big_int` -- `boolean` -- `date` -- `jsonb` +EQL supports encrypting, decrypting, and searching JSON and JSONB objects. -#### match opts +### Configuring the index -A match index enables full text search across one or more text fields in queries. +Similar to how you configure indexes for text data, you can configure indexes for JSON and JSONB data. +The only difference is that you need to specify the `cast_as` parameter as `json` or `jsonb`. -The default Match index options are: - -```json - { - "k": 6, - "m": 2048, - "include_original": true, - "tokenizer": { - "kind": "ngram", - "token_length": 3 - } - "token_filters": { - "kind": "downcase" - } - } +```sql +SELECT cs_add_index_v1( + 'users', + 'encrypted_json', + 'ste_vec', + 'jsonb', + '{"prefix": "users/encrypted_json"}' -- The prefix is in the form of "table/column" +); ``` -- `tokenFilters`: a list of filters to apply to normalize tokens before indexing. -- `tokenizer`: determines how input text is split into tokens. -- `m`: The size of the backing [bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) in bits. Defaults to `2048`. -- `k`: The maximum number of bits set in the bloom filter per term. Defaults to `6`. - -**Token filters** - -There are currently only two token filters available: `downcase` and `upcase`. These are used to normalise the text before indexing and are also applied to query terms. An empty array can also be passed to `tokenFilters` if no normalisation of terms is required. - -**Tokenizer** +You can read more about the index configuration options [here](https://github.com/cipherstash/encrypt-query-language/blob/main/docs/reference/INDEX.md). -There are two `tokenizer`s provided: `standard` and `ngram`. -`standard` simply splits text into tokens using this regular expression: `/[ ,;:!]/`. -`ngram` splits the text into n-grams and accepts a configuration object that allows you to specify the `tokenLength`. +### Inserting JSON data -**m** and **k** +When inserting JSON data, this works the same as inserting text data. +You need to wrap the JSON data in the appropriate EQL payload. +CipherStash Proxy will **encrypt** the data automatically. -`k` and `m` are optional fields for configuring [bloom filters](https://en.wikipedia.org/wiki/Bloom_filter) that back full text search. - -`m` is the size of the bloom filter in bits. `filterSize` must be a power of 2 between `32` and `65536` and defaults to `2048`. - -`k` is the number of hash functions to use per term. -This determines the maximum number of bits that will be set in the bloom filter per term. -`k` must be an integer from `3` to `16` and defaults to `6`. - -**Caveats around n-gram tokenization** - -While using n-grams as a tokenization method allows greater flexibility when doing arbitrary substring matches, it is important to bear in mind the limitations of this approach. -Specifically, searching for strings _shorter_ than the `tokenLength` parameter will not _generally_ work. - -If you're using n-gram as a token filter, then a token that is already shorter than the `tokenLength` parameter will be kept as-is when indexed, and so a search for that short token will match that record. -However, if that same short string only appears as a part of a larger token, then it will not match that record. -In general, therefore, you should try to ensure that the string you search for is at least as long as the `tokenLength` of the index, except in the specific case where you know that there are shorter tokens to match, _and_ you are explicitly OK with not returning records that have that short string as part of a larger token. - -#### ste_vec opts - -An ste_vec index on a encrypted JSONB column enables the use of PostgreSQL's `@>` and `<@` [containment operators](https://www.postgresql.org/docs/16/functions-json.html#FUNCTIONS-JSONB-OP-TABLE). - -An ste_vec index requires one piece of configuration: the `context` (a string) which is passed as an info string to a MAC (Message Authenticated Code). -This ensures that all of the encrypted values are unique to that context. -It is generally recommended to use the table and column name as a the context (e.g. `users/name`). - -Within a dataset, encrypted columns indexed using an `ste_vec` that use different contexts cannot be compared. -Containment queries that manage to mix index terms from multiple columns will never return a positive result. -This is by design. - -The index is generated from a JSONB document by first flattening the structure of the document such that a hash can be generated for each unique path prefix to a node. - -The complete set of JSON types is supported by the indexer. -Null values are ignored by the indexer. - -- Object `{ ... }` -- Array `[ ... ]` -- String `"abc"` -- Boolean `true` -- Number `123.45` +**Example:** -For a document like this: +Assuming you want to store the following JSON data: ```json { - "account": { - "email": "alice@example.com", - "name": { - "first_name": "Alice", - "last_name": "McCrypto" - }, - "roles": ["admin", "owner"] + "name": "John Doe", + "metadata": { + "age": 42, } } ``` -Hashes would be produced from the following list of entries: - -```js -[ - [Obj, Key("account"), Obj, Key("email"), String("alice@example.com")], - [ - Obj, - Key("account"), - Obj, - Key("name"), - Obj, - Key("first_name"), - String("Alice"), - ], - [ - Obj, - Key("account"), - Obj, - Key("name"), - Obj, - Key("last_name"), - String("McCrypto"), - ], - [Obj, Key("account"), Obj, Key("roles"), Array, String("admin")], - [Obj, Key("account"), Obj, Key("roles"), Array, String("owner")], -]; -``` - -Using the first entry to illustrate how an entry is converted to hashes: +The EQL payload would be: -```js -[Obj, Key("account"), Obj, Key("email"), String("alice@example.com")]; -``` - -The hashes would be generated for all prefixes of the full path to the leaf node. - -```js -[ - [Obj], - [Obj, Key("account")], - [Obj, Key("account"), Obj], - [Obj, Key("account"), Obj, Key("email")], - [Obj, Key("account"), Obj, Key("email"), String("alice@example.com")], - // (remaining leaf nodes omitted) -]; +```sql +INSERT INTO users (encrypted_json) VALUES ( + '{"v":1,"k":"pt","p":"{\"name\":\"John Doe\",\"metadata\":{\"age\":42}}","i":{"t":"users","c":"encrypted_json"}}' +); ``` -Query terms are processed in the same manner as the input document. - -A query prior to encrypting & indexing looks like a structurally similar subset of the encrypted document, for example: +Data is stored in the database as: ```json -{ "account": { "email": "alice@example.com", "roles": "admin" } } +{ + "i": { + "c": "encrypted_json", + "t": "users" + }, + "k": "sv", + "v": 1, + "sv": [ + ...ciphertext... + ] +} ``` -The expression `cs_ste_vec_v1(encrypted_account) @> cs_ste_vec_v1($query)` would match all records where the `encrypted_account` column contains a JSONB object with an "account" key containing an object with an "email" key where the value is the string "alice@example.com". - -When reduced to a prefix list, it would look like this: - -```js -[ - [Obj], - [Obj, Key("account")], - [Obj, Key("account"), Obj], - [Obj, Key("account"), Obj, Key("email")], - [Obj, Key("account"), Obj, Key("email"), String("alice@example.com")][ - (Obj, Key("account"), Obj, Key("roles")) - ], - [Obj, Key("account"), Obj, Key("roles"), Array], - [Obj, Key("account"), Obj, Key("roles"), Array, String("admin")], -]; -``` +### Reading JSON data -Which is then turned into an ste_vec of hashes which can be directly queries against the index. +When querying data, select the encrypted column. CipherStash Proxy will **decrypt** the data automatically. -### `cs_modify_index` +**Example:** ```sql -_cs_modify_index_v1(table_name text, column_name text, index_name text, cast_as text, opts jsonb) +SELECT encrypted_json FROM users; ``` -Modifies an existing index configuration. -Accepts the same parameters as `cs_add_index` +Data is returned as: -### `cs_remove_index` - -```sql -cs_remove_index_v1(table_name text, column_name text, index_name text) +```json +{ + "k": "pt", + "p": "{\"metadata\":{\"age\":42},\"name\":\"John Doe\"}", + "i": { + "t": "users", + "c": "encrypted_json" + }, + "v": 1, + "q": null +} ``` -Removes an index configuration from the column. +### Advanced JSON queries -## Data format +We support a wide range of JSON/JSONB functions and operators. +You can read more about the JSONB support in the [JSONB reference guide](https://github.com/cipherstash/encrypt-query-language/blob/main/docs/reference/JSON.md). + +## EQL payload data format Encrypted data is stored as `jsonb` with a specific schema: @@ -658,12 +483,30 @@ CipherStash Proxy handles the encoding, and EQL provides the functions. | u | Unique index | Ciphertext index value. Encrypted by Proxy. | | sv | STE vector index | Ciphertext index value. Encrypted by Proxy. | +## Frequently Asked Questions + +### How do I integrate CipherStash EQL with my application? + +Use CipherStash Proxy to intercept database queries and handle encryption and decryption automatically. +The proxy interacts with the database using the EQL functions and types defined in this documentation. + +Use the [helper packages](#helper-packages) to integate EQL functions into your application. + +### Can I use EQL without the CipherStash Proxy? + +No, CipherStash Proxy is required to handle the encryption and decryption operations based on the configurations and indexes defined. + +### How is data encrypted in the database? + +Data is encrypted using CipherStash's cryptographic schemes and stored in the `cs_encrypted_v1` column as a JSONB payload. +Encryption and decryption are handled by CipherStash Proxy. + ## Helper packages We've created a few langague specific packages to help you interact with the payloads: -- [@cipherstash/eql](https://github.com/cipherstash/encrypt-query-language/tree/main/languages/javascript/packages/eql): This is a TypeScript implementation of EQL. -- [github.com/cipherstash/goeql](https://github.com/cipherstash/goeql): This is a Go implementation of EQL +- **JavaScript/TypeScript**: [@cipherstash/eql](https://github.com/cipherstash/encrypt-query-language/tree/main/languages/javascript/packages/eql) +- **Go**: [github.com/cipherstash/goeql](https://github.com/cipherstash/goeql) ## Releasing diff --git a/cipherstash-proxy/cipherstash-proxy.toml.example b/cipherstash-proxy/cipherstash-proxy.toml.example deleted file mode 100644 index 31961d33..00000000 --- a/cipherstash-proxy/cipherstash-proxy.toml.example +++ /dev/null @@ -1,25 +0,0 @@ -## For a complete list of configuration options -## see the documentation at https://cipherstash.com/docs/reference/proxy - -## Sign up for an account to create an access key: https://dashboard.cipherstash.com -workspace_id = "..." -client_access_key = "..." - -prometheus_metrics = true -query_logging = true -unsafe_logging = true - -[encryption] -mode = "encrypted" -client_id = "..." -client_key = "..." - -[audit] -subscriber = "stdout" - -[database] -name = "..." -username = "..." -password = "..." -host = "..." -port = 5432 diff --git a/cipherstash-proxy/docker-compose.yaml b/cipherstash-proxy/docker-compose.yaml deleted file mode 100644 index 09ff9b82..00000000 --- a/cipherstash-proxy/docker-compose.yaml +++ /dev/null @@ -1,11 +0,0 @@ -name: eql -services: - cipherstash-proxy: - container_name: eql-cipherstash-proxy - ports: - - 6432:6432 - environment: - - LOG_LEVEL=debug - volumes: - - ./cipherstash-proxy.toml:/etc/cipherstash-proxy/cipherstash-proxy.toml - image: cipherstash/cipherstash-proxy:cipherstash-proxy-v0.1.1 diff --git a/WHY.md b/docs/concepts/WHY.md similarity index 100% rename from WHY.md rename to docs/concepts/WHY.md diff --git a/docs/reference/INDEX.md b/docs/reference/INDEX.md new file mode 100644 index 00000000..a80d236d --- /dev/null +++ b/docs/reference/INDEX.md @@ -0,0 +1,241 @@ +# EQL index configuration + +The following functions allow you to configure indexes for encrypted columns. +All these functions modify the `cs_configuration_v1` table in your database, and is added during the EQL installation. + +> **IMPORTANT:** When you modify or add an index, you must re-encrypt data that's already been stored in the database. +The CipherStash encryption solution will encrypt the data based on the current state of the configuration. + +### Adding an index (`cs_add_index`) + +Add an index to an encrypted column. + +```sql +SELECT cs_add_index_v1( + 'table_name', -- Name of the table + 'column_name', -- Name of the column + 'index_name', -- Index kind ('unique', 'match', 'ore', 'ste_vec') + 'cast_as', -- PostgreSQL type to cast decrypted data ('text', 'int', etc.) + 'opts' -- Index options as JSONB (optional) +); +``` + +| Parameter | Description | Notes | +| ------------- | -------------------------------------------------- | ------------------------------------------------------------------------ | +| `table_name` | Name of target table | Required | +| `column_name` | Name of target column | Required | +| `index_name` | The index kind | Required. | +| `cast_as` | The PostgreSQL type decrypted data will be cast to | Optional. Defaults to `text` | +| `opts` | Index options | Optional for `match` indexes, required for `ste_vec` indexes (see below) | + +#### Option (`cast_as`) + +Supported types: + +- `text` +- `int` +- `small_int` +- `big_int` +- `boolean` +- `date` +- `jsonb` + +#### Options for match indexes (`opts`) + +A match index enables full text search across one or more text fields in queries. + +The default Match index options are: + +```json + { + "k": 6, + "m": 2048, + "include_original": true, + "tokenizer": { + "kind": "ngram", + "token_length": 3 + } + "token_filters": { + "kind": "downcase" + } + } +``` + +- `tokenFilters`: a list of filters to apply to normalize tokens before indexing. +- `tokenizer`: determines how input text is split into tokens. +- `m`: The size of the backing [bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) in bits. Defaults to `2048`. +- `k`: The maximum number of bits set in the bloom filter per term. Defaults to `6`. + +**Token filters** + +There are currently only two token filters available: `downcase` and `upcase`. These are used to normalise the text before indexing and are also applied to query terms. An empty array can also be passed to `tokenFilters` if no normalisation of terms is required. + +**Tokenizer** + +There are two `tokenizer`s provided: `standard` and `ngram`. +`standard` simply splits text into tokens using this regular expression: `/[ ,;:!]/`. +`ngram` splits the text into n-grams and accepts a configuration object that allows you to specify the `tokenLength`. + +**m** and **k** + +`k` and `m` are optional fields for configuring [bloom filters](https://en.wikipedia.org/wiki/Bloom_filter) that back full text search. + +`m` is the size of the bloom filter in bits. `filterSize` must be a power of 2 between `32` and `65536` and defaults to `2048`. + +`k` is the number of hash functions to use per term. +This determines the maximum number of bits that will be set in the bloom filter per term. +`k` must be an integer from `3` to `16` and defaults to `6`. + +**Caveats around n-gram tokenization** + +While using n-grams as a tokenization method allows greater flexibility when doing arbitrary substring matches, it is important to bear in mind the limitations of this approach. +Specifically, searching for strings _shorter_ than the `tokenLength` parameter will not _generally_ work. + +If you're using n-gram as a token filter, then a token that is already shorter than the `tokenLength` parameter will be kept as-is when indexed, and so a search for that short token will match that record. +However, if that same short string only appears as a part of a larger token, then it will not match that record. +In general, therefore, you should try to ensure that the string you search for is at least as long as the `tokenLength` of the index, except in the specific case where you know that there are shorter tokens to match, _and_ you are explicitly OK with not returning records that have that short string as part of a larger token. + +#### Options for ste_vec indexes (`opts`) + +An ste_vec index on a encrypted JSONB column enables the use of PostgreSQL's `@>` and `<@` [containment operators](https://www.postgresql.org/docs/16/functions-json.html#FUNCTIONS-JSONB-OP-TABLE). + +An ste_vec index requires one piece of configuration: the `context` (a string) which is passed as an info string to a MAC (Message Authenticated Code). +This ensures that all of the encrypted values are unique to that context. +It is generally recommended to use the table and column name as a the context (e.g. `users/name`). + +Within a dataset, encrypted columns indexed using an `ste_vec` that use different contexts cannot be compared. +Containment queries that manage to mix index terms from multiple columns will never return a positive result. +This is by design. + +The index is generated from a JSONB document by first flattening the structure of the document such that a hash can be generated for each unique path prefix to a node. + +The complete set of JSON types is supported by the indexer. +Null values are ignored by the indexer. + +- Object `{ ... }` +- Array `[ ... ]` +- String `"abc"` +- Boolean `true` +- Number `123.45` + +For a document like this: + +```json +{ + "account": { + "email": "alice@example.com", + "name": { + "first_name": "Alice", + "last_name": "McCrypto" + }, + "roles": ["admin", "owner"] + } +} +``` + +Hashes would be produced from the following list of entries: + +```js +[ + [Obj, Key("account"), Obj, Key("email"), String("alice@example.com")], + [ + Obj, + Key("account"), + Obj, + Key("name"), + Obj, + Key("first_name"), + String("Alice"), + ], + [ + Obj, + Key("account"), + Obj, + Key("name"), + Obj, + Key("last_name"), + String("McCrypto"), + ], + [Obj, Key("account"), Obj, Key("roles"), Array, String("admin")], + [Obj, Key("account"), Obj, Key("roles"), Array, String("owner")], +]; +``` + +Using the first entry to illustrate how an entry is converted to hashes: + +```js +[Obj, Key("account"), Obj, Key("email"), String("alice@example.com")]; +``` + +The hashes would be generated for all prefixes of the full path to the leaf node. + +```js +[ + [Obj], + [Obj, Key("account")], + [Obj, Key("account"), Obj], + [Obj, Key("account"), Obj, Key("email")], + [Obj, Key("account"), Obj, Key("email"), String("alice@example.com")], + // (remaining leaf nodes omitted) +]; +``` + +Query terms are processed in the same manner as the input document. + +A query prior to encrypting & indexing looks like a structurally similar subset of the encrypted document, for example: + +```json +{ + "account": { + "email": "alice@example.com", + "roles": "admin" + } +} +``` + +The expression `cs_ste_vec_v1(encrypted_account) @> cs_ste_vec_v1($query)` would match all records where the `encrypted_account` column contains a JSONB object with an "account" key containing an object with an "email" key where the value is the string "alice@example.com". + +When reduced to a prefix list, it would look like this: + +```js +[ + [Obj], + [Obj, Key("account")], + [Obj, Key("account"), Obj], + [Obj, Key("account"), Obj, Key("email")], + [Obj, Key("account"), Obj, Key("email"), String("alice@example.com")][ + (Obj, Key("account"), Obj, Key("roles")) + ], + [Obj, Key("account"), Obj, Key("roles"), Array], + [Obj, Key("account"), Obj, Key("roles"), Array, String("admin")], +]; +``` + +Which is then turned into an ste_vec of hashes which can be directly queries against the index. + +### Modifying an index (`cs_modify_index`) + +Modifies an existing index configuration. +Accepts the same parameters as `cs_add_index` + +```sql +SELECT cs_modify_index_v1( + table_name text, + column_name text, + index_name text, + cast_as text, + opts jsonb +); +``` + +### Removing an index (`cs_remove_index`) + +Removes an index configuration from the column. + +```sql +SELECT cs_remove_index_v1( + table_name text, + column_name text, + index_name text +); +``` \ No newline at end of file diff --git a/JSON.md b/docs/reference/JSON.md similarity index 100% rename from JSON.md rename to docs/reference/JSON.md diff --git a/MIGRATOR.md b/docs/reference/MIGRATOR.md similarity index 100% rename from MIGRATOR.md rename to docs/reference/MIGRATOR.md diff --git a/NATIVE_POSTGRES_JSON_COMPARED_TO_EQL.md b/docs/reference/NATIVE_POSTGRES_JSON_COMPARED_TO_EQL.md similarity index 100% rename from NATIVE_POSTGRES_JSON_COMPARED_TO_EQL.md rename to docs/reference/NATIVE_POSTGRES_JSON_COMPARED_TO_EQL.md diff --git a/GETTINGSTARTED.md b/docs/tutorials/GETTINGSTARTED.md similarity index 100% rename from GETTINGSTARTED.md rename to docs/tutorials/GETTINGSTARTED.md diff --git a/PROXY.md b/docs/tutorials/PROXY.md similarity index 100% rename from PROXY.md rename to docs/tutorials/PROXY.md diff --git a/playground/.envrc.example b/playground/.envrc.example new file mode 100644 index 00000000..ec00117a --- /dev/null +++ b/playground/.envrc.example @@ -0,0 +1,5 @@ +export CS_WORKSPACE_ID=1234 +export CS_CLIENT_ACCESS_KEY=1234 +export CS_ENCRYPTION__CLIENT_ID=1234 +export CS_ENCRYPTION__CLIENT_KEY=1234 +export CS_DATASET_ID=1234 \ No newline at end of file diff --git a/cipherstash-proxy/dataset.yml b/playground/dataset.yml similarity index 100% rename from cipherstash-proxy/dataset.yml rename to playground/dataset.yml diff --git a/playground/db/Dockerfile b/playground/db/Dockerfile new file mode 100644 index 00000000..c36f5771 --- /dev/null +++ b/playground/db/Dockerfile @@ -0,0 +1,8 @@ +FROM curlimages/curl:7.85.0 as fetch-eql +WORKDIR /out +RUN curl -sLo /out/cipherstash-encrypt.sql https://github.com/cipherstash/encrypt-query-language/releases/download/eql-0.4.2/cipherstash-encrypt.sql + +FROM postgres:16.2-bookworm as db +WORKDIR /app +COPY init.sh /docker-entrypoint-initdb.d +COPY --from=fetch-eql /out/cipherstash-encrypt.sql /app/scripts/db/cipherstash-encrypt.sql diff --git a/playground/db/init.sh b/playground/db/init.sh new file mode 100644 index 00000000..584b3f58 --- /dev/null +++ b/playground/db/init.sh @@ -0,0 +1,3 @@ +#!/bin/bash + +psql -U $POSTGRES_USER -d $POSTGRES_DB -a -f /app/scripts/db/cipherstash-encrypt.sql \ No newline at end of file diff --git a/playground/docker-compose.yml b/playground/docker-compose.yml new file mode 100644 index 00000000..5ba8a4d8 --- /dev/null +++ b/playground/docker-compose.yml @@ -0,0 +1,41 @@ +services: + postgres: + container_name: eql-playground-pg + build: + context: ./db + command: [ "postgres", "-c", "log_statement=all" ] + environment: + POSTGRES_USER: postgres + POSTGRES_PASSWORD: postgres + POSTGRES_DB: postgres + ports: + - ${PGPORT:-5432}:5432 + networks: + - eql-playground-nw + proxy: + container_name: postgres_proxy + image: cipherstash/cipherstash-proxy:cipherstash-proxy-v0.3.4 + depends_on: + - postgres + ports: + - ${CS_PORT:-6432}:${CS_PORT:-6432} + environment: + CS_WORKSPACE_ID: $CS_WORKSPACE_ID + CS_CLIENT_ACCESS_KEY: $CS_CLIENT_ACCESS_KEY + CS_ENCRYPTION__CLIENT_ID: $CS_ENCRYPTION__CLIENT_ID + CS_ENCRYPTION__CLIENT_KEY: $CS_ENCRYPTION__CLIENT_KEY + CS_ENCRYPTION__DATASET_ID: $CS_DATASET_ID + CS_TEST_ON_CHECKOUT: "true" + CS_AUDIT__ENABLED: "false" + CS_DATABASE__PORT: 5432 + CS_DATABASE__USERNAME: postgres + CS_DATABASE__PASSWORD: postgres + CS_DATABASE__NAME: postgres + CS_DATABASE__HOST: eql-playground-pg + CS_UNSAFE_LOGGING: "true" + networks: + - eql-playground-nw + +networks: + eql-playground-nw: + driver: bridge