diff --git a/README.md b/README.md index 6af10fe..7f87690 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,21 @@ # Active Consistency Engine (ACE) [![Go Integration Tests](https://github.com/pgEdge/ace/actions/workflows/test.yml/badge.svg?branch=main)](https://github.com/pgEdge/ace/actions/workflows/test.yml) -The Active Consistency Engine (ACE) is a tool designed to ensure eventual consistency between nodes in a pgEdge cluster. For more information, please refer to the official [pgEdge docs on ACE](https://docs.pgedge.com/platform/ace). - - - +The Active Consistency Engine (ACE) is a tool designed to ensure eventual consistency between nodes in a pgEdge cluster. + +## Table of Contents +- [ACE Overview](docs/index.md) +- [Understanding ACE Use Cases](docs/ace_use_cases.md) +- [Building the ACE Extension](README.md#building) +- [Basic Configuration](README.md#configuration) +- [ACE Quickstart](README.md#quickstart) +- [Advanced ACE Configuration](docs/configuring.md) +- [Using Merkle Trees with ACE](docs/merkle.md) +- [Using ACE Functions](docs/ace_functions.md) +- [Using the ACE API](docs/ace_api.md) +- [API Reference](docs/api.md) +- [Scheduling ACE](docs/schedule_ace.md) +- [Building the ACE Documentation](README.md#building-the-documentation) ## Building @@ -24,7 +35,12 @@ To build ACE, you need to have Go (version 1.18 or higher) installed. ## Configuration -ACE requires a cluster configuration file to connect to the database nodes. Please refer to the [pgEdge docs](https://docs.pgedge.com/platform/installing_pgedge/json) on how to create this file. +ACE requires a cluster configuration file to connect to the database nodes. [Create and update a .json file](https://docs.pgedge.com/platform/installing_pgedge/json) that describes the cluster you will be managing with ACE, and place the file in `cluster/cluster_name/cluster_name.json` on the ACE host. For example, if your cluster name is `us_eu_backend`, the cluster definition file for this should be placed in `/pgedge/cluster/us_eu_backend/us_eu_backend.json`. The .json file must: + + * Contain connection information for each node in the cluster. + * Identify the user that will be invoking ACE commands in the `db_user` property; this user must also be the table owner. + +After ensuring that the .json file describes your cluster connections and identifies the ACE user, you're ready to use [ACE functions](ace_functions.md). ## Quickstart @@ -121,3 +137,36 @@ If differences are found, you can repair them using the `table-repair` command, ``` The Merkle trees can be kept up-to-date automatically by running the `mtree listen` command, which uses Change Data Capture (CDC) with the `pgoutput` output plugin to track row changes. Performing the `mtree table-diff` will update the Merkle tree even if `mtree listen` is not used. + +### Building the Documentation + +The documentation uses [MkDocs](https://www.mkdocs.org) with the [Material theme](https://squidfunk.github.io/mkdocs-material/) to generate styled static HTML documentation from Markdown files in the `docs` directory. + +To build the documentation, and run a development server for live previewing: + +1) Create a Python virtual environment: + ```bash + python3 -m venv spock-docs-venv + ``` + +2) Activate the virtual environment: + ```bash + source spock-docs-venv/bin/activate + ``` + +3) Install MkDocs: + ```bash + pip install mkdocs mkdocs-material + ``` + +4) Run the local MkDocs server for testing: + ```bash + mkdocs serve + INFO - Building documentation... + INFO - Multirepo plugin importing docs... + INFO - Cleaning site directory + INFO - Multirepo plugin is cleaning up temp_dir/ + INFO - Documentation built in 0.18 seconds + INFO - [14:32:14] Watching paths for changes: 'docs', 'mkdocs.yml' + INFO - [14:32:14] Serving on http://127.0.0.1:8000/ + ``` \ No newline at end of file diff --git a/docs/ace_api.md b/docs/ace_api.md new file mode 100644 index 0000000..c0bc363 --- /dev/null +++ b/docs/ace_api.md @@ -0,0 +1,337 @@ + +## ACE API Endpoints + +ACE includes API endpoints for some of its most frequently used functions. + +## API Reference + +ACE provides a REST API for programmatic access. The API server runs on localhost:5000 by default. An SSH tunnel is required to access the API from outside the host machine for security purposes. + +### Configuring Authentication for ACE API Use + +You must also configure client-based certificate authentication before using the ACE API. + +You should create a client certificate separately for ACE with all necessary privileges on tables, schemas, and databases that you want to use with ACE. This user should preferably be a superuser since ACE may need elevated privileges during diffs and repairs. + +Each external user should have their own client certificate to use the API, typically with lower privileges. ACE will attempt to use `SET ROLE` to switch to the external user's role before performing any operations, thus ensuring that diffs and repairs happen with the external user's privileges when possible. + +**Creating a Certificate File** + +Please refer to [Postgres documentation](https://www.postgresql.org/docs/17/ssl-tcp.html#SSL-CERTIFICATE-CREATION) for information about setting up server and client certificates. Please note that after creating your certificate, you must provide details in the following [postgresql.conf parameters](https://www.postgresql.org/docs/17/ssl-tcp.html#SSL-SERVER-FILES): + +* `ssl = on` +* `ssl_ca_file` +* `ssl_cert_file` +* `ssl_key_file` + +You must also enable `cert` authentication in the [pg_hba.conf](https://www.postgresql.org/docs/17/auth-pg-hba-conf.html) file: + +`hostssl all all 192.168.0.0/16 cert` + +After configuring cert authentication for pgEdge Postgres, restart the Postgres server for the changes to be applied. + +Then, provide the following certificate information in the [ace_config.py file](../ace/installing_ace.md#the-ace-configuration-file-ace_configpy) (by default, located in `$PGEDGE_HOME/hub/scripts/`). Client-cert-based auth is a *required* option for using the ACE APIs. It can optionally be used with the CLI modules as well. Specify your preferences in the `Cert-based auth options` section: + +```bash +""" +USE_CERT_AUTH = False +ACE_USER_CERT_FILE = "" +ACE_USER_KEY_FILE = "" +CA_CERT_FILE = "" +``` + +* `USE_CERT_AUTH` (default=False) a boolean value that indicates if ACE should use client-cert based authentication when connecting to the Postgres nodes. +* `ACE_USER_CERT_FILE` (default="") is the path to the certificate file (.crt) of the certificate bundle issued to the ACE user. +* `ACE_USER_KEY_FILE` (default="") is the path to the key file (.key) of the certificate bundle issued to the ACE user. +* `CA_CERT_FILE` (default="") is the path to the certificate file (.crt) of the certificate authority that was used to issue certificates. + +After creating the certificates and providing information in the configuration files, you're ready to use the ACE API. + +!!! note + + If you're already running the ACE process, and need to modify the `ace_config.py` file, use `Ctrl+C` to stop the process before making changes. + +### The table-diff API + +Initiates a table diff operation. + +**Endpoint:** `GET /ace/table-diff` + +**Request Body:** +```json +{ + "cluster_name": "my_cluster", // required + "table_name": "public.users", // required + "dbname": "mydb", // optional + "block_rows": 10000, // optional, default: 10000 + "max_cpu_ratio": 0.8, // optional, default: 0.6 + "output": "json", // optional, default: "json" + "nodes": "all", // optional, default: "all" + "batch_size": 50, // optional, default: 1 + "table_filter": "id < 1000", // optional + "quiet": false // optional, default: false +} +``` + +**Parameters:** +- `cluster_name` (required): Name of the cluster +- `table_name` (required): Fully qualified table name (schema.table) +- `dbname` (optional): Database name +- `block_rows` (optional): Number of rows per block (default: 10000) +- `max_cpu_ratio` (optional): Maximum CPU usage ratio (default: 0.8) +- `output` (optional): Output format ["json", "csv", "html"] (default: "json") +- `nodes` (optional): Nodes to include ("all" or comma-separated list) +- `batch_size` (optional): Batch size for processing (default: 50) +- `table_filter` (optional): SQL WHERE clause to filter rows for comparison +- `quiet` (optional): Suppress output (default: false) + +**Example Request:** +```bash +curl -X POST "http://localhost:5000/ace/table-diff" \ + -H "Content-Type: application/json" \ + --cert /path/to/client.crt \ + --key /path/to/client.key \ + -d '{ + "cluster_name": "my_cluster", + "table_name": "public.users", + "output": "html" + }' +``` + +**Example Response:** +```json +{ + "task_id": "td_20240315_123456", + "submitted_at": "2024-03-15T12:34:56.789Z" +} +``` + +### The table-repair API + +Initiates a table repair operation. + +**Endpoint:** `GET /ace/table-repair` + +**Request Body:** +```json +{ + "cluster_name": "my_cluster", // required + "diff_file": "/path/to/diff.json", // required + "source_of_truth": "primary", // required unless fix_nulls is true + "table_name": "public.users", // required + "dbname": "mydb", // optional + "dry_run": false, // optional, default: false + "quiet": false, // optional, default: false + "generate_report": false, // optional, default: false + "upsert_only": false, // optional, default: false + "insert_only": false, // optional, default: false + "bidirectional": false, // optional, default: false + "fix_nulls": false, // optional, default: false + "fire_triggers": false // optional, default: false +} +``` + +**Parameters:** +- `cluster_name` (required): Name of the cluster +- `diff_file` (required): Path to the diff file +- `source_of_truth` (required): Source node for repairs +- `table_name` (required): Fully qualified table name +- `dbname` (optional): Database name +- `dry_run` (optional): Simulate repairs (default: false) +- `quiet` (optional): Suppress output (default: false) +- `generate_report` (optional): Create detailed report (default: false) +- `upsert_only` (optional): Skip deletions (default: false) +- `insert_only` (optional): Repair INSERT statements only (default: false) +- `bidirectional` (optional): Insert missing rows in a bidirectional manner (default: false) +- `fix_nulls` (optional): Fix NULL values by comparing across nodes (default: false) +- `fire_triggers` (optional): fire triggers when ACE performs a repair; note that `ENABLE ALWAYS` triggers will always fire. (default: false) + + +**Example Request:** +```bash +curl -X POST "http://localhost:5000/ace/table-repair" \ + -H "Content-Type: application/json" \ + --cert /path/to/client.crt \ + --key /path/to/client.key \ + -d '{ + "cluster_name": "my_cluster", + "diff_file": "/path/to/diff.json", + "source_of_truth": "primary", + "table_name": "public.users" + }' +``` + +**Example Response:** +```json +{ + "task_id": "tr_20240315_123456", + "submitted_at": "2024-03-15T12:34:56.789Z" +} +``` + +### The table-rerun API + +Reruns a previous table diff operation. + +**Endpoint:** `POST /ace/table-rerun` + +**Request Body:** +```json +{ + "cluster_name": "my_cluster", // required + "diff_file": "/path/to/diff.json", // required + "table_name": "public.users", // required + "dbname": "mydb", // optional + "quiet": false, // optional, default: false + "behavior": "multiprocessing" // optional, default: "multiprocessing" +} +``` + +**Parameters:** +- `cluster_name` (required): Name of the cluster +- `diff_file` (required): Path to the previous diff file +- `table_name` (required): Fully qualified table name +- `dbname` (optional): Database name +- `quiet` (optional): Suppress output (default: false) +- `behavior` (optional): Processing behavior ["multiprocessing", "hostdb"] + +**Example Request:** +```bash +curl -X POST "http://localhost:5000/ace/table-rerun" \ + -H "Content-Type: application/json" \ + --cert /path/to/client.crt \ + --key /path/to/client.key \ + -d '{ + "cluster_name": "my_cluster", + "diff_file": "/path/to/diff.json", + "table_name": "public.users" + }' +``` + +**Example Response:** +```json +{ + "task_id": "tr_20240315_123456", + "submitted_at": "2024-03-15T12:34:56.789Z" +} +``` + +### The task-status API + +Retrieves the status of a submitted task. + +**Endpoint:** `GET /ace/task-status/` + +**Parameters:** +- `task_id` (required): The ID of the task to check + +**Example Request:** +```bash +curl "http://localhost:5000/ace/task-status?task_id=td_20240315_123456" \ + --cert /path/to/client.crt \ + --key /path/to/client.key +``` + +**Example Response:** +```json +{ + "task_id": "td_20240315_123456", + "task_type": "table-diff", + "status": "COMPLETED", + "started_at": "2024-03-15T12:34:56.789Z", + "finished_at": "2024-03-15T12:35:01.234Z", + "time_taken": 4.445, + "result": { + "diff_file": "/path/to/output.json", + "total_rows": 10000, + "mismatched_rows": 5, + "summary": { + // Additional task-specific details + } + } +} +``` + +### Spock Exception Update API + +Updates the status of a Spock exception. + +**Endpoint:** `POST /ace/update-spock-exception` + +**Request Body:** +```json +{ + "cluster_name": "my_cluster", // required + "node_name": "node1", // required + "dbname": "mydb", // optional + "exception_details": { // required + "remote_origin": "origin_oid", // required + "remote_commit_ts": "2024-03-15T12:34:56Z", // required + "remote_xid": "123456", // required + "command_counter": 1, // optional + "status": "RESOLVED", // required + "resolution_details": { // optional + "details": "Issue fixed" + } + } +} +``` + +**Parameters:** +- `cluster_name` (required): Name of the cluster +- `node_name` (required): The name of the node +- `dbname` (optional): The name of the database +- `exception_details` (required) + - `remote_origin` (optional): The OID of the origin + - `remote_commit_ts` (optional): The timestamp of the exception + - `remote_xid` (optional): The XID of the transaction + - `command_counter` (optional): The number of commands executed + - `status` (optional): The current state of the exception + - `resolution_details` (optional): + - `details`: Include details about the exception + +**Example Request:** +```bash +curl -X POST "http://localhost:5000/ace/update-spock-exception" \ + -H "Content-Type: application/json" \ + --cert /path/to/client.crt \ + --key /path/to/client.key \ + -d '{ + "cluster_name": "my_cluster", + "node_name": "node1", + "exception_details": { + "remote_origin": "origin1", + "remote_commit_ts": "2024-03-15T12:34:56Z", + "remote_xid": "123456", + "status": "RESOLVED" + } + }' +``` + +**Example Response:** +```json +{ + "message": "Exception status updated successfully" +} +``` + +## API Error Responses + +ACE API endpoints return error responses in the following format: + +```json +{ + "error": "Description of what went wrong" +} +``` + +Common HTTP status codes: +- 200: Success +- 400: Bad Request (missing or invalid parameters) +- 401: Unauthorized (missing or invalid client certificate) +- 415: Unsupported Media Type (request body is not JSON) +- 500: Internal Server Error + + + + diff --git a/docs/ace_functions.md b/docs/ace_functions.md new file mode 100644 index 0000000..75de3f3 --- /dev/null +++ b/docs/ace_functions.md @@ -0,0 +1,411 @@ +# ACE Functions + +ACE provides functions that compare the data from one object to the data on other object, and optionally repairs the differences it finds. ACE functions include: + +| Command | Description | +|---------|-------------| +| [ACE table-diff](#ace-table-diff) | Compare two tables to identify differences. | +| [ACE repset-diff](#ace-repset-diff) | Compare two replication sets to identify differences. | +| [ACE schema-diff](#ace-schema-diff) | Compare two schemas to identify differences | +| [ACE spock-diff](#ace-spock-diff) | Compare two sets of spock meta-data to identify differences. | +| [ACE table-repair](#ace-table-repair) | Repair data inconsistencies identified by the table-diff function. | +| [ACE table-rerun](#ace-table-rerun) | Rerun a diff to confirm that a fix has been correctly applied. | + + +## ACE Diff Functions + +ACE diff functions compare two objects and identify the differences; the output is a report that contains a: + +- Summary of compared rows +- Mismatched data details +- Node-specific statistics +- Error logs (if any) + +If you generate an html report, ACE generates an interactive report with: +- Colour-coded differences +- Expandable row details +- Primary key highlighting +- Missing row indicators + +Common use cases for the ACE diff functions include: + + * Performing routine content verification. + * Performing a performance-optimized large table scan. + * Performing a focused comparison between nodes, tables, or schemas. + +As a best practice, you should experiment with different block sizes and CPU utilisation to find the best performance/resource-usage balance for your workload. Making use `--table-filter` for large tables to reduce comparison scope and generating HTML reports will make analysis of differences easier. + +As you work, ensure that diffs have not overrun the `MAX_ALLOWED_DIFFS` limit; if your diffs surpass this limit, `table-repair` will only be able to partially repair the table. + +### ACE table-diff + +Use the `table-diff` command to compare the tables in a cluster and produce a csv, json, or html report showing any differences. + +The syntax is: + +`$ ./pgedge ace table-diff cluster_name schema.table_name [options]` + +* `cluster_name` is the name of the pgEdge cluster in which the table resides. +* `schema.table_name` is the schema-qualified name of the table that you are comparing across cluster nodes. + +**Optional Arguments** + +Include the following optional arguments to customize ACE table-diff behavior: + +* `-d` or `--dbname` is a string value that specifies the database name; `dbname` defaults to the name of the first database in the cluster configuration. +* `--block-rows` is an integer value that specifies the number of rows to process per block. + - Min: 1000 + - Max: 100000 + - Default: 10000 + - Higher values improve performance but increase memory usage. + - This is a configurable parameter in `ace_config.py`. +* `-m` or `--max-cpu-ratio` is a float value that specifies the maximum CPU utilisation; the accepted range is 0.0-1.0. The default is 0.6. + - This value is configurable in `ace_config.py`. +* `--batch-size` is an integer value that specifies the number of blocks to process per multiprocessing worker (default: `1`). + - The higher the number, the lower the parallelism. + - This value is configurable in `ace_config.py`. +* `-o` or `--output` specifies the output type; choose from `html`, `json`, or `csv` when including the `--output` option to select the output type for a report. By default, the report is written to `diffs//diffs_.json`. If the output mode is csv or html, ACE will generate colored diff files to highlight differences. +* `-n` or `--nodes` specifies a comma-delimited subset of nodes on which the command will be executed. ACE allows up to a three-way node comparison. We do not recommend simultaneously comparing more than three nodes at once. +* `-q` or `--quiet` suppresses messages about sanity checks and the progress bar in `stdout`. If ACE encounters no differences, ACE will exit without messages. Otherwise, it will print the differences to JSON in `stdout` (without writing to a file). +* `-t` or `--table-filter` is a `SQL WHERE` clause that allows you to filter rows for comparison. + +**ACE table-diff Command Examples** + +The following example reports a difference when comparing a table (`public.foo`) across all nodes and generates an html report: + +```bash +$ ./pgedge ace table-diff demo public.foo --output=html +✔ Cluster demo exists +✔ Connections successful to nodes in cluster +✔ Table public.foo is comparable across nodes +Getting primary key offsets for table... +Starting jobs to compare tables... + + 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/3 [ 0:00:00 < 0:00:00 , ? it/s ] +⚠ TABLES DO NOT MATCH +⚠ FOUND 1 DIFFS BETWEEN n1 AND n2 +Diffs written out to diffs/2025-04-08/diffs_072159340.json +HTML report generated: diffs/2025-04-08/diffs_072159340.html +TOTAL ROWS CHECKED = 5 +RUN TIME = 0.40 seconds +``` + +The following example reports a difference when comparing a table (`public.foo`) across nodes `n1` and `n2`, with a custom block size (`50000`): + +```bash +$ ./pgedge ace table-diff demo public.foo --nodes="n1,n2" --block-rows=50000 +✔ Cluster demo exists +✔ Connections successful to nodes in cluster +✔ Table public.foo is comparable across nodes +Getting primary key offsets for table... +Starting jobs to compare tables... + + 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/3 [ 0:00:00 < 0:00:00 , ? it/s ] +⚠ TABLES DO NOT MATCH +⚠ FOUND 1 DIFFS BETWEEN n1 AND n2 +Diffs written out to diffs/2025-04-08/diffs_072804313.json +TOTAL ROWS CHECKED = 5 +RUN TIME = 0.40 seconds +``` + +### ACE repset-diff + +Use the `repset-diff` command to loop through the tables in a replication set and produce a csv, json, or html report showing any differences. The syntax is: + +`$ ./pgedge ace repset-diff cluster_name repset_name [options]` + +* `cluster_name` is the name of the cluster in which the replication set is a member. +* `repset_name` is the name of the replication set in which the tables being compared reside. + +**Optional Arguments** +* `-d` or `--dbname=db_name` is the name of the database in which to run the `repset-diff` command; the default is `none`. +* `-m` or `--max_cpu_ratio` specifies the percentage of CPU power you are allotting for use by ACE. A value of `1` instructs the server to use all available CPUs, while `.5` means use half of the available CPUs. The default is `.6` (or 60% of the CPUs). +* `--block_rows` specifies the number of tuples to be used at a time during table comparisons. If `block_rows` is set to `1000`, then a thousand tuples are compared per job across tables. +* `-o` or `--output` specifies the output type; choose from `html`, `json`, or `csv` when including the `--output` parameter to select the output type for a report. By default, the report is written to `diffs//diffs_.json`. If the output type is csv or html, ACE will generate coloured diff files to highlight differences. +* `-n` or `--nodes` specifies a comma-delimited list of nodes on which the command will be executed. +* `--batch-size` is an integer value that specifies the number of blocks to process per multiprocessing worker (default: `1`). +* `-q` or `--quiet` suppresses output from ACE; this defaults to `False`. +* `--skip_tables=table_name` instructs ACE to not evaluate the specified table for differences. +* `--skip_file=file_name` allows you to specify the name of a file that contains a list of tables that you would like to skip. + +**ACE repset-diff Example** + +The following example reports a difference when comparing the `default` repset across all nodes: + +```bash +$ ./pgedge ace repset-diff demo default +✔ Cluster demo exists +✔ Connections successful to nodes in cluster + +CHECKING TABLE public.foo... + +✔ Cluster demo exists +✔ Connections successful to nodes in cluster +✔ Table public.foo is comparable across nodes +Getting primary key offsets for table... +Starting jobs to compare tables... + + 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/3 [ 0:00:00 < 0:00:00 , ? it/s ] +⚠ TABLES DO NOT MATCH +⚠ FOUND 1 DIFFS BETWEEN n1 AND n2 +Diffs written out to diffs/2025-04-08/diffs_090241529.json +TOTAL ROWS CHECKED = 5 +RUN TIME = 0.40 seconds +``` + + +### ACE schema-diff + +Use the `schema-diff` command to compare the schemas in a cluster and report differences in a .json format report; the syntax is: + +`$ ./pgedge ace schema-diff cluster_name schema_name [options]` + +* `schema_name` is the name of the schema you will be comparing. +* `cluster_name` is the name of the cluster in which the table resides. + +**Optional Arguments** +* `-n=node_list` specifies a list of nodes on which the schema will be compared; `node_list` is a comma-delimited list of node names. If omitted, the default is all nodes. +* `--dbname=db_name` specifies the name of the database in which you would like to run the diff; defaults to `none`. +* `--ddl_only` instructs ACE to check for only DDL differences. +* `--skip_tables=table_name` instructs ACE to not evaluate the specified table for differences. +* `--skip_file=file_name` allows you to specify the name of a file that contains a list of tables that you would like to skip. +* `-q` or `--quiet` suppresses output from ACE; this defaults to `False`. + +**ACE schema-diff Example** + +The following example demonstrates using the `schema-diff` command to check for differences in the `public` schema in a cluster named `demo` on nodes `n1` and `n2`: + +```bash +$ ./pgedge ace schema-diff demo public -nodes=n1,n2 +✔ Cluster demo exists +✔ Connections successful to nodes in cluster + +Comparing nodes 127.0.0.1:6432 and 127.0.0.1:6433: +✔ No differences found +``` + +### ACE spock-diff + +Use the `spock-diff` command to compare the meta-data on two cluster nodes, and produce a report showing any differences. The syntax is: + +`$ ./pgedge ace spock-diff cluster_name [options]` + +* `cluster_name` is the name of the cluster in which the table resides. + +**Optional Arguments** +* `-n=node_list` specifies a list of nodes on which spock will be compared; `node_list` is a comma-delimited list of node names. If omitted, the default is all nodes. +* `-q` or `--quiet` suppresses output from ACE; this defaults to `False`. + + +## ACE table-repair + +The `ACE table-repair` function fixes data inconsistencies identified by the `table-diff` functions. ACE table-repair uses a specified node as the source of truth to correct data on other nodes. Common use cases for `table-repair` include: + + * **Spock Exception Repair** for exceptions arising from insert/update/delete conflicts +during replication. + * **Network Partition Repair** to restore consistency across nodes after a network partition fails. + * **Temporary Node Outage Repair** to bring a node up to speed after a temporary outage. + +The function has a number of safety and audit features that you should consider before invoking the command: + + * **Dry run mode** allows you to test repairs without making changes. + * **Report generation** produces a detailed repair audit trail of all changes made. + * **Include the Upsert-Only option** to prevent data deletion. + * **Transaction safety** ensures that all changes are atomic. If, for some reason your repair fails midway, the entire transaction will be rolled back, and no changes will be made to the database. + +When using `table-repair`, remember that: + + * Table-repair is intended to be used to repair differences that arise from incidents such as spock exceptions, network partition irregularities, temporary node outages, etc. If the 'blast radius' of a failure event is too large -- say, millions of records across several tables, even though table-repair can handle this, we recommend that instead you do a dump and restore using native Postgres tooling. + * Table-repair can only repair rows found in the diff file. If your diff exceeds `MAX_ALLOWED_DIFFS`, table-repair will only be able to partially repair the table. This may even be desirable if you want to repair the table in batches; you can perform a `diff->repair->diff->repair` cycle until no more differences are reported. + * You should invoke `ACE table-repair` with `--dry-run` first to review proposed changes. + * Use `--upsert-only` or `--insert-only` for critical tables where data deletion may be risky. + * You should verify your table structure and constraints before repair. + +The command syntax is: + +```bash +./pgedge ace table-repair --diff-file= <--source-of-truth>[options] +``` + +* `cluster_name` is the name of the cluster in which the table resides. +* `diff_file` is the path and name of the file that contains the table differences. +* `schema.table_name` is the schema-qualified name of the table that you are repairing. +* `-s` or `--source-of-truth` is a string value specifying the node name to use as the source of truth for repairs. Note: If you are performing a repair that specifies the `--bidirectional` or `--fix-nulls` option, the `--source-of-truth` is not required. + +**Optional Arguments** +* `--dry-run` is a boolean value that simulates repair operations without making changes. The default is `false`. +* `--upsert_only` (or `-u`) - Set this option to `true` to specify that ACE should make only additions to the *non-source of truth nodes*, skipping any `DELETE` statements that may be needed to make the data match. This option does not guarantee that nodes will match when the command completes, but can be useful if you want to merge the contents of different nodes. The default value is `false`. +* `--generate_report` (or `-g`) - Set this option to `true` to generate a .json report of the actions performed; Reports are written to files identified by a timestamp in the format: `reports//report_`.json. The default is `false`. +* `--dbname` is a string value that specifies the database name; dbname defaults to `none`. +* `--quiet` is a boolean value that suppresses non-essential output. The default is `false`. +* `--generate-report` is a boolean value that instructs the server to create a detailed report of repair operations. The default is `false`. +* `--upsert-only` is a boolean value that instructs the server to only perform inserts/updates, and skip deletions. The default is `false`. +* `-i` or `--insert-only` is a boolean value that instructs the server to only perform inserts, and skip updates and deletions. Note: This option uses `INSERT INTO ... ON CONFLICT DO NOTHING`. If there are identical rows with different values, this option alone is not enough to fully repair the table. The default is `false`. +* `-b` or `--bidirectional` is a boolean value that must be used with `--insert-only`. Similar to `--insert-only`, but inserts missing rows in a bidirectional manner. For example, if you specify `--bidirectional` is a boolean value that instructs ACE to apply differences found between nodes to create a *distinct union* of the content. In a distinct union, each row that is missing is recreated on the node from which it is missing, eventually leading to a data set (on all nodes) in which all rows are represented exactly once. For example, if you are performing a repair in a case where node A has rows with IDs 1, 2, 3 and node B has rows with IDs 2, 3, 4, the repair will ensure that both node A and node B have rows with IDs 1, 2, 3, and 4. +- `--fix-nulls` is a boolean value that instructs the server to fix NULL values by comparing values across nodes. For example, if you have an issue where a column is not being replicated, you can use this option to fix the NULL values on the target nodes. This does not need a source of truth node as it consults the diff file to determine which rows have NULL values. However, it should be used for this special case only, and should not be used for other types of data inconsistencies. +- `--fire-triggers` is a boolean value that instructs triggers to fire when ACE performs a repair; note that `ENABLE ALWAYS` triggers will fire regardless of the value of `--fire-triggers`. The default is `false`. + + +**ACE table-repair Command Examples** + +The following commands first perform a table-repair dry run of the `public.foo` table, specifying a diff file (`--diff-file=diffs/2025-04-08/diffs_090241529.json`) and using node `n1` as the source of truth: + +```bash +[rocky@ip-172-31-15-12 pgedge]$ ./pgedge ace table-repair demo public.foo --diff-file=diffs/2025-04-08/diffs_090241529.json --source-of-truth=n1 --dry_run=True +✔ Cluster demo exists +✔ Connections successful to nodes in cluster +######## DRY RUN ######## + +Repair would have attempted to upsert 0 rows and delete 1 row on n2 + +######## END DRY RUN ######## +``` +After performing the dry run, we change the `--dry_run` flag to `False`, confirming that we want to apply the changes we reviewed in the first command iteration: + +```bash +$ ./pgedge ace table-repair demo public.foo --diff-file=diffs/2025-04-08/diffs_090241529.json -s=n1 --dry_run=False +✔ Cluster demo exists +✔ Connections successful to nodes in cluster +✔ Successfully applied diffs to public.foo in cluster demo + +*** SUMMARY *** + +n2 UPSERTED = 0 rows + +n2 DELETED = 1 rows +RUN TIME = 0.00 seconds +``` + +The following example performs a unidirectional insert-only repair on the `public.foo` table. In a situation where node 2 is missing a row when compared to node 1, including the `--insert-only` option inserts the missing rows from `node 1` to `node 2`: + +```bash +$ ./pgedge ace table-repair demo public.foo diffs/2025-04-09/diffs_101804246.json --source-of-truth=n1 --insert-only=True +✔ Cluster demo exists +✔ Connections successful to nodes in cluster +✔ Successfully applied diffs to public.foo in cluster demo + +*** SUMMARY *** + +n2 INSERTED = 1 rows + +RUN TIME = 0.00 seconds +``` + +The following example performs a bidirectional insert-only repair. If you have a network partition between `node 1` and `node 2`, and they each separately received new records, including the `--bidirectional` option will insert the missing records from `node 1` to `node 2` and vice versa: + +```bash +$ ./pgedge ace table-repair demo public.foo diffs/2025-04-09/diffs_103544698.json --source-of-truth=n1 --insert-only=True --bidirectional=True +✔ Cluster demo exists +✔ Connections successful to nodes in cluster + +Performing bidirectional repair: +Overall progress: 0%| | 0/100 [00:00 --diff_file=/path/to/diff_file.json schema.table_name` + +* `cluster_name` is a string value that specifies the name of the cluster as defined in your configuration file. +* `schema.table_name` is a string value that specifies the fully qualified table name (e.g., "public.users")'. +* `diff_file` is a string value that specifies the path to the JSON diff file from a previous table-diff operation. + +**Optional Arguments** + +* `-d` or `--dbname` is a string value that specifies the database name; this defaults to the first database in the cluster config file. +* `-q` or `--quiet` is a boolean value that suppresses non-essential output. +* `-b` or `--behavior`is a string value that specifies the processing behavior [`multiprocessing` or `hostdb`]. + - `multiprocessing` (the default) uses parallel processing for faster execution. + - `hostdb` uses the host database to create temporary tables for faster comparisons. This is useful for very large tables and diffs. + +**ACE table-rerun Command Examples** + +To perform a table-rerun of a previous diff (specifying a diff file with the `--diff-file=diffs/2025-04-08/diffs_090241529.json` clause): + +```bash +$ ./pgedge ace table-rerun demo --diff-file=diffs/2025-04-08/diffs_090241529.json public.foo +✔ Cluster demo exists +✔ Connections successful to nodes in cluster +✔ Table public.foo is comparable across nodes +Starting jobs to compare tables ... + + 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 [ 0:00:00 < 0:00:00 , ? it/s ] +✔ TABLES MATCH OK + +TOTAL ROWS CHECKED = 2 +RUN TIME = 0.40 seconds +``` + + + diff --git a/docs/ace_use_cases.md b/docs/ace_use_cases.md new file mode 100644 index 0000000..c1051b9 --- /dev/null +++ b/docs/ace_use_cases.md @@ -0,0 +1,25 @@ +# ACE (the Active Consistency Engine) + +In an eventually consistent system (like a cluster), nodes can potentially go out of sync (diverge) due to replication exceptions, replication lag, network partitions, or node failures. Node divergences can also arise out of planned maintenance windows or even unexpected cloud provider failures. The Active Consistency Engine (ACE) allows for reconciliation of such differences across databases by performing efficient node state comparisons and repairs in a highly controlled manner. + +You can invoke ACE functions on the [command line](../docs/ace_functions.md), or from an [API endpoint](../docs/ace_api.md). [Scheduling options](../docs/schedule_ace.md) in ACE make it convenient to run unmanned checks as part of routine maintenance to keep your cluster healthy. + +![A cluster using ACE](./images/ace_cluster.png) + +## Common Use Cases for ACE + +**Node Failures:** When node outages happen -- whether planned or unplanned, they may result in a node-state divergence when the node eventually rejoins the cluster. ACE is a great tool for use in such cases; it can provide a report of the magnitude and the nature of the divergence, and also repair the node's state in a highly controlled manner. + +**Network Partitions:** In a typical pgEdge Distributed Postgres deployment, a database cluster is often spread across multiple regions and data centers. Due to the volatility that comes with a large-scale deployment, link degradation and other network related issues are common. This volatility can result in Spock replication exceptions. You can use ACE to identify the precise set of affected records, and perform surgically precise repairs to ensure that a node is brought back to a consistent state, while causing little to no disruption to the application. + +**Planned Maintenance:** During planned maintenance windows, a node can fall behind its peers. Even within a cluster, you can use ACE to rapidly bring the node back up to speed by configuring the repair to perform inserts and updates en masse. + +**Post-Repair Verification:** You can use the ACE's rerun functionality to confirm if ACE has resolved previously identified diffs. + + **Confirm if Diffs Still Persist:** Use the rerun function to verify if diffs identified by table-diff still exist after a replication lag window has elapsed. + +**Spock Exception Repair:** You can use repair functions to fix exceptions arising from insert/update/delete conflicts during replication. + +**Network Partition Repair:** You can use repair functions to restore consistency across nodes after repairing a network partition. + +**Temporary Node Outage Repair:** You can use repair functions to bring a cluster up to speed after a temporary node failure. diff --git a/docs/configuring.md b/docs/configuring.md new file mode 100644 index 0000000..a066c11 --- /dev/null +++ b/docs/configuring.md @@ -0,0 +1,285 @@ +# Configuring ACE Preferences with the ace_config.py File + +You can provide your preferences for ACE configuration options in the `ace_config.py` file (by default, created in `$PGEDGE_HOME/hub/scripts/`). You can use the configuration file to specify: + +* ACE operational preferences. +* Job and Schedule information for ACE jobs. +* ACE Auto Repair options +* SSL certificate details for API users. + +!!! hint + + If you're already running the ACE process, and need to modify the `ace_config.py` file, use `Ctrl+C` to stop the process before making changes. + +## Specifying ACE Operational Preferences + +Use properties in the `Postgres options` section of the `ace_config.py` file to specify your timeout preferences: + +```bash +# Postgres options +STATEMENT_TIMEOUT = 60000 # in milliseconds +CONNECTION_TIMEOUT = 10 # in seconds +``` + +* `STATEMENT_TIMEOUT` (default=60000) is equivalent to Postgres' `statement_timeout`. Aborts any query that takes more than the specified amount of time (in milliseconds). A value of 0 disables the timeout. Could be useful when running table-diff on very large tables to minimise performance impact on the database. +* `CONNECTION_TIMEOUT` (default=10): Equivalent to Postgres' `connect_timeout`. Maximum time to wait while connecting, in seconds. A value of 0 means that ACE will wait indefinitely to connect. + +Use properties in the `Default values for ACE table-diff` section to specify default values for the ACE table-diff command: + +```bash +# Default values for ACE table-diff +MAX_DIFF_ROWS = 10000 +MIN_ALLOWED_BLOCK_SIZE = 1000 +MAX_ALLOWED_BLOCK_SIZE = 100000 +BLOCK_ROWS_DEFAULT = os.environ.get("ACE_BLOCK_ROWS", 10000) +MAX_CPU_RATIO_DEFAULT = os.environ.get("ACE_MAX_CPU_RATIO", 0.6) +BATCH_SIZE_DEFAULT = os.environ.get("ACE_BATCH_SIZE", 1) +MAX_BATCH_SIZE = 1000 +``` + +* `MAX_DIFF_ROWS` (default=10000) is the number of differences after which table-diff will abort the run. A value of 10000 means that table-diff will abort if there are more than 10000 differences between the specified nodes. +* `MIN_ALLOWED_BLOCK_SIZE` (default=1000) is the smallest allowed block size during table-diff. +* `MAX_ALLOWED_BLOCK_SIZE` (default=100000) is the largest allowed block size during table-diff. +* `BLOCK_ROWS_DEFAULT` (default=10000) is the default number of block rows to use if nothing is specified. It attempts to read the environment variable ACE_BLOCK_ROWS, and falls back to 10000 if it is not set. Can be overridden with the CLI option `--block-rows`. +* `MAX_CPU_RATIO_DEFAULT` (default=0.6) is the number of multiprocessing workers (a float value between 0 and 1) used during table-diff is given by NUM_CPUS * MAX_CPU_RATIO_DEFAULT. Can be overridden with the CLI option `--max-cpu-ratio`. +* `BATCH_SIZE_DEFAULT` (default=1) is the number of work items to process per worker. Can be overridden with the CLI option `--batch-size`. +* `MAX_BATCH_SIZE` (default=1000) is the largest batch size allowed during a table-diff. + +Use properties in the `ACE Auto-repair Options` section to specify your preferences related to auto-repair: + +```bash +auto_repair_config = { + "enabled": False, + "cluster_name": "eqn-t9da", + "dbname": "demo", + "poll_frequency": "10m", + "repair_frequency": "15m", +} +``` +* `enabled` is a boolean value that specifies that auto-repair should or should not be enabled; the default is `false`. +* `cluster_name` is the name of your cluster. +* `dbname` is the name of the Postgres database that auto-repair is monitoring. +* `poll_frequency` is the interval at which to poll the `exception_log` table and populate the `exception_status` and `exception_status_detail` tables. +* `repair_frequency` is the interval at which to repair exceptions. + +### The ACE Configuration File (ace_config.py) + +The following listing is a complete `ace_config.py` file, provided for reference only. Please note that your configuration values will vary. + +``` +import os +from datetime import timedelta + +""" + +** ACE CLI and common configuration options ** + +""" + +# ============================================================================== +# Postgres options +STATEMENT_TIMEOUT = 60000 # in milliseconds +CONNECTION_TIMEOUT = 10 # in seconds + + +# Default values for ACE table-diff +MAX_DIFF_ROWS = 10000 +MIN_ALLOWED_BLOCK_SIZE = 1000 +MAX_ALLOWED_BLOCK_SIZE = 100000 +BLOCK_ROWS_DEFAULT = os.environ.get("ACE_BLOCK_ROWS", 10000) +MAX_CPU_RATIO_DEFAULT = os.environ.get("ACE_MAX_CPU_RATIO", 0.6) +BATCH_SIZE_DEFAULT = os.environ.get("ACE_BATCH_SIZE", 1) +MAX_BATCH_SIZE = 1000 + + +# Return codes for compare_checksums +BLOCK_OK = 0 +MAX_DIFFS_EXCEEDED = 1 +BLOCK_MISMATCH = 2 +BLOCK_ERROR = 3 + +# The minimum version of Spock that supports the repair mode +SPOCK_REPAIR_MODE_MIN_VERSION = 4.0 + +# ============================================================================== + +""" + +** ACE Background Service Options ** + +""" + +LISTEN_ADDRESS = "0.0.0.0" +LISTEN_PORT = 5000 + +# Smallest interval that can be used for any ACE background service +MIN_RUN_FREQUENCY = timedelta(minutes=5) + +""" +Table-diff scheduling options +Specify a list of job definitions. Currently, only table-diff and repset-diff +jobs are supported. + +A job definition must have the following fields: +- name: The name of the job +- cluster_name: The name of the cluster +- table_name: The name of the table +OR +- repset_name: The name of the repset for repset-diff + +If finer control is needed, you could also specify additional arguments in the +args field. +args currently supports the following fields: +- max_cpu_ratio: The maximum number of CPUs to use for the job. Expressed as a + float between 0 and 1. +- batch_size: The batch size to use for the job. How many blocks does a single + job process at a time. +- block_rows: The maximum number of rows per block. How many rows does a + single block contain. Each multiprocessing worker running in parallel will + process this many rows at a time. +- nodes: A list of node OIDs--if you'd like to run the job only on specific nodes. +- output: The output format to use for the job. Can be "json", "html" or "csv". +- quiet: Whether to suppress output. +- table_filter: A where clause to run table-diff only on a subset of rows. E.g. + "id < 10000". Note: table_filter argument will be ignored for repset-diff. +- dbname: The database to use. +- skip_tables: A list of tables to skip during repset-diff. + + +NOTE: For best results, stagger the jobs by at least a few seconds. Do not run +overlapping jobs + +Example: +schedule_jobs = [ + { + "name": "t1", + "cluster_name": "eqn-t9da", + "table_name": "public.t1", + }, + { + "name": "t2", + "cluster_name": "eqn-t9da", + "table_name": "public.t2", + "args": { + "max_cpu_ratio": 0.7, + "batch_size": 1000, + "block_rows": 10000, + "nodes": "all", + "output": "json", + "quiet": False, + "dbname": "demo", + }, + }, + { + "name": "t3", + "cluster_name": "eqn-t9da", + "repset_name": "demo_repset", + "args": { + "max_cpu_ratio": 0.7, + "batch_size": 1000, + "block_rows": 10000, + "nodes": "all", + "output": "json", + "quiet": False, + "dbname": "demo", + "skip_tables": ["public.test_table_1", "public.test_table_2"], + }, + }, +] +""" + +schedule_jobs = [] + +""" +Specify a list of jobs and their crontab schedules or run_frequency as a string. +This list must reference job names from schedule_jobs above. +run_frequency can be string like "1 h", "5 min" or "30 s". +If the crontab_schedule is specified, run_frequency is ignored. +Minimum run_frequency is 5 minutes by default. Can be overriden by setting +MIN_RUN_FREQUENCY above. + +Example: +schedule_config = [ + { + "job_name": "t1", + "crontab_schedule": "0 0 * * *", + "run_frequency": "30s", + "enabled": False, + }, + { + "job_name": "t2", + "crontab_schedule": "0 0 * * *", + "run_frequency": "5s", + "enabled": False, + }, + { + "job_name": "t3", + "crontab_schedule": "0 0 * * *", + "run_frequency": "30s", + "enabled": False, + }, +] +""" + +schedule_config = [] + +""" + +ACE Auto-repair Options + +Auto-repair is a feature in ACE to automatically repair tables that are +detected to have diverged. Currently, auto-repair supports handling only +insert-insert exceptions. Handling other types of exception will need replication +information from Spock, which it either doesn't track or simply doesn't provide. +The detection happens by polling the spock.exception_status table. However, since +Spock does not automatically insert into the spock.exception_status or the +spock.exception_status_detail tables, ACE has to manually insert into them by +performing a MERGE INTO using all three tables. + +How often the exception_status is populated is controlled by the poll_frequency +setting, and how often the repair happens is controlled by the repair_frequency +setting. + +To enable auto-repair, set the following options: +- enabled: Whether to enable auto-repair. +- cluster_name: The name of the cluster. +- dbname: The name of the database. +- poll_frequency: The interval at which to poll the exception_log table and + populate the exception_status and exception_status_detail tables. +- repair_frequency: The interval at which to repair exceptions. + +auto_repair_config = { + "enabled": False, + "cluster_name": "eqn-t9da", + "dbname": "demo", + "poll_frequency": "10m", + "repair_frequency": "15m", +} +""" + +auto_repair_config = {} + +""" +Cert-based auth options + +Client-cert-based auth is a *required* option for using the ACE APIs. It can +optionally be used with the CLI modules as well. + +""" +USE_CERT_AUTH = False +ACE_USER_CERT_FILE = "" +ACE_USER_KEY_FILE = "" +CA_CERT_FILE = "" + +# Prior to version 42.0.0 of cryptography, the not_valid_before and not_valid_after +# fields in the certificate object returned a naive datetime object. +# These fields were deprecated starting with version 42.0.0. +# If the user has a pre- 42.0.0 version of cryptography, we need to support it +# correctly. +USE_NAIVE_DATETIME = False + +DEBUG_MODE = False + +# ============================================================================== + +``` \ No newline at end of file diff --git a/docs/images/ace_cluster.png b/docs/images/ace_cluster.png new file mode 100644 index 0000000..bfe15af Binary files /dev/null and b/docs/images/ace_cluster.png differ diff --git a/docs/images/ace_elephant.png b/docs/images/ace_elephant.png new file mode 100644 index 0000000..9d96587 Binary files /dev/null and b/docs/images/ace_elephant.png differ diff --git a/docs/images/anti-chaos.png b/docs/images/anti-chaos.png new file mode 100644 index 0000000..af56ba1 Binary files /dev/null and b/docs/images/anti-chaos.png differ diff --git a/docs/images/arch1.png b/docs/images/arch1.png new file mode 100644 index 0000000..4763845 Binary files /dev/null and b/docs/images/arch1.png differ diff --git a/docs/images/arch2.png b/docs/images/arch2.png new file mode 100644 index 0000000..11bd34d Binary files /dev/null and b/docs/images/arch2.png differ diff --git a/docs/images/arch3.png b/docs/images/arch3.png new file mode 100644 index 0000000..3eff028 Binary files /dev/null and b/docs/images/arch3.png differ diff --git a/docs/images/arch4.png b/docs/images/arch4.png new file mode 100644 index 0000000..9eb7a92 Binary files /dev/null and b/docs/images/arch4.png differ diff --git a/docs/images/arch5.png b/docs/images/arch5.png new file mode 100644 index 0000000..c0f37d3 Binary files /dev/null and b/docs/images/arch5.png differ diff --git a/docs/images/arch6.png b/docs/images/arch6.png new file mode 100644 index 0000000..1c96fc6 Binary files /dev/null and b/docs/images/arch6.png differ diff --git a/docs/images/arch7.png b/docs/images/arch7.png new file mode 100644 index 0000000..77c11c1 Binary files /dev/null and b/docs/images/arch7.png differ diff --git a/docs/images/arch8.png b/docs/images/arch8.png new file mode 100644 index 0000000..aee3962 Binary files /dev/null and b/docs/images/arch8.png differ diff --git a/docs/images/arch9.png b/docs/images/arch9.png new file mode 100644 index 0000000..bf52a4e Binary files /dev/null and b/docs/images/arch9.png differ diff --git a/docs/images/hintplan.png b/docs/images/hintplan.png new file mode 100644 index 0000000..07302ca Binary files /dev/null and b/docs/images/hintplan.png differ diff --git a/docs/images/merkle_one.png b/docs/images/merkle_one.png new file mode 100644 index 0000000..25a5428 Binary files /dev/null and b/docs/images/merkle_one.png differ diff --git a/docs/images/merkle_two.png b/docs/images/merkle_two.png new file mode 100644 index 0000000..21e1123 Binary files /dev/null and b/docs/images/merkle_two.png differ diff --git a/docs/images/partitioning.png b/docs/images/partitioning.png new file mode 100644 index 0000000..4e9b938 Binary files /dev/null and b/docs/images/partitioning.png differ diff --git a/docs/images/ultra_high_availability.png b/docs/images/ultra_high_availability.png new file mode 100644 index 0000000..b606edb Binary files /dev/null and b/docs/images/ultra_high_availability.png differ diff --git a/docs/img/favicon.ico b/docs/img/favicon.ico new file mode 100644 index 0000000..e7ce2cf Binary files /dev/null and b/docs/img/favicon.ico differ diff --git a/docs/img/logo-dark.png b/docs/img/logo-dark.png new file mode 100644 index 0000000..90587c6 Binary files /dev/null and b/docs/img/logo-dark.png differ diff --git a/docs/img/logo-light.png b/docs/img/logo-light.png new file mode 100644 index 0000000..f998225 Binary files /dev/null and b/docs/img/logo-light.png differ diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..71832ac --- /dev/null +++ b/docs/index.md @@ -0,0 +1,46 @@ +# ACE (the Active Consistency Engine) + +ACE is a powerful tool designed to ensure and maintain consistency across nodes in a pgEdge Distributed Postgres cluster. It helps identify and resolve data inconsistencies, schema differences, and replication configuration mismatches across nodes in a cluster. + +Key features of ACE include: + +- Table-level data comparison and repair +- Replication set level verification +- Automated repair capabilities +- Schema comparison +- Spock configuration validation + +## ACE Deployment Considerations + +ACE is very efficient when it comes to comparing tables, and uses a lot of optimisations to speed up the process. However, it is important to consider certain factors that can affect the runtime of a `table-diff` command. +ACE first looks at table sizes, and then based on the specified runtime options, splits up the task into multiple processes and executes them in parallel. Each multiprocessing worker initially computes a hash of the data block, and if it finds there is a hash mismatch, attempts to fetch those records to generate a report. +How fast it can execute a `table-diff` command depends on the: + +* configuration of the machine you're running ACE on – how many cores, how much memory, etc. +* resources you allow ACE to use (`max_cpu_ratio`). This is a strong determinant of runtime performance, even in the absence of other tuning options. +* runtime tuning options: `block-rows`, `batch-size`, `table-filter`, and `nodes`. +* size of your table, size of individual rows, and column datatypes. Sometimes performing a table-diff on tables with a very large number of records may take just a few minutes, while a smaller table with fewer rows, but with larger row sizes may take much longer. An example of the latter case is when embeddings or binary data is stored in the table. +* distribution of differences: Any differences in blocks of data identified by ACE require ACE to pull all those records together to generate a report. So, the smaller the data transfer between the database nodes and ACE, the faster it will run. If diffs are spread across numerous data blocks throughout the key space, it will take longer for ACE to be able to pull all the records. If you expect to see differences in certain blocks, using a table-filter and adjusting the block size can greatly speed up the process. +* network latency between the ACE node and your database nodes: The closer the ACE node is to the database nodes, the faster it can run. + +ACE uses the cluster definition JSON file to connect to nodes and execute SQL statements. It might even be desirable to set up a connection pooler like pgBouncer or pgCat separately and point to that in the cluster JSON file for faster runtime performance. + +### Improving ACE Performance when Invoking Diff Functions + +The following runtime options can impact ACE performance during a `table-diff`: + +* `--block-rows` specifies the number of tuples to be used at a time during table comparisons. ACE computes an MD5 sum on the full chunk of rows per block and compares it with the hash of the same chunk on the other nodes. If the hashes match up between nodes, then ACE moves on to the next block. Otherwise, the rows get pulled in and a set difference is computed. If `block_rows` is set to `1000`, then a thousand tuples are compared per job across tables. +It is worth noting here that while it may appear that larger block sizes yield faster results, it may not always be the case. Using a larger block size will result in a speed up, but only up to a threshold. If the block size is too large, the Postgres [array_agg()](https://www.postgresql.org/docs/16/functions-aggregate.html) function may run out of memory, or the hash might take longer to compute, thus annulling the benefit of using a larger block size. The sweet spot is a block size that is large enough to yield quicker runtimes, but still small enough to avoid the issues listed above. ACE enforces that block sizes are between 10^3 and 10^5 rows. +* `batch_size` dictates how many sets of `block_rows` a single process should handle. By default, this is set to `1` to achieve the maximum possible parallelism – each process in the multiprocessing pool works on one block at a time. However, in some cases, you may want to limit process creation overheads and use a larger batch size. We recommend you leave this setting to its default value, unless there is a specific use-case that demands changing it. +* `--max_cpu_ratio` specifies the percentage of CPU power you are allotting for use by ACE. A value of `1` instructs the server to use all available CPUs, while `.5` means use half of the available CPUs. The default is `.6` (or 60% of the CPUs). Setting it to its maximum (1.0) will result in faster execution times. This should be modified as needed. + +To evaluate and improve ACE performance: + +1. Experiment with different block sizes and CPU utilisation to find the best performance/resource-usage balance for your workload. +2. Use `--table-filter` for large tables to reduce comparison scope. +3. Generate HTML reports for easier analysis of differences. +4. Ensure the diffs have not overrun the MAX_ALLOWED_DIFFS limit--otherwise, table-repair will only be able to partially repair the table. + +### Known Limitations + +* ACE cannot be used on a table without a primary key, because primary keys are the basis for range partitioning, hash calculations, and other critical functions in ACE. diff --git a/docs/merkle.md b/docs/merkle.md new file mode 100644 index 0000000..69c8dd1 --- /dev/null +++ b/docs/merkle.md @@ -0,0 +1,83 @@ +# Enhancing ACE Performance with Merkle Trees + +!!! info + + ACE Merkle trees are added as an experimental optimisation with release of the pgEdge Distributed Postgres (CLI 25.2); we encourage caution before using this feature in a production environment. + +ACE adds functionality that uses [Merkle trees](https://en.wikipedia.org/wiki/Merkle_tree) to make table comparisons significantly faster. In most cases, performing a normal mode table-diff, when run with tuned parameters, can produce diff reports in anywhere from a few seconds to a few minutes depending on the size of the table, network latency, disc I/O latency, and similar factors. The Merkle tree feature in ACE is intended for tables where performing a diff without a Merkle tree might take hours to complete. + +## Initialising the Merkle Tree Objects for a Table + +You must perform two initialisation steps before using Merkle trees: + +* Create the Merkle tree functions and operators at the database level. +* Create a Merkle tree metadata table for the candidate table. + +The following command initialises functions and operators at the database level: + +`./pgedge ace mtree init cluster_name` + +The second command creates pre-computed hash objects and triggers on the candidate table. If you pass `--recreate-objects=true` during the build phase, the previous step is not necessary. Use the following command to build a pre-computed hash table: + +`./pgedge ace mtree build cluster_name schema_name.table_name --max-cpu-ratio=1 --recreate-objects=true` + +This command creates a table in which to store the Merkle tree of the candidate table, and adds triggers to track modifications to it. + +!!! info + + Building the pre-computed hash table (the Merkle tree table) is a one-time operation. Once built, ACE tracks changes on the table and automatically updates the tree when you perform a table-diff. + +It is also worth noting here that because the Merkle tree feature in ACE is built to handle very large tables, it uses probabilistic sampling and estimates to compute things such as the number of rows, primary key ranges, etc. For these estimates to work correctly, the table should be `ANALYZED` beforehand. You can pass in `--analyse=true` during the Merkle tree build to let ACE analyse the table, but we recommend that an `ANALYZE table` is performed manually on the table before invoking `mtree build` with ACE simply because it might take a while for the analysis to complete. However, if you are not actively using the table, then `ANALYZE` may not be necessary. + +### Building Merkle Trees in Parallel + +If your table is extremely large (say, close to a billion rows, or ~1 TB in size), then building the tree even on a single node might take a non-trivial amount of time. The good news is that the build operation needs to happen just once per table per cluster. If the ACE (management) node is used to remotely build the trees on each node, there is additional latency because of the network distance between the ACE host and the database instances. + +For example, if you are using the `n1` node as the ACE management node of a three-node pgEdge cluster (with `n1`, `n2`, and `n3`), building the tree on `n1` may be comparatively faster (because of network latency) than when ACE on `n1` tries to build the same tree on `n2` or `n3`. + +As a workaround, you can build the Merkle trees in parallel; when you invoke the `mtree build` command, include the `--write-ranges=true` clause as shown below: + +`./pgedge ace mtree build cluster_name schema_name.table_name --max-cpu-ration=1 --write-ranges=true` + +![Building the Merkle trees in parallel](../../platform/images/merkle_one.png) + +This command outputs the computed ranges to a file, and then begins the hash computations. While that command is running, you can `scp` the `ranges` file to the remaining nodes. Then, on the other nodes, use the same `mtree build` command, but this time, specify the file name with the `--ranges-file=/path/to/ranges-file.txt` option as shown below: + +`./pgedge ace mtree build cluster_name schema_name.table_name --max-cpu-ration=1 --recreate-objects=true --nodes=node_name --ranges-file=/path/to/ranges-file.txt` + +![Building the Merkle trees in parallel](../../platform/images/merkle_two.png) + +While running the build on each node using the `--ranges-file` option, make sure to specify only one specific node in the `--nodes` option. For example, when building a tree on `n1`, include `--nodes=n1`, or when building a tree on `n2`, include `--nodes=n2`, or ACE will attempt to remotely create the Merkle tree tables on other nodes in the cluster - the very thing you want to avoid for large tables. + +## Using ACE Functions with Merkle Trees + +The following ACE commands use and manage Merkle tree functionality. + +### mtree table-diff + +To perform a table-diff using a Merkle tree, use the command: + +`./pgedge ace mtree table-diff cluster_name schema_name.table_name` + +### mtree update + +Performing an mtree table-diff automatically updates the Merkle tree before performing the diff. You can also perform a Merkle tree update with the command: + +`./pgedge ace mtree update cluster_name schema_name.table_name` + +Passing in `--rebalance=true` with either a diff or update command will perform splits and merges of blocks based on changes in the underlying keyspace; you don't need to perform a rebalancing unless it's essential. The default update operation in `mtree table-diff` takes care of block splits and updates but defers merges. This is to preserve parent-child relationships and avoid costly recursions due to merged blocks. + +### mtree teardown + +You can remove table-specific triggers and objects with the following command: + +`./pgedge ace mtree teardown cluster_name schema_name.table_name` + +To remove cluster-level objects (custom operators, generic functions, etc.) use the following command: + +`./pgedge ace mtree teardown cluster_name` + + +## Performance Considerations + +Since the Merkle tree initialisation process adds triggers to the candidate table, you may encounter a performance impact on normal user queries to the table. This impact is application dependent and needs to be measured and deemed acceptable before using this feature. diff --git a/docs/overrides/partials/logo.html b/docs/overrides/partials/logo.html new file mode 100644 index 0000000..780ddb5 --- /dev/null +++ b/docs/overrides/partials/logo.html @@ -0,0 +1,2 @@ +logo +logo \ No newline at end of file diff --git a/docs/schedule_ace.md b/docs/schedule_ace.md new file mode 100644 index 0000000..086f0b5 --- /dev/null +++ b/docs/schedule_ace.md @@ -0,0 +1,194 @@ +# Scheduling ACE Diff Operations (Beta) + +ACE supports automated scheduling of table-diff and repset-diff operations through configuration settings in `ace_config.py`. The job scheduler allows you to perform regular consistency checks without manual intervention. + +Use properties in the `ACE Background Service Options` section of the `ace_config.py` file to specify general background service preferences: + +```bash +** ACE Background Service Options ** + +LISTEN_ADDRESS = "0.0.0.0" +LISTEN_PORT = 5000 + +# Smallest interval that can be used for any ACE background service +MIN_RUN_FREQUENCY = timedelta(minutes=5) +``` + +* `LISTEN_ADDRESS` (default="0.0.0.0") is the network address ACE should bind to when started as a background process. +* `LISTEN_PORT` (default=5000) is the default port ACE should listen on when started as a background process. +* `MIN_RUN_FREQUENCY` (default=timedelta(minutes=5)) is the minimum interval between consecutive runs of a background job. This value can be set using any timedelta unit -- such as minutes, seconds, or hours. For example, if MIN_RUN_FREQUENCY is set to 5 minutes, then no job can be scheduled to run more frequently than once every 5 minutes. + +Additionally, use properties in the following sections to define jobs and schedules for their execution. + +## Scheduling a Job + +[The `ace_config.py` file](../ace/installing_ace.md#configuring-ace-preferences-with-the-ace_configpy-file) (by default, located in `$PGEDGE_HOME/hub/scripts/`) contains information about jobs and their schedules in two .json-formatted sections; first, use the following property:value pairs in the `schedule_jobs` section to define jobs: + +**Job Configuration Options** + +Each job in `schedule_jobs` supports: + +- `name` (required): Unique identifier for the job +- `cluster_name` (required): Name of the cluster +- `table_name` OR `repset_name` (required): Fully qualified table name or repset name +- `args` (optional): Dictionary of table-diff parameters + - `max_cpu_ratio`: Maximum CPU usage ratio + - `batch_size`: Batch size for processing + - `block_rows`: Number of rows per block + - `table_filter`: `SQL WHERE` clause used to filter rows for comparison + - `nodes`: Nodes to include + - `output`: Output format ["json", "csv", "html"] + - `quiet`: Suppress output + - `dbname`: Database name + +**For Example** + +```python +# Define the jobs +schedule_jobs = [ + { + "name": "t1", + "cluster_name": "my_cluster", + "table_name": "public.users" + }, + { + "name": "t2", + "cluster_name": "my_cluster", + "table_name": "public.orders", + "args": { + "max_cpu_ratio": 0.7, + "batch_size": 1000, + "block_rows": 10000, + "nodes": "all", + "output": "json", + "quiet": False, + "dbname": "mydb" + } + } +] +``` + +Then, use the property:value pairs in the `schedule_config` section to define the schedule for each job: + +**Schedule Configuration Options** + +Each schedule in `schedule_config` supports: + +- `job_name` (required): Name of the job to schedule (must match a job name) +- `crontab_schedule`: Cron-style schedule expression + - **Cron Format**: `* * * * *` (minute hour day_of_month month day_of_week) + - Examples: + - `0 0 * * *`: Daily at midnight + - `0 */4 * * *`: Every 4 hours + - `0 0 * * 0`: Weekly on Sunday +- `run_frequency`: Alternative to crontab, using time units (e.g., "30s", "5m", "1h") + - **Run Frequency Format**: `` + - Units: "s" (seconds), "m" (minutes), "h" (hours) + - Minimum: 5 minutes + - Examples: + - "30s": Every 30 seconds + - "5m": Every 5 minutes + - "1h": Every hour +- `enabled`: Whether the schedule is active (default: False) +- `rerun_after`: Time to wait before rerunning if differences found + +**For Example** + +```json +schedule_config = [ + { + "job_name": "t1", + "crontab_schedule": "0 0 * * *", # Run at midnight + "run_frequency": "30s", # Alternative to crontab + "enabled": True, + "rerun_after": "1h" # Rerun if diff found after 1 hour + }, + { + "job_name": "t2", + "crontab_schedule": "0 */4 * * *", # Every 4 hours + "run_frequency": "5m", # Alternative to crontab + "enabled": True, + "rerun_after": "30m" + } +] +``` + +**Starting the Scheduler** + +The scheduler starts automatically when ACE is started. + +```bash +./pgedge ace start +``` + + +**Best Practices** + +1. **Resource Management**: + - Stagger schedules to avoid overlapping resource-intensive jobs + - Set appropriate `max_cpu_ratio`, `block_rows`, and `batch_size` values based on the + table size and expected load +2. **Frequency Selection**: + - Use `crontab_schedule` for specific times + - Use `run_frequency` for regular intervals + + +## Scheduling Auto-Repair Jobs (Beta) + +The `auto-repair` module monitors and repairs INSERT-INSERT exceptions in tables containing data that has been detected to have diverged. It runs as a background process, periodically checking for inconsistencies and applying repairs based on configured settings. + +To enable auto-repair, specify your auto-repair preferences in `ace_config.py`: + +```json +auto_repair_config = { + "enabled": False, + "cluster_name": "eqn-t9da", + "dbname": "demo", + "poll_frequency": "10m", + "repair_frequency": "15m" +} +``` + +**Configuration Options** + +- `enabled`: Enable/disable auto-repair functionality (default: False) +- `cluster_name`: Name of the cluster to monitor +- `dbname`: Database name to monitor +- `poll_frequency`: How often the Spock exception log is polled to check for new exceptions. +- `repair_frequency`: How often to repair exceptions that have been detected. + +**Time Intervals** + +You can specify the time intervals for execution in either cron format or in a simple frequency format. Both `poll_interval` and `status_update_interval` accept time strings in the following formats: + +**Cron Format**: `* * * * *` (minute hour day_of_month month day_of_week); for example: + - `0 0 * * *`: Daily at midnight + - `0 */4 * * *`: Every 4 hours + - `0 0 * * 0`: Weekly on Sunday + +**Run Frequency Format**: ``; for example: + - Units: "s" (seconds), "m" (minutes), "h" (hours) + - Minimum: 5 minutes + - Examples: + - "30s": Every 30 seconds + - "5m": Every 5 minutes + - "1h": Every hour + +Note: The minimum frequency allowed is 5 minutes. However, you can modify that time by editing the `MIN_RUN_FREQUENCY` variable in `ace_config.py`. + +**Controlling the auto-repair Daemon** + +The auto-repair daemon starts automatically when ACE is started. + +```bash +./pgedge ace start +``` + + +**Common Use Cases** + +Auto-repair is a great candidate for handling use-cases that have a high probability of `INSERT`-`INSERT` conflicts. For example, on bidding and reservation servers, `INSERT`-`INSERT` conflicts are likely to arise across multiple nodes. + +### Limitations and Considerations + +- The auto-repair daemon is currently limited to handling `INSERT`-`INSERT` conflicts only. diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css new file mode 100644 index 0000000..ebd98ec --- /dev/null +++ b/docs/stylesheets/extra.css @@ -0,0 +1,17 @@ +#logo_light_mode { + display: var(--md-footer-logo-light-mode); +} + +#logo_dark_mode { + display: var(--md-footer-logo-dark-mode); +} + +[data-md-color-scheme="default"] { + --md-footer-logo-dark-mode: none; + --md-footer-logo-light-mode: block; +} + +[data-md-color-scheme="slate"] { + --md-footer-logo-dark-mode: block; + --md-footer-logo-light-mode: none; +} \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 0000000..06fcd26 --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,64 @@ +site_name: ace + +extra_css: + - stylesheets/extra.css + +markdown_extensions: + - admonition + - pymdownx.details + - pymdownx.superfences + +plugins: + - search + +theme: + name: material + custom_dir: docs/overrides + + logo_dark_mode: 'img/logo-dark.png' + logo_light_mode: 'img/logo-light.png' + + features: + - navigation.instant + - navigation.tracking + - navigation.prune + - navigation.top + - toc.follow + + palette: + - media: "(prefers-color-scheme: light)" + scheme: default + primary: white + accent: cyan + toggle: + icon: material/brightness-7 + name: Switch to dark mode + + - media: "(prefers-color-scheme: dark)" + scheme: slate + primary: black + accent: cyan + toggle: + icon: material/brightness-4 + name: Switch to system preference + +extra: + generator: false + +copyright: Copyright © 2023 - 2025 pgEdge, Inc +repo_url: https://github.com/pgEdge/ace + +nav: + - ACE Overview: index.md + - Understanding ACE Use Cases: ace_use_cases.md + - Configuring ACE: configuring.md + - Using Merkle Trees with ACE: merkle.md + - Using ACE Functions: ace_functions.md + - Using the ACE API: ace_api.md + - API Reference: api.md + - Scheduling ACE: schedule_ace.md + + - Internals: + - Design: internals-doc/DESIGN.md + - Output Plugin: internals-doc/OUTPUT.md + - Protocol: internals-doc/protocol.txt