Skip to content

A high-performance CLI tool for archiving PostgreSQL partitioned table data to S3-compatible object storage.

License

Notifications You must be signed in to change notification settings

airframesio/data-archiver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Data Archiver

CI Go Report Card License: MIT Go Version

A high-performance CLI tool for archiving database data to S3-compatible object storage.

Currently supports: PostgreSQL input (partitioned tables) and S3-compatible object storage output.

Screenshots

Terminal UI (TUI)

Data Archiver TUI Screenshot

Web-based Cache Viewer

Data Archiver Web UI Screenshot

Features

  • πŸš€ Parallel Processing - Archive multiple partitions concurrently with configurable workers
  • πŸ“Š Beautiful Progress UI - Real-time progress tracking with dual progress bars
  • 🌐 Embedded Cache Viewer - Beautiful web interface with real-time updates:
    • WebSocket Live Updates - Real-time data streaming without polling
    • Interactive task monitoring showing current partition and operation
    • Clickable partition names to jump directly to table row
    • Shows archiver status (running/idle) with PID tracking
    • Live statistics: total partitions, sizes, compression ratios
    • Sortable table with S3 upload status indicators
    • Smooth animations highlight data changes
    • Error tracking with timestamps
    • Auto-reconnecting WebSocket for reliability
  • πŸ’Ύ Intelligent Caching - Advanced caching system for maximum efficiency:
    • Caches row counts for 24 hours (refreshed daily)
    • Caches file metadata permanently (size, MD5, compression ratio)
    • Tracks errors with timestamps
    • Skip extraction/compression entirely when cached metadata matches S3
  • πŸ” Data Integrity - Comprehensive file integrity verification:
    • Size comparison (both compressed and uncompressed)
    • MD5 hash verification for single-part uploads
    • Multipart ETag verification for large files (>100MB)
    • Automatic multipart upload for files >100MB
  • ⚑ Smart Compression - Uses Zstandard compression with multi-core support
  • πŸ”„ Intelligent Resume - Three-level skip detection:
    1. Fast skip using cached metadata (no extraction needed)
    2. Skip if S3 file matches after local processing
    3. Re-upload if size or hash differs
  • 🎯 Flexible Partition Support - Handles multiple partition naming formats:
    • table_YYYYMMDD (e.g., messages_20240315)
    • table_pYYYYMMDD (e.g., messages_p20240315)
    • table_YYYY_MM (e.g., messages_2024_03)

πŸ“‹ Prerequisites

  • Go 1.22 or higher
  • PostgreSQL database with partitioned tables (format: tablename_YYYYMMDD)
  • S3-compatible object storage (Hetzner, AWS S3, MinIO, etc.)

πŸ”§ Installation

Homebrew (macOS/Linux)

The easiest way to install on macOS or Linux:

brew install airframesio/tap/data-archiver

Pre-built Binaries

Download the latest release for your platform from the releases page.

Go Install

go install github.com/airframesio/data-archiver@latest

Build from Source

git clone https://github.com/airframesio/data-archiver.git
cd data-archiver
go build -o data-archiver

πŸš€ Quick Start

data-archiver \
  --db-user myuser \
  --db-password mypass \
  --db-name mydb \
  --table flights \
  --s3-endpoint https://fsn1.your-objectstorage.com \
  --s3-bucket my-archive-bucket \
  --s3-access-key YOUR_ACCESS_KEY \
  --s3-secret-key YOUR_SECRET_KEY \
  --path-template "archives/{table}/{YYYY}/{MM}" \
  --start-date 2024-01-01 \
  --end-date 2024-01-31

Advanced Example with Custom Output:

data-archiver \
  --db-user myuser \
  --db-password mypass \
  --db-name mydb \
  --table flights \
  --s3-endpoint https://fsn1.your-objectstorage.com \
  --s3-bucket my-archive-bucket \
  --s3-access-key YOUR_ACCESS_KEY \
  --s3-secret-key YOUR_SECRET_KEY \
  --path-template "data/{table}/year={YYYY}/month={MM}" \
  --output-format parquet \
  --compression lz4 \
  --compression-level 5 \
  --output-duration daily \
  --start-date 2024-01-01 \
  --end-date 2024-01-31

🎯 Usage

Basic Command Structure

# Archive data to S3
data-archiver archive [flags]

# Dump database using pg_dump to S3
data-archiver dump [flags]

# Restore data from S3
data-archiver restore [flags]

Help Output

Data Archiver

A CLI tool to efficiently archive database data to object storage.
Supports multiple output formats (JSONL/CSV/Parquet), compression types (Zstandard/LZ4/Gzip),
and flexible path templates for S3-compatible storage.

Usage:
  data-archiver [flags]

Flags:
      --viewer                       start embedded cache viewer web server
      --compression string           compression type: zstd, lz4, gzip, none (default "zstd")
      --compression-level int        compression level (zstd: 1-22, lz4/gzip: 1-9, none: 0) (default 3)
      --config string                config file (default is $HOME/.data-archiver.yaml)
      --date-column string           timestamp column name for duration-based splitting (optional)
      --db-host string               PostgreSQL host (default "localhost")
      --db-name string               PostgreSQL database name
      --db-password string           PostgreSQL password
      --db-port int                  PostgreSQL port (default 5432)
      --db-sslmode string            PostgreSQL SSL mode (disable, require, verify-ca, verify-full) (default "disable")
      --db-user string               PostgreSQL user
  -d, --debug                        enable debug output
      --dry-run                      perform a dry run without uploading
      --end-date string              end date (YYYY-MM-DD) (default "2025-08-27")
  -h, --help                         help for data-archiver
      --output-duration string       output file duration: hourly, daily, weekly, monthly, yearly (default "daily")
      --output-format string         output format: jsonl, csv, parquet (default "jsonl")
      --path-template string         S3 path template with placeholders: {table}, {YYYY}, {MM}, {DD}, {HH} (required)
      --s3-access-key string         S3 access key
      --s3-bucket string             S3 bucket name
      --s3-endpoint string           S3-compatible endpoint URL
      --s3-region string             S3 region (default "auto")
      --s3-secret-key string         S3 secret key
      --skip-count                   skip counting rows (faster startup, no progress bars)
      --start-date string            start date (YYYY-MM-DD)
      --table string                 base table name (required)
      --viewer-port int              port for cache viewer web server (default 8080)
      --workers int                  number of parallel workers (default 4)

Required Flags

  • --table - Base table name (without date suffix)
  • --path-template - S3 path template with placeholders (e.g., "archives/{table}/{YYYY}/{MM}")
  • --db-user - PostgreSQL username
  • --db-name - PostgreSQL database name
  • --s3-endpoint - S3-compatible endpoint URL
  • --s3-bucket - S3 bucket name
  • --s3-access-key - S3 access key
  • --s3-secret-key - S3 secret key

Output Configuration Flags

  • --output-format - Output file format: jsonl (default), csv, or parquet
  • --compression - Compression type: zstd (default), lz4, gzip, or none
  • --compression-level - Compression level (default: 3)
    • Zstandard: 1-22 (higher = better compression, slower)
    • LZ4/Gzip: 1-9 (higher = better compression, slower)
  • --output-duration - File duration: hourly, daily (default), weekly, monthly, or yearly
  • --date-column - Optional timestamp column for duration-based splitting
  • --chunk-size - Number of rows to process per chunk (default: 10000, range: 100-1000000)
    • Tune based on average row size for optimal memory usage
    • Smaller chunks for large rows, larger chunks for small rows

βš™οΈ Configuration

The tool supports three configuration methods (in order of precedence):

  1. Command-line flags (highest priority)
  2. Environment variables (prefix: ARCHIVE_)
  3. Configuration file (lowest priority)

Environment Variables

export ARCHIVE_DB_HOST=localhost
export ARCHIVE_DB_PORT=5432
export ARCHIVE_DB_USER=myuser
export ARCHIVE_DB_PASSWORD=mypass
export ARCHIVE_DB_NAME=mydb
export ARCHIVE_DB_SSLMODE=disable
export ARCHIVE_S3_ENDPOINT=https://fsn1.your-objectstorage.com
export ARCHIVE_S3_BUCKET=my-bucket
export ARCHIVE_S3_ACCESS_KEY=your_key
export ARCHIVE_S3_SECRET_KEY=your_secret
export ARCHIVE_S3_PATH_TEMPLATE="archives/{table}/{YYYY}/{MM}"
export ARCHIVE_TABLE=flights
export ARCHIVE_OUTPUT_FORMAT=jsonl           # Options: jsonl, csv, parquet
export ARCHIVE_COMPRESSION=zstd              # Options: zstd, lz4, gzip, none
export ARCHIVE_COMPRESSION_LEVEL=3           # zstd: 1-22, lz4/gzip: 1-9
export ARCHIVE_OUTPUT_DURATION=daily         # Options: hourly, daily, weekly, monthly, yearly
export ARCHIVE_WORKERS=8
export ARCHIVE_CACHE_VIEWER=true
export ARCHIVE_VIEWER_PORT=8080

Configuration File

Create ~/.data-archiver.yaml:

db:
  host: localhost
  port: 5432
  user: myuser
  password: mypass
  name: mydb
  sslmode: disable  # Options: disable, require, verify-ca, verify-full

s3:
  endpoint: https://fsn1.your-objectstorage.com
  bucket: my-archive-bucket
  access_key: your_access_key
  secret_key: your_secret_key
  region: auto
  path_template: "archives/{table}/{YYYY}/{MM}"  # S3 path template with placeholders

table: flights
output_format: jsonl          # Options: jsonl, csv, parquet
compression: zstd             # Options: zstd, lz4, gzip, none
compression_level: 3          # zstd: 1-22, lz4/gzip: 1-9
output_duration: daily        # Options: hourly, daily, weekly, monthly, yearly
workers: 8
start_date: "2024-01-01"
end_date: "2024-12-31"
cache_viewer: false           # Enable embedded cache viewer
viewer_port: 8080             # Port for cache viewer web server

πŸ“ Output Structure

Files are organized in S3 based on your configured --path-template. The tool supports flexible path templates with the following placeholders:

  • {table} - Table name
  • {YYYY} - 4-digit year
  • {MM} - 2-digit month
  • {DD} - 2-digit day
  • {HH} - 2-digit hour (for hourly duration)

Example with default settings (--path-template "archives/{table}/{YYYY}/{MM}" --output-format jsonl --compression zstd --output-duration daily):

bucket/
└── archives/
    └── flights/
        └── 2024/
            └── 01/
                β”œβ”€β”€ flights-2024-01-01.jsonl.zst
                β”œβ”€β”€ flights-2024-01-02.jsonl.zst
                └── flights-2024-01-03.jsonl.zst

Example with Parquet and LZ4 (--path-template "data/{table}/year={YYYY}/month={MM}" --output-format parquet --compression lz4):

bucket/
└── data/
    └── flights/
        └── year=2024/
            └── month=01/
                β”œβ”€β”€ flights-2024-01-01.parquet.lz4
                β”œβ”€β”€ flights-2024-01-02.parquet.lz4
                └── flights-2024-01-03.parquet.lz4

Example with uncompressed CSV (--path-template "{table}/{YYYY}" --output-format csv --compression none --output-duration monthly):

bucket/
└── flights/
    └── 2024/
        β”œβ”€β”€ flights-2024-01.csv
        β”œβ”€β”€ flights-2024-02.csv
        └── flights-2024-03.csv

🎨 Features in Detail

Cache Viewer Web Interface

The archiver includes an embedded web server for monitoring cache and progress:

# Start archiver with embedded cache viewer
data-archiver --viewer --viewer-port 8080 [other options]

# Or run standalone cache viewer
data-archiver viewer --port 8080

Features:

  • WebSocket Real-time Updates: Live data streaming with automatic reconnection
  • Interactive Status Panel:
    • Shows current partition being processed with clickable link
    • Displays specific operation (e.g., "Checking if exists", "Extracting", "Compressing", "Uploading")
    • Progress bar with completion percentage and partition count
    • Elapsed time tracking
  • Visual Change Detection: Smooth animations highlight updated cells and stats
  • S3 Upload Status: Shows which files are uploaded vs only processed locally
  • Comprehensive Metrics: Shows both compressed and uncompressed sizes
  • Compression Ratios: Visual display of space savings
  • Error Tracking: Displays last error and timestamp for failed partitions
  • Smart Rendering: No page flashing - only updates changed values
  • Sortable Columns: Click any column header to sort (default: partition name)
  • File Counts: Shows total partitions, processed, uploaded, and errors
  • Process Monitoring: Checks if archiver is currently running via PID
  • Connection Status: Visual indicator shows WebSocket connection state

Access the viewer at http://localhost:8080 (or your configured port).

Technical Details

The cache viewer uses modern web technologies for optimal performance:

  • WebSocket Protocol: Bi-directional communication for instant updates
  • Automatic Reconnection: Reconnects every 2 seconds if connection drops
  • Event-Driven File Monitoring: Uses fsnotify for instant file change detection
  • Efficient Updates: Only transmits and renders changed data
  • No Polling Overhead: WebSocket eliminates the need for HTTP polling

Interactive Progress Display

The tool features a beautiful terminal UI with:

  • Per-partition progress bar: Shows real-time progress for data extraction, compression, and upload
  • Overall progress bar: Tracks completion across all partitions
  • Live statistics: Displays elapsed time, estimated remaining time, and recent completions
  • Row counter: Shows progress through large tables during extraction

Partition Discovery

The tool automatically discovers partitions matching these naming patterns:

  1. Daily partitions (standard): {base_table}_YYYYMMDD

    • Example: flights_20240101, flights_20240102
  2. Daily partitions (with prefix): {base_table}_pYYYYMMDD

    • Example: flights_p20240101, flights_p20240102
  3. Monthly partitions: {base_table}_YYYY_MM

    • Example: flights_2024_01, flights_2024_02
    • Note: Monthly partitions are processed as the first day of the month

For example, if your base table is flights, the tool will find and process all of these:

  • flights_20240101 (daily)
  • flights_p20240102 (daily with prefix)
  • flights_2024_01 (monthly)

JSONL Format

Each row from the partition is exported as a single JSON object on its own line:

{"id":1,"flight_number":"AA123","departure":"2024-01-01T10:00:00Z"}
{"id":2,"flight_number":"UA456","departure":"2024-01-01T11:00:00Z"}

Compression

Uses Facebook's Zstandard compression with:

  • Multi-core parallel compression
  • "Better Compression" preset for optimal size/speed balance
  • Typically achieves 5-10x compression ratios on JSON data

Skip Logic

Files are skipped if:

  • They already exist in S3 with the same path
  • The file size matches (prevents re-uploading identical data)

πŸ› Debugging

Enable debug mode for detailed output:

data-archiver --debug --table flights ...

Debug mode shows:

  • Database connection details
  • Discovered partitions and row counts
  • Extraction progress (every 10,000 rows)
  • Compression ratios
  • Upload destinations
  • Detailed error messages

πŸƒ Dry Run Mode

Test your configuration without uploading:

data-archiver --dry-run --table flights ...

This will:

  • Connect to the database
  • Discover partitions
  • Extract and compress data
  • Calculate file sizes and MD5 hashes
  • Skip the actual upload

πŸ’Ύ Caching System

The archiver uses an intelligent two-tier caching system to maximize performance:

Row Count Cache

  • Caches partition row counts for 24 hours
  • Speeds up progress bar initialization
  • Always recounts today's partition for accuracy
  • Cache location: ~/.data-archiver/cache/{table}_metadata.json

File Metadata Cache

  • Caches compressed/uncompressed sizes, MD5 hash, and S3 upload status
  • Tracks whether files have been successfully uploaded to S3
  • Enables fast skipping without extraction/compression on subsequent runs
  • Validates against S3 metadata before skipping
  • Preserves all metadata when updating row counts
  • Stores error messages with timestamps for failed uploads
  • File metadata is kept permanently (only row counts expire after 24 hours)

Cache Efficiency

On subsequent runs with cached metadata:

  1. Check cached size/MD5 against S3 (milliseconds)
  2. Skip extraction and compression if match found
  3. Result: 100-1000x faster for already-processed partitions

πŸ“Š Process Monitoring

The archiver provides real-time monitoring capabilities:

PID Tracking

  • Creates PID file at ~/.data-archiver/archiver.pid when running
  • Allows external tools to check if archiver is active
  • Automatically cleaned up on exit

Task Progress File

  • Writes current task details to ~/.data-archiver/current_task.json
  • Includes:
    • Current operation (connecting, counting, extracting, uploading)
    • Progress percentage
    • Total and completed partitions
    • Start time and last update time
  • Updated in real-time during processing

Web API Endpoints

The cache viewer provides REST API and WebSocket endpoints:

  • /api/cache - Returns all cached metadata (REST)
  • /api/status - Returns archiver running status and current task (REST)
  • /ws - WebSocket endpoint for real-time updates
    • Sends cache updates when files change
    • Streams status updates during archiving
    • Automatic reconnection support

πŸ” Data Integrity Verification

The archiver ensures data integrity through multiple verification methods:

Single-Part Uploads (files <100MB)

  • Calculates MD5 hash of compressed data
  • Compares with S3 ETag (which is MD5 for single-part uploads)
  • Only skips if both size and MD5 match exactly

Multipart Uploads (files β‰₯100MB)

  • Automatically uses multipart upload for large files
  • Calculates multipart ETag using S3's algorithm
  • Verifies size and multipart ETag match before skipping

Verification Process

  1. First Run: Extract β†’ Compress β†’ Calculate MD5 β†’ Upload β†’ Cache metadata
  2. Subsequent Runs with Cache: Check cache β†’ Compare with S3 β†’ Skip if match
  3. Subsequent Runs without Cache: Extract β†’ Compress β†’ Calculate MD5 β†’ Compare with S3 β†’ Skip or upload

πŸ” Examples

Archive Last 30 Days

data-archiver \
  --table events \
  --start-date $(date -d '30 days ago' +%Y-%m-%d) \
  --config ~/.archive-config.yaml

Archive Specific Month with Debug

data-archiver \
  --table transactions \
  --start-date 2024-06-01 \
  --end-date 2024-06-30 \
  --debug \
  --workers 8

Dry Run with Custom Config

data-archiver \
  --config production.yaml \
  --table orders \
  --dry-run \
  --debug

πŸ’Ύ Dump Command

The dump subcommand uses PostgreSQL's pg_dump utility to create database dumps with custom format and heavy compression, streaming directly to S3.

Dump Basic Usage

data-archiver dump \
  --db-user myuser \
  --db-password mypass \
  --db-name mydb \
  --s3-endpoint https://fsn1.your-objectstorage.com \
  --s3-bucket my-archive-bucket \
  --s3-access-key YOUR_ACCESS_KEY \
  --s3-secret-key YOUR_SECRET_KEY \
  --path-template "dumps/{table}/{YYYY}/{MM}"

Dump with Schema Only

data-archiver dump \
  --db-user myuser \
  --db-password mypass \
  --db-name mydb \
  --dump-mode schema-only \
  --s3-endpoint https://fsn1.your-objectstorage.com \
  --s3-bucket my-archive-bucket \
  --s3-access-key YOUR_ACCESS_KEY \
  --s3-secret-key YOUR_SECRET_KEY \
  --path-template "dumps/{table}/{YYYY}/{MM}"

Dump with Data Only

data-archiver dump \
  --db-user myuser \
  --db-password mypass \
  --db-name mydb \
  --dump-mode data-only \
  --workers 8 \
  --s3-endpoint https://fsn1.your-objectstorage.com \
  --s3-bucket my-archive-bucket \
  --s3-access-key YOUR_ACCESS_KEY \
  --s3-secret-key YOUR_SECRET_KEY \
  --path-template "dumps/{table}/{YYYY}/{MM}"

Dump Specific Table

data-archiver dump \
  --db-user myuser \
  --db-password mypass \
  --db-name mydb \
  --table flights \
  --workers 4 \
  --s3-endpoint https://fsn1.your-objectstorage.com \
  --s3-bucket my-archive-bucket \
  --s3-access-key YOUR_ACCESS_KEY \
  --s3-secret-key YOUR_SECRET_KEY \
  --path-template "dumps/{table}/{YYYY}/{MM}"

Dump Flags

  • --db-host - PostgreSQL host (default: localhost)
  • --db-port - PostgreSQL port (default: 5432)
  • --db-user - PostgreSQL user (required)
  • --db-password - PostgreSQL password (required)
  • --db-name - PostgreSQL database name (required)
  • --db-sslmode - PostgreSQL SSL mode (disable, require, verify-ca, verify-full)
  • --s3-endpoint - S3-compatible endpoint URL (required)
  • --s3-bucket - S3 bucket name (required)
  • --s3-access-key - S3 access key (required)
  • --s3-secret-key - S3 secret key (required)
  • --s3-region - S3 region (default: auto)
  • --path-template - S3 path template with placeholders (required)
  • --table - Table name to dump (optional, dumps entire database if not specified)
  • --workers - Number of parallel jobs for pg_dump (default: 4)
  • --dump-mode - Dump mode: schema-only, data-only, or schema-and-data (default: schema-and-data)

Dump Features

  • Custom Format: Uses PostgreSQL's custom format (-Fc) which supports parallel dumps and compression
  • Heavy Compression: Uses maximum compression level (-Z 9) for optimal file size
  • Parallel Processing: Honors the --workers flag to run multiple parallel jobs
  • Streaming Upload: Streams output directly to S3 without creating intermediate files
  • Flexible Modes: Supports schema-only, data-only, or both schema and data
  • Table-Specific: Can dump individual tables or entire databases
  • Schema-Only Optimization: For schema-only dumps:
    • Automatically discovers and dumps only top-level tables (excludes partitions)
    • Partitions share the same schema as their parent table, so scanning them is unnecessary
    • Use --table flag to dump a specific table's schema
    • Without --table, dumps schemas for all top-level tables
  • Automatic Naming: Generates filenames with timestamp and mode suffix (e.g., flights-schema-20240115-120000.dump)

Dump Examples

Dump entire database:

data-archiver dump \
  --db-user myuser \
  --db-password mypass \
  --db-name mydb \
  --path-template "dumps/{table}/{YYYY}/{MM}"

Dump specific table with parallel processing:

data-archiver dump \
  --db-user myuser \
  --db-password mypass \
  --db-name mydb \
  --table flights \
  --workers 8 \
  --path-template "dumps/{table}/{YYYY}/{MM}"

Dry run dump (validate without uploading):

data-archiver dump \
  --db-user myuser \
  --db-password mypass \
  --db-name mydb \
  --dry-run \
  --path-template "dumps/{table}/{YYYY}/{MM}"

πŸ”„ Restore Command

The restore subcommand reverses the archive process: downloads files from S3, decompresses them, parses formats (JSONL/CSV/Parquet), and inserts data into PostgreSQL tables with automatic table/partition creation.

Restore Basic Usage

data-archiver restore \
  --table flights \
  --path-template "archives/{table}/{YYYY}/{MM}" \
  --start-date 2024-01-01 \
  --end-date 2024-01-31

Restore with Partitioning

data-archiver restore \
  --table flights \
  --path-template "archives/{table}/{YYYY}/{MM}" \
  --table-partition-range daily \
  --start-date 2024-01-01 \
  --end-date 2024-01-31

Restore Flags

  • --table - Base table name (required)
  • --path-template - S3 path template matching archive configuration (required)
  • --start-date - Start date filter (YYYY-MM-DD, optional)
  • --end-date - End date filter (YYYY-MM-DD, optional)
  • --table-partition-range - Partition range: hourly, daily, monthly, quarterly, yearly (optional)
  • --output-format - Override format detection: jsonl, csv, parquet (optional, auto-detected from file extensions)
  • --compression - Override compression detection: zstd, lz4, gzip, none (optional, auto-detected from file extensions)

Restore Features

  • Automatic Format Detection: Detects format and compression from file extensions (.jsonl.zst, .csv.lz4, .parquet.gz, etc.)
  • Automatic Table Creation: Creates tables automatically if they don't exist, inferring schema from data
  • Partition Support: Automatically creates partitions based on --table-partition-range:
    • hourly: Creates partitions like table_2024010115
    • daily: Creates partitions like table_20240101
    • monthly: Creates partitions like table_202401
    • quarterly: Creates partitions like table_2024Q1
    • yearly: Creates partitions like table_2024
  • Conflict Handling: Uses ON CONFLICT DO NOTHING to skip existing rows
  • Date Range Filtering: Only restores files matching the specified date range
  • Sequential Processing: Processes files one at a time (parallel support may be added later)

Restore Examples

Restore all files for a table:

data-archiver restore \
  --table flights \
  --path-template "archives/{table}/{YYYY}/{MM}"

Restore specific date range with daily partitions:

data-archiver restore \
  --table flights \
  --path-template "archives/{table}/{YYYY}/{MM}" \
  --table-partition-range daily \
  --start-date 2024-01-01 \
  --end-date 2024-01-31

Restore with format override:

data-archiver restore \
  --table flights \
  --path-template "archives/{table}/{YYYY}/{MM}" \
  --output-format parquet \
  --compression zstd

Dry run restore (validate without inserting):

data-archiver restore \
  --table flights \
  --path-template "archives/{table}/{YYYY}/{MM}" \
  --dry-run

🚨 Error Handling

The tool provides detailed error messages for common issues:

  • Database Connection: Checks connectivity before processing
  • Partition Discovery: Reports invalid partition formats
  • Data Extraction: Handles large datasets with streaming
  • Compression: Reports compression failures and ratios
  • S3 Upload: Retries on transient failures
  • Configuration: Validates all required parameters

πŸ“Š Performance Tips

  1. Increase Workers: Use --workers to process more partitions in parallel
  2. Network: Ensure good bandwidth to S3 endpoint
  3. Database: Add indexes on date columns for faster queries
  4. Memory Management:
    • Tool uses streaming architecture with constant ~150 MB memory footprint
    • Memory usage independent of partition size (no OOM on multi-GB partitions)
    • Tune --chunk-size based on average row size:
      • Small rows (~1 KB): --chunk-size 50000 (~50 MB)
      • Medium rows (~10 KB): --chunk-size 10000 (~100 MB, default)
      • Large rows (~100 KB): --chunk-size 1000 (~100 MB)
      • Very large rows (1+ MB): --chunk-size 100 (~100 MB)
  5. Compression: Multi-core zstd scales with CPU cores

πŸ§ͺ Testing

The project includes a comprehensive test suite covering:

  • Cache Operations: Row count and file metadata caching, TTL expiration, legacy migration
  • Configuration Validation: Required fields, default values, date formats
  • Process Management: PID file operations, task tracking, process status checks

Run tests with:

# Run all tests
go test ./...

# Run with verbose output
go test -v ./...

# Run with coverage
go test -cover ./...

# Run specific tests
go test -run TestPartitionCache ./cmd

🐳 Docker Support

Build and run with Docker:

# Build the Docker image
docker build -t data-archiver .

# Run with environment variables
docker run --rm \
  -e ARCHIVE_DB_HOST=host.docker.internal \
  -e ARCHIVE_DB_USER=myuser \
  -e ARCHIVE_DB_PASSWORD=mypass \
  -e ARCHIVE_DB_NAME=mydb \
  -e ARCHIVE_S3_ENDPOINT=https://s3.example.com \
  -e ARCHIVE_S3_BUCKET=my-bucket \
  -e ARCHIVE_S3_ACCESS_KEY=key \
  -e ARCHIVE_S3_SECRET_KEY=secret \
  -e ARCHIVE_TABLE=events \
  data-archiver

# Run with config file
docker run --rm \
  -v ~/.data-archiver.yaml:/root/.data-archiver.yaml \
  data-archiver

πŸ”§ Development

Prerequisites

  • Go 1.21+
  • PostgreSQL database for testing
  • S3-compatible storage for testing

Building from Source

# Clone the repository
git clone https://github.com/airframesio/data-archiver.git
cd data-archiver

# Install Go dependencies
go mod download

# Install Node.js dependencies (for web asset minification)
npm install

# Minify web assets (CSS, JavaScript, HTML)
npm run minify

# Build the binary (with minified assets embedded)
go build -o data-archiver

# Or use the npm build script which minifies and builds in one command
npm run build

# Run tests
go test ./...

# Build for different platforms
GOOS=linux GOARCH=amd64 go build -o data-archiver-linux-amd64
GOOS=darwin GOARCH=arm64 go build -o data-archiver-darwin-arm64
GOOS=windows GOARCH=amd64 go build -o data-archiver.exe

Web Asset Minification

The cache viewer web UI uses minified assets in production builds to reduce load times:

  • Original size: 98,389 bytes (HTML + CSS + JS + design system)
  • Minified size: 60,995 bytes (38% reduction)

The minification process:

  1. Uses csso-cli for CSS minification
  2. Uses terser for JavaScript minification with mangling and compression
  3. Uses html-minifier-terser for HTML minification
  4. Automatically runs in CI/CD before building binaries

To minify manually:

# Run the minification script
./scripts/minify.sh

# Or use npm
npm run minify

The minified files are automatically embedded into the Go binary during build.

CI/CD

The project uses GitHub Actions for continuous integration:

  • Test Matrix: Tests on Go 1.21.x and 1.22.x
  • Platforms: Linux, macOS, Windows
  • Coverage: Runs tests with coverage reporting
  • Linting: Ensures code quality with golangci-lint
  • Binary Builds: Creates binaries for multiple platforms

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

MIT License - see LICENSE file for details

πŸ™ Acknowledgments

Built with these awesome libraries:

About

A high-performance CLI tool for archiving PostgreSQL partitioned table data to S3-compatible object storage.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 5