Cluster Reboot Coordinator

The Cluster Reboot Coordinator orchestrates safe, serialized kernel reboots across Linux fleets. Each node runs the coordinator, evaluates reboot requirements, validates cluster health, and acquires a distributed lock in etcd before triggering the configured reboot command.

Overview

The daemon has matured beyond the early building blocks: it now ships the full orchestration loop, lock manager, observability plumbing, packaging assets, and CI/CD automation required by the PRD. Operators receive a predictable service that favours safety, explicit configuration, and verifiable supply-chain artefacts.

Key Features

Detector engine – combines file and command detectors, re-runs checks after lock acquisition, and surfaces detailed timing/exit-code data for diagnostics.
Health gating – executes an operator-supplied script twice (pre- and post-lock) with rich environment variables for node identity, cluster policies, maintenance windows, and optional metrics endpoints.
Cluster-wide health coordination – persists unhealthy node markers in etcd so peers refuse to reboot while any script is reporting failure, keeps publishing each node's health even when no reboot is pending, applies configured cluster policy thresholds (minimum healthy counts, fractions, fallback protections) before allowing another reboot, and clears the block automatically once the node becomes healthy again.【F:pkg/clusterhealth/etcd.go†L18-L153】【F:pkg/orchestrator/runner.go†L321-L469】
Distributed coordination – etcd-backed mutex with annotated metadata (node, pid, acquired_at) so operators can inspect lock holders during incidents.
Safeguards – kill switch file, dry-run mode, deny/allow maintenance windows, a configurable cooldown between successful reboots, and retry/jitter controls for transient failures.
Observability – structured JSON logs on stderr plus optional Prometheus metrics served from a configurable listener.
Packaging & release – reproducible .deb/.rpm packages with SBOMs, checksums, and cosign signatures produced via the repository Makefile and GitHub Actions pipeline.

Repository Layout

cmd/clusterrebootd      # CLI entrypoint and daemon bootstrapper
internal/testutil           # Shared helpers for integration and packaging tests
pkg/config                  # YAML configuration structures, defaults, and validation
pkg/detector                # Pluggable reboot-required detectors (file/command)
pkg/health                  # Cluster health script runner with timeout enforcement
pkg/lock                    # etcd lock manager with metadata annotations
pkg/observability           # JSON logger, event types, and Prometheus collector
pkg/orchestrator            # Orchestration runner, loop, and outcome reporting
pkg/version                 # Version metadata exposed via the CLI
pkg/windows                 # Maintenance window parsing and evaluation
packaging/                  # nfpm config, systemd unit, tmpfiles entry, maintainer scripts, smoke tests
.github/workflows/          # CI and release automation
examples/config.yaml        # Annotated production-style configuration sample

Documentation

Architecture Overview – component model, interfaces, and roadmap.
Operations Guide – installation, configuration, health script guidance, and troubleshooting.
Packaging Blueprint – agreed filesystem layout and packaging contract.
CI Pipeline Blueprint – reference GitHub Actions stages and security posture.
Project State – canonical backlog, next steps, and open questions.

Quick Start

Install Go 1.23 or newer.
Fetch dependencies and run the test suite:
```
go test ./...
```
Alternatively use make test which also enforces formatting.
Create a configuration file based on examples/config.yaml. The service defaults to /etc/clusterrebootd/config.yaml but the CLI accepts --config for alternate paths.

Validate your configuration before running the daemon:

clusterrebootd validate-config --config /path/to/config.yaml

Start the coordinator once everything validates. Use --dry-run during initial rollouts to exercise the full loop without rebooting the host.

CLI Commands

clusterrebootd run [--config FILE] [--dry-run] [--once] – start the orchestration loop. --once performs a single diagnostic pass while still honouring lock acquisition and health gating.
clusterrebootd status [--skip-health] [--skip-lock] – execute a dry-run orchestration pass and report the outcome (detectors, health gate, lock). Skipping health or lock annotates the environment so scripts are aware of the bypass.
clusterrebootd simulate – instantiate detectors, execute them once, and print per-detector summaries without contacting etcd or running the health script.
clusterrebootd validate-config – parse and validate the YAML configuration.
clusterrebootd version – print the build metadata.

Observability & Telemetry

clusterrebootd run emits structured JSON logs to stderr for each orchestration stage. Entries include timestamps, levels, node identity, event labels, and contextual fields so journald or log shippers can route them without additional parsing. When metrics are enabled the daemon listens on the configured address, serves Prometheus counters/histograms, and exports RC_METRICS_ENDPOINT into the health script environment so custom checks can verify scrapeability.

The etcd lock metadata is stored alongside the mutex key in JSON form:

{"node":"node-a","pid":1234,"acquired_at":"2024-03-07T11:45:12.123Z"}

Use etcdctl get <lock-key> to inspect the current holder during investigations.

Build, Packaging, and Release

The top-level Makefile streamlines builds and packaging:

make build – compile a statically linked Linux binary and stage it in dist/.
make package – cross-compile for amd64/arm64, run nfpm to produce .deb/.rpm packages, generate CycloneDX SBOMs via syft, write SHA-256/512 manifests, and create cosign signatures when signing keys are supplied. Outputs live under dist/packages/.
packaging/scripts/verify_artifacts.sh – re-validate checksums and signatures after a build by supplying COSIGN_PUBLIC_KEY.

GitHub Actions mirror this workflow (.github/workflows/ci.yaml) and upload build artefacts on every push/PR. The release workflow rebuilds tagged revisions, generates release notes, and publishes packages, SBOMs, checksums, and signatures to the GitHub Release.

Development Environment

A VS Code Dev Container definition under .devcontainer/ provisions Go 1.22, etcd 3.6.4, and nfpm 2.43.1 so packaging and smoke tests run without additional setup. Launch it via Reopen in Container or devcontainer up. An etcd instance suitable for smoke tests can be started inside the container with:

etcd --data-dir /tmp/etcd-data \
  --listen-client-urls http://0.0.0.0:2379 \
  --advertise-client-urls http://127.0.0.1:2379

The dev container now ships Podman alongside a passwordless docker shim that proxies to sudo podman, enabling the packaging smoke tests to build and launch privileged containers without extra setup.

Operations

The Operations Guide expands on deployment, maintenance windows, health script practices, and incident response. Highlights include graceful handling of SIGINT/SIGTERM, exponential backoff for transient failures, and recommended packaging workflows.

Exit Codes

Code	Meaning	Returned By
0	Success. No reboot required, prerequisites satisfied, or configuration validated.	All commands on success, including `run --once` and `status` when they report `no_action`, `recheck_cleared`, or `ready`.
1	Runtime failure. Setup or orchestration error prevented evaluation.	`run`, `run --once`, `status`, `simulate`.
2	Invalid configuration.	`run`, `status`, `validate-config`.
3	Blocked by the health script (pre- or post-lock).	`run --once`, `status`, and long-running `run` when terminated while health is blocking.
4	Lock contention prevented progress.	`run --once`, `status`, and long-running `run` when terminated while unable to acquire the lock.
5	Kill switch present.	`run --once`, `status`, and long-running `run` when terminated while the kill switch is active.
6	Detector evaluation failed during simulation.	`simulate`.
64	CLI usage error (unknown command or flag parsing failure).	All commands.

Development Philosophy

The project prioritises safety, resilience, and long-term maintainability:

Safety: strict configuration validation, conservative defaults, and explicit kill-switch semantics prevent accidental reboot storms.
Extensibility: detectors, health checks, and observability hooks are pluggable so environments can tailor policies.
Testability: extensive unit tests cover configuration parsing, detectors, health execution, locking, packaging assets, and the orchestration loop.

Contributions should continue to respect these principles and the staged roadmap captured in the PRD.

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
cmd/clusterrebootd		cmd/clusterrebootd
docs		docs
examples		examples
internal/testutil		internal/testutil
packaging		packaging
pkg		pkg
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
Makefile		Makefile
PRD.md		PRD.md
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Cluster Reboot Coordinator

Overview

Key Features

Repository Layout

Documentation

Quick Start

CLI Commands

Observability & Telemetry

Build, Packaging, and Release

Development Environment

Operations

Exit Codes

Development Philosophy

About

Uh oh!

Releases 8

Packages

Languages

License

ActiDoo/clusterrebootd

Folders and files

Latest commit

History

Repository files navigation

Cluster Reboot Coordinator

Overview

Key Features

Repository Layout

Documentation

Quick Start

CLI Commands

Observability & Telemetry

Build, Packaging, and Release

Development Environment

Operations

Exit Codes

Development Philosophy

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Languages

Packages