Good-enough (mvp) tool for rack health checks to support automated updates

(updated title and scope base on 7/1/2025 control plane sync discussion)
Delivering self-service update introduces a new support problem: we’ll no longer be running the healthcheck script or doing any of the other checks we currently do when we update customers’ systems.  If something non-fatal goes wrong during the update (like #7668), how will a customer know?  Presumably we need to build some basic health reporting into the external API and deliver that in R17 or shortly after?  We haven’t been tracking this work or planning for this in the self-service update project.

- Need a plan on what minimum thing we can deliver for R17
- This is a tool an operator would use as part of the automated update process
- Initially, have user triggers/downloads a support bundle to get health check reports
- Next, ability to interpret the health checks and alerts
- Failures / problems could be “active problems” and use the existing FMA reporting api
- Also, can incorporate cockroachdb SMF status in inventory (along with other data points such as under-replicated ranges) and use that to generate active problems



(original 1/2/2040 request replaced with the discussion above)
This is intended to be an inexpensive stop-gap that can be used in the field before we have adequate fault management feature coverage. Anything that is better than a rolled-up bash script of the [manual health checks](https://github.com/oxidecomputer/meta/blob/master/engineering/rack-support/rack-health-check.adoc) would already be an improvement. If we manage to identify/build reusable long-term technician API during the process, it's even better but not a requirement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Good-enough (mvp) tool for rack health checks to support automated updates #4745

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Good-enough (mvp) tool for rack health checks to support automated updates #4745

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions