Skip to content

Good-enough (mvp) tool for rack health checks to support automated updates #4745

@askfongjojo

Description

@askfongjojo

(updated title and scope base on 7/1/2025 control plane sync discussion)
Delivering self-service update introduces a new support problem: we’ll no longer be running the healthcheck script or doing any of the other checks we currently do when we update customers’ systems. If something non-fatal goes wrong during the update (like #7668), how will a customer know? Presumably we need to build some basic health reporting into the external API and deliver that in R17 or shortly after? We haven’t been tracking this work or planning for this in the self-service update project.

  • Need a plan on what minimum thing we can deliver for R17
  • This is a tool an operator would use as part of the automated update process
  • Initially, have user triggers/downloads a support bundle to get health check reports
  • Next, ability to interpret the health checks and alerts
  • Failures / problems could be “active problems” and use the existing FMA reporting api
  • Also, can incorporate cockroachdb SMF status in inventory (along with other data points such as under-replicated ranges) and use that to generate active problems

(original 1/2/2040 request replaced with the discussion above)
This is intended to be an inexpensive stop-gap that can be used in the field before we have adequate fault management feature coverage. Anything that is better than a rolled-up bash script of the manual health checks would already be an improvement. If we manage to identify/build reusable long-term technician API during the process, it's even better but not a requirement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions