- 
                Notifications
    You must be signed in to change notification settings 
- Fork 60
Description
(updated title and scope base on 7/1/2025 control plane sync discussion)
Delivering self-service update introduces a new support problem: we’ll no longer be running the healthcheck script or doing any of the other checks we currently do when we update customers’ systems.  If something non-fatal goes wrong during the update (like #7668), how will a customer know?  Presumably we need to build some basic health reporting into the external API and deliver that in R17 or shortly after?  We haven’t been tracking this work or planning for this in the self-service update project.
- Need a plan on what minimum thing we can deliver for R17
- This is a tool an operator would use as part of the automated update process
- Initially, have user triggers/downloads a support bundle to get health check reports
- Next, ability to interpret the health checks and alerts
- Failures / problems could be “active problems” and use the existing FMA reporting api
- Also, can incorporate cockroachdb SMF status in inventory (along with other data points such as under-replicated ranges) and use that to generate active problems
(original 1/2/2040 request replaced with the discussion above)
This is intended to be an inexpensive stop-gap that can be used in the field before we have adequate fault management feature coverage. Anything that is better than a rolled-up bash script of the manual health checks would already be an improvement. If we manage to identify/build reusable long-term technician API during the process, it's even better but not a requirement.