-
Notifications
You must be signed in to change notification settings - Fork 1.4k
✨ Add support for checking Machine conditions in MachineHealthCheck #12827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
✨ Add support for checking Machine conditions in MachineHealthCheck #12827
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
af68d6e
to
7cb44c3
Compare
7cb44c3
to
ab19424
Compare
MachineHealthCheck currently only allows checking Node conditions to validate if a machine is healthy. However, machine conditions capture conditions that do not exist on nodes, for example, control plane node conditions such as EtcdPodHealthy, SchedulerPodHealthy that can indicate if a controlplane machine has been created correctly. Adding support for Machine conditions enables us to perform remediation during control plane upgrades. This PR introduces a new field as part of the MachineHealthCheckChecks: - `UnhealthyMachineConditions` This will mirror the behavior of `UnhealthyNodeConditions` but the MachineHealthCheck controller will instead check the machine conditions. This reimplements and extends earlier work originally proposed in a previous PR 12275. Co-authored-by: Justin Miron <[email protected]> Signed-off-by: Furkat Gofurov <[email protected]>
Signed-off-by: Furkat Gofurov <[email protected]>
ab19424
to
c6a7148
Compare
/test pull-cluster-api-e2e-main |
internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go
Outdated
Show resolved
Hide resolved
test/e2e/data/infrastructure-docker/main/cluster-template-kcp-remediation/mhc.yaml
Outdated
Show resolved
Hide resolved
test/e2e/data/infrastructure-docker/main/clusterclass-quick-start.yaml
Outdated
Show resolved
Hide resolved
test/e2e/data/infrastructure-docker/v1.11/clusterclass-quick-start.yaml
Outdated
Show resolved
Hide resolved
internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go
Show resolved
Hide resolved
2c052cf
to
424114f
Compare
Signed-off-by: Furkat Gofurov <[email protected]>
424114f
to
5ee7d25
Compare
v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.UnhealthyMachineConditionV1Beta1Reason, clusterv1.ConditionSeverityWarning, "Condition %s on Machine is reporting status %s for more than %s", c.Type, c.Status, timeoutSecondsDuration.String()) | ||
logger.V(3).Info("Target is unhealthy: condition is in state longer than allowed timeout", "condition", c.Type, "state", c.Status, "timeout", timeoutSecondsDuration.String()) | ||
|
||
conditions.Set(t.Machine, metav1.Condition{ | ||
Type: clusterv1.MachineHealthCheckSucceededCondition, | ||
Status: metav1.ConditionFalse, | ||
Reason: clusterv1.MachineHealthCheckUnhealthyMachineReason, | ||
Message: fmt.Sprintf("Health check failed: Condition %s on Machine is reporting status %s for more than %s", c.Type, c.Status, timeoutSecondsDuration.String()), | ||
}) | ||
return true, time.Duration(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
q:
if I'm not wrong, when there are both nod unhealthy condition and machine unhealthy conditions, the latter is overriding the first. is this intentional? is it ok current priority (machine over node)?
should we consider a different approach, where we pick one or the other reason, but we combine all the messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabriziopandini thanks for catching it, you're absolutely right.
When both node and machine unhealthy conditions are present, the machine conditions will indeed override the node conditions because:
- Node conditions are processed initially
- Machine conditions are processed after
- Both use the same condition type:
clusterv1.MachineHealthCheckSucceededCondition
and
conditions.Set()
call in the machine conditions loop will overwrite any condition set by the node conditions loop 💥
This means if both a node condition and a machine condition are unhealthy, only the machine condition will be reflected in the final condition status, and the node condition information will be lost...
I will see how we can combine all the messages 👍🏼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I came up with: f96b742 PTAL and let me know what you think!
…iation() method If both a node condition and machine condition are unhealthy, pick one reason but combine all the messages Signed-off-by: Furkat Gofurov <[email protected]>
/test pull-cluster-api-e2e-main |
Quick update on failing main tests:
--- FAIL: TestMachineHealthCheck_Reconcile (441.16s)
--- FAIL: TestHealthCheckTargets (0.00s)
Currently, I am trying to fix this by relaxing the custom matcher to be order-insensitive and subset-based, and standardizing the combined unhealthy message to include the prefix. However, if you have any other suggestions, I am all open for suggestions, thank you. |
I think gomega should already not care about the order if we use the right matcher I also thought we have some matcher that allows ignoring timestamps for condition comparisons (grep for HaveSameStateOf, maybe it's useful) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have found the root cause of failures in TestMachineHealthCheck_Reconcile and provided a few suggestion for the computation of the condition (see comments).
Unfortunately I have also found another issue in the needsRemediation func, that probably needs a bigger refactor.
the current logic in needsRemediation sort of assumes that check were only applied to nodes, so e.g. it returns immediately if the node is not showing up at startup, or if the node has been deleted at a later stage
cluster-api/internal/controllers/machinehealthcheck/machinehealthcheck_targets.go
Lines 134 to 183 in f96b742
if t.Node == nil { | |
if timeoutForMachineToHaveNode == disabledNodeStartupTimeout { | |
// Startup timeout is disabled so no need to go any further. | |
// No node yet to check conditions, can return early here. | |
return false, 0 | |
} | |
controlPlaneInitialized := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition) | |
clusterInfraReady := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition) | |
machineInfraReady := conditions.GetLastTransitionTime(t.Machine, clusterv1.MachineInfrastructureReadyCondition) | |
machineCreationTime := t.Machine.CreationTimestamp.Time | |
// Use the latest of the following timestamps. | |
comparisonTime := machineCreationTime | |
logger.V(5).Info("Determining comparison time", | |
"machineCreationTime", machineCreationTime, | |
"clusterInfraReadyTime", clusterInfraReady, | |
"controlPlaneInitializedTime", controlPlaneInitialized, | |
"machineInfraReadyTime", machineInfraReady, | |
) | |
if conditions.IsTrue(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition) && controlPlaneInitialized != nil && controlPlaneInitialized.After(comparisonTime) { | |
comparisonTime = controlPlaneInitialized.Time | |
} | |
if conditions.IsTrue(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition) && clusterInfraReady != nil && clusterInfraReady.After(comparisonTime) { | |
comparisonTime = clusterInfraReady.Time | |
} | |
if conditions.IsTrue(t.Machine, clusterv1.MachineInfrastructureReadyCondition) && machineInfraReady != nil && machineInfraReady.After(comparisonTime) { | |
comparisonTime = machineInfraReady.Time | |
} | |
logger.V(5).Info("Using comparison time", "time", comparisonTime) | |
timeoutDuration := timeoutForMachineToHaveNode.Duration | |
if comparisonTime.Add(timeoutForMachineToHaveNode.Duration).Before(now) { | |
v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeStartupTimeoutV1Beta1Reason, clusterv1.ConditionSeverityWarning, "Node failed to report startup in %s", timeoutDuration) | |
logger.V(3).Info("Target is unhealthy: machine has no node", "duration", timeoutDuration) | |
conditions.Set(t.Machine, metav1.Condition{ | |
Type: clusterv1.MachineHealthCheckSucceededCondition, | |
Status: metav1.ConditionFalse, | |
Reason: clusterv1.MachineHealthCheckNodeStartupTimeoutReason, | |
Message: fmt.Sprintf("Health check failed: Node failed to report startup in %s", timeoutDuration), | |
}) | |
return true, time.Duration(0) | |
} | |
durationUnhealthy := now.Sub(comparisonTime) | |
nextCheck := timeoutDuration - durationUnhealthy + time.Second | |
return false, nextCheck | |
} |
cluster-api/internal/controllers/machinehealthcheck/machinehealthcheck_targets.go
Lines 105 to 116 in f96b742
if t.nodeMissing { | |
logger.V(3).Info("Target is unhealthy: node is missing") | |
v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeNotFoundV1Beta1Reason, clusterv1.ConditionSeverityWarning, "") | |
conditions.Set(t.Machine, metav1.Condition{ | |
Type: clusterv1.MachineHealthCheckSucceededCondition, | |
Status: metav1.ConditionFalse, | |
Reason: clusterv1.MachineHealthCheckNodeDeletedReason, | |
Message: fmt.Sprintf("Health check failed: Node %s has been deleted", t.Machine.Status.NodeRef.Name), | |
}) | |
return true, time.Duration(0) | |
} |
While this code structure worked well when checking only node conditions at the end of the func, it does not work well with the addition of the check for machine conditions at the end of the func.
More specifically, I think that we should find a way to always check machine conditions, not only in case the node is existing / the func doesn't hit the two if branches highlighted above, as in the current implementation.
Additionally, we should always consider that we merge messages from machine conditions and from node conditions in all the possible scenarios:
- when node is not showing up at startup
- when the node has been deleted at a later stage
- when the node exists (which is the only scenario covered in the current change set)
I will try to come up with a some ideas to solve this problem, but of course suggestions are more than welcome
unhealthyMessages []string | ||
unhealthyReasons []string | ||
foundUnhealthyCondition bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would simplify this by keeping track only of messages (everythings else can be inferred from those two arrays when setting the condition)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did simplify it by removing unhealthyReasons
and foundUnhealthyCondition
as suggested and use only
unhealthyNodeMessages
& unhealthyMachineMessages
arrays instead to infer removed things
internal/controllers/machinehealthcheck/machinehealthcheck_targets.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinehealthcheck/machinehealthcheck_targets.go
Outdated
Show resolved
Hide resolved
internal/controllers/machinehealthcheck/machinehealthcheck_targets.go
Outdated
Show resolved
Hide resolved
UnhealthyMachineConditions: []clusterv1.UnhealthyMachineCondition{ | ||
{ | ||
Type: clusterv1.MachineReadyCondition, | ||
Status: metav1.ConditionUnknown, | ||
TimeoutSeconds: ptr.To(int32(5 * 60)), | ||
}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this additional check is the reason why most of the tests in TestMachineHealthCheck_Reconcile are failing.
What happens is that in those tests, machine's Ready condition is Unknown, and thus it matches the rule above.
When a machine has a matching rule, even if the timeout is not expired, is not counted anymore as an healthy machine when computing MHC's replica counters, and this is what triggers test failure (unexpected replica counters)
The relevant code is here:
cluster-api/internal/controllers/machinehealthcheck/machinehealthcheck_targets.go
Lines 404 to 416 in f96b742
if nextCheck > 0 { | |
logger.V(3).Info("Target is likely to go unhealthy", "timeUntilUnhealthy", nextCheck.Truncate(time.Second).String()) | |
r.recorder.Eventf( | |
t.Machine, | |
corev1.EventTypeNormal, | |
EventDetectedUnhealthy, | |
"Machine %s has unhealthy Node %s", | |
klog.KObj(t.Machine), | |
t.nodeName(), | |
) | |
nextCheckTimes = append(nextCheckTimes, nextCheck) | |
continue | |
} |
Two things are worth to notice:
- When a machine is matching a rule, but the timeout is not expired, MHC doesn't count the machine as healthy and nor as unhealthy. The machine is considered as "likely to go unhealthy", which is something that doesn't surface anywhere in status 😓
- As you might notice when a machine enter the "likely to go unhealthy" status, MHC generates an event; the event message is "Machine %s has unhealthy Node %s", and it should be changed because now we are also checking machine conditions. I would suggest to use something like "Machine %s (Node %s) is failing machine health check rules and it is likely to go unhealthy" as a new message (suggestions are welcome)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it makes sense to use the event message you suggested 👍🏼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this additional check is the reason why most of the tests in TestMachineHealthCheck_Reconcile are failing.
Yes, that was it. I have removed it now from newMachineHealthCheck
since ALL tests that use this helper suddenly get machine condition evaluation enabled. Instead, I added it directly to the specific test and tests started passing again!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@furkatgofurov7 let's drop the event https://github.com/furkatgofurov7/cluster-api/blob/2c1ca0374c287e0016a2323a587c42045878c6f0/internal/controllers/machinehealthcheck/machinehealthcheck_targets.go#L447 (it is very noisy, it is generated at every reconcile for every machine that does not met a check and for the entire time until timeout expires)
Signed-off-by: Furkat Gofurov <[email protected]>
Refactors `needsRemediation`, specifically following changes were made: - Move machine condition evaluation to always execute first, regardless of node state - Ensure machine conditions are checked in ALL scenarios: * When node is missing (t.nodeMissing) * When node hasn't appeared yet (t.Node == nil) * When node exists (t.Node != nil) - Consistently merge node and machine condition messages in all failure scenarios - Maintain backward compatibility with existing condition message formats - Use appropriate condition reasons based on which conditions are unhealthy Signed-off-by: Furkat Gofurov <[email protected]>
/test pull-cluster-api-test-main |
Signed-off-by: Furkat Gofurov <[email protected]>
@fabriziopandini Thanks for the detailed feedback! You're absolutely right about the inconsistent behavior. I've refactored now the Changes I made:
Let me know what you think about the refactor. |
@sbueringer, thanks for another round of feedback on conversion; hopefully all your suggestions should be incorporated now. |
/test pull-cluster-api-e2e-main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @furkatgofurov7 for this iteration!
I'm wondering if we can further simplify the code/improve readability by using two sub functions, one for machineChecks and the other for nodeChecks.
The resulting needsRemediation will look like:
func (t *healthCheckTarget) needsRemediation(logger logr.Logger, timeoutForMachineToHaveNode metav1.Duration) (bool, time.Duration) {
// checks for HasRemediateMachineAnnotation, ClusterControlPlaneInitializedCondition, ClusterInfrastructureReadyCondition
...
// Check machine conditions
unhealthyMachineMessages, nextMachineCheck := t.machineChecks(logger)
// Check node conditions
nodeConditionReason, nodeV1beta1ConditionReason, unhealthyNodeMessages, nextNodeCheck := t.nodeChecks(logger, timeoutForMachineToHaveNode)
// Combine results and set conditions
...
}
Another benefit of this code struct, is that condition management is implemented only in one place.
In case it can help, this is a commit where I experimented a little bit about this idea
wdyt?
v1beta1Reason = clusterv1.UnhealthyMachineConditionV1Beta1Reason | ||
} | ||
|
||
// For v1beta2 we use a single-line message prefixed with "Health check failed: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For v1beta2 should use multiline comment with * prefix on every message (I pasted an example of how to compute it in my previous comment)
Also, we usually avoid to call out v1beta2 (v1beta2 conditions are just conditions), so the we won't need to fix comment when v1beta1 code will go away
(Same for other places where we are computing condition message)
// the node has not been set yet | ||
if t.Node == nil { | ||
// Check if we already have unhealthy machine conditions that should trigger remediation | ||
if len(unhealthyMachineMessages) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like that when len(unhealthyMachineMessages) > 0 we are returning without checking nodeStartup timeout
…ns: one for machineChecks and the other for nodeChecks. Another benefit of this code struct, is that condition management is implemented only in one place. Co-authored-by: Fabrizio Pandini Signed-off-by: Furkat Gofurov <[email protected]>
/test pull-cluster-api-e2e-main |
What this PR does / why we need it:
MachineHealthCheck currently only allows checking Node conditions to validate if a machine is healthy. However, machine conditions capture conditions that do not exist on nodes, for example, control plane node conditions such as EtcdPodHealthy, SchedulerPodHealthy that can indicate if a controlplane machine has been created correctly.
Adding support for Machine conditions enables us to perform remediation during control plane upgrades.
This PR introduces a new field as part of the MachineHealthCheckChecks:
UnhealthyMachineConditions
This will mirror the behavior of
UnhealthyNodeConditions
but the MachineHealthCheck controller will instead check the machine conditions.This reimplements and extends the work originally proposed by @justinmir in PR #12275.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes: #5450
Label(s) to be applied
/kind feature
/area machinehealthcheck
Notes for Reviewers
We updated the tests to validate the new MachineHealthCheck code paths for UnhealthyMachineConditions in the following ways:
internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go includes a new envtest
internal/controllers/machinehealthcheck/machinehealthcheck_targets_test.go includes a unit test that the machine will need remediation.
Remaining test changes are boilerplate to ensure that this doesn't break existing functionality, every place we use UnhealthyNodeConditions, we also specify a UnhealthyMachineConditions.
Core Logic Refactor: Modified
needsRemediation()
inmachinehealthcheck_targets.go
to:Event Message Updates: