✨ Add support for checking Machine conditions in MachineHealthCheck #12827

furkatgofurov7 · 2025-10-07T21:12:23Z

What this PR does / why we need it:
MachineHealthCheck currently only allows checking Node conditions to validate if a machine is healthy. However, machine conditions capture conditions that do not exist on nodes, for example, control plane node conditions such as EtcdPodHealthy, SchedulerPodHealthy that can indicate if a controlplane machine has been created correctly.

Adding support for Machine conditions enables us to perform remediation during control plane upgrades.

This PR introduces a new field as part of the MachineHealthCheckChecks:

UnhealthyMachineConditions

This will mirror the behavior of UnhealthyNodeConditions but the MachineHealthCheck controller will instead check the machine conditions.

This reimplements and extends the work originally proposed by @justinmir in PR #12275.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes: #5450

Label(s) to be applied
/kind feature
/area machinehealthcheck

Notes for Reviewers
We updated the tests to validate the new MachineHealthCheck code paths for UnhealthyMachineConditions in the following ways:

internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go includes a new envtest
internal/controllers/machinehealthcheck/machinehealthcheck_targets_test.go includes a unit test that the machine will need remediation.
Remaining test changes are boilerplate to ensure that this doesn't break existing functionality, every place we use UnhealthyNodeConditions, we also specify a UnhealthyMachineConditions.
Core Logic Refactor: Modified needsRemediation() in machinehealthcheck_targets.go to:
- Always evaluate machine conditions first, regardless of node state
- Ensure machine conditions are checked in ALL scenarios (node missing, startup timeout, node exists)
- Consistently merge machine and node condition messages across all failure scenarios
- Maintain backward compatibility with existing condition message formats
Event Message Updates:
- Updated event message to use maintainer's suggested wording: "Machine %s (Node %s) is failing machine health check rules and it is likely to go unhealthy"
- Updated EventDetectedUnhealthy comment to reflect both machine and node condition checking

k8s-ci-robot · 2025-10-07T21:12:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign justinsb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

MachineHealthCheck currently only allows checking Node conditions to validate if a machine is healthy. However, machine conditions capture conditions that do not exist on nodes, for example, control plane node conditions such as EtcdPodHealthy, SchedulerPodHealthy that can indicate if a controlplane machine has been created correctly. Adding support for Machine conditions enables us to perform remediation during control plane upgrades. This PR introduces a new field as part of the MachineHealthCheckChecks: - `UnhealthyMachineConditions` This will mirror the behavior of `UnhealthyNodeConditions` but the MachineHealthCheck controller will instead check the machine conditions. This reimplements and extends earlier work originally proposed in a previous PR 12275. Co-authored-by: Justin Miron <[email protected]> Signed-off-by: Furkat Gofurov <[email protected]>

Signed-off-by: Furkat Gofurov <[email protected]>

furkatgofurov7 · 2025-10-07T22:23:52Z

/test pull-cluster-api-e2e-main

api/core/v1beta1/conversion_test.go

api/core/v1beta2/machinehealthcheck_types.go

internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go

internal/webhooks/clusterclass_test.go

test/e2e/data/infrastructure-docker/main/cluster-template-kcp-remediation/mhc.yaml

test/e2e/data/infrastructure-docker/main/clusterclass-quick-start.yaml

test/e2e/data/infrastructure-docker/v1.11/clusterclass-quick-start.yaml

test/e2e/kcp_remediations.go

internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go

Signed-off-by: Furkat Gofurov <[email protected]>

internal/webhooks/machinehealthcheck_test.go

fabriziopandini · 2025-10-13T18:46:44Z

internal/controllers/machinehealthcheck/machinehealthcheck_targets.go

+			v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.UnhealthyMachineConditionV1Beta1Reason, clusterv1.ConditionSeverityWarning, "Condition %s on Machine is reporting status %s for more than %s", c.Type, c.Status, timeoutSecondsDuration.String())
+			logger.V(3).Info("Target is unhealthy: condition is in state longer than allowed timeout", "condition", c.Type, "state", c.Status, "timeout", timeoutSecondsDuration.String())
+
+			conditions.Set(t.Machine, metav1.Condition{
+				Type:    clusterv1.MachineHealthCheckSucceededCondition,
+				Status:  metav1.ConditionFalse,
+				Reason:  clusterv1.MachineHealthCheckUnhealthyMachineReason,
+				Message: fmt.Sprintf("Health check failed: Condition %s on Machine is reporting status %s for more than %s", c.Type, c.Status, timeoutSecondsDuration.String()),
+			})
+			return true, time.Duration(0)


q:

if I'm not wrong, when there are both nod unhealthy condition and machine unhealthy conditions, the latter is overriding the first. is this intentional? is it ok current priority (machine over node)?

should we consider a different approach, where we pick one or the other reason, but we combine all the messages?

@fabriziopandini thanks for catching it, you're absolutely right.
When both node and machine unhealthy conditions are present, the machine conditions will indeed override the node conditions because:

Node conditions are processed initially

Machine conditions are processed after

Both use the same condition type: clusterv1.MachineHealthCheckSucceededCondition and
conditions.Set() call in the machine conditions loop will overwrite any condition set by the node conditions loop 💥

This means if both a node condition and a machine condition are unhealthy, only the machine condition will be reflected in the final condition status, and the node condition information will be lost...

I will see how we can combine all the messages 👍🏼

I came up with: f96b742 PTAL and let me know what you think!

…iation() method If both a node condition and machine condition are unhealthy, pick one reason but combine all the messages Signed-off-by: Furkat Gofurov <[email protected]>

sbueringer · 2025-10-14T06:00:06Z

/test pull-cluster-api-e2e-main

furkatgofurov7 · 2025-10-14T08:11:02Z

Quick update on failing main tests:

they are failing now, and probably the reason is that the condition matcher seems to be too strict (order/length and dynamic fields like ObservedGeneration/LastTransitionTime) example failure from local run:

--- FAIL: TestMachineHealthCheck_Reconcile (441.16s)

machinehealthcheck_controller_test.go:232:  
Timed out after 5.001s.  
expected  
    &v1beta2.MachineHealthCheckStatus{Conditions:[]v1.Condition{v1.Condition{Type:"RemediationAllowed", Status:"True", ObservedGeneration:1, LastTransitionTime:time.Date(2025, time.October, 14, 0, 40, 34, 0, time.Local), Reason:"RemediationAllowed", Message:""}, v1.Condition{Type:"Paused", Status:"False", ObservedGeneration:1, LastTransitionTime:time.Date(2025, time.October, 14, 0, 40, 34, 0, time.Local), Reason:"NotPaused", Message:""}}, ExpectedMachines:(*int32)(0x14000c9498c), CurrentHealthy:(*int32)(0x14000c94990), RemediationsAllowed:(*int32)(0x14000c94994), ObservedGeneration:1, Targets:[]string{"test-mhc-machine-pm82p", "test-mhc-machine-v74dz"}, Deprecated:(*v1beta2.MachineHealthCheckDeprecatedStatus)(0x140008227d0)}  
to match  
    &v1beta2.MachineHealthCheckStatus{Conditions:[]v1.Condition{v1.Condition{Type:"RemediationAllowed", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Reason:"RemediationAllowed", Message:""}, v1.Condition{Type:"Paused", Status:"False", ObservedGeneration:0, LastTransitionTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Reason:"NotPaused", Message:""}}, ExpectedMachines:(*int32)(0x14001fa10ac), CurrentHealthy:(*int32)(0x14001fa10b0), RemediationsAllowed:(*int32)(0x14001fa10b4), ObservedGeneration:1, Targets:[]string{"test-mhc-machine-pm82p", "test-mhc-machine-v74dz"}, Deprecated:(*v1beta2.MachineHealthCheckDeprecatedStatus)(0x14001d9e6d0)}  

machinehealthcheck_controller_test.go:366:  
Timed out after 30.001s.  
expected  
    &v1beta2.MachineHealthCheckStatus{Conditions:[]v1.Condition{v1.Condition{Type:"RemediationAllowed", Status:"True", ObservedGeneration:1, LastTransitionTime:time.Date(2025, time.October, 14, 0, 40, 40, 0, time.Local), Reason:"RemediationAllowed", Message:""}, v1.Condition{Type:"Paused", Status:"False", ObservedGeneration:1, LastTransitionTime:time.Date(2025, time.October, 14, 0, 40, 40, 0, time.Local), Reason:"NotPaused", Message:""}}, ExpectedMachines:(*int32)(0x14001ebd99c), CurrentHealthy:(*int32)(0x14001ebd9a0), RemediationsAllowed:(*int32)(0x14001ebd9a4), ObservedGeneration:1, Targets:[]string{"test-mhc-machine-9mctz", "test-mhc-machine-jdf2x"}, Deprecated:(*v1beta2.MachineHealthCheckDeprecatedStatus)(0x1400061eb08)}  
to match  
    &v1beta2.MachineHealthCheckStatus{Conditions:[]v1.Condition{v1.Condition{Type:"RemediationAllowed", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Reason:"RemediationAllowed", Message:""}}, ExpectedMachines:(*int32)(0x14001ebc1cc), CurrentHealthy:(*int32)(0x14001ebc1d0), RemediationsAllowed:(*int32)(0x14001ebc1d4), ObservedGeneration:1, Targets:[]string{"test-mhc-machine-9mctz", "test-mhc-machine-jdf2x"}, Deprecated:(*v1beta2.MachineHealthCheckDeprecatedStatus)(0x1400061e240)}

aggregated unhealthy messages didn’t consistently include the “Health check failed:” prefix:

--- FAIL: TestHealthCheckTargets (0.00s)

machinehealthcheck_targets_test.go:636:  
Expected  
<[]v1.Condition | len:1, cap:1>:  
[
  {
    Type: "HealthCheckSucceeded",
    Status: "False",
    ObservedGeneration: 0,
    LastTransitionTime: { Time: 0001-01-01T00:00:00Z },
    Reason: "UnhealthyNode",
    Message: "Condition Ready on Node is reporting status Unknown for more than 5m0s",
  },
]  

to contain elements  
<[]v1.Condition | len:1, cap:1>:  
[
  {
    Type: "HealthCheckSucceeded",
    Status: "False",
    ObservedGeneration: 0,
    LastTransitionTime: { Time: 0001-01-01T00:00:00Z },
    Reason: "UnhealthyNode",
    Message: "Health check failed: Condition Ready on Node is reporting status Unknown for more than 5m0s",
  },
]  

the missing elements were  
<[]v1.Condition | len:1, cap:1>:  
[
  {
    Type: "HealthCheckSucceeded",
    Status: "False",
    ObservedGeneration: 0,
    LastTransitionTime: { Time: 0001-01-01T00:00:00Z },
    Reason: "UnhealthyNode",
    Message: "Health check failed: Condition Ready on Node is reporting status Unknown for more than 5m0s",
  },
]

Currently, I am trying to fix this by relaxing the custom matcher to be order-insensitive and subset-based, and standardizing the combined unhealthy message to include the prefix. However, if you have any other suggestions, I am all open for suggestions, thank you.

sbueringer · 2025-10-14T08:18:01Z

Currently, I am trying to fix this by relaxing the custom matcher to be order-insensitive and subset-based, and standardizing the combined unhealthy message to include the prefix. However, if you have any other suggestions, I am all open for suggestions, thank you.

I think gomega should already not care about the order if we use the right matcher

I also thought we have some matcher that allows ignoring timestamps for condition comparisons (grep for HaveSameStateOf, maybe it's useful)

fabriziopandini

I have found the root cause of failures in TestMachineHealthCheck_Reconcile and provided a few suggestion for the computation of the condition (see comments).

Unfortunately I have also found another issue in the needsRemediation func, that probably needs a bigger refactor.

the current logic in needsRemediation sort of assumes that check were only applied to nodes, so e.g. it returns immediately if the node is not showing up at startup, or if the node has been deleted at a later stage

cluster-api/internal/controllers/machinehealthcheck/machinehealthcheck_targets.go

Lines 134 to 183 in f96b742

    
           if t.Node == nil { 
        
           	if timeoutForMachineToHaveNode == disabledNodeStartupTimeout { 
        
           		// Startup timeout is disabled so no need to go any further. 
        
           		// No node yet to check conditions, can return early here. 
        
           		return false, 0 
        
           	} 
        
           	controlPlaneInitialized := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition) 
        
           	clusterInfraReady := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition) 
        
           	machineInfraReady := conditions.GetLastTransitionTime(t.Machine, clusterv1.MachineInfrastructureReadyCondition) 
        
           	machineCreationTime := t.Machine.CreationTimestamp.Time 
        
           	// Use the latest of the following timestamps. 
        
           	comparisonTime := machineCreationTime 
        
           	logger.V(5).Info("Determining comparison time", 
        
           		"machineCreationTime", machineCreationTime, 
        
           		"clusterInfraReadyTime", clusterInfraReady, 
        
           		"controlPlaneInitializedTime", controlPlaneInitialized, 
        
           		"machineInfraReadyTime", machineInfraReady, 
        
           	) 
        
           	if conditions.IsTrue(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition) && controlPlaneInitialized != nil && controlPlaneInitialized.After(comparisonTime) { 
        
           		comparisonTime = controlPlaneInitialized.Time 
        
           	} 
        
           	if conditions.IsTrue(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition) && clusterInfraReady != nil && clusterInfraReady.After(comparisonTime) { 
        
           		comparisonTime = clusterInfraReady.Time 
        
           	} 
        
           	if conditions.IsTrue(t.Machine, clusterv1.MachineInfrastructureReadyCondition) && machineInfraReady != nil && machineInfraReady.After(comparisonTime) { 
        
           		comparisonTime = machineInfraReady.Time 
        
           	} 
        
           	logger.V(5).Info("Using comparison time", "time", comparisonTime) 
        
           	timeoutDuration := timeoutForMachineToHaveNode.Duration 
        
           	if comparisonTime.Add(timeoutForMachineToHaveNode.Duration).Before(now) { 
        
           		v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeStartupTimeoutV1Beta1Reason, clusterv1.ConditionSeverityWarning, "Node failed to report startup in %s", timeoutDuration) 
        
           		logger.V(3).Info("Target is unhealthy: machine has no node", "duration", timeoutDuration) 
        
           		conditions.Set(t.Machine, metav1.Condition{ 
        
           			Type:    clusterv1.MachineHealthCheckSucceededCondition, 
        
           			Status:  metav1.ConditionFalse, 
        
           			Reason:  clusterv1.MachineHealthCheckNodeStartupTimeoutReason, 
        
           			Message: fmt.Sprintf("Health check failed: Node failed to report startup in %s", timeoutDuration), 
        
           		}) 
        
           		return true, time.Duration(0) 
        
           	} 
        
           	durationUnhealthy := now.Sub(comparisonTime) 
        
           	nextCheck := timeoutDuration - durationUnhealthy + time.Second 
        
           	return false, nextCheck 
        
           }

cluster-api/internal/controllers/machinehealthcheck/machinehealthcheck_targets.go

Lines 105 to 116 in f96b742

    
           if t.nodeMissing { 
        
           	logger.V(3).Info("Target is unhealthy: node is missing") 
        
           	v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeNotFoundV1Beta1Reason, clusterv1.ConditionSeverityWarning, "") 
        
           	conditions.Set(t.Machine, metav1.Condition{ 
        
           		Type:    clusterv1.MachineHealthCheckSucceededCondition, 
        
           		Status:  metav1.ConditionFalse, 
        
           		Reason:  clusterv1.MachineHealthCheckNodeDeletedReason, 
        
           		Message: fmt.Sprintf("Health check failed: Node %s has been deleted", t.Machine.Status.NodeRef.Name), 
        
           	}) 
        
           	return true, time.Duration(0) 
        
           }

While this code structure worked well when checking only node conditions at the end of the func, it does not work well with the addition of the check for machine conditions at the end of the func.

More specifically, I think that we should find a way to always check machine conditions, not only in case the node is existing / the func doesn't hit the two if branches highlighted above, as in the current implementation.

Additionally, we should always consider that we merge messages from machine conditions and from node conditions in all the possible scenarios:

when node is not showing up at startup
when the node has been deleted at a later stage
when the node exists (which is the only scenario covered in the current change set)

I will try to come up with a some ideas to solve this problem, but of course suggestions are more than welcome

fabriziopandini · 2025-10-15T18:28:16Z

internal/controllers/machinehealthcheck/machinehealthcheck_targets.go

+		unhealthyMessages       []string
+		unhealthyReasons        []string
+		foundUnhealthyCondition bool


I would simplify this by keeping track only of messages (everythings else can be inferred from those two arrays when setting the condition)

I did simplify it by removing unhealthyReasons and foundUnhealthyCondition as suggested and use only
unhealthyNodeMessages & unhealthyMachineMessages arrays instead to infer removed things

internal/controllers/machinehealthcheck/machinehealthcheck_targets.go

fabriziopandini · 2025-10-15T20:01:49Z

internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go

+				UnhealthyMachineConditions: []clusterv1.UnhealthyMachineCondition{
+					{
+						Type:           clusterv1.MachineReadyCondition,
+						Status:         metav1.ConditionUnknown,
+						TimeoutSeconds: ptr.To(int32(5 * 60)),
+					},
+				},


Looks like this additional check is the reason why most of the tests in TestMachineHealthCheck_Reconcile are failing.

What happens is that in those tests, machine's Ready condition is Unknown, and thus it matches the rule above.

When a machine has a matching rule, even if the timeout is not expired, is not counted anymore as an healthy machine when computing MHC's replica counters, and this is what triggers test failure (unexpected replica counters)

The relevant code is here:

cluster-api/internal/controllers/machinehealthcheck/machinehealthcheck_targets.go

Lines 404 to 416 in f96b742

if nextCheck > 0 {

logger.V(3).Info("Target is likely to go unhealthy", "timeUntilUnhealthy", nextCheck.Truncate(time.Second).String())

r.recorder.Eventf(

t.Machine,

corev1.EventTypeNormal,

EventDetectedUnhealthy,

"Machine %s has unhealthy Node %s",

klog.KObj(t.Machine),

t.nodeName(),

)

nextCheckTimes = append(nextCheckTimes, nextCheck)

continue

}

Two things are worth to notice:

When a machine is matching a rule, but the timeout is not expired, MHC doesn't count the machine as healthy and nor as unhealthy. The machine is considered as "likely to go unhealthy", which is something that doesn't surface anywhere in status 😓

As you might notice when a machine enter the "likely to go unhealthy" status, MHC generates an event; the event message is "Machine %s has unhealthy Node %s", and it should be changed because now we are also checking machine conditions. I would suggest to use something like "Machine %s (Node %s) is failing machine health check rules and it is likely to go unhealthy" as a new message (suggestions are welcome)

Thanks, it makes sense to use the event message you suggested 👍🏼

Looks like this additional check is the reason why most of the tests in TestMachineHealthCheck_Reconcile are failing.

Yes, that was it. I have removed it now from newMachineHealthCheck since ALL tests that use this helper suddenly get machine condition evaluation enabled. Instead, I added it directly to the specific test and tests started passing again!

@furkatgofurov7 let's drop the event https://github.com/furkatgofurov7/cluster-api/blob/2c1ca0374c287e0016a2323a587c42045878c6f0/internal/controllers/machinehealthcheck/machinehealthcheck_targets.go#L447 (it is very noisy, it is generated at every reconcile for every machine that does not met a check and for the entire time until timeout expires)

internal/api/core/v1alpha3/conversion.go

internal/api/core/v1alpha3/conversion_test.go

internal/api/core/v1alpha4/conversion.go

internal/api/core/v1alpha4/conversion_test.go

Signed-off-by: Furkat Gofurov <[email protected]>

Refactors `needsRemediation`, specifically following changes were made: - Move machine condition evaluation to always execute first, regardless of node state - Ensure machine conditions are checked in ALL scenarios: * When node is missing (t.nodeMissing) * When node hasn't appeared yet (t.Node == nil) * When node exists (t.Node != nil) - Consistently merge node and machine condition messages in all failure scenarios - Maintain backward compatibility with existing condition message formats - Use appropriate condition reasons based on which conditions are unhealthy Signed-off-by: Furkat Gofurov <[email protected]>

furkatgofurov7 · 2025-10-17T08:54:26Z

/test pull-cluster-api-test-main

Signed-off-by: Furkat Gofurov <[email protected]>

furkatgofurov7 · 2025-10-17T09:17:59Z

I have found the root cause of failures in TestMachineHealthCheck_Reconcile and provided a few suggestion for the computation of the condition (see comments).

Unfortunately I have also found another issue in the needsRemediation func, that probably needs a bigger refactor.

the current logic in needsRemediation sort of assumes that check were only applied to nodes, so e.g. it returns immediately if the node is not showing up at startup, or if the node has been deleted at a later stage

cluster-api/internal/controllers/machinehealthcheck/machinehealthcheck_targets.go

Lines 134 to 183 in f96b742

if t.Node == nil {

if timeoutForMachineToHaveNode == disabledNodeStartupTimeout {

// Startup timeout is disabled so no need to go any further.

// No node yet to check conditions, can return early here.

return false, 0

}

controlPlaneInitialized := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition)

clusterInfraReady := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition)

machineInfraReady := conditions.GetLastTransitionTime(t.Machine, clusterv1.MachineInfrastructureReadyCondition)

machineCreationTime := t.Machine.CreationTimestamp.Time

// Use the latest of the following timestamps.

comparisonTime := machineCreationTime

logger.V(5).Info("Determining comparison time",

"machineCreationTime", machineCreationTime,

"clusterInfraReadyTime", clusterInfraReady,

"controlPlaneInitializedTime", controlPlaneInitialized,

"machineInfraReadyTime", machineInfraReady,

)

if conditions.IsTrue(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition) && controlPlaneInitialized != nil && controlPlaneInitialized.After(comparisonTime) {

comparisonTime = controlPlaneInitialized.Time

}

if conditions.IsTrue(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition) && clusterInfraReady != nil && clusterInfraReady.After(comparisonTime) {

comparisonTime = clusterInfraReady.Time

}

if conditions.IsTrue(t.Machine, clusterv1.MachineInfrastructureReadyCondition) && machineInfraReady != nil && machineInfraReady.After(comparisonTime) {

comparisonTime = machineInfraReady.Time

}

logger.V(5).Info("Using comparison time", "time", comparisonTime)

timeoutDuration := timeoutForMachineToHaveNode.Duration

if comparisonTime.Add(timeoutForMachineToHaveNode.Duration).Before(now) {

v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeStartupTimeoutV1Beta1Reason, clusterv1.ConditionSeverityWarning, "Node failed to report startup in %s", timeoutDuration)

logger.V(3).Info("Target is unhealthy: machine has no node", "duration", timeoutDuration)

conditions.Set(t.Machine, metav1.Condition{

Type: clusterv1.MachineHealthCheckSucceededCondition,

Status: metav1.ConditionFalse,

Reason: clusterv1.MachineHealthCheckNodeStartupTimeoutReason,

Message: fmt.Sprintf("Health check failed: Node failed to report startup in %s", timeoutDuration),

})

return true, time.Duration(0)

}

durationUnhealthy := now.Sub(comparisonTime)

nextCheck := timeoutDuration - durationUnhealthy + time.Second

return false, nextCheck

}

cluster-api/internal/controllers/machinehealthcheck/machinehealthcheck_targets.go

Lines 105 to 116 in f96b742

if t.nodeMissing {

logger.V(3).Info("Target is unhealthy: node is missing")

v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeNotFoundV1Beta1Reason, clusterv1.ConditionSeverityWarning, "")

conditions.Set(t.Machine, metav1.Condition{

Type: clusterv1.MachineHealthCheckSucceededCondition,

Status: metav1.ConditionFalse,

Reason: clusterv1.MachineHealthCheckNodeDeletedReason,

Message: fmt.Sprintf("Health check failed: Node %s has been deleted", t.Machine.Status.NodeRef.Name),

})

return true, time.Duration(0)

}

While this code structure worked well when checking only node conditions at the end of the func, it does not work well with the addition of the check for machine conditions at the end of the func.

More specifically, I think that we should find a way to always check machine conditions, not only in case the node is existing / the func doesn't hit the two if branches highlighted above, as in the current implementation.

Additionally, we should always consider that we merge messages from machine conditions and from node conditions in all the possible scenarios:

when node is not showing up at startup

when the node has been deleted at a later stage

when the node exists (which is the only scenario covered in the current change set)

I will try to come up with a some ideas to solve this problem, but of course suggestions are more than welcome

@fabriziopandini Thanks for the detailed feedback! You're absolutely right about the inconsistent behavior. I've refactored now the needsRemediation function to address all the concerns you raised.

Changes I made:

We now always evaluate Machine Conditions first
Machine conditions are now evaluated before any node-related logic, to ensure they're checked in ALL scenarios:

When node is missing t.nodeMissing
When node hasn't appeared yet t.Node == nil
When node exists t.Node != nil

We should have a consistent message merging in ALL scenarios
Error messages from both machine and node conditions are now properly merged in every failure scenario:

Node Missing: "Node X has been deleted; Condition Y on Machine is reporting status Z"
Node Startup Timeout: "Node failed to report startup in 5m; Condition Y on Machine is reporting status Z"
Node Exists: "Condition A on Node is reporting status B; Condition Y on Machine is reporting status Z"

No More Early Returns Bypassing Machine Evaluation
The problematic early return paths now occur after machine conditions are evaluated, not before.

Let me know what you think about the refactor.

furkatgofurov7 · 2025-10-17T09:20:42Z

@sbueringer, thanks for another round of feedback on conversion; hopefully all your suggestions should be incorporated now.

furkatgofurov7 · 2025-10-17T09:25:58Z

/test pull-cluster-api-e2e-main

fabriziopandini

Thanks @furkatgofurov7 for this iteration!

I'm wondering if we can further simplify the code/improve readability by using two sub functions, one for machineChecks and the other for nodeChecks.

The resulting needsRemediation will look like:

func (t *healthCheckTarget) needsRemediation(logger logr.Logger, timeoutForMachineToHaveNode metav1.Duration) (bool, time.Duration) {
	// checks for HasRemediateMachineAnnotation, ClusterControlPlaneInitializedCondition, ClusterInfrastructureReadyCondition
        ...

	// Check machine conditions
	unhealthyMachineMessages, nextMachineCheck := t.machineChecks(logger)

	// Check node conditions
	nodeConditionReason, nodeV1beta1ConditionReason, unhealthyNodeMessages, nextNodeCheck := t.nodeChecks(logger, timeoutForMachineToHaveNode)

	// Combine results and set conditions
	...
}

Another benefit of this code struct, is that condition management is implemented only in one place.

In case it can help, this is a commit where I experimented a little bit about this idea

wdyt?

fabriziopandini · 2025-10-17T15:36:04Z

internal/controllers/machinehealthcheck/machinehealthcheck_targets.go

+			v1beta1Reason = clusterv1.UnhealthyMachineConditionV1Beta1Reason
+		}
+
+		// For v1beta2 we use a single-line message prefixed with "Health check failed: "


For v1beta2 should use multiline comment with * prefix on every message (I pasted an example of how to compute it in my previous comment)

Also, we usually avoid to call out v1beta2 (v1beta2 conditions are just conditions), so the we won't need to fix comment when v1beta1 code will go away

(Same for other places where we are computing condition message)

fabriziopandini · 2025-10-17T15:36:52Z

internal/controllers/machinehealthcheck/machinehealthcheck_targets.go

 	// the node has not been set yet
 	if t.Node == nil {
+		// Check if we already have unhealthy machine conditions that should trigger remediation
+		if len(unhealthyMachineMessages) > 0 {


Looks like that when len(unhealthyMachineMessages) > 0 we are returning without checking nodeStartup timeout

…ns: one for machineChecks and the other for nodeChecks. Another benefit of this code struct, is that condition management is implemented only in one place. Co-authored-by: Fabrizio Pandini Signed-off-by: Furkat Gofurov <[email protected]>

furkatgofurov7 · 2025-10-17T22:37:11Z

/test pull-cluster-api-e2e-main

k8s-ci-robot requested review from elmiko and richardcase October 7, 2025 21:12

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 7, 2025

furkatgofurov7 force-pushed the unhealthyMachineConditions-check-mhc branch from af68d6e to 7cb44c3 Compare October 7, 2025 21:24

k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Oct 7, 2025

furkatgofurov7 force-pushed the unhealthyMachineConditions-check-mhc branch from 7cb44c3 to ab19424 Compare October 7, 2025 21:42

furkatgofurov7 and others added 2 commits October 8, 2025 01:06

Fix PR check Markdown links CI

c6a7148

Signed-off-by: Furkat Gofurov <[email protected]>

furkatgofurov7 force-pushed the unhealthyMachineConditions-check-mhc branch from ab19424 to c6a7148 Compare October 7, 2025 22:08

This was referenced Oct 8, 2025

Tracking issue for In-place Updates implementation #12291

Open

✨ MachineHealthCheck supports checking Machine conditions #12275

Closed

sbueringer reviewed Oct 13, 2025

View reviewed changes

internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go Show resolved Hide resolved

furkatgofurov7 force-pushed the unhealthyMachineConditions-check-mhc branch 2 times, most recently from 2c052cf to 424114f Compare October 13, 2025 18:19

Address review comments

5ee7d25

Signed-off-by: Furkat Gofurov <[email protected]>

furkatgofurov7 force-pushed the unhealthyMachineConditions-check-mhc branch from 424114f to 5ee7d25 Compare October 13, 2025 18:21

fabriziopandini reviewed Oct 13, 2025

View reviewed changes

Address review comments: rework node and machine checks in needsRemed…

f96b742

…iation() method If both a node condition and machine condition are unhealthy, pick one reason but combine all the messages Signed-off-by: Furkat Gofurov <[email protected]>

fabriziopandini reviewed Oct 15, 2025

View reviewed changes

sbueringer reviewed Oct 16, 2025

View reviewed changes

furkatgofurov7 added 2 commits October 17, 2025 08:11

Address Stefan comments (conversion)

d807911

Signed-off-by: Furkat Gofurov <[email protected]>

Fix event message to reflect both machine and node condition checking

2c1ca03

Signed-off-by: Furkat Gofurov <[email protected]>

fabriziopandini reviewed Oct 17, 2025

View reviewed changes

	if t.Node == nil {
	if timeoutForMachineToHaveNode == disabledNodeStartupTimeout {
	// Startup timeout is disabled so no need to go any further.
	// No node yet to check conditions, can return early here.
	return false, 0
	}

	controlPlaneInitialized := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition)
	clusterInfraReady := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition)
	machineInfraReady := conditions.GetLastTransitionTime(t.Machine, clusterv1.MachineInfrastructureReadyCondition)
	machineCreationTime := t.Machine.CreationTimestamp.Time

	// Use the latest of the following timestamps.
	comparisonTime := machineCreationTime
	logger.V(5).Info("Determining comparison time",
	"machineCreationTime", machineCreationTime,
	"clusterInfraReadyTime", clusterInfraReady,
	"controlPlaneInitializedTime", controlPlaneInitialized,
	"machineInfraReadyTime", machineInfraReady,
	)
	if conditions.IsTrue(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition) && controlPlaneInitialized != nil && controlPlaneInitialized.After(comparisonTime) {
	comparisonTime = controlPlaneInitialized.Time
	}
	if conditions.IsTrue(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition) && clusterInfraReady != nil && clusterInfraReady.After(comparisonTime) {
	comparisonTime = clusterInfraReady.Time
	}
	if conditions.IsTrue(t.Machine, clusterv1.MachineInfrastructureReadyCondition) && machineInfraReady != nil && machineInfraReady.After(comparisonTime) {
	comparisonTime = machineInfraReady.Time
	}
	logger.V(5).Info("Using comparison time", "time", comparisonTime)

	timeoutDuration := timeoutForMachineToHaveNode.Duration
	if comparisonTime.Add(timeoutForMachineToHaveNode.Duration).Before(now) {
	v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeStartupTimeoutV1Beta1Reason, clusterv1.ConditionSeverityWarning, "Node failed to report startup in %s", timeoutDuration)
	logger.V(3).Info("Target is unhealthy: machine has no node", "duration", timeoutDuration)

	conditions.Set(t.Machine, metav1.Condition{
	Type: clusterv1.MachineHealthCheckSucceededCondition,
	Status: metav1.ConditionFalse,
	Reason: clusterv1.MachineHealthCheckNodeStartupTimeoutReason,
	Message: fmt.Sprintf("Health check failed: Node failed to report startup in %s", timeoutDuration),
	})
	return true, time.Duration(0)
	}

	durationUnhealthy := now.Sub(comparisonTime)
	nextCheck := timeoutDuration - durationUnhealthy + time.Second

	return false, nextCheck
	}

	if t.nodeMissing {
	logger.V(3).Info("Target is unhealthy: node is missing")
	v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeNotFoundV1Beta1Reason, clusterv1.ConditionSeverityWarning, "")

	conditions.Set(t.Machine, metav1.Condition{
	Type: clusterv1.MachineHealthCheckSucceededCondition,
	Status: metav1.ConditionFalse,
	Reason: clusterv1.MachineHealthCheckNodeDeletedReason,
	Message: fmt.Sprintf("Health check failed: Node %s has been deleted", t.Machine.Status.NodeRef.Name),
	})
	return true, time.Duration(0)
	}

	if nextCheck > 0 {
	logger.V(3).Info("Target is likely to go unhealthy", "timeUntilUnhealthy", nextCheck.Truncate(time.Second).String())
	r.recorder.Eventf(
	t.Machine,
	corev1.EventTypeNormal,
	EventDetectedUnhealthy,
	"Machine %s has unhealthy Node %s",
	klog.KObj(t.Machine),
	t.nodeName(),
	)
	nextCheckTimes = append(nextCheckTimes, nextCheck)
	continue
	}

✨ Add support for checking Machine conditions in MachineHealthCheck #12827

Are you sure you want to change the base?

✨ Add support for checking Machine conditions in MachineHealthCheck #12827

Conversation

furkatgofurov7 commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Oct 7, 2025

Uh oh!

furkatgofurov7 commented Oct 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

furkatgofurov7 Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbueringer commented Oct 14, 2025

Uh oh!

furkatgofurov7 commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbueringer commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fabriziopandini left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

furkatgofurov7 Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fabriziopandini Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

furkatgofurov7 commented Oct 17, 2025

Uh oh!

furkatgofurov7 commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

furkatgofurov7 commented Oct 17, 2025

Uh oh!

furkatgofurov7 commented Oct 17, 2025

Uh oh!

fabriziopandini left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

furkatgofurov7 commented Oct 7, 2025 •

edited

Loading

furkatgofurov7 Oct 13, 2025 •

edited

Loading

furkatgofurov7 commented Oct 14, 2025 •

edited

Loading

sbueringer commented Oct 14, 2025 •

edited

Loading

fabriziopandini left a comment •

edited

Loading

furkatgofurov7 Oct 17, 2025 •

edited

Loading

fabriziopandini Oct 15, 2025 •

edited

Loading

furkatgofurov7 commented Oct 17, 2025 •

edited

Loading