Skip to content

Conversation

furkatgofurov7
Copy link
Member

@furkatgofurov7 furkatgofurov7 commented Oct 7, 2025

What this PR does / why we need it:
MachineHealthCheck currently only allows checking Node conditions to validate if a machine is healthy. However, machine conditions capture conditions that do not exist on nodes, for example, control plane node conditions such as EtcdPodHealthy, SchedulerPodHealthy that can indicate if a controlplane machine has been created correctly.

Adding support for Machine conditions enables us to perform remediation during control plane upgrades.

This PR introduces a new field as part of the MachineHealthCheckChecks:

  • UnhealthyMachineConditions

This will mirror the behavior of UnhealthyNodeConditions but the MachineHealthCheck controller will instead check the machine conditions.

This reimplements and extends the work originally proposed by @justinmir in PR #12275.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes: #5450

Label(s) to be applied
/kind feature
/area machinehealthcheck

Notes for Reviewers
We updated the tests to validate the new MachineHealthCheck code paths for UnhealthyMachineConditions in the following ways:

  • internal/controllers/machinehealthcheck/machinehealthcheck_controller_test.go includes a new envtest

  • internal/controllers/machinehealthcheck/machinehealthcheck_targets_test.go includes a unit test that the machine will need remediation.

  • Remaining test changes are boilerplate to ensure that this doesn't break existing functionality, every place we use UnhealthyNodeConditions, we also specify a UnhealthyMachineConditions.

  • Core Logic Refactor: Modified needsRemediation() in machinehealthcheck_targets.go to:

    • Always evaluate machine conditions first, regardless of node state
    • Ensure machine conditions are checked in ALL scenarios (node missing, startup timeout, node exists)
    • Consistently merge machine and node condition messages across all failure scenarios
    • Maintain backward compatibility with existing condition message formats
  • Event Message Updates:

    • Updated event message to use maintainer's suggested wording: "Machine %s (Node %s) is failing machine health check rules and it is likely to go unhealthy"
    • Updated EventDetectedUnhealthy comment to reflect both machine and node condition checking

@k8s-ci-robot k8s-ci-robot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. kind/feature Categorizes issue or PR as related to a new feature. area/machinehealthcheck Issues or PRs related to machinehealthchecks cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 7, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign justinsb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 7, 2025
@furkatgofurov7 furkatgofurov7 force-pushed the unhealthyMachineConditions-check-mhc branch from af68d6e to 7cb44c3 Compare October 7, 2025 21:24
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Oct 7, 2025
@furkatgofurov7 furkatgofurov7 force-pushed the unhealthyMachineConditions-check-mhc branch from 7cb44c3 to ab19424 Compare October 7, 2025 21:42
furkatgofurov7 and others added 2 commits October 8, 2025 01:06
MachineHealthCheck currently only allows checking Node conditions to
validate if a machine is healthy. However, machine conditions capture
conditions that do not exist on nodes, for example, control plane node
conditions such as EtcdPodHealthy, SchedulerPodHealthy that can indicate
if a controlplane machine has been created correctly.

Adding support for Machine conditions enables us to perform remediation
during control plane upgrades.

This PR introduces a new field as part of the MachineHealthCheckChecks:
  - `UnhealthyMachineConditions`

This will mirror the behavior of `UnhealthyNodeConditions` but the
MachineHealthCheck controller will instead check the machine conditions.

This reimplements and extends earlier work originally proposed in a previous PR 12275.

Co-authored-by: Justin Miron <[email protected]>
Signed-off-by: Furkat Gofurov <[email protected]>
Signed-off-by: Furkat Gofurov <[email protected]>
@furkatgofurov7 furkatgofurov7 force-pushed the unhealthyMachineConditions-check-mhc branch from ab19424 to c6a7148 Compare October 7, 2025 22:08
@furkatgofurov7
Copy link
Member Author

/test pull-cluster-api-e2e-main

@furkatgofurov7 furkatgofurov7 force-pushed the unhealthyMachineConditions-check-mhc branch 2 times, most recently from 2c052cf to 424114f Compare October 13, 2025 18:19
Signed-off-by: Furkat Gofurov <[email protected]>
@furkatgofurov7 furkatgofurov7 force-pushed the unhealthyMachineConditions-check-mhc branch from 424114f to 5ee7d25 Compare October 13, 2025 18:21
Comment on lines 233 to 242
v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.UnhealthyMachineConditionV1Beta1Reason, clusterv1.ConditionSeverityWarning, "Condition %s on Machine is reporting status %s for more than %s", c.Type, c.Status, timeoutSecondsDuration.String())
logger.V(3).Info("Target is unhealthy: condition is in state longer than allowed timeout", "condition", c.Type, "state", c.Status, "timeout", timeoutSecondsDuration.String())

conditions.Set(t.Machine, metav1.Condition{
Type: clusterv1.MachineHealthCheckSucceededCondition,
Status: metav1.ConditionFalse,
Reason: clusterv1.MachineHealthCheckUnhealthyMachineReason,
Message: fmt.Sprintf("Health check failed: Condition %s on Machine is reporting status %s for more than %s", c.Type, c.Status, timeoutSecondsDuration.String()),
})
return true, time.Duration(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q:

if I'm not wrong, when there are both nod unhealthy condition and machine unhealthy conditions, the latter is overriding the first. is this intentional? is it ok current priority (machine over node)?

should we consider a different approach, where we pick one or the other reason, but we combine all the messages?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabriziopandini thanks for catching it, you're absolutely right.
When both node and machine unhealthy conditions are present, the machine conditions will indeed override the node conditions because:

  • Node conditions are processed initially
  • Machine conditions are processed after
  • Both use the same condition type: clusterv1.MachineHealthCheckSucceededCondition and
    conditions.Set() call in the machine conditions loop will overwrite any condition set by the node conditions loop 💥

This means if both a node condition and a machine condition are unhealthy, only the machine condition will be reflected in the final condition status, and the node condition information will be lost...

I will see how we can combine all the messages 👍🏼

Copy link
Member Author

@furkatgofurov7 furkatgofurov7 Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came up with: f96b742 PTAL and let me know what you think!

…iation() method

If both a node condition and machine condition are unhealthy, pick one reason but
combine all the messages

Signed-off-by: Furkat Gofurov <[email protected]>
@sbueringer
Copy link
Member

/test pull-cluster-api-e2e-main

@furkatgofurov7
Copy link
Member Author

furkatgofurov7 commented Oct 14, 2025

Quick update on failing main tests:

  1. they are failing now, and probably the reason is that the condition matcher seems to be too strict (order/length and dynamic fields like ObservedGeneration/LastTransitionTime) example failure from local run:
--- FAIL: TestMachineHealthCheck_Reconcile (441.16s)
machinehealthcheck_controller_test.go:232:  
Timed out after 5.001s.  
expected  
    &v1beta2.MachineHealthCheckStatus{Conditions:[]v1.Condition{v1.Condition{Type:"RemediationAllowed", Status:"True", ObservedGeneration:1, LastTransitionTime:time.Date(2025, time.October, 14, 0, 40, 34, 0, time.Local), Reason:"RemediationAllowed", Message:""}, v1.Condition{Type:"Paused", Status:"False", ObservedGeneration:1, LastTransitionTime:time.Date(2025, time.October, 14, 0, 40, 34, 0, time.Local), Reason:"NotPaused", Message:""}}, ExpectedMachines:(*int32)(0x14000c9498c), CurrentHealthy:(*int32)(0x14000c94990), RemediationsAllowed:(*int32)(0x14000c94994), ObservedGeneration:1, Targets:[]string{"test-mhc-machine-pm82p", "test-mhc-machine-v74dz"}, Deprecated:(*v1beta2.MachineHealthCheckDeprecatedStatus)(0x140008227d0)}  
to match  
    &v1beta2.MachineHealthCheckStatus{Conditions:[]v1.Condition{v1.Condition{Type:"RemediationAllowed", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Reason:"RemediationAllowed", Message:""}, v1.Condition{Type:"Paused", Status:"False", ObservedGeneration:0, LastTransitionTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Reason:"NotPaused", Message:""}}, ExpectedMachines:(*int32)(0x14001fa10ac), CurrentHealthy:(*int32)(0x14001fa10b0), RemediationsAllowed:(*int32)(0x14001fa10b4), ObservedGeneration:1, Targets:[]string{"test-mhc-machine-pm82p", "test-mhc-machine-v74dz"}, Deprecated:(*v1beta2.MachineHealthCheckDeprecatedStatus)(0x14001d9e6d0)}  

machinehealthcheck_controller_test.go:366:  
Timed out after 30.001s.  
expected  
    &v1beta2.MachineHealthCheckStatus{Conditions:[]v1.Condition{v1.Condition{Type:"RemediationAllowed", Status:"True", ObservedGeneration:1, LastTransitionTime:time.Date(2025, time.October, 14, 0, 40, 40, 0, time.Local), Reason:"RemediationAllowed", Message:""}, v1.Condition{Type:"Paused", Status:"False", ObservedGeneration:1, LastTransitionTime:time.Date(2025, time.October, 14, 0, 40, 40, 0, time.Local), Reason:"NotPaused", Message:""}}, ExpectedMachines:(*int32)(0x14001ebd99c), CurrentHealthy:(*int32)(0x14001ebd9a0), RemediationsAllowed:(*int32)(0x14001ebd9a4), ObservedGeneration:1, Targets:[]string{"test-mhc-machine-9mctz", "test-mhc-machine-jdf2x"}, Deprecated:(*v1beta2.MachineHealthCheckDeprecatedStatus)(0x1400061eb08)}  
to match  
    &v1beta2.MachineHealthCheckStatus{Conditions:[]v1.Condition{v1.Condition{Type:"RemediationAllowed", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Reason:"RemediationAllowed", Message:""}}, ExpectedMachines:(*int32)(0x14001ebc1cc), CurrentHealthy:(*int32)(0x14001ebc1d0), RemediationsAllowed:(*int32)(0x14001ebc1d4), ObservedGeneration:1, Targets:[]string{"test-mhc-machine-9mctz", "test-mhc-machine-jdf2x"}, Deprecated:(*v1beta2.MachineHealthCheckDeprecatedStatus)(0x1400061e240)}
  1. aggregated unhealthy messages didn’t consistently include the “Health check failed:” prefix:
--- FAIL: TestHealthCheckTargets (0.00s)
machinehealthcheck_targets_test.go:636:  
Expected  
<[]v1.Condition | len:1, cap:1>:  
[
  {
    Type: "HealthCheckSucceeded",
    Status: "False",
    ObservedGeneration: 0,
    LastTransitionTime: { Time: 0001-01-01T00:00:00Z },
    Reason: "UnhealthyNode",
    Message: "Condition Ready on Node is reporting status Unknown for more than 5m0s",
  },
]  

to contain elements  
<[]v1.Condition | len:1, cap:1>:  
[
  {
    Type: "HealthCheckSucceeded",
    Status: "False",
    ObservedGeneration: 0,
    LastTransitionTime: { Time: 0001-01-01T00:00:00Z },
    Reason: "UnhealthyNode",
    Message: "Health check failed: Condition Ready on Node is reporting status Unknown for more than 5m0s",
  },
]  

the missing elements were  
<[]v1.Condition | len:1, cap:1>:  
[
  {
    Type: "HealthCheckSucceeded",
    Status: "False",
    ObservedGeneration: 0,
    LastTransitionTime: { Time: 0001-01-01T00:00:00Z },
    Reason: "UnhealthyNode",
    Message: "Health check failed: Condition Ready on Node is reporting status Unknown for more than 5m0s",
  },
]

Currently, I am trying to fix this by relaxing the custom matcher to be order-insensitive and subset-based, and standardizing the combined unhealthy message to include the prefix. However, if you have any other suggestions, I am all open for suggestions, thank you.

@sbueringer
Copy link
Member

sbueringer commented Oct 14, 2025

Currently, I am trying to fix this by relaxing the custom matcher to be order-insensitive and subset-based, and standardizing the combined unhealthy message to include the prefix. However, if you have any other suggestions, I am all open for suggestions, thank you.

I think gomega should already not care about the order if we use the right matcher

I also thought we have some matcher that allows ignoring timestamps for condition comparisons (grep for HaveSameStateOf, maybe it's useful)

Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found the root cause of failures in TestMachineHealthCheck_Reconcile and provided a few suggestion for the computation of the condition (see comments).

Unfortunately I have also found another issue in the needsRemediation func, that probably needs a bigger refactor.

the current logic in needsRemediation sort of assumes that check were only applied to nodes, so e.g. it returns immediately if the node is not showing up at startup, or if the node has been deleted at a later stage

if t.Node == nil {
if timeoutForMachineToHaveNode == disabledNodeStartupTimeout {
// Startup timeout is disabled so no need to go any further.
// No node yet to check conditions, can return early here.
return false, 0
}
controlPlaneInitialized := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition)
clusterInfraReady := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition)
machineInfraReady := conditions.GetLastTransitionTime(t.Machine, clusterv1.MachineInfrastructureReadyCondition)
machineCreationTime := t.Machine.CreationTimestamp.Time
// Use the latest of the following timestamps.
comparisonTime := machineCreationTime
logger.V(5).Info("Determining comparison time",
"machineCreationTime", machineCreationTime,
"clusterInfraReadyTime", clusterInfraReady,
"controlPlaneInitializedTime", controlPlaneInitialized,
"machineInfraReadyTime", machineInfraReady,
)
if conditions.IsTrue(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition) && controlPlaneInitialized != nil && controlPlaneInitialized.After(comparisonTime) {
comparisonTime = controlPlaneInitialized.Time
}
if conditions.IsTrue(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition) && clusterInfraReady != nil && clusterInfraReady.After(comparisonTime) {
comparisonTime = clusterInfraReady.Time
}
if conditions.IsTrue(t.Machine, clusterv1.MachineInfrastructureReadyCondition) && machineInfraReady != nil && machineInfraReady.After(comparisonTime) {
comparisonTime = machineInfraReady.Time
}
logger.V(5).Info("Using comparison time", "time", comparisonTime)
timeoutDuration := timeoutForMachineToHaveNode.Duration
if comparisonTime.Add(timeoutForMachineToHaveNode.Duration).Before(now) {
v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeStartupTimeoutV1Beta1Reason, clusterv1.ConditionSeverityWarning, "Node failed to report startup in %s", timeoutDuration)
logger.V(3).Info("Target is unhealthy: machine has no node", "duration", timeoutDuration)
conditions.Set(t.Machine, metav1.Condition{
Type: clusterv1.MachineHealthCheckSucceededCondition,
Status: metav1.ConditionFalse,
Reason: clusterv1.MachineHealthCheckNodeStartupTimeoutReason,
Message: fmt.Sprintf("Health check failed: Node failed to report startup in %s", timeoutDuration),
})
return true, time.Duration(0)
}
durationUnhealthy := now.Sub(comparisonTime)
nextCheck := timeoutDuration - durationUnhealthy + time.Second
return false, nextCheck
}

if t.nodeMissing {
logger.V(3).Info("Target is unhealthy: node is missing")
v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeNotFoundV1Beta1Reason, clusterv1.ConditionSeverityWarning, "")
conditions.Set(t.Machine, metav1.Condition{
Type: clusterv1.MachineHealthCheckSucceededCondition,
Status: metav1.ConditionFalse,
Reason: clusterv1.MachineHealthCheckNodeDeletedReason,
Message: fmt.Sprintf("Health check failed: Node %s has been deleted", t.Machine.Status.NodeRef.Name),
})
return true, time.Duration(0)
}

While this code structure worked well when checking only node conditions at the end of the func, it does not work well with the addition of the check for machine conditions at the end of the func.

More specifically, I think that we should find a way to always check machine conditions, not only in case the node is existing / the func doesn't hit the two if branches highlighted above, as in the current implementation.

Additionally, we should always consider that we merge messages from machine conditions and from node conditions in all the possible scenarios:

  • when node is not showing up at startup
  • when the node has been deleted at a later stage
  • when the node exists (which is the only scenario covered in the current change set)

I will try to come up with a some ideas to solve this problem, but of course suggestions are more than welcome

Comment on lines 187 to 189
unhealthyMessages []string
unhealthyReasons []string
foundUnhealthyCondition bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would simplify this by keeping track only of messages (everythings else can be inferred from those two arrays when setting the condition)

Copy link
Member Author

@furkatgofurov7 furkatgofurov7 Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did simplify it by removing unhealthyReasons and foundUnhealthyCondition as suggested and use only
unhealthyNodeMessages & unhealthyMachineMessages arrays instead to infer removed things

Comment on lines 2882 to 2888
UnhealthyMachineConditions: []clusterv1.UnhealthyMachineCondition{
{
Type: clusterv1.MachineReadyCondition,
Status: metav1.ConditionUnknown,
TimeoutSeconds: ptr.To(int32(5 * 60)),
},
},
Copy link
Member

@fabriziopandini fabriziopandini Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this additional check is the reason why most of the tests in TestMachineHealthCheck_Reconcile are failing.

What happens is that in those tests, machine's Ready condition is Unknown, and thus it matches the rule above.

When a machine has a matching rule, even if the timeout is not expired, is not counted anymore as an healthy machine when computing MHC's replica counters, and this is what triggers test failure (unexpected replica counters)

The relevant code is here:

if nextCheck > 0 {
logger.V(3).Info("Target is likely to go unhealthy", "timeUntilUnhealthy", nextCheck.Truncate(time.Second).String())
r.recorder.Eventf(
t.Machine,
corev1.EventTypeNormal,
EventDetectedUnhealthy,
"Machine %s has unhealthy Node %s",
klog.KObj(t.Machine),
t.nodeName(),
)
nextCheckTimes = append(nextCheckTimes, nextCheck)
continue
}

Two things are worth to notice:

  • When a machine is matching a rule, but the timeout is not expired, MHC doesn't count the machine as healthy and nor as unhealthy. The machine is considered as "likely to go unhealthy", which is something that doesn't surface anywhere in status 😓
  • As you might notice when a machine enter the "likely to go unhealthy" status, MHC generates an event; the event message is "Machine %s has unhealthy Node %s", and it should be changed because now we are also checking machine conditions. I would suggest to use something like "Machine %s (Node %s) is failing machine health check rules and it is likely to go unhealthy" as a new message (suggestions are welcome)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it makes sense to use the event message you suggested 👍🏼

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this additional check is the reason why most of the tests in TestMachineHealthCheck_Reconcile are failing.

Yes, that was it. I have removed it now from newMachineHealthCheck since ALL tests that use this helper suddenly get machine condition evaluation enabled. Instead, I added it directly to the specific test and tests started passing again!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@furkatgofurov7 let's drop the event https://github.com/furkatgofurov7/cluster-api/blob/2c1ca0374c287e0016a2323a587c42045878c6f0/internal/controllers/machinehealthcheck/machinehealthcheck_targets.go#L447 (it is very noisy, it is generated at every reconcile for every machine that does not met a check and for the entire time until timeout expires)

Refactors `needsRemediation`, specifically following changes were made:
- Move machine condition evaluation to always execute first, regardless of node state
- Ensure machine conditions are checked in ALL scenarios:
  * When node is missing (t.nodeMissing)
  * When node hasn't appeared yet (t.Node == nil)
  * When node exists (t.Node != nil)
- Consistently merge node and machine condition messages in all failure scenarios
- Maintain backward compatibility with existing condition message formats
- Use appropriate condition reasons based on which conditions are unhealthy

Signed-off-by: Furkat Gofurov <[email protected]>
@furkatgofurov7
Copy link
Member Author

/test pull-cluster-api-test-main

@furkatgofurov7
Copy link
Member Author

furkatgofurov7 commented Oct 17, 2025

I have found the root cause of failures in TestMachineHealthCheck_Reconcile and provided a few suggestion for the computation of the condition (see comments).

Unfortunately I have also found another issue in the needsRemediation func, that probably needs a bigger refactor.

the current logic in needsRemediation sort of assumes that check were only applied to nodes, so e.g. it returns immediately if the node is not showing up at startup, or if the node has been deleted at a later stage

if t.Node == nil {
if timeoutForMachineToHaveNode == disabledNodeStartupTimeout {
// Startup timeout is disabled so no need to go any further.
// No node yet to check conditions, can return early here.
return false, 0
}
controlPlaneInitialized := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition)
clusterInfraReady := conditions.GetLastTransitionTime(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition)
machineInfraReady := conditions.GetLastTransitionTime(t.Machine, clusterv1.MachineInfrastructureReadyCondition)
machineCreationTime := t.Machine.CreationTimestamp.Time
// Use the latest of the following timestamps.
comparisonTime := machineCreationTime
logger.V(5).Info("Determining comparison time",
"machineCreationTime", machineCreationTime,
"clusterInfraReadyTime", clusterInfraReady,
"controlPlaneInitializedTime", controlPlaneInitialized,
"machineInfraReadyTime", machineInfraReady,
)
if conditions.IsTrue(t.Cluster, clusterv1.ClusterControlPlaneInitializedCondition) && controlPlaneInitialized != nil && controlPlaneInitialized.After(comparisonTime) {
comparisonTime = controlPlaneInitialized.Time
}
if conditions.IsTrue(t.Cluster, clusterv1.ClusterInfrastructureReadyCondition) && clusterInfraReady != nil && clusterInfraReady.After(comparisonTime) {
comparisonTime = clusterInfraReady.Time
}
if conditions.IsTrue(t.Machine, clusterv1.MachineInfrastructureReadyCondition) && machineInfraReady != nil && machineInfraReady.After(comparisonTime) {
comparisonTime = machineInfraReady.Time
}
logger.V(5).Info("Using comparison time", "time", comparisonTime)
timeoutDuration := timeoutForMachineToHaveNode.Duration
if comparisonTime.Add(timeoutForMachineToHaveNode.Duration).Before(now) {
v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeStartupTimeoutV1Beta1Reason, clusterv1.ConditionSeverityWarning, "Node failed to report startup in %s", timeoutDuration)
logger.V(3).Info("Target is unhealthy: machine has no node", "duration", timeoutDuration)
conditions.Set(t.Machine, metav1.Condition{
Type: clusterv1.MachineHealthCheckSucceededCondition,
Status: metav1.ConditionFalse,
Reason: clusterv1.MachineHealthCheckNodeStartupTimeoutReason,
Message: fmt.Sprintf("Health check failed: Node failed to report startup in %s", timeoutDuration),
})
return true, time.Duration(0)
}
durationUnhealthy := now.Sub(comparisonTime)
nextCheck := timeoutDuration - durationUnhealthy + time.Second
return false, nextCheck
}

if t.nodeMissing {
logger.V(3).Info("Target is unhealthy: node is missing")
v1beta1conditions.MarkFalse(t.Machine, clusterv1.MachineHealthCheckSucceededV1Beta1Condition, clusterv1.NodeNotFoundV1Beta1Reason, clusterv1.ConditionSeverityWarning, "")
conditions.Set(t.Machine, metav1.Condition{
Type: clusterv1.MachineHealthCheckSucceededCondition,
Status: metav1.ConditionFalse,
Reason: clusterv1.MachineHealthCheckNodeDeletedReason,
Message: fmt.Sprintf("Health check failed: Node %s has been deleted", t.Machine.Status.NodeRef.Name),
})
return true, time.Duration(0)
}

While this code structure worked well when checking only node conditions at the end of the func, it does not work well with the addition of the check for machine conditions at the end of the func.

More specifically, I think that we should find a way to always check machine conditions, not only in case the node is existing / the func doesn't hit the two if branches highlighted above, as in the current implementation.

Additionally, we should always consider that we merge messages from machine conditions and from node conditions in all the possible scenarios:

  • when node is not showing up at startup
  • when the node has been deleted at a later stage
  • when the node exists (which is the only scenario covered in the current change set)

I will try to come up with a some ideas to solve this problem, but of course suggestions are more than welcome

@fabriziopandini Thanks for the detailed feedback! You're absolutely right about the inconsistent behavior. I've refactored now the needsRemediation function to address all the concerns you raised.

Changes I made:

  1. We now always evaluate Machine Conditions first
    Machine conditions are now evaluated before any node-related logic, to ensure they're checked in ALL scenarios:
  • When node is missing t.nodeMissing
  • When node hasn't appeared yet t.Node == nil
  • When node exists t.Node != nil
  1. We should have a consistent message merging in ALL scenarios
    Error messages from both machine and node conditions are now properly merged in every failure scenario:
  • Node Missing: "Node X has been deleted; Condition Y on Machine is reporting status Z"
  • Node Startup Timeout: "Node failed to report startup in 5m; Condition Y on Machine is reporting status Z"
  • Node Exists: "Condition A on Node is reporting status B; Condition Y on Machine is reporting status Z"
  1. No More Early Returns Bypassing Machine Evaluation
    The problematic early return paths now occur after machine conditions are evaluated, not before.

Let me know what you think about the refactor.

@furkatgofurov7
Copy link
Member Author

@sbueringer, thanks for another round of feedback on conversion; hopefully all your suggestions should be incorporated now.

@furkatgofurov7
Copy link
Member Author

/test pull-cluster-api-e2e-main

Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @furkatgofurov7 for this iteration!

I'm wondering if we can further simplify the code/improve readability by using two sub functions, one for machineChecks and the other for nodeChecks.

The resulting needsRemediation will look like:

func (t *healthCheckTarget) needsRemediation(logger logr.Logger, timeoutForMachineToHaveNode metav1.Duration) (bool, time.Duration) {
	// checks for HasRemediateMachineAnnotation, ClusterControlPlaneInitializedCondition, ClusterInfrastructureReadyCondition
        ...

	// Check machine conditions
	unhealthyMachineMessages, nextMachineCheck := t.machineChecks(logger)

	// Check node conditions
	nodeConditionReason, nodeV1beta1ConditionReason, unhealthyNodeMessages, nextNodeCheck := t.nodeChecks(logger, timeoutForMachineToHaveNode)

	// Combine results and set conditions
	...
}

Another benefit of this code struct, is that condition management is implemented only in one place.

In case it can help, this is a commit where I experimented a little bit about this idea

wdyt?

v1beta1Reason = clusterv1.UnhealthyMachineConditionV1Beta1Reason
}

// For v1beta2 we use a single-line message prefixed with "Health check failed: "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For v1beta2 should use multiline comment with * prefix on every message (I pasted an example of how to compute it in my previous comment)

Also, we usually avoid to call out v1beta2 (v1beta2 conditions are just conditions), so the we won't need to fix comment when v1beta1 code will go away

(Same for other places where we are computing condition message)

// the node has not been set yet
if t.Node == nil {
// Check if we already have unhealthy machine conditions that should trigger remediation
if len(unhealthyMachineMessages) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like that when len(unhealthyMachineMessages) > 0 we are returning without checking nodeStartup timeout

…ns: one for machineChecks and the other for nodeChecks.

Another benefit of this code struct, is that condition management is implemented only in one place.

Co-authored-by: Fabrizio Pandini
Signed-off-by: Furkat Gofurov <[email protected]>
@furkatgofurov7
Copy link
Member Author

/test pull-cluster-api-e2e-main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/machinehealthcheck Issues or PRs related to machinehealthchecks cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MHC should provide support for checking Machine conditions

4 participants