[WIP] Fix issue 251 peer control nodes #255

mark-dell-usa · 2025-03-26T22:10:48Z

Why we need this PR

Unit test to show problem with Issue #251 as well as a potential fix.

Please feel free to disregard as much of or as little of this as you want, hopefully the unit test refactoring is seen as useful, but I believe it sufficiently highlights the core issue. I separated it out into multiple commits so you can look at branch history and see the individual updates prior to the squash commit that will happen to main. This is intended so you can see a delta of the actual meat & potatoes of the code fix itself which was relatively small and could be pulled in separately in the worst case scenario.

Note: I created an official account for my Dell contributions, still the same Mark who opened the issue 251 originally though.

Changes made

Update selfnoderemediation_controller_test.go in order to add control plane node configurability. To make this work right I tried to refactor it based on my understanding of ginkgo/gomega best practices, which fixed some overall issues with setup & teardown that I would see occasionally as unit test instabilities.
Add a new unit test, which shows the Remediation doesn't occur when node can contact peer control plane nodes even if they consider it unhealthy #251 scenario that we discovered in the lab, and fails
Implement a possible fix (if I'm understanding the intent right), and show that all new and existing unit tests still work.

Which issue(s) this PR fixes

Fixes #251 & RHWA-384

Test plan

Added new unit tests.

Summary by CodeRabbit

New Features
- Improved API connectivity checks: injectable remote health checks, quorum/timeouts, failure-tracking, and new runtime settings for failure window and peer quorum timeout.
Tests
- Expanded test harness with simulated peer responses, randomized pod IPs, many helpers, new unit tests, and a custom matcher for richer assertions.
Bug Fixes
- More descriptive, wrapped error messages for pod lookups and related failures.
Refactor
- Reworked peer and control‑plane health evaluation flows and consolidated test setup/teardown.
Style
- Added extensive informational logging across remediation, peer, and control‑plane paths.
Chores
- Dependency and vendor updates; added test documentation.

openshift-ci · 2025-03-26T22:10:59Z

Hi @mark-scott-jr-dell. Thanks for your PR.

I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

slintes

Sorry for the late first review, we are pretty busy...

slintes · 2025-04-01T08:56:27Z

pkg/apicheck/check.go

+		//canOtherControlPlanesBeReached := c.canOtherControlPlanesBeReached()
+		peersResponse = c.getPeersResponse(peers.ControlPlane)
+
+		// MES: This does not appear to have any actual relevance.  To me, it appears that all the necessary


IsControlPlaneHealthy() not being relevant is a bold statement ;)

However, before going on with a more detailed review, I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.

cc @mshitrit

I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.

+1

I think this change significantly changes current logic.
Couple of things I've noticed:

In the new code for CP nodes we completely ignore feedback of worker nodes. for most use case worker nodes can accurately report the status of the CP nodes and even though I expect the CP peers to report the same I'm not sure that ignoring the worker peers would be the best option.

diagnostic logic (i.e isDiagnosticsPassed()) is removed, which means the node can be falsely considered healthy for some use cases

IsControlPlaneHealthy() not being relevant is a bold statement ;)

Haha I did say "does not APPEAR to have any actual relevence" to be fair, based on my observations. I definitely left room in there for me to be wrong 😂.

However, before going on with a more detailed review, I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.

This would help a lot, my actual code changes were based on how I understood the expected flow to go, I attempted to interpret this based on the intention I saw in the code. My goal was to attempt not to change far too much and keep behaviors the same today since I personally don't know all the intentions, nor did I find it documented in detail anywhere (correct me if I'm wrong to be sure!).

I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.

+1

I think this change significantly changes current logic. Couple of things I've noticed:

In the new code for CP nodes we completely ignore feedback of worker nodes. for most use case worker nodes can accurately report the status of the CP nodes and even though I expect the CP peers to report the same I'm not sure that ignoring the worker peers would be the best option.

diagnostic logic (i.e isDiagnosticsPassed()) is removed, which means the node can be falsely considered healthy for some use cases

I'll look back at this later today to respond, especially w.r.t isDiagnosticsPassed, but I did spend some time walking through the flows and found multiple checks that basically referenced the same data multiple times, so I was attempting to simplify so it was clear what the code was doing. I felt it was unclear visually what was actually going on.

Ultimately, if we use the updated unit test just to prove out the core issue, I'm still good - our goal was to prove it so that it could be fixed, since it's a PITA to get logs in that case due to the nature of the cluster status at that point, so I personally picked the stretch goal of creating the unit test which would be better for the long term (in theory).

@mshitrit Not sure if you've had a chance to think about the logic flows here, but, if I'm reading things correctly, an update like this?

Attempt to get control plane responses

Attempt to get worker responses

Some combination of these should say that the node is healthy. Also add in isDiagnosticPassed

I'm willing to implement it and push a new PR, just want to be sure that I use the flow that you have in mind.

We are looking to pull the latest build whenever this is merged to main, and get some needed CVE fixes and other things, so I'd love to drive this to a close ASAP. I thought I had posted this message a week and a half ago but I guess it went into the ether.

Hi I think that pretty close, writing down both current and what I understand is the desired flow:

Attempt to get worker responses

If a worker return that response

else (assuming it's a control plane)

get a control plane response

Some combination of these should say that the node is healthy. Also add in isDiagnosticPassed

IIUC the fix is aiming for this flow:

Attempt to get control plane responses

Attempt to get worker responses

If a worker Some combination of these should say that the node is healthy (CP response is only relevant for some use cases of a worker node healthy response otherwise it can be ignored)

else Some combination of these should say that the node is healthy. Also add in isDiagnosticPassed

@mshitrit I pushed a set of changes to IsConsideredHealthy which I believe matches what you have described as the desired flow. Let me know your thoughts.

mshitrit · 2025-04-01T10:37:57Z

pkg/apicheck/check.go

 }

-func (c *ApiConnectivityCheck) getWorkerPeersResponse() peers.Response {
+func (c *ApiConnectivityCheck) getPeersResponse(role peers.Role) peers.Response {


I like this refactoring 👍

mshitrit · 2025-05-04T08:27:50Z

pkg/apicheck/check.go

+		controlPlanePeersResponse := c.getPeersResponse(peers.ControlPlane)
+
+		c.config.Log.Info("isConsideredHealthy: since peers think I'm unhealthy, double checking "+
+			"by returning what the control plane nodes think of my state",
+			"controlPlanePeersResponse.IsHealthy", controlPlanePeersResponse.IsHealthy)
+		return controlPlanePeersResponse.IsHealthy


IIUC basically this means that for a worker node the CP peers response will override the Worker peers response (unless worker peers response is healthy).
It completely ignores why the worker response was unhealthy.
TBH I'm having a hard time of thinking of an example this would not work as expected, but ignoring the reason still feels dangerous.

@slintes maybe you have a stronger opinion either way ?

@mshitrit I hear you, I still don't necessarily have a full feeling of what the fully desired behavior, so trying to translate what I am hearing from you and what I've seen overall.

Definitely will change it to match what the medik8s team thinks is appropriate. My main goal was to prove the issue with the unit test, and attempt to come in with a solution rather than just toss the problem in your collective laps. Intention is not to change intended behaviors, especially since it could have impacts on existing installations.

We are still discussing details on the PR without knowing what we're aiming at.
Again, can write down the expected flow first?

#255 (comment)

My apologies, @slintes I'm honestly looking for ya'lls feedback on what the expected flow is. I identified a problem case, wherein a control plane node can go down. I didn't fully know what the expected flow was supposed to be, but I did know in this case it was wrong.

So, the unit test proves the issue, which I believe could affect others utilizing SNR in different ways. I tried to provide some solution, but am totally ok if it's not the right solution.

If you are asking me what the expected flow is, I can sit down and draft out what I think it should be for the entire system for CP nodes as well as worker nodes, and am happy to do it, but I was hoping to start with a baseline of what the medik8s team believed the workflow was supposed to theoretically believe.

So, if this reply is directed at me, I'd ask that you be more specific, are you asking me to write down the expected flow for the entire system? (Which is what I seem to have possibly gotten incorrect in my proposed solution)

We are still discussing details on the PR without knowing what we're aiming at. Again, can write down the expected flow first?

#255 (comment)

If so, I'm up for taking a stab at it, but I don't have the background of why existing decisions were made.

I identified a problem case
the unit test proves the issue

And that's great, thank you 👍🏼
But I think the code changes significantly change the existing flow of things, which desires some general discussion of what the expected flow should be.

If you are asking me what the expected flow is

No no, it was a general ask, sorry if it sounded inappropriate. I would do it myself if I had more time for this...

Our docs contain some diagrams. As a first step we can verify if they are up to date and aligned with the current code. And then if they still make sense. Is the issue is even visible there?

https://www.medik8s.io/remediation/self-node-remediation/how-it-works/

I identified a problem case
the unit test proves the issue

And that's great, thank you 👍🏼 But I think the code changes significantly change the existing flow of things, which desires some general discussion of what the expected flow should be.

If you are asking me what the expected flow is

No no, it was a general ask, sorry if it sounded inappropriate. I would do it myself if I had more time for this...

Ok, understood.

Our docs contain some diagrams. As a first step we can verify if they are up to date and aligned with the current code. And then if they still make sense. Is the issue is even visible there?

https://www.medik8s.io/remediation/self-node-remediation/how-it-works/

It had been so long since our initial implementation I forgot these diagrams that exist. Let me review today and see if I can offer a proposal, given your time constraints, and update diagrams. Perhaps I can shorten the time required

coderabbitai · 2025-05-15T22:35:14Z

Walkthrough

Injected role-aware remote health checks, failure tracking, and quorum/time-window configuration into API connectivity; refactored control‑plane evaluation to outcome-driven logic; expanded peers and test harness (wrappers, simulated peers, random pod IPs); added FailureTracker and tests; plus logging and error-message improvements across controllers and utils.

Changes

Cohort / File(s)	Change Summary
Remediation controller `controllers/selfnoderemediation_controller.go`	Added phase-entry/exit logs and improved unknown-phase and pod-retrieval error messages; no control-flow changes.
API connectivity core `pkg/apicheck/check.go`, `pkg/apicheck/failure_tracker.go`, `pkg/apicheck/failure_tracker_test.go`, `pkg/apicheck/check_internal_test.go`	Added injectable remote health-check func, peers override and control-plane manager wiring, role-based peer gathering, FailureTracker implementation and tests, new FailureWindow and PeerQuorumTimeout handling, and related public APIs.
Control-plane manager `pkg/controlplane/manager.go`, `pkg/controlplane/manager_internal_test.go`	Introduced EvaluationOutcome type/constants; IsControlPlaneHealthy now evaluates outcomes (Remediate/Isolation/GlobalOutage/AwaitQuorum/Healthy) with expanded logging and tests.
Peers management `pkg/peers/peers.go`	Increased observability: renamed `updateControlPlanePeers`→`UpdateControlPlanePeers`, changed `updatePeers` to accept a reset function, and added logging during peer discovery and retrieval.
Pod utils `pkg/utils/pods.go`	Improved error wrapping with fmt.Errorf and richer contextual error messages.
Runtime / main config `main.go`	Added env-driven FailureWindow and PeerQuorumTimeout parsing and injected them into ApiConnectivityCheckConfig initialization.
Tests — harness & wrappers `controllers/tests/...`, `controllers/tests/shared/shared.go`, `controllers/tests/controller/selfnoderemediation_controller_test.go`, `controllers/tests/controller/suite_test.go`, `controllers/tests/config/suite_test.go`	Major test-harness expansion: ApiConnectivityCheckWrapper, K8sClientWrapper random-Pod-IP support, simulated peer/API-server failure management, many helpers for node/pod lifecycle and remediation strategies; MinPeersForRemediation default switched to ConfigDefaultValue.
Test data / docs `controllers/tests/controller/control_plane_isolation_testdata.md`	New markdown describing control-plane isolation test scenarios and expectations.
Dependencies & vendor `go.mod`, `vendor/modules.txt`, `vendor/github.com/onsi/gomega/gcustom/make_matcher.go`	Promoted `github.com/google/uuid` to direct dependency; added vendored `gcustom` matcher implementation and module listing tweaks.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant SNR as SelfNodeRemediation
    participant ApiCheck as ApiConnectivityCheck
    participant Peers as Peers
    participant Control as ControlPlaneManager
    participant Remote as RemotePeer (gRPC / injected)

    Note over ApiCheck: FailureTracker, FailureWindow, PeerQuorumTimeout, injected remote health check
    SNR->>ApiCheck: Request aggregated health (node role)
    ApiCheck->>Peers: listPeers(role)
    Peers-->>ApiCheck: peer Pod IPs
    ApiCheck->>Remote: getHealthStatusFromPeer(endpointIp) (injected)
    Remote-->>ApiCheck: HealthCheckResponseCode
    ApiCheck->>ApiCheck: gatherPeerResponses → evaluate (quorum, failure tracker, escalation)
    ApiCheck-->>SNR: outcome (Healthy / Remediate / Isolation / AwaitQuorum)
    alt outcome requires control-plane validation
        SNR->>Control: IsControlPlaneHealthy(outcome)
        Control-->>SNR: boolean decision (logged)
    end

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Areas needing extra attention:
- pkg/apicheck/check.go — new health-check flow, peer aggregation, timeouts, and FailureTracker interactions.
- pkg/peers/peers.go — signature changes and reset function behavior.
- pkg/controlplane/manager.go — outcome mapping, logging, and diagnostic gating.
- controllers/tests/shared/shared.go and controllers/tests/... — large harness additions and simulation wiring.
- vendor addition (gcustom) — verify compatibility, templating and error handling.

Possibly related PRs

Update peer timeout logic #264 — touches ApiConnectivityCheck peer timeout logic and timeout propagation; strongly related to the new peer quorum/timeout behavior.

Suggested reviewers

clobrano
mshitrit
slintes

Poem

"I hop through logs at break of dawn,
I count the peers before the lawn.
Timers tick and failures stack,
I nudge the cluster back on track.
A rabbit cheers: the quorum's on!" 🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 2.78% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The PR title "[WIP] Fix issue 251 peer control nodes" is related to the main change in the changeset. It clearly references Issue #251, which is about control-plane nodes failing to remediate when they have peer connectivity but lose API server access. The changes implement a comprehensive fix through enhanced health-check logic with EvaluationOutcome types, failure tracking, peer response management, and improved control-plane coordination. While the title doesn't exhaustively describe all the implementation mechanisms (such as the FailureTracker or the health-check refactoring), it accurately captures the primary objective of fixing the control-plane peer nodes issue.
Linked Issues Check	✅ Passed	The PR comprehensively addresses all major coding objectives from Issue #251. The core problem—control-plane nodes with down API servers but peer connectivity not remediating—is addressed through multiple coordinated changes: pkg/apicheck/check.go refactors health checking to support remote health-check injection, peer response gathering, and role-based health evaluation; pkg/controlplane/manager.go introduces EvaluationOutcome to properly handle control-plane-specific scenarios including isolation, global outage, and quorum awaiting; pkg/apicheck/failure_tracker.go implements window-based failure tracking for escalation logic; and main.go adds FailureWindow and PeerQuorumTimeout configuration parameters. The test infrastructure refactoring in selfnoderemediation_controller_test.go and new helpers in shared.go enable reproducible testing of the fix, with control_plane_isolation_testdata.md documenting expected behaviors.
Out of Scope Changes Check	✅ Passed	All code changes are either directly addressing Issue #251 or supporting those changes through necessary test infrastructure. The health-check refactoring in pkg/apicheck/check.go, control-plane logic in pkg/controlplane/manager.go, and failure tracking in pkg/apicheck/failure_tracker.go directly implement fixes to the peer health evaluation flow. The extensive test refactoring in selfnoderemediation_controller_test.go and new test support infrastructure in controllers/tests/shared/shared.go are explicitly mentioned in the PR objectives as necessary for reproducing and validating the Issue #251 scenario. Vendor changes (gcustom matcher) and minimal dependency changes (uuid) support the improved test infrastructure. No out-of-scope modifications were detected.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (6)

pkg/apicheck/check.go (3)
209-213: Log message prints the wrong variable

isControlPlaneHealthy is computed a few lines above, but the log prints controlPlanePeersResponse.IsHealthy, duplicating earlier output and hiding the final decision.
- c.config.Log.Info("isConsideredHealthy: we have checkd the control plane peer responses and cross "+
-   "checked it against the control plane diagnostics ",
-   "isControlPlaneHealthy", controlPlanePeersResponse.IsHealthy)
+ c.config.Log.Info("isConsideredHealthy: evaluated peer responses & diagnostics",
+   "isControlPlaneHealthy", isControlPlaneHealthy)
223-225: Misleading log text claims “I consider myself a WORKER” irrespective of actual role

The hard-coded message talks about WORKER even when role == peers.ControlPlane.
Replace the literal with role.String() (or similar) to avoid confusion during incident triage.

386-388: Setter lacks concurrency protection

SetHealthStatusFunc may be called from tests while ApiConnectivityCheck is running in a goroutine.
If that ever happens, the write is unsynchronised with reads in getHealthStatusFromPeer, leading to a data race.
Wrap the field access with the existing mutex or document it as “write-once before Start()”.
vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)
87-91: Panic message could be clearer

The panic mentions “function that takes one argument and returns (bool, error)”, but omits the possibility of a typed first parameter which you explicitly support.
Consider:
- panic("MakeMatcher must be passed a function that takes one argument and returns (bool, error)")
+ panic("MakeMatcher expects func(<any single param>) (bool, error)")
controllers/tests/controller/selfnoderemediation_controller_test.go (2)

1044-1049: Deep-equality on Node.Status is brittle and frequently fails

reflect.DeepEqual on the whole Status block compares timestamps, resource versions, conditions, etc. that can legitimately change between the expected skeleton returned by getNode and the actual cluster object – even when the node is perfectly “equal” for test purposes.
This can introduce non-deterministic test failures.

Recommendation
• Compare only deterministic fields (e.g., labels, taints) or use a semantic helper such as equality.Semantic.DeepEqual with a well-scoped struct.
• Alternatively, omit Status from the comparison altogether unless a specific field is being asserted.

790-791: Unnecessary time.Sleep slows the suite

A fixed time.Sleep(1 * time.Second) immediately before an Eventually poll adds a full second to every invocation of createGenericSelfNodeRemediationPod.
Eventually already waits until the pod is observed; the explicit sleep can be safely removed to cut runtime.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between ee8985c and 6397c3e.

⛔ Files ignored due to path filters (2)

pkg/peerhealth/peerhealth.pb.go is excluded by !**/*.pb.go
pkg/peerhealth/peerhealth_grpc.pb.go is excluded by !**/*.pb.go

📒 Files selected for processing (12)

controllers/selfnoderemediation_controller.go (3 hunks)
controllers/tests/config/suite_test.go (1 hunks)
controllers/tests/controller/selfnoderemediation_controller_test.go (11 hunks)
controllers/tests/controller/suite_test.go (3 hunks)
controllers/tests/shared/shared.go (3 hunks)
go.mod (1 hunks)
pkg/apicheck/check.go (5 hunks)
pkg/controlplane/manager.go (2 hunks)
pkg/peers/peers.go (4 hunks)
pkg/utils/pods.go (3 hunks)
vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1 hunks)
vendor/modules.txt (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (3)

controllers/tests/config/suite_test.go (1)

controllers/tests/shared/shared.go (1)

MinPeersForRemediationConfigDefaultValue (47-47)

controllers/tests/controller/suite_test.go (1)

controllers/tests/shared/shared.go (3)

ApiConnectivityCheckWrapper (59-65)

MinPeersForRemediationConfigDefaultValue (47-47)

NewApiConnectivityCheckWrapper (109-131)

controllers/tests/shared/shared.go (4)

pkg/apicheck/check.go (3)

ApiConnectivityCheck (28-37)

New (59-70)

ApiConnectivityCheckConfig (41-57)

api/HealthCheckResponse.go (1)

HealthCheckResponseCode (3-3)

pkg/controlplane/manager.go (1)

Manager (29-36)

vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)

MakeMatcher (87-128)

🪛 golangci-lint (1.64.8)

pkg/apicheck/check.go

110-110: S1023: redundant return statement

(gosimple)

🔇 Additional comments (29)

go.mod (1)

25-25: Explicit dependency declaration for UUID library

Moving the UUID package from indirect to direct dependencies correctly reflects its usage in the codebase, specifically for generating random IPv6 addresses in test utilities.

vendor/modules.txt (1)

144-144: Added Gomega custom matcher package

The addition of github.com/onsi/gomega/gcustom in the vendor modules reflects the use of custom matchers in tests, which provides more flexible failure message formatting.

pkg/controlplane/manager.go (2)

57-60: Enhanced logging for control plane role verification

Adding detailed logging about node roles improves observability when checking if a node is a control plane node, which will help with debugging role-based issues.

138-140: Added node labels logging during role assignment

This logging enhancement provides visibility into node labels during role determination, which is valuable for troubleshooting control plane node identification issues, especially when dealing with peer control nodes.
controllers/tests/config/suite_test.go (1)

129-129:
✅ Verification successful

Updated minimum peers configuration constant

Changed from using shared.MinPeersForRemediation to shared.MinPeersForRemediationConfigDefaultValue (value of 1) to align with the new API connectivity check wrapper's default configuration values.

🏁 Script executed:
#!/bin/bash
# Verify the default value in the shared package
grep -r "MinPeersForRemediationConfigDefaultValue" --include="*.go" controllers/tests/shared/
Length of output: 174
Change Approved: Verified default peers configuration constant
The shared.MinPeersForRemediationConfigDefaultValue constant is defined as 1 in controllers/tests/shared/shared.go, matching the intended default. No further changes required.

File updated: controllers/tests/config/suite_test.go (line 129)
pkg/utils/pods.go (3)

5-5: Good addition of the fmt package for improved error handling.

The import of the fmt package is required for the enhanced error wrapping in the error handling sections.

23-24: Excellent improvement to error handling with context.

Using fmt.Errorf with the %w verb provides better error context by wrapping the original error instead of losing the root cause. This change makes debugging much easier by preserving the full error chain.

33-33: Good enhancement to error message clarity.

The improved error message now explicitly includes the node name in the formatted string, making it easier to identify which node is affected when troubleshooting.

controllers/tests/controller/suite_test.go (3)

63-63: Good update to variable type for enhanced testing capabilities.

Changing the type to *shared.ApiConnectivityCheckWrapper allows for more flexible testing by enabling simulation of peer health check responses.

166-166: Updated constant reference for better code maintenance.

Changing from shared.MinPeersForRemediation to shared.MinPeersForRemediationConfigDefaultValue improves code clarity by using a more descriptive constant name and aligns with updated constants in the shared test package.

168-170: Good refactoring to use wrapper for API connectivity checks.

The change from directly using apicheck.New to using shared.NewApiConnectivityCheckWrapper enhances testing capabilities by allowing simulation of peer responses, which is essential for thorough testing of peer control node scenarios.

controllers/selfnoderemediation_controller.go (7)

453-453: Good addition of informative logging at phase start.

This logging statement improves observability by explicitly marking entry into the fencing start phase, making it easier to track the remediation workflow.

456-456: Enhanced logging for pre-reboot phase entry.

This logging statement improves traceability by clearly marking the transition to the pre-reboot completed phase.

459-459: Improved phase transition visibility with logging.

Adding explicit logging for entering the reboot completed phase enhances observability of the remediation workflow.

462-462: Clear logging for fencing completion phase.

This logging statement provides clear indication of reaching the final fencing complete phase in the remediation workflow.

466-466: Enhanced error message with phase value inclusion.

Including the actual phase value in the error message provides more context for troubleshooting unknown phase errors.

500-500: Good indication of pre-reboot completion.

Adding a log statement that clearly indicates when the pre-reboot phase is completed improves workflow visibility.

631-632: Improved error context for node reboot capability.

The expanded error message clearly explains the consequence of the error - that failure to get the agent pod makes the node not reboot capable, which is important context for troubleshooting.
pkg/peers/peers.go (9)

104-116: Good addition of detailed logging and reset functionality.

The refactored updateWorkerPeers method now includes comprehensive logging and a dedicated reset function, improving observability and state management. The use of closures for setter, getter, and reset functions makes the code more modular and testable.

119-134: Well-structured implementation of UpdateControlPlanePeers.

The new implementation follows the same pattern as updateWorkerPeers, providing consistent logging, clear state management, and a dedicated reset function. This parallelism in implementation makes the code more maintainable.

136-137: Good enhancement to updatePeers with reset functionality.

Adding the resetPeers parameter allows for more flexible peer state management, particularly in error recovery scenarios.

150-150: Improved state management with reset function.

Using the dedicated reset function instead of directly manipulating state improves encapsulation and ensures consistent state management across the codebase.

156-156: Enhanced logging of node information.

Adding detailed logging of the nodes found during peer updates improves observability and debugging capabilities, especially for distributed node issues.

179-183: Informative logging for skipped node/pod combinations.

Adding logging when skipping node/pod combinations provides visibility into the peer selection process, which is helpful for understanding which peers are considered during updates.

194-195: Comprehensive peer address logging.

The detailed logging of both worker and control plane peer addresses when retrieving peer addresses significantly improves observability, particularly for debugging peer connectivity issues.

200-200: Improved role-specific peer logging.

Adding specific log messages for worker and control plane peer retrieval requests enhances the clarity of the peer discovery process, making it easier to track role-based peer operations.

Also applies to: 203-203

86-87:
✅ Verification successful

Method name change from updateControlPlanePeers to UpdateControlPlanePeers.

This change from private to public method enhances API accessibility, allowing external components to trigger control plane peer updates. Ensure this change is intentional and that all callers are updated to use the new name.

Also applies to: 92-93

🏁 Script executed:
#!/bin/bash
# Check for all callers of the renamed method
rg -A 2 -B 2 -p "updateControlPlanePeers|UpdateControlPlanePeers" --type go
Length of output: 1869
All callers updated: Public method rename is safe

Confirmed that the private updateControlPlanePeers identifier has been fully replaced by the public UpdateControlPlanePeers across the codebase. No remaining references to the old name were found.

Key call sites:

pkg/peers/peers.go: lines 85–87 and 91–93

controllers/tests/controller/selfnoderemediation_controller_test.go: lines 1142 & 1147

No further changes required.
controllers/tests/shared/shared.go (1)

101-107: UUID-based IPv6 test addresses can collide across processes

While unlikely, deriving an IPv6 address from the raw UUID bytes does not set the locally-administered bit (RFC 4193) and may occasionally overlap with real cluster traffic if the test runs on the host network.
A safer pattern is to prepend fd00: (ULA range) or use net.ParseIP("::1") for pure unit tests.

Would you like a patch that generates deterministic ULA addresses (fdxx:) instead?

controllers/tests/controller/selfnoderemediation_controller_test.go (1)

1237-1244: Global mutable test state – verify isolation

configureRemediationStrategy mutates the package-level remediationStrategy variable that is also toggled directly in other test contexts.
While Ginkgo executes tests serially by default, parallel runs or future refactors can introduce surprising cross-test coupling.

Please confirm that:

No test suite is executed with -parallel > 1.

No goroutines read this variable concurrently.

If parallelism is anticipated, consider passing the strategy explicitly to helper functions instead of relying on a shared global.

controllers/tests/shared/shared.go

controllers/tests/controller/selfnoderemediation_controller_test.go

openshift-ci · 2025-06-23T14:07:54Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mark-dell-usa
Once this PR has been reviewed and has the lgtm label, please assign razo7 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (4)

controllers/tests/shared/shared.go (2)
82-89: Random IP helper forgets to update .Status.PodIP

assignRandomIpAddressesPods() populates PodIPs, but callers relying on the legacy singular field (pod.Status.PodIP) will still observe an empty value.
Several components (including controller-runtime caches and some older network add-ons) still look at PodIP when only one address is expected.
Consider mirroring the first generated address into both fields:
 func assignRandomIpAddressesPods(pods *corev1.PodList) {
   for i := range pods.Items {
-    pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: GetRandomIpAddress()}}
+    randIP := GetRandomIpAddress()
+    pods.Items[i].Status.PodIP = randIP        // keep legacy field in sync
+    pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: randIP}}
   }
 }
116-128: Deadlock risk: simulated peer sends > 1 message to a bounded channel

getHealthStatusFromPeers() allocates resultsChan with capacity = len(addresses) and waits for exactly one message per peer.
The current override loops over ckw.SimulatePeerResponses and may write multiple codes, causing the goroutine to block on the second send and the whole test to hang.
-       case ckw.ShouldSimulatePeerResponses:
-           for _, code := range ckw.SimulatePeerResponses {
-               results <- code
-           }
-           return
+       case ckw.ShouldSimulatePeerResponses:
+           // Provide exactly one deterministic response per call
+           if len(ckw.SimulatePeerResponses) > 0 {
+               results <- ckw.SimulatePeerResponses[0]
+           } else {
+               results <- selfNodeRemediation.RequestFailed
+           }
+           return
If you need per-peer variance, consider indexing into the slice with a counter guarded by a mutex instead of pushing multiple values at once.
controllers/tests/controller/selfnoderemediation_controller_test.go (2)
875-885: Variable shadowing hides the named return value and loses context

The inner err := … shadows the function's named err return value, then the assignment err = … inside the block modifies the shadowed variable, not the function return.
As a result, the caller may receive a nil error even though the Get failed.
-		if err := k8sClient.Client.Get(context.Background(), podKey, pod); err != nil {
-			if apierrors.IsNotFound(err) && !throwErrorIfNotFound {
+		getErr := k8sClient.Client.Get(context.Background(), podKey, pod)
+		if getErr != nil {
+			if apierrors.IsNotFound(getErr) && !throwErrorIfNotFound {
 				logf.Log.Info("pod with name '%s' not found, we're not going to do anything", podName)
 				err = nil
 				return
 			}
-
-			err = fmt.Errorf("unable to get pod with name '%s' in order to delete it", err)
+			err = fmt.Errorf("unable to get pod with name '%s' in order to delete it: %w", podName, getErr)
 			return
 		}
1197-1201: Cleanup doesn't persist the label rollback – subsequent tests may inherit the control-plane label

Inside configureUnhealthyNodeAsControlNode, the DeferCleanup callback only changes the in-memory unhealthyNode object but never updates the cluster.
If another test starts before this process ends (or if the global state is inspected in verifyCleanState), the node will still carry the node-role.kubernetes.io/master=true label, leading to flaky or misleading results.
-				unhealthyNode.Labels[labels2.MasterRole] = previousRole
+				unhealthyNode.Labels[labels2.MasterRole] = previousRole
+				// Persist the rollback so follow-up tests start from
+				// a clean state
+				Expect(k8sClient.Update(context.TODO(), unhealthyNode)).To(Succeed(),
+					"failed to restore the unhealthy node label after test cleanup")

🧹 Nitpick comments (2)

pkg/apicheck/check.go (2)
72-114: Clean up redundant return and outdated TODO

Two minor improvements:

Remove the redundant return statement at line 110 (flagged by static analysis)

Remove the outdated TODO comment since IPv6 compatibility is confirmed
-		// TODO does this work with IPv6?
-		// MES: Yes it does, we've tested this
 		phClient, err := peerhealth.NewClient(fmt.Sprintf("%v:%v", endpointIp.IP, c.config.PeerHealthPort), c.config.PeerDialTimeout, c.config.Log.WithName("peerhealth client"), c.clientCreds)
 
 		...
 
 		results <- selfNodeRemediation.HealthCheckResponseCode(resp.Status)
-		return
 	}
173-215: Well-structured health check logic with improved observability

The refactored isConsideredHealthy method provides clear separation between worker and control plane logic with excellent logging for debugging. The implementation aligns with the discussed flow from past reviews.

Minor suggestion: Consider extracting the complex boolean expression on line 177 into a named variable for clarity:
-	isWorkerNode := isControlPlaneManagerNil || !c.controlPlaneManager.IsControlPlane()
+	isControlPlane := !isControlPlaneManagerNil && c.controlPlaneManager.IsControlPlane()
+	isWorkerNode := !isControlPlane

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6397c3e and befbcbb.

⛔ Files ignored due to path filters (2)

pkg/peerhealth/peerhealth.pb.go is excluded by !**/*.pb.go
pkg/peerhealth/peerhealth_grpc.pb.go is excluded by !**/*.pb.go

📒 Files selected for processing (12)

controllers/selfnoderemediation_controller.go (3 hunks)
controllers/tests/config/suite_test.go (1 hunks)
controllers/tests/controller/selfnoderemediation_controller_test.go (11 hunks)
controllers/tests/controller/suite_test.go (3 hunks)
controllers/tests/shared/shared.go (3 hunks)
go.mod (1 hunks)
pkg/apicheck/check.go (5 hunks)
pkg/controlplane/manager.go (2 hunks)
pkg/peers/peers.go (4 hunks)
pkg/utils/pods.go (3 hunks)
vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1 hunks)
vendor/modules.txt (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (8)

pkg/controlplane/manager.go
go.mod
pkg/utils/pods.go
controllers/tests/config/suite_test.go
vendor/modules.txt
controllers/tests/controller/suite_test.go
controllers/selfnoderemediation_controller.go
pkg/peers/peers.go

🧰 Additional context used

🪛 golangci-lint (1.64.8)

pkg/apicheck/check.go

110-110: S1023: redundant return statement

(gosimple)

🔇 Additional comments (8)

vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)

1-271: Vendor dependency addition looks good

This vendor file adds the gcustom package from Gomega v1.27.0+ which provides custom matcher functionality. The integration with the test suite is appropriate for creating the new error matchers in shared.go.

Note: Static analysis detected a redundant return statement at line 110, but since this is vendor code, it should be addressed upstream rather than modified locally.

controllers/tests/shared/shared.go (1)

167-206: Well-implemented custom error matchers

The custom Gomega matchers for Kubernetes errors are cleanly implemented with proper nil handling. This improves test readability and reusability.

pkg/apicheck/check.go (2)

28-40: Excellent refactoring for testability

The introduction of GetHealthStatusFromRemoteFunc and the function injection pattern improves testability while maintaining backward compatibility with the default implementation. This follows the dependency injection principle effectively.

Also applies to: 59-70

217-322: Good refactoring to support role-based peer querying

The generalization from worker-specific to role-based peer querying improves code reusability. The additional logging statements will be valuable for troubleshooting peer communication issues in production.

controllers/tests/controller/selfnoderemediation_controller_test.go (4)

39-82: Excellent test setup refactoring

The restructured setup with descriptive By blocks and comprehensive cleanup improves test maintainability and debugging. The separation of concerns between BeforeEach and JustBeforeEach is well-organized.

448-495: Well-structured test case for issue #251

This test effectively reproduces the control plane peer health check scenario described in issue #251. The setup clearly configures the environment with control plane nodes and simulated unhealthy responses.

1134-1337: Excellent test helper functions

The new helper functions provide clean abstractions for test configuration with proper cleanup. The consistent use of DeferCleanup ensures test isolation, and the descriptive function names make test scenarios self-documenting.

1003-1041: Comprehensive cluster state verification

The enhanced verifyCleanState function thoroughly validates the cluster state after tests, including checking for specific SNR pods by name. This will help catch cleanup issues early.

controllers/tests/shared/shared.go

mark-dell-usa · 2025-06-23T16:06:29Z

Self Node Remediation - Control Plane Edge Case Analysis

Overview

This document analyzes a critical edge case in the Self Node Remediation (SNR) operator where a control plane node becomes partially degraded but never remediates itself. This occurs when:

The control plane's Kubernetes API server is down
Most other Kubernetes services (controller-manager, scheduler) are offline
The SNR daemonset pod remains alive and functional
The kubelet service continues running

The Edge Case Scenario

Initial State

Control Plane Node
├── API Server: DOWN ❌
├── Controller Manager: DOWN ❌
├── Scheduler: DOWN ❌
├── Kubelet: RUNNING ✓
└── SNR Pod: RUNNING ✓

Workflow Diagrams

1. Normal SNR Health Check Flow

flowchart TD
    Start([SNR Pod Health Check Timer]) --> APICheck[Check API Server /readyz]
    APICheck -->|Success| Healthy[Mark as Healthy]
    APICheck -->|Failure| ErrorCount{Error Count > Threshold?}
    ErrorCount -->|No| IncrementError[Increment Error Counter]
    IncrementError --> Wait[Wait for Next Check]
    ErrorCount -->|Yes| PeerCheck[Query Peer Nodes]
    
    PeerCheck --> PeerResponse{Peer Response Analysis}
    PeerResponse -->|Majority say Unhealthy| CreateSNR[Trigger Remediation]
    PeerResponse -->|Majority say Healthy| ResetCounter[Reset Error Counter]
    PeerResponse -->|Most can't reach API| ControlPlaneCheck{Is Control Plane?}
    
    ControlPlaneCheck -->|Worker Node| ConsiderHealthy[Consider Healthy]
    ControlPlaneCheck -->|Control Plane| RunDiagnostics[Run Diagnostics]
    
    RunDiagnostics --> DiagResult{Diagnostics Pass?}
    DiagResult -->|Yes| ConsiderHealthy
    DiagResult -->|No| CreateSNR
    
    ResetCounter --> Wait
    ConsiderHealthy --> Wait
    CreateSNR --> Remediate[Begin Remediation Process]

2. Control Plane Edge Case Flow (The Bug)

flowchart TD
    Start([Control Plane Node:<br/>API Server DOWN]) --> APIFails[API Check Fails Repeatedly]
    APIFails --> QueryPeers[Query Worker Peers]
    
    QueryPeers --> PeerStatus{Worker Peer Responses}
    PeerStatus -->|">50% also can't<br/>reach API Server"| MostCantAccess[Status: HealthyBecauseMostPeersCantAccessAPIServer]
    
    MostCantAccess --> CPDiag[Run Control Plane Diagnostics]
    
    CPDiag --> EndpointCheck{Check Endpoint<br/>Health URL}
    EndpointCheck -->|Not Configured or<br/>Was Never Accessible| EndpointPass[Endpoint Check: PASS]
    
    EndpointPass --> KubeletCheck{Is Kubelet<br/>Running?}
    KubeletCheck -->|Port 10250<br/>Responds| KubeletPass[Kubelet Check: PASS]
    
    KubeletPass --> DiagPass[Diagnostics: PASSED ✓]
    DiagPass --> MarkHealthy[Node Marked as HEALTHY]
    MarkHealthy --> NoRemediation[❌ NO REMEDIATION TRIGGERED]
    
    NoRemediation --> Impact[Control Plane Remains Broken:<br/>- No API Server<br/>- No Scheduling<br/>- No Controllers<br/>- Cluster Partially Down]

3. Peer Health Check Details

sequenceDiagram
    participant CP as Control Plane<br/>(Broken API)
    participant W1 as Worker 1
    participant W2 as Worker 2
    participant W3 as Worker 3
    
    Note over CP: API Check Fails
    CP->>W1: Is my SNR CR present?
    CP->>W2: Is my SNR CR present?
    CP->>W3: Is my SNR CR present?
    
    W1--xCP: Error: Can't reach API
    W2--xCP: Error: Can't reach API
    W3--xCP: Error: Can't reach API
    
    Note over CP: >50% peers have API errors
    Note over CP: Status = HealthyBecauseMostPeersCantAccessAPIServer
    Note over CP: Run Diagnostics:
    Note over CP: ✓ Kubelet Running
    Note over CP: ✓ No Endpoint URL
    Note over CP: = HEALTHY (Bug!)

4. The Diagnostic Gap

flowchart LR
    subgraph "Current Diagnostics"
        D1[Endpoint Health Check]
        D2[Kubelet Service Check]
    end
    
    subgraph "Missing Checks"
        M1[API Server Process]
        M2[Controller Manager]
        M3[Scheduler]
        M4[Etcd Connectivity]
    end
    
    subgraph "Result"
        R1[False Positive:<br/>Node Considered Healthy<br/>Despite Being Non-Functional]
    end
    
    D1 --> R1
    D2 --> R1
    M1 -.->|Should Check| R1
    M2 -.->|Should Check| R1
    M3 -.->|Should Check| R1
    M4 -.->|Should Check| R1

Code Analysis

The Bug Location

File: pkg/controlplane/manager.go

func (manager *Manager) IsControlPlaneHealthy(workerPeersResponse peers.Response, 
    canOtherControlPlanesBeReached bool) bool {
    
    switch workerPeersResponse.ReasonCode {
    // ...
    case peers.HealthyBecauseMostPeersCantAccessAPIServer:
        didDiagnosticsPass := manager.isDiagnosticsPassed()
        manager.log.Info("The peers couldn't access the API server, so we are returning whether "+
            "diagnostics passed", "didDiagnosticsPass", didDiagnosticsPass)
        return didDiagnosticsPass  // <-- BUG: Returns true if kubelet is running
    // ...
    }
}

func (manager *Manager) isDiagnosticsPassed() bool {
    manager.log.Info("Starting control-plane node diagnostics")
    if manager.isEndpointAccessLost() {
        return false
    } else if !manager.isKubeletServiceRunning() {  // <-- Only checks kubelet!
        return false
    }
    manager.log.Info("Control-plane node diagnostics passed successfully")
    return true
}

The Problem

The diagnostics only check:

Endpoint Health: Often not configured or was never accessible
Kubelet Service: Can be running even when API server is dead

They DO NOT check:

Is the API server process running?
Is the API server responding locally?
Are other control plane components functional?

Impact

This edge case creates a situation where:

Cluster Degradation: The control plane node is effectively non-functional
No Self-Healing: SNR never triggers remediation
Manual Intervention Required: Operators must manually detect and fix
Cascading Failures: Other nodes may start failing due to missing control plane

References

Issue: Remediation doesn't occur when node can contact peer control plane nodes even if they consider it unhealthy #251
Related medik8s documentation: https://www.medik8s.io/remediation/self-node-remediation/how-it-works/

Conclusion

The current SNR implementation has a critical gap in control plane node health assessment. When the API server fails but kubelet remains running, the node incorrectly considers itself healthy, preventing automatic remediation. This edge case requires either enhanced diagnostics that specifically check control plane components or a fundamental change in how control plane node health is evaluated during cluster-wide API server outages.

mark-dell-usa · 2025-06-23T16:07:22Z

Find an updated analysis here of what I did, perhaps it will help, I attempted to pull it together with diagrams to make it clear.

https://gist.github.com/mark-dell-usa/576901e5cc420114bcf1076ff5d57f52

mark-dell-usa · 2025-06-27T14:55:17Z

@slintes @mshitrit

Please take a look at the analysis, I did see an error on the diagram for existing code and failure though, I am making a small correction.

mshitrit · 2025-07-08T07:46:12Z

@slintes @mshitrit

Please take a look at the analysis, I did see an error on the diagram for existing code and failure though, I am making a small correction.

Great thanks ! I've updated my comment here

…#251 exists and is a real problem.

…y and getWorkerPeersResponse for issue medik8s#251

…mmand.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (5)

controllers/tests/shared/shared.go (3)
82-89: Random IP helper forgets to update .Status.PodIP

assignRandomIpAddressesPods() populates PodIPs, but callers relying on the legacy singular field (pod.Status.PodIP) will still observe an empty value. Several components (including controller-runtime caches and some older network add-ons) still look at PodIP when only one address is expected.

Consider mirroring the first generated address into both fields:
 func assignRandomIpAddressesPods(pods *corev1.PodList) {
   for i := range pods.Items {
-    pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: GetRandomIpAddress()}}
+    randIP := GetRandomIpAddress()
+    pods.Items[i].Status.PodIP = randIP        // keep legacy field in sync
+    pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: randIP}}
   }
 }
101-107: Use proper IPv6 address generation instead of UUID bytes

Converting UUID bytes directly to IPv6 addresses may generate invalid or reserved addresses. Consider using proper IPv6 address generation that respects network standards:
 func GetRandomIpAddress() (randomIP string) {
-    u := uuid.New()
-    ip := net.IP(u[:net.IPv6len])
-    randomIP = ip.String()
+    // Generate a random IPv6 address in the fd00::/8 range (Unique Local Address)
+    bytes := make([]byte, net.IPv6len)
+    bytes[0] = 0xfd // ULA prefix
+    _, _ = rand.Read(bytes[1:])
+    ip := net.IP(bytes)
+    randomIP = ip.String()
     return
 }
116-128: Deadlock risk: simulated peer sends > 1 message to a bounded channel

getHealthStatusFromPeers() allocates resultsChan with capacity = len(addresses) and waits for exactly one message per peer. The current override loops over ckw.SimulatePeerResponses and may write multiple codes, causing the goroutine to block on the second send and the whole test to hang.
-       case ckw.ShouldSimulatePeerResponses:
-           for _, code := range ckw.SimulatePeerResponses {
-               results <- code
-           }
-           return
+       case ckw.ShouldSimulatePeerResponses:
+           // Provide exactly one deterministic response per call
+           if len(ckw.SimulatePeerResponses) > 0 {
+               results <- ckw.SimulatePeerResponses[0]
+           } else {
+               results <- selfNodeRemediation.RequestFailed
+           }
+           return
If you need per-peer variance, consider indexing into the slice with a counter guarded by a mutex instead of pushing multiple values at once.
controllers/tests/controller/selfnoderemediation_controller_test.go (2)
1246-1267: Cleanup doesn't persist the label rollback – subsequent tests may inherit the control-plane label

Inside configureUnhealthyNodeAsControlNode, the DeferCleanup callback only changes the in-memory unhealthyNode object but never updates the cluster. If another test starts before this process ends (or if the global state is inspected in verifyCleanState), the node will still carry the node-role.kubernetes.io/master=true label, leading to flaky or misleading results.
 		DeferCleanup(func() {
 			By("Revert the unhealthy node's role to its previous value", func() {
 				unhealthyNode.Labels[labels2.MasterRole] = previousRole
+				// Persist the rollback so follow-up tests start from
+				// a clean state
+				Expect(k8sClient.Update(context.TODO(), unhealthyNode)).To(Succeed(),
+					"failed to restore the unhealthy node label after test cleanup")
 			})
 		})
934-945: Variable shadowing hides the named return value and loses context

The inner err := … shadows the function's named err return value, then the assignment err = … inside the block modifies the shadowed variable, not the function return. As a result, the caller may receive a nil error even though the Get failed.
 	By(fmt.Sprintf("Attempting to get pod '%s' before deleting it", podName), func() {
-		if err := k8sClient.Client.Get(context.Background(), podKey, pod); err != nil {
-			if apierrors.IsNotFound(err) && !throwErrorIfNotFound {
+		getErr := k8sClient.Client.Get(context.Background(), podKey, pod)
+		if getErr != nil {
+			if apierrors.IsNotFound(getErr) && !throwErrorIfNotFound {
 				logf.Log.Info("pod with name '%s' not found, we're not going to do anything", podName)
 				err = nil
 				return
 			}
-
-			err = fmt.Errorf("unable to get pod with name '%s' in order to delete it", err)
+			err = fmt.Errorf("unable to get pod with name '%s' in order to delete it: %w", podName, getErr)
 			return
 		}
 	})

🧹 Nitpick comments (4)

pkg/peers/peers.go (4)
105-117: Excessive logging in updateWorkerPeers creates noise.

The function logs entry, setter calls, and getter calls at Info level. These logs fire on every peer update interval and provide limited operational value. In production clusters with frequent peer updates, this generates unnecessary log volume.

Consider removing or downgrading these logs to a trace/debug level that can be conditionally enabled, keeping only errors or significant state changes at Info level.

120-135: Excessive logging in UpdateControlPlanePeers creates noise.

Similar to updateWorkerPeers, this function logs entry, setter calls, and getter calls at Info level on every update interval, creating unnecessary log volume in production.

Consider removing or downgrading these logs to trace/debug level.

157-157: Logging entire nodes list may produce excessive output.

In clusters with many nodes, logging the complete nodes object dumps significant data at Info level. This can overwhelm logs and make troubleshooting harder.

Consider logging a summary instead:
-	p.log.Info("updatePeers", "nodes", nodes)
+	p.log.Info("updatePeers retrieved nodes", "nodeCount", len(nodes.Items))
209-219: Reduce logging frequency in GetPeersAddresses.

This function logs peer addresses at Info level on every call. Depending on how frequently health checks or peer queries occur, this can generate significant log volume with redundant information.

Consider downgrading to debug level or logging only when addresses change:
-	p.log.Info("GetPeersAddresses", "workerPeersAddresses", p.workerPeersAddresses,
-		"controlPlanePeersAddresses", p.controlPlanePeersAddresses)
+	p.log.V(1).Info("GetPeersAddresses", "workerPeersAddresses", p.workerPeersAddresses,
+		"controlPlanePeersAddresses", p.controlPlanePeersAddresses)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between befbcbb and 2fe5996.

⛔ Files ignored due to path filters (2)

pkg/peerhealth/peerhealth.pb.go is excluded by !**/*.pb.go
pkg/peerhealth/peerhealth_grpc.pb.go is excluded by !**/*.pb.go

📒 Files selected for processing (12)

controllers/selfnoderemediation_controller.go (3 hunks)
controllers/tests/config/suite_test.go (1 hunks)
controllers/tests/controller/selfnoderemediation_controller_test.go (11 hunks)
controllers/tests/controller/suite_test.go (3 hunks)
controllers/tests/shared/shared.go (3 hunks)
go.mod (1 hunks)
pkg/apicheck/check.go (6 hunks)
pkg/controlplane/manager.go (2 hunks)
pkg/peers/peers.go (4 hunks)
pkg/utils/pods.go (3 hunks)
vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1 hunks)
vendor/modules.txt (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (5)

controllers/tests/config/suite_test.go
pkg/controlplane/manager.go
controllers/selfnoderemediation_controller.go
controllers/tests/controller/suite_test.go
go.mod

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-06-18T11:38:24.387Z

Learnt from: slintes
PR: medik8s/self-node-remediation#262
File: pkg/peerhealth/client_server_test.go:165-189
Timestamp: 2025-06-18T11:38:24.387Z
Learning: In the medik8s/self-node-remediation project, the Ginkgo tests in pkg/peerhealth/client_server_test.go don't run in parallel, making shared mutable state like the `reader` variable safe to use across test cases without race condition concerns.

Applied to files:

controllers/tests/controller/selfnoderemediation_controller_test.go
controllers/tests/shared/shared.go

🧬 Code graph analysis (3)

controllers/tests/controller/selfnoderemediation_controller_test.go (5)

api/v1alpha1/selfnoderemediation_types.go (4)

RemediationStrategyType (44-44)

SelfNodeRemediation (92-98)

ResourceDeletionRemediationStrategy (25-25)

SelfNodeRemediationList (103-107)

controllers/tests/shared/shared.go (10)

UnhealthyNodeName (35-35)

Namespace (34-34)

GenerateTestConfig (133-145)

PeerNodeName (36-36)

Peer2NodeName (37-37)

SnrPodName2 (41-41)

Peer3NodeName (38-38)

SnrPodName3 (42-42)

IsK8sNotFoundError (190-206)

SnrPodName1 (40-40)

vendor/github.com/medik8s/common/pkg/labels/labels.go (1)

MasterRole (7-7)

api/HealthCheckResponse.go (2)

Unhealthy (8-8)

HealthCheckResponseCode (3-3)

pkg/controlplane/manager.go (1)

NewManager (39-47)

pkg/apicheck/check.go (6)

pkg/controlplane/manager.go (1)

Manager (29-36)

api/HealthCheckResponse.go (2)

HealthCheckResponseCode (3-3)

RequestFailed (6-6)

pkg/peers/peers.go (5)

New (47-58)

Worker (29-29)

ControlPlane (30-30)

Role (26-26)

Peers (36-45)

pkg/peerhealth/client.go (1)

NewClient (19-44)

pkg/peerhealth/peerhealth.pb.go (3)

HealthRequest (23-30)

HealthRequest (45-45)

HealthRequest (60-62)

pkg/peers/response.go (2)

Response (3-6)

HealthyBecauseErrorsThresholdNotReached (12-12)

controllers/tests/shared/shared.go (3)

pkg/apicheck/check.go (3)

ApiConnectivityCheck (36-45)

New (68-79)

ApiConnectivityCheckConfig (49-66)

api/HealthCheckResponse.go (1)

HealthCheckResponseCode (3-3)

pkg/controlplane/manager.go (1)

Manager (29-36)

🔇 Additional comments (17)

pkg/utils/pods.go (3)

5-5: LGTM!

The import change from errors to fmt is appropriate given the error formatting improvements below.

21-25: LGTM! Proper error wrapping.

The error is now properly wrapped using fmt.Errorf with the %w verb, which preserves the error chain and adds descriptive context. This follows Go best practices and improves debugging.

33-33: LGTM! Improved error message.

Including the node name in the error message significantly improves debugging and observability when the pod is not found.

vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)

18-270: Upstream gcustom matcher looks solid

Reflect wrapper and templated failure messages behave as expected; nothing concerning here.

vendor/modules.txt (1)

140-145: Vendor manifest entry matches added package

The gcustom module entry is consistent with the new vendored files.

controllers/tests/shared/shared.go (2)

67-91: LGTM on K8sClientWrapper.List refactoring.

The switch-based flow with fallthrough is clear and correct. The IP randomization is properly applied after the underlying client.List call.

167-206: LGTM on error matcher utilities.

The custom Gomega matchers for K8s error testing follow the correct pattern and provide useful abstractions for test assertions.

pkg/apicheck/check.go (4)

47-47: LGTM on injectable health-check function pattern.

The introduction of GetHealthStatusFromRemoteFunc as a type and the ability to inject it via SetHealthStatusFunc is a solid pattern for testability. The constructor properly sets a default implementation, ensuring production behavior remains unchanged while enabling test overrides.

Also applies to: 68-79

81-124: LGTM on GetDefaultPeerHealthCheckFunc extraction.

Extracting the default peer health-check logic into a named method improves code organization and makes the health-check implementation explicit and replaceable.

183-225: Verify the new health-check flow aligns with intended behavior.

The refactored isConsideredHealthy logic significantly changes the health determination flow:

Worker nodes now consult control-plane peers if worker peers report unhealthy

Control-plane nodes rely on control-plane peer responses and diagnostics

Based on past review comments, there are ongoing discussions about whether this flow correctly handles all scenarios (e.g., ignoring worker response reasons for CP nodes). Ensure the logic matches the expected behavior documented in the PR objectives and addresses Issue #251.

414-426: LGTM on health-check function accessors.

The setter, getter, and delegation pattern for getHealthStatusFromRemoteFunc is correctly implemented and enables the test wrapper to override behavior.

controllers/tests/controller/selfnoderemediation_controller_test.go (6)

39-39: Verify global remediation strategy is properly managed across tests.

The package-level remediationStrategy variable is mutable test state. If tests don't consistently reset it via configureRemediationStrategy() or BeforeEach blocks, residual values from one test can leak into another.

Scan test cases to confirm each that modifies remediationStrategy properly restores it in cleanup.

507-554: Test case demonstrates Issue #251 but comment suggests expected failure.

The test "two control node peers found, they tell me I'm unhealthy" includes the comment: "It's expected that the next line will fail, even though it shouldn't!" This suggests the test is meant to demonstrate the bug rather than pass after the fix is applied.

If the PR intends to fix Issue #251, this test should pass after the code changes. Update the test or comment to reflect whether:

The test currently demonstrates the bug (should be marked as pending/skipped until fixed)

The fix in this PR makes the test pass (remove the comment about expected failure)

830-859: LGTM on createGenericSelfNodeRemediationPod helper.

The function properly creates an SNR pod with correct labels, registers cleanup, and verifies creation. The use of DeferCleanup ensures proper test hygiene.

1311-1376: LGTM on addNodes helper for test orchestration.

The addNodes helper provides a clean, declarative way to set up multi-node test scenarios with associated pods and simulated responses. The cleanup registration ensures proper teardown.

1062-1100: Verify verifyCleanState handles dynamically added nodes.

verifyCleanState expects exactly 2 nodes (UnhealthyNodeName and PeerNodeName), but tests using addNodes can introduce additional nodes (Peer2NodeName, Peer3NodeName). If verifyCleanState runs before addNodes cleanup completes, the assertion at line 1067 will fail.

Ensure test cleanup ordering is correct, or make verifyCleanState more flexible to handle transient nodes during cleanup.

1193-1210: LGTM on addControlPlaneManager helper.

The helper properly creates, wires, and cleans up the control plane manager for tests. The cleanup ensures peer state is reset when the manager is removed.

pkg/apicheck/check.go

coderabbitai · 2025-10-07T19:26:27Z

pkg/peers/peers.go

+			} else {
+				p.log.Info("Skipping current node/pod combo",
+					"node.Name", node.Name,
+					"pod.Spec.NodeName", pod.Spec.NodeName)
 			}


🛠️ Refactor suggestion | 🟠 Major

Remove debug logging from hot path in nested loop.

This Info log fires for every node/pod combination that doesn't match, which is the common case. In a cluster with N nodes and M pods, this can generate N×M log entries per peer update interval, overwhelming logs.

Remove this log statement or make it conditional on a debug flag:

- } else { - p.log.Info("Skipping current node/pod combo", - "node.Name", node.Name, - "pod.Spec.NodeName", pod.Spec.NodeName) }

🤖 Prompt for AI Agents

In pkg/peers/peers.go around lines 191 to 195, the Info log inside the node/pod nested loop is executed for every non-matching pair and floods logs (N×M entries); remove this Info log from the hot path or gate it behind a debug flag so it only emits when verbose/debug logging is enabled (e.g., check the logger's debug/verbose state before logging), keeping normal operation silent for the common non-matching case.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (4)

controllers/tests/shared/shared.go (2)
102-108: Use proper IPv6 address generation instead of raw UUID bytes.

Converting UUID bytes directly to an IPv6 address can produce invalid or reserved addresses. Consider generating addresses in a proper test range such as the Unique Local Address (ULA) range fd00::/8.

Apply this diff:
 func GetRandomIpAddress() (randomIP string) {
-	u := uuid.New()
-	ip := net.IP(u[:net.IPv6len])
-	randomIP = ip.String()
-
+	// Generate a random IPv6 address in the fd00::/8 range (Unique Local Address)
+	bytes := make([]byte, net.IPv6len)
+	bytes[0] = 0xfd // ULA prefix
+	u := uuid.New()
+	copy(bytes[1:], u[:net.IPv6len-1])
+	randomIP = net.IP(bytes).String()
 	return
 }
83-97: Set the legacy PodIP field for compatibility.

While PodIPs is the current standard, some components still rely on the singular pod.Status.PodIP field. The helper only populates PodIPs, which may cause compatibility issues.

Apply this diff to also populate the legacy field:
 func assignRandomIpAddressesPods(pods *corev1.PodList) {
 	for i := range pods.Items {
-		pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: GetRandomIpAddress()}}
+		randIP := GetRandomIpAddress()
+		pods.Items[i].Status.PodIP = randIP
+		pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: randIP}}
 	}
-
-	return
 }
controllers/tests/controller/selfnoderemediation_controller_test.go (2)
1246-1267: Fix: Cleanup doesn't persist the label rollback.

The DeferCleanup callback at lines 1261-1265 modifies the in-memory unhealthyNode.Labels but never calls k8sClient.Update() to persist the change. Subsequent tests may inherit the node-role.kubernetes.io/master label, causing flaky behavior.

Apply this diff to persist the rollback:
 		DeferCleanup(func() {
 			By("Revert the unhealthy node's role to its previous value", func() {
 				unhealthyNode.Labels[labels2.MasterRole] = previousRole
+				Expect(k8sClient.Update(context.TODO(), unhealthyNode)).To(Succeed(),
+					"failed to restore the unhealthy node label after test cleanup")
 			})
 		})
927-950: Fix: Variable shadowing prevents error return.

The err := short declaration at line 935 shadows the named return variable err, so assignments to err inside the By block (lines 938, 942) modify the local variable rather than the return value. The function always returns nil even when errors occur.

Apply this diff to fix the shadowing:
 	By(fmt.Sprintf("Attempting to get pod '%s' before deleting it", podName), func() {
-		if err := k8sClient.Client.Get(context.Background(), podKey, pod); err != nil {
-			if apierrors.IsNotFound(err) && !throwErrorIfNotFound {
+		getErr := k8sClient.Client.Get(context.Background(), podKey, pod)
+		if getErr != nil {
+			if apierrors.IsNotFound(getErr) && !throwErrorIfNotFound {
 				logf.Log.Info("pod with name '%s' not found, we're not going to do anything", podName)
-				err = nil
 				return
 			}
 
-			err = fmt.Errorf("unable to get pod with name '%s' in order to delete it", err)
+			err = fmt.Errorf("unable to get pod with name '%s': %w", podName, getErr)
 			return
 		}
 	})

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2fe5996 and d1c5111.

⛔ Files ignored due to path filters (2)

pkg/peerhealth/peerhealth.pb.go is excluded by !**/*.pb.go
pkg/peerhealth/peerhealth_grpc.pb.go is excluded by !**/*.pb.go

📒 Files selected for processing (3)

controllers/tests/controller/selfnoderemediation_controller_test.go (11 hunks)
controllers/tests/shared/shared.go (3 hunks)
pkg/apicheck/check.go (6 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-06-18T11:38:24.387Z

Learnt from: slintes
PR: medik8s/self-node-remediation#262
File: pkg/peerhealth/client_server_test.go:165-189
Timestamp: 2025-06-18T11:38:24.387Z
Learning: In the medik8s/self-node-remediation project, the Ginkgo tests in pkg/peerhealth/client_server_test.go don't run in parallel, making shared mutable state like the `reader` variable safe to use across test cases without race condition concerns.

Applied to files:

controllers/tests/controller/selfnoderemediation_controller_test.go
controllers/tests/shared/shared.go

🧬 Code graph analysis (3)

controllers/tests/controller/selfnoderemediation_controller_test.go (5)

api/v1alpha1/selfnoderemediation_types.go (4)

RemediationStrategyType (44-44)

SelfNodeRemediation (92-98)

ResourceDeletionRemediationStrategy (25-25)

SelfNodeRemediationList (103-107)

controllers/tests/shared/shared.go (10)

UnhealthyNodeName (36-36)

Namespace (35-35)

GenerateTestConfig (180-192)

PeerNodeName (37-37)

Peer2NodeName (38-38)

SnrPodName2 (42-42)

Peer3NodeName (39-39)

SnrPodName3 (43-43)

IsK8sNotFoundError (237-253)

SnrPodName1 (41-41)

vendor/github.com/medik8s/common/pkg/labels/labels.go (1)

MasterRole (7-7)

api/HealthCheckResponse.go (2)

Unhealthy (8-8)

HealthCheckResponseCode (3-3)

pkg/controlplane/manager.go (1)

NewManager (39-47)

pkg/apicheck/check.go (6)

pkg/controlplane/manager.go (1)

Manager (29-36)

api/HealthCheckResponse.go (2)

HealthCheckResponseCode (3-3)

RequestFailed (6-6)

pkg/peers/peers.go (5)

New (47-58)

Worker (29-29)

ControlPlane (30-30)

Role (26-26)

Peers (36-45)

pkg/peerhealth/client.go (1)

NewClient (19-44)

pkg/peerhealth/peerhealth.pb.go (3)

HealthRequest (23-30)

HealthRequest (45-45)

HealthRequest (60-62)

pkg/peers/response.go (2)

Response (3-6)

HealthyBecauseErrorsThresholdNotReached (12-12)

controllers/tests/shared/shared.go (4)

pkg/apicheck/check.go (3)

ApiConnectivityCheck (36-45)

New (68-79)

ApiConnectivityCheckConfig (49-66)

api/HealthCheckResponse.go (2)

HealthCheckResponseCode (3-3)

RequestFailed (6-6)

pkg/controlplane/manager.go (1)

Manager (29-36)

vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)

MakeMatcher (87-128)

🔇 Additional comments (10)

pkg/apicheck/check.go (4)

47-47: LGTM: Injectable health check function enables testing.

The new GetHealthStatusFromRemoteFunc type allows test code to inject custom health-check behavior, which is essential for simulating peer responses in unit tests without requiring actual network calls.

68-79: LGTM: Constructor properly initializes the default health check function.

The refactored constructor now returns a named return value and correctly wires up the default peer health check function via SetHealthStatusFunc(c.GetDefaultPeerHealthCheckFunc()). This provides a clean separation between the default implementation and test overrides.

183-225: LGTM: Health check logic now handles control plane nodes correctly.

The refactored isConsideredHealthy() method now properly differentiates between worker and control plane nodes, addressing the core issue in #251. For control plane nodes, it queries control plane peers and cross-checks with diagnostics, while worker nodes first check worker peers and fall back to control plane peers as a secondary verification.

227-339: LGTM: Generalized peer response handling with improved logging.

The renamed getPeersResponse() method now accepts a role parameter, enabling it to query either worker or control plane peers. The enhanced logging that includes the role name addresses the misleading log message flagged in previous reviews.

controllers/tests/shared/shared.go (1)

217-235: LGTM: Custom Gomega matcher for ignored errors.

The IsIgnoredK8sError() matcher correctly uses gcustom.MakeMatcher to create a reusable matcher that checks if an error matches any of the provided testing functions. This is a clean pattern for test assertions.

controllers/tests/controller/selfnoderemediation_controller_test.go (5)

507-554: LGTM: Test case reproduces issue #251.

This test case properly simulates the scenario described in issue #251: a control plane node with multiple control plane peers reporting it as unhealthy. The comment on line 551-552 indicates this test demonstrates the bug (watchdog should trigger but currently doesn't), which aligns with the PR objectives.

766-788: LGTM: Improved remediation cleanup with better error handling.

The refactored deleteRemediations() now uses Eventually with Gomega assertions for more robust cleanup, properly removes finalizers before deletion, and verifies the list is empty afterwards.

830-859: LGTM: Reusable pod creation helper with automatic cleanup.

The new createGenericSelfNodeRemediationPod() helper centralizes pod creation logic and uses DeferCleanup to ensure proper teardown. This reduces code duplication across test cases.

1193-1210: LGTM: Control plane manager helper with proper cleanup.

The addControlPlaneManager() helper properly initializes the control plane manager, updates peer addresses, and uses DeferCleanup to restore the previous state. This enables testing control plane scenarios.

1327-1373: LGTM: Flexible node and pod creation for test scenarios.

The addNodes() helper accepts a slice of newNodeConfig structs, allowing test cases to declaratively define complex multi-node scenarios with custom labels and simulated peer responses. The automatic cleanup ensures tests don't pollute subsequent runs.

controllers/tests/shared/shared.go

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (5)

controllers/tests/shared/shared.go (3)

94-101: Mirror PodIPs[0] into legacy PodIP

Some consumers still read Status.PodIP. Keep both in sync.

 func assignRandomIpAddressesPods(pods *corev1.PodList) {
-	for i := range pods.Items {
-		pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: GetRandomIpAddress()}}
-	}
+	for i := range pods.Items {
+		randIP := GetRandomIpAddress()
+		pods.Items[i].Status.PodIP = randIP
+		pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: randIP}}
+	}
 	return
 }

102-108: Generate valid IPv6 (ULA) instead of raw UUID bytes

Stay within fd00::/8 to avoid invalid/reserved addrs. Keeps imports the same.

 func GetRandomIpAddress() (randomIP string) {
-	u := uuid.New()
-	ip := net.IP(u[:net.IPv6len])
-	randomIP = ip.String()
+	// Generate a Unique Local Address in fd00::/8
+	u := uuid.New()
+	b := u[:net.IPv6len]
+	b[0] = 0xfd
+	ip := net.IP(b)
+	randomIP = ip.String()
 	return
 }

130-144: Bug: queue head not removed when length == 1 (repeats last simulated response)

Slice must always shrink by 1 after pop.

 func (ckw *ApiConnectivityCheckWrapper) nextSimulatedPeerResponse() selfNodeRemediation.HealthCheckResponseCode {
 	ckw.responsesMu.Lock()
 	defer ckw.responsesMu.Unlock()

 	if len(ckw.simulatedPeerResponses) == 0 {
 		return selfNodeRemediation.RequestFailed
 	}

-	code := ckw.simulatedPeerResponses[0]
-	if len(ckw.simulatedPeerResponses) > 1 {
-		ckw.simulatedPeerResponses = append([]selfNodeRemediation.HealthCheckResponseCode{}, ckw.simulatedPeerResponses[1:]...)
-	}
-
-	return code
+	code := ckw.simulatedPeerResponses[0]
+	ckw.simulatedPeerResponses = ckw.simulatedPeerResponses[1:]
+	return code
 }

controllers/tests/controller/selfnoderemediation_controller_test.go (2)

934-944: Fix err shadowing; ensure named return is set

Current code shadows err and may return nil on failure.

-	By(fmt.Sprintf("Attempting to get pod '%s' before deleting it", podName), func() {
-		if err := k8sClient.Client.Get(context.Background(), podKey, pod); err != nil {
-			if apierrors.IsNotFound(err) && !throwErrorIfNotFound {
-				logf.Log.Info("pod with name '%s' not found, we're not going to do anything", podName)
-				err = nil
-				return
-			}
-
-			err = fmt.Errorf("unable to get pod with name '%s' in order to delete it", err)
-			return
-		}
-	})
+	By(fmt.Sprintf("Attempting to get pod '%s' before deleting it", podName), func() {
+		getErr := k8sClient.Client.Get(context.Background(), podKey, pod)
+		if getErr != nil {
+			if apierrors.IsNotFound(getErr) && !throwErrorIfNotFound {
+				logf.Log.Info("pod with name '%s' not found, we're not going to do anything", podName)
+				err = nil
+				return
+			}
+			err = fmt.Errorf("unable to get pod with name '%s' in order to delete it: %w", podName, getErr)
+			return
+		}
+	})

1261-1265: Persist label rollback in cleanup to avoid test bleed-over

Cleanup only edits local object; update the cluster too.

 		DeferCleanup(func() {
 			By("Revert the unhealthy node's role to its previous value", func() {
 				unhealthyNode.Labels[labels2.MasterRole] = previousRole
+				Expect(k8sClient.Update(context.TODO(), unhealthyNode)).To(Succeed(),
+					"failed to restore the unhealthy node label after test cleanup")
 			})
 		})

🧹 Nitpick comments (4)

controllers/tests/controller/selfnoderemediation_controller_test.go (2)
907-910: Fix log message: it's a pod, not a node

Minor wording nit.
-			ExpectWithOffset(1, err).To(Or(Succeed(), shared.IsK8sNotFoundError()),
-				"expected the delete operation to succeed, or for it to have told us that node '%s'"+
-					" didn't exist", pod.Name)
+			ExpectWithOffset(1, err).To(Or(Succeed(), shared.IsK8sNotFoundError()),
+				"expected the delete operation to succeed, or for it to have told us that pod '%s' didn't exist", pod.Name)
1194-1210: Verify control-plane manager lifecycle; avoid leaking goroutines

Starting the manager directly with context.Background() may not stop on test teardown. Prefer wiring it via the envtest manager (Add) or use a cancellable context.
pkg/apicheck/check.go (2)
430-433: Tighten comment and guard nil handler

Reflect actual behavior and add a safety fallback.
-// GetHealthStatusFromPeer issues a GET request to the specified IP and returns the result from the peer into the given channel
+// getHealthStatusFromPeer delegates to the configured health-check function (gRPC peer query by default).
 func (c *ApiConnectivityCheck) getHealthStatusFromPeer(endpointIp corev1.PodIP, results chan<- selfNodeRemediation.HealthCheckResponseCode) {
-	c.getHealthStatusFromRemoteFunc(endpointIp, results)
+	if c.getHealthStatusFromRemoteFunc == nil {
+		c.SetHealthStatusFunc(c.GetDefaultPeerHealthCheckFunc())
+	}
+	c.getHealthStatusFromRemoteFunc(endpointIp, results)
 }
255-272: Clarify mutually exclusive branches for 0 peers

The len(peersToAsk)==0 branch only executes when MinPeersForRemediation==0 (otherwise the prior < MinPeers check returns healthy). Consider a comment or reorder for clarity.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d1c5111 and 281a674.

📒 Files selected for processing (4)

controllers/tests/controller/selfnoderemediation_controller_test.go (11 hunks)
controllers/tests/controller/suite_test.go (3 hunks)
controllers/tests/shared/shared.go (3 hunks)
pkg/apicheck/check.go (6 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-06-18T11:38:24.387Z

Learnt from: slintes
PR: medik8s/self-node-remediation#262
File: pkg/peerhealth/client_server_test.go:165-189
Timestamp: 2025-06-18T11:38:24.387Z
Learning: In the medik8s/self-node-remediation project, the Ginkgo tests in pkg/peerhealth/client_server_test.go don't run in parallel, making shared mutable state like the `reader` variable safe to use across test cases without race condition concerns.

Applied to files:

controllers/tests/shared/shared.go
controllers/tests/controller/selfnoderemediation_controller_test.go

🧬 Code graph analysis (4)

pkg/apicheck/check.go (5)

pkg/controlplane/manager.go (1)

Manager (29-36)

api/HealthCheckResponse.go (2)

HealthCheckResponseCode (3-3)

RequestFailed (6-6)

pkg/peers/peers.go (5)

New (47-58)

Worker (29-29)

ControlPlane (30-30)

Role (26-26)

Peers (36-45)

pkg/peerhealth/client.go (1)

NewClient (19-44)

pkg/peers/response.go (2)

Response (3-6)

HealthyBecauseErrorsThresholdNotReached (12-12)

controllers/tests/shared/shared.go (5)

pkg/apicheck/check.go (3)

ApiConnectivityCheck (36-45)

New (68-79)

ApiConnectivityCheckConfig (49-66)

api/HealthCheckResponse.go (2)

HealthCheckResponseCode (3-3)

RequestFailed (6-6)

pkg/peers/peers.go (1)

New (47-58)

pkg/controlplane/manager.go (1)

Manager (29-36)

vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)

MakeMatcher (87-128)

controllers/tests/controller/suite_test.go (1)

controllers/tests/shared/shared.go (3)

ApiConnectivityCheckWrapper (60-66)

MinPeersForRemediationConfigDefaultValue (48-48)

NewApiConnectivityCheckWrapper (110-128)

controllers/tests/controller/selfnoderemediation_controller_test.go (5)

api/v1alpha1/selfnoderemediation_types.go (4)

RemediationStrategyType (44-44)

SelfNodeRemediation (92-98)

ResourceDeletionRemediationStrategy (25-25)

SelfNodeRemediationList (103-107)

controllers/tests/shared/shared.go (10)

UnhealthyNodeName (36-36)

Namespace (35-35)

GenerateTestConfig (180-192)

PeerNodeName (37-37)

Peer2NodeName (38-38)

SnrPodName2 (42-42)

Peer3NodeName (39-39)

SnrPodName3 (43-43)

IsK8sNotFoundError (237-253)

SnrPodName1 (41-41)

vendor/github.com/medik8s/common/pkg/labels/labels.go (1)

MasterRole (7-7)

api/HealthCheckResponse.go (2)

Unhealthy (8-8)

HealthCheckResponseCode (3-3)

pkg/controlplane/manager.go (1)

NewManager (39-47)

🔇 Additional comments (1)

controllers/tests/controller/suite_test.go (1)

169-176: LGTM: wrapper-based ApiConnectivityCheck wiring

Using the wrapper and enabling simulated peer responses by default keeps tests deterministic and avoids dialing peers.

coderabbitai · 2025-10-07T20:26:02Z

pkg/apicheck/check.go

+		// TODO does this work with IPv6?
+		// MES: Yes it does, we've tested this
+		phClient, err := peerhealth.NewClient(fmt.Sprintf("%v:%v", endpointIp.IP, c.config.PeerHealthPort), c.config.PeerDialTimeout, c.config.Log.WithName("peerhealth client"), c.clientCreds)
+		if err != nil {
+			logger.Error(err, "failed to init grpc client")
+			results <- selfNodeRemediation.RequestFailed
+			return
+		}
+		defer phClient.Close()


⚠️ Potential issue | 🟠 Major

Fix IPv6 dialing: wrap IPv6 literals in brackets

Without brackets, gRPC/net parsing fails for IPv6.

- phClient, err := peerhealth.NewClient(fmt.Sprintf("%v:%v", endpointIp.IP, c.config.PeerHealthPort), c.config.PeerDialTimeout, c.config.Log.WithName("peerhealth client"), c.clientCreds) + addr := fmt.Sprintf("%s:%d", endpointIp.IP, c.config.PeerHealthPort) + // IPv6 literals must be bracketed in host:port form + if strings.Contains(endpointIp.IP, ":") { + addr = fmt.Sprintf("[%s]:%d", endpointIp.IP, c.config.PeerHealthPort) + } + phClient, err := peerhealth.NewClient(addr, c.config.PeerDialTimeout, c.config.Log.WithName("peerhealth client"), c.clientCreds)

Add import:

import "strings"

🤖 Prompt for AI Agents

In pkg/apicheck/check.go around lines 93 to 101, the gRPC dialing address uses fmt.Sprintf("%v:%v", endpointIp.IP, c.config.PeerHealthPort) which breaks for IPv6 literals; update the code to detect IPv6 addresses (use strings.Contains(endpointIp.IP, ":") after adding import "strings") and wrap IPv6 literals in brackets when building the host:port (i.e., use "[IP]:port" for IPv6, plain "IP:port" for IPv4), then pass that formatted address into peerhealth.NewClient; keep the existing error handling and defer close.

coderabbitai · 2025-10-07T20:26:02Z

pkg/apicheck/check.go

+	c.config.Log.Info("isConsideredHealthy: we have checked the control plane peer responses and cross "+
+		"checked it against the control plane diagnostics ",
+		"isControlPlaneHealthy", controlPlanePeersResponse.IsHealthy)
+
+	return isControlPlaneHealthy
+


⚠️ Potential issue | 🟡 Minor

Fix log: wrong variable under key isControlPlaneHealthy

It logs controlPlanePeersResponse.IsHealthy instead of the computed isControlPlaneHealthy.

- c.config.Log.Info("isConsideredHealthy: we have checked the control plane peer responses and cross "+ - "checked it against the control plane diagnostics ", - "isControlPlaneHealthy", controlPlanePeersResponse.IsHealthy) + c.config.Log.Info("isConsideredHealthy: we have checked the control plane peer responses and cross "+ + "checked it against the control plane diagnostics ", + "isControlPlaneHealthy", isControlPlaneHealthy)

🤖 Prompt for AI Agents

In pkg/apicheck/check.go around lines 219 to 224, the log call is passing controlPlanePeersResponse.IsHealthy under the "isControlPlaneHealthy" key instead of the computed isControlPlaneHealthy; update the log invocation to supply isControlPlaneHealthy as the value for the "isControlPlaneHealthy" key (and verify key/value ordering matches the logger's expected signature) so the log reflects the computed result.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/controlplane/manager.go (1)
169-191: Add HTTP client timeout to avoid hangs in kubelet check

Set a reasonable timeout on the client to prevent indefinite hangs during network issues and to stabilize unit tests.
-	httpClient := &http.Client{Transport: tr}
+	httpClient := &http.Client{
+		Transport: tr,
+		Timeout:   5 * time.Second,
+	}

♻️ Duplicate comments (6)

controllers/tests/controller/selfnoderemediation_controller_test.go (2)
1284-1289: Cleanup doesn’t persist the label rollback – update the cluster state

The DeferCleanup only mutates the in-memory object; it doesn’t issue an Update to persist the label change. This can leak the control-plane label into subsequent tests.

Apply:
 DeferCleanup(func() {
   By("Revert the unhealthy node's role to its previous value", func() {
     unhealthyNode.Labels[labels2.MasterRole] = previousRole
-  })
+    Expect(k8sClient.Update(context.TODO(), unhealthyNode)).To(Succeed(),
+      "failed to restore the unhealthy node label after test cleanup")
+  })
 })
957-972: Fix error shadowing; function may return nil on Get failure

The short variable declaration shadows the named return err. Use a separate local or assign to the named return.
- if err := k8sClient.Client.Get(context.Background(), podKey, pod); err != nil {
-   if apierrors.IsNotFound(err) && !throwErrorIfNotFound {
+ getErr := k8sClient.Client.Get(context.Background(), podKey, pod)
+ if getErr != nil {
+   if apierrors.IsNotFound(getErr) && !throwErrorIfNotFound {
      logf.Log.Info("pod with name '%s' not found, we're not going to do anything", podName)
      err = nil
      return
    }
-   err = fmt.Errorf("unable to get pod with name '%s' in order to delete it", err)
+   err = fmt.Errorf("unable to get pod with name '%s' in order to delete it: %w", podName, getErr)
    return
 }
controllers/tests/shared/shared.go (3)

93-100: Random IP helper forgets to update .Status.PodIP

assignRandomIpAddressesPods() populates PodIPs, but callers relying on the legacy singular field (pod.Status.PodIP) will still observe an empty value. Several components (including controller-runtime caches and some older network add-ons) still look at PodIP when only one address is expected.

112-118: Use proper IPv6 address generation instead of UUID bytes

Converting UUID bytes directly to IPv6 addresses may generate invalid or reserved addresses. Consider using proper IPv6 address generation that respects network standards.

140-154: Fix: Last simulated response is never removed from the queue.

When there's exactly one element in the queue, the code extracts it but doesn't remove it from the slice. Subsequent calls will return the same response repeatedly instead of returning RequestFailed.

pkg/apicheck/check.go (1)

101-103: IPv6 address formatting missing brackets for gRPC dial.

The TODO comment claims IPv6 works, but the code uses fmt.Sprintf("%v:%v", endpointIp.IP, c.config.PeerHealthPort) which breaks for IPv6 literals in gRPC/net parsing. IPv6 addresses must be wrapped in brackets when combined with a port.

🧹 Nitpick comments (8)

pkg/apicheck/failure_tracker_test.go (1)

8-27: Solid coverage of windowed escalation; consider a couple edge cases

Looks good. Optionally add tests for:

window <= 0 (ShouldEscalate must be true)

nil receiver safety (methods no-op / false)

Also applies to: 29-46, 48-63
pkg/apicheck/failure_tracker.go (1)
21-47: Clarify semantics and consider minor cleanups

Document explicitly that window <= 0 forces escalation (true), to avoid surprises. Also document that ShouldEscalate mutates internal state (window reset on gaps).

Not thread-safe by design; note single-goroutine expectation.

Minor: the IsZero guard on lastFailure is redundant after consecutive==0 path; can drop it.

Diff (optional tidy-up):
-	if ft.lastFailure.IsZero() {
-		ft.lastFailure = at
-	}
Also applies to: 61-92
pkg/apicheck/check_internal_test.go (1)

16-22: Avoid unsafe reflection to mutate peers in tests

Using unsafe to set unexported fields bypasses locks and is brittle. Prefer:

Injecting a PeersOverrideFunc in apicheck (already present in other tests), or

Adding a test-only setter behind a build tag in peers, or

Exposing a minimal exported method to set peer addresses for tests.

This eliminates race/breakage risks if fields or sync change.
main.go (2)
254-265: Support human-readable duration env values

getOptionalDurEnvVar only accepts integers (interpreted as nanoseconds). Consider supporting time.ParseDuration (e.g., "30s", "500ms") with integer fallback for backward compatibility.
 func getOptionalDurEnvVar(varName string, fallback time.Duration) time.Duration {
-	val, exists := os.LookupEnv(varName)
+	val, exists := os.LookupEnv(varName)
 	if !exists || val == "" {
 		return fallback
 	}
-	parsed, err := strconv.Atoi(val)
-	if err != nil {
-		setupLog.Error(err, "failed to convert env variable to int", "var name", varName, "var value", val)
-		os.Exit(1)
-	}
-	return time.Duration(parsed)
+	// First try Go duration syntax (e.g. "30s", "500ms")
+	if dur, err := time.ParseDuration(val); err == nil {
+		return dur
+	}
+	// Fallback: raw integer as nanoseconds (legacy behavior)
+	parsed, err := strconv.Atoi(val)
+	if err != nil {
+		setupLog.Error(err, "failed to parse duration env var", "var name", varName, "var value", val)
+		os.Exit(1)
+	}
+	return time.Duration(parsed)
 }
332-365: Document fallback behavior and clarify units for timeout configs

MaxTimeForNoPeersResponse is still required as the default/fallback in effectivePeerQuorumTimeout – don’t remove it; update docs to explain its role.

In your configuration guide, specify the expected units/formats for MAX_API_FAILURE_WINDOW and PEER_QUORUM_TIMEOUT (e.g. duration strings like “30s”).
pkg/controlplane/manager.go (1)

74-97: Outcome-based health flow looks sane; diagnostics path depends on kubelet check

Switching on EvaluationOutcome is clear and aligns with the redesign. Ensure callers never pass unknown outcomes (you log and return false, which is safe).

Consider adding a package comment on EvaluationOutcome semantics for maintainers.

controllers/tests/controller/selfnoderemediation_controller_test.go (2)

50-57: Avoid fixed sleeps; wait on observable conditions

Replace time.Sleep with Eventually/Consistently on the condition under setup (e.g., config presence) to make tests deterministic and faster.

853-882: Remove fixed sleep; rely on Eventually to observe pod creation

The sleep is redundant as the next step already waits for the pod. Drop the sleep to reduce test latency.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 281a674 and 3d815d2.

📒 Files selected for processing (10)

controllers/tests/controller/control_plane_isolation_testdata.md (1 hunks)
controllers/tests/controller/selfnoderemediation_controller_test.go (11 hunks)
controllers/tests/shared/shared.go (3 hunks)
main.go (3 hunks)
pkg/apicheck/check.go (5 hunks)
pkg/apicheck/check_internal_test.go (1 hunks)
pkg/apicheck/failure_tracker.go (1 hunks)
pkg/apicheck/failure_tracker_test.go (1 hunks)
pkg/controlplane/manager.go (3 hunks)
pkg/controlplane/manager_internal_test.go (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-06-18T11:38:24.387Z

Learnt from: slintes
PR: medik8s/self-node-remediation#262
File: pkg/peerhealth/client_server_test.go:165-189
Timestamp: 2025-06-18T11:38:24.387Z
Learning: In the medik8s/self-node-remediation project, the Ginkgo tests in pkg/peerhealth/client_server_test.go don't run in parallel, making shared mutable state like the `reader` variable safe to use across test cases without race condition concerns.

Applied to files:

controllers/tests/shared/shared.go

🧬 Code graph analysis (8)

pkg/apicheck/failure_tracker_test.go (1)

pkg/apicheck/failure_tracker.go (1)

NewFailureTracker (17-19)

main.go (1)

pkg/reboot/calculator.go (1)

MaxTimeForNoPeersResponse (22-22)

pkg/controlplane/manager.go (1)

pkg/peers/peers.go (1)

ControlPlane (30-30)

pkg/apicheck/check_internal_test.go (3)

pkg/peers/peers.go (1)

Peers (36-45)

pkg/apicheck/check.go (2)

ApiConnectivityCheckConfig (55-74)

New (76-87)

api/HealthCheckResponse.go (2)

HealthCheckResponseCode (3-3)

Unhealthy (8-8)

pkg/controlplane/manager_internal_test.go (1)

pkg/controlplane/manager.go (3)

Manager (39-46)

EvaluationIsolation (34-34)

EvaluationGlobalOutage (33-33)

controllers/tests/controller/selfnoderemediation_controller_test.go (7)

api/v1alpha1/selfnoderemediation_types.go (5)

RemediationStrategyType (44-44)

SelfNodeRemediation (92-98)

AutomaticRemediationStrategy (24-24)

ResourceDeletionRemediationStrategy (25-25)

SelfNodeRemediationList (103-107)

controllers/tests/shared/shared.go (11)

UnhealthyNodeName (37-37)

Namespace (36-36)

GenerateTestConfig (276-288)

PeerNodeName (38-38)

Peer2NodeName (39-39)

SnrPodName2 (43-43)

Peer3NodeName (40-40)

SnrPodName3 (44-44)

IsK8sNotFoundError (333-349)

SnrPodName1 (42-42)

PeersOverrideFunc (76-76)

pkg/peers/peers.go (3)

Role (26-26)

ControlPlane (30-30)

Peers (36-45)

api/HealthCheckResponse.go (2)

Unhealthy (8-8)

HealthCheckResponseCode (3-3)

vendor/github.com/medik8s/common/pkg/labels/labels.go (1)

MasterRole (7-7)

pkg/controlplane/manager.go (1)

NewManager (49-57)

pkg/apicheck/check.go (1)

PeersOverrideFunc (36-36)

pkg/apicheck/check.go (7)

pkg/peers/peers.go (5)

Role (26-26)

New (47-58)

Worker (29-29)

ControlPlane (30-30)

Peers (36-45)

pkg/apicheck/failure_tracker.go (2)

FailureTracker (9-14)

NewFailureTracker (17-19)

pkg/controlplane/manager.go (7)

Manager (39-46)

EvaluationOutcome (28-28)

EvaluationIsolation (34-34)

EvaluationAwaitQuorum (35-35)

EvaluationHealthy (31-31)

EvaluationGlobalOutage (33-33)

EvaluationRemediate (32-32)

controllers/tests/shared/shared.go (1)

PeersOverrideFunc (76-76)

api/HealthCheckResponse.go (2)

HealthCheckResponseCode (3-3)

RequestFailed (6-6)

pkg/peerhealth/client.go (1)

NewClient (19-44)

pkg/utils/peers.go (1)

GetNextBatchSize (9-31)

controllers/tests/shared/shared.go (4)

pkg/apicheck/check.go (4)

ApiConnectivityCheck (38-51)

PeersOverrideFunc (36-36)

New (76-87)

ApiConnectivityCheckConfig (55-74)

api/HealthCheckResponse.go (2)

HealthCheckResponseCode (3-3)

RequestFailed (6-6)

pkg/peers/peers.go (2)

Role (26-26)

New (47-58)

pkg/controlplane/manager.go (1)

Manager (39-46)

🔇 Additional comments (31)

controllers/tests/controller/control_plane_isolation_testdata.md (1)

1-12: Helpful scenario documentation

Clear articulation of legacy vs redesigned expectations. LGTM.

controllers/tests/shared/shared.go (11)

6-28: LGTM: Import additions support new test infrastructure.

The new imports are appropriate for the added functionality: random IP generation, concurrency primitives, custom Gomega matchers, and integration with apicheck/controlplane/peers packages.

31-50: LGTM: Test constants are well-organized.

The constants provide clear defaults for test configuration and improve readability. The naming conventions are consistent.

58-59: LGTM: Test control field for random IP generation.

The ShouldReturnRandomPodIPs field provides a clean way for tests to enable/disable random IP address generation behavior.

61-71: LGTM: Wrapper design supports comprehensive test scenarios.

The ApiConnectivityCheckWrapper provides excellent test infrastructure with thread-safe response simulation, peer override capabilities, and baseline snapshot management for stateful tests.

73-76: LGTM: Well-documented type for test peer injection.

The PeersOverrideFunc type and its documentation clearly explain the purpose of allowing tests to supply synthetic peer address lists.

120-138: LGTM: Constructor properly wires simulation logic.

The constructor correctly initializes the wrapper and sets up the health check function override with proper delegation between simulated and default behavior.

156-215: LGTM: Thread-safe response management with snapshot capabilities.

The response management methods provide comprehensive test state control with proper mutex protection. The snapshot and baseline patterns are particularly useful for complex test scenarios.

217-242: LGTM: Peer override methods with proper wrapping.

The peer override methods correctly handle the test-to-production type boundary with proper wrapping and thread-safe access.

244-270: LGTM: Thread-safe timestamp recording for test verification.

The peer response recording methods provide proper thread-safe access to timestamps, useful for verifying timing behavior in tests.

272-274: LGTM: Proper delegation to wrapped instance.

The ResetPeerTimers method correctly delegates to the underlying ApiConnectivityCheck instance.

310-349: LGTM: Custom Gomega matchers for K8s error testing.

The K8s error testing helpers provide useful custom matchers with proper Gomega integration. The implementation correctly handles nil errors and provides flexible error matching capabilities.

pkg/apicheck/check.go (19)

36-53: LGTM: Enhanced types support testability and state tracking.

The new types and struct fields provide excellent support for dependency injection (GetHealthStatusFromRemoteFunc, PeersOverrideFunc) and comprehensive peer state tracking (timing fields, FailureTracker). This design improves both testability and production functionality.

72-73: LGTM: Configuration fields for timeout control.

The new FailureWindow and PeerQuorumTimeout fields provide fine-grained control over timeout behavior in the health evaluation logic.

76-87: LGTM: Constructor properly initializes new components.

The constructor correctly initializes the failure tracker and sets the default health status function, establishing a clean default configuration.

134-146: LGTM: Clean accessor methods with proper synchronization.

The control plane manager and peers override methods provide straightforward access with appropriate mutex protection for the override.

186-189: LGTM: Proper failure tracker reset with defensive nil check.

The code correctly resets the failure tracker after a successful API call, with appropriate nil checking.

199-224: LGTM: Well-structured health evaluation with role separation.

The refactored isConsideredHealthy method cleanly separates worker and control-plane evaluation paths with proper failure tracking and delegation to the control plane manager.

226-251: LGTM: Peer summary abstraction improves readability.

The peerSummary type and helper methods provide a clean abstraction for aggregating peer response data with clear logic for majority calculations.

253-293: LGTM: Comprehensive worker evaluation logic.

The evaluateWorker method handles all major scenarios with clear logic: isolation, health, escalation to control planes, global outage detection, and quorum awaiting. The timeout handling is appropriate.

295-343: LGTM: Control-plane evaluation with proper prioritization.

The evaluateControlPlane method correctly prioritizes worker feedback for global outage detection and escalation scenarios, then evaluates control-plane peers with appropriate fallback to worker responses.

345-372: LGTM: Consistent escalation logic.

The escalateToControlPlanes method provides consistent escalation logic with the same structure as the evaluation methods, ensuring predictable behavior.

374-419: LGTM: Well-structured peer response gathering with batching.

The gatherPeerResponses method efficiently gathers responses using batching, properly records timing data, and provides useful logging for debugging. The implementation is clear and correct.

421-441: LGTM: Safe peer listing with override support.

The listPeers method properly handles overrides and makes defensive copies to prevent mutation issues. The nil checks are appropriate.

443-463: LGTM: Clean peer timing state management.

The peer activity and silence recording methods provide clean role-based tracking with proper zero-time checking for silence start timestamps.

465-470: LGTM: Complete timer reset utility.

The ResetPeerTimers method correctly resets all peer timing fields, useful for test state management.

472-495: LGTM: Clear outcome interpretation helpers.

The outcome helper methods provide clear interpretation of evaluation outcomes with consistent logic: Remediate and Isolation are unhealthy, all others are healthy.

497-525: LGTM: Defensive calculation of effective configuration values.

The effective value methods provide proper fallback logic and defensive initialization, ensuring sensible defaults when configuration values are not explicitly set.

527-545: LGTM: Correct peer timeout detection.

The peerTimeoutExceeded method correctly checks role-specific silence timestamps against the configured timeout with proper zero-time handling.

605-617: LGTM: Clean dependency injection for health checks.

The health status function methods provide clean dependency injection with proper getter/setter/delegation pattern, enabling comprehensive testing.

578-579: LGTM: Useful logging for peer health queries.

The added log statement provides helpful context when querying peer health status, aiding in debugging.

coderabbitai · 2025-10-08T18:35:43Z

controllers/tests/controller/selfnoderemediation_controller_test.go

+			By(fmt.Sprintf("Clear node state for '%s'", shared.UnhealthyNodeName), func() {
+				Expect(k8sClient.Update(context.Background(), getNode(shared.UnhealthyNodeName)))
+			})
+
+			By(fmt.Sprintf("Clear node state for '%s'", shared.PeerNodeName), func() {
+				Expect(k8sClient.Update(context.Background(), getNode(shared.PeerNodeName)))
+			})
+
+			time.Sleep(time.Second * 2)


⚠️ Potential issue | 🟡 Minor

Missing assertion and likely invalid Update (resourceVersion)

Expect(...) is not followed by .To(Succeed()), so failures aren’t asserted.

Updating a freshly constructed object (getNode(...)) typically lacks resourceVersion; Update will fail. Fetch the node and patch/merge instead.

Example:

-Expect(k8sClient.Update(context.Background(), getNode(shared.UnhealthyNodeName))) +node := &v1.Node{} +Expect(k8sClient.Get(context.Background(), unhealthyNodeNamespacedName, node)).To(Succeed()) +clean := getNode(shared.UnhealthyNodeName) +patch := client.MergeFrom(node.DeepCopy()) +node.Spec = clean.Spec +node.Annotations = clean.Annotations +node.Labels = clean.Labels +Expect(k8sClient.Patch(context.Background(), node, patch)).To(Succeed())

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

By(fmt.Sprintf("Clear node state for '%s'", shared.UnhealthyNodeName), func() {

Expect(k8sClient.Update(context.Background(), getNode(shared.UnhealthyNodeName)))

})

By(fmt.Sprintf("Clear node state for '%s'", shared.PeerNodeName), func() {

Expect(k8sClient.Update(context.Background(), getNode(shared.PeerNodeName)))

})

time.Sleep(time.Second * 2)

By(fmt.Sprintf("Clear node state for '%s'", shared.UnhealthyNodeName), func() {

// Fetch the live object so we have a resourceVersion

node := &v1.Node{}

Expect(k8sClient.Get(context.Background(), unhealthyNodeNamespacedName, node)).To(Succeed())

// Prepare a clean copy of the desired state

clean := getNode(shared.UnhealthyNodeName)

patch := client.MergeFrom(node.DeepCopy())

node.Spec = clean.Spec

node.Annotations = clean.Annotations

node.Labels = clean.Labels

// Apply the patch and assert success

Expect(k8sClient.Patch(context.Background(), node, patch)).To(Succeed())

})

By(fmt.Sprintf("Clear node state for '%s'", shared.PeerNodeName), func() {

Expect(k8sClient.Update(context.Background(), getNode(shared.PeerNodeName))).To(Succeed())

})

time.Sleep(time.Second * 2)

🤖 Prompt for AI Agents

In controllers/tests/controller/selfnoderemediation_controller_test.go around lines 63 to 71, the test currently calls Expect(k8sClient.Update(context.Background(), getNode(...))) without asserting the result and tries to Update a freshly constructed node object (missing resourceVersion) which will fail; change the Expect calls to Expect(...).To(Succeed()) and replace the blind Update of a new object with a proper fetch-and-modify or a Patch: first Get the existing node from the API (populate resourceVersion), mutate the fields you need, then call Update and assert .To(Succeed()), or create a MergeFrom patch by fetching the original object, applying changes to a copy, and call client.Patch(..., patch) with Expect(...).To(Succeed()).

pkg/controlplane/manager_internal_test.go

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

controllers/tests/controller/selfnoderemediation_controller_test.go (1)

972-981: Fix error shadowing and wrap original error

Inner err shadows the named return; callers may receive nil on failure. Use a separate variable and %w.

-    if err := k8sClient.Client.Get(context.Background(), podKey, pod); err != nil {
-      if apierrors.IsNotFound(err) && !throwErrorIfNotFound {
-        logf.Log.Info("pod with name '%s' not found, we're not going to do anything", podName)
-        err = nil
-        return
-      }
-
-      err = fmt.Errorf("unable to get pod with name '%s' in order to delete it", err)
-      return
-    }
+    getErr := k8sClient.Client.Get(context.Background(), podKey, pod)
+    if getErr != nil {
+      if apierrors.IsNotFound(getErr) && !throwErrorIfNotFound {
+        logf.Log.Info("pod not found; skipping delete", "name", podName)
+        err = nil
+        return
+      }
+      err = fmt.Errorf("unable to get pod with name %q in order to delete it: %w", podName, getErr)
+      return
+    }

controllers/tests/shared/shared.go (1)

150-164: Bug: last simulated response never dequeued (repeats same code)

When length is 1, the slice isn’t advanced; subsequent calls replay the same response and can mask “no more responses” cases. Fix by always removing index 0.

 func (ckw *ApiConnectivityCheckWrapper) nextSimulatedPeerResponse() selfNodeRemediation.HealthCheckResponseCode {
   ckw.responsesMu.Lock()
   defer ckw.responsesMu.Unlock()

   if len(ckw.simulatedPeerResponses) == 0 {
     return selfNodeRemediation.RequestFailed
   }

   code := ckw.simulatedPeerResponses[0]
-  if len(ckw.simulatedPeerResponses) > 1 {
-    ckw.simulatedPeerResponses = append([]selfNodeRemediation.HealthCheckResponseCode{}, ckw.simulatedPeerResponses[1:]...)
-  }
+  // Always consume the head element
+  ckw.simulatedPeerResponses = ckw.simulatedPeerResponses[1:]

   return code
 }

🧹 Nitpick comments (6)

controllers/tests/shared/shared.go (2)
114-125: Avoid panic in test helper; return deterministic fallback on RNG error

Panic here can flake tests. Prefer a safe fallback.
 func GetRandomIpAddress() (randomIP string) {
@@
-  if _, err := rand.Read(bytes[1:]); err != nil {
-    panic(err)
-  }
+  if _, err := rand.Read(bytes[1:]); err != nil {
+    logf.Log.Error(err, "random IPv6 generation failed; using fallback fd00::1")
+    return "fd00::1"
+  }
   randomIP = net.IP(bytes).String()
   return
 }
47-49: Unused constant

K8sClientReturnRandomPodIPAddressesByDefault is never referenced. Remove to reduce noise.
-  K8sClientReturnRandomPodIPAddressesByDefault = false
pkg/controlplane/manager.go (1)
169-195: Treat non-2xx/3xx kubelet responses as unhealthy

HTTP success should depend on status code, not only network success.
   resp, err := httpClient.Do(req)
   if err != nil {
     manager.log.Error(err, "kubelet service is down", "node name", manager.nodeName)
     return false
   }
   defer resp.Body.Close()
-  return true
+  if resp.StatusCode >= 200 && resp.StatusCode < 400 {
+    return true
+  }
+  manager.log.Info("kubelet responded with non-success status",
+    "node name", manager.nodeName, "status", resp.StatusCode)
+  return false
controllers/tests/controller/selfnoderemediation_controller_test.go (3)
585-587: Remove misleading comment

“This is expected to fail” contradicts the assertion and confuses future readers. Drop it.
-        // It's expected that the next line will fail, even though it shouldn't!
         verifyWatchdogTriggered()
886-889: Avoid fixed sleep; rely on Eventually already below

The explicit 1s sleep is unnecessary and can slow tests; the subsequent Eventually handles readiness.
-    time.Sleep(1 * time.Second)
-
     verifySelfNodeRemediationPodByExistsByName(podName)
1255-1261: Replace config-change sleep with an assertion

Sleeping 1s after setting MinPeersForRemediation is brittle. Assert the value instead.
-    time.Sleep(1 * time.Second)
+    Eventually(func() int {
+      return apiConnectivityCheckConfig.MinPeersForRemediation
+    }, 3*time.Second, 100*time.Millisecond).Should(Equal(minimumNumberOfPeers))

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3d815d2 and ac5306e.

📒 Files selected for processing (3)

controllers/tests/controller/selfnoderemediation_controller_test.go (11 hunks)
controllers/tests/shared/shared.go (3 hunks)
pkg/controlplane/manager.go (4 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-06-18T11:38:24.387Z

Learnt from: slintes
PR: medik8s/self-node-remediation#262
File: pkg/peerhealth/client_server_test.go:165-189
Timestamp: 2025-06-18T11:38:24.387Z
Learning: In the medik8s/self-node-remediation project, the Ginkgo tests in pkg/peerhealth/client_server_test.go don't run in parallel, making shared mutable state like the `reader` variable safe to use across test cases without race condition concerns.

Applied to files:

controllers/tests/shared/shared.go

🧬 Code graph analysis (3)

controllers/tests/shared/shared.go (5)

pkg/apicheck/check.go (4)

ApiConnectivityCheck (38-51)

PeersOverrideFunc (36-36)

New (76-87)

ApiConnectivityCheckConfig (55-74)

api/HealthCheckResponse.go (2)

HealthCheckResponseCode (3-3)

RequestFailed (6-6)

pkg/peers/peers.go (2)

Role (26-26)

New (47-58)

pkg/controlplane/manager.go (1)

Manager (39-46)

vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)

MakeMatcher (87-128)

controllers/tests/controller/selfnoderemediation_controller_test.go (7)

api/v1alpha1/selfnoderemediation_types.go (4)

RemediationStrategyType (44-44)

SelfNodeRemediation (92-98)

AutomaticRemediationStrategy (24-24)

ResourceDeletionRemediationStrategy (25-25)

controllers/tests/shared/shared.go (11)

UnhealthyNodeName (37-37)

Namespace (36-36)

GenerateTestConfig (286-298)

PeerNodeName (38-38)

Peer2NodeName (39-39)

SnrPodName2 (43-43)

Peer3NodeName (40-40)

SnrPodName3 (44-44)

IsK8sNotFoundError (343-359)

SnrPodName1 (42-42)

PeersOverrideFunc (76-76)

pkg/peers/peers.go (3)

Role (26-26)

ControlPlane (30-30)

Peers (36-45)

api/HealthCheckResponse.go (2)

Unhealthy (8-8)

HealthCheckResponseCode (3-3)

vendor/github.com/medik8s/common/pkg/labels/labels.go (1)

MasterRole (7-7)

pkg/controlplane/manager.go (1)

NewManager (49-57)

pkg/apicheck/check.go (1)

PeersOverrideFunc (36-36)

pkg/controlplane/manager.go (1)

pkg/peers/peers.go (1)

ControlPlane (30-30)

🔇 Additional comments (4)

controllers/tests/shared/shared.go (3)

104-110: LGTM: PodIP and PodIPs kept in sync

Mirrors first IP into both fields; avoids consumers seeing empty PodIP.

135-145: Single response per peer — good change

Wrapper enforces one send per peer; prevents bounded-channel deadlocks.

322-341: Custom Gomega matcher for ignored k8s errors — looks good

Clear, composable, and nil-safe.

pkg/controlplane/manager.go (1)

28-36: API migration verified: all call sites use new EvaluationOutcome signature

Verification confirms no stale two-argument invocations remain. Three call sites found (two in manager_internal_test.go, one in apicheck/check.go) all correctly use the new single-argument signature. Function implementation handles all five evaluation outcomes with proper default case for unknown values.

Also it DID NOT like working with a powershell wrapper.

openshift-merge-robot · 2025-10-27T20:53:05Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

controllers/tests/shared/shared.go (1)

102-107: Legacy PodIP now mirrored — LGTM

Setting both Status.PodIP and Status.PodIPs fixes earlier compatibility issues. Good.

🧹 Nitpick comments (4)

controllers/tests/shared/shared.go (2)

149-161: Avoid extra allocation when popping the head

Use slice re-slicing instead of append-copy to trim the consumed element.
 func (ckw *ApiConnectivityCheckWrapper) nextSimulatedPeerResponse() selfNodeRemediation.HealthCheckResponseCode {
   ckw.responsesMu.Lock()
   defer ckw.responsesMu.Unlock()

   if len(ckw.simulatedPeerResponses) == 0 {
     return selfNodeRemediation.RequestFailed
   }

   code := ckw.simulatedPeerResponses[0]
-  ckw.simulatedPeerResponses = append([]selfNodeRemediation.HealthCheckResponseCode{}, ckw.simulatedPeerResponses[1:]...)
+  ckw.simulatedPeerResponses = ckw.simulatedPeerResponses[1:]
   return code
 }
76-89: Sharper failure simulation and safer defaults

If SimulatedFailureMessage is empty, return a sensible default to aid debugging.

Keep behavior as-is; randomization already gated to PodList. Optional: restrict randomization to SNR pods via label selector to reduce side effects.
 case kcw.ShouldSimulatePodDeleteFailure:
   if _, ok := list.(*corev1.NamespaceList); ok {
-    err = errors.New(kcw.SimulatedFailureMessage)
+    msg := kcw.SimulatedFailureMessage
+    if msg == "" {
+      msg = "simulated API error (namespace list)"
+    }
+    err = errors.New(msg)
     return
   }
   fallthrough
Also applies to: 91-100

controllers/tests/controller/selfnoderemediation_controller_test.go (2)

1195-1212: Stop the control-plane manager on cleanup

Start() may spawn background goroutines; add cancellable context or Stop if available to avoid leaks between specs.

 func addControlPlaneManager() {
   By("Add a control plane manager", func() {
-    controlPlaneMgr := controlplane.NewManager(shared.UnhealthyNodeName, k8sClient)
-    Expect(controlPlaneMgr.Start(context.Background())).To(Succeed(), "we should"+
+    controlPlaneMgr := controlplane.NewManager(shared.UnhealthyNodeName, k8sClient)
+    ctx, cancel := context.WithCancel(context.Background())
+    Expect(controlPlaneMgr.Start(ctx)).To(Succeed(), "we should"+
       "have been able to enable a control plane manager for the current node")

     apiCheck.SetControlPlaneManager(controlPlaneMgr)

     Expect(apiConnectivityCheckConfig.Peers.UpdateControlPlanePeers(context.Background())).To(Succeed())

     DeferCleanup(func() {
       By("Removing the control plane manager", func() {
+        cancel() // stop background workers
         apiCheck.SetControlPlaneManager(nil)
         Expect(apiConnectivityCheckConfig.Peers.UpdateControlPlanePeers(context.Background())).To(Succeed())
       })
     })
   })
 }

936-962: Use Delete option helper directly

Nit: you can pass the option inline without the temporary var for clarity.

- var grace client.GracePeriodSeconds = 0
- err := k8sClient.Client.Delete(context.Background(), pod, grace)
+ err := k8sClient.Client.Delete(context.Background(), pod, client.GracePeriodSeconds(0))

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ac5306e and fe61c0e.

📒 Files selected for processing (3)

controllers/tests/controller/selfnoderemediation_controller_test.go (11 hunks)
controllers/tests/shared/shared.go (3 hunks)
pkg/controlplane/manager.go (5 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/controlplane/manager.go

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-06-18T11:38:24.387Z

Learnt from: slintes
PR: medik8s/self-node-remediation#262
File: pkg/peerhealth/client_server_test.go:165-189
Timestamp: 2025-06-18T11:38:24.387Z
Learning: In the medik8s/self-node-remediation project, the Ginkgo tests in pkg/peerhealth/client_server_test.go don't run in parallel, making shared mutable state like the `reader` variable safe to use across test cases without race condition concerns.

Applied to files:

controllers/tests/shared/shared.go

🧬 Code graph analysis (2)

controllers/tests/controller/selfnoderemediation_controller_test.go (7)

api/v1alpha1/selfnoderemediation_types.go (4)

RemediationStrategyType (44-44)

SelfNodeRemediation (92-98)

AutomaticRemediationStrategy (24-24)

ResourceDeletionRemediationStrategy (25-25)

controllers/tests/shared/shared.go (11)

UnhealthyNodeName (37-37)

Namespace (36-36)

GenerateTestConfig (283-295)

PeerNodeName (38-38)

Peer2NodeName (39-39)

SnrPodName2 (43-43)

Peer3NodeName (40-40)

SnrPodName3 (44-44)

IsK8sNotFoundError (320-336)

SnrPodName1 (42-42)

PeersOverrideFunc (74-74)

pkg/peers/peers.go (3)

Role (26-26)

ControlPlane (30-30)

Peers (36-45)

api/HealthCheckResponse.go (1)

Unhealthy (8-8)

vendor/github.com/medik8s/common/pkg/labels/labels.go (1)

MasterRole (7-7)

pkg/controlplane/manager.go (1)

NewManager (49-57)

pkg/apicheck/check.go (1)

PeersOverrideFunc (36-36)

controllers/tests/shared/shared.go (5)

pkg/apicheck/check.go (4)

ApiConnectivityCheck (38-51)

PeersOverrideFunc (36-36)

New (76-87)

ApiConnectivityCheckConfig (55-74)

api/HealthCheckResponse.go (2)

HealthCheckResponseCode (3-3)

RequestFailed (6-6)

pkg/peers/peers.go (2)

Role (26-26)

New (47-58)

pkg/controlplane/manager.go (1)

Manager (39-46)

vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)

MakeMatcher (87-128)

🔇 Additional comments (4)

controllers/tests/shared/shared.go (2)

112-124: Valid ULA IPv6 generation — LGTM

Using fd00::/8 with crypto/rand is correct; reasonable fallback on error.

319-336: Custom matcher for IsNotFound — LGTM

Matches error semantics cleanly using gcustom.MakeMatcher.

controllers/tests/controller/selfnoderemediation_controller_test.go (2)

543-583: Order-of-operations bug: responses appended before enabling simulation get wiped

In the “two control node peers found” setup you append responses in addNodes, then later enable simulation and reset timers which clears baselines. Move configureSimulatedPeerResponses(true) before addNodes(...) or stop clearing on enable (see prior comment).

Also applies to: 1394-1410

1065-1101: Strict “exactly two nodes” post-condition can flake if cleanup fails

If a prior test leaves an extra peer due to a transient API error, this hard check will fail unrelated specs. Consider filtering by known names or asserting a superset condition.

coderabbitai · 2025-10-27T20:58:49Z

controllers/tests/controller/selfnoderemediation_controller_test.go

+			It("check that we actually get a triggered watchdog reboot", func() {
+				// It's expected that the next line will fail, even though it shouldn't!
+				verifyWatchdogTriggered()
 			})


⚠️ Potential issue | 🟠 Major

Remove “expected to fail” assertion or fix the expectation

Tests must be deterministic. Either make verifyWatchdogTriggered() reliably true for this setup or flip to verifyWatchdogNotTriggered() with matching setup. Leaving a known-failing assertion will break CI.

🤖 Prompt for AI Agents

In controllers/tests/controller/selfnoderemediation_controller_test.go around lines 584 to 587, the test currently calls verifyWatchdogTriggered() but the comment indicates it is expected to fail; make the test deterministic by either (A) ensuring the test setup actually triggers the watchdog before calling verifyWatchdogTriggered() — e.g., configure the fake/mock watchdog state, timeouts or events so the trigger condition is met — or (B) if the intended behavior is that the watchdog should not be triggered in this scenario, replace the call with verifyWatchdogNotTriggered() and adjust any setup/fixtures to match that expectation; also remove/update the misleading comment so CI no longer contains a known-failing assertion.

coderabbitai · 2025-10-27T20:58:49Z

controllers/tests/controller/selfnoderemediation_controller_test.go

+func configureSimulatedPeerResponses(simulateResponses bool) {
+	By("Start simulating peer responses", func() {
+		orgValue := apiCheck.ShouldSimulatePeerResponses
+		orgResponses := apiCheck.SnapshotSimulatedPeerResponses()
+		apiCheck.ShouldSimulatePeerResponses = simulateResponses
+		if simulateResponses {
+			apiCheck.ClearSimulatedPeerResponses()
+		}
+
+		DeferCleanup(func() {
+			apiCheck.ShouldSimulatePeerResponses = orgValue
+			apiCheck.RestoreSimulatedPeerResponses(orgResponses)
+			apiCheck.RememberSimulatedPeerResponses()
+			if !simulateResponses {
+				apiCheck.SetPeersOverride(nil)
+			}
+		})
+	})
+}


🛠️ Refactor suggestion | 🟠 Major

Simulated response lifecycle is inverted; do not clear on enable and don’t couple override cleanup here

Enabling simulation currently clears pre-staged responses, breaking scenarios that append before toggling.

Cleanup clears PeersOverride only when simulateResponses is false — backwards.

func configureSimulatedPeerResponses(simulateResponses bool) { By("Start simulating peer responses", func() { orgValue := apiCheck.ShouldSimulatePeerResponses orgResponses := apiCheck.SnapshotSimulatedPeerResponses() apiCheck.ShouldSimulatePeerResponses = simulateResponses - if simulateResponses { - apiCheck.ClearSimulatedPeerResponses() - } + // Preserve any previously staged responses when enabling simulation. + // Callers can ClearSimulatedPeerResponses() explicitly if needed. DeferCleanup(func() { apiCheck.ShouldSimulatePeerResponses = orgValue apiCheck.RestoreSimulatedPeerResponses(orgResponses) - apiCheck.RememberSimulatedPeerResponses() - if !simulateResponses { - apiCheck.SetPeersOverride(nil) - } + // Leave PeersOverride restoration to configurePeersOverride’s own cleanup. }) }) }

Also applies to: 1305-1321

🤖 Prompt for AI Agents

controllers/tests/controller/selfnoderemediation_controller_test.go around lines 1279-1297 (and apply same change to 1305-1321): the code currently clears pre-staged simulated responses when enabling simulation and only clears PeersOverride on cleanup when simulateResponses is false — both inverted. Change the toggle logic so that enabling simulation (simulateResponses == true) does NOT call ClearSimulatedPeerResponses (preserve pre-staged responses) and disabling simulation (simulateResponses == false) clears simulated responses as needed; also invert the DeferCleanup condition so SetPeersOverride(nil) runs when we previously enabled simulation (simulateResponses == true) rather than when it was false. Ensure the same fixes are applied to the other block at 1305-1321.

coderabbitai · 2025-10-27T20:58:49Z

controllers/tests/controller/selfnoderemediation_controller_test.go

+func configurePeersOverride(fn shared.PeersOverrideFunc) {
+	By("Configure peer override", func() {
+		apiCheck.SetPeersOverride(fn)
+	})
+}


⚠️ Potential issue | 🟠 Major

Peers override leaks across tests — add restoration in cleanup

configurePeersOverride sets an override but never restores it, polluting later specs.

func configurePeersOverride(fn shared.PeersOverrideFunc) { By("Configure peer override", func() { - apiCheck.SetPeersOverride(fn) + prev := apiCheck.PeersOverride() + apiCheck.SetPeersOverride(fn) + DeferCleanup(func() { + apiCheck.SetPeersOverride(prev) + }) }) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func configurePeersOverride(fn shared.PeersOverrideFunc) {

By("Configure peer override", func() {

apiCheck.SetPeersOverride(fn)

})

}

func configurePeersOverride(fn shared.PeersOverrideFunc) {

By("Configure peer override", func() {

prev := apiCheck.PeersOverride()

apiCheck.SetPeersOverride(fn)

DeferCleanup(func() {

apiCheck.SetPeersOverride(prev)

})

})

}

🤖 Prompt for AI Agents

In controllers/tests/controller/selfnoderemediation_controller_test.go around lines 1299-1303, configurePeersOverride currently sets a global override without restoring it, leaking state across specs; change it to capture the current override before calling apiCheck.SetPeersOverride(fn) and register a cleanup to restore the previous value (using Ginkgo's DeferCleanup or an AfterEach) so the original override is reinstated after the spec finishes.

openshift-ci bot added the do-not-merge/work-in-progress label Mar 26, 2025

openshift-ci bot requested review from clobrano and slintes March 26, 2025 22:10

openshift-ci bot added the needs-ok-to-test label Mar 26, 2025

slintes reviewed Apr 1, 2025

View reviewed changes

mshitrit reviewed Apr 1, 2025

View reviewed changes

mark-dell-usa force-pushed the fix_issue_251_peer_control_nodes branch from eb2c8c8 to 6397c3e Compare April 30, 2025 14:33

mshitrit reviewed May 4, 2025

View reviewed changes

openshift-merge-robot added the needs-rebase label May 15, 2025

coderabbitai bot reviewed May 15, 2025

View reviewed changes

mark-dell-usa force-pushed the fix_issue_251_peer_control_nodes branch from 6397c3e to befbcbb Compare June 23, 2025 14:05

openshift-merge-robot removed the needs-rebase label Jun 23, 2025

coderabbitai bot reviewed Jun 23, 2025

View reviewed changes

controllers/tests/shared/shared.go Show resolved Hide resolved

openshift-merge-robot added the needs-rebase label Jul 3, 2025

mark-dell-usa added 2 commits October 7, 2025 13:52

Added new failing test to showcase bug ticket

2b68d28

This test intentionally breaks, because it proves that defect medik8s…

cb32b84

…#251 exists and is a real problem.

novasbc force-pushed the fix_issue_251_peer_control_nodes branch from befbcbb to 2fe5996 Compare October 7, 2025 19:16

openshift-merge-robot removed the needs-rebase label Oct 7, 2025

mark-dell-usa added 2 commits October 7, 2025 14:20

Fixing breaking unit test with proposed changes to IsConsideredHealth…

eec805a

…y and getWorkerPeersResponse for issue medik8s#251

gofmt updates to watchdog_test after running the local "make test" co…

17bde37

…mmand.

coderabbitai bot reviewed Oct 7, 2025

View reviewed changes

novasbc force-pushed the fix_issue_251_peer_control_nodes branch from 2fe5996 to d1c5111 Compare October 7, 2025 19:40

coderabbitai bot reviewed Oct 7, 2025

View reviewed changes

controllers/tests/shared/shared.go Show resolved Hide resolved

Mostly cleanup + clarifications

281a674

novasbc force-pushed the fix_issue_251_peer_control_nodes branch from d1c5111 to 281a674 Compare October 7, 2025 20:15

coderabbitai bot reviewed Oct 7, 2025

View reviewed changes

mark-dell-usa added 8 commits October 8, 2025 09:51

Working on fencing logic redesign

7d3b4f9

Working on fencing logic redesign

4a66304

failing test added

6cae9e2

failing test added

638a365

failing test added

153947b

implementation wip

8d6eb20

wip

7d43445

wip

3d815d2

coderabbitai bot reviewed Oct 8, 2025

View reviewed changes

Implementing changes for code review feedback.

ac5306e

coderabbitai bot reviewed Oct 27, 2025

View reviewed changes

mark-dell-usa added 2 commits October 27, 2025 14:27

Updating pinentry scripts to be more robust.

98c6a22

Also it DID NOT like working with a powershell wrapper.

Implementing changes for code review feedback.

fe61c0e

openshift-merge-robot added the needs-rebase label Oct 27, 2025

coderabbitai bot reviewed Oct 27, 2025

View reviewed changes

-			By(fmt.Sprintf("Clear node state for '%s'", shared.UnhealthyNodeName), func() {
-				Expect(k8sClient.Update(context.Background(), getNode(shared.UnhealthyNodeName)))
-			})
-			By(fmt.Sprintf("Clear node state for '%s'", shared.PeerNodeName), func() {
-				Expect(k8sClient.Update(context.Background(), getNode(shared.PeerNodeName)))
-			})
-			time.Sleep(time.Second * 2)
+            By(fmt.Sprintf("Clear node state for '%s'", shared.UnhealthyNodeName), func() {
+                // Fetch the live object so we have a resourceVersion
+                node := &v1.Node{}
+                Expect(k8sClient.Get(context.Background(), unhealthyNodeNamespacedName, node)).To(Succeed())
+                // Prepare a clean copy of the desired state
+                clean := getNode(shared.UnhealthyNodeName)
+                patch := client.MergeFrom(node.DeepCopy())
+                node.Spec = clean.Spec
+                node.Annotations = clean.Annotations
+                node.Labels = clean.Labels
+                // Apply the patch and assert success
+                Expect(k8sClient.Patch(context.Background(), node, patch)).To(Succeed())
+            })
+            By(fmt.Sprintf("Clear node state for '%s'", shared.PeerNodeName), func() {
+                Expect(k8sClient.Update(context.Background(), getNode(shared.PeerNodeName))).To(Succeed())
+            })
+            time.Sleep(time.Second * 2)

Uh oh!

[WIP] Fix issue 251 peer control nodes #255

Are you sure you want to change the base?

[WIP] Fix issue 251 peer control nodes #255

Uh oh!

Conversation

mark-dell-usa commented Mar 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why we need this PR

Changes made

Which issue(s) this PR fixes

Test plan

Summary by CodeRabbit

Uh oh!

openshift-ci bot commented Mar 26, 2025

Uh oh!

slintes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mshitrit Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slintes May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

openshift-ci bot commented Jun 23, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mark-dell-usa commented Jun 23, 2025

mark-dell-usa commented Mar 26, 2025 •

edited by coderabbitai bot

Loading

mshitrit Apr 16, 2025 •

edited

Loading

slintes May 13, 2025 •

edited

Loading

coderabbitai bot commented May 15, 2025 •

edited

Loading