- 
        Couldn't load subscription status. 
- Fork 25
[WIP] Fix issue 251 peer control nodes #255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[WIP] Fix issue 251 peer control nodes #255
Conversation
| Hi @mark-scott-jr-dell. Thanks for your PR. I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with  Once the patch is verified, the new status will be reflected by the  I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late first review, we are pretty busy...
        
          
                pkg/apicheck/check.go
              
                Outdated
          
        
      | //canOtherControlPlanesBeReached := c.canOtherControlPlanesBeReached() | ||
| peersResponse = c.getPeersResponse(peers.ControlPlane) | ||
|  | ||
| // MES: This does not appear to have any actual relevance. To me, it appears that all the necessary | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsControlPlaneHealthy() not being relevant is a bold statement ;)
However, before going on with a more detailed review, I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @mshitrit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.
+1
I think this change significantly changes current logic.
Couple of things I've noticed:
- In the new code for CP nodes we completely ignore feedback of worker nodes. for most use case worker nodes can accurately report the status of the CP nodes and even though I expect the CP peers to report the same I'm not sure that ignoring the worker peers would be the best option.
- diagnostic logic (i.e isDiagnosticsPassed()) is removed, which means the node can be falsely considered healthy for some use cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsControlPlaneHealthy()not being relevant is a bold statement ;)
Haha I did say "does not APPEAR to have any actual relevence" to be fair, based on my observations. I definitely left room in there for me to be wrong 😂.
However, before going on with a more detailed review, I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.
This would help a lot, my actual code changes were based on how I understood the expected flow to go, I attempted to interpret this based on the intention I saw in the code. My goal was to attempt not to change far too much and keep behaviors the same today since I personally don't know all the intentions, nor did I find it documented in detail anywhere (correct me if I'm wrong to be sure!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.
+1
I think this change significantly changes current logic. Couple of things I've noticed:
- In the new code for CP nodes we completely ignore feedback of worker nodes. for most use case worker nodes can accurately report the status of the CP nodes and even though I expect the CP peers to report the same I'm not sure that ignoring the worker peers would be the best option.
- diagnostic logic (i.e
isDiagnosticsPassed()) is removed, which means the node can be falsely considered healthy for some use cases
I'll look back at this later today to respond, especially w.r.t isDiagnosticsPassed, but I did spend some time walking through the flows and found multiple checks that basically referenced the same data multiple times, so I was attempting to simplify so it was clear what the code was doing. I felt it was unclear visually what was actually going on.
Ultimately, if we use the updated unit test just to prove out the core issue, I'm still good - our goal was to prove it so that it could be fixed, since it's a PITA to get logs in that case due to the nature of the cluster status at that point, so I personally picked the stretch goal of creating the unit test which would be better for the long term (in theory).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mshitrit Not sure if you've had a chance to think about the logic flows here, but, if I'm reading things correctly, an update like this?
- Attempt to get control plane responses
- Attempt to get worker responses
- Some combination of these should say that the node is healthy. Also add in isDiagnosticPassed
I'm willing to implement it and push a new PR, just want to be sure that I use the flow that you have in mind.
We are looking to pull the latest build whenever this is merged to main, and get some needed CVE fixes and other things, so I'd love to drive this to a close ASAP. I thought I had posted this message a week and a half ago but I guess it went into the ether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi I think that pretty close, writing down both current and what I understand is the desired flow:
- Attempt to get worker responses
- If a worker return that response
- else (assuming it's a control plane)
- get a control plane response
- Some combination of these should say that the node is healthy. Also add in isDiagnosticPassed
 
IIUC the fix is aiming for this flow:
- Attempt to get control plane responses
- Attempt to get worker responses
- If a worker Some combination of these should say that the node is healthy (CP response is only relevant for some use cases of a worker node healthy response otherwise it can be ignored)
- else Some combination of these should say that the node is healthy. Also add in isDiagnosticPassed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mshitrit I pushed a set of changes to IsConsideredHealthy which I believe matches what you have described as the desired flow. Let me know your thoughts.
        
          
                pkg/apicheck/check.go
              
                Outdated
          
        
      | } | ||
|  | ||
| func (c *ApiConnectivityCheck) getWorkerPeersResponse() peers.Response { | ||
| func (c *ApiConnectivityCheck) getPeersResponse(role peers.Role) peers.Response { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this refactoring 👍
eb2c8c8    to
    6397c3e      
    Compare
  
            
          
                pkg/apicheck/check.go
              
                Outdated
          
        
      | controlPlanePeersResponse := c.getPeersResponse(peers.ControlPlane) | ||
|  | ||
| c.config.Log.Info("isConsideredHealthy: since peers think I'm unhealthy, double checking "+ | ||
| "by returning what the control plane nodes think of my state", | ||
| "controlPlanePeersResponse.IsHealthy", controlPlanePeersResponse.IsHealthy) | ||
| return controlPlanePeersResponse.IsHealthy | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC basically this means that for a worker node the CP peers response will override the Worker peers response (unless worker peers response is healthy).
It completely ignores why the worker response was unhealthy.
TBH I'm having a hard time of thinking of an example this would not work as expected, but ignoring the reason still feels dangerous.
@slintes maybe you have a stronger opinion either way ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mshitrit I hear you, I still don't necessarily have a full feeling of what the fully desired behavior, so trying to translate what I am hearing from you and what I've seen overall.
Definitely will change it to match what the medik8s team thinks is appropriate. My main goal was to prove the issue with the unit test, and attempt to come in with a solution rather than just toss the problem in your collective laps. Intention is not to change intended behaviors, especially since it could have impacts on existing installations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are still discussing details on the PR without knowing what we're aiming at.
Again, can write down the expected flow first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My apologies, @slintes I'm honestly looking for ya'lls feedback on what the expected flow is. I identified a problem case, wherein a control plane node can go down. I didn't fully know what the expected flow was supposed to be, but I did know in this case it was wrong.
So, the unit test proves the issue, which I believe could affect others utilizing SNR in different ways. I tried to provide some solution, but am totally ok if it's not the right solution.
If you are asking me what the expected flow is, I can sit down and draft out what I think it should be for the entire system for CP nodes as well as worker nodes, and am happy to do it, but I was hoping to start with a baseline of what the medik8s team believed the workflow was supposed to theoretically believe.
So, if this reply is directed at me, I'd ask that you be more specific, are you asking me to write down the expected flow for the entire system? (Which is what I seem to have possibly gotten incorrect in my proposed solution)
We are still discussing details on the PR without knowing what we're aiming at. Again, can write down the expected flow first?
If so, I'm up for taking a stab at it, but I don't have the background of why existing decisions were made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I identified a problem case
the unit test proves the issue
And that's great, thank you 👍🏼
But I think the code changes significantly change the existing flow of things, which desires some general discussion of what the expected flow should be.
If you are asking me what the expected flow is
No no, it was a general ask, sorry if it sounded inappropriate. I would do it myself if I had more time for this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our docs contain some diagrams. As a first step we can verify if they are up to date and aligned with the current code. And then if they still make sense. Is the issue is even visible there?
https://www.medik8s.io/remediation/self-node-remediation/how-it-works/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I identified a problem case
the unit test proves the issueAnd that's great, thank you 👍🏼 But I think the code changes significantly change the existing flow of things, which desires some general discussion of what the expected flow should be.
If you are asking me what the expected flow is
No no, it was a general ask, sorry if it sounded inappropriate. I would do it myself if I had more time for this...
Ok, understood.
Our docs contain some diagrams. As a first step we can verify if they are up to date and aligned with the current code. And then if they still make sense. Is the issue is even visible there?
https://www.medik8s.io/remediation/self-node-remediation/how-it-works/
It had been so long since our initial implementation I forgot these diagrams that exist. Let me review today and see if I can offer a proposal, given your time constraints, and update diagrams. Perhaps I can shorten the time required
| WalkthroughInjected role-aware remote health checks, failure tracking, and quorum/time-window configuration into API connectivity; refactored control‑plane evaluation to outcome-driven logic; expanded peers and test harness (wrappers, simulated peers, random pod IPs); added FailureTracker and tests; plus logging and error-message improvements across controllers and utils. Changes
 Sequence Diagram(s)sequenceDiagram
    autonumber
    participant SNR as SelfNodeRemediation
    participant ApiCheck as ApiConnectivityCheck
    participant Peers as Peers
    participant Control as ControlPlaneManager
    participant Remote as RemotePeer (gRPC / injected)
    Note over ApiCheck: FailureTracker, FailureWindow, PeerQuorumTimeout, injected remote health check
    SNR->>ApiCheck: Request aggregated health (node role)
    ApiCheck->>Peers: listPeers(role)
    Peers-->>ApiCheck: peer Pod IPs
    ApiCheck->>Remote: getHealthStatusFromPeer(endpointIp) (injected)
    Remote-->>ApiCheck: HealthCheckResponseCode
    ApiCheck->>ApiCheck: gatherPeerResponses → evaluate (quorum, failure tracker, escalation)
    ApiCheck-->>SNR: outcome (Healthy / Remediate / Isolation / AwaitQuorum)
    alt outcome requires control-plane validation
        SNR->>Control: IsControlPlaneHealthy(outcome)
        Control-->>SNR: boolean decision (logged)
    end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 
 Possibly related PRs
 Suggested reviewers
 Poem
 Pre-merge checks and finishing touches❌ Failed checks (1 warning)
 ✅ Passed checks (4 passed)
 ✨ Finishing touches
 🧪 Generate unit tests (beta)
 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (6)
pkg/apicheck/check.go (3)
209-213: Log message prints the wrong variable
isControlPlaneHealthyis computed a few lines above, but the log printscontrolPlanePeersResponse.IsHealthy, duplicating earlier output and hiding the final decision.- c.config.Log.Info("isConsideredHealthy: we have checkd the control plane peer responses and cross "+ - "checked it against the control plane diagnostics ", - "isControlPlaneHealthy", controlPlanePeersResponse.IsHealthy) + c.config.Log.Info("isConsideredHealthy: evaluated peer responses & diagnostics", + "isControlPlaneHealthy", isControlPlaneHealthy)
223-225: Misleading log text claims “I consider myself a WORKER” irrespective of actual roleThe hard-coded message talks about WORKER even when
role == peers.ControlPlane.
Replace the literal withrole.String()(or similar) to avoid confusion during incident triage.
386-388: Setter lacks concurrency protection
SetHealthStatusFuncmay be called from tests whileApiConnectivityCheckis running in a goroutine.
If that ever happens, the write is unsynchronised with reads ingetHealthStatusFromPeer, leading to a data race.
Wrap the field access with the existingmutexor document it as “write-once before Start()”.vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)
87-91: Panic message could be clearerThe panic mentions “function that takes one argument and returns (bool, error)”, but omits the possibility of a typed first parameter which you explicitly support.
Consider:- panic("MakeMatcher must be passed a function that takes one argument and returns (bool, error)") + panic("MakeMatcher expects func(<any single param>) (bool, error)")controllers/tests/controller/selfnoderemediation_controller_test.go (2)
1044-1049: Deep-equality onNode.Statusis brittle and frequently fails
reflect.DeepEqualon the wholeStatusblock compares timestamps, resource versions, conditions, etc. that can legitimately change between the expected skeleton returned bygetNodeand the actual cluster object – even when the node is perfectly “equal” for test purposes.
This can introduce non-deterministic test failures.Recommendation
• Compare only deterministic fields (e.g., labels, taints) or use a semantic helper such asequality.Semantic.DeepEqualwith a well-scoped struct.
• Alternatively, omitStatusfrom the comparison altogether unless a specific field is being asserted.
790-791: Unnecessarytime.Sleepslows the suiteA fixed
time.Sleep(1 * time.Second)immediately before anEventuallypoll adds a full second to every invocation ofcreateGenericSelfNodeRemediationPod.
Eventuallyalready waits until the pod is observed; the explicit sleep can be safely removed to cut runtime.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
⛔ Files ignored due to path filters (2)
- pkg/peerhealth/peerhealth.pb.gois excluded by- !**/*.pb.go
- pkg/peerhealth/peerhealth_grpc.pb.gois excluded by- !**/*.pb.go
📒 Files selected for processing (12)
- controllers/selfnoderemediation_controller.go(3 hunks)
- controllers/tests/config/suite_test.go(1 hunks)
- controllers/tests/controller/selfnoderemediation_controller_test.go(11 hunks)
- controllers/tests/controller/suite_test.go(3 hunks)
- controllers/tests/shared/shared.go(3 hunks)
- go.mod(1 hunks)
- pkg/apicheck/check.go(5 hunks)
- pkg/controlplane/manager.go(2 hunks)
- pkg/peers/peers.go(4 hunks)
- pkg/utils/pods.go(3 hunks)
- vendor/github.com/onsi/gomega/gcustom/make_matcher.go(1 hunks)
- vendor/modules.txt(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (3)
controllers/tests/config/suite_test.go (1)
controllers/tests/shared/shared.go (1)
MinPeersForRemediationConfigDefaultValue(47-47)
controllers/tests/controller/suite_test.go (1)
controllers/tests/shared/shared.go (3)
ApiConnectivityCheckWrapper(59-65)
MinPeersForRemediationConfigDefaultValue(47-47)
NewApiConnectivityCheckWrapper(109-131)
controllers/tests/shared/shared.go (4)
pkg/apicheck/check.go (3)
ApiConnectivityCheck(28-37)
New(59-70)
ApiConnectivityCheckConfig(41-57)api/HealthCheckResponse.go (1)
HealthCheckResponseCode(3-3)pkg/controlplane/manager.go (1)
Manager(29-36)vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)
MakeMatcher(87-128)
🪛 golangci-lint (1.64.8)
pkg/apicheck/check.go
110-110: S1023: redundant return statement
(gosimple)
🔇 Additional comments (29)
go.mod (1)
25-25: Explicit dependency declaration for UUID libraryMoving the UUID package from indirect to direct dependencies correctly reflects its usage in the codebase, specifically for generating random IPv6 addresses in test utilities.
vendor/modules.txt (1)
144-144: Added Gomega custom matcher packageThe addition of
github.com/onsi/gomega/gcustomin the vendor modules reflects the use of custom matchers in tests, which provides more flexible failure message formatting.pkg/controlplane/manager.go (2)
57-60: Enhanced logging for control plane role verificationAdding detailed logging about node roles improves observability when checking if a node is a control plane node, which will help with debugging role-based issues.
138-140: Added node labels logging during role assignmentThis logging enhancement provides visibility into node labels during role determination, which is valuable for troubleshooting control plane node identification issues, especially when dealing with peer control nodes.
controllers/tests/config/suite_test.go (1)
129-129:✅ Verification successful
Updated minimum peers configuration constant
Changed from using
shared.MinPeersForRemediationtoshared.MinPeersForRemediationConfigDefaultValue(value of 1) to align with the new API connectivity check wrapper's default configuration values.
🏁 Script executed:
#!/bin/bash # Verify the default value in the shared package grep -r "MinPeersForRemediationConfigDefaultValue" --include="*.go" controllers/tests/shared/Length of output: 174
Change Approved: Verified default peers configuration constant
Theshared.MinPeersForRemediationConfigDefaultValueconstant is defined as 1 incontrollers/tests/shared/shared.go, matching the intended default. No further changes required.
- File updated:
controllers/tests/config/suite_test.go(line 129)pkg/utils/pods.go (3)
5-5: Good addition of the fmt package for improved error handling.The import of the
fmtpackage is required for the enhanced error wrapping in the error handling sections.
23-24: Excellent improvement to error handling with context.Using
fmt.Errorfwith the%wverb provides better error context by wrapping the original error instead of losing the root cause. This change makes debugging much easier by preserving the full error chain.
33-33: Good enhancement to error message clarity.The improved error message now explicitly includes the node name in the formatted string, making it easier to identify which node is affected when troubleshooting.
controllers/tests/controller/suite_test.go (3)
63-63: Good update to variable type for enhanced testing capabilities.Changing the type to
*shared.ApiConnectivityCheckWrapperallows for more flexible testing by enabling simulation of peer health check responses.
166-166: Updated constant reference for better code maintenance.Changing from
shared.MinPeersForRemediationtoshared.MinPeersForRemediationConfigDefaultValueimproves code clarity by using a more descriptive constant name and aligns with updated constants in the shared test package.
168-170: Good refactoring to use wrapper for API connectivity checks.The change from directly using
apicheck.Newto usingshared.NewApiConnectivityCheckWrapperenhances testing capabilities by allowing simulation of peer responses, which is essential for thorough testing of peer control node scenarios.controllers/selfnoderemediation_controller.go (7)
453-453: Good addition of informative logging at phase start.This logging statement improves observability by explicitly marking entry into the fencing start phase, making it easier to track the remediation workflow.
456-456: Enhanced logging for pre-reboot phase entry.This logging statement improves traceability by clearly marking the transition to the pre-reboot completed phase.
459-459: Improved phase transition visibility with logging.Adding explicit logging for entering the reboot completed phase enhances observability of the remediation workflow.
462-462: Clear logging for fencing completion phase.This logging statement provides clear indication of reaching the final fencing complete phase in the remediation workflow.
466-466: Enhanced error message with phase value inclusion.Including the actual phase value in the error message provides more context for troubleshooting unknown phase errors.
500-500: Good indication of pre-reboot completion.Adding a log statement that clearly indicates when the pre-reboot phase is completed improves workflow visibility.
631-632: Improved error context for node reboot capability.The expanded error message clearly explains the consequence of the error - that failure to get the agent pod makes the node not reboot capable, which is important context for troubleshooting.
pkg/peers/peers.go (9)
104-116: Good addition of detailed logging and reset functionality.The refactored
updateWorkerPeersmethod now includes comprehensive logging and a dedicated reset function, improving observability and state management. The use of closures for setter, getter, and reset functions makes the code more modular and testable.
119-134: Well-structured implementation ofUpdateControlPlanePeers.The new implementation follows the same pattern as
updateWorkerPeers, providing consistent logging, clear state management, and a dedicated reset function. This parallelism in implementation makes the code more maintainable.
136-137: Good enhancement toupdatePeerswith reset functionality.Adding the
resetPeersparameter allows for more flexible peer state management, particularly in error recovery scenarios.
150-150: Improved state management with reset function.Using the dedicated reset function instead of directly manipulating state improves encapsulation and ensures consistent state management across the codebase.
156-156: Enhanced logging of node information.Adding detailed logging of the nodes found during peer updates improves observability and debugging capabilities, especially for distributed node issues.
179-183: Informative logging for skipped node/pod combinations.Adding logging when skipping node/pod combinations provides visibility into the peer selection process, which is helpful for understanding which peers are considered during updates.
194-195: Comprehensive peer address logging.The detailed logging of both worker and control plane peer addresses when retrieving peer addresses significantly improves observability, particularly for debugging peer connectivity issues.
200-200: Improved role-specific peer logging.Adding specific log messages for worker and control plane peer retrieval requests enhances the clarity of the peer discovery process, making it easier to track role-based peer operations.
Also applies to: 203-203
86-87:✅ Verification successful
Method name change from
updateControlPlanePeerstoUpdateControlPlanePeers.This change from private to public method enhances API accessibility, allowing external components to trigger control plane peer updates. Ensure this change is intentional and that all callers are updated to use the new name.
Also applies to: 92-93
🏁 Script executed:
#!/bin/bash # Check for all callers of the renamed method rg -A 2 -B 2 -p "updateControlPlanePeers|UpdateControlPlanePeers" --type goLength of output: 1869
All callers updated: Public method rename is safe
Confirmed that the private
updateControlPlanePeersidentifier has been fully replaced by the publicUpdateControlPlanePeersacross the codebase. No remaining references to the old name were found.Key call sites:
- pkg/peers/peers.go: lines 85–87 and 91–93
- controllers/tests/controller/selfnoderemediation_controller_test.go: lines 1142 & 1147
No further changes required.
controllers/tests/shared/shared.go (1)
101-107: UUID-based IPv6 test addresses can collide across processesWhile unlikely, deriving an IPv6 address from the raw UUID bytes does not set the locally-administered bit (RFC 4193) and may occasionally overlap with real cluster traffic if the test runs on the host network.
A safer pattern is to prependfd00:(ULA range) or usenet.ParseIP("::1")for pure unit tests.Would you like a patch that generates deterministic ULA addresses (
fdxx:) instead?controllers/tests/controller/selfnoderemediation_controller_test.go (1)
1237-1244: Global mutable test state – verify isolation
configureRemediationStrategymutates the package-levelremediationStrategyvariable that is also toggled directly in other test contexts.
While Ginkgo executes tests serially by default, parallel runs or future refactors can introduce surprising cross-test coupling.Please confirm that:
- No test suite is executed with
-parallel> 1.- No goroutines read this variable concurrently.
If parallelism is anticipated, consider passing the strategy explicitly to helper functions instead of relying on a shared global.
        
          
                controllers/tests/controller/selfnoderemediation_controller_test.go
              
                Outdated
          
            Show resolved
            Hide resolved
        
      6397c3e    to
    befbcbb      
    Compare
  
    | [APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mark-dell-usa The full list of commands accepted by this bot can be found here. 
Needs approval from an approver in each of these files:
 Approvers can indicate their approval by writing  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (4)
controllers/tests/shared/shared.go (2)
82-89: Random IP helper forgets to update.Status.PodIP
assignRandomIpAddressesPods()populatesPodIPs, but callers relying on the legacy singular field (pod.Status.PodIP) will still observe an empty value.
Several components (including controller-runtime caches and some older network add-ons) still look atPodIPwhen only one address is expected.
Consider mirroring the first generated address into both fields:func assignRandomIpAddressesPods(pods *corev1.PodList) { for i := range pods.Items { - pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: GetRandomIpAddress()}} + randIP := GetRandomIpAddress() + pods.Items[i].Status.PodIP = randIP // keep legacy field in sync + pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: randIP}} } }
116-128: Deadlock risk: simulated peer sends > 1 message to a bounded channel
getHealthStatusFromPeers()allocatesresultsChanwithcapacity = len(addresses)and waits for exactly one message per peer.
The current override loops overckw.SimulatePeerResponsesand may write multiple codes, causing the goroutine to block on the second send and the whole test to hang.- case ckw.ShouldSimulatePeerResponses: - for _, code := range ckw.SimulatePeerResponses { - results <- code - } - return + case ckw.ShouldSimulatePeerResponses: + // Provide exactly one deterministic response per call + if len(ckw.SimulatePeerResponses) > 0 { + results <- ckw.SimulatePeerResponses[0] + } else { + results <- selfNodeRemediation.RequestFailed + } + returnIf you need per-peer variance, consider indexing into the slice with a counter guarded by a mutex instead of pushing multiple values at once.
controllers/tests/controller/selfnoderemediation_controller_test.go (2)
875-885: Variable shadowing hides the named return value and loses contextThe inner
err := …shadows the function's namederrreturn value, then the assignmenterr = …inside the block modifies the shadowed variable, not the function return.
As a result, the caller may receive anilerror even though the Get failed.- if err := k8sClient.Client.Get(context.Background(), podKey, pod); err != nil { - if apierrors.IsNotFound(err) && !throwErrorIfNotFound { + getErr := k8sClient.Client.Get(context.Background(), podKey, pod) + if getErr != nil { + if apierrors.IsNotFound(getErr) && !throwErrorIfNotFound { logf.Log.Info("pod with name '%s' not found, we're not going to do anything", podName) err = nil return } - - err = fmt.Errorf("unable to get pod with name '%s' in order to delete it", err) + err = fmt.Errorf("unable to get pod with name '%s' in order to delete it: %w", podName, getErr) return }
1197-1201: Cleanup doesn't persist the label rollback – subsequent tests may inherit the control-plane labelInside
configureUnhealthyNodeAsControlNode, theDeferCleanupcallback only changes the in-memoryunhealthyNodeobject but never updates the cluster.
If another test starts before this process ends (or if the global state is inspected inverifyCleanState), the node will still carry thenode-role.kubernetes.io/master=truelabel, leading to flaky or misleading results.- unhealthyNode.Labels[labels2.MasterRole] = previousRole + unhealthyNode.Labels[labels2.MasterRole] = previousRole + // Persist the rollback so follow-up tests start from + // a clean state + Expect(k8sClient.Update(context.TODO(), unhealthyNode)).To(Succeed(), + "failed to restore the unhealthy node label after test cleanup")
🧹 Nitpick comments (2)
pkg/apicheck/check.go (2)
72-114: Clean up redundant return and outdated TODOTwo minor improvements:
- Remove the redundant
returnstatement at line 110 (flagged by static analysis)- Remove the outdated TODO comment since IPv6 compatibility is confirmed- // TODO does this work with IPv6? - // MES: Yes it does, we've tested this phClient, err := peerhealth.NewClient(fmt.Sprintf("%v:%v", endpointIp.IP, c.config.PeerHealthPort), c.config.PeerDialTimeout, c.config.Log.WithName("peerhealth client"), c.clientCreds) ... results <- selfNodeRemediation.HealthCheckResponseCode(resp.Status) - return }
173-215: Well-structured health check logic with improved observabilityThe refactored
isConsideredHealthymethod provides clear separation between worker and control plane logic with excellent logging for debugging. The implementation aligns with the discussed flow from past reviews.Minor suggestion: Consider extracting the complex boolean expression on line 177 into a named variable for clarity:
- isWorkerNode := isControlPlaneManagerNil || !c.controlPlaneManager.IsControlPlane() + isControlPlane := !isControlPlaneManagerNil && c.controlPlaneManager.IsControlPlane() + isWorkerNode := !isControlPlane
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
- pkg/peerhealth/peerhealth.pb.gois excluded by- !**/*.pb.go
- pkg/peerhealth/peerhealth_grpc.pb.gois excluded by- !**/*.pb.go
📒 Files selected for processing (12)
- controllers/selfnoderemediation_controller.go(3 hunks)
- controllers/tests/config/suite_test.go(1 hunks)
- controllers/tests/controller/selfnoderemediation_controller_test.go(11 hunks)
- controllers/tests/controller/suite_test.go(3 hunks)
- controllers/tests/shared/shared.go(3 hunks)
- go.mod(1 hunks)
- pkg/apicheck/check.go(5 hunks)
- pkg/controlplane/manager.go(2 hunks)
- pkg/peers/peers.go(4 hunks)
- pkg/utils/pods.go(3 hunks)
- vendor/github.com/onsi/gomega/gcustom/make_matcher.go(1 hunks)
- vendor/modules.txt(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- pkg/controlplane/manager.go
- go.mod
- pkg/utils/pods.go
- controllers/tests/config/suite_test.go
- vendor/modules.txt
- controllers/tests/controller/suite_test.go
- controllers/selfnoderemediation_controller.go
- pkg/peers/peers.go
🧰 Additional context used
🪛 golangci-lint (1.64.8)
pkg/apicheck/check.go
110-110: S1023: redundant return statement
(gosimple)
🔇 Additional comments (8)
vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)
1-271: Vendor dependency addition looks goodThis vendor file adds the
gcustompackage from Gomega v1.27.0+ which provides custom matcher functionality. The integration with the test suite is appropriate for creating the new error matchers inshared.go.Note: Static analysis detected a redundant
returnstatement at line 110, but since this is vendor code, it should be addressed upstream rather than modified locally.controllers/tests/shared/shared.go (1)
167-206: Well-implemented custom error matchersThe custom Gomega matchers for Kubernetes errors are cleanly implemented with proper nil handling. This improves test readability and reusability.
pkg/apicheck/check.go (2)
28-40: Excellent refactoring for testabilityThe introduction of
GetHealthStatusFromRemoteFuncand the function injection pattern improves testability while maintaining backward compatibility with the default implementation. This follows the dependency injection principle effectively.Also applies to: 59-70
217-322: Good refactoring to support role-based peer queryingThe generalization from worker-specific to role-based peer querying improves code reusability. The additional logging statements will be valuable for troubleshooting peer communication issues in production.
controllers/tests/controller/selfnoderemediation_controller_test.go (4)
39-82: Excellent test setup refactoringThe restructured setup with descriptive
Byblocks and comprehensive cleanup improves test maintainability and debugging. The separation of concerns betweenBeforeEachandJustBeforeEachis well-organized.
448-495: Well-structured test case for issue #251This test effectively reproduces the control plane peer health check scenario described in issue #251. The setup clearly configures the environment with control plane nodes and simulated unhealthy responses.
1134-1337: Excellent test helper functionsThe new helper functions provide clean abstractions for test configuration with proper cleanup. The consistent use of
DeferCleanupensures test isolation, and the descriptive function names make test scenarios self-documenting.
1003-1041: Comprehensive cluster state verificationThe enhanced
verifyCleanStatefunction thoroughly validates the cluster state after tests, including checking for specific SNR pods by name. This will help catch cleanup issues early.
| Self Node Remediation - Control Plane Edge Case AnalysisOverviewThis document analyzes a critical edge case in the Self Node Remediation (SNR) operator where a control plane node becomes partially degraded but never remediates itself. This occurs when: 
 The Edge Case ScenarioInitial StateWorkflow Diagrams1. Normal SNR Health Check Flowflowchart TD
    Start([SNR Pod Health Check Timer]) --> APICheck[Check API Server /readyz]
    APICheck -->|Success| Healthy[Mark as Healthy]
    APICheck -->|Failure| ErrorCount{Error Count > Threshold?}
    ErrorCount -->|No| IncrementError[Increment Error Counter]
    IncrementError --> Wait[Wait for Next Check]
    ErrorCount -->|Yes| PeerCheck[Query Peer Nodes]
    
    PeerCheck --> PeerResponse{Peer Response Analysis}
    PeerResponse -->|Majority say Unhealthy| CreateSNR[Trigger Remediation]
    PeerResponse -->|Majority say Healthy| ResetCounter[Reset Error Counter]
    PeerResponse -->|Most can't reach API| ControlPlaneCheck{Is Control Plane?}
    
    ControlPlaneCheck -->|Worker Node| ConsiderHealthy[Consider Healthy]
    ControlPlaneCheck -->|Control Plane| RunDiagnostics[Run Diagnostics]
    
    RunDiagnostics --> DiagResult{Diagnostics Pass?}
    DiagResult -->|Yes| ConsiderHealthy
    DiagResult -->|No| CreateSNR
    
    ResetCounter --> Wait
    ConsiderHealthy --> Wait
    CreateSNR --> Remediate[Begin Remediation Process]
2. Control Plane Edge Case Flow (The Bug)flowchart TD
    Start([Control Plane Node:<br/>API Server DOWN]) --> APIFails[API Check Fails Repeatedly]
    APIFails --> QueryPeers[Query Worker Peers]
    
    QueryPeers --> PeerStatus{Worker Peer Responses}
    PeerStatus -->|">50% also can't<br/>reach API Server"| MostCantAccess[Status: HealthyBecauseMostPeersCantAccessAPIServer]
    
    MostCantAccess --> CPDiag[Run Control Plane Diagnostics]
    
    CPDiag --> EndpointCheck{Check Endpoint<br/>Health URL}
    EndpointCheck -->|Not Configured or<br/>Was Never Accessible| EndpointPass[Endpoint Check: PASS]
    
    EndpointPass --> KubeletCheck{Is Kubelet<br/>Running?}
    KubeletCheck -->|Port 10250<br/>Responds| KubeletPass[Kubelet Check: PASS]
    
    KubeletPass --> DiagPass[Diagnostics: PASSED ✓]
    DiagPass --> MarkHealthy[Node Marked as HEALTHY]
    MarkHealthy --> NoRemediation[❌ NO REMEDIATION TRIGGERED]
    
    NoRemediation --> Impact[Control Plane Remains Broken:<br/>- No API Server<br/>- No Scheduling<br/>- No Controllers<br/>- Cluster Partially Down]
3. Peer Health Check DetailssequenceDiagram
    participant CP as Control Plane<br/>(Broken API)
    participant W1 as Worker 1
    participant W2 as Worker 2
    participant W3 as Worker 3
    
    Note over CP: API Check Fails
    CP->>W1: Is my SNR CR present?
    CP->>W2: Is my SNR CR present?
    CP->>W3: Is my SNR CR present?
    
    W1--xCP: Error: Can't reach API
    W2--xCP: Error: Can't reach API
    W3--xCP: Error: Can't reach API
    
    Note over CP: >50% peers have API errors
    Note over CP: Status = HealthyBecauseMostPeersCantAccessAPIServer
    Note over CP: Run Diagnostics:
    Note over CP: ✓ Kubelet Running
    Note over CP: ✓ No Endpoint URL
    Note over CP: = HEALTHY (Bug!)
4. The Diagnostic Gapflowchart LR
    subgraph "Current Diagnostics"
        D1[Endpoint Health Check]
        D2[Kubelet Service Check]
    end
    
    subgraph "Missing Checks"
        M1[API Server Process]
        M2[Controller Manager]
        M3[Scheduler]
        M4[Etcd Connectivity]
    end
    
    subgraph "Result"
        R1[False Positive:<br/>Node Considered Healthy<br/>Despite Being Non-Functional]
    end
    
    D1 --> R1
    D2 --> R1
    M1 -.->|Should Check| R1
    M2 -.->|Should Check| R1
    M3 -.->|Should Check| R1
    M4 -.->|Should Check| R1
Code AnalysisThe Bug LocationFile:  func (manager *Manager) IsControlPlaneHealthy(workerPeersResponse peers.Response, 
    canOtherControlPlanesBeReached bool) bool {
    
    switch workerPeersResponse.ReasonCode {
    // ...
    case peers.HealthyBecauseMostPeersCantAccessAPIServer:
        didDiagnosticsPass := manager.isDiagnosticsPassed()
        manager.log.Info("The peers couldn't access the API server, so we are returning whether "+
            "diagnostics passed", "didDiagnosticsPass", didDiagnosticsPass)
        return didDiagnosticsPass  // <-- BUG: Returns true if kubelet is running
    // ...
    }
}
func (manager *Manager) isDiagnosticsPassed() bool {
    manager.log.Info("Starting control-plane node diagnostics")
    if manager.isEndpointAccessLost() {
        return false
    } else if !manager.isKubeletServiceRunning() {  // <-- Only checks kubelet!
        return false
    }
    manager.log.Info("Control-plane node diagnostics passed successfully")
    return true
}The ProblemThe diagnostics only check: 
 They DO NOT check: 
 ImpactThis edge case creates a situation where: 
 Recommended SolutionsSolution 1: Enhanced Diagnosticsfunc (manager *Manager) isDiagnosticsPassed() bool {
    // Existing checks...
    
    // Add: Check if API server is running locally
    if !manager.isAPIServerRunningLocally() {
        manager.log.Info("API server is not running locally")
        return false
    }
    
    // Add: Check local API server connectivity
    if !manager.canReachLocalAPIServer() {
        manager.log.Info("Cannot reach local API server")
        return false
    }
    
    return true
}Solution 2: Control Plane Specific LogicWhen  
 Solution 3: Timeout-Based RemediationIf a control plane node remains in this state for a configurable duration: 
 References
 ConclusionThe current SNR implementation has a critical gap in control plane node health assessment. When the API server fails but kubelet remains running, the node incorrectly considers itself healthy, preventing automatic remediation. This edge case requires either enhanced diagnostics that specifically check control plane components or a fundamental change in how control plane node health is evaluated during cluster-wide API server outages. | 
| Find an updated analysis here of what I did, perhaps it will help, I attempted to pull it together with diagrams to make it clear. https://gist.github.com/mark-dell-usa/576901e5cc420114bcf1076ff5d57f52 | 
…#251 exists and is a real problem.
befbcbb    to
    2fe5996      
    Compare
  
    …y and getWorkerPeersResponse for issue medik8s#251
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (5)
controllers/tests/shared/shared.go (3)
82-89: Random IP helper forgets to update.Status.PodIP
assignRandomIpAddressesPods()populatesPodIPs, but callers relying on the legacy singular field (pod.Status.PodIP) will still observe an empty value. Several components (including controller-runtime caches and some older network add-ons) still look atPodIPwhen only one address is expected.Consider mirroring the first generated address into both fields:
func assignRandomIpAddressesPods(pods *corev1.PodList) { for i := range pods.Items { - pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: GetRandomIpAddress()}} + randIP := GetRandomIpAddress() + pods.Items[i].Status.PodIP = randIP // keep legacy field in sync + pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: randIP}} } }
101-107: Use proper IPv6 address generation instead of UUID bytesConverting UUID bytes directly to IPv6 addresses may generate invalid or reserved addresses. Consider using proper IPv6 address generation that respects network standards:
func GetRandomIpAddress() (randomIP string) { - u := uuid.New() - ip := net.IP(u[:net.IPv6len]) - randomIP = ip.String() + // Generate a random IPv6 address in the fd00::/8 range (Unique Local Address) + bytes := make([]byte, net.IPv6len) + bytes[0] = 0xfd // ULA prefix + _, _ = rand.Read(bytes[1:]) + ip := net.IP(bytes) + randomIP = ip.String() return }
116-128: Deadlock risk: simulated peer sends > 1 message to a bounded channel
getHealthStatusFromPeers()allocatesresultsChanwithcapacity = len(addresses)and waits for exactly one message per peer. The current override loops overckw.SimulatePeerResponsesand may write multiple codes, causing the goroutine to block on the second send and the whole test to hang.- case ckw.ShouldSimulatePeerResponses: - for _, code := range ckw.SimulatePeerResponses { - results <- code - } - return + case ckw.ShouldSimulatePeerResponses: + // Provide exactly one deterministic response per call + if len(ckw.SimulatePeerResponses) > 0 { + results <- ckw.SimulatePeerResponses[0] + } else { + results <- selfNodeRemediation.RequestFailed + } + returnIf you need per-peer variance, consider indexing into the slice with a counter guarded by a mutex instead of pushing multiple values at once.
controllers/tests/controller/selfnoderemediation_controller_test.go (2)
1246-1267: Cleanup doesn't persist the label rollback – subsequent tests may inherit the control-plane labelInside
configureUnhealthyNodeAsControlNode, theDeferCleanupcallback only changes the in-memoryunhealthyNodeobject but never updates the cluster. If another test starts before this process ends (or if the global state is inspected inverifyCleanState), the node will still carry thenode-role.kubernetes.io/master=truelabel, leading to flaky or misleading results.DeferCleanup(func() { By("Revert the unhealthy node's role to its previous value", func() { unhealthyNode.Labels[labels2.MasterRole] = previousRole + // Persist the rollback so follow-up tests start from + // a clean state + Expect(k8sClient.Update(context.TODO(), unhealthyNode)).To(Succeed(), + "failed to restore the unhealthy node label after test cleanup") }) })
934-945: Variable shadowing hides the named return value and loses contextThe inner
err := …shadows the function's namederrreturn value, then the assignmenterr = …inside the block modifies the shadowed variable, not the function return. As a result, the caller may receive anilerror even though the Get failed.By(fmt.Sprintf("Attempting to get pod '%s' before deleting it", podName), func() { - if err := k8sClient.Client.Get(context.Background(), podKey, pod); err != nil { - if apierrors.IsNotFound(err) && !throwErrorIfNotFound { + getErr := k8sClient.Client.Get(context.Background(), podKey, pod) + if getErr != nil { + if apierrors.IsNotFound(getErr) && !throwErrorIfNotFound { logf.Log.Info("pod with name '%s' not found, we're not going to do anything", podName) err = nil return } - - err = fmt.Errorf("unable to get pod with name '%s' in order to delete it", err) + err = fmt.Errorf("unable to get pod with name '%s' in order to delete it: %w", podName, getErr) return } })
🧹 Nitpick comments (4)
pkg/peers/peers.go (4)
105-117: Excessive logging in updateWorkerPeers creates noise.The function logs entry, setter calls, and getter calls at Info level. These logs fire on every peer update interval and provide limited operational value. In production clusters with frequent peer updates, this generates unnecessary log volume.
Consider removing or downgrading these logs to a trace/debug level that can be conditionally enabled, keeping only errors or significant state changes at Info level.
120-135: Excessive logging in UpdateControlPlanePeers creates noise.Similar to
updateWorkerPeers, this function logs entry, setter calls, and getter calls at Info level on every update interval, creating unnecessary log volume in production.Consider removing or downgrading these logs to trace/debug level.
157-157: Logging entire nodes list may produce excessive output.In clusters with many nodes, logging the complete
nodesobject dumps significant data at Info level. This can overwhelm logs and make troubleshooting harder.Consider logging a summary instead:
- p.log.Info("updatePeers", "nodes", nodes) + p.log.Info("updatePeers retrieved nodes", "nodeCount", len(nodes.Items))
209-219: Reduce logging frequency in GetPeersAddresses.This function logs peer addresses at Info level on every call. Depending on how frequently health checks or peer queries occur, this can generate significant log volume with redundant information.
Consider downgrading to debug level or logging only when addresses change:
- p.log.Info("GetPeersAddresses", "workerPeersAddresses", p.workerPeersAddresses, - "controlPlanePeersAddresses", p.controlPlanePeersAddresses) + p.log.V(1).Info("GetPeersAddresses", "workerPeersAddresses", p.workerPeersAddresses, + "controlPlanePeersAddresses", p.controlPlanePeersAddresses)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
- pkg/peerhealth/peerhealth.pb.gois excluded by- !**/*.pb.go
- pkg/peerhealth/peerhealth_grpc.pb.gois excluded by- !**/*.pb.go
📒 Files selected for processing (12)
- controllers/selfnoderemediation_controller.go(3 hunks)
- controllers/tests/config/suite_test.go(1 hunks)
- controllers/tests/controller/selfnoderemediation_controller_test.go(11 hunks)
- controllers/tests/controller/suite_test.go(3 hunks)
- controllers/tests/shared/shared.go(3 hunks)
- go.mod(1 hunks)
- pkg/apicheck/check.go(6 hunks)
- pkg/controlplane/manager.go(2 hunks)
- pkg/peers/peers.go(4 hunks)
- pkg/utils/pods.go(3 hunks)
- vendor/github.com/onsi/gomega/gcustom/make_matcher.go(1 hunks)
- vendor/modules.txt(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- controllers/tests/config/suite_test.go
- pkg/controlplane/manager.go
- controllers/selfnoderemediation_controller.go
- controllers/tests/controller/suite_test.go
- go.mod
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-06-18T11:38:24.387Z
Learnt from: slintes
PR: medik8s/self-node-remediation#262
File: pkg/peerhealth/client_server_test.go:165-189
Timestamp: 2025-06-18T11:38:24.387Z
Learning: In the medik8s/self-node-remediation project, the Ginkgo tests in pkg/peerhealth/client_server_test.go don't run in parallel, making shared mutable state like the `reader` variable safe to use across test cases without race condition concerns.
Applied to files:
- controllers/tests/controller/selfnoderemediation_controller_test.go
- controllers/tests/shared/shared.go
🧬 Code graph analysis (3)
controllers/tests/controller/selfnoderemediation_controller_test.go (5)
api/v1alpha1/selfnoderemediation_types.go (4)
RemediationStrategyType(44-44)
SelfNodeRemediation(92-98)
ResourceDeletionRemediationStrategy(25-25)
SelfNodeRemediationList(103-107)controllers/tests/shared/shared.go (10)
UnhealthyNodeName(35-35)
Namespace(34-34)
GenerateTestConfig(133-145)
PeerNodeName(36-36)
Peer2NodeName(37-37)
SnrPodName2(41-41)
Peer3NodeName(38-38)
SnrPodName3(42-42)
IsK8sNotFoundError(190-206)
SnrPodName1(40-40)vendor/github.com/medik8s/common/pkg/labels/labels.go (1)
MasterRole(7-7)api/HealthCheckResponse.go (2)
Unhealthy(8-8)
HealthCheckResponseCode(3-3)pkg/controlplane/manager.go (1)
NewManager(39-47)
pkg/apicheck/check.go (6)
pkg/controlplane/manager.go (1)
Manager(29-36)api/HealthCheckResponse.go (2)
HealthCheckResponseCode(3-3)
RequestFailed(6-6)pkg/peers/peers.go (5)
New(47-58)
Worker(29-29)
ControlPlane(30-30)
Role(26-26)
Peers(36-45)pkg/peerhealth/client.go (1)
NewClient(19-44)pkg/peerhealth/peerhealth.pb.go (3)
HealthRequest(23-30)
HealthRequest(45-45)
HealthRequest(60-62)pkg/peers/response.go (2)
Response(3-6)
HealthyBecauseErrorsThresholdNotReached(12-12)
controllers/tests/shared/shared.go (3)
pkg/apicheck/check.go (3)
ApiConnectivityCheck(36-45)
New(68-79)
ApiConnectivityCheckConfig(49-66)api/HealthCheckResponse.go (1)
HealthCheckResponseCode(3-3)pkg/controlplane/manager.go (1)
Manager(29-36)
🔇 Additional comments (17)
pkg/utils/pods.go (3)
5-5: LGTM!The import change from
errorstofmtis appropriate given the error formatting improvements below.
21-25: LGTM! Proper error wrapping.The error is now properly wrapped using
fmt.Errorfwith the%wverb, which preserves the error chain and adds descriptive context. This follows Go best practices and improves debugging.
33-33: LGTM! Improved error message.Including the node name in the error message significantly improves debugging and observability when the pod is not found.
vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)
18-270: Upstream gcustom matcher looks solidReflect wrapper and templated failure messages behave as expected; nothing concerning here.
vendor/modules.txt (1)
140-145: Vendor manifest entry matches added packageThe gcustom module entry is consistent with the new vendored files.
controllers/tests/shared/shared.go (2)
67-91: LGTM on K8sClientWrapper.List refactoring.The switch-based flow with fallthrough is clear and correct. The IP randomization is properly applied after the underlying client.List call.
167-206: LGTM on error matcher utilities.The custom Gomega matchers for K8s error testing follow the correct pattern and provide useful abstractions for test assertions.
pkg/apicheck/check.go (4)
47-47: LGTM on injectable health-check function pattern.The introduction of
GetHealthStatusFromRemoteFuncas a type and the ability to inject it viaSetHealthStatusFuncis a solid pattern for testability. The constructor properly sets a default implementation, ensuring production behavior remains unchanged while enabling test overrides.Also applies to: 68-79
81-124: LGTM on GetDefaultPeerHealthCheckFunc extraction.Extracting the default peer health-check logic into a named method improves code organization and makes the health-check implementation explicit and replaceable.
183-225: Verify the new health-check flow aligns with intended behavior.The refactored
isConsideredHealthylogic significantly changes the health determination flow:
- Worker nodes now consult control-plane peers if worker peers report unhealthy
- Control-plane nodes rely on control-plane peer responses and diagnostics
Based on past review comments, there are ongoing discussions about whether this flow correctly handles all scenarios (e.g., ignoring worker response reasons for CP nodes). Ensure the logic matches the expected behavior documented in the PR objectives and addresses Issue #251.
414-426: LGTM on health-check function accessors.The setter, getter, and delegation pattern for
getHealthStatusFromRemoteFuncis correctly implemented and enables the test wrapper to override behavior.controllers/tests/controller/selfnoderemediation_controller_test.go (6)
39-39: Verify global remediation strategy is properly managed across tests.The package-level
remediationStrategyvariable is mutable test state. If tests don't consistently reset it viaconfigureRemediationStrategy()or BeforeEach blocks, residual values from one test can leak into another.Scan test cases to confirm each that modifies
remediationStrategyproperly restores it in cleanup.
507-554: Test case demonstrates Issue #251 but comment suggests expected failure.The test "two control node peers found, they tell me I'm unhealthy" includes the comment: "It's expected that the next line will fail, even though it shouldn't!" This suggests the test is meant to demonstrate the bug rather than pass after the fix is applied.
If the PR intends to fix Issue #251, this test should pass after the code changes. Update the test or comment to reflect whether:
- The test currently demonstrates the bug (should be marked as pending/skipped until fixed)
- The fix in this PR makes the test pass (remove the comment about expected failure)
830-859: LGTM on createGenericSelfNodeRemediationPod helper.The function properly creates an SNR pod with correct labels, registers cleanup, and verifies creation. The use of
DeferCleanupensures proper test hygiene.
1311-1376: LGTM on addNodes helper for test orchestration.The
addNodeshelper provides a clean, declarative way to set up multi-node test scenarios with associated pods and simulated responses. The cleanup registration ensures proper teardown.
1062-1100: Verify verifyCleanState handles dynamically added nodes.
verifyCleanStateexpects exactly 2 nodes (UnhealthyNodeNameandPeerNodeName), but tests usingaddNodescan introduce additional nodes (Peer2NodeName,Peer3NodeName). IfverifyCleanStateruns beforeaddNodescleanup completes, the assertion at line 1067 will fail.Ensure test cleanup ordering is correct, or make
verifyCleanStatemore flexible to handle transient nodes during cleanup.
1193-1210: LGTM on addControlPlaneManager helper.The helper properly creates, wires, and cleans up the control plane manager for tests. The cleanup ensures peer state is reset when the manager is removed.
| } else { | ||
| p.log.Info("Skipping current node/pod combo", | ||
| "node.Name", node.Name, | ||
| "pod.Spec.NodeName", pod.Spec.NodeName) | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Remove debug logging from hot path in nested loop.
This Info log fires for every node/pod combination that doesn't match, which is the common case. In a cluster with N nodes and M pods, this can generate N×M log entries per peer update interval, overwhelming logs.
Remove this log statement or make it conditional on a debug flag:
-			} else {
-				p.log.Info("Skipping current node/pod combo",
-					"node.Name", node.Name,
-					"pod.Spec.NodeName", pod.Spec.NodeName)
 			}🤖 Prompt for AI Agents
In pkg/peers/peers.go around lines 191 to 195, the Info log inside the node/pod
nested loop is executed for every non-matching pair and floods logs (N×M
entries); remove this Info log from the hot path or gate it behind a debug flag
so it only emits when verbose/debug logging is enabled (e.g., check the logger's
debug/verbose state before logging), keeping normal operation silent for the
common non-matching case.
2fe5996    to
    d1c5111      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (4)
controllers/tests/shared/shared.go (2)
102-108: Use proper IPv6 address generation instead of raw UUID bytes.Converting UUID bytes directly to an IPv6 address can produce invalid or reserved addresses. Consider generating addresses in a proper test range such as the Unique Local Address (ULA) range
fd00::/8.Apply this diff:
func GetRandomIpAddress() (randomIP string) { - u := uuid.New() - ip := net.IP(u[:net.IPv6len]) - randomIP = ip.String() - + // Generate a random IPv6 address in the fd00::/8 range (Unique Local Address) + bytes := make([]byte, net.IPv6len) + bytes[0] = 0xfd // ULA prefix + u := uuid.New() + copy(bytes[1:], u[:net.IPv6len-1]) + randomIP = net.IP(bytes).String() return }
83-97: Set the legacy PodIP field for compatibility.While
PodIPsis the current standard, some components still rely on the singularpod.Status.PodIPfield. The helper only populatesPodIPs, which may cause compatibility issues.Apply this diff to also populate the legacy field:
func assignRandomIpAddressesPods(pods *corev1.PodList) { for i := range pods.Items { - pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: GetRandomIpAddress()}} + randIP := GetRandomIpAddress() + pods.Items[i].Status.PodIP = randIP + pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: randIP}} } - - return }controllers/tests/controller/selfnoderemediation_controller_test.go (2)
1246-1267: Fix: Cleanup doesn't persist the label rollback.The
DeferCleanupcallback at lines 1261-1265 modifies the in-memoryunhealthyNode.Labelsbut never callsk8sClient.Update()to persist the change. Subsequent tests may inherit thenode-role.kubernetes.io/masterlabel, causing flaky behavior.Apply this diff to persist the rollback:
DeferCleanup(func() { By("Revert the unhealthy node's role to its previous value", func() { unhealthyNode.Labels[labels2.MasterRole] = previousRole + Expect(k8sClient.Update(context.TODO(), unhealthyNode)).To(Succeed(), + "failed to restore the unhealthy node label after test cleanup") }) })
927-950: Fix: Variable shadowing prevents error return.The
err :=short declaration at line 935 shadows the named return variableerr, so assignments toerrinside theByblock (lines 938, 942) modify the local variable rather than the return value. The function always returnsnileven when errors occur.Apply this diff to fix the shadowing:
By(fmt.Sprintf("Attempting to get pod '%s' before deleting it", podName), func() { - if err := k8sClient.Client.Get(context.Background(), podKey, pod); err != nil { - if apierrors.IsNotFound(err) && !throwErrorIfNotFound { + getErr := k8sClient.Client.Get(context.Background(), podKey, pod) + if getErr != nil { + if apierrors.IsNotFound(getErr) && !throwErrorIfNotFound { logf.Log.Info("pod with name '%s' not found, we're not going to do anything", podName) - err = nil return } - err = fmt.Errorf("unable to get pod with name '%s' in order to delete it", err) + err = fmt.Errorf("unable to get pod with name '%s': %w", podName, getErr) return } })
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
- pkg/peerhealth/peerhealth.pb.gois excluded by- !**/*.pb.go
- pkg/peerhealth/peerhealth_grpc.pb.gois excluded by- !**/*.pb.go
📒 Files selected for processing (3)
- controllers/tests/controller/selfnoderemediation_controller_test.go(11 hunks)
- controllers/tests/shared/shared.go(3 hunks)
- pkg/apicheck/check.go(6 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-06-18T11:38:24.387Z
Learnt from: slintes
PR: medik8s/self-node-remediation#262
File: pkg/peerhealth/client_server_test.go:165-189
Timestamp: 2025-06-18T11:38:24.387Z
Learning: In the medik8s/self-node-remediation project, the Ginkgo tests in pkg/peerhealth/client_server_test.go don't run in parallel, making shared mutable state like the `reader` variable safe to use across test cases without race condition concerns.
Applied to files:
- controllers/tests/controller/selfnoderemediation_controller_test.go
- controllers/tests/shared/shared.go
🧬 Code graph analysis (3)
controllers/tests/controller/selfnoderemediation_controller_test.go (5)
api/v1alpha1/selfnoderemediation_types.go (4)
RemediationStrategyType(44-44)
SelfNodeRemediation(92-98)
ResourceDeletionRemediationStrategy(25-25)
SelfNodeRemediationList(103-107)controllers/tests/shared/shared.go (10)
UnhealthyNodeName(36-36)
Namespace(35-35)
GenerateTestConfig(180-192)
PeerNodeName(37-37)
Peer2NodeName(38-38)
SnrPodName2(42-42)
Peer3NodeName(39-39)
SnrPodName3(43-43)
IsK8sNotFoundError(237-253)
SnrPodName1(41-41)vendor/github.com/medik8s/common/pkg/labels/labels.go (1)
MasterRole(7-7)api/HealthCheckResponse.go (2)
Unhealthy(8-8)
HealthCheckResponseCode(3-3)pkg/controlplane/manager.go (1)
NewManager(39-47)
pkg/apicheck/check.go (6)
pkg/controlplane/manager.go (1)
Manager(29-36)api/HealthCheckResponse.go (2)
HealthCheckResponseCode(3-3)
RequestFailed(6-6)pkg/peers/peers.go (5)
New(47-58)
Worker(29-29)
ControlPlane(30-30)
Role(26-26)
Peers(36-45)pkg/peerhealth/client.go (1)
NewClient(19-44)pkg/peerhealth/peerhealth.pb.go (3)
HealthRequest(23-30)
HealthRequest(45-45)
HealthRequest(60-62)pkg/peers/response.go (2)
Response(3-6)
HealthyBecauseErrorsThresholdNotReached(12-12)
controllers/tests/shared/shared.go (4)
pkg/apicheck/check.go (3)
ApiConnectivityCheck(36-45)
New(68-79)
ApiConnectivityCheckConfig(49-66)api/HealthCheckResponse.go (2)
HealthCheckResponseCode(3-3)
RequestFailed(6-6)pkg/controlplane/manager.go (1)
Manager(29-36)vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)
MakeMatcher(87-128)
🔇 Additional comments (10)
pkg/apicheck/check.go (4)
47-47: LGTM: Injectable health check function enables testing.The new
GetHealthStatusFromRemoteFunctype allows test code to inject custom health-check behavior, which is essential for simulating peer responses in unit tests without requiring actual network calls.
68-79: LGTM: Constructor properly initializes the default health check function.The refactored constructor now returns a named return value and correctly wires up the default peer health check function via
SetHealthStatusFunc(c.GetDefaultPeerHealthCheckFunc()). This provides a clean separation between the default implementation and test overrides.
183-225: LGTM: Health check logic now handles control plane nodes correctly.The refactored
isConsideredHealthy()method now properly differentiates between worker and control plane nodes, addressing the core issue in #251. For control plane nodes, it queries control plane peers and cross-checks with diagnostics, while worker nodes first check worker peers and fall back to control plane peers as a secondary verification.
227-339: LGTM: Generalized peer response handling with improved logging.The renamed
getPeersResponse()method now accepts aroleparameter, enabling it to query either worker or control plane peers. The enhanced logging that includes the role name addresses the misleading log message flagged in previous reviews.controllers/tests/shared/shared.go (1)
217-235: LGTM: Custom Gomega matcher for ignored errors.The
IsIgnoredK8sError()matcher correctly usesgcustom.MakeMatcherto create a reusable matcher that checks if an error matches any of the provided testing functions. This is a clean pattern for test assertions.controllers/tests/controller/selfnoderemediation_controller_test.go (5)
507-554: LGTM: Test case reproduces issue #251.This test case properly simulates the scenario described in issue #251: a control plane node with multiple control plane peers reporting it as unhealthy. The comment on line 551-552 indicates this test demonstrates the bug (watchdog should trigger but currently doesn't), which aligns with the PR objectives.
766-788: LGTM: Improved remediation cleanup with better error handling.The refactored
deleteRemediations()now usesEventuallywith Gomega assertions for more robust cleanup, properly removes finalizers before deletion, and verifies the list is empty afterwards.
830-859: LGTM: Reusable pod creation helper with automatic cleanup.The new
createGenericSelfNodeRemediationPod()helper centralizes pod creation logic and usesDeferCleanupto ensure proper teardown. This reduces code duplication across test cases.
1193-1210: LGTM: Control plane manager helper with proper cleanup.The
addControlPlaneManager()helper properly initializes the control plane manager, updates peer addresses, and usesDeferCleanupto restore the previous state. This enables testing control plane scenarios.
1327-1373: LGTM: Flexible node and pod creation for test scenarios.The
addNodes()helper accepts a slice ofnewNodeConfigstructs, allowing test cases to declaratively define complex multi-node scenarios with custom labels and simulated peer responses. The automatic cleanup ensures tests don't pollute subsequent runs.
d1c5111    to
    281a674      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (5)
controllers/tests/shared/shared.go (3)
94-101: Mirror PodIPs[0] into legacy PodIPSome consumers still read Status.PodIP. Keep both in sync.
func assignRandomIpAddressesPods(pods *corev1.PodList) { - for i := range pods.Items { - pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: GetRandomIpAddress()}} - } + for i := range pods.Items { + randIP := GetRandomIpAddress() + pods.Items[i].Status.PodIP = randIP + pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: randIP}} + } return }
102-108: Generate valid IPv6 (ULA) instead of raw UUID bytesStay within fd00::/8 to avoid invalid/reserved addrs. Keeps imports the same.
func GetRandomIpAddress() (randomIP string) { - u := uuid.New() - ip := net.IP(u[:net.IPv6len]) - randomIP = ip.String() + // Generate a Unique Local Address in fd00::/8 + u := uuid.New() + b := u[:net.IPv6len] + b[0] = 0xfd + ip := net.IP(b) + randomIP = ip.String() return }
130-144: Bug: queue head not removed when length == 1 (repeats last simulated response)Slice must always shrink by 1 after pop.
func (ckw *ApiConnectivityCheckWrapper) nextSimulatedPeerResponse() selfNodeRemediation.HealthCheckResponseCode { ckw.responsesMu.Lock() defer ckw.responsesMu.Unlock() if len(ckw.simulatedPeerResponses) == 0 { return selfNodeRemediation.RequestFailed } - code := ckw.simulatedPeerResponses[0] - if len(ckw.simulatedPeerResponses) > 1 { - ckw.simulatedPeerResponses = append([]selfNodeRemediation.HealthCheckResponseCode{}, ckw.simulatedPeerResponses[1:]...) - } - - return code + code := ckw.simulatedPeerResponses[0] + ckw.simulatedPeerResponses = ckw.simulatedPeerResponses[1:] + return code }controllers/tests/controller/selfnoderemediation_controller_test.go (2)
934-944: Fix err shadowing; ensure named return is setCurrent code shadows err and may return nil on failure.
- By(fmt.Sprintf("Attempting to get pod '%s' before deleting it", podName), func() { - if err := k8sClient.Client.Get(context.Background(), podKey, pod); err != nil { - if apierrors.IsNotFound(err) && !throwErrorIfNotFound { - logf.Log.Info("pod with name '%s' not found, we're not going to do anything", podName) - err = nil - return - } - - err = fmt.Errorf("unable to get pod with name '%s' in order to delete it", err) - return - } - }) + By(fmt.Sprintf("Attempting to get pod '%s' before deleting it", podName), func() { + getErr := k8sClient.Client.Get(context.Background(), podKey, pod) + if getErr != nil { + if apierrors.IsNotFound(getErr) && !throwErrorIfNotFound { + logf.Log.Info("pod with name '%s' not found, we're not going to do anything", podName) + err = nil + return + } + err = fmt.Errorf("unable to get pod with name '%s' in order to delete it: %w", podName, getErr) + return + } + })
1261-1265: Persist label rollback in cleanup to avoid test bleed-overCleanup only edits local object; update the cluster too.
DeferCleanup(func() { By("Revert the unhealthy node's role to its previous value", func() { unhealthyNode.Labels[labels2.MasterRole] = previousRole + Expect(k8sClient.Update(context.TODO(), unhealthyNode)).To(Succeed(), + "failed to restore the unhealthy node label after test cleanup") }) })
🧹 Nitpick comments (4)
controllers/tests/controller/selfnoderemediation_controller_test.go (2)
907-910: Fix log message: it's a pod, not a nodeMinor wording nit.
- ExpectWithOffset(1, err).To(Or(Succeed(), shared.IsK8sNotFoundError()), - "expected the delete operation to succeed, or for it to have told us that node '%s'"+ - " didn't exist", pod.Name) + ExpectWithOffset(1, err).To(Or(Succeed(), shared.IsK8sNotFoundError()), + "expected the delete operation to succeed, or for it to have told us that pod '%s' didn't exist", pod.Name)
1194-1210: Verify control-plane manager lifecycle; avoid leaking goroutinesStarting the manager directly with context.Background() may not stop on test teardown. Prefer wiring it via the envtest manager (Add) or use a cancellable context.
pkg/apicheck/check.go (2)
430-433: Tighten comment and guard nil handlerReflect actual behavior and add a safety fallback.
-// GetHealthStatusFromPeer issues a GET request to the specified IP and returns the result from the peer into the given channel +// getHealthStatusFromPeer delegates to the configured health-check function (gRPC peer query by default). func (c *ApiConnectivityCheck) getHealthStatusFromPeer(endpointIp corev1.PodIP, results chan<- selfNodeRemediation.HealthCheckResponseCode) { - c.getHealthStatusFromRemoteFunc(endpointIp, results) + if c.getHealthStatusFromRemoteFunc == nil { + c.SetHealthStatusFunc(c.GetDefaultPeerHealthCheckFunc()) + } + c.getHealthStatusFromRemoteFunc(endpointIp, results) }
255-272: Clarify mutually exclusive branches for 0 peersThe len(peersToAsk)==0 branch only executes when MinPeersForRemediation==0 (otherwise the prior < MinPeers check returns healthy). Consider a comment or reorder for clarity.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
- controllers/tests/controller/selfnoderemediation_controller_test.go(11 hunks)
- controllers/tests/controller/suite_test.go(3 hunks)
- controllers/tests/shared/shared.go(3 hunks)
- pkg/apicheck/check.go(6 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-06-18T11:38:24.387Z
Learnt from: slintes
PR: medik8s/self-node-remediation#262
File: pkg/peerhealth/client_server_test.go:165-189
Timestamp: 2025-06-18T11:38:24.387Z
Learning: In the medik8s/self-node-remediation project, the Ginkgo tests in pkg/peerhealth/client_server_test.go don't run in parallel, making shared mutable state like the `reader` variable safe to use across test cases without race condition concerns.
Applied to files:
- controllers/tests/shared/shared.go
- controllers/tests/controller/selfnoderemediation_controller_test.go
🧬 Code graph analysis (4)
pkg/apicheck/check.go (5)
pkg/controlplane/manager.go (1)
Manager(29-36)api/HealthCheckResponse.go (2)
HealthCheckResponseCode(3-3)
RequestFailed(6-6)pkg/peers/peers.go (5)
New(47-58)
Worker(29-29)
ControlPlane(30-30)
Role(26-26)
Peers(36-45)pkg/peerhealth/client.go (1)
NewClient(19-44)pkg/peers/response.go (2)
Response(3-6)
HealthyBecauseErrorsThresholdNotReached(12-12)
controllers/tests/shared/shared.go (5)
pkg/apicheck/check.go (3)
ApiConnectivityCheck(36-45)
New(68-79)
ApiConnectivityCheckConfig(49-66)api/HealthCheckResponse.go (2)
HealthCheckResponseCode(3-3)
RequestFailed(6-6)pkg/peers/peers.go (1)
New(47-58)pkg/controlplane/manager.go (1)
Manager(29-36)vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)
MakeMatcher(87-128)
controllers/tests/controller/suite_test.go (1)
controllers/tests/shared/shared.go (3)
ApiConnectivityCheckWrapper(60-66)
MinPeersForRemediationConfigDefaultValue(48-48)
NewApiConnectivityCheckWrapper(110-128)
controllers/tests/controller/selfnoderemediation_controller_test.go (5)
api/v1alpha1/selfnoderemediation_types.go (4)
RemediationStrategyType(44-44)
SelfNodeRemediation(92-98)
ResourceDeletionRemediationStrategy(25-25)
SelfNodeRemediationList(103-107)controllers/tests/shared/shared.go (10)
UnhealthyNodeName(36-36)
Namespace(35-35)
GenerateTestConfig(180-192)
PeerNodeName(37-37)
Peer2NodeName(38-38)
SnrPodName2(42-42)
Peer3NodeName(39-39)
SnrPodName3(43-43)
IsK8sNotFoundError(237-253)
SnrPodName1(41-41)vendor/github.com/medik8s/common/pkg/labels/labels.go (1)
MasterRole(7-7)api/HealthCheckResponse.go (2)
Unhealthy(8-8)
HealthCheckResponseCode(3-3)pkg/controlplane/manager.go (1)
NewManager(39-47)
🔇 Additional comments (1)
controllers/tests/controller/suite_test.go (1)
169-176: LGTM: wrapper-based ApiConnectivityCheck wiringUsing the wrapper and enabling simulated peer responses by default keeps tests deterministic and avoids dialing peers.
| // TODO does this work with IPv6? | ||
| // MES: Yes it does, we've tested this | ||
| phClient, err := peerhealth.NewClient(fmt.Sprintf("%v:%v", endpointIp.IP, c.config.PeerHealthPort), c.config.PeerDialTimeout, c.config.Log.WithName("peerhealth client"), c.clientCreds) | ||
| if err != nil { | ||
| logger.Error(err, "failed to init grpc client") | ||
| results <- selfNodeRemediation.RequestFailed | ||
| return | ||
| } | ||
| defer phClient.Close() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix IPv6 dialing: wrap IPv6 literals in brackets
Without brackets, gRPC/net parsing fails for IPv6.
-		phClient, err := peerhealth.NewClient(fmt.Sprintf("%v:%v", endpointIp.IP, c.config.PeerHealthPort), c.config.PeerDialTimeout, c.config.Log.WithName("peerhealth client"), c.clientCreds)
+		addr := fmt.Sprintf("%s:%d", endpointIp.IP, c.config.PeerHealthPort)
+		// IPv6 literals must be bracketed in host:port form
+		if strings.Contains(endpointIp.IP, ":") {
+			addr = fmt.Sprintf("[%s]:%d", endpointIp.IP, c.config.PeerHealthPort)
+		}
+		phClient, err := peerhealth.NewClient(addr, c.config.PeerDialTimeout, c.config.Log.WithName("peerhealth client"), c.clientCreds)Add import:
import "strings"🤖 Prompt for AI Agents
In pkg/apicheck/check.go around lines 93 to 101, the gRPC dialing address uses
fmt.Sprintf("%v:%v", endpointIp.IP, c.config.PeerHealthPort) which breaks for
IPv6 literals; update the code to detect IPv6 addresses (use
strings.Contains(endpointIp.IP, ":") after adding import "strings") and wrap
IPv6 literals in brackets when building the host:port (i.e., use "[IP]:port" for
IPv6, plain "IP:port" for IPv4), then pass that formatted address into
peerhealth.NewClient; keep the existing error handling and defer close.
        
          
                pkg/apicheck/check.go
              
                Outdated
          
        
      | c.config.Log.Info("isConsideredHealthy: we have checked the control plane peer responses and cross "+ | ||
| "checked it against the control plane diagnostics ", | ||
| "isControlPlaneHealthy", controlPlanePeersResponse.IsHealthy) | ||
|  | ||
| return isControlPlaneHealthy | ||
|  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix log: wrong variable under key isControlPlaneHealthy
It logs controlPlanePeersResponse.IsHealthy instead of the computed isControlPlaneHealthy.
-	c.config.Log.Info("isConsideredHealthy: we have checked the control plane peer responses and cross "+
-		"checked it against the control plane diagnostics ",
-		"isControlPlaneHealthy", controlPlanePeersResponse.IsHealthy)
+	c.config.Log.Info("isConsideredHealthy: we have checked the control plane peer responses and cross "+
+		"checked it against the control plane diagnostics ",
+		"isControlPlaneHealthy", isControlPlaneHealthy)🤖 Prompt for AI Agents
In pkg/apicheck/check.go around lines 219 to 224, the log call is passing
controlPlanePeersResponse.IsHealthy under the "isControlPlaneHealthy" key
instead of the computed isControlPlaneHealthy; update the log invocation to
supply isControlPlaneHealthy as the value for the "isControlPlaneHealthy" key
(and verify key/value ordering matches the logger's expected signature) so the
log reflects the computed result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️  Outside diff range comments (1)
pkg/controlplane/manager.go (1)
169-191: Add HTTP client timeout to avoid hangs in kubelet checkSet a reasonable timeout on the client to prevent indefinite hangs during network issues and to stabilize unit tests.
- httpClient := &http.Client{Transport: tr} + httpClient := &http.Client{ + Transport: tr, + Timeout: 5 * time.Second, + }
♻️ Duplicate comments (6)
controllers/tests/controller/selfnoderemediation_controller_test.go (2)
1284-1289: Cleanup doesn’t persist the label rollback – update the cluster stateThe DeferCleanup only mutates the in-memory object; it doesn’t issue an Update to persist the label change. This can leak the control-plane label into subsequent tests.
Apply:
DeferCleanup(func() { By("Revert the unhealthy node's role to its previous value", func() { unhealthyNode.Labels[labels2.MasterRole] = previousRole - }) + Expect(k8sClient.Update(context.TODO(), unhealthyNode)).To(Succeed(), + "failed to restore the unhealthy node label after test cleanup") + }) })
957-972: Fix error shadowing; function may return nil on Get failureThe short variable declaration shadows the named return err. Use a separate local or assign to the named return.
- if err := k8sClient.Client.Get(context.Background(), podKey, pod); err != nil { - if apierrors.IsNotFound(err) && !throwErrorIfNotFound { + getErr := k8sClient.Client.Get(context.Background(), podKey, pod) + if getErr != nil { + if apierrors.IsNotFound(getErr) && !throwErrorIfNotFound { logf.Log.Info("pod with name '%s' not found, we're not going to do anything", podName) err = nil return } - err = fmt.Errorf("unable to get pod with name '%s' in order to delete it", err) + err = fmt.Errorf("unable to get pod with name '%s' in order to delete it: %w", podName, getErr) return }controllers/tests/shared/shared.go (3)
93-100: Random IP helper forgets to update.Status.PodIP
assignRandomIpAddressesPods()populatesPodIPs, but callers relying on the legacy singular field (pod.Status.PodIP) will still observe an empty value. Several components (including controller-runtime caches and some older network add-ons) still look atPodIPwhen only one address is expected.
112-118: Use proper IPv6 address generation instead of UUID bytesConverting UUID bytes directly to IPv6 addresses may generate invalid or reserved addresses. Consider using proper IPv6 address generation that respects network standards.
140-154: Fix: Last simulated response is never removed from the queue.When there's exactly one element in the queue, the code extracts it but doesn't remove it from the slice. Subsequent calls will return the same response repeatedly instead of returning
RequestFailed.pkg/apicheck/check.go (1)
101-103: IPv6 address formatting missing brackets for gRPC dial.The TODO comment claims IPv6 works, but the code uses
fmt.Sprintf("%v:%v", endpointIp.IP, c.config.PeerHealthPort)which breaks for IPv6 literals in gRPC/net parsing. IPv6 addresses must be wrapped in brackets when combined with a port.
🧹 Nitpick comments (8)
pkg/apicheck/failure_tracker_test.go (1)
8-27: Solid coverage of windowed escalation; consider a couple edge casesLooks good. Optionally add tests for:
- window <= 0 (ShouldEscalate must be true)
- nil receiver safety (methods no-op / false)
Also applies to: 29-46, 48-63
pkg/apicheck/failure_tracker.go (1)
21-47: Clarify semantics and consider minor cleanups
- Document explicitly that window <= 0 forces escalation (true), to avoid surprises. Also document that ShouldEscalate mutates internal state (window reset on gaps).
- Not thread-safe by design; note single-goroutine expectation.
- Minor: the IsZero guard on lastFailure is redundant after consecutive==0 path; can drop it.
Diff (optional tidy-up):
- if ft.lastFailure.IsZero() { - ft.lastFailure = at - }Also applies to: 61-92
pkg/apicheck/check_internal_test.go (1)
16-22: Avoid unsafe reflection to mutate peers in testsUsing unsafe to set unexported fields bypasses locks and is brittle. Prefer:
- Injecting a PeersOverrideFunc in apicheck (already present in other tests), or
- Adding a test-only setter behind a build tag in peers, or
- Exposing a minimal exported method to set peer addresses for tests.
This eliminates race/breakage risks if fields or sync change.
main.go (2)
254-265: Support human-readable duration env valuesgetOptionalDurEnvVar only accepts integers (interpreted as nanoseconds). Consider supporting time.ParseDuration (e.g., "30s", "500ms") with integer fallback for backward compatibility.
func getOptionalDurEnvVar(varName string, fallback time.Duration) time.Duration { - val, exists := os.LookupEnv(varName) + val, exists := os.LookupEnv(varName) if !exists || val == "" { return fallback } - parsed, err := strconv.Atoi(val) - if err != nil { - setupLog.Error(err, "failed to convert env variable to int", "var name", varName, "var value", val) - os.Exit(1) - } - return time.Duration(parsed) + // First try Go duration syntax (e.g. "30s", "500ms") + if dur, err := time.ParseDuration(val); err == nil { + return dur + } + // Fallback: raw integer as nanoseconds (legacy behavior) + parsed, err := strconv.Atoi(val) + if err != nil { + setupLog.Error(err, "failed to parse duration env var", "var name", varName, "var value", val) + os.Exit(1) + } + return time.Duration(parsed) }
332-365: Document fallback behavior and clarify units for timeout configs
- MaxTimeForNoPeersResponse is still required as the default/fallback in effectivePeerQuorumTimeout – don’t remove it; update docs to explain its role.
- In your configuration guide, specify the expected units/formats for MAX_API_FAILURE_WINDOW and PEER_QUORUM_TIMEOUT (e.g. duration strings like “30s”).
pkg/controlplane/manager.go (1)
74-97: Outcome-based health flow looks sane; diagnostics path depends on kubelet checkSwitching on EvaluationOutcome is clear and aligns with the redesign. Ensure callers never pass unknown outcomes (you log and return false, which is safe).
Consider adding a package comment on EvaluationOutcome semantics for maintainers.
controllers/tests/controller/selfnoderemediation_controller_test.go (2)
50-57: Avoid fixed sleeps; wait on observable conditionsReplace time.Sleep with Eventually/Consistently on the condition under setup (e.g., config presence) to make tests deterministic and faster.
853-882: Remove fixed sleep; rely on Eventually to observe pod creationThe sleep is redundant as the next step already waits for the pod. Drop the sleep to reduce test latency.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
- controllers/tests/controller/control_plane_isolation_testdata.md(1 hunks)
- controllers/tests/controller/selfnoderemediation_controller_test.go(11 hunks)
- controllers/tests/shared/shared.go(3 hunks)
- main.go(3 hunks)
- pkg/apicheck/check.go(5 hunks)
- pkg/apicheck/check_internal_test.go(1 hunks)
- pkg/apicheck/failure_tracker.go(1 hunks)
- pkg/apicheck/failure_tracker_test.go(1 hunks)
- pkg/controlplane/manager.go(3 hunks)
- pkg/controlplane/manager_internal_test.go(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-06-18T11:38:24.387Z
Learnt from: slintes
PR: medik8s/self-node-remediation#262
File: pkg/peerhealth/client_server_test.go:165-189
Timestamp: 2025-06-18T11:38:24.387Z
Learning: In the medik8s/self-node-remediation project, the Ginkgo tests in pkg/peerhealth/client_server_test.go don't run in parallel, making shared mutable state like the `reader` variable safe to use across test cases without race condition concerns.
Applied to files:
- controllers/tests/shared/shared.go
🧬 Code graph analysis (8)
pkg/apicheck/failure_tracker_test.go (1)
pkg/apicheck/failure_tracker.go (1)
NewFailureTracker(17-19)
main.go (1)
pkg/reboot/calculator.go (1)
MaxTimeForNoPeersResponse(22-22)
pkg/controlplane/manager.go (1)
pkg/peers/peers.go (1)
ControlPlane(30-30)
pkg/apicheck/check_internal_test.go (3)
pkg/peers/peers.go (1)
Peers(36-45)pkg/apicheck/check.go (2)
ApiConnectivityCheckConfig(55-74)
New(76-87)api/HealthCheckResponse.go (2)
HealthCheckResponseCode(3-3)
Unhealthy(8-8)
pkg/controlplane/manager_internal_test.go (1)
pkg/controlplane/manager.go (3)
Manager(39-46)
EvaluationIsolation(34-34)
EvaluationGlobalOutage(33-33)
controllers/tests/controller/selfnoderemediation_controller_test.go (7)
api/v1alpha1/selfnoderemediation_types.go (5)
RemediationStrategyType(44-44)
SelfNodeRemediation(92-98)
AutomaticRemediationStrategy(24-24)
ResourceDeletionRemediationStrategy(25-25)
SelfNodeRemediationList(103-107)controllers/tests/shared/shared.go (11)
UnhealthyNodeName(37-37)
Namespace(36-36)
GenerateTestConfig(276-288)
PeerNodeName(38-38)
Peer2NodeName(39-39)
SnrPodName2(43-43)
Peer3NodeName(40-40)
SnrPodName3(44-44)
IsK8sNotFoundError(333-349)
SnrPodName1(42-42)
PeersOverrideFunc(76-76)pkg/peers/peers.go (3)
Role(26-26)
ControlPlane(30-30)
Peers(36-45)api/HealthCheckResponse.go (2)
Unhealthy(8-8)
HealthCheckResponseCode(3-3)vendor/github.com/medik8s/common/pkg/labels/labels.go (1)
MasterRole(7-7)pkg/controlplane/manager.go (1)
NewManager(49-57)pkg/apicheck/check.go (1)
PeersOverrideFunc(36-36)
pkg/apicheck/check.go (7)
pkg/peers/peers.go (5)
Role(26-26)
New(47-58)
Worker(29-29)
ControlPlane(30-30)
Peers(36-45)pkg/apicheck/failure_tracker.go (2)
FailureTracker(9-14)
NewFailureTracker(17-19)pkg/controlplane/manager.go (7)
Manager(39-46)
EvaluationOutcome(28-28)
EvaluationIsolation(34-34)
EvaluationAwaitQuorum(35-35)
EvaluationHealthy(31-31)
EvaluationGlobalOutage(33-33)
EvaluationRemediate(32-32)controllers/tests/shared/shared.go (1)
PeersOverrideFunc(76-76)api/HealthCheckResponse.go (2)
HealthCheckResponseCode(3-3)
RequestFailed(6-6)pkg/peerhealth/client.go (1)
NewClient(19-44)pkg/utils/peers.go (1)
GetNextBatchSize(9-31)
controllers/tests/shared/shared.go (4)
pkg/apicheck/check.go (4)
ApiConnectivityCheck(38-51)
PeersOverrideFunc(36-36)
New(76-87)
ApiConnectivityCheckConfig(55-74)api/HealthCheckResponse.go (2)
HealthCheckResponseCode(3-3)
RequestFailed(6-6)pkg/peers/peers.go (2)
Role(26-26)
New(47-58)pkg/controlplane/manager.go (1)
Manager(39-46)
🔇 Additional comments (31)
controllers/tests/controller/control_plane_isolation_testdata.md (1)
1-12: Helpful scenario documentationClear articulation of legacy vs redesigned expectations. LGTM.
controllers/tests/shared/shared.go (11)
6-28: LGTM: Import additions support new test infrastructure.The new imports are appropriate for the added functionality: random IP generation, concurrency primitives, custom Gomega matchers, and integration with apicheck/controlplane/peers packages.
31-50: LGTM: Test constants are well-organized.The constants provide clear defaults for test configuration and improve readability. The naming conventions are consistent.
58-59: LGTM: Test control field for random IP generation.The
ShouldReturnRandomPodIPsfield provides a clean way for tests to enable/disable random IP address generation behavior.
61-71: LGTM: Wrapper design supports comprehensive test scenarios.The
ApiConnectivityCheckWrapperprovides excellent test infrastructure with thread-safe response simulation, peer override capabilities, and baseline snapshot management for stateful tests.
73-76: LGTM: Well-documented type for test peer injection.The
PeersOverrideFunctype and its documentation clearly explain the purpose of allowing tests to supply synthetic peer address lists.
120-138: LGTM: Constructor properly wires simulation logic.The constructor correctly initializes the wrapper and sets up the health check function override with proper delegation between simulated and default behavior.
156-215: LGTM: Thread-safe response management with snapshot capabilities.The response management methods provide comprehensive test state control with proper mutex protection. The snapshot and baseline patterns are particularly useful for complex test scenarios.
217-242: LGTM: Peer override methods with proper wrapping.The peer override methods correctly handle the test-to-production type boundary with proper wrapping and thread-safe access.
244-270: LGTM: Thread-safe timestamp recording for test verification.The peer response recording methods provide proper thread-safe access to timestamps, useful for verifying timing behavior in tests.
272-274: LGTM: Proper delegation to wrapped instance.The
ResetPeerTimersmethod correctly delegates to the underlyingApiConnectivityCheckinstance.
310-349: LGTM: Custom Gomega matchers for K8s error testing.The K8s error testing helpers provide useful custom matchers with proper Gomega integration. The implementation correctly handles nil errors and provides flexible error matching capabilities.
pkg/apicheck/check.go (19)
36-53: LGTM: Enhanced types support testability and state tracking.The new types and struct fields provide excellent support for dependency injection (GetHealthStatusFromRemoteFunc, PeersOverrideFunc) and comprehensive peer state tracking (timing fields, FailureTracker). This design improves both testability and production functionality.
72-73: LGTM: Configuration fields for timeout control.The new
FailureWindowandPeerQuorumTimeoutfields provide fine-grained control over timeout behavior in the health evaluation logic.
76-87: LGTM: Constructor properly initializes new components.The constructor correctly initializes the failure tracker and sets the default health status function, establishing a clean default configuration.
134-146: LGTM: Clean accessor methods with proper synchronization.The control plane manager and peers override methods provide straightforward access with appropriate mutex protection for the override.
186-189: LGTM: Proper failure tracker reset with defensive nil check.The code correctly resets the failure tracker after a successful API call, with appropriate nil checking.
199-224: LGTM: Well-structured health evaluation with role separation.The refactored
isConsideredHealthymethod cleanly separates worker and control-plane evaluation paths with proper failure tracking and delegation to the control plane manager.
226-251: LGTM: Peer summary abstraction improves readability.The
peerSummarytype and helper methods provide a clean abstraction for aggregating peer response data with clear logic for majority calculations.
253-293: LGTM: Comprehensive worker evaluation logic.The
evaluateWorkermethod handles all major scenarios with clear logic: isolation, health, escalation to control planes, global outage detection, and quorum awaiting. The timeout handling is appropriate.
295-343: LGTM: Control-plane evaluation with proper prioritization.The
evaluateControlPlanemethod correctly prioritizes worker feedback for global outage detection and escalation scenarios, then evaluates control-plane peers with appropriate fallback to worker responses.
345-372: LGTM: Consistent escalation logic.The
escalateToControlPlanesmethod provides consistent escalation logic with the same structure as the evaluation methods, ensuring predictable behavior.
374-419: LGTM: Well-structured peer response gathering with batching.The
gatherPeerResponsesmethod efficiently gathers responses using batching, properly records timing data, and provides useful logging for debugging. The implementation is clear and correct.
421-441: LGTM: Safe peer listing with override support.The
listPeersmethod properly handles overrides and makes defensive copies to prevent mutation issues. The nil checks are appropriate.
443-463: LGTM: Clean peer timing state management.The peer activity and silence recording methods provide clean role-based tracking with proper zero-time checking for silence start timestamps.
465-470: LGTM: Complete timer reset utility.The
ResetPeerTimersmethod correctly resets all peer timing fields, useful for test state management.
472-495: LGTM: Clear outcome interpretation helpers.The outcome helper methods provide clear interpretation of evaluation outcomes with consistent logic: Remediate and Isolation are unhealthy, all others are healthy.
497-525: LGTM: Defensive calculation of effective configuration values.The effective value methods provide proper fallback logic and defensive initialization, ensuring sensible defaults when configuration values are not explicitly set.
527-545: LGTM: Correct peer timeout detection.The
peerTimeoutExceededmethod correctly checks role-specific silence timestamps against the configured timeout with proper zero-time handling.
605-617: LGTM: Clean dependency injection for health checks.The health status function methods provide clean dependency injection with proper getter/setter/delegation pattern, enabling comprehensive testing.
578-579: LGTM: Useful logging for peer health queries.The added log statement provides helpful context when querying peer health status, aiding in debugging.
| By(fmt.Sprintf("Clear node state for '%s'", shared.UnhealthyNodeName), func() { | ||
| Expect(k8sClient.Update(context.Background(), getNode(shared.UnhealthyNodeName))) | ||
| }) | ||
|  | ||
| By(fmt.Sprintf("Clear node state for '%s'", shared.PeerNodeName), func() { | ||
| Expect(k8sClient.Update(context.Background(), getNode(shared.PeerNodeName))) | ||
| }) | ||
|  | ||
| time.Sleep(time.Second * 2) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing assertion and likely invalid Update (resourceVersion)
- Expect(...) is not followed by .To(Succeed()), so failures aren’t asserted.
- Updating a freshly constructed object (getNode(...)) typically lacks resourceVersion; Update will fail. Fetch the node and patch/merge instead.
Example:
-Expect(k8sClient.Update(context.Background(), getNode(shared.UnhealthyNodeName)))
+node := &v1.Node{}
+Expect(k8sClient.Get(context.Background(), unhealthyNodeNamespacedName, node)).To(Succeed())
+clean := getNode(shared.UnhealthyNodeName)
+patch := client.MergeFrom(node.DeepCopy())
+node.Spec = clean.Spec
+node.Annotations = clean.Annotations
+node.Labels = clean.Labels
+Expect(k8sClient.Patch(context.Background(), node, patch)).To(Succeed())📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| By(fmt.Sprintf("Clear node state for '%s'", shared.UnhealthyNodeName), func() { | |
| Expect(k8sClient.Update(context.Background(), getNode(shared.UnhealthyNodeName))) | |
| }) | |
| By(fmt.Sprintf("Clear node state for '%s'", shared.PeerNodeName), func() { | |
| Expect(k8sClient.Update(context.Background(), getNode(shared.PeerNodeName))) | |
| }) | |
| time.Sleep(time.Second * 2) | |
| By(fmt.Sprintf("Clear node state for '%s'", shared.UnhealthyNodeName), func() { | |
| // Fetch the live object so we have a resourceVersion | |
| node := &v1.Node{} | |
| Expect(k8sClient.Get(context.Background(), unhealthyNodeNamespacedName, node)).To(Succeed()) | |
| // Prepare a clean copy of the desired state | |
| clean := getNode(shared.UnhealthyNodeName) | |
| patch := client.MergeFrom(node.DeepCopy()) | |
| node.Spec = clean.Spec | |
| node.Annotations = clean.Annotations | |
| node.Labels = clean.Labels | |
| // Apply the patch and assert success | |
| Expect(k8sClient.Patch(context.Background(), node, patch)).To(Succeed()) | |
| }) | |
| By(fmt.Sprintf("Clear node state for '%s'", shared.PeerNodeName), func() { | |
| Expect(k8sClient.Update(context.Background(), getNode(shared.PeerNodeName))).To(Succeed()) | |
| }) | |
| time.Sleep(time.Second * 2) | 
🤖 Prompt for AI Agents
In controllers/tests/controller/selfnoderemediation_controller_test.go around
lines 63 to 71, the test currently calls
Expect(k8sClient.Update(context.Background(), getNode(...))) without asserting
the result and tries to Update a freshly constructed node object (missing
resourceVersion) which will fail; change the Expect calls to
Expect(...).To(Succeed()) and replace the blind Update of a new object with a
proper fetch-and-modify or a Patch: first Get the existing node from the API
(populate resourceVersion), mutate the fields you need, then call Update and
assert .To(Succeed()), or create a MergeFrom patch by fetching the original
object, applying changes to a copy, and call client.Patch(..., patch) with
Expect(...).To(Succeed()).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
controllers/tests/controller/selfnoderemediation_controller_test.go (1)
972-981: Fix error shadowing and wrap original errorInner err shadows the named return; callers may receive nil on failure. Use a separate variable and %w.
- if err := k8sClient.Client.Get(context.Background(), podKey, pod); err != nil { - if apierrors.IsNotFound(err) && !throwErrorIfNotFound { - logf.Log.Info("pod with name '%s' not found, we're not going to do anything", podName) - err = nil - return - } - - err = fmt.Errorf("unable to get pod with name '%s' in order to delete it", err) - return - } + getErr := k8sClient.Client.Get(context.Background(), podKey, pod) + if getErr != nil { + if apierrors.IsNotFound(getErr) && !throwErrorIfNotFound { + logf.Log.Info("pod not found; skipping delete", "name", podName) + err = nil + return + } + err = fmt.Errorf("unable to get pod with name %q in order to delete it: %w", podName, getErr) + return + }controllers/tests/shared/shared.go (1)
150-164: Bug: last simulated response never dequeued (repeats same code)When length is 1, the slice isn’t advanced; subsequent calls replay the same response and can mask “no more responses” cases. Fix by always removing index 0.
func (ckw *ApiConnectivityCheckWrapper) nextSimulatedPeerResponse() selfNodeRemediation.HealthCheckResponseCode { ckw.responsesMu.Lock() defer ckw.responsesMu.Unlock() if len(ckw.simulatedPeerResponses) == 0 { return selfNodeRemediation.RequestFailed } code := ckw.simulatedPeerResponses[0] - if len(ckw.simulatedPeerResponses) > 1 { - ckw.simulatedPeerResponses = append([]selfNodeRemediation.HealthCheckResponseCode{}, ckw.simulatedPeerResponses[1:]...) - } + // Always consume the head element + ckw.simulatedPeerResponses = ckw.simulatedPeerResponses[1:] return code }
🧹 Nitpick comments (6)
controllers/tests/shared/shared.go (2)
114-125: Avoid panic in test helper; return deterministic fallback on RNG errorPanic here can flake tests. Prefer a safe fallback.
func GetRandomIpAddress() (randomIP string) { @@ - if _, err := rand.Read(bytes[1:]); err != nil { - panic(err) - } + if _, err := rand.Read(bytes[1:]); err != nil { + logf.Log.Error(err, "random IPv6 generation failed; using fallback fd00::1") + return "fd00::1" + } randomIP = net.IP(bytes).String() return }
47-49: Unused constantK8sClientReturnRandomPodIPAddressesByDefault is never referenced. Remove to reduce noise.
- K8sClientReturnRandomPodIPAddressesByDefault = falsepkg/controlplane/manager.go (1)
169-195: Treat non-2xx/3xx kubelet responses as unhealthyHTTP success should depend on status code, not only network success.
resp, err := httpClient.Do(req) if err != nil { manager.log.Error(err, "kubelet service is down", "node name", manager.nodeName) return false } defer resp.Body.Close() - return true + if resp.StatusCode >= 200 && resp.StatusCode < 400 { + return true + } + manager.log.Info("kubelet responded with non-success status", + "node name", manager.nodeName, "status", resp.StatusCode) + return falsecontrollers/tests/controller/selfnoderemediation_controller_test.go (3)
585-587: Remove misleading comment“This is expected to fail” contradicts the assertion and confuses future readers. Drop it.
- // It's expected that the next line will fail, even though it shouldn't! verifyWatchdogTriggered()
886-889: Avoid fixed sleep; rely on Eventually already belowThe explicit 1s sleep is unnecessary and can slow tests; the subsequent Eventually handles readiness.
- time.Sleep(1 * time.Second) - verifySelfNodeRemediationPodByExistsByName(podName)
1255-1261: Replace config-change sleep with an assertionSleeping 1s after setting MinPeersForRemediation is brittle. Assert the value instead.
- time.Sleep(1 * time.Second) + Eventually(func() int { + return apiConnectivityCheckConfig.MinPeersForRemediation + }, 3*time.Second, 100*time.Millisecond).Should(Equal(minimumNumberOfPeers))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
- controllers/tests/controller/selfnoderemediation_controller_test.go(11 hunks)
- controllers/tests/shared/shared.go(3 hunks)
- pkg/controlplane/manager.go(4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-06-18T11:38:24.387Z
Learnt from: slintes
PR: medik8s/self-node-remediation#262
File: pkg/peerhealth/client_server_test.go:165-189
Timestamp: 2025-06-18T11:38:24.387Z
Learning: In the medik8s/self-node-remediation project, the Ginkgo tests in pkg/peerhealth/client_server_test.go don't run in parallel, making shared mutable state like the `reader` variable safe to use across test cases without race condition concerns.
Applied to files:
- controllers/tests/shared/shared.go
🧬 Code graph analysis (3)
controllers/tests/shared/shared.go (5)
pkg/apicheck/check.go (4)
ApiConnectivityCheck(38-51)
PeersOverrideFunc(36-36)
New(76-87)
ApiConnectivityCheckConfig(55-74)api/HealthCheckResponse.go (2)
HealthCheckResponseCode(3-3)
RequestFailed(6-6)pkg/peers/peers.go (2)
Role(26-26)
New(47-58)pkg/controlplane/manager.go (1)
Manager(39-46)vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)
MakeMatcher(87-128)
controllers/tests/controller/selfnoderemediation_controller_test.go (7)
api/v1alpha1/selfnoderemediation_types.go (4)
RemediationStrategyType(44-44)
SelfNodeRemediation(92-98)
AutomaticRemediationStrategy(24-24)
ResourceDeletionRemediationStrategy(25-25)controllers/tests/shared/shared.go (11)
UnhealthyNodeName(37-37)
Namespace(36-36)
GenerateTestConfig(286-298)
PeerNodeName(38-38)
Peer2NodeName(39-39)
SnrPodName2(43-43)
Peer3NodeName(40-40)
SnrPodName3(44-44)
IsK8sNotFoundError(343-359)
SnrPodName1(42-42)
PeersOverrideFunc(76-76)pkg/peers/peers.go (3)
Role(26-26)
ControlPlane(30-30)
Peers(36-45)api/HealthCheckResponse.go (2)
Unhealthy(8-8)
HealthCheckResponseCode(3-3)vendor/github.com/medik8s/common/pkg/labels/labels.go (1)
MasterRole(7-7)pkg/controlplane/manager.go (1)
NewManager(49-57)pkg/apicheck/check.go (1)
PeersOverrideFunc(36-36)
pkg/controlplane/manager.go (1)
pkg/peers/peers.go (1)
ControlPlane(30-30)
🔇 Additional comments (4)
controllers/tests/shared/shared.go (3)
104-110: LGTM: PodIP and PodIPs kept in syncMirrors first IP into both fields; avoids consumers seeing empty PodIP.
135-145: Single response per peer — good changeWrapper enforces one send per peer; prevents bounded-channel deadlocks.
322-341: Custom Gomega matcher for ignored k8s errors — looks goodClear, composable, and nil-safe.
pkg/controlplane/manager.go (1)
28-36: API migration verified: all call sites use new EvaluationOutcome signatureVerification confirms no stale two-argument invocations remain. Three call sites found (two in manager_internal_test.go, one in apicheck/check.go) all correctly use the new single-argument signature. Function implementation handles all five evaluation outcomes with proper default case for unknown values.
Also it DID NOT like working with a powershell wrapper.
| PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
controllers/tests/shared/shared.go (1)
102-107: Legacy PodIP now mirrored — LGTMSetting both Status.PodIP and Status.PodIPs fixes earlier compatibility issues. Good.
🧹 Nitpick comments (4)
controllers/tests/shared/shared.go (2)
149-161: Avoid extra allocation when popping the headUse slice re-slicing instead of append-copy to trim the consumed element.
func (ckw *ApiConnectivityCheckWrapper) nextSimulatedPeerResponse() selfNodeRemediation.HealthCheckResponseCode { ckw.responsesMu.Lock() defer ckw.responsesMu.Unlock() if len(ckw.simulatedPeerResponses) == 0 { return selfNodeRemediation.RequestFailed } code := ckw.simulatedPeerResponses[0] - ckw.simulatedPeerResponses = append([]selfNodeRemediation.HealthCheckResponseCode{}, ckw.simulatedPeerResponses[1:]...) + ckw.simulatedPeerResponses = ckw.simulatedPeerResponses[1:] return code }
76-89: Sharper failure simulation and safer defaults
- If SimulatedFailureMessage is empty, return a sensible default to aid debugging.
- Keep behavior as-is; randomization already gated to PodList. Optional: restrict randomization to SNR pods via label selector to reduce side effects.
case kcw.ShouldSimulatePodDeleteFailure: if _, ok := list.(*corev1.NamespaceList); ok { - err = errors.New(kcw.SimulatedFailureMessage) + msg := kcw.SimulatedFailureMessage + if msg == "" { + msg = "simulated API error (namespace list)" + } + err = errors.New(msg) return } fallthroughAlso applies to: 91-100
controllers/tests/controller/selfnoderemediation_controller_test.go (2)
1195-1212: Stop the control-plane manager on cleanupStart() may spawn background goroutines; add cancellable context or Stop if available to avoid leaks between specs.
func addControlPlaneManager() { By("Add a control plane manager", func() { - controlPlaneMgr := controlplane.NewManager(shared.UnhealthyNodeName, k8sClient) - Expect(controlPlaneMgr.Start(context.Background())).To(Succeed(), "we should"+ + controlPlaneMgr := controlplane.NewManager(shared.UnhealthyNodeName, k8sClient) + ctx, cancel := context.WithCancel(context.Background()) + Expect(controlPlaneMgr.Start(ctx)).To(Succeed(), "we should"+ "have been able to enable a control plane manager for the current node") apiCheck.SetControlPlaneManager(controlPlaneMgr) Expect(apiConnectivityCheckConfig.Peers.UpdateControlPlanePeers(context.Background())).To(Succeed()) DeferCleanup(func() { By("Removing the control plane manager", func() { + cancel() // stop background workers apiCheck.SetControlPlaneManager(nil) Expect(apiConnectivityCheckConfig.Peers.UpdateControlPlanePeers(context.Background())).To(Succeed()) }) }) }) }
936-962: Use Delete option helper directlyNit: you can pass the option inline without the temporary var for clarity.
- var grace client.GracePeriodSeconds = 0 - err := k8sClient.Client.Delete(context.Background(), pod, grace) + err := k8sClient.Client.Delete(context.Background(), pod, client.GracePeriodSeconds(0))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
- controllers/tests/controller/selfnoderemediation_controller_test.go(11 hunks)
- controllers/tests/shared/shared.go(3 hunks)
- pkg/controlplane/manager.go(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/controlplane/manager.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-06-18T11:38:24.387Z
Learnt from: slintes
PR: medik8s/self-node-remediation#262
File: pkg/peerhealth/client_server_test.go:165-189
Timestamp: 2025-06-18T11:38:24.387Z
Learning: In the medik8s/self-node-remediation project, the Ginkgo tests in pkg/peerhealth/client_server_test.go don't run in parallel, making shared mutable state like the `reader` variable safe to use across test cases without race condition concerns.
Applied to files:
- controllers/tests/shared/shared.go
🧬 Code graph analysis (2)
controllers/tests/controller/selfnoderemediation_controller_test.go (7)
api/v1alpha1/selfnoderemediation_types.go (4)
RemediationStrategyType(44-44)
SelfNodeRemediation(92-98)
AutomaticRemediationStrategy(24-24)
ResourceDeletionRemediationStrategy(25-25)controllers/tests/shared/shared.go (11)
UnhealthyNodeName(37-37)
Namespace(36-36)
GenerateTestConfig(283-295)
PeerNodeName(38-38)
Peer2NodeName(39-39)
SnrPodName2(43-43)
Peer3NodeName(40-40)
SnrPodName3(44-44)
IsK8sNotFoundError(320-336)
SnrPodName1(42-42)
PeersOverrideFunc(74-74)pkg/peers/peers.go (3)
Role(26-26)
ControlPlane(30-30)
Peers(36-45)api/HealthCheckResponse.go (1)
Unhealthy(8-8)vendor/github.com/medik8s/common/pkg/labels/labels.go (1)
MasterRole(7-7)pkg/controlplane/manager.go (1)
NewManager(49-57)pkg/apicheck/check.go (1)
PeersOverrideFunc(36-36)
controllers/tests/shared/shared.go (5)
pkg/apicheck/check.go (4)
ApiConnectivityCheck(38-51)
PeersOverrideFunc(36-36)
New(76-87)
ApiConnectivityCheckConfig(55-74)api/HealthCheckResponse.go (2)
HealthCheckResponseCode(3-3)
RequestFailed(6-6)pkg/peers/peers.go (2)
Role(26-26)
New(47-58)pkg/controlplane/manager.go (1)
Manager(39-46)vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)
MakeMatcher(87-128)
🔇 Additional comments (4)
controllers/tests/shared/shared.go (2)
112-124: Valid ULA IPv6 generation — LGTMUsing fd00::/8 with crypto/rand is correct; reasonable fallback on error.
319-336: Custom matcher for IsNotFound — LGTMMatches error semantics cleanly using gcustom.MakeMatcher.
controllers/tests/controller/selfnoderemediation_controller_test.go (2)
543-583: Order-of-operations bug: responses appended before enabling simulation get wipedIn the “two control node peers found” setup you append responses in addNodes, then later enable simulation and reset timers which clears baselines. Move configureSimulatedPeerResponses(true) before addNodes(...) or stop clearing on enable (see prior comment).
Also applies to: 1394-1410
1065-1101: Strict “exactly two nodes” post-condition can flake if cleanup failsIf a prior test leaves an extra peer due to a transient API error, this hard check will fail unrelated specs. Consider filtering by known names or asserting a superset condition.
| It("check that we actually get a triggered watchdog reboot", func() { | ||
| // It's expected that the next line will fail, even though it shouldn't! | ||
| verifyWatchdogTriggered() | ||
| }) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove “expected to fail” assertion or fix the expectation
Tests must be deterministic. Either make verifyWatchdogTriggered() reliably true for this setup or flip to verifyWatchdogNotTriggered() with matching setup. Leaving a known-failing assertion will break CI.
🤖 Prompt for AI Agents
In controllers/tests/controller/selfnoderemediation_controller_test.go around
lines 584 to 587, the test currently calls verifyWatchdogTriggered() but the
comment indicates it is expected to fail; make the test deterministic by either
(A) ensuring the test setup actually triggers the watchdog before calling
verifyWatchdogTriggered() — e.g., configure the fake/mock watchdog state,
timeouts or events so the trigger condition is met — or (B) if the intended
behavior is that the watchdog should not be triggered in this scenario, replace
the call with verifyWatchdogNotTriggered() and adjust any setup/fixtures to
match that expectation; also remove/update the misleading comment so CI no
longer contains a known-failing assertion.
| func configureSimulatedPeerResponses(simulateResponses bool) { | ||
| By("Start simulating peer responses", func() { | ||
| orgValue := apiCheck.ShouldSimulatePeerResponses | ||
| orgResponses := apiCheck.SnapshotSimulatedPeerResponses() | ||
| apiCheck.ShouldSimulatePeerResponses = simulateResponses | ||
| if simulateResponses { | ||
| apiCheck.ClearSimulatedPeerResponses() | ||
| } | ||
|  | ||
| DeferCleanup(func() { | ||
| apiCheck.ShouldSimulatePeerResponses = orgValue | ||
| apiCheck.RestoreSimulatedPeerResponses(orgResponses) | ||
| apiCheck.RememberSimulatedPeerResponses() | ||
| if !simulateResponses { | ||
| apiCheck.SetPeersOverride(nil) | ||
| } | ||
| }) | ||
| }) | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Simulated response lifecycle is inverted; do not clear on enable and don’t couple override cleanup here
- Enabling simulation currently clears pre-staged responses, breaking scenarios that append before toggling.
- Cleanup clears PeersOverride only when simulateResponses is false — backwards.
 func configureSimulatedPeerResponses(simulateResponses bool) {
   By("Start simulating peer responses", func() {
     orgValue := apiCheck.ShouldSimulatePeerResponses
     orgResponses := apiCheck.SnapshotSimulatedPeerResponses()
     apiCheck.ShouldSimulatePeerResponses = simulateResponses
-    if simulateResponses {
-      apiCheck.ClearSimulatedPeerResponses()
-    }
+    // Preserve any previously staged responses when enabling simulation.
+    // Callers can ClearSimulatedPeerResponses() explicitly if needed.
     DeferCleanup(func() {
       apiCheck.ShouldSimulatePeerResponses = orgValue
       apiCheck.RestoreSimulatedPeerResponses(orgResponses)
-      apiCheck.RememberSimulatedPeerResponses()
-      if !simulateResponses {
-        apiCheck.SetPeersOverride(nil)
-      }
+      // Leave PeersOverride restoration to configurePeersOverride’s own cleanup.
     })
   })
 }Also applies to: 1305-1321
🤖 Prompt for AI Agents
controllers/tests/controller/selfnoderemediation_controller_test.go around lines
1279-1297 (and apply same change to 1305-1321): the code currently clears
pre-staged simulated responses when enabling simulation and only clears
PeersOverride on cleanup when simulateResponses is false — both inverted. Change
the toggle logic so that enabling simulation (simulateResponses == true) does
NOT call ClearSimulatedPeerResponses (preserve pre-staged responses) and
disabling simulation (simulateResponses == false) clears simulated responses as
needed; also invert the DeferCleanup condition so SetPeersOverride(nil) runs
when we previously enabled simulation (simulateResponses == true) rather than
when it was false. Ensure the same fixes are applied to the other block at
1305-1321.
| func configurePeersOverride(fn shared.PeersOverrideFunc) { | ||
| By("Configure peer override", func() { | ||
| apiCheck.SetPeersOverride(fn) | ||
| }) | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Peers override leaks across tests — add restoration in cleanup
configurePeersOverride sets an override but never restores it, polluting later specs.
 func configurePeersOverride(fn shared.PeersOverrideFunc) {
   By("Configure peer override", func() {
-    apiCheck.SetPeersOverride(fn)
+    prev := apiCheck.PeersOverride()
+    apiCheck.SetPeersOverride(fn)
+    DeferCleanup(func() {
+      apiCheck.SetPeersOverride(prev)
+    })
   })
 }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func configurePeersOverride(fn shared.PeersOverrideFunc) { | |
| By("Configure peer override", func() { | |
| apiCheck.SetPeersOverride(fn) | |
| }) | |
| } | |
| func configurePeersOverride(fn shared.PeersOverrideFunc) { | |
| By("Configure peer override", func() { | |
| prev := apiCheck.PeersOverride() | |
| apiCheck.SetPeersOverride(fn) | |
| DeferCleanup(func() { | |
| apiCheck.SetPeersOverride(prev) | |
| }) | |
| }) | |
| } | 
🤖 Prompt for AI Agents
In controllers/tests/controller/selfnoderemediation_controller_test.go around
lines 1299-1303, configurePeersOverride currently sets a global override without
restoring it, leaking state across specs; change it to capture the current
override before calling apiCheck.SetPeersOverride(fn) and register a cleanup to
restore the previous value (using Ginkgo's DeferCleanup or an AfterEach) so the
original override is reinstated after the spec finishes.
Why we need this PR
Unit test to show problem with Issue #251 as well as a potential fix.
Please feel free to disregard as much of or as little of this as you want, hopefully the unit test refactoring is seen as useful, but I believe it sufficiently highlights the core issue. I separated it out into multiple commits so you can look at branch history and see the individual updates prior to the squash commit that will happen to main. This is intended so you can see a delta of the actual meat & potatoes of the code fix itself which was relatively small and could be pulled in separately in the worst case scenario.
Note: I created an official account for my Dell contributions, still the same Mark who opened the issue 251 originally though.
Changes made
Which issue(s) this PR fixes
Fixes #251 & RHWA-384
Test plan
Summary by CodeRabbit
New Features
Tests
Bug Fixes
Refactor
Style
Chores