-
Notifications
You must be signed in to change notification settings - Fork 44
Fix nil pointer panic and spurious Auto Mode updates #171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix nil pointer panic and spurious Auto Mode updates #171
Conversation
Hi @demikl. Thanks for your PR. I'm waiting for a aws-controllers-k8s member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Auto Mode Configuration LogicThis PR adds proper validation and handling for EKS Auto Mode, which has specific requirements from AWS: Auto Mode Requirements:
Changes Made:
Breaking Change: Custom Resources with partial Auto Mode configurations that previously succeeded will now fail validation with clear error messages, guiding users toward correct configurations. This ensures the controller behavior matches AWS Auto Mode requirements and prevents the nil pointer crashes reported in issue aws-controllers-k8s/community#2619. |
8290936
to
85a97c1
Compare
/ok-to-test |
Hey @demikl , Thank you for this!! Can you add a test for the auto-mode behavior? Creating a non-auto cluster and making it auto mode and one trying to turn it off with incorrect parameterS.. |
/retest |
Hi @rushmash91, I’m seeing behavior in the Auto Mode activation test that I can’t explain: Test flow (
|
Hi @demikl , Questions: I usually test the api describe and update behavior directly via the CLI to see if there are any caveats being missed. |
Thanks for the clarification, @rushmash91. I can reliably reproduce the expected behavior (DescribeCluster showing the three Auto Mode sections) when running the same workflow locally/manually. Because I still can’t determine why the Prow run’s DescribeCluster response omits those sections after a Successful AutoModeUpdate, I’ve updated the test to assert the transition using list-updates + describe-update only (type=AutoModeUpdate, status=Successful, three params all enabled). This has been consistent across runs. Let me know if you’d prefer that I keep a (soft) DescribeCluster check as a best-effort, or leave it as-is with the update-based validation. |
/retest |
1 similar comment
/retest |
Hi, do you need anything more from me for this PR to be accepted? |
Hey @demikl , thank you! this is great! |
- Add isAutoModeCluster() function to detect valid Auto Mode configurations - Add validateAutoModeConfig() to enforce AWS requirement that compute, storage, and load balancing must all be enabled/disabled together - Only call updateComputeConfig() for actual Auto Mode clusters - Ignore elasticLoadBalancing absent vs false diffs for non-Auto Mode clusters Fixes aws-controllers-k8s/community#2619
a00dd91
to
d5a11c1
Compare
I've merged the tests as requested. It looks like there is a flaky test that impacts my PR checks 😞 : test_cluster_adopt_update |
/test eks-kind-e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @demikl ,
Thank you! the tests look good, left a few small nits..
/retest |
Hey @demikl , I see the your tests are failing, We can merge this if they pass. The changes look good to me 🙂 |
In the latest run, the 2 tests that are failing are out-of-scope of my changes. I hope the current state is OK for you ? |
/retest |
/test eks-kind-e2e |
b2c711b
to
3a92e5c
Compare
3a92e5c
to
3b5856f
Compare
@demikl: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: a-hilaly, demikl The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple questions.
|
||
return returnClusterUpdating(updatedRes) | ||
// If not Auto Mode, ignore the diff | ||
rlog.Info("ignoring diff on compute/storage/network config for non-Auto Mode cluster") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Will this not result in the delta still being present in the next reconcile loop? Might be tough to avoid this if the API is returning invalid auto-mode flag combinations unless we treat nil
as equal to false
in the delta comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we will see the delta in logs but not sent the payload for update. Suggestion on what should be done instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're treating nil as false in the validation logic, we could do the same in the delta comparison. That way a partially false set of flags won't register as a diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need to validate that nil is equivalent to false in the EKS service as well though.
if err != nil { | ||
return nil, ackerr.NewTerminalError(err) | ||
} | ||
if isAutoMode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: If this is false due to a user removing the auto-mode flags do we need to take any action? As-is we won't send any API request and leave the EKS cluster with whatever values were already present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, if it's the an invalid automode payload it's not sent. No action is taken apart from logging it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this could lead to some odd behavior from a user's perspective. Here's a sequence of events that could happen with this logic.
-
User creates a cluster without any auto-mode configs set. Cluster created in EKS has auto-mode disabled as expected.
-
User adds auto-mode configs with all values true. Cluster is modified to use auto-mode as expected.
-
User decides they don't want auto-mode and rollback the ACK resource to the original Spec. We log that we are ignoring the diff for a non-automode cluster. However, the actual cluster in EKS still has automode enabled .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we still make the API request?
That would make this change still safe even if the API behavior changes
We can mark the encountered error terminal for now
Description
Fixes nil pointer panic in updateComputeConfig and prevents spurious Auto Mode updates for non-Auto Mode clusters
Related Issue
Fixes aws-controllers-k8s/community#2619
Changes
isAutoModeCluster()
function to detect Auto Mode clustersupdateComputeConfig()
for actual Auto Mode clustersBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.