Skip to content

Conversation

demikl
Copy link

@demikl demikl commented Sep 4, 2025

Description

Fixes nil pointer panic in updateComputeConfig and prevents spurious Auto Mode updates for non-Auto Mode clusters

Related Issue

Fixes aws-controllers-k8s/community#2619

Changes

  • Add isAutoModeCluster() function to detect Auto Mode clusters
  • Only invoke updateComputeConfig() for actual Auto Mode clusters
  • Ignore `auto-mode field updates for non-Auto Mode clusters

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@ack-prow ack-prow bot requested review from a-hilaly and michaelhtm September 4, 2025 14:24
Copy link

ack-prow bot commented Sep 4, 2025

Hi @demikl. Thanks for your PR.

I'm waiting for a aws-controllers-k8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@demikl
Copy link
Author

demikl commented Sep 4, 2025

⚠️ Breaking Change Notice

This PR introduces a breaking change in validation behavior for Auto Mode configurations. Users who previously had partial Auto Mode configurations in their Custom Resources will now encounter validation errors.

What Changed

Previously, the controller allowed partial Auto Mode configurations where only one or two of the three required capabilities were specified. This led to:

  • Nil pointer panics when ComputeConfig was present but other configs were missing
  • Confusing behavior where missing fields were implicitly treated as false
  • Spurious reconciliation loops due to AWS API inconsistencies

New Validation Rules

With this change, if any Auto Mode configuration is specified, ALL three capabilities must be explicitly declared:

  1. spec.computeConfig.enabled
  2. spec.storageConfig.blockStorage.enabled
  3. spec.kubernetesNetworkConfig.elasticLoadBalancing.enabled

Example of Breaking Change

This configuration will now fail validation:

apiVersion: eks.services.k8s.aws/v1alpha1
kind: Cluster
metadata:
  name: self
spec:
  computeConfig:
    enabled: false  # Only one capability specified
  kubernetesNetworkConfig:
    ipFamily: ipv4
    serviceIPv4CIDR: 172.20.0.0/16

Error result:

status:
  conditions:
  - lastTransitionTime: "2025-09-04T15:03:00Z"
    status: "True"
    type: ACK.ResourceSynced
  - message: 'invalid Auto Mode configuration: when configuring Auto Mode, all three
      capabilities must be specified (compute=true, storage=false, elb=false)'
    status: "True"
    type: ACK.Terminal

How to Fix Your Custom Resources

Users have three options:

Option 1: Enable Auto Mode completely

spec:
  computeConfig:
    enabled: true
  storageConfig:
    blockStorage:
      enabled: true
  kubernetesNetworkConfig:
    elasticLoadBalancing:
      enabled: true

Option 2: Disable Auto Mode completely

spec:
  computeConfig:
    enabled: false
  storageConfig:
    blockStorage:
      enabled: false
  kubernetesNetworkConfig:
    elasticLoadBalancing:
      enabled: false

Option 3: Don't use Auto Mode

Remove all Auto Mode configurations from your CR entirely.

Why This Change is Necessary

  1. Prevents crashes: Eliminates nil pointer panics when partial configurations are present
  2. Follows AWS requirements: EKS Auto Mode requires all three capabilities to be configured together
  3. Improves user experience: Clear validation errors instead of silent failures or crashes
  4. Reduces confusion: No more implicit false treatment of missing fields

This aligns the controller behavior with AWS EKS Auto Mode documentation and prevents the undefined behavior that was causing production issues.

@demikl
Copy link
Author

demikl commented Sep 4, 2025

Auto Mode Configuration Logic

This PR adds proper validation and handling for EKS Auto Mode, which has specific requirements from AWS:

Auto Mode Requirements:

  • Activate: All three capabilities (compute, storage, elasticLoadBalancing) must be true
  • Deactivate: All three capabilities must be false
  • Invalid: Any partial or inconsistent configuration (e.g., compute=true, storage=false)

Changes Made:

  1. Strict Validation: The controller now validates that Auto Mode configurations are complete and consistent. If a Custom Resource specifies only some capabilities or has inconsistent true/false values, the resource status is set to failed with a clear error message.
  2. Auto Mode Detection: Only clusters with valid Auto Mode configurations (all true or all false) will trigger Auto Mode-specific logic. Other clusters skip Auto Mode processing entirely.
  3. AWS API Resilience: If the EKS API returns inconsistent responses (which we've observed with the elasticLoadBalancing field), the controller ignores these inconsistencies rather than crashing, preventing API quirks from affecting cluster management.

Breaking Change: Custom Resources with partial Auto Mode configurations that previously succeeded will now fail validation with clear error messages, guiding users toward correct configurations.

This ensures the controller behavior matches AWS Auto Mode requirements and prevents the nil pointer crashes reported in issue aws-controllers-k8s/community#2619.

@demikl demikl force-pushed the fix-automode-nil-pointer-issue-2619 branch from 8290936 to 85a97c1 Compare September 4, 2025 15:42
@rushmash91
Copy link
Member

/ok-to-test

@ack-prow ack-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 5, 2025
@rushmash91
Copy link
Member

Hey @demikl , Thank you for this!!

Can you add a test for the auto-mode behavior? Creating a non-auto cluster and making it auto mode and one trying to turn it off with incorrect parameterS..

@demikl
Copy link
Author

demikl commented Sep 8, 2025

/retest

@demikl
Copy link
Author

demikl commented Sep 9, 2025

Hi @rushmash91,

I’m seeing behavior in the Auto Mode activation test that I can’t explain:

Test flow (test_enable_auto_mode_on_standard_cluster in test_cluster_automode_updates.py):

  1. Create a standard (non–Auto Mode) cluster using the existing simple-cluster template.
  2. Wait until the cluster is ACTIVE (both CR status and eks:DescribeCluster agree).
  3. Patch the Cluster CR to add:
  • spec.computeConfig.enabled = true
  • spec.storageConfig.blockStorage.enabled = true
  • spec.kubernetesNetworkConfig.elasticLoadBalancing.enabled = true
  1. Cluster transitions to UPDATING (observed in CR status and eks:DescribeCluster).
  2. Cluster returns to ACTIVE.
  3. The CR now contains the three Auto Mode sections with enabled=true and the controller reports no further drift.
  4. However, eks:DescribeCluster still does not include computeConfig, storageConfig, or kubernetesNetworkConfig.elasticLoadBalancing (the fields are absent, not just disabled).

Additional debugging I added:

  • Logged intermediate CR + DescribeCluster snapshots (pre‑patch, during UPDATING, after ACTIVE).
  • Called list_updates + describe_update for each updateId: there is exactly one update of type AutoModeUpdate with status Successful.
  • This suggests the service accepted and applied the transition.
  • When I manually reproduce the same workflow in my company’s AWS account (create standard cluster → patch to enable Auto Mode → describe), the DescribeCluster response DOES include the three sections as expected.

Questions / hypotheses:

  • Do Prow tests run against the real EKS API or some proxy/mock that may not yet surface these new fields?
  • Is there longer propagation latency in this environment? (I currently poll up to 180s.)
  • Could the region under test have partial rollout of the Auto Mode describe fields?
  • Are additional fields (e.g. nodeRoleARN, nodePools) required in this environment for the sections to appear in DescribeCluster even though just enabled=true works elsewhere?
  • Is the API silently accepting the update but withholding the fields until another condition is met?

Next steps I can take if helpful:

  • Enrich the patch with nodeRoleARN (from bootstrap) and optional nodePools.
  • Extend polling window and dump the raw JSON describe payload.
  • Add a temporary xfail if this is a known service-side lag or rollout gap.

Could you please confirm:

  • That the test environment is using the live EKS control plane.
  • Whether any extra prerequisites are required for DescribeCluster to return the Auto Mode sections after an update.

Happy to adjust the test once I understand the expected behavior here.

Thanks!

@rushmash91
Copy link
Member

Hi @demikl ,

Questions:
That the test environment is using the live EKS control plane - Yes, the environment is an EKS cluster
Whether any extra prerequisites are required for DescribeCluster to return the Auto Mode sections after an update - No

I usually test the api describe and update behavior directly via the CLI to see if there are any caveats being missed.

@demikl
Copy link
Author

demikl commented Sep 11, 2025

Thanks for the clarification, @rushmash91.

I can reliably reproduce the expected behavior (DescribeCluster showing the three Auto Mode sections) when running the same workflow locally/manually.

Because I still can’t determine why the Prow run’s DescribeCluster response omits those sections after a Successful AutoModeUpdate, I’ve updated the test to assert the transition using list-updates + describe-update only (type=AutoModeUpdate, status=Successful, three params all enabled). This has been consistent across runs.

Let me know if you’d prefer that I keep a (soft) DescribeCluster check as a best-effort, or leave it as-is with the update-based validation.

@demikl
Copy link
Author

demikl commented Sep 11, 2025

/retest

1 similar comment
@demikl
Copy link
Author

demikl commented Sep 12, 2025

/retest

@demikl
Copy link
Author

demikl commented Sep 18, 2025

Hi, do you need anything more from me for this PR to be accepted?

@rushmash91
Copy link
Member

Hey @demikl , thank you! this is great!
One comment i had is can we have the auto-mode test be part of the cluster auto mode test file already there..

- Add isAutoModeCluster() function to detect valid Auto Mode configurations
- Add validateAutoModeConfig() to enforce AWS requirement that compute,
  storage, and load balancing must all be enabled/disabled together
- Only call updateComputeConfig() for actual Auto Mode clusters
- Ignore elasticLoadBalancing absent vs false diffs for non-Auto Mode clusters

Fixes aws-controllers-k8s/community#2619
@demikl demikl force-pushed the fix-automode-nil-pointer-issue-2619 branch from a00dd91 to d5a11c1 Compare September 22, 2025 06:29
@demikl
Copy link
Author

demikl commented Sep 22, 2025

I've merged the tests as requested. It looks like there is a flaky test that impacts my PR checks 😞 : test_cluster_adopt_update

@demikl
Copy link
Author

demikl commented Sep 22, 2025

/test eks-kind-e2e

Copy link
Member

@rushmash91 rushmash91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @demikl ,

Thank you! the tests look good, left a few small nits..

@rushmash91
Copy link
Member

/retest

@rushmash91
Copy link
Member

Hey @demikl ,

I see the your tests are failing, We can merge this if they pass. The changes look good to me 🙂

@demikl
Copy link
Author

demikl commented Sep 26, 2025

Hey @demikl ,

I see the your tests are failing, We can merge this if they pass. The changes look good to me 🙂

In the latest run, the 2 tests that are failing are out-of-scope of my changes. I hope the current state is OK for you ?

@rushmash91
Copy link
Member

/retest

@a-hilaly
Copy link
Member

/test eks-kind-e2e

@rushmash91 rushmash91 force-pushed the fix-automode-nil-pointer-issue-2619 branch 2 times, most recently from b2c711b to 3a92e5c Compare October 3, 2025 19:10
@rushmash91 rushmash91 force-pushed the fix-automode-nil-pointer-issue-2619 branch from 3a92e5c to 3b5856f Compare October 3, 2025 19:15
Copy link

ack-prow bot commented Oct 3, 2025

@demikl: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
eks-verify-attribution 3b5856f link false /test eks-verify-attribution

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link

ack-prow bot commented Oct 6, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: a-hilaly, demikl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ack-prow ack-prow bot added the approved label Oct 6, 2025
Copy link
Contributor

@knottnt knottnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple questions.


return returnClusterUpdating(updatedRes)
// If not Auto Mode, ignore the diff
rlog.Info("ignoring diff on compute/storage/network config for non-Auto Mode cluster")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Will this not result in the delta still being present in the next reconcile loop? Might be tough to avoid this if the API is returning invalid auto-mode flag combinations unless we treat nil as equal to false in the delta comparison.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we will see the delta in logs but not sent the payload for update. Suggestion on what should be done instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're treating nil as false in the validation logic, we could do the same in the delta comparison. That way a partially false set of flags won't register as a diff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to validate that nil is equivalent to false in the EKS service as well though.

if err != nil {
return nil, ackerr.NewTerminalError(err)
}
if isAutoMode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: If this is false due to a user removing the auto-mode flags do we need to take any action? As-is we won't send any API request and leave the EKS cluster with whatever values were already present.

Copy link
Member

@rushmash91 rushmash91 Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, if it's the an invalid automode payload it's not sent. No action is taken apart from logging it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this could lead to some odd behavior from a user's perspective. Here's a sequence of events that could happen with this logic.

  1. User creates a cluster without any auto-mode configs set. Cluster created in EKS has auto-mode disabled as expected.

  2. User adds auto-mode configs with all values true. Cluster is modified to use auto-mode as expected.

  3. User decides they don't want auto-mode and rollback the ACK resource to the original Spec. We log that we are ignoring the diff for a non-automode cluster. However, the actual cluster in EKS still has automode enabled .

Copy link
Member

@michaelhtm michaelhtm Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we still make the API request?
That would make this change still safe even if the API behavior changes
We can mark the encountered error terminal for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

eks-controller: Nil pointer panic in updateComputeConfig + false drift on elasticLoadBalancing for non–Auto Mode clusters

5 participants