From 8f10dbc932758053eb78b68fbbec2eb19bfbdce5 Mon Sep 17 00:00:00 2001 From: Guillermo Gaston Date: Mon, 7 Oct 2024 17:37:42 -0500 Subject: [PATCH 01/14] Add KEP-4872 Harden Kubelet serving cert validation --- .../README.md | 386 ++++++++++++++++++ .../kep.yaml | 40 ++ 2 files changed, 426 insertions(+) create mode 100644 keps/sig-auth/4872-harden-kubelet-cert-validation/README.md create mode 100644 keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml diff --git a/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md b/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md new file mode 100644 index 00000000000..70b4a281c05 --- /dev/null +++ b/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md @@ -0,0 +1,386 @@ +# KEP-4872: Harden Kubelet Serving Certificate Validation in Kube-API server + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Impact of node impersonation](#impact-of-node-impersonation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Enabling the feature](#enabling-the-feature) + - [Metrics](#metrics) + - [TLS insecure](#tls-insecure) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed](#infrastructure-needed) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +This proposal aims to enhance the security of the Kube API server by validating the Common Name (CN) of the kubelet's serving certificate to ensure it matches the expected node name. +This validation prevents a compromised node that has obtained a certificate for an IP address it does not own from using it to impersonate another node. + +## Motivation + +In cloud environments, IPs can change rapidly due to the ephemeral nature of the infrastructure. +If IPs or machines rotate faster than the expiration frequency of kubelet serving certificates, a certificate issued to an old node could be used to respond to requests aimed at a new node, provided they share an IP. + +In addition, in on-premises environments, verifying that the IP addresses in a Certificate Signing Request (CSR) are owned by the requesting node can be challenging due to the lack of a reliable source of truth for IP ownership. +Even when such a source exists, integration can be complex, leading to unsafe practices like auto-approval of CSRs without a strong guarantee of IP ownership. +This vulnerability can be exploited through ARP poisoning or other routing attacks, allowing a rogue node to obtain a certificate for an IP it does not own and reroute traffic to itself. + +When the Kube API server connects to a kubelet, it verifies that the serving certificate is signed by a trusted CA and that the IP or hostname it’s connecting to is included in the certificate's SANs. +If a rogue node obtained a certificate for an IP it does not own and reroute traffic to itself, it would be able to impersonate a Node that reports that IP. + +### Impact of node impersonation + +Provided an actor with control of a node can impersonate another node, the impact would be: + +* Break confidentiality of the requests sent by the Kube-API server to the kubelet (e.g kubectl exec/logs).These are usually user-driven requests. That gives the threat actor the possibility of producing incorrect or mis-leading feedback. In the exec case, it could allow a threat actor to issue prompts for credentials. In addition, the exec commands might contain user secrets. +* Break confidentiality of credentials if the client uses token based authentication. This is probably more common for non Kube-API server clients, given mTLS is common for Kube-API server to kubelet communication. + +### Goals + +* Ensure the Kube API server validates that the node’s serving certificate's CN matches the expected node name. +* Prevent rogue nodes from using certificates issued for IPs they do not own. + +### Non-Goals + +* This proposal does not address certificate validation for clients other than the Kube API server, such as metrics scrapers. However, we'll consider an implementation in client-go that could be used by those other clients. + +## Proposal + +We propose that the Kube API server is modified to validate the Common Name (CN) of the kubelet's serving certificate is equal to `system:node:`. +`nodename` is the name of the Node object as reported by the kubelet. When the Kube-API server connects to the kubelet server (e.g. for logs, exec, port-forward), it always knows the Node it's connecting to. + +### User Stories (Optional) + +#### Story 1 + +As a cluster administrator, I want to ensure that kubelet serving certificates are validated based on the node name, reducing the risk of IP-based impersonation attacks. + +#### Story 2 + +As a cluster administrator using custom serving certificates for the kubelet server, I want to be able to disable the Subject's CN validation. + +### Notes/Constraints/Caveats (Optional) + +When the kubelet requests a certificate through a CSR, it sets the CN to `system:node:`, enforced by the admission controller as per [PR \#126015](https://github.com/kubernetes/kubernetes/pull/126015). + +However, certificates issued manually or through other mechanisms may not follow this convention. +With the new validation, any certificate not following this `system:node:` convention will be deemed invalid by the Kube API server. +This will require cluster administrators to reissue any non-conforming certificates before enabling this feature. + +### Risks and Mitigations + +This could disrupt existing clusters that are using custom kubelet serving certificates. +These clusters will need to reissue their certificates before enabling this feature. We will allow to disable the validation through a command-line flag to allow for a smooth transition. + +## Design Details + +### Enabling the feature + +We will introduce a feature flag `KubeletCertCNValidation` that will gate the usage of the new validation. +This gate will start off by default in Alpha, will be turned on by default in Beta and will be removed in GA. + +In addition, we will allow to disable the validation through a command-line flag `--disable-kubelet-cert-cn-validation`. +This flag can only be set if the `KubeletCertCNValidation` feature flag is enabled. +This flag will allow cluster administrators to opt-out of this validation if they are using custom kubelet serving certificates that don't follow the `system:node:` convention even after the feature gate is removed. + +#### Metrics + +In order to help cluster administrators determine if it's safe to enable the feature, we propose to add a new metric `kube_apiserver_validation_kubelet_cert_cn_errors` that will track the number of errors due to the new CN validation. +If the feature gate is disabled, we will still add the validation code to the HTTP transport, however, if the validation fails we won't return an error, we will just increment the metric counter. +In addition, we will log the error including the node name, so cluster administrators can identify which nodes are affected and need to reissue their certificates. + +We purposefully don't add the node name to the metric to avoid a high cardinality. +The purpose of the metric is to easily/cheaply tell administrators if they can flip the feature on or not. If the answer is no (counter is greater than 0), the rest of the necessary information to detect the offending nodes will come from logs. + +Given that running the validation to feed the metric still has a cost, we won't run it if the validation is explicitly disabled with `--disable-kubelet-cert-cn-validation`. + +We will remove the metric once the feature is GA. + +> TODO: let's discuss this in the review. We could consider adding the node name to the metric or even keeping the metric post GA if it's valuable. + +### TLS insecure + +Currently, if the Kube-API server is not configured with a `--kubelet-certificate-authority` the TLS client for kubelet server will skip the server certificate validation. +Additionally, `logs` requests allow to configure `InsecureSkipTLSVerifyBackend` per request to skip the server certificate validation. + +To align with this behavior, we won't execute the CN validation if `--kubelet-certificate-authority` is not set or if `InsecureSkipTLSVerifyBackend` is set to true. + +### Test Plan + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + +##### Unit tests + +Unit tests will be added along with any new code introduced. + +Existing test coverage for the packages we anticipate modifying: + +- `k8s.io/kubernetes/pkg/kubelet/client`: `2024-10-07` - `28.2` +- `k8s.io/client-go/transport`: `2024-10-07` - `59.4` + +##### Integration tests + +Integration tests will be added to ensure the following: +* An error is returned if `--disable-kubelet-cert-cn-validation` is set but `KubeletCertCNValidation` feature flag is not enabled. +* Validation for custom certificates works if feature flag is not enabled. +* Validation for custom certificates works if feature flag enabled and `--disable-kubelet-cert-cn-validation` is set to true. +* Validation for custom certificates fails if feature flag enabled and `--disable-kubelet-cert-cn-validation` is set to false or not set. +* Validation for kubernetes issued certificates works if feature flag enabled and `--disable-kubelet-cert-cn-validation` is set to false or not set. + +##### e2e tests + +End-to-end tests won't be needed as unit and integration tests will cover all the scenarios. + +### Graduation Criteria + +#### Alpha + +* Add feature flag for gating usage, off by default +* Add flag to disable extra validation +* Unit and integration tests + +#### Beta +* Address user reviews and iterate if needed +* Feature flag on by default + +#### GA +* Remove feature flag + +### Upgrade / Downgrade Strategy + +Once feature flag is on by default (starting in Beta), administrators using custom serving certs +can use the proposed flag to disable the extra validation and maintain current behavior. +They will be able to use this flag even after the feature flag is removed. + +### Version Skew Strategy + +Not applicable. + +## Production Readiness Review Questionnaire + +### Feature Enablement and Rollback + +###### How can this feature be enabled / disabled in a live cluster? + +- [x] Feature gate + - Feature gate name: `KubeletCertCNValidation` + - Components depending on the feature gate: kube-apiserver +- [x] Other + - Describe the mechanism: kube-apiserver command-line flag `--disable-kubelet-cert-cn-validation` + - Will enabling / disabling the feature require downtime of the control + plane? No. But requires restarting the kube-apiserver. + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? No. + +###### Does enabling the feature change any default behavior? + +Yes. If a cluster is using custom kubelet serving certificates that don't follow the same convention as kubernetes issued certificates (CN is `system:node:`), +enabling this feature will make any connection initiated by the kube-api server fail (logs, exec and port-forwarding). + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +Yes, the feature can be disabled once enabled by just setting the command-line flag to true. + +###### What happens if we reenable the feature if it was previously rolled back? + +You just get back the new behavior with the extra cert validation, no extra considerations needed. + +###### Are there any tests for feature enablement/disablement? + +We will add integration tests to validate the enablement/disablement flow. Test cases specified in a previous section. + +### Rollout, Upgrade and Rollback Planning + +###### How can a rollout or rollback fail? Can it impact already running workloads? + +A rollout can fail if the feature flag is not enabled but the command-line flag is set. + +Already running workloads won't be impacted but cluster users won't be able to access the control plane if the cluster is single-node. + +###### What specific metrics should inform a rollback? + +Not applicable. + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + +No. There is no data stored for this feature which persists between upgrade / downgrade, or between enable / disable. +The feature is purely an API server configuration option. + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + +No. + +### Monitoring Requirements + +###### How can an operator determine if the feature is in use by workloads? + +The cluster administrators can check the flags passed to the kube-apiserver if they have access to the control plane nodes. +If the `--disable-kubelet-cert-cn-validation` flag is not set or set to false, the feature is being used. +Alternatively the can check the `kubernetes_feature_enabled` metric. + +###### How can someone using this feature know that it is working for their instance? + +- [x] Other + - Details: users can create a Node with a kubelet serving certificate that doesn't meet the CN requirements enforced by this validation (something different than `system:node:`).Then run `kubectl logs` for any pod running in that node. If it returns an error for an invalid certificate, the feature is working. + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + +The average `apiserver_request_duration_seconds` for logs/exec/port-forward requests is within reasonable limits. +A raising value after enabling this feature could signal overhead introduced by the extra validation. + +> TODO: I expect the overhead to be negligible and probably to fall in within the standard deviation of the current average. Specially for long running requests like port-forward and exec. Is this even valuable to have here? + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + +- [x] Metrics + - Metric name: `kube_apiserver_pod_logs_backend_tls_failure_total` + - Components exposing the metric: kube-apiserver + +> TODO: should `kube_apiserver_pod_logs_backend_tls_failure_total` reflect errors due to the new CN validation? +> It's technically a TLS failure, but it's not part of the base TLS client validations. + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + +We could add a metric specific to track the number of requests that failed due to the new CN validation. In addition, we could track the time spent per request on the CN validation. + +However, we consider these metrics to not provide enough value to justify the work to maintain them. + +### Dependencies + +###### Does this feature depend on any specific services running in the cluster? + +No. + +### Scalability + +###### Will enabling / using this feature result in any new API calls? + +No. + +###### Will enabling / using this feature result in introducing new API types? + +No. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + +No. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + +No. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + +No. This only affects streaming APIs and these are not covered by SLIs/SLOs. + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + +No. + +Note: depending on the implementation (caching the client-go transport or not) there might be a slight increase in memory (due to one transport per node being cached) or in CPU usage (due to building the transport on the fly for every request). This should be negligible. + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + +No. + +### Troubleshooting + +###### How does this feature react if the API server and/or etcd is unavailable? + +It's part of the API server, so the feature will be unavailable. + +###### What are other known failure modes? + +- [API server can't connect to Nodes with custom kubelet serving certificates that don't follow the `system:node:` convention] + - Detection: `kubectl logs` returns a certificate validation error. + - Mitigations: disable the validation with the `--disable-kubelet-cert-cn-validation` flag. + - Diagnostics: error is returned by the API server, no additional logging needed. + - Testing: We will have tests for this, this is basically testing that the feature works. + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + +## Drawbacks + +This could disrupt clusters that are using custom kubelet serving certificates. These clusters will need to reissue their certificates before enabling this feature. + +## Alternatives + +None. + +## Infrastructure Needed + +None. diff --git a/keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml b/keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml new file mode 100644 index 00000000000..e9631c6dfb7 --- /dev/null +++ b/keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml @@ -0,0 +1,40 @@ +title: Harden Kubelet Serving Certificate Validation in Kube-API server +kep-number: 4872 +authors: + - "@g-gaston" +owning-sig: sig-auth +participating-sigs: +status: provisional +creation-date: 2024-09-24 +reviewers: + - TBD +approvers: + - TBD + +see-also: +replaces: + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha|beta|stable + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "" + beta: "" + stable: "" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: KubeletCertCNValidation + components: + - kube-apiserver +disable-supported: true + +# The following PRR answers are required at beta release +metrics: From 16c59e6df4e9dbac846a175e6d1a9a012f86e7eb Mon Sep 17 00:00:00 2001 From: Guillermo Gaston Date: Thu, 10 Oct 2024 09:30:19 -0500 Subject: [PATCH 02/14] Update keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml Co-authored-by: Tim Bannister --- keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml b/keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml index e9631c6dfb7..2b1a36b2b47 100644 --- a/keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml +++ b/keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml @@ -15,7 +15,7 @@ see-also: replaces: # The target maturity stage in the current dev cycle for this KEP. -stage: alpha|beta|stable +stage: alpha # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively From b8ea822a3c0848f50e4a689ee82c918835b37174 Mon Sep 17 00:00:00 2001 From: Guillermo Gaston Date: Tue, 3 Dec 2024 13:22:33 -0600 Subject: [PATCH 03/14] Only produce metric if feature gate is enabled --- .../4872-harden-kubelet-cert-validation/README.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md b/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md index 70b4a281c05..a28281fec84 100644 --- a/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md +++ b/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md @@ -154,13 +154,15 @@ This flag will allow cluster administrators to opt-out of this validation if the #### Metrics In order to help cluster administrators determine if it's safe to enable the feature, we propose to add a new metric `kube_apiserver_validation_kubelet_cert_cn_errors` that will track the number of errors due to the new CN validation. -If the feature gate is disabled, we will still add the validation code to the HTTP transport, however, if the validation fails we won't return an error, we will just increment the metric counter. In addition, we will log the error including the node name, so cluster administrators can identify which nodes are affected and need to reissue their certificates. -We purposefully don't add the node name to the metric to avoid a high cardinality. +If the feature gate is disabled, we won't publish the metric or run any validation code at all. + +If the feature gate is enabled but the feature is disabled (with `--disable-kubelet-cert-cn-validation`), we will still add the validation code to the HTTP transport, however, if the validation fails we won't return an error, we will just increment the metric counter. + +We intentionally don't add the node name to the metric to avoid a high cardinality. The purpose of the metric is to easily/cheaply tell administrators if they can flip the feature on or not. If the answer is no (counter is greater than 0), the rest of the necessary information to detect the offending nodes will come from logs. -Given that running the validation to feed the metric still has a cost, we won't run it if the validation is explicitly disabled with `--disable-kubelet-cert-cn-validation`. We will remove the metric once the feature is GA. @@ -384,3 +386,4 @@ None. ## Infrastructure Needed None. +**** \ No newline at end of file From 786d43d948736c74539816b3a254aa90ed514b1e Mon Sep 17 00:00:00 2001 From: Guillermo Gaston Date: Tue, 3 Dec 2024 13:25:47 -0600 Subject: [PATCH 04/14] We will add e2e tests if we can't cover everything with integration --- keps/sig-auth/4872-harden-kubelet-cert-validation/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md b/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md index a28281fec84..7c3758eefe3 100644 --- a/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md +++ b/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md @@ -203,7 +203,7 @@ Integration tests will be added to ensure the following: ##### e2e tests -End-to-end tests won't be needed as unit and integration tests will cover all the scenarios. +We believe is likely end-to-end tests won't be needed as unit and integration tests will cover all the scenarios. If it's not possible to cover all the scenarios, we will add e2e tests. It's also quite likely that existing e2e tests will cover the new behavior once the feature gate is enabled, so new tests might only be needed for the transition period. ### Graduation Criteria From 6425c791148b5e13c4e2da2fe5d5d6b360a66367 Mon Sep 17 00:00:00 2001 From: Guillermo Gaston Date: Mon, 9 Jun 2025 11:50:48 -0500 Subject: [PATCH 05/14] Update metadata files for enhancements review --- keps/prod-readiness/sig-auth/4872.yaml | 3 +++ .../sig-auth/4872-harden-kubelet-cert-validation/kep.yaml | 8 ++++---- 2 files changed, 7 insertions(+), 4 deletions(-) create mode 100644 keps/prod-readiness/sig-auth/4872.yaml diff --git a/keps/prod-readiness/sig-auth/4872.yaml b/keps/prod-readiness/sig-auth/4872.yaml new file mode 100644 index 00000000000..ebf7d79287d --- /dev/null +++ b/keps/prod-readiness/sig-auth/4872.yaml @@ -0,0 +1,3 @@ +kep-number: 4872 +alpha: + approver: "@liggitt" diff --git a/keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml b/keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml index 2b1a36b2b47..fffccc7d130 100644 --- a/keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml +++ b/keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml @@ -20,13 +20,13 @@ stage: alpha # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "" +latest-milestone: "1.34" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: - alpha: "" - beta: "" - stable: "" + alpha: "1.34" + beta: "1.35" + stable: "1.37" # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled From 316144bd5a164a12c4693854832a96a9ceab81f9 Mon Sep 17 00:00:00 2001 From: Guillermo Gaston Date: Mon, 16 Jun 2025 12:08:56 -0500 Subject: [PATCH 06/14] Update prr reviewer to soltysh --- keps/prod-readiness/sig-auth/4872.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/prod-readiness/sig-auth/4872.yaml b/keps/prod-readiness/sig-auth/4872.yaml index ebf7d79287d..462a0d1cbca 100644 --- a/keps/prod-readiness/sig-auth/4872.yaml +++ b/keps/prod-readiness/sig-auth/4872.yaml @@ -1,3 +1,3 @@ kep-number: 4872 alpha: - approver: "@liggitt" + approver: "@soltysh" From 79fb591f44e336b7b756e2756a98f2b73fff2b6c Mon Sep 17 00:00:00 2001 From: Guillermo Gaston Date: Mon, 16 Jun 2025 12:10:06 -0500 Subject: [PATCH 07/14] Mark kep as implementable --- keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml b/keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml index fffccc7d130..0c9d945d06b 100644 --- a/keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml +++ b/keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml @@ -4,7 +4,7 @@ authors: - "@g-gaston" owning-sig: sig-auth participating-sigs: -status: provisional +status: implementable creation-date: 2024-09-24 reviewers: - TBD From ed8f3e260122132c9c5412b5a8693b8ef0901a67 Mon Sep 17 00:00:00 2001 From: Guillermo Gaston Date: Mon, 16 Jun 2025 12:44:02 -0500 Subject: [PATCH 08/14] Make the validation opt-in --- .../README.md | 49 +++++++++---------- 1 file changed, 24 insertions(+), 25 deletions(-) diff --git a/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md b/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md index 7c3758eefe3..5717d6011eb 100644 --- a/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md +++ b/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md @@ -138,7 +138,10 @@ This will require cluster administrators to reissue any non-conforming certifica ### Risks and Mitigations This could disrupt existing clusters that are using custom kubelet serving certificates. -These clusters will need to reissue their certificates before enabling this feature. We will allow to disable the validation through a command-line flag to allow for a smooth transition. + +In order to maintain compatibility by default with these clusters even after this feature goes GA, we will make it opt-in. + +Before enabling this feature on clusters with custom kubelet serving certificates, cluster administrators will need to reissue those certificates. ## Design Details @@ -147,33 +150,29 @@ These clusters will need to reissue their certificates before enabling this feat We will introduce a feature flag `KubeletCertCNValidation` that will gate the usage of the new validation. This gate will start off by default in Alpha, will be turned on by default in Beta and will be removed in GA. -In addition, we will allow to disable the validation through a command-line flag `--disable-kubelet-cert-cn-validation`. -This flag can only be set if the `KubeletCertCNValidation` feature flag is enabled. -This flag will allow cluster administrators to opt-out of this validation if they are using custom kubelet serving certificates that don't follow the `system:node:` convention even after the feature gate is removed. +In addition, the validation will be opt-in and enabled through a new command-line flag `--enable-kubelet-cert-cn-validation`. +This flag can only be set if the `KubeletCertCNValidation` feature flag is enabled and if `--kubelet-certificate-authority` is set. + +Making the feature opt-in maintains compatibility with existing clusters using custom kubelet serving certificates that don't follow the `system:node:` convention even after the feature gate is removed. #### Metrics In order to help cluster administrators determine if it's safe to enable the feature, we propose to add a new metric `kube_apiserver_validation_kubelet_cert_cn_errors` that will track the number of errors due to the new CN validation. In addition, we will log the error including the node name, so cluster administrators can identify which nodes are affected and need to reissue their certificates. -If the feature gate is disabled, we won't publish the metric or run any validation code at all. +If the feature gate is disabled or if `--kubelet-certificate-authority` is not set, we won't publish the metric or run any validation code at all. -If the feature gate is enabled but the feature is disabled (with `--disable-kubelet-cert-cn-validation`), we will still add the validation code to the HTTP transport, however, if the validation fails we won't return an error, we will just increment the metric counter. +If the feature gate is enabled, the kubelet CA is set (`--kubelet-certificate-authority`) but this feature is disabled, we will still run the validation code to collect the metric. However, if the validation fails we won't return an error, we will just increment the metric counter. We intentionally don't add the node name to the metric to avoid a high cardinality. The purpose of the metric is to easily/cheaply tell administrators if they can flip the feature on or not. If the answer is no (counter is greater than 0), the rest of the necessary information to detect the offending nodes will come from logs. - -We will remove the metric once the feature is GA. - -> TODO: let's discuss this in the review. We could consider adding the node name to the metric or even keeping the metric post GA if it's valuable. - ### TLS insecure Currently, if the Kube-API server is not configured with a `--kubelet-certificate-authority` the TLS client for kubelet server will skip the server certificate validation. Additionally, `logs` requests allow to configure `InsecureSkipTLSVerifyBackend` per request to skip the server certificate validation. -To align with this behavior, we won't execute the CN validation if `--kubelet-certificate-authority` is not set or if `InsecureSkipTLSVerifyBackend` is set to true. +To align with this behavior, we won't allow to enable the validation if `--kubelet-certificate-authority` is not set and we won't execute the CN validation if `InsecureSkipTLSVerifyBackend` is set to true. ### Test Plan @@ -195,11 +194,12 @@ Existing test coverage for the packages we anticipate modifying: ##### Integration tests Integration tests will be added to ensure the following: -* An error is returned if `--disable-kubelet-cert-cn-validation` is set but `KubeletCertCNValidation` feature flag is not enabled. +* An error is returned if `--enable-kubelet-cert-cn-validation` is set but `KubeletCertCNValidation` feature flag is not enabled. +* An error is returned if the feature `KubeletCertCNValidation` is enabled, `--enable-kubelet-cert-cn-validation` is set to true but `--kubelet-certificate-authority` is not set. * Validation for custom certificates works if feature flag is not enabled. -* Validation for custom certificates works if feature flag enabled and `--disable-kubelet-cert-cn-validation` is set to true. -* Validation for custom certificates fails if feature flag enabled and `--disable-kubelet-cert-cn-validation` is set to false or not set. -* Validation for kubernetes issued certificates works if feature flag enabled and `--disable-kubelet-cert-cn-validation` is set to false or not set. +* Validation for custom certificates works if feature flag enabled and `--enable-kubelet-cert-cn-validation` is not set or set to false. +* Validation for custom certificates fails if feature flag enabled, `--kubelet-certificate-authority` is set and `--enable-kubelet-cert-cn-validation` is set to true. +* Validation for kubernetes issued certificates works if feature flag enabled, `--kubelet-certificate-authority` is set and `--enable-kubelet-cert-cn-validation` is set to true. ##### e2e tests @@ -222,9 +222,7 @@ We believe is likely end-to-end tests won't be needed as unit and integration te ### Upgrade / Downgrade Strategy -Once feature flag is on by default (starting in Beta), administrators using custom serving certs -can use the proposed flag to disable the extra validation and maintain current behavior. -They will be able to use this flag even after the feature flag is removed. +The feature is opt-in and it can be disabled at any time by just not setting the `--enable-kubelet-cert-cn-validation` flag. ### Version Skew Strategy @@ -240,7 +238,7 @@ Not applicable. - Feature gate name: `KubeletCertCNValidation` - Components depending on the feature gate: kube-apiserver - [x] Other - - Describe the mechanism: kube-apiserver command-line flag `--disable-kubelet-cert-cn-validation` + - Describe the mechanism: kube-apiserver command-line flag `--enable-kubelet-cert-cn-validation` - Will enabling / disabling the feature require downtime of the control plane? No. But requires restarting the kube-apiserver. - Will enabling / disabling the feature require downtime or reprovisioning @@ -248,8 +246,9 @@ Not applicable. ###### Does enabling the feature change any default behavior? -Yes. If a cluster is using custom kubelet serving certificates that don't follow the same convention as kubernetes issued certificates (CN is `system:node:`), -enabling this feature will make any connection initiated by the kube-api server fail (logs, exec and port-forwarding). +Enabling the feature gate doesn't change any behavior. + +Enabling the validation does change the default certificate validation behavior. ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? @@ -289,7 +288,7 @@ No. ###### How can an operator determine if the feature is in use by workloads? The cluster administrators can check the flags passed to the kube-apiserver if they have access to the control plane nodes. -If the `--disable-kubelet-cert-cn-validation` flag is not set or set to false, the feature is being used. +If the `--enable-kubelet-cert-cn-validation` flag set to true, the feature is being used. Alternatively the can check the `kubernetes_feature_enabled` metric. ###### How can someone using this feature know that it is working for their instance? @@ -367,7 +366,7 @@ It's part of the API server, so the feature will be unavailable. - [API server can't connect to Nodes with custom kubelet serving certificates that don't follow the `system:node:` convention] - Detection: `kubectl logs` returns a certificate validation error. - - Mitigations: disable the validation with the `--disable-kubelet-cert-cn-validation` flag. + - Mitigations: disable the validation byt not setting `--enable-kubelet-cert-cn-validation` flag. - Diagnostics: error is returned by the API server, no additional logging needed. - Testing: We will have tests for this, this is basically testing that the feature works. @@ -377,7 +376,7 @@ It's part of the API server, so the feature will be unavailable. ## Drawbacks -This could disrupt clusters that are using custom kubelet serving certificates. These clusters will need to reissue their certificates before enabling this feature. +None. ## Alternatives From f26878886039e432c4c3fb68c32d4425032fa5a6 Mon Sep 17 00:00:00 2001 From: Guillermo Gaston Date: Mon, 16 Jun 2025 13:14:35 -0500 Subject: [PATCH 09/14] Improve metrics --- .../README.md | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md b/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md index 5717d6011eb..9de1a6277c1 100644 --- a/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md +++ b/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md @@ -157,7 +157,7 @@ Making the feature opt-in maintains compatibility with existing clusters using c #### Metrics -In order to help cluster administrators determine if it's safe to enable the feature, we propose to add a new metric `kube_apiserver_validation_kubelet_cert_cn_errors` that will track the number of errors due to the new CN validation. +In order to help cluster administrators determine if it's safe to enable the feature, we propose to add a new metric `kube_apiserver_validation_kubelet_cert_cn_total`. We will have two labels `success` and `failure`, allowing to track the number of errors due to the new CN validation. In addition, we will log the error including the node name, so cluster administrators can identify which nodes are affected and need to reissue their certificates. If the feature gate is disabled or if `--kubelet-certificate-authority` is not set, we won't publish the metric or run any validation code at all. @@ -165,7 +165,7 @@ If the feature gate is disabled or if `--kubelet-certificate-authority` is not s If the feature gate is enabled, the kubelet CA is set (`--kubelet-certificate-authority`) but this feature is disabled, we will still run the validation code to collect the metric. However, if the validation fails we won't return an error, we will just increment the metric counter. We intentionally don't add the node name to the metric to avoid a high cardinality. -The purpose of the metric is to easily/cheaply tell administrators if they can flip the feature on or not. If the answer is no (counter is greater than 0), the rest of the necessary information to detect the offending nodes will come from logs. +The purpose of the metric is to easily/cheaply tell administrators if they can flip the feature on or not. If the answer is no (counter for `failure` label is greater than 0), the rest of the necessary information to detect the offending nodes will come from logs. ### TLS insecure @@ -272,7 +272,7 @@ Already running workloads won't be impacted but cluster users won't be able to a ###### What specific metrics should inform a rollback? -Not applicable. +`kube_apiserver_validation_kubelet_cert_cn_total` can help inform a rollback. A non-zero value for the `failure` label will require invetsigation: if the rejected requests are going to legitimate nodes, the feature should be rolled back until kuebeler serving certificates are reissued. ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? @@ -294,7 +294,7 @@ Alternatively the can check the `kubernetes_feature_enabled` metric. ###### How can someone using this feature know that it is working for their instance? - [x] Other - - Details: users can create a Node with a kubelet serving certificate that doesn't meet the CN requirements enforced by this validation (something different than `system:node:`).Then run `kubectl logs` for any pod running in that node. If it returns an error for an invalid certificate, the feature is working. + - Details: when the feature is enabled, the metric `kube_apiserver_validation_kubelet_cert_cn_total` will increase for the `success` label. ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? @@ -306,17 +306,16 @@ A raising value after enabling this feature could signal overhead introduced by ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? - [x] Metrics - - Metric name: `kube_apiserver_pod_logs_backend_tls_failure_total` + - Metric name: `kube_apiserver_validation_kubelet_cert_cn_total` - Components exposing the metric: kube-apiserver - -> TODO: should `kube_apiserver_pod_logs_backend_tls_failure_total` reflect errors due to the new CN validation? -> It's technically a TLS failure, but it's not part of the base TLS client validations. + - If the feature is enabled, and the metric increases for the `failure` label, it signals a problem. + - If the service is healthy, the metric should increase. ###### Are there any missing metrics that would be useful to have to improve observability of this feature? -We could add a metric specific to track the number of requests that failed due to the new CN validation. In addition, we could track the time spent per request on the CN validation. +We could add a metric to track the time spent per request on the CN validation. -However, we consider these metrics to not provide enough value to justify the work to maintain them. +However, we consider this metric to not provide enough value to justify the work to maintain it. ### Dependencies From 7359ae02f25031bacffca5f5910ab553af4f608c Mon Sep 17 00:00:00 2001 From: Guillermo Gaston Date: Mon, 16 Jun 2025 13:18:34 -0500 Subject: [PATCH 10/14] Improve testing --- .../4872-harden-kubelet-cert-validation/README.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md b/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md index 9de1a6277c1..d66884bdc01 100644 --- a/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md +++ b/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md @@ -191,11 +191,13 @@ Existing test coverage for the packages we anticipate modifying: - `k8s.io/kubernetes/pkg/kubelet/client`: `2024-10-07` - `28.2` - `k8s.io/client-go/transport`: `2024-10-07` - `59.4` +On top of testing the validation itself, we will test that: +* An error is returned if `--enable-kubelet-cert-cn-validation` is set but `KubeletCertCNValidation` feature flag is not enabled. +* An error is returned if the feature `KubeletCertCNValidation` is enabled, `--enable-kubelet-cert-cn-validation` is set to true but `--kubelet-certificate-authority` is not set. + ##### Integration tests Integration tests will be added to ensure the following: -* An error is returned if `--enable-kubelet-cert-cn-validation` is set but `KubeletCertCNValidation` feature flag is not enabled. -* An error is returned if the feature `KubeletCertCNValidation` is enabled, `--enable-kubelet-cert-cn-validation` is set to true but `--kubelet-certificate-authority` is not set. * Validation for custom certificates works if feature flag is not enabled. * Validation for custom certificates works if feature flag enabled and `--enable-kubelet-cert-cn-validation` is not set or set to false. * Validation for custom certificates fails if feature flag enabled, `--kubelet-certificate-authority` is set and `--enable-kubelet-cert-cn-validation` is set to true. @@ -203,7 +205,7 @@ Integration tests will be added to ensure the following: ##### e2e tests -We believe is likely end-to-end tests won't be needed as unit and integration tests will cover all the scenarios. If it's not possible to cover all the scenarios, we will add e2e tests. It's also quite likely that existing e2e tests will cover the new behavior once the feature gate is enabled, so new tests might only be needed for the transition period. +We will update the alpha kind e2e tests job to exercise this flow to start with, and once the functionality is beta, we will update all kind e2e test jobs to run with this verification. ### Graduation Criteria @@ -216,6 +218,7 @@ We believe is likely end-to-end tests won't be needed as unit and integration te #### Beta * Address user reviews and iterate if needed * Feature flag on by default +* Validation enabled for all kind e2e test jobs #### GA * Remove feature flag From c9b7424ae23653c4e9d727711a83b99c1f206f4a Mon Sep 17 00:00:00 2001 From: Guillermo Gaston Date: Mon, 16 Jun 2025 13:29:30 -0500 Subject: [PATCH 11/14] Be explicit about implementation not creating additional TLS connections --- keps/sig-auth/4872-harden-kubelet-cert-validation/README.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md b/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md index d66884bdc01..2186e31945d 100644 --- a/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md +++ b/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md @@ -304,7 +304,7 @@ Alternatively the can check the `kubernetes_feature_enabled` metric. The average `apiserver_request_duration_seconds` for logs/exec/port-forward requests is within reasonable limits. A raising value after enabling this feature could signal overhead introduced by the extra validation. -> TODO: I expect the overhead to be negligible and probably to fall in within the standard deviation of the current average. Specially for long running requests like port-forward and exec. Is this even valuable to have here? +In addition, the number of TLS connections made from API server to nodes should not increase. ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? @@ -352,8 +352,6 @@ No. This only affects streaming APIs and these are not covered by SLIs/SLOs. No. -Note: depending on the implementation (caching the client-go transport or not) there might be a slight increase in memory (due to one transport per node being cached) or in CPU usage (due to building the transport on the fly for every request). This should be negligible. - ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? No. From 557b0f89e9e171db935032afd7437e221b0fa3c2 Mon Sep 17 00:00:00 2001 From: Guillermo Gaston Date: Mon, 16 Jun 2025 13:31:20 -0500 Subject: [PATCH 12/14] Address other review comments --- keps/sig-auth/4872-harden-kubelet-cert-validation/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md b/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md index 2186e31945d..99aa2ead0b7 100644 --- a/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md +++ b/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md @@ -221,7 +221,7 @@ We will update the alpha kind e2e tests job to exercise this flow to start with, * Validation enabled for all kind e2e test jobs #### GA -* Remove feature flag +* Successful adoption by at least one provider ### Upgrade / Downgrade Strategy @@ -374,6 +374,8 @@ It's part of the API server, so the feature will be unavailable. ## Implementation History +* Implemenation options discussion: https://docs.google.com/document/d/1RqhAkGov_coHsB3lbAo-qfQl1MOfYvgpPUjiGMJ_3PY + ## Drawbacks None. From b7b0e1d661dd146b23e5f3d2b40bb090c22ae0ba Mon Sep 17 00:00:00 2001 From: Guillermo Gaston Date: Mon, 16 Jun 2025 13:50:12 -0500 Subject: [PATCH 13/14] Fix misspelling and grammar --- .../README.md | 34 +++++++++---------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md b/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md index 99aa2ead0b7..048d742ce74 100644 --- a/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md +++ b/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md @@ -100,7 +100,7 @@ If a rogue node obtained a certificate for an IP it does not own and reroute tra Provided an actor with control of a node can impersonate another node, the impact would be: -* Break confidentiality of the requests sent by the Kube-API server to the kubelet (e.g kubectl exec/logs).These are usually user-driven requests. That gives the threat actor the possibility of producing incorrect or mis-leading feedback. In the exec case, it could allow a threat actor to issue prompts for credentials. In addition, the exec commands might contain user secrets. +* Break confidentiality of the requests sent by the Kube-API server to the kubelet (e.g kubectl exec/logs). These are usually user-driven requests. That gives the threat actor the possibility of producing incorrect or misleading feedback. In the exec case, it could allow a threat actor to issue prompts for credentials. In addition, the exec commands might contain user secrets. * Break confidentiality of credentials if the client uses token based authentication. This is probably more common for non Kube-API server clients, given mTLS is common for Kube-API server to kubelet communication. ### Goals @@ -114,7 +114,7 @@ Provided an actor with control of a node can impersonate another node, the impac ## Proposal -We propose that the Kube API server is modified to validate the Common Name (CN) of the kubelet's serving certificate is equal to `system:node:`. +We propose that the Kube API server is modified to validate the Common Name (CN) of the kubelet's serving certificate to be equal to `system:node:`. `nodename` is the name of the Node object as reported by the kubelet. When the Kube-API server connects to the kubelet server (e.g. for logs, exec, port-forward), it always knows the Node it's connecting to. ### User Stories (Optional) @@ -148,7 +148,7 @@ Before enabling this feature on clusters with custom kubelet serving certificate ### Enabling the feature We will introduce a feature flag `KubeletCertCNValidation` that will gate the usage of the new validation. -This gate will start off by default in Alpha, will be turned on by default in Beta and will be removed in GA. +This gate will start disabled by default in Alpha, will be turned on by default in Beta and will be removed in GA. In addition, the validation will be opt-in and enabled through a new command-line flag `--enable-kubelet-cert-cn-validation`. This flag can only be set if the `KubeletCertCNValidation` feature flag is enabled and if `--kubelet-certificate-authority` is set. @@ -157,7 +157,7 @@ Making the feature opt-in maintains compatibility with existing clusters using c #### Metrics -In order to help cluster administrators determine if it's safe to enable the feature, we propose to add a new metric `kube_apiserver_validation_kubelet_cert_cn_total`. We will have two labels `success` and `failure`, allowing to track the number of errors due to the new CN validation. +In order to help cluster administrators determine if it's safe to enable the feature, we propose to add a new metric `kube_apiserver_validation_kubelet_cert_cn_total`. We will have two labels `success` and `failure`, allowing us to track the number of errors due to the new CN validation. In addition, we will log the error including the node name, so cluster administrators can identify which nodes are affected and need to reissue their certificates. If the feature gate is disabled or if `--kubelet-certificate-authority` is not set, we won't publish the metric or run any validation code at all. @@ -170,9 +170,9 @@ The purpose of the metric is to easily/cheaply tell administrators if they can f ### TLS insecure Currently, if the Kube-API server is not configured with a `--kubelet-certificate-authority` the TLS client for kubelet server will skip the server certificate validation. -Additionally, `logs` requests allow to configure `InsecureSkipTLSVerifyBackend` per request to skip the server certificate validation. +Additionally, `logs` requests allow configuring `InsecureSkipTLSVerifyBackend` per request to skip the server certificate validation. -To align with this behavior, we won't allow to enable the validation if `--kubelet-certificate-authority` is not set and we won't execute the CN validation if `InsecureSkipTLSVerifyBackend` is set to true. +To align with this behavior, we won't allow enabling the validation if `--kubelet-certificate-authority` is not set and we won't execute the CN validation if `InsecureSkipTLSVerifyBackend` is set to true. ### Test Plan @@ -198,10 +198,10 @@ On top of testing the validation itself, we will test that: ##### Integration tests Integration tests will be added to ensure the following: -* Validation for custom certificates works if feature flag is not enabled. -* Validation for custom certificates works if feature flag enabled and `--enable-kubelet-cert-cn-validation` is not set or set to false. -* Validation for custom certificates fails if feature flag enabled, `--kubelet-certificate-authority` is set and `--enable-kubelet-cert-cn-validation` is set to true. -* Validation for kubernetes issued certificates works if feature flag enabled, `--kubelet-certificate-authority` is set and `--enable-kubelet-cert-cn-validation` is set to true. +* Validation for custom certificates works if the feature flag is not enabled. +* Validation for custom certificates works if the feature flag is enabled and `--enable-kubelet-cert-cn-validation` is not set or set to false. +* Validation for custom certificates fails if the feature flag is enabled, `--kubelet-certificate-authority` is set and `--enable-kubelet-cert-cn-validation` is set to true. +* Validation for kubernetes issued certificates works if the feature flag is enabled, `--kubelet-certificate-authority` is set and `--enable-kubelet-cert-cn-validation` is set to true. ##### e2e tests @@ -255,7 +255,7 @@ Enabling the validation does change the default certificate validation behavior. ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? -Yes, the feature can be disabled once enabled by just setting the command-line flag to true. +Yes, the feature can be disabled once enabled by not setting the command-line flag. ###### What happens if we reenable the feature if it was previously rolled back? @@ -275,7 +275,7 @@ Already running workloads won't be impacted but cluster users won't be able to a ###### What specific metrics should inform a rollback? -`kube_apiserver_validation_kubelet_cert_cn_total` can help inform a rollback. A non-zero value for the `failure` label will require invetsigation: if the rejected requests are going to legitimate nodes, the feature should be rolled back until kuebeler serving certificates are reissued. +`kube_apiserver_validation_kubelet_cert_cn_total` can help inform a rollback. A non-zero value for the `failure` label will require investigation: if the rejected requests are going to legitimate nodes, the feature should be rolled back until kubelet serving certificates are reissued. ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? @@ -291,8 +291,8 @@ No. ###### How can an operator determine if the feature is in use by workloads? The cluster administrators can check the flags passed to the kube-apiserver if they have access to the control plane nodes. -If the `--enable-kubelet-cert-cn-validation` flag set to true, the feature is being used. -Alternatively the can check the `kubernetes_feature_enabled` metric. +If the `--enable-kubelet-cert-cn-validation` flag is set to true, the feature is being used. +Alternatively, they can check the `kubernetes_feature_enabled` metric. ###### How can someone using this feature know that it is working for their instance? @@ -302,7 +302,7 @@ Alternatively the can check the `kubernetes_feature_enabled` metric. ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? The average `apiserver_request_duration_seconds` for logs/exec/port-forward requests is within reasonable limits. -A raising value after enabling this feature could signal overhead introduced by the extra validation. +A rising value after enabling this feature could signal overhead introduced by the extra validation. In addition, the number of TLS connections made from API server to nodes should not increase. @@ -366,7 +366,7 @@ It's part of the API server, so the feature will be unavailable. - [API server can't connect to Nodes with custom kubelet serving certificates that don't follow the `system:node:` convention] - Detection: `kubectl logs` returns a certificate validation error. - - Mitigations: disable the validation byt not setting `--enable-kubelet-cert-cn-validation` flag. + - Mitigations: disable the validation by not setting `--enable-kubelet-cert-cn-validation` flag. - Diagnostics: error is returned by the API server, no additional logging needed. - Testing: We will have tests for this, this is basically testing that the feature works. @@ -374,7 +374,7 @@ It's part of the API server, so the feature will be unavailable. ## Implementation History -* Implemenation options discussion: https://docs.google.com/document/d/1RqhAkGov_coHsB3lbAo-qfQl1MOfYvgpPUjiGMJ_3PY +* Implementation options discussion: https://docs.google.com/document/d/1RqhAkGov_coHsB3lbAo-qfQl1MOfYvgpPUjiGMJ_3PY ## Drawbacks From 9084d3549005cbfd3b555fd8efa1a23310ab5d68 Mon Sep 17 00:00:00 2001 From: Guillermo Gaston Date: Tue, 17 Jun 2025 16:01:50 -0500 Subject: [PATCH 14/14] Address PRR review comments --- .../README.md | 18 ++++++++++-------- .../kep.yaml | 8 ++++++-- 2 files changed, 16 insertions(+), 10 deletions(-) diff --git a/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md b/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md index 048d742ce74..215d3f7e9a9 100644 --- a/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md +++ b/keps/sig-auth/4872-harden-kubelet-cert-validation/README.md @@ -59,18 +59,18 @@ checklist items _must_ be updated for the enhancement to be released. Items marked with (R) are required *prior to targeting to a milestone / release*. -- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) - [ ] (R) KEP approvers have approved the KEP status as `implementable` -- [ ] (R) Design details are appropriately documented +- [x] (R) Design details are appropriately documented - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) - [ ] e2e Tests for all Beta API Operations (endpoints) - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free -- [ ] (R) Graduation criteria is in place - - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) -- [ ] (R) Production readiness review completed -- [ ] (R) Production readiness review approved -- [ ] "Implementation History" section is up-to-date for milestone +- [x] (R) Graduation criteria is in place + - [x] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [x] (R) Production readiness review completed +- [x] (R) Production readiness review approved +- [x] "Implementation History" section is up-to-date for milestone - [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] - [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes @@ -374,7 +374,9 @@ It's part of the API server, so the feature will be unavailable. ## Implementation History -* Implementation options discussion: https://docs.google.com/document/d/1RqhAkGov_coHsB3lbAo-qfQl1MOfYvgpPUjiGMJ_3PY +* 2024-10-08: KEP created +* 2025-05-08: Implementation options discussion: https://docs.google.com/document/d/1RqhAkGov_coHsB3lbAo-qfQl1MOfYvgpPUjiGMJ_3PY + ## Drawbacks diff --git a/keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml b/keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml index 0c9d945d06b..fb56c1d1181 100644 --- a/keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml +++ b/keps/sig-auth/4872-harden-kubelet-cert-validation/kep.yaml @@ -7,9 +7,11 @@ participating-sigs: status: implementable creation-date: 2024-09-24 reviewers: - - TBD + - "@enj" + - "@liggitt" approvers: - - TBD + - "@enj" + - "@liggitt" see-also: replaces: @@ -38,3 +40,5 @@ disable-supported: true # The following PRR answers are required at beta release metrics: + - "kube_apiserver_validation_kubelet_cert_cn_total" + - "apiserver_request_duration_seconds"