From 2fb4744510be731b0087b9763291a3a43e2a554a Mon Sep 17 00:00:00 2001 From: David Porter Date: Mon, 21 Sep 2020 21:28:29 +0000 Subject: [PATCH 1/3] Add initial graceful shutdown KEP --- .../2000-graceful-node-shutdown/README.md | 866 ++++++++++++++++++ .../2000-graceful-node-shutdown/kep.yaml | 34 + 2 files changed, 900 insertions(+) create mode 100644 keps/sig-node/2000-graceful-node-shutdown/README.md create mode 100644 keps/sig-node/2000-graceful-node-shutdown/kep.yaml diff --git a/keps/sig-node/2000-graceful-node-shutdown/README.md b/keps/sig-node/2000-graceful-node-shutdown/README.md new file mode 100644 index 00000000000..bae96a2164d --- /dev/null +++ b/keps/sig-node/2000-graceful-node-shutdown/README.md @@ -0,0 +1,866 @@ + +# KEP-2000: Graceful Node Shutdown + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Background on Linux Shutdown](#background-on-linux-shutdown) + - [Background on Inhibitors](#background-on-inhibitors) + - [Implementation](#implementation) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input +- [ ] (R) Graduation criteria is in place +- [ ] (R) Production readiness review completed +- [ ] Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + + +Kubelet should be aware of node shutdown and trigger graceful shutdown of pods +during a machine shutdown. + +## Motivation + + + +Users and cluster administrators expect that pods will adhere to expected pod +lifecycle including pod termination. Currently, when a node shuts down, pods do +not follow the expected pod termination lifecycle and are not terminated +gracefully which can cause issues for some workloads. This KEP aims to address +this problem by making the kubelet aware of the underlying node shutdown. +Kubelet will propagate this signal to pods ensuring they can shutdown as +gracefully as possible. + +### Goals + + + +* Make kubelet aware of underlying node shutdown event and trigger pod + termination with sufficient grace period to shutdown properly +* Handle node shutdown in cloud-provider agnostic way +* Introduce minimal shutdown delay in order to shutdown node soon as possible + (but not sooner) +* Focus on handling shutdown on systemd based machines + +### Non-Goals + + + +* Let users modify or change existing pod lifecycle or introduce new inner + pod depencides / shutdown ordering +* Support every linux init and ACPI event handling mechanism (focus on widely + used logind from systemd) +* Provide guarantee to handle all cases of graceful node shutdown, for + example abrupt shutdown or sudden power cable pull can’t result in graceful + shutdown + +## Proposal + + + +### User Stories (Optional) + + + +#### Story 1 + +* As a cluster administrator, I can configure the nodes in my cluster to + allocate X seconds for my pods to terminate gracefully during a node + shutdown + +#### Story 2 + +* As a developer I can expect that my pods will terminate gracefully during + node shutdowns + +### Background on Linux Shutdown + +In the context of this KEP, shutdown is referred to as shutdown of the +underlying machine. On most linux distros shutdown can be initiated via a +variety of methods for example: + +1. `shutdown -h now` +2. `shutdown -h +30` `#schedule a delayed shutdown in 30mins` +3. `systemctl poweroff` +4. Physically pressing the power button on the machine +5. If a machine is a VM, the underlying hypervisor can press the “virtual” + power button +6. For a cloud instance, stopping the instance via Cloud API, e.g. via `gcloud + compute instances stop`. Depending on the cloud provider, this may result in + virtual power button press by the underlying hypervisor. + + +Some of these cases will involve the machine receiving an ACPI event to change +the power state. The machine can go from G0 (working state) to G2 (Soft Off) +and finally to G3 (Off) [more info on ACPI +states](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface). +On Linux, prior to shutdown usually a system demon will listen to these events +and perform some series of actions prior to userspace calling the +[reboot(2)](https://man7.org/linux/man-pages/man2/reboot.2.html) systemcall +with `LINUX_REBOOT_CMD_POWER_OFF` or `LINUX_REBOOT_CMD_HALT` to actually +shutdown the machine. + +Historically, ACPI events were often handled by the +[acpid](https://wiki.archlinux.org/index.php/Acpid) daemon which uses a variety +of mechanisms to watch ACPI events (i.e. reading `/proc/acpi/event` or +`/dev/input/eventX` to react to power button presses). However, in most modern +linux distros today, +[systemd-logind](https://www.freedesktop.org/wiki/Software/systemd/logind/) +has taken over as the main component [reacting to ACPI +events](https://wiki.archlinux.org/index.php/Power_management#ACPI_events) and +initiating shutdown of the machine. On a system with +systemd-logind, for example, a trigger of the power button will +result in the systemd target +[poweroff](https://www.freedesktop.org/software/systemd/man/systemd-halt.service.html) +being run (see +[HandlePowerKey](https://www.freedesktop.org/software/systemd/man/logind.conf.html), +which will terminate all the systemd services running on the machine and +eventually shut it down. However, in the context of kubernetes, systemd is not +aware of the pods and containers running on the machine and systemd will simply +kill them as regular linux processes. + +### Background on Inhibitors + +`systemd-logind` provides the ability for applications to delay shutdown and +perform some series of actions before the shutdown completes through a +mechanism called ["Inhibitor +Locks".](https://www.freedesktop.org/wiki/Software/systemd/inhibit/) +Applications can request to delay shutdown by taking an inhibitor lock by +sending messages to logind over dbus. Applications can request up to +`InhibitDelayMaxSec` (a setting configured in `logind.conf`) for delay based +locks, which allow applications to receive sleep and shutdown events, and block +the shutdown from proceeding by `InhibitDelayMaxSec` period to execute some +critical work prior to shutdown/sleep. Inhibitor Locks were introduced to +[systemd 183](https://lwn.net/Articles/499480/) (released in 2012). + +We believe that making use of systemd is a reasonable approach considering +almost all new popular linux distros are systemd based (RHEL, Google COS, +Ubuntu, CentOS, Debian, Fedora, Flatcar Linux, see widespread +[adoption](https://en.wikipedia.org/wiki/Systemd#Adoption)) and [systemd +183](https://lwn.net/Articles/499480/) (released in 2012) features support for +inhibitors. + +Thanks to @giuseppe for helping with getting systemd inhibitors working! + +### Implementation + +Introduce a new Kubelet Config setting, `kubeletConfig.ShutdownGracePeriod`, +defaulting to 0 seconds. Upon kubelet startup, + +* if the setting is greater than 0 seconds + * kubelet will check with dbus current `InhibitDelayMaxSec` to check if + `kubeletConfig.ShutdownGracePeriod <=` `InhibitDelayMaxSec`. +* if `kubeletConfig.ShutdownGracePeriod` > `InhibitDelayMaxSec` + * Kubelet will attempt to update the InhibitDelayMaxSec setting, by + writing a config file to `/etc/systemd/logind.conf.d/kubelet.conf`, and + sending a SIGHUP to logind to update the config setting to ensure that + the ShutdownGracePeriod from kubelet config is equal to + `InhibitDelayMaxSec`. + +After updating the `InhibitDelayMaxSec` on the node if needed, Kubelet will +query the dbus for the final value of `InhibitDelayMaxSec` set on the node and +treat min(`InhibitDelayMaxSec`, `kubeletConfig.ShutdownGracePeriod`) as the +allocatable shutdown grace period, which will be referred to in this KEP as +`ShutdownGracePeriod`. + +Kubelet will register with dbus as a delay systemd inhibitor lock for the +`ShutdownGracePeriod` for the shutdown event. Kubelet will also register a +`PrepareForShutdown` signal which will be emitted prior to the shutdown. Upon +receiving the signal, Kubelet will have additional `ShutdownGracePeriod` time +before the actual node will initiate the shutdown. + +**Handling the shutdown** + +Upon a shutdown occurring, Kubelet will gracefully terminate all the pods +running on the node and update the Ready condition of the node to false with a +message `Node Shutting Down`, thereby ensuring new workloads will not get +scheduled to the node. + +Since some of the pods running on the node are often critical for the the +workloads running on a node (e.g. logging pod daemonset, kubeproxy, kubedns) +etc, we choose to split the pods running on the node into two categories, +“critical system pods”, and regular pods. Critical system pods should be +terminated last, because for example, if the logging pod is terminated first, +logs from the other workloads will not be captured. Critical system pods are +identified as those that are in the `system-cluster-critical` or +`system-node-critical` [priority +classes](https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical). + +Upon shutdown Kubelet will: + +1. Gracefully terminate all non critical system pods with a gracePeriodOverride +computed as `min(podSpec.terminationGracePeriodSeconds, ShutdownGracePeriod)` +2. Gracefully terminate all critical system pods with gracePeriodOverride of 2 +seconds + +Kubelet will use the same existing +[killPod](https://github.com/kubernetes/kubernetes/blob/release-1.19/pkg/kubelet/pod_workers.go#L292) +function to perform the termination of pods, using `gracePeriodOverride` to set +the appropriate grace period. During the termination process, normal [pod +termination](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination) +processes will apply, e.g. preStopHooks will be called, SIGTERM to containers +delivered, etc. + +2 seconds as gracePeriodOverride for critical system pods was decided to ensure +that they can also perform a graceful shutdown and 2 seconds is currently +[defined](https://github.com/kubernetes/kubernetes/blob/release-1.19/pkg/kubelet/kuberuntime/kuberuntime_container.go#L626-L629) +as the minimum grace period defined in the kubelet. + +POC: I’ve prototyped an initial POC +[here](https://github.com/bobbypage/kubernetes/tree/shutdown) of the proposed +implementation on the `shutdown` branch. + + +### Notes/Constraints/Caveats (Optional) + + + +### Risks and Mitigations + + + +* Kubelet does not receive shutdown event or is able to create inhibitor lock + * Mitigation: Kubelet does not provide graceful shutdown to pods (same as + today’s existing behavior) +* Kubelet is unable to update `InhibitDelayMaxSec` in logind to match that of + `kubeletConfig.ShutdownGracePeriod` + * If there are multiple logind configuration file overrides in + `/etc/systemd/logind.conf.d/`, logind will use the config file with the + lexicographically latest name. As a result in rare cases, the kubelet’s + InhibitDelayMaxSec conf file override may be overwritten by another + config file (possibly placed by another service on the machine). + * Mitigation: Kubelet will use current value of `InhibitDelayMaxSec` from + logind as the shutdown period which may be less than + `kubeletConfig.ShutdownGracePeriod`. +* OS / Distro does not use systemd or systemd version < 183 + * Mitigation: Kubelet will not provide graceful shutdown to pods (same as + today’s existing behavior). + +## Design Details + + + +The design proposes adding a new KubeletConfig field `ShutdownGracePeriod` used +to specify total time period kubelet should delay shutdown by and thus total time +allocated to the graceful termination process. + +``` +type KubeletConfiguration struct { + ... + ShutdownGracePeriod metav1.Duration +} +``` + +Communication with systemd over dbus for (creating inhibitor lock, receiving +`PrepareForShutdown` callback, etc), will make use of the +`github.com/godbus/dbus/v5` package which is already included in +[`vendor/`](https://github.com/kubernetes/kubernetes/tree/release-1.19/vendor/github.com/godbus/dbus/v5). + +Termination of pods will make use of the existing +[killPod](https://github.com/kubernetes/kubernetes/blob/release-1.19/pkg/kubelet/pod_workers.go#L292)function +from the `kubelet` package and specify the appropriate `gracePeriodOverride` as +necessary. + +### Test Plan + + + +* Unit tests for kubelet of handling shutdown event +* New E2E tests to validate node graceful shutdown (note limitation that K8S + E2E tests currently only run on GCE). + * Shutdown grace period unspecified, feature is not active + * Pod’s ExecStop and SIGTERM handlers are given gracePeriodSeconds for + case when gracePeriodSeconds <= kubeletConfig.ShutdownGracePeriod + * Pod’s ExecStop and SIGTERM handlers are given + kubeletConfig.ShutdownGracePeriod for case when gracePeriodSeconds > + kubeletConfig.ShutdownGracePeriod + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +n/a + +### Version Skew Strategy + + + +n/a + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + +_This section must be completed when targeting alpha to a release._ + +* **How can this feature be enabled / disabled in a live cluster?** + - [ ] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: `GracefulNodeShutdown` + - Components depending on the feature gate: + - `kubelet` + - [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - no + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). + - yes (will require restart of kubelet) + +* **Does enabling the feature change any default behavior?** + Any change of default behavior may be surprising to users or break existing + automations, so be extremely careful here. + + * The main behavior change is that during a node shutdown, pods running on + the node will be terminated gracefully. + +* **Can the feature be disabled once it has been enabled (i.e. can we roll back + the enablement)?** + Also set `disable-supported` to `true` or `false` in `kep.yaml`. + Describe the consequences on existing workloads (e.g., if this is a runtime + feature, can it break the existing applications?). + + * Yes, the feature can be disabled by either disabling the feature gate, or + setting `kubeletConfig.ShutdownGracePeriod` to 0 seconds. + +* **What happens if we reenable the feature if it was previously rolled back?** + + * Kubelet will attempt to perform graceful termination of pods during a + node shutdown. + +* **Are there any tests for feature enablement/disablement?** + The e2e framework does not currently support enabling or disabling feature + gates. However, unit tests in each component dealing with managing data, created + with and without the feature, are necessary. At the very least, think about + conversion tests if API types are being modified. + + * n/a + +### Rollout, Upgrade and Rollback Planning + +_This section must be completed when targeting beta graduation to a release._ + +* **How can a rollout fail? Can it impact already running workloads?** + Try to be as paranoid as possible - e.g., what if some components will restart + mid-rollout? + +* **What specific metrics should inform a rollback?** + +* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?** + Describe manual testing that was done and the outcomes. + Longer term, we may want to require automated upgrade/rollback tests, but we + are missing a bunch of machinery and tooling and can't do that now. + +* **Is the rollout accompanied by any deprecations and/or removals of features, APIs, +fields of API types, flags, etc.?** + Even if applying deprecation policies, they may still surprise some users. + +### Monitoring Requirements + +_This section must be completed when targeting beta graduation to a release._ + +* **How can an operator determine if the feature is in use by workloads?** + Ideally, this should be a metric. Operations against the Kubernetes API (e.g., + checking if there are objects with field X set) may be a last resort. Avoid + logs or events for this purpose. + +* **What are the SLIs (Service Level Indicators) an operator can use to determine +the health of the service?** + - [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: + - [ ] Other (treat as last resort) + - Details: + +* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** + At a high level, this usually will be in the form of "high percentile of SLI + per day <= X". It's impossible to provide comprehensive guidance, but at the very + high level (needs more precise definitions) those may be things like: + - per-day percentage of API calls finishing with 5XX errors <= 1% + - 99% percentile over day of absolute value from (job creation time minus expected + job creation time) for cron job <= 10% + - 99,9% of /health requests per day finish with 200 code + +* **Are there any missing metrics that would be useful to have to improve observability +of this feature?** + Describe the metrics themselves and the reasons why they weren't added (e.g., cost, + implementation difficulties, etc.). + +### Dependencies + +_This section must be completed when targeting beta graduation to a release._ + +* **Does this feature depend on any specific services running in the cluster?** + Think about both cluster-level services (e.g. metrics-server) as well + as node-level agents (e.g. specific version of CRI). Focus on external or + optional services that are needed. For example, if this feature depends on + a cloud provider API, or upon an external software-defined storage or network + control plane. + + For each of these, fill in the following—thinking about running existing user workloads + and creating new ones, as well as about cluster-level services (e.g. DNS): + - [Dependency name] + - Usage description: + - Impact of its outage on the feature: + - Impact of its degraded performance or high-error rates on the feature: + + +### Scalability + +_For alpha, this section is encouraged: reviewers should consider these questions +and attempt to answer them._ + +_For beta, this section is required: reviewers must answer these questions._ + +_For GA, this section is required: approvers should be able to confirm the +previous answers based on experience in the field._ + +* **Will enabling / using this feature result in any new API calls?** + Describe them, providing: + - API call type (e.g. PATCH pods) + - estimated throughput + - originating component(s) (e.g. Kubelet, Feature-X-controller) + focusing mostly on: + - components listing and/or watching resources they didn't before + - API calls that may be triggered by changes of some Kubernetes resources + (e.g. update of object X triggers new updates of object Y) + - periodic API calls to reconcile state (e.g. periodic fetching state, + heartbeats, leader election, etc.) + +* **Will enabling / using this feature result in introducing new API types?** + Describe them, providing: + - API type + - Supported number of objects per cluster + - Supported number of objects per namespace (for namespace-scoped objects) + +* **Will enabling / using this feature result in any new calls to the cloud +provider?** + +* **Will enabling / using this feature result in increasing size or count of +the existing API objects?** + Describe them, providing: + - API type(s): + - Estimated increase in size: (e.g., new annotation of size 32B) + - Estimated amount of new objects: (e.g., new Object X for every existing Pod) + +* **Will enabling / using this feature result in increasing time taken by any +operations covered by [existing SLIs/SLOs]?** + Think about adding additional work or introducing new steps in between + (e.g. need to do X to start a container), etc. Please describe the details. + +* **Will enabling / using this feature result in non-negligible increase of +resource usage (CPU, RAM, disk, IO, ...) in any components?** + Things to keep in mind include: additional in-memory state, additional + non-trivial computations, excessive access to disks (including increased log + volume), significant amount of data sent and/or received over network, etc. + This through this both in small and large cases, again with respect to the + [supported limits]. + +### Troubleshooting + +The Troubleshooting section currently serves the `Playbook` role. We may consider +splitting it into a dedicated `Playbook` document (potentially with some monitoring +details). For now, we leave it here. + +_This section must be completed when targeting beta graduation to a release._ + +* **How does this feature react if the API server and/or etcd is unavailable?** + +* **What are other known failure modes?** + For each of them, fill in the following information by copying the below template: + - [Failure mode brief description] + - Detection: How can it be detected via metrics? Stated another way: + how can an operator troubleshoot without logging into a master or worker node? + - Mitigations: What can be done to stop the bleeding, especially for already + running user workloads? + - Diagnostics: What are the useful log messages and their required logging + levels that could help debug the issue? + Not required until feature graduated to beta. + - Testing: Are there any tests for failure mode? If not, describe why. + +* **What steps should be taken if SLOs are not being met to determine the problem?** + +[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md +[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos + +## Implementation History + + + +* May 26 - [Original GH issue #91472 + filed](https://github.com/kubernetes/kubernetes/issues/91472) + +## Drawbacks + + + +## Alternatives + + + +* Use systemd cgroup driver to set `TimeoutStopSec=` on scopes underlying + containers + * Set `TimeStopSec=` for the container scopes using the value set in the pod + for termination grace period. The problem with this approach is that + systemd doesn’t understand the prestop hooks. +* Use systemd cgroup driver to set `Before=kubelet.service` on scopes + underlying containers + * Set `Before=kubelet.service` and container runtime service for the + container scopes. Systemd would then stop the containers after the + kubelet giving the kubelet a chance to stop the containers itself. This + depends upon using the systemd cgroups driver and is coupled to systemd. +* Use systemd cgroup driver to set controller property on scope to delegate + control to kubelet + * Set Controller dbus property for the container scopes and set + `After=kubelet.service` for the containers. Systemd would then signal the + kubelet over dbus to delegate the container scope termination. This + requires more work in the kubelet and is also coupled to systemd and the + systemd cgroup driver. +* Don’t handle node shutdown events at all, and have users drain nodes before + shutting them down. + * This is not always possible, for example if the shutdown is controlled by + some external system (e.g. Preemptible VMs). +* Avoid relying on systemd and logind and directly hook into ACPI events on the + node. + * Unfortunately, this can create conflicts because only one systemd daemon + should be monitoring ACPI events. Additionally, if the system is using + systemd but kubelet did not integrate with it, systemd by default would + terminate kubelet and other processes during a shutdown event. +* Provide more configuration options on how to split time during shutdown (e.g. + split between critical pods and user workloads). Need more feedback from the + community here. + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-node/2000-graceful-node-shutdown/kep.yaml b/keps/sig-node/2000-graceful-node-shutdown/kep.yaml new file mode 100644 index 00000000000..66fdc47c1cc --- /dev/null +++ b/keps/sig-node/2000-graceful-node-shutdown/kep.yaml @@ -0,0 +1,34 @@ +title: Graceful Node Shutdown +kep-number: 2000 +authors: + - "bobbypage" + - "mrunalp" +owning-sig: sig-node +status: provisional +creation-date: 2020-09-21 + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.20" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.21" + beta: "x.y" + stable: "x.y" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: GracefulNodeShutdown + components: + - kubelet +disable-supported: true + +# The following PRR answers are required at beta release +metrics: + - "N/A" From e49f2ca9b5848cf976ae083c37bccd5682793d77 Mon Sep 17 00:00:00 2001 From: David Porter Date: Fri, 2 Oct 2020 00:08:28 +0000 Subject: [PATCH 2/3] Update based on feedback * Change ready status to false during node shutdown * Add note about new KubeletConfig option, `ShutdownGracePeriodCriticalPods`, to configure shutdown gracePeriod for critical pods * Update status to implementable --- .../2000-graceful-node-shutdown/README.md | 34 +++++++++++++------ .../2000-graceful-node-shutdown/kep.yaml | 2 +- 2 files changed, 25 insertions(+), 11 deletions(-) diff --git a/keps/sig-node/2000-graceful-node-shutdown/README.md b/keps/sig-node/2000-graceful-node-shutdown/README.md index bae96a2164d..24928167161 100644 --- a/keps/sig-node/2000-graceful-node-shutdown/README.md +++ b/keps/sig-node/2000-graceful-node-shutdown/README.md @@ -364,23 +364,26 @@ classes](https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduli Upon shutdown Kubelet will: -1. Gracefully terminate all non critical system pods with a gracePeriodOverride -computed as `min(podSpec.terminationGracePeriodSeconds, ShutdownGracePeriod)` -2. Gracefully terminate all critical system pods with gracePeriodOverride of 2 -seconds +1. Update the Node's `Ready` condition to `false`, with the reason `Node is + shutting down` +2. Gracefully terminate all non critical system pods with a gracePeriodOverride + computed as `min(podSpec.terminationGracePeriodSeconds, + ShutdownGracePeriod-ShutdownGracePeriodCriticalPods)` +3. Gracefully terminate all critical system pods with gracePeriodOverride of + `ShutdownGracePeriodCriticalPods` seconds Kubelet will use the same existing [killPod](https://github.com/kubernetes/kubernetes/blob/release-1.19/pkg/kubelet/pod_workers.go#L292) function to perform the termination of pods, using `gracePeriodOverride` to set the appropriate grace period. During the termination process, normal [pod termination](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination) -processes will apply, e.g. preStopHooks will be called, SIGTERM to containers +processes will apply, e.g. preStop Hooks will be called, `SIGTERM` to containers delivered, etc. -2 seconds as gracePeriodOverride for critical system pods was decided to ensure -that they can also perform a graceful shutdown and 2 seconds is currently -[defined](https://github.com/kubernetes/kubernetes/blob/release-1.19/pkg/kubelet/kuberuntime/kuberuntime_container.go#L626-L629) -as the minimum grace period defined in the kubelet. +To ensure `gracePeriodOverride` is respected, Github issue +[#92432](https://github.com/kubernetes/kubernetes/issues/92432) should also be +addressed to ensure that `gracePeriod` override will be respected for `preStop` +hooks. POC: I’ve prototyped an initial POC [here](https://github.com/bobbypage/kubernetes/tree/shutdown) of the proposed @@ -412,7 +415,10 @@ Consider including folks who also work outside the SIG or subproject. * Kubelet does not receive shutdown event or is able to create inhibitor lock * Mitigation: Kubelet does not provide graceful shutdown to pods (same as - today’s existing behavior) + today’s existing behavior). For alpha stage, to track shutdown behavior + and if it was successful, we plan to add a debugging log statement just + prior to kubelet's shutdown process being completed, so it's possible + to verify if kubelet shutdown the node gracefully. * Kubelet is unable to update `InhibitDelayMaxSec` in logind to match that of `kubeletConfig.ShutdownGracePeriod` * If there are multiple logind configuration file overrides in @@ -440,10 +446,18 @@ The design proposes adding a new KubeletConfig field `ShutdownGracePeriod` used to specify total time period kubelet should delay shutdown by and thus total time allocated to the graceful termination process. +In addition to `ShutdownGracePeriod`, another KubeletConfig field will be added +`ShutdownGracePeriodCriticalPods`. During the shutdown, the +`ShutdownGracePeriod-ShutdownGracePeriodCriticalPods` duration will be grace +period for non critical system pods like user workloads, while the remaining +time of `ShutdownGracePeriodCriticalPods` will be the grace period for critical +pods like node logging daemonsets. + ``` type KubeletConfiguration struct { ... ShutdownGracePeriod metav1.Duration + ShutdownGracePeriodCriticalPods metav1.Duration } ``` diff --git a/keps/sig-node/2000-graceful-node-shutdown/kep.yaml b/keps/sig-node/2000-graceful-node-shutdown/kep.yaml index 66fdc47c1cc..6501abf9c26 100644 --- a/keps/sig-node/2000-graceful-node-shutdown/kep.yaml +++ b/keps/sig-node/2000-graceful-node-shutdown/kep.yaml @@ -4,7 +4,7 @@ authors: - "bobbypage" - "mrunalp" owning-sig: sig-node -status: provisional +status: implementable creation-date: 2020-09-21 # The target maturity stage in the current dev cycle for this KEP. From f419e61b560dca382e2fe99e68de9906deba4329 Mon Sep 17 00:00:00 2001 From: David Porter Date: Fri, 2 Oct 2020 07:57:04 +0000 Subject: [PATCH 3/3] Add graduation criteria and update alpha milestone --- .../2000-graceful-node-shutdown/README.md | 27 ++++++++++++++----- .../2000-graceful-node-shutdown/kep.yaml | 2 +- 2 files changed, 22 insertions(+), 7 deletions(-) diff --git a/keps/sig-node/2000-graceful-node-shutdown/README.md b/keps/sig-node/2000-graceful-node-shutdown/README.md index 24928167161..0da8ddbb2f6 100644 --- a/keps/sig-node/2000-graceful-node-shutdown/README.md +++ b/keps/sig-node/2000-graceful-node-shutdown/README.md @@ -94,6 +94,8 @@ tags, and then generate with `hack/update-toc.sh`. - [Design Details](#design-details) - [Test Plan](#test-plan) - [Graduation Criteria](#graduation-criteria) + - [Alpha -> Beta Graduation](#alpha---beta-graduation) + - [Beta -> GA Graduation](#beta---ga-graduation) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - [Version Skew Strategy](#version-skew-strategy) - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) @@ -530,15 +532,16 @@ Below are some examples to consider, in addition to the aforementioned [maturity #### Alpha -> Beta Graduation -* Addresses feedback from alpha testers -* Sufficient E2E and unit testing +- Gather feedback from developers and surveys +- Complete features A, B, C +- Tests are in Testgrid and linked in KEP #### Beta -> GA Graduation -* Addresses feedback from beta -* Sufficient number of users using the feature -* Confident that no further API / kubeletConfig configuration options changes are needed -* Close on any remaining open issues / bugs +- N examples of real-world usage +- N installs +- More rigorous forms of testing—e.g., downgrade tests and scalability tests +- Allowing time for feedback **Note:** Generally we also wait at least two releases between beta and GA/stable, because there's no opportunity for user feedback, or even bug reports, @@ -557,6 +560,18 @@ in back-to-back releases. [conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md --> +#### Alpha -> Beta Graduation + +* Addresses feedback from alpha testers +* Sufficient E2E and unit testing + +#### Beta -> GA Graduation + +* Addresses feedback from beta +* Sufficient number of users using the feature +* Confident that no further API / kubelet config configuration options changes are needed +* Close on any remaining open issues & bugs + ### Upgrade / Downgrade Strategy