Graduate "Forensic Container Checkpointing" to Beta

adrianreber · adrianreber · commit 7239c9935004 · 2024-02-06T16:22:16.000+01:00
As defined in the existing KEP the steps to graduate from Alpha to Beta
are

   At least one container engine has to have implemented the
   corresponding CRI APIs to introduce e2e test for checkpointing.

   - [ ] Enable the feature per default
   - [ ] No major bugs reported in the previous cycle

CRI-O implemented the corresponding CRI RPC and no major bugs
have been reported since the initial release in 1.25.

Signed-off-by: Adrian Reber &lt;areber@redhat.com&gt;
diff --git a/keps/prod-readiness/sig-node/2008.yaml b/keps/prod-readiness/sig-node/2008.yaml
@@ -1,3 +1,5 @@
 kep-number: 2008
 alpha:
   approver: "@ehashman"
+beta:
+  approver: "@deads2k"
diff --git a/keps/sig-node/2008-forensic-container-checkpointing/README.md b/keps/sig-node/2008-forensic-container-checkpointing/README.md
@@ -25,8 +25,11 @@
   - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
 - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
   - [Feature Enablement and Rollback](#feature-enablement-and-rollback)
+  - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
+  - [Monitoring Requirements](#monitoring-requirements)
   - [Dependencies](#dependencies)
   - [Scalability](#scalability)
+  - [Troubleshooting](#troubleshooting)
 - [Implementation History](#implementation-history)
 - [Drawbacks](#drawbacks)
 - [Alternatives](#alternatives)
@@ -125,6 +128,10 @@ message CheckpointContainerRequest {
     string container_id = 1;
     // Location of the checkpoint archive used for export/import
     string location = 2;
+    // Timeout in seconds for the checkpoint to complete.
+    // Timeout of zero means to use the CRI default.
+    // Timeout > 0 means to use the user specified timeout.
+    int64 timeout = 3;
 }
 
 message CheckpointContainerResponse {}
@@ -146,6 +153,16 @@ In its first implementation the risks are low as it tries to be a CRI API
 change with minimal changes to the kubelet and it is gated by the feature
 gate `ContainerCheckpoint`.
 
+One possible risk that was identified during Alpha is that the disk of
+the node requesting the checkpoints could fill up if too many checkpoints
+are created. One approach to solve this was some kind of garbage collection
+of checkpoint archives. A pull request to implement garbage collection
+was opened ([#115888](https://github.com/kubernetes/kubernetes/pull/115888))
+but during review it became clear that the kubelet might not be the right
+place to implement checkpoint archive garbage collection and the pull request
+was closed again. Currently the most likely solution seems to be to implement
+the garbage collection in an operator.
+
 ## Design Details
 
 The feature gate `ContainerCheckpoint` will ensure that the API
@@ -244,21 +261,41 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
   Once CRI implementation provide the relevant RPC calls
   the e2e tests will not fail but need to be extended.
 
+- Once the initial Alpha release  CRI-O supports the
+  `CheckpointContainer` CRI RPC and tests have been
+  enhanced to support CRI implementation that implement
+  the `CheckpointContainer` CRI RPC
+
+- Once Kubernetes was released with the `CheckpointContainer` CRI RPC
+  CRI-O has been updated to support the new CRI RPC.
+  The tests have been enhanced to work with CRI implementations
+  that support the `CheckpointContainer` CRI RPC as well as
+  CRI implementations that do not support it. The tests also handle
+  if the corresponding feature gate is disabled or enabled:
+  <https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/checkpoint_container.go>
+
 ### Graduation Criteria
 
 #### Alpha
 
-- [ ] Implement the new feature gate and kubelet implementation
-- [ ] Ensure proper tests are in place
-- [ ] Update documentation to make the feature visible
+- [X] Implement the new feature gate and kubelet implementation
+- [X] Ensure proper tests are in place
+- [X] Update documentation to make the feature visible
+  - <https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/>
+  - <https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/>
+  - <https://kubernetes.io/blog/2023/03/10/forensic-container-analysis/>
 
 #### Alpha to Beta Graduation
 
-At least one container engine has to have implemented the
-corresponding CRI APIs to introduce e2e test for checkpointing.
+CRI-O as well as containerd have to have implemented the corresponding CRI APIs:
+
+- [x] CRI-O
+- [ ] containerd (<https://github.com/containerd/containerd/pull/6965>)
+
+In Kubernetes:
 
 - [ ] Enable the feature per default
-- [ ] No major bugs reported in the previous cycle
+- [x] No major bugs reported in the previous cycle
 
 #### Beta to GA Graduation
 
@@ -292,14 +329,94 @@ Checkpointing containers will be possible again.
 
 ###### Are there any tests for feature enablement/disablement?
 
-Currently no.
+Currently the test will automatically be skipped if the feature is not enabled.
+
+### Rollout, Upgrade and Rollback Planning
+
+Does not apply as the feature is an additional API endpoint with no
+dependencies on other functionality. If it is not enabled via the feature
+gate it will return `404 page not found`. If it is not enabled in the
+underlying container engine a `500` will be returned with an error
+message from the container engine. If it is enabled the API endpoint exists
+if disabled then it does not exist. No planning necessary.
+
+Documented at <https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/>
+<!--
+This section must be completed when targeting beta to a release.
+-->
+
+###### How can a rollout or rollback fail? Can it impact already running workloads?
+
+At this point it is still a kubelet only API endpoint and has no dependencies
+on other components.
+
+###### What specific metrics should inform a rollback?
+
+The only metric is the return code from the API endpoint.
+
+###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
+
+No, this does not seem to apply for this feature.
+
+###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
+
+No.
+
+### Monitoring Requirements
+
+Querying the state of the feature gate offers the possibility to detect
+if the API endpoint will return `404` or not.
+
+<!--
+This section must be completed when targeting beta to a release.
+
+For GA, this section is required: approvers should be able to confirm the
+previous answers based on experience in the field.
+-->
+
+###### How can an operator determine if the feature is in use by workloads?
+
+As it is not exposed in the Kubernetes API it cannot be determined. This is
+only visible in the kubelet.
+
+<!--
+Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
+checking if there are objects with field X set) may be a last resort. Avoid
+logs or events for this purpose.
+-->
+
+###### How can someone using this feature know that it is working for their instance?
+
+The kubelet API endpoint can return following codes:
+
+- 200: checkpoint archive was successfully created
+- 404: feature is not enabled
+- 500: underlying container engine does not support checkpointing containers
+
+Documented at <https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/>
+
+###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
+
+Does not apply as the enhancement will only be called when requested. Not a service.
+
+###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
+
+Does not apply as the enhancement will only be called when requested. Not a service.
+
+###### Are there any missing metrics that would be useful to have to improve observability of this feature?
+
+There are no metrics.
 
 ### Dependencies
 
 CRIU needs to be installed on the node, but on most distributions it is already
 a dependency of runc/crun. It does not require any specific services on the
 cluster.
 
+###### Does this feature depend on any specific services running in the cluster?
+
+No, the container engine, however, must support the checkpoint CRI API call.
+
 ### Scalability
 
 ###### Will enabling / using this feature result in any new API calls?
@@ -334,6 +451,64 @@ Disk usage will overall increase by the used memory of the container and the cha
 Checkpoint archive written to disk can optionally be compressed. The current implementation
 does not compress the checkpoint archive on disk.
 
+To avoid running out of disk space an operator has been introduced: <https://github.com/checkpoint-restore/checkpoint-restore-operator>
+
+###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
+
+During checkpointing each memory page will be written to disk. Disk usage will increase by
+the size of all memory pages in the checkpointed container. Each file in the container that
+has been changed compared to the original version will also be part of the checkpoint.
+Disk usage will overall increase by the used memory of the container and the changed files.
+Checkpoint archive written to disk can optionally be compressed. The current implementation
+does not compress the checkpoint archive on disk.
+
+To avoid running out of disk space an operator has been introduced: <https://github.com/checkpoint-restore/checkpoint-restore-operator>
+
+### Troubleshooting
+
+<!--
+This section must be completed when targeting beta to a release.
+
+For GA, this section is required: approvers should be able to confirm the
+previous answers based on experience in the field.
+
+The Troubleshooting section currently serves the `Playbook` role. We may consider
+splitting it into a dedicated `Playbook` document (potentially with some monitoring
+details). For now, we leave it here.
+-->
+
+###### How does this feature react if the API server and/or etcd is unavailable?
+
+The feature does not care if the API server and/or etcd is unavailable.
+
+###### What are other known failure modes?
+
+- The creation of the checkpoint archive can fail.
+  - Detection: See https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/
+  - Mitigation: Do not checkpoint a container that cannot be checkpointed by CRIU.
+  - Diagnostics: The container engine will provide the location of log file created
+    by CRIU with more details.
+  - Testing: Tests are currently covering if checkpointing is enabled in the kubelet
+    or not as well as covering if the underlying container engine supports the
+    corresponding CRI API calls. The most common checkpointing failure is if the
+    container is using an external hardware device like a GPU or InfiniBand which
+    usually do not exist in test systems.
+
+<!--
+For each of them, fill in the following information by copying the below template:
+  - [Failure mode brief description]
+    - Detection: How can it be detected via metrics? Stated another way:
+      how can an operator troubleshoot without logging into a master or worker node?
+    - Mitigations: What can be done to stop the bleeding, especially for already
+      running user workloads?
+    - Diagnostics: What are the useful log messages and their required logging
+      levels that could help debug the issue?
+      Not required until feature graduated to beta.
+    - Testing: Are there any tests for failure mode? If not, describe why.
+-->
+
+###### What steps should be taken if SLOs are not being met to determine the problem?
+
 ## Implementation History
 
 * 2020-09-16: Initial version of this KEP
@@ -350,6 +525,7 @@ does not compress the checkpoint archive on disk.
 * 2022-01-20: Reworked based on review and renamed feature gate to `ContainerCheckpoint`
 * 2022-04-05: Added CRI API section and targeted 1.25
 * 2022-05-17: Remove *restore* RPC from the CRI API
+* 2023-10-09: Beta graduation in 1.30
 
 ## Drawbacks
 
diff --git a/keps/sig-node/2008-forensic-container-checkpointing/kep.yaml b/keps/sig-node/2008-forensic-container-checkpointing/kep.yaml
@@ -15,18 +15,18 @@ approvers:
   - "@dchen1107"
 
 # The target maturity stage in the current dev cycle for this KEP.
-stage: alpha
+stage: beta
 
 # The most recent milestone for which work toward delivery of this KEP has been
 # done. This can be the current (upcoming) milestone, if it is being actively
 # worked on.
-latest-milestone: "v1.25"
+latest-milestone: "v1.30"
 
 # The milestone at which this feature was, or is targeted to be, at each stage.
 milestone:
   alpha: "v1.25"
-  beta: "v1.26"
-  stable: "v1.28"
+  beta: "v1.30"
+  stable: "v1.33"
 
 # The following PRR answers are required at alpha release
 # List the feature gate name and the components for which it must be enabled