Skip to content

Commit 7239c99

Browse files
committed
Graduate "Forensic Container Checkpointing" to Beta
As defined in the existing KEP the steps to graduate from Alpha to Beta are At least one container engine has to have implemented the corresponding CRI APIs to introduce e2e test for checkpointing. - [ ] Enable the feature per default - [ ] No major bugs reported in the previous cycle CRI-O implemented the corresponding CRI RPC and no major bugs have been reported since the initial release in 1.25. Signed-off-by: Adrian Reber <[email protected]>
1 parent 12cc497 commit 7239c99

File tree

3 files changed

+189
-11
lines changed

3 files changed

+189
-11
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 2008
22
alpha:
33
approver: "@ehashman"
4+
beta:
5+
approver: "@deads2k"

keps/sig-node/2008-forensic-container-checkpointing/README.md

Lines changed: 183 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,11 @@
2525
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
2626
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
2727
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
28+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
29+
- [Monitoring Requirements](#monitoring-requirements)
2830
- [Dependencies](#dependencies)
2931
- [Scalability](#scalability)
32+
- [Troubleshooting](#troubleshooting)
3033
- [Implementation History](#implementation-history)
3134
- [Drawbacks](#drawbacks)
3235
- [Alternatives](#alternatives)
@@ -125,6 +128,10 @@ message CheckpointContainerRequest {
125128
string container_id = 1;
126129
// Location of the checkpoint archive used for export/import
127130
string location = 2;
131+
// Timeout in seconds for the checkpoint to complete.
132+
// Timeout of zero means to use the CRI default.
133+
// Timeout > 0 means to use the user specified timeout.
134+
int64 timeout = 3;
128135
}
129136
130137
message CheckpointContainerResponse {}
@@ -146,6 +153,16 @@ In its first implementation the risks are low as it tries to be a CRI API
146153
change with minimal changes to the kubelet and it is gated by the feature
147154
gate `ContainerCheckpoint`.
148155

156+
One possible risk that was identified during Alpha is that the disk of
157+
the node requesting the checkpoints could fill up if too many checkpoints
158+
are created. One approach to solve this was some kind of garbage collection
159+
of checkpoint archives. A pull request to implement garbage collection
160+
was opened ([#115888](https://github.com/kubernetes/kubernetes/pull/115888))
161+
but during review it became clear that the kubelet might not be the right
162+
place to implement checkpoint archive garbage collection and the pull request
163+
was closed again. Currently the most likely solution seems to be to implement
164+
the garbage collection in an operator.
165+
149166
## Design Details
150167

151168
The feature gate `ContainerCheckpoint` will ensure that the API
@@ -244,21 +261,41 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
244261
Once CRI implementation provide the relevant RPC calls
245262
the e2e tests will not fail but need to be extended.
246263

264+
- Once the initial Alpha release CRI-O supports the
265+
`CheckpointContainer` CRI RPC and tests have been
266+
enhanced to support CRI implementation that implement
267+
the `CheckpointContainer` CRI RPC
268+
269+
- Once Kubernetes was released with the `CheckpointContainer` CRI RPC
270+
CRI-O has been updated to support the new CRI RPC.
271+
The tests have been enhanced to work with CRI implementations
272+
that support the `CheckpointContainer` CRI RPC as well as
273+
CRI implementations that do not support it. The tests also handle
274+
if the corresponding feature gate is disabled or enabled:
275+
<https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/checkpoint_container.go>
276+
247277
### Graduation Criteria
248278

249279
#### Alpha
250280

251-
- [ ] Implement the new feature gate and kubelet implementation
252-
- [ ] Ensure proper tests are in place
253-
- [ ] Update documentation to make the feature visible
281+
- [X] Implement the new feature gate and kubelet implementation
282+
- [X] Ensure proper tests are in place
283+
- [X] Update documentation to make the feature visible
284+
- <https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/>
285+
- <https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/>
286+
- <https://kubernetes.io/blog/2023/03/10/forensic-container-analysis/>
254287

255288
#### Alpha to Beta Graduation
256289

257-
At least one container engine has to have implemented the
258-
corresponding CRI APIs to introduce e2e test for checkpointing.
290+
CRI-O as well as containerd have to have implemented the corresponding CRI APIs:
291+
292+
- [x] CRI-O
293+
- [ ] containerd (<https://github.com/containerd/containerd/pull/6965>)
294+
295+
In Kubernetes:
259296

260297
- [ ] Enable the feature per default
261-
- [ ] No major bugs reported in the previous cycle
298+
- [x] No major bugs reported in the previous cycle
262299

263300
#### Beta to GA Graduation
264301

@@ -292,14 +329,94 @@ Checkpointing containers will be possible again.
292329

293330
###### Are there any tests for feature enablement/disablement?
294331

295-
Currently no.
332+
Currently the test will automatically be skipped if the feature is not enabled.
333+
334+
### Rollout, Upgrade and Rollback Planning
335+
336+
Does not apply as the feature is an additional API endpoint with no
337+
dependencies on other functionality. If it is not enabled via the feature
338+
gate it will return `404 page not found`. If it is not enabled in the
339+
underlying container engine a `500` will be returned with an error
340+
message from the container engine. If it is enabled the API endpoint exists
341+
if disabled then it does not exist. No planning necessary.
342+
343+
Documented at <https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/>
344+
<!--
345+
This section must be completed when targeting beta to a release.
346+
-->
347+
348+
###### How can a rollout or rollback fail? Can it impact already running workloads?
349+
350+
At this point it is still a kubelet only API endpoint and has no dependencies
351+
on other components.
352+
353+
###### What specific metrics should inform a rollback?
354+
355+
The only metric is the return code from the API endpoint.
356+
357+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
358+
359+
No, this does not seem to apply for this feature.
360+
361+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
362+
363+
No.
364+
365+
### Monitoring Requirements
366+
367+
Querying the state of the feature gate offers the possibility to detect
368+
if the API endpoint will return `404` or not.
369+
370+
<!--
371+
This section must be completed when targeting beta to a release.
372+
373+
For GA, this section is required: approvers should be able to confirm the
374+
previous answers based on experience in the field.
375+
-->
376+
377+
###### How can an operator determine if the feature is in use by workloads?
378+
379+
As it is not exposed in the Kubernetes API it cannot be determined. This is
380+
only visible in the kubelet.
381+
382+
<!--
383+
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
384+
checking if there are objects with field X set) may be a last resort. Avoid
385+
logs or events for this purpose.
386+
-->
387+
388+
###### How can someone using this feature know that it is working for their instance?
389+
390+
The kubelet API endpoint can return following codes:
391+
392+
- 200: checkpoint archive was successfully created
393+
- 404: feature is not enabled
394+
- 500: underlying container engine does not support checkpointing containers
395+
396+
Documented at <https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/>
397+
398+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
399+
400+
Does not apply as the enhancement will only be called when requested. Not a service.
401+
402+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
403+
404+
Does not apply as the enhancement will only be called when requested. Not a service.
405+
406+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
407+
408+
There are no metrics.
296409

297410
### Dependencies
298411

299412
CRIU needs to be installed on the node, but on most distributions it is already
300413
a dependency of runc/crun. It does not require any specific services on the
301414
cluster.
302415

416+
###### Does this feature depend on any specific services running in the cluster?
417+
418+
No, the container engine, however, must support the checkpoint CRI API call.
419+
303420
### Scalability
304421

305422
###### Will enabling / using this feature result in any new API calls?
@@ -334,6 +451,64 @@ Disk usage will overall increase by the used memory of the container and the cha
334451
Checkpoint archive written to disk can optionally be compressed. The current implementation
335452
does not compress the checkpoint archive on disk.
336453

454+
To avoid running out of disk space an operator has been introduced: <https://github.com/checkpoint-restore/checkpoint-restore-operator>
455+
456+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
457+
458+
During checkpointing each memory page will be written to disk. Disk usage will increase by
459+
the size of all memory pages in the checkpointed container. Each file in the container that
460+
has been changed compared to the original version will also be part of the checkpoint.
461+
Disk usage will overall increase by the used memory of the container and the changed files.
462+
Checkpoint archive written to disk can optionally be compressed. The current implementation
463+
does not compress the checkpoint archive on disk.
464+
465+
To avoid running out of disk space an operator has been introduced: <https://github.com/checkpoint-restore/checkpoint-restore-operator>
466+
467+
### Troubleshooting
468+
469+
<!--
470+
This section must be completed when targeting beta to a release.
471+
472+
For GA, this section is required: approvers should be able to confirm the
473+
previous answers based on experience in the field.
474+
475+
The Troubleshooting section currently serves the `Playbook` role. We may consider
476+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
477+
details). For now, we leave it here.
478+
-->
479+
480+
###### How does this feature react if the API server and/or etcd is unavailable?
481+
482+
The feature does not care if the API server and/or etcd is unavailable.
483+
484+
###### What are other known failure modes?
485+
486+
- The creation of the checkpoint archive can fail.
487+
- Detection: See https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/
488+
- Mitigation: Do not checkpoint a container that cannot be checkpointed by CRIU.
489+
- Diagnostics: The container engine will provide the location of log file created
490+
by CRIU with more details.
491+
- Testing: Tests are currently covering if checkpointing is enabled in the kubelet
492+
or not as well as covering if the underlying container engine supports the
493+
corresponding CRI API calls. The most common checkpointing failure is if the
494+
container is using an external hardware device like a GPU or InfiniBand which
495+
usually do not exist in test systems.
496+
497+
<!--
498+
For each of them, fill in the following information by copying the below template:
499+
- [Failure mode brief description]
500+
- Detection: How can it be detected via metrics? Stated another way:
501+
how can an operator troubleshoot without logging into a master or worker node?
502+
- Mitigations: What can be done to stop the bleeding, especially for already
503+
running user workloads?
504+
- Diagnostics: What are the useful log messages and their required logging
505+
levels that could help debug the issue?
506+
Not required until feature graduated to beta.
507+
- Testing: Are there any tests for failure mode? If not, describe why.
508+
-->
509+
510+
###### What steps should be taken if SLOs are not being met to determine the problem?
511+
337512
## Implementation History
338513

339514
* 2020-09-16: Initial version of this KEP
@@ -350,6 +525,7 @@ does not compress the checkpoint archive on disk.
350525
* 2022-01-20: Reworked based on review and renamed feature gate to `ContainerCheckpoint`
351526
* 2022-04-05: Added CRI API section and targeted 1.25
352527
* 2022-05-17: Remove *restore* RPC from the CRI API
528+
* 2023-10-09: Beta graduation in 1.30
353529

354530
## Drawbacks
355531

keps/sig-node/2008-forensic-container-checkpointing/kep.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,18 +15,18 @@ approvers:
1515
- "@dchen1107"
1616

1717
# The target maturity stage in the current dev cycle for this KEP.
18-
stage: alpha
18+
stage: beta
1919

2020
# The most recent milestone for which work toward delivery of this KEP has been
2121
# done. This can be the current (upcoming) milestone, if it is being actively
2222
# worked on.
23-
latest-milestone: "v1.25"
23+
latest-milestone: "v1.30"
2424

2525
# The milestone at which this feature was, or is targeted to be, at each stage.
2626
milestone:
2727
alpha: "v1.25"
28-
beta: "v1.26"
29-
stable: "v1.28"
28+
beta: "v1.30"
29+
stable: "v1.33"
3030

3131
# The following PRR answers are required at alpha release
3232
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)