You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Graduate "Forensic Container Checkpointing" to Beta
As defined in the existing KEP the steps to graduate from Alpha to Beta
are
At least one container engine has to have implemented the
corresponding CRI APIs to introduce e2e test for checkpointing.
- [ ] Enable the feature per default
- [ ] No major bugs reported in the previous cycle
CRI-O implemented the corresponding CRI RPC and no major bugs
have been reported since the initial release in 1.25.
Signed-off-by: Adrian Reber <[email protected]>
@@ -292,14 +329,94 @@ Checkpointing containers will be possible again.
292
329
293
330
###### Are there any tests for feature enablement/disablement?
294
331
295
-
Currently no.
332
+
Currently the test will automatically be skipped if the feature is not enabled.
333
+
334
+
### Rollout, Upgrade and Rollback Planning
335
+
336
+
Does not apply as the feature is an additional API endpoint with no
337
+
dependencies on other functionality. If it is not enabled via the feature
338
+
gate it will return `404 page not found`. If it is not enabled in the
339
+
underlying container engine a `500` will be returned with an error
340
+
message from the container engine. If it is enabled the API endpoint exists
341
+
if disabled then it does not exist. No planning necessary.
342
+
343
+
Documented at <https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/>
344
+
<!--
345
+
This section must be completed when targeting beta to a release.
346
+
-->
347
+
348
+
###### How can a rollout or rollback fail? Can it impact already running workloads?
349
+
350
+
At this point it is still a kubelet only API endpoint and has no dependencies
351
+
on other components.
352
+
353
+
###### What specific metrics should inform a rollback?
354
+
355
+
The only metric is the return code from the API endpoint.
356
+
357
+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
358
+
359
+
No, this does not seem to apply for this feature.
360
+
361
+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
362
+
363
+
No.
364
+
365
+
### Monitoring Requirements
366
+
367
+
Querying the state of the feature gate offers the possibility to detect
368
+
if the API endpoint will return `404` or not.
369
+
370
+
<!--
371
+
This section must be completed when targeting beta to a release.
372
+
373
+
For GA, this section is required: approvers should be able to confirm the
374
+
previous answers based on experience in the field.
375
+
-->
376
+
377
+
###### How can an operator determine if the feature is in use by workloads?
378
+
379
+
As it is not exposed in the Kubernetes API it cannot be determined. This is
380
+
only visible in the kubelet.
381
+
382
+
<!--
383
+
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
384
+
checking if there are objects with field X set) may be a last resort. Avoid
385
+
logs or events for this purpose.
386
+
-->
387
+
388
+
###### How can someone using this feature know that it is working for their instance?
389
+
390
+
The kubelet API endpoint can return following codes:
391
+
392
+
- 200: checkpoint archive was successfully created
393
+
- 404: feature is not enabled
394
+
- 500: underlying container engine does not support checkpointing containers
395
+
396
+
Documented at <https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/>
397
+
398
+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
399
+
400
+
Does not apply as the enhancement will only be called when requested. Not a service.
401
+
402
+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
403
+
404
+
Does not apply as the enhancement will only be called when requested. Not a service.
405
+
406
+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
407
+
408
+
There are no metrics.
296
409
297
410
### Dependencies
298
411
299
412
CRIU needs to be installed on the node, but on most distributions it is already
300
413
a dependency of runc/crun. It does not require any specific services on the
301
414
cluster.
302
415
416
+
###### Does this feature depend on any specific services running in the cluster?
417
+
418
+
No, the container engine, however, must support the checkpoint CRI API call.
419
+
303
420
### Scalability
304
421
305
422
###### Will enabling / using this feature result in any new API calls?
@@ -334,6 +451,64 @@ Disk usage will overall increase by the used memory of the container and the cha
334
451
Checkpoint archive written to disk can optionally be compressed. The current implementation
335
452
does not compress the checkpoint archive on disk.
336
453
454
+
To avoid running out of disk space an operator has been introduced: <https://github.com/checkpoint-restore/checkpoint-restore-operator>
455
+
456
+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
457
+
458
+
During checkpointing each memory page will be written to disk. Disk usage will increase by
459
+
the size of all memory pages in the checkpointed container. Each file in the container that
460
+
has been changed compared to the original version will also be part of the checkpoint.
461
+
Disk usage will overall increase by the used memory of the container and the changed files.
462
+
Checkpoint archive written to disk can optionally be compressed. The current implementation
463
+
does not compress the checkpoint archive on disk.
464
+
465
+
To avoid running out of disk space an operator has been introduced: <https://github.com/checkpoint-restore/checkpoint-restore-operator>
466
+
467
+
### Troubleshooting
468
+
469
+
<!--
470
+
This section must be completed when targeting beta to a release.
471
+
472
+
For GA, this section is required: approvers should be able to confirm the
473
+
previous answers based on experience in the field.
474
+
475
+
The Troubleshooting section currently serves the `Playbook` role. We may consider
476
+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
477
+
details). For now, we leave it here.
478
+
-->
479
+
480
+
###### How does this feature react if the API server and/or etcd is unavailable?
481
+
482
+
The feature does not care if the API server and/or etcd is unavailable.
483
+
484
+
###### What are other known failure modes?
485
+
486
+
- The creation of the checkpoint archive can fail.
487
+
- Detection: See https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/
488
+
- Mitigation: Do not checkpoint a container that cannot be checkpointed by CRIU.
489
+
- Diagnostics: The container engine will provide the location of log file created
490
+
by CRIU with more details.
491
+
- Testing: Tests are currently covering if checkpointing is enabled in the kubelet
492
+
or not as well as covering if the underlying container engine supports the
493
+
corresponding CRI API calls. The most common checkpointing failure is if the
494
+
container is using an external hardware device like a GPU or InfiniBand which
495
+
usually do not exist in test systems.
496
+
497
+
<!--
498
+
For each of them, fill in the following information by copying the below template:
499
+
- [Failure mode brief description]
500
+
- Detection: How can it be detected via metrics? Stated another way:
501
+
how can an operator troubleshoot without logging into a master or worker node?
502
+
- Mitigations: What can be done to stop the bleeding, especially for already
503
+
running user workloads?
504
+
- Diagnostics: What are the useful log messages and their required logging
505
+
levels that could help debug the issue?
506
+
Not required until feature graduated to beta.
507
+
- Testing: Are there any tests for failure mode? If not, describe why.
508
+
-->
509
+
510
+
###### What steps should be taken if SLOs are not being met to determine the problem?
511
+
337
512
## Implementation History
338
513
339
514
* 2020-09-16: Initial version of this KEP
@@ -350,6 +525,7 @@ does not compress the checkpoint archive on disk.
350
525
* 2022-01-20: Reworked based on review and renamed feature gate to `ContainerCheckpoint`
351
526
* 2022-04-05: Added CRI API section and targeted 1.25
352
527
* 2022-05-17: Remove *restore* RPC from the CRI API
0 commit comments