Provide details about additional checkpoint/restore use cases

adrianreber · adrianreber · commit 843dd61c478b · 2023-10-19T15:45:15.000+02:00
Signed-off-by: Adrian Reber &lt;areber@redhat.com&gt;
diff --git a/keps/sig-node/2008-forensic-container-checkpointing/README.md b/keps/sig-node/2008-forensic-container-checkpointing/README.md
@@ -10,6 +10,13 @@
   - [Implementation](#implementation)
     - [CRI Updates](#cri-updates)
   - [User Stories](#user-stories)
+    - [Forensic Container Checkpointing](#forensic-container-checkpointing)
+    - [Fast Container Startup](#fast-container-startup)
+    - [Container Migration](#container-migration)
+      - [Fault Tolerance](#fault-tolerance)
+      - [Load Balancing](#load-balancing)
+      - [Spot Instances](#spot-instances)
+      - [Scheduler Integration](#scheduler-integration)
   - [Risks and Mitigations](#risks-and-mitigations)
 - [Design Details](#design-details)
   - [Future Enhancements](#future-enhancements)
@@ -132,6 +139,16 @@ message CheckpointContainerResponse {}
 
 ### User Stories
 
+For the initial Alpha release this KEP was focusing on "Forensic Container
+Checkpointing". The checkpoint/restore technology, however, opens up the
+possibility to many different use cases. Since the introduction in Kubernetes
+1.25 there has been feedback from users that were using the checkpoint
+functionality for some of those other use cases. In the following some
+of the possible use cases are described starting with the original
+"Forensic Container Checkpointing" use case.
+
+#### Forensic Container Checkpointing
+
 To analyze unusual activities in a container, the container should
 be checkpointed without stopping the container or without the container
 knowing it was checkpointed. Using checkpointing it is possible to take
@@ -140,10 +157,93 @@ continue to run without knowing a copy was created. This copy can then
 be restored in another (sandboxed) environment in the context of another
 container engine for detailed analysis of a possible attack.
 
+#### Fast Container Startup
+
+In addition to forensic analysis of a container checkpointing can be used to
+offer a way to quickly start containers. This is especially useful for
+containers that need a long time start. Either the software in the container
+needs a long time to initialize by loading many libraries or the container
+requires time to read data from a storage device. Using checkpointing it is
+possible to wait once until the container finished the initialization and save
+the initialized state to a checkpoint archive. Based on this checkpoint archive
+one or multiple copies of the container can be created without the need to wait
+for the initialization to finish. The startup time is reduced to the time
+necessary to read back all memory pages to their previous location.
+
+This feature is already used in production to decrease startup time of
+containers.
+
+Another similar use case for quicker starting containers has been reported in
+combination with persistent memory systems. The combination of checkpointed
+containers and persistent memory systems can reduce startup time after a reboot
+tremendously.
+
+#### Container Migration
+
+On of the main use cases for checkpointing and restoring containers is
+container migration. An open issue asking for container migration in
+Kubernetes exists since 2015: [#3949][migration-issue].
+
+With the current Alpha based implementation container migration is already
+possible as documented in [Forensic Container Checkpointing Alpha][kubernetes-blog-post].
+
+The following tries to give an overview of possible use cases for
+container migration.
+
+##### Fault Tolerance
+
+Container migration for fault tolerance is one of the typical reasons to
+migrate containers or processes. It is a well researched topic especially
+in the field of high performance computing (HPC). To avoid loss of work
+already done by a container the container is migrated to another node before
+the current node crashes. There are many scientific papers describing how
+to detect a node that might soon have a hardware problem. The main goal
+of using container migration for fault tolerance is to avoid loss of already
+done work. This is, in contrast to the forensic container checkpointing use
+case, only useful for stateful containers.
+
+##### Load Balancing
+
+Container migration for load balancing is something where checkpoint/restore
+as implemented by CRIU is already used in production today. A prominent example
+is Google as presented at the Linux Plumbers conference in 2018:
+[Task Migration at Scale Using CRIU]<[task-migration]>
+
+If multiple containers are running on one node, checkpoint/restore and thus
+container migration open up the possibility to migrate containers to another
+node in case not enough resources are available on the current node. High
+priority workloads can continue to run if low priority containers are
+migrated to another node. This way, stateful low priority containers do
+not have to restart from scratch but can continue to run on another node.
+
+This might be especially interesting for AI training containers which can be
+preempted for higher priority tasks without having to restart the training
+from the start.
+
+##### Spot Instances
+
+Yet another possible use case where checkpoint/restore is already used today
+are spot instances. Spot instances are usually resources that are cheaper but
+with the drawback that they might shut down with very short notice. With the
+help of checkpoint/restore workloads on spot instances can either be
+checkpointed regularly or the checkpointing can be triggered by a signal.
+Once checkpointed the container can be moved to another instance and
+continue to run without having to start from the beginning.
+
+##### Scheduler Integration
+
+All of the above mentioned container migration use cases currently require
+manual checkpointing, manual transfer of the checkpoint archive and manual
+restore of the container (see [Forensic Container Checkpointing
+Alpha][kubernetes-blog-post] for details). If all these steps could automatically
+be performed by the scheduler this would greatly improve the user experience.
+Scheduler integration, however, is probably one use case which will be
+implemented at the very end and something for future enhancements.
+
 ### Risks and Mitigations
 
-In its first implementation the risks are low as it tries to be a CRI API
-change with minimal changes to the kubelet and it is gated by the feature
+In its first implementation the risks are low as it tries to be a CRI API change
+with minimal changes to the kubelet and it is gated by the feature
 gate `ContainerCheckpoint`.
 
 ## Design Details
@@ -168,7 +268,7 @@ containers in the pod are checkpointed.
 
 One possible result of being able to checkpoint and restore containers and pods
 might be the possibility to migrate containers and pods in the future as
-discussed in [#3949](https://github.com/kubernetes/kubernetes/issues/3949).
+discussed in [#3949][migration-issue].
 
 ### Test Plan
 
@@ -387,3 +487,7 @@ using checkpoint and restore in Kubernetes through the existing paths of
 runtimes and engines is not well known and maybe not even possible as
 checkpointing and restoring is tightly integrated as it requires much
 information only available by working closely with runtimes and engines.
+
+[kubernetes-blog-post]: https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/
+[migration-issue]: https://github.com/kubernetes/kubernetes/issues/3949
+[task-migration]: https://lpc.events/event/2/contributions/69/attachments/205/374/Task_Migration_at_Scale_Using_CRIU_-_LPC_2018.pdf