Skip to content

Commit 843dd61

Browse files
committed
Provide details about additional checkpoint/restore use cases
Signed-off-by: Adrian Reber <[email protected]>
1 parent 5eea480 commit 843dd61

File tree

1 file changed

+107
-3
lines changed
  • keps/sig-node/2008-forensic-container-checkpointing

1 file changed

+107
-3
lines changed

keps/sig-node/2008-forensic-container-checkpointing/README.md

Lines changed: 107 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,13 @@
1010
- [Implementation](#implementation)
1111
- [CRI Updates](#cri-updates)
1212
- [User Stories](#user-stories)
13+
- [Forensic Container Checkpointing](#forensic-container-checkpointing)
14+
- [Fast Container Startup](#fast-container-startup)
15+
- [Container Migration](#container-migration)
16+
- [Fault Tolerance](#fault-tolerance)
17+
- [Load Balancing](#load-balancing)
18+
- [Spot Instances](#spot-instances)
19+
- [Scheduler Integration](#scheduler-integration)
1320
- [Risks and Mitigations](#risks-and-mitigations)
1421
- [Design Details](#design-details)
1522
- [Future Enhancements](#future-enhancements)
@@ -132,6 +139,16 @@ message CheckpointContainerResponse {}
132139

133140
### User Stories
134141

142+
For the initial Alpha release this KEP was focusing on "Forensic Container
143+
Checkpointing". The checkpoint/restore technology, however, opens up the
144+
possibility to many different use cases. Since the introduction in Kubernetes
145+
1.25 there has been feedback from users that were using the checkpoint
146+
functionality for some of those other use cases. In the following some
147+
of the possible use cases are described starting with the original
148+
"Forensic Container Checkpointing" use case.
149+
150+
#### Forensic Container Checkpointing
151+
135152
To analyze unusual activities in a container, the container should
136153
be checkpointed without stopping the container or without the container
137154
knowing it was checkpointed. Using checkpointing it is possible to take
@@ -140,10 +157,93 @@ continue to run without knowing a copy was created. This copy can then
140157
be restored in another (sandboxed) environment in the context of another
141158
container engine for detailed analysis of a possible attack.
142159

160+
#### Fast Container Startup
161+
162+
In addition to forensic analysis of a container checkpointing can be used to
163+
offer a way to quickly start containers. This is especially useful for
164+
containers that need a long time start. Either the software in the container
165+
needs a long time to initialize by loading many libraries or the container
166+
requires time to read data from a storage device. Using checkpointing it is
167+
possible to wait once until the container finished the initialization and save
168+
the initialized state to a checkpoint archive. Based on this checkpoint archive
169+
one or multiple copies of the container can be created without the need to wait
170+
for the initialization to finish. The startup time is reduced to the time
171+
necessary to read back all memory pages to their previous location.
172+
173+
This feature is already used in production to decrease startup time of
174+
containers.
175+
176+
Another similar use case for quicker starting containers has been reported in
177+
combination with persistent memory systems. The combination of checkpointed
178+
containers and persistent memory systems can reduce startup time after a reboot
179+
tremendously.
180+
181+
#### Container Migration
182+
183+
On of the main use cases for checkpointing and restoring containers is
184+
container migration. An open issue asking for container migration in
185+
Kubernetes exists since 2015: [#3949][migration-issue].
186+
187+
With the current Alpha based implementation container migration is already
188+
possible as documented in [Forensic Container Checkpointing Alpha][kubernetes-blog-post].
189+
190+
The following tries to give an overview of possible use cases for
191+
container migration.
192+
193+
##### Fault Tolerance
194+
195+
Container migration for fault tolerance is one of the typical reasons to
196+
migrate containers or processes. It is a well researched topic especially
197+
in the field of high performance computing (HPC). To avoid loss of work
198+
already done by a container the container is migrated to another node before
199+
the current node crashes. There are many scientific papers describing how
200+
to detect a node that might soon have a hardware problem. The main goal
201+
of using container migration for fault tolerance is to avoid loss of already
202+
done work. This is, in contrast to the forensic container checkpointing use
203+
case, only useful for stateful containers.
204+
205+
##### Load Balancing
206+
207+
Container migration for load balancing is something where checkpoint/restore
208+
as implemented by CRIU is already used in production today. A prominent example
209+
is Google as presented at the Linux Plumbers conference in 2018:
210+
[Task Migration at Scale Using CRIU]<[task-migration]>
211+
212+
If multiple containers are running on one node, checkpoint/restore and thus
213+
container migration open up the possibility to migrate containers to another
214+
node in case not enough resources are available on the current node. High
215+
priority workloads can continue to run if low priority containers are
216+
migrated to another node. This way, stateful low priority containers do
217+
not have to restart from scratch but can continue to run on another node.
218+
219+
This might be especially interesting for AI training containers which can be
220+
preempted for higher priority tasks without having to restart the training
221+
from the start.
222+
223+
##### Spot Instances
224+
225+
Yet another possible use case where checkpoint/restore is already used today
226+
are spot instances. Spot instances are usually resources that are cheaper but
227+
with the drawback that they might shut down with very short notice. With the
228+
help of checkpoint/restore workloads on spot instances can either be
229+
checkpointed regularly or the checkpointing can be triggered by a signal.
230+
Once checkpointed the container can be moved to another instance and
231+
continue to run without having to start from the beginning.
232+
233+
##### Scheduler Integration
234+
235+
All of the above mentioned container migration use cases currently require
236+
manual checkpointing, manual transfer of the checkpoint archive and manual
237+
restore of the container (see [Forensic Container Checkpointing
238+
Alpha][kubernetes-blog-post] for details). If all these steps could automatically
239+
be performed by the scheduler this would greatly improve the user experience.
240+
Scheduler integration, however, is probably one use case which will be
241+
implemented at the very end and something for future enhancements.
242+
143243
### Risks and Mitigations
144244

145-
In its first implementation the risks are low as it tries to be a CRI API
146-
change with minimal changes to the kubelet and it is gated by the feature
245+
In its first implementation the risks are low as it tries to be a CRI API change
246+
with minimal changes to the kubelet and it is gated by the feature
147247
gate `ContainerCheckpoint`.
148248

149249
## Design Details
@@ -168,7 +268,7 @@ containers in the pod are checkpointed.
168268

169269
One possible result of being able to checkpoint and restore containers and pods
170270
might be the possibility to migrate containers and pods in the future as
171-
discussed in [#3949](https://github.com/kubernetes/kubernetes/issues/3949).
271+
discussed in [#3949][migration-issue].
172272

173273
### Test Plan
174274

@@ -387,3 +487,7 @@ using checkpoint and restore in Kubernetes through the existing paths of
387487
runtimes and engines is not well known and maybe not even possible as
388488
checkpointing and restoring is tightly integrated as it requires much
389489
information only available by working closely with runtimes and engines.
490+
491+
[kubernetes-blog-post]: https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/
492+
[migration-issue]: https://github.com/kubernetes/kubernetes/issues/3949
493+
[task-migration]: https://lpc.events/event/2/contributions/69/attachments/205/374/Task_Migration_at_Scale_Using_CRIU_-_LPC_2018.pdf

0 commit comments

Comments
 (0)