Skip to content

Commit 5d237de

Browse files
adrianreberrst0git
andcommitted
Provide details about additional checkpoint/restore use cases
Co-Authored-By: Radostin Stoyanov <[email protected]> Signed-off-by: Adrian Reber <[email protected]>
1 parent f451a19 commit 5d237de

File tree

1 file changed

+180
-3
lines changed
  • keps/sig-node/2008-forensic-container-checkpointing

1 file changed

+180
-3
lines changed

keps/sig-node/2008-forensic-container-checkpointing/README.md

Lines changed: 180 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,19 @@
1010
- [Implementation](#implementation)
1111
- [CRI Updates](#cri-updates)
1212
- [User Stories](#user-stories)
13+
- [Forensic Container Checkpointing](#forensic-container-checkpointing)
14+
- [Fast Container Startup](#fast-container-startup)
15+
- [Container Migration](#container-migration)
16+
- [Fault Tolerance](#fault-tolerance)
17+
- [Load Balancing](#load-balancing)
18+
- [Spot Instances](#spot-instances)
19+
- [Scheduler Integration](#scheduler-integration)
1320
- [Risks and Mitigations](#risks-and-mitigations)
1421
- [Design Details](#design-details)
1522
- [Future Enhancements](#future-enhancements)
23+
- [Checkpoint Archive Management](#checkpoint-archive-management)
24+
- [CLI (kubectl) Integration](#cli-kubectl-integration)
25+
- [Checkpoint Options](#checkpoint-options)
1626
- [Test Plan](#test-plan)
1727
- [Prerequisite testing updates](#prerequisite-testing-updates)
1828
- [Unit tests](#unit-tests)
@@ -132,6 +142,24 @@ message CheckpointContainerResponse {}
132142

133143
### User Stories
134144

145+
For the initial Alpha release this KEP was focusing on "Forensic Container
146+
Checkpointing". The checkpoint/restore technology, however, opens up the
147+
possibility to many different use cases. Since the introduction in Kubernetes
148+
1.25 there has been feedback from users that were using the checkpoint
149+
functionality for some of those other use cases. In the following some
150+
of the possible use cases are described starting with the original
151+
"Forensic Container Checkpointing" use case. At which point any of these
152+
use cases will be supported in Kubernetes is not defined yet. At this point
153+
any of the use cases can be used with the currently available implementation.
154+
One question for the future will be in how far any of the possible use cases can
155+
be made more user friendly by additional Kubernetes features. Especially the
156+
container migration use case has many possibilities for optimization. CRIU,
157+
which is on the lowest level of the checkpoint/restore stack, offers the
158+
possibility to decrease container downtime during migration by techniques well
159+
known in virtual machine migration like pre-copy or post-copy migration.
160+
161+
#### Forensic Container Checkpointing
162+
135163
To analyze unusual activities in a container, the container should
136164
be checkpointed without stopping the container or without the container
137165
knowing it was checkpointed. Using checkpointing it is possible to take
@@ -140,10 +168,105 @@ continue to run without knowing a copy was created. This copy can then
140168
be restored in another (sandboxed) environment in the context of another
141169
container engine for detailed analysis of a possible attack.
142170

171+
#### Fast Container Startup
172+
173+
In addition to forensic analysis of a container checkpointing can be used to
174+
offer a way to quickly start containers. This is especially useful for
175+
containers that need a long time start. Either the software in the container
176+
needs a long time to initialize by loading many libraries or the container
177+
requires time to read data from a storage device. Using checkpointing it is
178+
possible to wait once until the container finished the initialization and save
179+
the initialized state to a checkpoint archive. Based on this checkpoint archive
180+
one or multiple copies of the container can be created without the need to wait
181+
for the initialization to finish. The startup time is reduced to the time
182+
necessary to read back all memory pages to their previous location.
183+
184+
This feature is already used in production to decrease startup time of
185+
containers.
186+
187+
Another similar use case for quicker starting containers has been reported in
188+
combination with persistent memory systems. The combination of checkpointed
189+
containers and persistent memory systems can reduce startup time after a reboot
190+
tremendously.
191+
192+
#### Container Migration
193+
194+
One of the main use cases for checkpointing and restoring containers is
195+
container migration. An open issue asking for container migration in
196+
Kubernetes exists since 2015: [#3949][migration-issue].
197+
198+
With the current Alpha based implementation container migration is already
199+
possible as documented in [Forensic Container Checkpointing Alpha][kubernetes-blog-post].
200+
201+
The following tries to give an overview of possible use cases for
202+
container migration.
203+
204+
##### Fault Tolerance
205+
206+
Container migration for fault tolerance is one of the typical reasons to
207+
migrate containers or processes. It is a well researched topic especially
208+
in the field of high performance computing (HPC). To avoid loss of work
209+
already done by a container the container is migrated to another node before
210+
the current node crashes. There are many scientific papers describing how
211+
to detect a node that might soon have a hardware problem. The main goal
212+
of using container migration for fault tolerance is to avoid loss of already
213+
done work. This is, in contrast to the forensic container checkpointing use
214+
case, only useful for stateful containers.
215+
216+
With GPUs becoming a costly commodity, there is an opportunity to help
217+
users save on costs by leveraging container checkpointing to prevent
218+
re-computation if there are any faults.
219+
220+
##### Load Balancing
221+
222+
Container migration for load balancing is something where checkpoint/restore
223+
as implemented by CRIU is already used in production today. A prominent example
224+
is Google as presented at the Linux Plumbers conference in 2018:
225+
[Task Migration at Scale Using CRIU]<[task-migration]>
226+
227+
If multiple containers are running on the same physical node in a cluster,
228+
checkpoint/restore, and thus container migration, open up the possibility
229+
of migrating containers across cluster nodes in case there are not enough
230+
computational resources (e.g., CPU, memory) available on the current node.
231+
While high-priority workloads can continue to run on the same node, containers
232+
with lower priority can be migrated. This way, stateful applications with low
233+
priority can continue to run on a different node without loosing their progress
234+
or state.
235+
236+
This functionality is especially valuable in distributed infrastructure services
237+
for AI workloads, as it helps reduce the cost of AI by maximizing the aggregate
238+
useful throughput on a given pool with a fixed capacity of hardware accelerators.
239+
Microsoft's globally distributed scheduling service, [Singularity]<[singularity]>,
240+
is an example that demonstrates the efficiency and reliability of this mechanism
241+
with deep learning training and inference workloads.
242+
243+
##### Spot Instances
244+
245+
Yet another possible use case where checkpoint/restore is already used today
246+
are spot instances. Spot instances are usually resources that are cheaper but
247+
with the drawback that they might shut down with very short notice. With the
248+
help of checkpoint/restore workloads on spot instances can either be
249+
checkpointed regularly or the checkpointing can be triggered by a signal.
250+
Once checkpointed the container can be moved to another instance and
251+
continue to run without having to start from the beginning.
252+
253+
##### Scheduler Integration
254+
255+
All of the above-mentioned container migration use cases currently require manual
256+
checkpointing, manual transfer of the checkpoint archive, and manual restoration
257+
of the container (see [Forensic Container Checkpointing Alpha][kubernetes-blog-post]
258+
for details). If all these steps could be automatically performed by the scheduler,
259+
it would greatly improve the user experience and enable more efficient resource
260+
utilization. For example, the scheduler could transparently checkpoint, preempt,
261+
and migrate workloads across nodes while keeping track of available resources and
262+
identifying suitable nodes (with compatible hardware accelerators) where a container
263+
can be migrated. However, scheduler integration is likely to be implemented at a later
264+
stage and is a subject for future enhancements.
265+
143266
### Risks and Mitigations
144267

145-
In its first implementation the risks are low as it tries to be a CRI API
146-
change with minimal changes to the kubelet and it is gated by the feature
268+
In its first implementation the risks are low as it tries to be a CRI API change
269+
with minimal changes to the kubelet and it is gated by the feature
147270
gate `ContainerCheckpoint`.
148271

149272
## Design Details
@@ -168,7 +291,54 @@ containers in the pod are checkpointed.
168291

169292
One possible result of being able to checkpoint and restore containers and pods
170293
might be the possibility to migrate containers and pods in the future as
171-
discussed in [#3949](https://github.com/kubernetes/kubernetes/issues/3949).
294+
discussed in [#3949][migration-issue].
295+
296+
#### Checkpoint Archive Management
297+
298+
One of the questions from users has been what happens with old checkpoint archives.
299+
Especially if there are multiple checkpoints on a single node theses checkpoint
300+
archives can occupy node local disk space. Depending on the checkpoint archive size
301+
this could result in a situation where the node runs out of local disk space.
302+
303+
One approach to avoid out of disk space situations would be some kind of
304+
checkpoint archive management or garbage collection of old checkpoint archives.
305+
306+
One possible argument against checkpoint archive management could be that, especially
307+
for the forensic use case, once the checkpoint archive has been created the user should
308+
retrieve it from the node and delete it. As there are, however, many different use
309+
cases for container checkpointing it sounds more realistic to have an existing checkpoint
310+
archive management to automatically clean up old checkpoint archives.
311+
312+
In its simplest form a checkpoint archive management could just start deleting checkpoint
313+
archives once the number of checkpoints reaches a certain threshold. If more checkpoint
314+
archives than the configurable threshold exist older checkpoint archives are deleted (see [#115888][checkpoint-management]).
315+
316+
Another way to manage checkpoint archives would be to delete checkpoint archives once
317+
a certain amount disk space is used or if not enough free space is available.
318+
319+
A third way to manage checkpoint archives would be the possibility to keep one checkpoint
320+
archive per day/week/month. The way to manage checkpoint archives probably depends on
321+
the checkpointing use case. For the forensic use case other checkpoint archives might
322+
be of interest in contrast to checkpointing and restoring containers in combination
323+
with spot instances where probably only the latest checkpoint is of interest to
324+
be able to continue preempted work.
325+
326+
#### CLI (kubectl) Integration
327+
328+
The current (Alpha) implementation only offers access to the checkpoint functionality
329+
through the *kubelet* API endpoint. A more user friendly interface would be a *kubectl*
330+
integration. As of this writing a pull request adding the *checkpoint* verb to
331+
*kubectl* exists: [#120898][kubectl-checkpoint]
332+
333+
#### Checkpoint Options
334+
335+
The current (Alpha) implementation does not allow additional checkpoint parameters to be
336+
passed to CRIU. During the integration of checkpoint/restore in other container projects
337+
(CRI-O, Docker, Podman, runc, crun, lxc) many CRIU specific options were exposed to the user.
338+
Common options are things like handle TCP established connections (`--tcp-established`),
339+
stop container after checkpointing, use pre-copy or post-copy algorithms to decrease
340+
container downtime during migration or compression method of the checkpoint archive (currently
341+
uncompressed but things like zstd or gzip possible).
172342

173343
### Test Plan
174344

@@ -387,3 +557,10 @@ using checkpoint and restore in Kubernetes through the existing paths of
387557
runtimes and engines is not well known and maybe not even possible as
388558
checkpointing and restoring is tightly integrated as it requires much
389559
information only available by working closely with runtimes and engines.
560+
561+
[kubernetes-blog-post]: https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/
562+
[migration-issue]: https://github.com/kubernetes/kubernetes/issues/3949
563+
[task-migration]: https://lpc.events/event/2/contributions/69/attachments/205/374/Task_Migration_at_Scale_Using_CRIU_-_LPC_2018.pdf
564+
[singularity]: https://arxiv.org/abs/2202.07848
565+
[kubectl-checkpoint]: https://github.com/kubernetes/kubernetes/pull/120898
566+
[checkpoint-management]: https://github.com/kubernetes/kubernetes/pull/115888

0 commit comments

Comments
 (0)