Skip to content

Commit 36fe3cc

Browse files
committed
tests/int/cpt: fix lazy-pages flakiness
"checkpoint --lazy-pages and restore" test sometimes fails on restore in our CI on Fedora 33 when systemd cgroup driver is used: > (00.076104) Error (compel/src/lib/infect.c:1513): Task 48521 is in unexpected state: f7f > (00.076122) Error (compel/src/lib/infect.c:1520): Task stopped with 15: Terminated > ... > (00.078246) Error (criu/cr-restore.c:2483): Restoring FAILED. I think what happens is 1. The test runs runc checkpoint in lazy-pages mode in background. 2. The test runs criu lazy-pages in background. 3. The test runs runc restore. Now, all three are working in together: criu restore restores, criu lazy-pages listens for page faults on a uffd and fetch missing pages from runc checkpoint, who serves those pages. At some point criu lazy-pages decides to fetch the rest of the pages, and once it's done it exits, and runc checkpoint, as there are no more pages to serve, exits too. At the end of runc checkpoint the container is removed (see "defer destroy(container)" in checkpoint.go. This involves a call to cgroupManager.Destroy, which, in case systemd manager is used, calls stopUnit, which makes systemd to not just remove the unit, but also send SIGTERM to its processes, if there are any. As the container is being restored into the same systemd unit, sometimes this results in sending SIGTERM to a process which criu restores, and thus restoring fails. The remedy here is to change the name of systemd unit to which the container is restored. Signed-off-by: Kir Kolyshkin <[email protected]>
1 parent 2dd62b3 commit 36fe3cc

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

tests/integration/checkpoint.bats

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -211,11 +211,13 @@ function simple_cr() {
211211
lp_pid=$!
212212

213213
# Restore lazily from checkpoint.
214-
# The restored container needs a different name as the checkpointed
214+
# The restored container needs a different name (as well as systemd
215+
# unit name, in case systemd cgroup driver is used) as the checkpointed
215216
# container is not yet destroyed. It is only destroyed at that point
216217
# in time when the last page is lazily transferred to the destination.
217218
# Killing the CRIU on the checkpoint side will let the container
218219
# continue to run if the migration failed at some point.
220+
[ -n "$RUNC_USE_SYSTEMD" ] && set_cgroups_path
219221
runc_restore_with_pipes ./image-dir test_busybox_restore --lazy-pages
220222

221223
wait $cpt_pid

0 commit comments

Comments
 (0)