You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"checkpoint --lazy-pages and restore" test sometimes fails on restore
in our CI on Fedora 33 when systemd cgroup driver is used:
> (00.076104) Error (compel/src/lib/infect.c:1513): Task 48521 is in unexpected state: f7f
> (00.076122) Error (compel/src/lib/infect.c:1520): Task stopped with 15: Terminated
> ...
> (00.078246) Error (criu/cr-restore.c:2483): Restoring FAILED.
I think what happens is
1. The test runs runc checkpoint in lazy-pages mode in background.
2. The test runs criu lazy-pages in background.
3. The test runs runc restore.
Now, all three are working in together: criu restore restores, criu
lazy-pages listens for page faults on a uffd and fetch missing pages
from runc checkpoint, who serves those pages.
At some point criu lazy-pages decides to fetch the rest of the pages,
and once it's done it exits, and runc checkpoint, as there are no more
pages to serve, exits too.
At the end of runc checkpoint the container is removed (see "defer
destroy(container)" in checkpoint.go. This involves a call to
cgroupManager.Destroy, which, in case systemd manager is used,
calls stopUnit, which makes systemd to not just remove the unit,
but also send SIGTERM to its processes, if there are any.
As the container is being restored into the same systemd unit,
sometimes this results in sending SIGTERM to a process which
criu restores, and thus restoring fails.
The remedy here is to change the name of systemd unit to which the
container is restored.
Signed-off-by: Kir Kolyshkin <[email protected]>
0 commit comments