You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Processes can watch /proc/self/mounts or /mountinfo, and the kernel
will notify them whenever the namespace's mount table is modified. The
notified process still needs to read and parse the mountinfo to
determine what changed once notified. Many such processes, including
udisksd and SystemD < v248, make no attempt to rate-limit their
mountinfo notifications. This tends to not be a problem on many systems,
where mount tables are small and mounting and unmounting is uncommon.
Every runC exec which successfully uses the try_bindfd container-escape
mitigation performs two mount()s and one umount() in the host's mount
namespace, causing any mount-watching processes to wake up and parse the
mountinfo file three times in a row. Consequently, using 'exec' health
checks on containers has a larger-than-expected impact on system load
when such mount-watching daemons are running. Furthermore, the size of
the mount table in the host's mount namespace tends to be proportional
to the number of OCI containers as a unique mount is required for the
rootfs of each container. Therefore, on systems with mount-watching
processes, the system load increases *quadratically* with the number of
running containers which use health checks!
Prevent runC from incidentally modifying the host's mount namespace for
container-escape mitigations by setting up the mitigation in a private
mount namespace.
Signed-off-by: Cory Snider <[email protected]>
0 commit comments