[feature request] *: introduce pidfd-socket flag #4045

fuweid · 2023-10-01T08:02:41Z

The container manager like containerd-shim can't use cgroup.kill feature or freeze all the processes in cgroup to terminate the exec init process. It's unsafe to call kill(2) since the pid can be recycled. It's good to provide the pidfd of init process through the pidfd-socket. It's similar to the console-socket. With the pidfd, the container manager like containerd-shim can send the signal to target process safely.

And for the standard init process, we can have polling support to get exit event instead of blocking on wait4.

Let me explain why the containerd-shim needs this feature for containerd init process.

Without pidfd, containerd-shim can't tell which process exits. It has to use reap all the zombies.
However, it requires all the fork/exec operations needs to use the reap-event-framework in containerd.
For example, the mount go-package needs to fork child process which unshares to get brand-new userns. If the child process has been killed and reaped by containerd-shim, the child process's pid can be reused. In order to know the exit event, the mount go-package needs to use reap-event-framework, which doesn't make senses.

With pidfd support, we can use polling support to know which process exits instead of calling wait4 syscall.

And one more detail is that the containerd-shim only cares the container init process and exec init processes.

Currently, containerd-shim uses PR_SET_CHILD_SUBREAPER, and watch the signal SIGCHLD to reap all the zombie processes, including container init process and exec init processes. Before v4.11 kernel-exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction, the process X double-forked by the exec init process will be reparented to containerd-shim so that containerd-shim can cleanup the zombie.

# docker exec
root@50c69c07310e:/go# uname -r
4.4.0-210-generic
root@50c69c07310e:/go# go install github.com/fuweid/[email protected]
root@50c69c07310e:/go#
root@50c69c07310e:/go# cmdrun-go sleep 30
root@50c69c07310e:/go#
root@50c69c07310e:/go# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 04:57 ?        00:00:00 sleep 1d
root         7     0  0 04:58 pts/0    00:00:00 bash
root       262     0  0 05:11 pts/0    00:00:00 sleep 30
root       264     7  0 05:11 pts/0    00:00:00 ps -ef
root@50c69c07310e:/go#

Right now (>= v4.11 kernel), the containerd-shim only receives the SIGCHLD from init processes because the double-forked processes will be reparent to pid-1 in the pid namespace. So, containerd-shim doesn't need to care any double-forked processes. The pidfd can help containerd-shim to focus on the correct processes.

# docker exec
root@d46705a1963c:/go# uname -r
6.4.0-4-amd64
root@d46705a1963c:/go#
root@d46705a1963c:/go# go install github.com/fuweid/[email protected]
root@d46705a1963c:/go#
root@d46705a1963c:/go# cmdrun-go sleep 5
root@d46705a1963c:/go#
root@d46705a1963c:/go# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 05:15 ?        00:00:00 /bin/sleep 1d
root           7       0  0 05:15 pts/0    00:00:00 /bin/bash
root         103       1  0 05:16 pts/0    00:00:00 sleep 5
root         104       7  0 05:16 pts/0    00:00:00 ps -ef
root@d46705a1963c:/go#
root@d46705a1963c:/go# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 05:15 ?        00:00:00 /bin/sleep 1d
root           7       0  0 05:15 pts/0    00:00:00 /bin/bash
root         103       1  0 05:16 pts/0    00:00:00 [sleep] <defunct>
root         105       7  0 05:16 pts/0    00:00:00 ps -ef
root@d46705a1963c:/go#

REF: containerd/containerd#9175

lifubang · 2023-10-01T09:29:43Z

The king of involution, is writing code during holidays also a form of rest? 😄

The container manager like containerd-shim can't use cgroup.kill feature or freeze all the processes in cgroup to terminate the exec init process.

Could you please explain why? And why we can’t use ‘runc kill’ or runc libcontainer API directly?

It's unsafe to call kill(2) since the pid can be recycled.

I seem to remember that ‘contained’ has used kill(2) for many many years.

libcontainer/init_linux.go

fuweid · 2023-10-02T05:04:09Z

Could you please explain why? And why we can’t use ‘runc kill’ or runc libcontainer API directly?

runc-kill is used to send the signal to the container init process. As far as I know, there is no runc-commandline to send signal to the exec init process.

I seem to remember that ‘contained’ has used kill(2) for many many years.

Yes because pidfd is available since v5.3 kernel. pidfd can ensure that we can send the signal to the correct process, especially the exec-probe has timeout.

fuweid · 2023-10-02T05:30:28Z

ping @AkihiroSuda @thaJeztah
cc @dmcgowan

fuweid · 2023-10-02T23:10:03Z

also ping @lifubang @cyphar @kolyshkin

lifubang · 2023-10-03T00:32:36Z

I think this is a nice feature, because I have hated the big for loop in containerd to find out whether the exit signal is from the init process or not in many years ago.

Just only one question, I think maybe we can simplify the implementation, I don’t know whether my solution could work or not:
Like ‘createPidFile’, How about create a function named ‘sendPidFd’ in utils_linux.go, use process.Pid() to get PidFd, and call it in function ’run’.

runc/utils_linux.go

Lines 270 to 278 in ee45b9b

    
           tty.ClosePostStart() 
        
           if r.pidFile != "" { 
        
           	if err = createPidFile(r.pidFile, process); err != nil { 
        
           		r.terminate(process) 
        
           		return -1, err 
        
           	} 
        
           } 
        
           status, err := handler.forward(process, tty, detach) 
        
           if err != nil {

fuweid · 2023-10-03T00:47:21Z

Just only one question, I think maybe we can simplify the implementation, I don’t know whether my solution could work > or not:
Like ‘createPidFile’, How about create a function named ‘sendPidFd’ in utils_linux.go, use process.Pid() to get PidFd, and > call it in function ’run’.

Basically, yes. The runc-{create/exec/run} process is still parent of the init process before exit. We should check the status of process by WNOHANG after pidfd_open. Since the pidfd_open allows to open the own process's pid, it's safe and not race-condition. So I think it's straightforward to handle it in the init process. Sounds good to you?

cyphar · 2023-10-04T03:51:47Z

It seems like it would've been a good idea to make --console-socket a proper communication channel so we don't need to add a new one for each file descriptor (unfortunately I don't think we can fix this -- old runc callers expect to only get the console socket)...

FWIW, I don't like adding features to runc's command-line if we can avoid it -- it makes life harder for other OCI runtimes because we are creating non-standard behaviour that everyone has to copy from us in order to work with runtimes that depend on it. I made this mistake with --console-socket (in retrospect, we probably should've implemented something like LXC's monitor process) and it'd be a shame to repeat it if it's unnecessary.

But then again, I don't see another way of solving it, other than re-architecting runc... Hmmm...

lifubang · 2023-10-04T12:01:02Z

I don't like adding features to runc's command-line if we can avoid it

I think it isn’t a problem to add features to runc's command-line, because if there is a way that we accept pidFd solution without cmd flag, other OCI runtimes should also have to support it with the way like runc uses. What I mean is that, Can we independently solve this problem on containerd side? @fuweid For example, when containerd have fetched the init process’ pid, how about get the pidFd from containerd side?

fuweid · 2023-10-04T12:21:02Z

Thanks for the comment! @cyphar

because we are creating non-standard behaviour that everyone has to copy from us in order to work with runtimes that depend on it.

Understand. Currently, no spec is to describe what the command line looks like. For example, the standard init process has two steps to setup: runc-create and runc-start. But the setns init process only has one step runc-exec. It's more about best practice.

But then again, I don't see another way of solving it, other than re-architecting runc... Hmmm...

Just wondering about what re-architecting runc looks like. If it's not related to spec or standard, I think we still have problem to align with all the runtime implementations. Any new features could introduce new flag.

It seems like it would've been a good idea to make --console-socket a proper communication channel so we don't need to add a new one for each file descriptor

Totally agrees. I was thinking about introduce --experimental-xxx to support new command-line features. The feature needs baking time before GA.

Hi @lifubang

What I mean is that, Can we independently solve this problem on containerd side? @fuweid For example, when containerd have fetched the init process’ pid, how about get the pidFd from containerd side?

It requires the sub-reaper setting. The idea comes from refactoring the containerd-shim process manager.
Without sub-reaper setting, there is race-condition to open pidfd after receiving the pid file, because of pid recycled.
With the dup fd by socket, the receiver can ensure that pidfd is correct and there is no race-condition.

I think it's useful to non-sub-reaper use case as well.

libcontainer/init_linux.go

fuweid · 2023-10-09T03:20:00Z

ping @opencontainers/runc-maintainers ~

fuweid · 2023-10-19T02:54:12Z

Sorry to ping @cyphar @kolyshkin @AkihiroSuda @thaJeztah @lifubang again. Any thoughts on this pull request? Thanks

AkihiroSuda · 2023-10-19T11:43:02Z

run.go

+		cli.StringFlag{
+			Name:  "pidfd-socket",
+			Usage: "path to an AF_UNIX socket which will receive a file descriptor referencing the init process",
+		},


Needs a test

Added. cc @lifubang

CI is green now. @lifubang @AkihiroSuda PTAL thanks

utils_linux.go

Makefile

lifubang

LGTM
Please add a changelog entry in your PR description.

lifubang · 2023-10-20T13:31:51Z

@cyphar @kolyshkin @thaJeztah PTAL, if there is no objection, I will merge it in next week.
There are just two things needs to discuss:

Is it need to add a description: This is an experimental flag at this time.
There is another way to implement this feature, using env, just like NOTIFY_SOCKET in systemd area. But I think this way is not better than command line.

exec.go

The container manager like containerd-shim can't use cgroup.kill feature or freeze all the processes in cgroup to terminate the exec init process. It's unsafe to call kill(2) since the pid can be recycled. It's good to provide the pidfd of init process through the pidfd-socket. It's similar to the console-socket. With the pidfd, the container manager like containerd-shim can send the signal to target process safely. And for the standard init process, we can have polling support to get exit event instead of blocking on wait4. Signed-off-by: Wei Fu <[email protected]>

lifubang reviewed Oct 1, 2023

View reviewed changes

libcontainer/init_linux.go Show resolved Hide resolved

lifubang reviewed Oct 1, 2023

View reviewed changes

libcontainer/init_linux.go Show resolved Hide resolved

lifubang added the kind/feature label Oct 1, 2023

fuweid force-pushed the support-pidfd-socket branch from 5a4d6eb to 2328294 Compare October 2, 2023 04:59

fuweid force-pushed the support-pidfd-socket branch from 2328294 to 71ff429 Compare October 4, 2023 01:04

fuweid force-pushed the support-pidfd-socket branch from 71ff429 to 7f3dfd9 Compare October 4, 2023 12:30

lifubang requested changes Oct 4, 2023

View reviewed changes

libcontainer/init_linux.go Outdated Show resolved Hide resolved

fuweid force-pushed the support-pidfd-socket branch from 7f3dfd9 to 0117ed9 Compare October 4, 2023 13:03

lifubang reviewed Oct 4, 2023

View reviewed changes

libcontainer/init_linux.go Show resolved Hide resolved

fuweid force-pushed the support-pidfd-socket branch 2 times, most recently from 16c6989 to 1105572 Compare October 6, 2023 14:21

fuweid mentioned this pull request Oct 11, 2023

clarify kill and delete operation for shared pid namespace container opencontainers/runtime-spec#1234

Open

AkihiroSuda reviewed Oct 19, 2023

View reviewed changes

utils_linux.go Show resolved Hide resolved

AkihiroSuda reviewed Oct 19, 2023

View reviewed changes

utils_linux.go Show resolved Hide resolved

AkihiroSuda reviewed Oct 19, 2023

View reviewed changes

utils_linux.go Show resolved Hide resolved

fuweid force-pushed the support-pidfd-socket branch 3 times, most recently from 661a689 to 5fe6606 Compare October 20, 2023 07:27

fuweid force-pushed the support-pidfd-socket branch 2 times, most recently from d77625a to 911366a Compare October 20, 2023 07:41

AkihiroSuda approved these changes Oct 20, 2023

View reviewed changes

lifubang requested changes Oct 20, 2023

View reviewed changes

Makefile Show resolved Hide resolved

fuweid force-pushed the support-pidfd-socket branch from 911366a to 52ad8b5 Compare October 20, 2023 12:08

lifubang approved these changes Oct 20, 2023

View reviewed changes

lifubang added impact/changelog status/4-merge labels Oct 20, 2023

lifubang force-pushed the support-pidfd-socket branch from 52ad8b5 to 12c2dab Compare October 24, 2023 11:42

lifubang reviewed Oct 28, 2023

View reviewed changes

exec.go Outdated Show resolved Hide resolved

fuweid force-pushed the support-pidfd-socket branch from 12c2dab to 94505a0 Compare November 21, 2023 10:29

lifubang merged commit 95a93c1 into opencontainers:main Nov 22, 2023

fuweid deleted the support-pidfd-socket branch December 1, 2023 07:27

fuweid mentioned this pull request Dec 1, 2023

feature: support Two Phases to Start Exec Process like Init fuweid/embedshim#27

Open

cyphar mentioned this pull request Mar 14, 2024

release 1.2.0-rc.1 #4221

Merged

cpuguy83 mentioned this pull request Jun 5, 2024

Investigate if pidfd socket can help with exec issues cpuguy83/containerd-shim-systemd-v1#8

Open

rata mentioned this pull request Mar 3, 2025

Release 1.3.0-rc.1 #4657

Merged

github-actions bot mentioned this pull request Mar 9, 2025

Bump runc from v1.1.13 to v1.3.0-rc.1 kokyhm/kubespray#159

Open

[feature request] *: introduce pidfd-socket flag #4045

[feature request] *: introduce pidfd-socket flag #4045

Uh oh!

Conversation

fuweid commented Oct 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lifubang commented Oct 1, 2023

Uh oh!

Uh oh!

Uh oh!

fuweid commented Oct 2, 2023

Uh oh!

fuweid commented Oct 2, 2023

Uh oh!

fuweid commented Oct 2, 2023

Uh oh!

lifubang commented Oct 3, 2023

Uh oh!

fuweid commented Oct 3, 2023

Uh oh!

cyphar commented Oct 4, 2023

Uh oh!

lifubang commented Oct 4, 2023

Uh oh!

fuweid commented Oct 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fuweid commented Oct 9, 2023

Uh oh!

fuweid commented Oct 19, 2023

Uh oh!

AkihiroSuda Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

fuweid Oct 20, 2023

Choose a reason for hiding this comment

Uh oh!

fuweid Oct 20, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lifubang left a comment

Choose a reason for hiding this comment

Uh oh!

lifubang commented Oct 20, 2023

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fuweid commented Oct 1, 2023 •

edited

Loading

fuweid commented Oct 4, 2023 •

edited

Loading