- 
                Notifications
    You must be signed in to change notification settings 
- Fork 2.2k
[feature request] *: introduce pidfd-socket flag #4045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| The king of involution, is writing code during holidays also a form of rest? 😄 
 Could you please explain why? And why we can’t use ‘runc kill’ or runc libcontainer API directly? 
 I seem to remember that ‘contained’ has used kill(2) for many many years. | 
5a4d6eb    to
    2328294      
    Compare
  
    | 
 runc-kill is used to send the signal to the container init process. As far as I know, there is no runc-commandline to send signal to the exec init process. 
 Yes because pidfd is available since v5.3 kernel. pidfd can ensure that we can send the signal to the correct process, especially the exec-probe has timeout. | 
| ping @AkihiroSuda @thaJeztah | 
| also ping @lifubang @cyphar @kolyshkin | 
| I think this is a nice feature, because I have hated the big for loop in containerd to find out whether the exit signal is from the init process or not in many years ago. Just only one question, I think maybe we can simplify the implementation, I don’t know whether my solution could work or not: Lines 270 to 278 in ee45b9b 
 | 
| 
 Basically, yes. The runc-{create/exec/run} process is still parent of the init process before exit. We should check the status of process by  | 
2328294    to
    71ff429      
    Compare
  
    | It seems like it would've been a good idea to make  FWIW, I don't like adding features to runc's command-line if we can avoid it -- it makes life harder for other OCI runtimes because we are creating non-standard behaviour that everyone has to copy from us in order to work with runtimes that depend on it. I made this mistake with  But then again, I don't see another way of solving it, other than re-architecting runc... Hmmm... | 
| 
 I think it isn’t a problem to add features to runc's command-line, because if there is a way that we accept pidFd solution without cmd flag, other OCI runtimes should also have to support it with the way like runc uses. What I mean is that, Can we independently solve this problem on containerd side? @fuweid For example, when containerd have fetched the init process’ pid, how about get the pidFd from containerd side? | 
| Thanks for the comment! @cyphar 
 Understand. Currently, no spec is to describe what the command line looks like. For example, the standard init process has two steps to setup:  
 Just wondering about what re-architecting runc looks like. If it's not related to spec or standard, I think we still have problem to align with all the runtime implementations. Any new features could introduce new flag. 
 Totally agrees. I was thinking about introduce  Hi @lifubang 
 It requires the sub-reaper setting. The idea comes from refactoring the containerd-shim process manager. I think it's useful to non-sub-reaper use case as well. | 
71ff429    to
    7f3dfd9      
    Compare
  
    7f3dfd9    to
    0117ed9      
    Compare
  
    16c6989    to
    1105572      
    Compare
  
    | ping @opencontainers/runc-maintainers ~ | 
| Sorry to ping @cyphar @kolyshkin @AkihiroSuda @thaJeztah @lifubang again. Any thoughts on this pull request? Thanks | 
| cli.StringFlag{ | ||
| Name: "pidfd-socket", | ||
| Usage: "path to an AF_UNIX socket which will receive a file descriptor referencing the init process", | ||
| }, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. cc @lifubang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI is green now. @lifubang @AkihiroSuda PTAL thanks
661a689    to
    5fe6606      
    Compare
  
    d77625a    to
    911366a      
    Compare
  
    911366a    to
    52ad8b5      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Please add a changelog entry in your PR description.
| @cyphar @kolyshkin @thaJeztah PTAL, if there is no objection, I will merge it in next week. 
 | 
52ad8b5    to
    12c2dab      
    Compare
  
    The container manager like containerd-shim can't use cgroup.kill feature or freeze all the processes in cgroup to terminate the exec init process. It's unsafe to call kill(2) since the pid can be recycled. It's good to provide the pidfd of init process through the pidfd-socket. It's similar to the console-socket. With the pidfd, the container manager like containerd-shim can send the signal to target process safely. And for the standard init process, we can have polling support to get exit event instead of blocking on wait4. Signed-off-by: Wei Fu <[email protected]>
12c2dab    to
    94505a0      
    Compare
  
    
The container manager like containerd-shim can't use cgroup.kill feature or freeze all the processes in cgroup to terminate the exec init process. It's unsafe to call kill(2) since the pid can be recycled. It's good to provide the pidfd of init process through the pidfd-socket. It's similar to the console-socket. With the pidfd, the container manager like containerd-shim can send the signal to target process safely.
And for the standard init process, we can have polling support to get exit event instead of blocking on wait4.
Let me explain why the containerd-shim needs this feature for containerd init process.
Without pidfd, containerd-shim can't tell which process exits. It has to use reap all the zombies.
However, it requires all the fork/exec operations needs to use the reap-event-framework in containerd.
For example, the
mountgo-package needs to fork child process which unshares to get brand-new userns. If the child process has been killed and reaped by containerd-shim, the child process's pid can be reused. In order to know the exit event, themountgo-package needs to use reap-event-framework, which doesn't make senses.With pidfd support, we can use polling support to know which process exits instead of calling
wait4syscall.And one more detail is that the containerd-shim only cares the container init process and exec init processes.
Currently, containerd-shim uses
PR_SET_CHILD_SUBREAPER, and watch the signalSIGCHLDto reap all the zombie processes, including container init process and exec init processes. Before v4.11 kernel-exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction, the process X double-forked by the exec init process will be reparented to containerd-shim so that containerd-shim can cleanup the zombie.Right now (>= v4.11 kernel), the containerd-shim only receives the SIGCHLD from init processes because the double-forked processes will be reparent to pid-1 in the pid namespace. So, containerd-shim doesn't need to care any double-forked processes. The pidfd can help containerd-shim to focus on the correct processes.
REF: containerd/containerd#9175