-
Notifications
You must be signed in to change notification settings - Fork 587
Description
While the list of capabilities in the Kernel has been relatively stable, recently,
new capabilities were added (CAP_PERFMON, CAP_BPF, and CAP_CHECKPOINT_RESTORE).
This proved to be a challenge, as (for example), docker was updated to be aware
of these new capabilities (and detects if the kernel on which it's running supports them),
however, the current runc release (and possibly other runtimes) not yet recognize them.
The specification currently defines that, in order to grant capabilities to a container process,
the container configuration has to specify those capabilities:
capabilities(object, OPTIONAL) is an object containing arrays that
specifies the sets of capabilities for the process.
Valid values are defined in the [capabilities(7)][capabilities.7] man page,
such asCAP_CHOWN. Any value which cannot be mapped to a relevant kernel
interface MUST cause an error.
In most situations, this is not a problem. For example, if I'm running on a 5.8+ kernel
and want to grant my container CAP_BPF capabilities, I start the container with --cap-add CAP_BPF.
Attempting to do the same on an older kernel version will produce an error (either generated
by dockerd, or by runc).
However, when granting a container all capabilities (for example, when using
--cap-add=ALL, or when running a container with --privileged), things become
problematic.
In this situation, dockerd generates a list of all capabilities supported by the
host's kernel, and sets those capabilities in the container configuration. On a
5.8+ kernel, this will include the (CAP_PERFMON, CAP_BPF, and CAP_CHECKPOINT_RESTORE).
Docker has no option to detect what capabilities are supported by the runtime, and
runc (or other runtime) on their hand, process the list of capabilities, and
produce an error for any "unknown" capability.
While docker could account for the runtime not supporting certain capabilities
(which is what's currently done as a temporary solution moby/moby#41563),
doing so is undesirable, as it would tightly couple the runtime (and would complicate
using alternative runtimes, such as crun, gVisor (runsc) or others).
Proposal
My proposal is to delegate generation of the "all capabilities" list to the runtime,
and to include a special ALL_CAPS (just a suggestion, I'm not attached to the name)
value in the specification.
- runtimes that do not support the
ALL_CAPSspecial value, consider it an
"unknown capability", and will produce an error (as defined by the specification). - runtimes that do support the
ALL_CAPSspecial value will materialize the list
of capabilities, and add all capabilities that the runtime (and active kernel)
supports. - when combining
ALL_CAPSwith other capabilities (e.g.ALL_CAPSandCAP_CHMOD),
ALL_CAPSmust take precedence. Alternatively, this situation could be considered
ambiguous, and an error can be produced (we should consider what's more future-proof
in case additional "special" values are to be added in future).
Compatibility and downsides
Ideally, docker would be able to detect what version of the runtime-spec is supported
by a runtime, but this is likely a separate discussion to have.
As described above, runtimes that do not support the ALL_CAPS special value
will produce an error. This could be considered a breaking change, on the other
hand, the current situation already does not handle new capabilities to be added
to the list.
Having an ALL_CAPS capability makes the container configuration "non-declarative";
the meaning of "all" capabilities will depend on the runtime, and the kernel on
which it's running. I don't think that's worse than the current situation, in
which the same applies, only at a higher level (dockerd or containerd supporting
the new capabilities).