seccomp: add support for "clone3" syscall in default policy

If no seccomp policy is requested, then the built-in default policy in
dockerd applies. This has no rule for "clone3" defined, nor any default
errno defined. So when runc receives the config it attempts to determine
a default errno, using logic defined in its commit:

  7a8d7162f9

As explained in the above commit message, runc uses a heuristic to
decide which errno to return by default:

[quote]
  The solution applied here is to prepend a "stub" filter which returns
  -ENOSYS if the requested syscall has a larger syscall number than any
  syscall mentioned in the filter. The reason for this specific rule is
  that syscall numbers are (roughly) allocated sequentially and thus newer
  syscalls will (usually) have a larger syscall number -- thus causing our
  filters to produce -ENOSYS if the filter was written before the syscall
  existed.
[/quote]

Unfortunately clone3 appears to one of the edge cases that does not
result in use of ENOSYS, instead ending up with the historical EPERM
errno.

Latest glibc (2.33.9000, in Fedora 35 rawhide) will attempt to use
clone3 by default. If it sees ENOSYS then it will automatically
fallback to using clone. Any other errno is treated as a fatal
error. Thus when docker seccomp policy triggers EPERM from clone3,
no fallback occurs and programs are thus unable to spawn threads.

The clone3 syscall is much more complicated than clone, most notably its
flags are not exposed as a directly argument any more. Instead they are
hidden inside a struct. This means that seccomp filters are unable to
apply policy based on values seen in flags. Thus we can't directly
replicate the current "clone" filtering for "clone3". We can at least
ensure "clone3" returns ENOSYS errno, to trigger fallback to "clone"
at which point we can filter on flags.

Fixes: https://github.com/moby/moby/issues/42680
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
This commit is contained in:
Daniel P. Berrangé 2021-07-26 19:10:01 +01:00
parent e9b07a730e
commit 9f6b562dd1
2 changed files with 27 additions and 0 deletions

View File

@ -553,6 +553,7 @@
"names": [
"bpf",
"clone",
"clone3",
"fanotify_init",
"fsconfig",
"fsmount",
@ -627,6 +628,18 @@
]
}
},
{
"names": [
"clone3"
],
"action": "SCMP_ACT_ERRNO",
"errnoRet": 38,
"excludes": {
"caps": [
"CAP_SYS_ADMIN"
]
}
},
{
"names": [
"reboot"

View File

@ -42,6 +42,7 @@ func arches() []Architecture {
// DefaultProfile defines the allowed syscalls for the default seccomp profile.
func DefaultProfile() *Seccomp {
nosys := uint(unix.ENOSYS)
syscalls := []*Syscall{
{
LinuxSyscall: specs.LinuxSyscall{
@ -546,6 +547,7 @@ func DefaultProfile() *Seccomp {
Names: []string{
"bpf",
"clone",
"clone3",
"fanotify_init",
"fsconfig",
"fsmount",
@ -615,6 +617,18 @@ func DefaultProfile() *Seccomp {
Caps: []string{"CAP_SYS_ADMIN"},
},
},
{
LinuxSyscall: specs.LinuxSyscall{
Names: []string{
"clone3",
},
Action: specs.ActErrno,
ErrnoRet: &nosys,
},
Excludes: &Filter{
Caps: []string{"CAP_SYS_ADMIN"},
},
},
{
LinuxSyscall: specs.LinuxSyscall{
Names: []string{