This system call is only available on the 32- and 64-bit PowerPC, it is
used by modern programming language implementations (such as gcc-go) to
implement coroutine features through userspace context switches.
Other container environment, such as Systemd nspawn already whitelist
this system call in their seccomp profile [1] [2]. As such, it would be
nice to also whitelist it in moby.
This issue was encountered on Alpine Linux GitLab CI system, which uses
moby, when attempting to execute gcc-go compiled software on ppc64le.
[1]: https://github.com/systemd/systemd/pull/9487
[2]: https://github.com/systemd/systemd/issues/9485
Signed-off-by: Sören Tempel <soeren+git@soeren-tempel.net>
Since commit "seccomp: Sync fields with runtime-spec fields"
(5d244675bd) we support to specify the
DefaultErrnoRet to be used.
Before that commit it was not specified and EPERM was used by default.
This commit keeps the same behaviour but just makes it explicit that the
default is EPERM.
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
If no seccomp policy is requested, then the built-in default policy in
dockerd applies. This has no rule for "clone3" defined, nor any default
errno defined. So when runc receives the config it attempts to determine
a default errno, using logic defined in its commit:
7a8d7162f9
As explained in the above commit message, runc uses a heuristic to
decide which errno to return by default:
[quote]
The solution applied here is to prepend a "stub" filter which returns
-ENOSYS if the requested syscall has a larger syscall number than any
syscall mentioned in the filter. The reason for this specific rule is
that syscall numbers are (roughly) allocated sequentially and thus newer
syscalls will (usually) have a larger syscall number -- thus causing our
filters to produce -ENOSYS if the filter was written before the syscall
existed.
[/quote]
Unfortunately clone3 appears to one of the edge cases that does not
result in use of ENOSYS, instead ending up with the historical EPERM
errno.
Latest glibc (2.33.9000, in Fedora 35 rawhide) will attempt to use
clone3 by default. If it sees ENOSYS then it will automatically
fallback to using clone. Any other errno is treated as a fatal
error. Thus when docker seccomp policy triggers EPERM from clone3,
no fallback occurs and programs are thus unable to spawn threads.
The clone3 syscall is much more complicated than clone, most notably its
flags are not exposed as a directly argument any more. Instead they are
hidden inside a struct. This means that seccomp filters are unable to
apply policy based on values seen in flags. Thus we can't directly
replicate the current "clone" filtering for "clone3". We can at least
ensure "clone3" returns ENOSYS errno, to trigger fallback to "clone"
at which point we can filter on flags.
Fixes: https://github.com/moby/moby/issues/42680
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
This patch, similar to d92739713c, embeds the
`LinuxSeccomp` type of the runtime-spec, so that we can support all options
provided by the spec, and decorates it with our own fields.
With this, profiles can make use of the recently added "Flags" field, to
specify flags that must be passed to seccomp(2) when installing the filter.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This makes the type better reflect the difference with the "runtime" profile;
our local type is used to generate a runtime-spec seccomp profile and extends
the runtime-spec type with additional fields; adding a "Name" field for backward
compatibility with older JSON representations, additional "Comment" metadata,
and conditional rules ("Includes", "Excludes") used during generation to adjust
the profile based on the container (capabilities) and host's (architecture, kernel)
configuration.
This change introduces one change in the type; the "runtime-spec" type uses a
`[]LinuxSeccompArg` for the `Args` field, whereas the local type used pointers;
`[]*LinuxSeccompArg`.
In addition, the runtime-spec Syscall type brings a new `ErrnoRet` field, allowing
the profile to specify the errno code returned for the syscall, which allows
changing the default EPERM for specific syscalls.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
These syscalls were disabled in #18971
due to them requiring CAP_PTRACE. CAP_PTRACE was blocked by default due
to a ptrace related exploit. This has been patched in the Linux kernel
(version 4.8) and thus `ptrace` has been re-enabled. However, these
associated syscalls seem to have been left behind. This commit brings
them in line with `ptrace`, and re-enables it for kernel > 4.8.
Signed-off-by: clubby789 <jamie@hill-daniel.co.uk>
These syscalls (some of which have been in Linux for a while but were
missing from the profile) fall into a few buckets:
* close_range(2), epoll_pwait2(2) are just extensions of existing "safe
for everyone" syscalls.
* The mountv2 API syscalls (fs*(2), move_mount(2), open_tree(2)) are
all equivalent to aspects of mount(2) and thus go into the
CAP_SYS_ADMIN category.
* process_madvise(2) is similar to the other process_*(2) syscalls and
thus goes in the CAP_SYS_PTRACE category.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
These types were not used in the API, so could not come up with
a reason why they were in that package.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This was just using libseccomp to get the right arch, but we can use
GOARCH to get this.
The nativeToSeccomp map needed to be adjusted a bit for mipsle vs mipsel
since that's go how refers to it. Also added some other arches to it.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2020-09-11 22:48:42 +00:00
Renamed from profiles/seccomp/seccomp_default.go (Browse further)