Commit Graph

25 Commits

Author SHA1 Message Date
Sebastiaan van Stijn 7b7d1132e8
seccomp: allow "bpf", "perf_event_open", gated by CAP_BPF, CAP_PERFMON
Update the profile to make use of CAP_BPF and CAP_PERFMON capabilities. Prior to
kernel 5.8, bpf and perf_event_open required CAP_SYS_ADMIN. This change enables
finer control of the privilege setting, thus allowing us to run certain system
tracing tools with minimal privileges.

Based on the original patch from Henry Wang in the containerd repository.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-08-18 18:34:09 +02:00
zhubojun e258d66f17 profiles: seccomp: add syscalls related to PKU in default policy
Add pkey_alloc(2), pkey_free(2) and pkey_mprotect(2) in seccomp default profile.
pkey_alloc(2), pkey_free(2) and pkey_mprotect(2) can only configure
the calling process's own memory, so they are existing "safe for everyone" syscalls.

close issue: #43481

Signed-off-by: zhubojun <bojun.zhu@foxmail.com>
2022-07-11 09:50:53 +08:00
Bastien Pascard 420142a886 profiles: seccomp: allow clock_settime64 when CAP_SYS_TIME is added
Signed-off-by: Bastien Pascard <bpascard@hotmail.com>
2022-07-06 23:45:13 +02:00
Djordje Lukic 7de9f4f82d Allow different syscalls from kernels 5.12 -> 5.16
Kernel 5.12:

    mount_setattr: needs CAP_SYS_ADMIN

Kernel 5.14:

    quotactl_fd: needs CAP_SYS_ADMIN
    memfd_secret: always allowed

Kernel 5.15:

    process_mrelease: always allowed

Kernel 5.16:

    futex_waitv: always allowed

Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>
2022-05-13 12:35:08 +02:00
Justin Cormack f1dd6bf84e
Merge pull request #43553 from AkihiroSuda/riscv64
seccomp: support riscv64
2022-05-13 10:41:53 +01:00
Sebastiaan van Stijn e9712464ad
Merge pull request #43199 from Xyene/allow-landlock
seccomp: add support for Landlock syscalls in default policy
2022-05-13 10:18:45 +02:00
Tianon Gravi c9e19a2aa1 Remove "seccomp" build tag
Similar to the (now removed) `apparmor` build tag, this build-time toggle existed for users who needed to build without the `libseccomp` library.  That's no longer necessary, and given the importance of seccomp to the overall default security profile of Docker containers, it makes sense that any binary built for Linux should support (and use by default) seccomp if the underlying host does.

Signed-off-by: Tianon Gravi <admwiggin@gmail.com>
2022-05-12 14:48:35 -07:00
Akihiro Suda 4c2f18f6cc
seccomp: support riscv64
Corresponds to containerd PR 6882

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2022-05-02 17:41:43 +09:00
Tudor Brindus af819bf623 seccomp: add support for Landlock syscalls in default policy
This commit allows the Landlock[0] system calls in the default seccomp
policy.

Landlock was introduced in kernel 5.13, to fill the gap that inspecting
filepaths passed as arguments to filesystem system calls is not really
possible with pure `seccomp` (unless involving `ptrace`).

Allowing Landlock by default fits in with allowing `seccomp` for
containerized applications to voluntarily restrict their access rights
to files within the container.

[0]: https://www.kernel.org/doc/html/latest/userspace-api/landlock.html

Signed-off-by: Tudor Brindus <me@tbrindus.ca>
2022-01-31 08:44:04 -05:00
Sören Tempel 85eaf23bf4 seccomp: add support for "swapcontext" syscall in default policy
This system call is only available on the 32- and 64-bit PowerPC, it is
used by modern programming language implementations (such as gcc-go) to
implement coroutine features through userspace context switches.

Other container environment, such as Systemd nspawn already whitelist
this system call in their seccomp profile [1] [2]. As such, it would be
nice to also whitelist it in moby.

This issue was encountered on Alpine Linux GitLab CI system, which uses
moby, when attempting to execute gcc-go compiled software on ppc64le.

[1]: https://github.com/systemd/systemd/pull/9487
[2]: https://github.com/systemd/systemd/issues/9485

Signed-off-by: Sören Tempel <soeren+git@soeren-tempel.net>
2021-12-18 14:06:07 +01:00
Sebastiaan van Stijn 686be57d0a
Update to Go 1.17.0, and gofmt with Go 1.17
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-08-24 23:33:27 +02:00
Sebastiaan van Stijn 2480bebf59
Merge pull request #42649 from kinvolk/rata/seccomp-default-errno
seccomp: Use explicit DefaultErrnoRet
2021-08-03 15:13:42 +02:00
Rodrigo Campos fb794166d9 seccomp: Use explicit DefaultErrnoRet
Since commit "seccomp: Sync fields with runtime-spec fields"
(5d244675bd) we support to specify the
DefaultErrnoRet to be used.

Before that commit it was not specified and EPERM was used by default.
This commit keeps the same behaviour but just makes it explicit that the
default is EPERM.

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
2021-07-30 19:13:21 +02:00
Daniel P. Berrangé 9f6b562dd1 seccomp: add support for "clone3" syscall in default policy
If no seccomp policy is requested, then the built-in default policy in
dockerd applies. This has no rule for "clone3" defined, nor any default
errno defined. So when runc receives the config it attempts to determine
a default errno, using logic defined in its commit:

  7a8d7162f9

As explained in the above commit message, runc uses a heuristic to
decide which errno to return by default:

[quote]
  The solution applied here is to prepend a "stub" filter which returns
  -ENOSYS if the requested syscall has a larger syscall number than any
  syscall mentioned in the filter. The reason for this specific rule is
  that syscall numbers are (roughly) allocated sequentially and thus newer
  syscalls will (usually) have a larger syscall number -- thus causing our
  filters to produce -ENOSYS if the filter was written before the syscall
  existed.
[/quote]

Unfortunately clone3 appears to one of the edge cases that does not
result in use of ENOSYS, instead ending up with the historical EPERM
errno.

Latest glibc (2.33.9000, in Fedora 35 rawhide) will attempt to use
clone3 by default. If it sees ENOSYS then it will automatically
fallback to using clone. Any other errno is treated as a fatal
error. Thus when docker seccomp policy triggers EPERM from clone3,
no fallback occurs and programs are thus unable to spawn threads.

The clone3 syscall is much more complicated than clone, most notably its
flags are not exposed as a directly argument any more. Instead they are
hidden inside a struct. This means that seccomp filters are unable to
apply policy based on values seen in flags. Thus we can't directly
replicate the current "clone" filtering for "clone3". We can at least
ensure "clone3" returns ENOSYS errno, to trigger fallback to "clone"
at which point we can filter on flags.

Fixes: https://github.com/moby/moby/issues/42680
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
2021-07-27 10:56:07 +01:00
Sebastiaan van Stijn 0ef7e727d2
seccomp: Seccomp: embed oci-spec LinuxSeccomp, add support for seccomp flags
This patch, similar to d92739713c, embeds the
`LinuxSeccomp` type of the runtime-spec, so that we can support all options
provided by the spec, and decorates it with our own fields.

With this, profiles can make use of the recently added "Flags" field, to
specify flags that must be passed to seccomp(2) when installing the filter.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-07-17 15:57:54 +02:00
Sebastiaan van Stijn c7cd1b9436
profiles/seccomp.Syscall: use pointers and omitempty
These fields are optional, and this makes the JSON representation
slightly less verbose.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-06-17 21:25:09 +02:00
Sebastiaan van Stijn d92739713c
seccomp.Syscall: embed runtime-spec Syscall type
This makes the type better reflect the difference with the "runtime" profile;
our local type is used to generate a runtime-spec seccomp profile and extends
the runtime-spec type with additional fields; adding a "Name" field for backward
compatibility with older JSON representations, additional "Comment" metadata,
and conditional rules ("Includes", "Excludes") used during generation to adjust
the profile based on the container (capabilities) and host's (architecture, kernel)
configuration.

This change introduces one change in the type; the "runtime-spec" type uses a
`[]LinuxSeccompArg` for the `Args` field, whereas the local type used pointers;
`[]*LinuxSeccompArg`.

In addition, the runtime-spec Syscall type brings a new `ErrnoRet` field, allowing
the profile to specify the errno code returned for the syscall, which allows
changing the default EPERM for specific syscalls.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-06-17 21:25:06 +02:00
clubby789 d39b075302 Enable `process_vm_readv` and `process_vm_writev` for kernel > 4.8
These syscalls were disabled in #18971
due to them requiring CAP_PTRACE. CAP_PTRACE was blocked by default due
to a ptrace related exploit. This has been patched in the Linux kernel
(version 4.8) and thus `ptrace` has been re-enabled. However, these
associated syscalls seem to have been left behind. This commit brings
them in line with `ptrace`, and re-enables it for kernel > 4.8.

Signed-off-by: clubby789 <jamie@hill-daniel.co.uk>
2021-03-04 17:12:01 +00:00
Aleksa Sarai 54eff4354b
profiles: seccomp: update to Linux 5.11 syscall list
These syscalls (some of which have been in Linux for a while but were
missing from the profile) fall into a few buckets:

 * close_range(2), epoll_pwait2(2) are just extensions of existing "safe
   for everyone" syscalls.

 * The mountv2 API syscalls (fs*(2), move_mount(2), open_tree(2)) are
   all equivalent to aspects of mount(2) and thus go into the
   CAP_SYS_ADMIN category.

 * process_madvise(2) is similar to the other process_*(2) syscalls and
   thus goes in the CAP_SYS_PTRACE category.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2021-01-27 13:25:49 +11:00
Mark Vainomaa f7bcb02f67
seccomp: Add pidfd_getfd syscall
Signed-off-by: Mark Vainomaa <mikroskeem@mikroskeem.eu>
2020-11-12 15:31:07 +02:00
Mark Vainomaa 5e3ffe6464
seccomp: Add pidfd_open and pidfd_send_signal
Signed-off-by: Mark Vainomaa <mikroskeem@mikroskeem.eu>
2020-11-11 15:20:34 +02:00
Sebastiaan van Stijn 4539e7f0eb
seccomp: implement marshal/unmarshall for MinVersion
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-10-07 17:48:25 +02:00
Sebastiaan van Stijn 0d75b63987
seccomp: replace types with runtime-spec types
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-09-18 19:33:58 +02:00
Sebastiaan van Stijn 0efee50b95
seccomp: move seccomp types from api into seccomp profile
These types were not used in the API, so could not come up with
a reason why they were in that package.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-09-18 18:14:16 +02:00
Brian Goff ccbb00c815 Remove dependency in dockerd on libseccomp
This was just using libseccomp to get the right arch, but we can use
GOARCH to get this.
The nativeToSeccomp map needed to be adjusted a bit for mipsle vs mipsel
since that's go how refers to it. Also added some other arches to it.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2020-09-11 22:48:42 +00:00
Renamed from profiles/seccomp/seccomp_default.go (Browse further)