mirror of
https://github.com/moby/moby.git
synced 2022-11-09 12:21:53 -05:00
5ff21add06
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
183 lines
11 KiB
Markdown
183 lines
11 KiB
Markdown
<!-- [metadata]>
|
|
+++
|
|
title = "Seccomp security profiles for Docker"
|
|
description = "Enabling seccomp in Docker"
|
|
keywords = ["seccomp, security, docker, documentation"]
|
|
[menu.main]
|
|
parent= "smn_secure_docker"
|
|
weight=90
|
|
+++
|
|
<![end-metadata]-->
|
|
|
|
# Seccomp security profiles for Docker
|
|
|
|
Secure computing mode (Seccomp) is a Linux kernel feature. You can use it to
|
|
restrict the actions available within the container. The `seccomp()` system
|
|
call operates on the seccomp state of the calling process. You can use this
|
|
feature to restrict your application's access.
|
|
|
|
This feature is available only if Docker has been built with seccomp and the
|
|
kernel is configured with `CONFIG_SECCOMP` enabled. To check if your kernel
|
|
supports seccomp:
|
|
|
|
```bash
|
|
$ cat /boot/config-`uname -r` | grep CONFIG_SECCOMP=
|
|
CONFIG_SECCOMP=y
|
|
```
|
|
|
|
> **Note**: seccomp profiles require seccomp 2.2.1 and are only
|
|
> available starting with Debian 9 "Stretch", Ubuntu 15.10 "Wily",
|
|
> Fedora 22, CentOS 7 and Oracle Linux 7. To use this feature on Ubuntu 14.04, Debian Wheezy, or
|
|
> Debian Jessie, you must download the [latest static Docker Linux binary](../installation/binaries.md).
|
|
> This feature is currently *not* available on other distributions.
|
|
|
|
## Passing a profile for a container
|
|
|
|
The default seccomp profile provides a sane default for running containers with
|
|
seccomp and disables around 44 system calls out of 300+. It is moderately protective while providing wide application
|
|
compatibility. The default Docker profile (found [here](https://github.com/docker/docker/blob/master/profiles/seccomp/default.json)) has a JSON layout in the following form:
|
|
|
|
```json
|
|
{
|
|
"defaultAction": "SCMP_ACT_ERRNO",
|
|
"archMap": [
|
|
{
|
|
"architecture": "SCMP_ARCH_X86_64",
|
|
"subArchitectures": [
|
|
"SCMP_ARCH_X86",
|
|
"SCMP_ARCH_X32"
|
|
]
|
|
},
|
|
...
|
|
],
|
|
"syscalls": [
|
|
{
|
|
"names": [
|
|
"accept",
|
|
"accept4",
|
|
"access",
|
|
"alarm",
|
|
"alarm",
|
|
"bind",
|
|
"brk",
|
|
...
|
|
"waitid",
|
|
"waitpid",
|
|
"write",
|
|
"writev"
|
|
],
|
|
"action": "SCMP_ACT_ALLOW",
|
|
"args": [],
|
|
"comment": "",
|
|
"includes": {},
|
|
"excludes": {}
|
|
},
|
|
{
|
|
"names": [
|
|
"clone"
|
|
],
|
|
"action": "SCMP_ACT_ALLOW",
|
|
"args": [
|
|
{
|
|
"index": 1,
|
|
"value": 2080505856,
|
|
"valueTwo": 0,
|
|
"op": "SCMP_CMP_MASKED_EQ"
|
|
}
|
|
],
|
|
"comment": "s390 parameter ordering for clone is different",
|
|
"includes": {
|
|
"arches": [
|
|
"s390",
|
|
"s390x"
|
|
]
|
|
},
|
|
"excludes": {
|
|
"caps": [
|
|
"CAP_SYS_ADMIN"
|
|
]
|
|
}
|
|
},
|
|
...
|
|
}
|
|
```
|
|
|
|
When you run a container, it uses the default profile unless you override
|
|
it with the `security-opt` option. For example, the following explicitly
|
|
specifies the default policy:
|
|
|
|
```
|
|
$ docker run --rm -it --security-opt seccomp=/path/to/seccomp/profile.json hello-world
|
|
```
|
|
|
|
### Significant syscalls blocked by the default profile
|
|
|
|
Docker's default seccomp profile is a whitelist which specifies the calls that
|
|
are allowed. The table below lists the significant (but not all) syscalls that
|
|
are effectively blocked because they are not on the whitelist. The table includes
|
|
the reason each syscall is blocked rather than white-listed.
|
|
|
|
| Syscall | Description |
|
|
|---------------------|---------------------------------------------------------------------------------------------------------------------------------------|
|
|
| `acct` | Accounting syscall which could let containers disable their own resource limits or process accounting. Also gated by `CAP_SYS_PACCT`. |
|
|
| `add_key` | Prevent containers from using the kernel keyring, which is not namespaced. |
|
|
| `adjtimex` | Similar to `clock_settime` and `settimeofday`, time/date is not namespaced. |
|
|
| `bpf` | Deny loading potentially persistent bpf programs into kernel, already gated by `CAP_SYS_ADMIN`. |
|
|
| `clock_adjtime` | Time/date is not namespaced. |
|
|
| `clock_settime` | Time/date is not namespaced. |
|
|
| `clone` | Deny cloning new namespaces. Also gated by `CAP_SYS_ADMIN` for CLONE_* flags, except `CLONE_USERNS`. |
|
|
| `create_module` | Deny manipulation and functions on kernel modules. |
|
|
| `delete_module` | Deny manipulation and functions on kernel modules. Also gated by `CAP_SYS_MODULE`. |
|
|
| `finit_module` | Deny manipulation and functions on kernel modules. Also gated by `CAP_SYS_MODULE`. |
|
|
| `get_kernel_syms` | Deny retrieval of exported kernel and module symbols. |
|
|
| `get_mempolicy` | Syscall that modifies kernel memory and NUMA settings. Already gated by `CAP_SYS_NICE`. |
|
|
| `init_module` | Deny manipulation and functions on kernel modules. Also gated by `CAP_SYS_MODULE`. |
|
|
| `ioperm` | Prevent containers from modifying kernel I/O privilege levels. Already gated by `CAP_SYS_RAWIO`. |
|
|
| `iopl` | Prevent containers from modifying kernel I/O privilege levels. Already gated by `CAP_SYS_RAWIO`. |
|
|
| `kcmp` | Restrict process inspection capabilities, already blocked by dropping `CAP_PTRACE`. |
|
|
| `kexec_file_load` | Sister syscall of `kexec_load` that does the same thing, slightly different arguments. |
|
|
| `kexec_load` | Deny loading a new kernel for later execution. |
|
|
| `keyctl` | Prevent containers from using the kernel keyring, which is not namespaced. |
|
|
| `lookup_dcookie` | Tracing/profiling syscall, which could leak a lot of information on the host. |
|
|
| `mbind` | Syscall that modifies kernel memory and NUMA settings. Already gated by `CAP_SYS_NICE`. |
|
|
| `mount` | Deny mounting, already gated by `CAP_SYS_ADMIN`. |
|
|
| `move_pages` | Syscall that modifies kernel memory and NUMA settings. |
|
|
| `name_to_handle_at` | Sister syscall to `open_by_handle_at`. Already gated by `CAP_SYS_NICE`. |
|
|
| `nfsservctl` | Deny interaction with the kernel nfs daemon. |
|
|
| `open_by_handle_at` | Cause of an old container breakout. Also gated by `CAP_DAC_READ_SEARCH`. |
|
|
| `perf_event_open` | Tracing/profiling syscall, which could leak a lot of information on the host. |
|
|
| `personality` | Prevent container from enabling BSD emulation. Not inherently dangerous, but poorly tested, potential for a lot of kernel vulns. |
|
|
| `pivot_root` | Deny `pivot_root`, should be privileged operation. |
|
|
| `process_vm_readv` | Restrict process inspection capabilities, already blocked by dropping `CAP_PTRACE`. |
|
|
| `process_vm_writev` | Restrict process inspection capabilities, already blocked by dropping `CAP_PTRACE`. |
|
|
| `ptrace` | Tracing/profiling syscall, which could leak a lot of information on the host. Already blocked by dropping `CAP_PTRACE`. |
|
|
| `query_module` | Deny manipulation and functions on kernel modules. |
|
|
| `quotactl` | Quota syscall which could let containers disable their own resource limits or process accounting. Also gated by `CAP_SYS_ADMIN`. |
|
|
| `reboot` | Don't let containers reboot the host. Also gated by `CAP_SYS_BOOT`. |
|
|
| `request_key` | Prevent containers from using the kernel keyring, which is not namespaced. |
|
|
| `set_mempolicy` | Syscall that modifies kernel memory and NUMA settings. Already gated by `CAP_SYS_NICE`. |
|
|
| `setns` | Deny associating a thread with a namespace. Also gated by `CAP_SYS_ADMIN`. |
|
|
| `settimeofday` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`. |
|
|
| `stime` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`. |
|
|
| `swapon` | Deny start/stop swapping to file/device. Also gated by `CAP_SYS_ADMIN`. |
|
|
| `swapoff` | Deny start/stop swapping to file/device. Also gated by `CAP_SYS_ADMIN`. |
|
|
| `sysfs` | Obsolete syscall. |
|
|
| `_sysctl` | Obsolete, replaced by /proc/sys. |
|
|
| `umount` | Should be a privileged operation. Also gated by `CAP_SYS_ADMIN`. |
|
|
| `umount2` | Should be a privileged operation. |
|
|
| `unshare` | Deny cloning new namespaces for processes. Also gated by `CAP_SYS_ADMIN`, with the exception of `unshare --user`. |
|
|
| `uselib` | Older syscall related to shared libraries, unused for a long time. |
|
|
| `userfaultfd` | Userspace page fault handling, largely needed for process migration. |
|
|
| `ustat` | Obsolete syscall. |
|
|
| `vm86` | In kernel x86 real mode virtual machine. Also gated by `CAP_SYS_ADMIN`. |
|
|
| `vm86old` | In kernel x86 real mode virtual machine. Also gated by `CAP_SYS_ADMIN`. |
|
|
|
|
## Run without the default seccomp profile
|
|
|
|
You can pass `unconfined` to run a container without the default seccomp
|
|
profile.
|
|
|
|
```
|
|
$ docker run --rm -it --security-opt seccomp=unconfined debian:jessie \
|
|
unshare --map-root-user --user sh -c whoami
|
|
```
|