Signed-off-by: Sven Dowideit <SvenDowideit@home.org.au>
20 KiB
Runtime metrics
Docker stats
You can use the docker stats
command to live stream a container's
runtime metrics. The command supports CPU, memory usage, memory limit,
and network IO metrics.
The following is a sample output from the docker stats
command
$ docker stats redis1 redis2
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O
redis1 0.07% 796 KB / 64 MB 1.21% 788 B / 648 B 3.568 MB / 512 KB
redis2 0.07% 2.746 MB / 64 MB 4.29% 1.266 KB / 648 B 12.4 MB / 0 B
The docker stats reference page has
more details about the docker stats
command.
Control groups
Linux Containers rely on control groups which not only track groups of processes, but also expose metrics about CPU, memory, and block I/O usage. You can access those metrics and obtain network usage metrics as well. This is relevant for "pure" LXC containers, as well as for Docker containers.
Control groups are exposed through a pseudo-filesystem. In recent
distros, you should find this filesystem under /sys/fs/cgroup
. Under
that directory, you will see multiple sub-directories, called devices,
freezer, blkio, etc.; each sub-directory actually corresponds to a different
cgroup hierarchy.
On older systems, the control groups might be mounted on /cgroup
, without
distinct hierarchies. In that case, instead of seeing the sub-directories,
you will see a bunch of files in that directory, and possibly some directories
corresponding to existing containers.
To figure out where your control groups are mounted, you can run:
$ grep cgroup /proc/mounts
Enumerating cgroups
You can look into /proc/cgroups
to see the different control group subsystems
known to the system, the hierarchy they belong to, and how many groups they contain.
You can also look at /proc/<pid>/cgroup
to see which control groups a process
belongs to. The control group will be shown as a path relative to the root of
the hierarchy mountpoint; e.g., /
means “this process has not been assigned into
a particular group”, while /lxc/pumpkin
means that the process is likely to be
a member of a container named pumpkin
.
Finding the cgroup for a given container
For each container, one cgroup will be created in each hierarchy. On
older systems with older versions of the LXC userland tools, the name of
the cgroup will be the name of the container. With more recent versions
of the LXC tools, the cgroup will be lxc/<container_name>.
For Docker containers using cgroups, the container name will be the full
ID or long ID of the container. If a container shows up as ae836c95b4c3
in docker ps
, its long ID might be something like
ae836c95b4c3c9e9179e0e91015512da89fdec91612f63cebae57df9a5444c79
. You can
look it up with docker inspect
or docker ps --no-trunc
.
Putting everything together to look at the memory metrics for a Docker
container, take a look at /sys/fs/cgroup/memory/docker/<longid>/
.
Metrics from cgroups: memory, CPU, block I/O
For each subsystem (memory, CPU, and block I/O), you will find one or more pseudo-files containing statistics.
Memory metrics: memory.stat
Memory metrics are found in the "memory" cgroup. Note that the memory
control group adds a little overhead, because it does very fine-grained
accounting of the memory usage on your host. Therefore, many distros
chose to not enable it by default. Generally, to enable it, all you have
to do is to add some kernel command-line parameters:
cgroup_enable=memory swapaccount=1
.
The metrics are in the pseudo-file memory.stat
.
Here is what it will look like:
cache 11492564992
rss 1930993664
mapped_file 306728960
pgpgin 406632648
pgpgout 403355412
swap 0
pgfault 728281223
pgmajfault 1724
inactive_anon 46608384
active_anon 1884520448
inactive_file 7003344896
active_file 4489052160
unevictable 32768
hierarchical_memory_limit 9223372036854775807
hierarchical_memsw_limit 9223372036854775807
total_cache 11492564992
total_rss 1930993664
total_mapped_file 306728960
total_pgpgin 406632648
total_pgpgout 403355412
total_swap 0
total_pgfault 728281223
total_pgmajfault 1724
total_inactive_anon 46608384
total_active_anon 1884520448
total_inactive_file 7003344896
total_active_file 4489052160
total_unevictable 32768
The first half (without the total_
prefix) contains statistics relevant
to the processes within the cgroup, excluding sub-cgroups. The second half
(with the total_
prefix) includes sub-cgroups as well.
Some metrics are "gauges", i.e., values that can increase or decrease (e.g., swap, the amount of swap space used by the members of the cgroup). Some others are "counters", i.e., values that can only go up, because they represent occurrences of a specific event (e.g., pgfault, which indicates the number of page faults which happened since the creation of the cgroup; this number can never decrease).
-
cache:
the amount of memory used by the processes of this control group that can be associated precisely with a block on a block device. When you read from and write to files on disk, this amount will increase. This will be the case if you use "conventional" I/O (open
,read
,write
syscalls) as well as mapped files (withmmap
). It also accounts for the memory used bytmpfs
mounts, though the reasons are unclear. -
rss:
the amount of memory that doesn't correspond to anything on disk: stacks, heaps, and anonymous memory maps. -
mapped_file:
indicates the amount of memory mapped by the processes in the control group. It doesn't give you information about how much memory is used; it rather tells you how it is used. -
pgfault and pgmajfault:
indicate the number of times that a process of the cgroup triggered a "page fault" and a "major fault", respectively. A page fault happens when a process accesses a part of its virtual memory space which is nonexistent or protected. The former can happen if the process is buggy and tries to access an invalid address (it will then be sent aSIGSEGV
signal, typically killing it with the famousSegmentation fault
message). The latter can happen when the process reads from a memory zone which has been swapped out, or which corresponds to a mapped file: in that case, the kernel will load the page from disk, and let the CPU complete the memory access. It can also happen when the process writes to a copy-on-write memory zone: likewise, the kernel will preempt the process, duplicate the memory page, and resume the write operation on the process` own copy of the page. "Major" faults happen when the kernel actually has to read the data from disk. When it just has to duplicate an existing page, or allocate an empty page, it's a regular (or "minor") fault. -
swap:
the amount of swap currently used by the processes in this cgroup. -
active_anon and inactive_anon:
the amount of anonymous memory that has been identified has respectively active and inactive by the kernel. "Anonymous" memory is the memory that is not linked to disk pages. In other words, that's the equivalent of the rss counter described above. In fact, the very definition of the rss counter is active_anon + inactive_anon - tmpfs (where tmpfs is the amount of memory used up bytmpfs
filesystems mounted by this control group). Now, what's the difference between "active" and "inactive"? Pages are initially "active"; and at regular intervals, the kernel sweeps over the memory, and tags some pages as "inactive". Whenever they are accessed again, they are immediately retagged "active". When the kernel is almost out of memory, and time comes to swap out to disk, the kernel will swap "inactive" pages. -
active_file and inactive_file:
cache memory, with active and inactive similar to the anon memory above. The exact formula is cache = active_file + inactive_file + tmpfs. The exact rules used by the kernel to move memory pages between active and inactive sets are different from the ones used for anonymous memory, but the general principle is the same. Note that when the kernel needs to reclaim memory, it is cheaper to reclaim a clean (=non modified) page from this pool, since it can be reclaimed immediately (while anonymous pages and dirty/modified pages have to be written to disk first). -
unevictable:
the amount of memory that cannot be reclaimed; generally, it will account for memory that has been "locked" withmlock
. It is often used by crypto frameworks to make sure that secret keys and other sensitive material never gets swapped out to disk. -
memory and memsw limits:
These are not really metrics, but a reminder of the limits applied to this cgroup. The first one indicates the maximum amount of physical memory that can be used by the processes of this control group; the second one indicates the maximum amount of RAM+swap.
Accounting for memory in the page cache is very complex. If two processes in different control groups both read the same file (ultimately relying on the same blocks on disk), the corresponding memory charge will be split between the control groups. It's nice, but it also means that when a cgroup is terminated, it could increase the memory usage of another cgroup, because they are not splitting the cost anymore for those memory pages.
CPU metrics: cpuacct.stat
Now that we've covered memory metrics, everything else will look very
simple in comparison. CPU metrics will be found in the
cpuacct
controller.
For each container, you will find a pseudo-file cpuacct.stat
,
containing the CPU usage accumulated by the processes of the container,
broken down between user
and system
time. If you're not familiar
with the distinction, user
is the time during which the processes were
in direct control of the CPU (i.e., executing process code), and system
is the time during which the CPU was executing system calls on behalf of
those processes.
Those times are expressed in ticks of 1/100th of a second. Actually,
they are expressed in "user jiffies". There are USER_HZ
"jiffies" per second, and on x86 systems,
USER_HZ
is 100. This used to map exactly to the
number of scheduler "ticks" per second; but with the advent of higher
frequency scheduling, as well as tickless kernels, the number of kernel ticks
wasn't relevant anymore. It stuck around anyway, mainly for legacy and
compatibility reasons.
Block I/O metrics
Block I/O is accounted in the blkio
controller.
Different metrics are scattered across different files. While you can
find in-depth details in the blkio-controller
file in the kernel documentation, here is a short list of the most
relevant ones:
-
blkio.sectors:
contain the number of 512-bytes sectors read and written by the processes member of the cgroup, device by device. Reads and writes are merged in a single counter. -
blkio.io_service_bytes:
indicates the number of bytes read and written by the cgroup. It has 4 counters per device, because for each device, it differentiates between synchronous vs. asynchronous I/O, and reads vs. writes. -
blkio.io_serviced:
the number of I/O operations performed, regardless of their size. It also has 4 counters per device. -
blkio.io_queued:
indicates the number of I/O operations currently queued for this cgroup. In other words, if the cgroup isn't doing any I/O, this will be zero. Note that the opposite is not true. In other words, if there is no I/O queued, it does not mean that the cgroup is idle (I/O-wise). It could be doing purely synchronous reads on an otherwise quiescent device, which is therefore able to handle them immediately, without queuing. Also, while it is helpful to figure out which cgroup is putting stress on the I/O subsystem, keep in mind that it is a relative quantity. Even if a process group does not perform more I/O, its queue size can increase just because the device load increases because of other devices.
Network metrics
Network metrics are not exposed directly by control groups. There is a
good explanation for that: network interfaces exist within the context
of network namespaces. The kernel could probably accumulate metrics
about packets and bytes sent and received by a group of processes, but
those metrics wouldn't be very useful. You want per-interface metrics
(because traffic happening on the local lo
interface doesn't really count). But since processes in a single cgroup
can belong to multiple network namespaces, those metrics would be harder
to interpret: multiple network namespaces means multiple lo
interfaces, potentially multiple eth0
interfaces, etc.; so this is why there is no easy way to gather network
metrics with control groups.
Instead we can gather network metrics from other sources:
IPtables
IPtables (or rather, the netfilter framework for which iptables is just an interface) can do some serious accounting.
For instance, you can setup a rule to account for the outbound HTTP traffic on a web server:
$ iptables -I OUTPUT -p tcp --sport 80
There is no -j
or -g
flag,
so the rule will just count matched packets and go to the following
rule.
Later, you can check the values of the counters, with:
$ iptables -nxvL OUTPUT
Technically, -n
is not required, but it will
prevent iptables from doing DNS reverse lookups, which are probably
useless in this scenario.
Counters include packets and bytes. If you want to setup metrics for
container traffic like this, you could execute a for
loop to add two iptables
rules per
container IP address (one in each direction), in the FORWARD
chain. This will only meter traffic going through the NAT
layer; you will also have to add traffic going through the userland
proxy.
Then, you will need to check those counters on a regular basis. If you
happen to use collectd
, there is a nice plugin
to automate iptables counters collection.
Interface-level counters
Since each container has a virtual Ethernet interface, you might want to
check directly the TX and RX counters of this interface. You will notice
that each container is associated to a virtual Ethernet interface in
your host, with a name like vethKk8Zqi
. Figuring
out which interface corresponds to which container is, unfortunately,
difficult.
But for now, the best way is to check the metrics from within the containers. To accomplish this, you can run an executable from the host environment within the network namespace of a container using ip-netns magic.
The ip-netns exec
command will let you execute any
program (present in the host system) within any network namespace
visible to the current process. This means that your host will be able
to enter the network namespace of your containers, but your containers
won't be able to access the host, nor their sibling containers.
Containers will be able to “see” and affect their sub-containers,
though.
The exact format of the command is:
$ ip netns exec <nsname> <command...>
For example:
$ ip netns exec mycontainer netstat -i
ip netns
finds the "mycontainer" container by
using namespaces pseudo-files. Each process belongs to one network
namespace, one PID namespace, one mnt
namespace,
etc., and those namespaces are materialized under
/proc/<pid>/ns/
. For example, the network
namespace of PID 42 is materialized by the pseudo-file
/proc/42/ns/net
.
When you run ip netns exec mycontainer ...
, it
expects /var/run/netns/mycontainer
to be one of
those pseudo-files. (Symlinks are accepted.)
In other words, to execute a command within the network namespace of a container, we need to:
- Find out the PID of any process within the container that we want to investigate;
- Create a symlink from
/var/run/netns/<somename>
to/proc/<thepid>/ns/net
- Execute
ip netns exec <somename> ....
Please review Enumerating Cgroups to learn how to find
the cgroup of a process running in the container of which you want to
measure network usage. From there, you can examine the pseudo-file named
tasks
, which contains the PIDs that are in the
control group (i.e., in the container). Pick any one of them.
Putting everything together, if the "short ID" of a container is held in
the environment variable $CID
, then you can do this:
$ TASKS=/sys/fs/cgroup/devices/docker/$CID*/tasks
$ PID=$(head -n 1 $TASKS)
$ mkdir -p /var/run/netns
$ ln -sf /proc/$PID/ns/net /var/run/netns/$CID
$ ip netns exec $CID netstat -i
Tips for high-performance metric collection
Note that running a new process each time you want to update metrics is (relatively) expensive. If you want to collect metrics at high resolutions, and/or over a large number of containers (think 1000 containers on a single host), you do not want to fork a new process each time.
Here is how to collect metrics from a single process. You will have to
write your metric collector in C (or any language that lets you do
low-level system calls). You need to use a special system call,
setns()
, which lets the current process enter any
arbitrary namespace. It requires, however, an open file descriptor to
the namespace pseudo-file (remember: that's the pseudo-file in
/proc/<pid>/ns/net
).
However, there is a catch: you must not keep this file descriptor open. If you do, when the last process of the control group exits, the namespace will not be destroyed, and its network resources (like the virtual interface of the container) will stay around for ever (or until you close that file descriptor).
The right approach would be to keep track of the first PID of each container, and re-open the namespace pseudo-file each time.
Collecting metrics when a container exits
Sometimes, you do not care about real time metric collection, but when a container exits, you want to know how much CPU, memory, etc. it has used.
Docker makes this difficult because it relies on lxc-start
, which
carefully cleans up after itself, but it is still possible. It is
usually easier to collect metrics at regular intervals (e.g., every
minute, with the collectd LXC plugin) and rely on that instead.
But, if you'd still like to gather the stats when a container stops, here is how:
For each container, start a collection process, and move it to the control groups that you want to monitor by writing its PID to the tasks file of the cgroup. The collection process should periodically re-read the tasks file to check if it's the last process of the control group. (If you also want to collect network statistics as explained in the previous section, you should also move the process to the appropriate network namespace.)
When the container exits, lxc-start
will try to
delete the control groups. It will fail, since the control group is
still in use; but that's fine. You process should now detect that it is
the only one remaining in the group. Now is the right time to collect
all the metrics you need!
Finally, your process should move itself back to the root control group,
and remove the container control group. To remove a control group, just
rmdir
its directory. It's counter-intuitive to
rmdir
a directory as it still contains files; but
remember that this is a pseudo-filesystem, so usual rules don't apply.
After the cleanup is done, the collection process can exit safely.