Add special memory management file

updating after assignment for Nigel
Adding in some notes from Nigel work
Updating with the storage driver content Nigel added
Updating with Nigel's polishing tech
Adding in Nigel graphics
First pass of aufs material
Capturing Nigel's latest
Comments back to Nigel on devicemapper
Incorporating Nigel's comments v3
Converting images for dm
Entering comments into aufs page
Adding the btfs storage driver
Moving to userguide
Adding in two new driver articles from Nigel
Optimized images
Updating with comments

Signed-off-by: Mary Anthony <mary@docker.com>
This commit is contained in:
Mary Anthony 2015-09-16 10:09:10 -07:00
parent 43077f9b64
commit 950fbf99b1
32 changed files with 1604 additions and 3 deletions

View File

@ -25,9 +25,7 @@ Docker.
## Data volumes
A *data volume* is a specially-designated directory within one or more
containers that bypasses the [*Union File
System*](../reference/glossary.md#union-file-system). Data volumes provide several
useful features for persistent or shared data:
containers that bypasses the [*Union File System*](../reference/glossary.md#union-file-system). Data volumes provide several useful features for persistent or shared data:
- Volumes are initialized when a container is created. If the container's
base image contains data at the specified mount point, that existing data is
@ -267,6 +265,12 @@ Then un-tar the backup file in the new container's data volume.
You can use the techniques above to automate backup, migration and
restore testing using your preferred tools.
## Important tips on using shared volumes
Multiple containers can also share one or more data volumes. However, multiple containers writing to a single shared volume can cause data corruption. Make sure you're applications are designed to write to shared data stores.
Data volumes are directly accessible from the Docker host. This means you can read and write to them with normal Linux tools. In most cases you should not do this as it can cause data corruption if your containers and applications are unaware of your direct access.
# Next steps
Now we've learned a bit more about how to use Docker we're going to see how to

View File

@ -0,0 +1,197 @@
<!--[metadata]>
+++
title = "AUFS storage driver in practice"
description = "Learn how to optimize your use of AUFS driver."
keywords = ["container, storage, driver, AUFS "]
[menu.main]
parent = "mn_storage_docker"
+++
<![end-metadata]-->
# Docker and AUFS in practice
AUFS was the first storage driver in use with Docker. As a result, it has a long and close history with Docker, is very stable, has a lot of real-world deployments, and has strong community support. AUFS has several features that make it a good choice for Docker. These features enable:
- Fast container startup times.
- Efficient use of storage.
- Efficient use of memory.
Despite its capabilities and long history with Docker, some Linux distributions do not support AUFS. This is usually because AUFS is not included in the mainline (upstream) Linux kernel.
The following sections examine some AUFS features and how they relate to Docker.
## Image layering and sharing with AUFS
AUFS is a *unification filesystem*. This means that it takes multiple directories on a single Linux host, stacks them on top of each other, and provides a single unified view. To achieve this, AUFS uses *union mount*.
AUFS stacks multiple directories and exposes them as a unified view through a single mount point. All of the directories in the stack, as well as the union mount point, must all exist on the same Linux host. AUFS refers to each directory that it stacks as a *branch*.
Within Docker, AUFS union mounts enable image layering. The AUFS storage driver implements Docker image layers using this union mount system. AUFS branches correspond to Docker image layers. The diagram below shows a Docker container based on the `ubuntu:latest` image.
![](images/aufs_layers.jpg)
This diagram shows the relationship between the Docker image layers and the AUFS branches (directories) in `/var/lib/docker/aufs`. Each image layer and the container layer correspond to an AUFS branch (directory) in the Docker host's local storage area. The union mount point gives the unified view of all layers.
AUFS also supports the copy-on-write technology (CoW). Not all storage drivers do.
## Container reads and writes with AUFS
Docker leverages AUFS CoW technology to enable image sharing and minimize the use of disk space. AUFS works at the file level. This means that all AUFS CoW operations copy entire files - even if only a small part of the file is being modified. This behavior can have a noticeable impact on container performance, especially if the files being copied are large, below a lot of image layers, or the CoW operation must search a deep directory tree.
Consider, for example, an application running in a container needs to add a single new value to a large key-value store (file). If this is the first time the file is modified it does not yet exist in the container's top writable layer. So, the CoW must *copy up* the file from the underlying image. The AUFS storage driver searches each image layer for the file. The search order is from top to bottom. When it is found, the entire file is *copied up* to the container's top writable layer. From there, it can be opened and modified.
Larger files obviously take longer to *copy up* than smaller files, and files that exist in lower image layers take longer to locate than those in higher layers. However, a *copy up* operation only occurs once per file on any given container. Subsequent reads and writes happen against the file's copy already *copied-up* to the container's top layer.
## Deleting files with the AUFS storage driver
The AUFS storage driver deletes a file from a container by placing a *whiteout
file* in the container's top layer. The whiteout file effectively obscures the
existence of the file in image's lower, read-only layers. The simplified
diagram below shows a container based on an image with three image layers.
![](images/aufs_delete.jpg)
The `file3` was deleted from the container. So, the AUFS storage driver placed
a whiteout file in the container's top layer. This whiteout file effectively
"deletes" `file3` from the container by obscuring any of the original file's
existence in the image's read-only base layer. Of course, the image could have
been in any of the other layers instead or in addition depending on how the
layers are built.
## Configure Docker with AUFS
You can only use the AUFS storage driver on Linux systems with AUFS installed. Use the following command to determine if your system supports AUFS.
```bash
$ grep aufs /proc/filesystems
nodev aufs
```
This output indicates the system supports AUFS. Once you've verified your
system supports AUFS, you can must instruct the Docker daemon to use it. You do
this from the command line with the `docker daemon` command:
```bash
$ sudo docker daemon --storage-driver=aufs &
```
Alternatively, you can edit the Docker config file and add the
`--storage-driver=aufs` option to the `DOCKER_OPTS` line.
```bash
# Use DOCKER_OPTS to modify the daemon startup options.
DOCKER_OPTS="--storage-driver=aufs"
```
Once your daemon is running, verify the storage driver with the `docker info` command.
```bash
$ sudo docker info
Containers: 1
Images: 4
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 6
Dirperm1 Supported: false
Execution Driver: native-0.2
...output truncated...
````
The output above shows that the Docker daemon is running the AUFS storage driver on top of an existing ext4 backing filesystem.
## Local storage and AUFS
As the `docker daemon` runs with the AUFS driver, the driver stores images and containers on within the Docker host's local storage area in the `/var/lib/docker/aufs` directory.
### Images
Image layers and their contents are stored under
`/var/lib/docker/aufs/mnt/diff/<image-id>` directory. The contents of an image
layer in this location includes all the files and directories belonging in that
image layer.
The `/var/lib/docker/aufs/layers/` directory contains metadata about how image
layers are stacked. This directory contains one file for every image or
container layer on the Docker host. Inside each file are the image layers names
that exist below it. The diagram below shows an image with 4 layers.
![](images/aufs_metadata.jpg)
Inspecting the contents of the file relating to the top layer of the image
shows the three image layers below it. They are listed in the order they are
stacked.
```bash
$ cat /var/lib/docker/aufs/layers/91e54dfb11794fad694460162bf0cb0a4fa710cfa3f60979c177d920813e267c
d74508fb6632491cea586a1fd7d748dfc5274cd6fdfedee309ecdcbc2bf5cb82
c22013c8472965aa5b62559f2b540cd440716ef149756e7b958a1b2aba421e87
d3a1f33e8a5a513092f01bb7eb1c2abf4d711e5105390a3fe1ae2248cfde1391
```
The base layer in an image has no image layers below it, so its file is empty.
### Containers
Running containers are mounted at locations in the
`/var/lib/docker/aufs/mnt/<container-id>` directory. This is the AUFS union
mount point that exposes the container and all underlying image layers as a
single unified view. If a container is not running, its directory still exists
but is empty. This is because containers are only mounted when they are running.
Container metadata and various config files that are placed into the running
container are stored in `/var/lib/containers/<container-id>`. Files in this
directory exist for all containers on the system, including ones that are
stopped. However, when a container is running the container's log files are also
in this directory.
A container's thin writable layer is stored under
`/var/lib/docker/aufs/diff/<container-id>`. This directory is stacked by AUFS as
the containers top writable layer and is where all changes to the container are
stored. The directory exists even if the container is stopped. This means that
restarting a container will not lose changes made to it. Once a container is
deleted this directory is deleted.
Information about which image layers are stacked below a container's top
writable layer is stored in the following file
`/var/lib/docker/aufs/layers/<container-id>`. The command below shows that the
container with ID `b41a6e5a508d` has 4 image layers below it:
```bash
$ cat /var/lib/docker/aufs/layers/b41a6e5a508dfa02607199dfe51ed9345a675c977f2cafe8ef3e4b0b5773404e-init
91e54dfb11794fad694460162bf0cb0a4fa710cfa3f60979c177d920813e267c
d74508fb6632491cea586a1fd7d748dfc5274cd6fdfedee309ecdcbc2bf5cb82
c22013c8472965aa5b62559f2b540cd440716ef149756e7b958a1b2aba421e87
d3a1f33e8a5a513092f01bb7eb1c2abf4d711e5105390a3fe1ae2248cfde1391
```
The image layers are shown in order. In the output above, the layer starting
with image ID "d3a1..." is the image's base layer. The image layer starting
with "91e5..." is the image's topmost layer.
## AUFS and Docker performance
To summarize some of the performance related aspects already mentioned:
- The AUFS storage driver is a good choice for PaaS and other similar use-cases where container density is important. This is because AUFS efficiently shares images between multiple running containers, enabling fast container start times and minimal use of disk space.
- The underlying mechanics of how AUFS shares files between image layers and containers uses the systems page cache very efficiently.
- The AUFS storage driver can introduce significant latencies into container write performance. This is because the first time a container writes to any file, the file has be located and copied into the containers top writable layer. These latencies increase and are compounded when these files exist below many image layers and the files themselves are large.
One final point. Data volumes provide the best and most predictable performance.
This is because they bypass the storage driver and do not incur any of the
potential overheads introduced by thin provisioning and copy-on-write. For this
reason, you may want to place heavy write workloads on data volumes.
## Related information
* [Understand images, containers, and storage drivers](imagesandcontainers.md)
* [Select a storage driver](selectadriver.md)
* [BTRFS storage driver in practice](btrfs-driver.md)
* [Device Mapper storage driver in practice](device-mapper-driver.md)

View File

@ -0,0 +1,280 @@
<!--[metadata]>
+++
title = "BTRFS storage in practice"
description = "Learn how to optimize your use of BTRFS driver."
keywords = ["container, storage, driver, BTRFS "]
[menu.main]
parent = "mn_storage_docker"
+++
<![end-metadata]-->
# Docker and BTRFS in practice
Btrfs is a next generation copy-on-write filesystem that supports many advanced
storage technologies that make it a good fit for Docker. Btrfs is included in
the mainline Linux kernel and it's on-disk-format is now considered stable.
However, many of its features are still under heavy development and users should
consider it a fast-moving target.
Docker's `btrfs` storage driver leverages many Btrfs features for image and
container management. Among these features are thin provisioning, copy-on-write,
and snapshotting.
This article refers to Docker's Btrfs storage driver as `btrfs` and the overall Btrfs Filesystem as Btrfs.
>**Note**: The [Commercially Supported Docker Engine (CS-Engine)](https://www.docker.com/compatibility-maintenance) does not currently support the `btrfs` storage driver.
## The future of Btrfs
Btrfs has been long hailed as the future of Linux filesystems. With full support in the mainline Linux kernel, a stable on-disk-format, and active development with a focus on stability, this is now becoming more of a reality.
As far as Docker on the Linux platform goes, many people see the `btrfs` storage driver as a potential long-term replacement for the `devicemapper` storage driver. However, at the time of writing, the `devicemapper` storage driver should be considered safer, more stable, and more *production ready*. You should only consider the `btrfs` driver for production deployments if you understand it well and have existing experience with Btrfs.
## Image layering and sharing with Btrfs
Docker leverages Btrfs *subvolumes* and *snapshots* for managing the on-disk components of image and container layers. Btrfs subvolumes look and feel like a normal Unix filesystem. As such, they can have their own internal directory structure that hooks into the wider Unix filesystem.
Subvolumes are natively copy-on-write and have space allocated to them on-demand
from an underlying storage pool. They can also be nested and snapped. The
diagram blow shows 4 subvolumes. 'Subvolume 2' and 'Subvolume 3' are nested,
whereas 'Subvolume 4' shows its own internal directory tree.
![](images/btfs_subvolume.jpg)
Snapshots are a point-in-time read-write copy of an entire subvolume. They exist directly below the subvolume they were created from. You can create snapshots of snapshots as shown in the diagram below.
![](images/btfs_snapshots.jpg)
Btfs allocates space to subvolumes and snapshots on demand from an underlying pool of storage. The unit of allocation is referred to as a *chunk* and *chunks* are normally ~1GB in size.
Snapshots are first-class citizens in a Btrfs filesystem. This means that they look, feel, and operate just like regular subvolumes. The technology required to create them is built directly into the Btrfs filesystem thanks to its native copy-on-write design. This means that Btrfs snapshots are space efficient with little or no performance overhead. The diagram below shows a subvolume and it's snapshot sharing the same data.
![](images/btfs_pool.jpg)
Docker's `btrfs` storage driver stores every image layer and container in its own Btrfs subvolume or snapshot. The base layer of an image is stored as a subvolume whereas child image layers and containers are stored as snapshots. This is shown in the diagram below.
![](images/btfs_container_layer.jpg)
The high level process for creating images and containers on Docker hosts running the `btrfs` driver is as follows:
1. The image's base layer is stored in a Btrfs subvolume under
`/var/lib/docker/btrfs/subvolumes`.
The image ID is used as the subvolume name. E.g., a base layer with image ID
"f9a9f253f6105141e0f8e091a6bcdb19e3f27af949842db93acba9048ed2410b" will be
stored in
`/var/lib/docker/btrfs/subvolumes/f9a9f253f6105141e0f8e091a6bcdb19e3f27af949842db93acba9048ed2410b`
2. Subsequent image layers are stored as a Btrfs snapshot of the parent layer's subvolume or snapshot.
The diagram below shows a three-layer image. The base layer is a subvolume. Layer 1 is a snapshot of the base layer's subvolume. Layer 2 is a snapshot of Layer 1's snapshot.
![](images/btfs_constructs.jpg)
## Image and container on-disk constructs
Image layers and containers are visible in the Docker host's filesystem at
`/var/lib/docker/btrfs/subvolumes/<image-id> OR <container-id>`. Directories for
containers are present even for containers with a stopped status. This is
because the `btrfs` storage driver mounts a default, top-level subvolume at
`/var/lib/docker/subvolumes`. All other subvolumes and snapshots exist below
that as Btrfs filesystem objects and not as individual mounts.
The following example shows a single Docker image with four image layers.
```bash
$ sudo docker images -a
REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE
ubuntu latest 0a17decee413 2 weeks ago 188.3 MB
<none> <none> 3c9a9d7cc6a2 2 weeks ago 188.3 MB
<none> <none> eeb7cb91b09d 2 weeks ago 188.3 MB
<none> <none> f9a9f253f610 2 weeks ago 188.1 MB
```
Each image layer exists as a Btrfs subvolume or snapshot with the same name as it's image ID as illustrated by the `btrfs subvolume list` command shown below:
```bash
$ sudo btrfs subvolume list /var/lib/docker
ID 257 gen 9 top level 5 path btrfs/subvolumes/f9a9f253f6105141e0f8e091a6bcdb19e3f27af949842db93acba9048ed2410b
ID 258 gen 10 top level 5 path btrfs/subvolumes/eeb7cb91b09d5de9edb2798301aeedf50848eacc2123e98538f9d014f80f243c
ID 260 gen 11 top level 5 path btrfs/subvolumes/3c9a9d7cc6a235eb2de58ca9ef3551c67ae42a991933ba4958d207b29142902b
ID 261 gen 12 top level 5 path btrfs/subvolumes/0a17decee4139b0de68478f149cc16346f5e711c5ae3bb969895f22dd6723751
```
Under the `/var/lib/docker/btrfs/subvolumes` directoy, each of these subvolumes and snapshots are visible as a normal Unix directory:
```bash
$ ls -l /var/lib/docker/btrfs/subvolumes/
total 0
drwxr-xr-x 1 root root 132 Oct 16 14:44 0a17decee4139b0de68478f149cc16346f5e711c5ae3bb969895f22dd6723751
drwxr-xr-x 1 root root 132 Oct 16 14:44 3c9a9d7cc6a235eb2de58ca9ef3551c67ae42a991933ba4958d207b29142902b
drwxr-xr-x 1 root root 132 Oct 16 14:44 eeb7cb91b09d5de9edb2798301aeedf50848eacc2123e98538f9d014f80f243c
drwxr-xr-x 1 root root 132 Oct 16 14:44 f9a9f253f6105141e0f8e091a6bcdb19e3f27af949842db93acba9048ed2410b
```
Because Btrfs works at the filesystem level and not the block level, each image
and container layer can be browsed in the filesystem using normal Unix commands.
The example below shows a truncated output of an `ls -l` command against the
image's top layer:
```bash
$ ls -l /var/lib/docker/btrfs/subvolumes/0a17decee4139b0de68478f149cc16346f5e711c5ae3bb969895f22dd6723751/
total 0
drwxr-xr-x 1 root root 1372 Oct 9 08:39 bin
drwxr-xr-x 1 root root 0 Apr 10 2014 boot
drwxr-xr-x 1 root root 882 Oct 9 08:38 dev
drwxr-xr-x 1 root root 2040 Oct 12 17:27 etc
drwxr-xr-x 1 root root 0 Apr 10 2014 home
...output truncated...
```
## Container reads and writes with Btrfs
A container is a space-efficient snapshot of an image. Metadata in the snapshot
points to the actual data blocks in the storage pool. This is the same as with a
subvolume. Therefore, reads performed against a snapshot are essentially the
same as reads performed against a subvolume. As a result, no performance
overhead is incurred from the Btrfs driver.
Writing a new file to a container invokes an allocate-on-demand operation to
allocate new data block to the container's snapshot. The file is then written to
this new space. The allocate-on-demand operation is native to all writes with
Btrfs and is the same as writing new data to a subvolume. As a result, writing
new files to a container's snapshot operate at native Btrfs speeds.
Updating an existing file in a container causes a copy-on-write operation
(technically *redirect-on-write*). The driver leaves the original data and
allocates new space to the snapshot. The updated data is written to this new
space. Then, the driver updates the filesystem metadata in the snapshot to point
to this new data. The original data is preserved in-place for subvolumes and
snapshots further up the tree. This behavior is native to copy-on-write
filesystems like Btrfs and incurs very little overhead.
With Btfs, writing and updating lots of small files can result in slow performance. More on this later.
## Configuring Docker with Btrfs
The `btrfs` storage driver only operates on a Docker host where `/var/lib/docker` is mounted as a Btrfs filesystem. The following procedure shows how to configure Btrfs on Ubuntu 14.04 LTS.
### Prerequisites
If you have already used the Docker daemon on your Docker host and have images you want to keep, `push` them to Docker Hub or your private Docker Trusted Registry before attempting this procedure.
Stop the Docker daemon. Then, ensure that you have a spare block device at `/dev/xvdb`. The device identifier may be different in your environment and you should substitute your own values throughout the procedure.
The procedure also assumes your kernel has the appropriate Btrfs modules loaded. To verify this, use the following command:
```bash
$ cat /proc/filesystems | grep btrfs`
```
### Configure Btrfs on Ubuntu 14.04 LTS
Assuming your system meets the prerequisites, do the following:
1. Install the "btrfs-tools" package.
$ sudo apt-get install btrfs-tools
Reading package lists... Done
Building dependency tree
<output truncated>
2. Create the Btrfs storage pool.
Btrfs storage pools are created with the `mkfs.btrfs` command. Passing multiple devices to the `mkfs.btrfs` command creates a pool across all of those devices. Here you create a pool with a single device at `/dev/xvdb`.
$ sudo mkfs.btrfs -f /dev/xvdb
WARNING! - Btrfs v3.12 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using
Turning ON incompat feature 'extref': increased hardlink limit per file to 65536
fs created label (null) on /dev/xvdb
nodesize 16384 leafsize 16384 sectorsize 4096 size 4.00GiB
Btrfs v3.12
Be sure to substitute `/dev/xvdb` with the appropriate device(s) on your
system.
> **Warning**: Take note of the warning about Btrfs being experimental. As
noted earlier, Btrfs is not currently recommended for production deployments
unless you already have extensive experience.
3. If it does not already exist, create a directory for the Docker host's local storage area at `/var/lib/docker`.
$ sudo mkdir /var/lib/docker
4. Configure the system to automatically mount the Btrfs filesystem each time the system boots.
a. Obtain the Btrfs filesystem's UUID.
$ sudo blkid /dev/xvdb
/dev/xvdb: UUID="a0ed851e-158b-4120-8416-c9b072c8cf47" UUID_SUB="c3927a64-4454-4eef-95c2-a7d44ac0cf27" TYPE="btrfs"
b. Create a `/etc/fstab` entry to automatically mount `/var/lib/docker` each time the system boots.
/dev/xvdb /var/lib/docker btrfs defaults 0 0
UUID="a0ed851e-158b-4120-8416-c9b072c8cf47" /var/lib/docker btrfs defaults 0 0
5. Mount the new filesystem and verify the operation.
$ sudo mount -a
$ mount
/dev/xvda1 on / type ext4 (rw,discard)
<output truncated>
/dev/xvdb on /var/lib/docker type btrfs (rw)
The last line in the output above shows the `/dev/xvdb` mounted at `/var/lib/docker` as Btrfs.
Now that you have a Btrfs filesystem mounted at `/var/lib/docker`, the daemon should automatically load with the `btrfs` storage driver.
1. Start the Docker daemon.
$ sudo service docker start
docker start/running, process 2315
The procedure for starting the Docker daemon may differ depending on the
Linux distribution you are using.
You can start the Docker daemon with the `btrfs` storage driver by passing
the `--storage-driver=btrfs` flag to the `docker daemon` command or you can
add the `DOCKER_OPTS` line to the Docker config file.
2. Verify the storage driver with the `docker info` command.
$ sudo docker info
Containers: 0
Images: 0
Storage Driver: btrfs
[...]
Your Docker host is now configured to use the `btrfs` storage driver.
## BTRFS and Docker performance
There are several factors that influence Docker's performance under the `btrfs` storage driver.
- **Page caching**. Btrfs does not support page cache sharing. This means that *n* containers accessing the same file require *n* copies to be cached. As a result, the `btrfs` driver may not be the best choice for PaaS and other high density container use cases.
- **Small writes**. Containers performing lots of small writes (including Docker hosts that start and stop many containers) can lead to poor use of Btrfs chunks. This can ultimately lead to out-of-space conditions on your Docker host and stop it working. This is currently a major drawback to using current versions of Btrfs.
If you use the `btrfs` storage driver, closely monitor the free space on your Btrfs filesystem using the `btrfs filesys show` command. Do not trust the output of normal Unix commands such as `df`; always use the Btrfs native commands.
- **Sequential writes**. Btrfs writes data to disk via journaling technique. This can impact sequential writes, where performance can be up to half.
- **Fragmentation**. Fragmentation is a natural byproduct of copy-on-write filesystems like Btrfs. Many small random writes can compound this issue. It can manifest as CPU spikes on Docker hosts using SSD media and head thrashing on Docker hosts using spinning media. Both of these result in poor performance.
Recent versions of Btrfs allow you to specify `autodefrag` as a mount option. This mode attempts to detect random writes and defragment them. You should perform your own tests before enabling this option on your Docker hosts. Some tests have shown this option has a negative performance impact on Docker hosts performing lots of small writes (including systems that start and stop many containers).
- **Solid State Devices (SSD)**. Btrfs has native optimizations for SSD media. To enable these, mount with the `-o ssd` mount option. These optimizations include enhanced SSD write performance by avoiding things like *seek optimizations* that have no use on SSD media.
Btfs also supports the TRIM/Discard primitives. However, mounting with the `-o discard` mount option can cause performance issues. Therefore, it is recommended you perform your own tests before using this option.
- **Use Data Volumes**. Data volumes provide the best and most predictable performance. This is because they bypass the storage driver and do not incur any of the potential overheads introduced by thin provisioning and copy-on-write. For this reason, you may want to place heavy write workloads on data volumes.
## Related Information
* [Understand images, containers, and storage drivers](imagesandcontainers.md)
* [Select a storage driver](selectadriver.md)
* [AUFS storage driver in practice](aufs-driver.md)
* [Device Mapper storage driver in practice](device-mapper-driver.md)

View File

@ -0,0 +1,310 @@
<!--[metadata]>
+++
title="Device mapper storage in practice"
description="Learn how to optimize your use of device mapper driver."
keywords=["container, storage, driver, device mapper"]
[menu.main]
parent="mn_storage_docker"
+++
<![end-metadata]-->
# Docker and the Device Mapper storage driver
Device Mapper is a kernel-based framework that underpins many advanced
volume management technologies on Linux. Docker's `devicemapper` storage driver
leverages the thin provisioning and snapshotting capabilities of this framework
for image and container management. This article refers to the Device Mapper
storage driver as `devicemapper`, and the kernel framework as `Device Mapper`.
>**Note**: The [Commercially Supported Docker Engine (CS-Engine) running on RHEL and CentOS Linux](https://www.docker.com/compatibility-maintenance) requires that you use the `devicemapper` storage driver.
## An alternative to AUFS
Docker originally ran on Ubuntu and Debian Linux and used AUFS for its storage
backend. As Docker became popular, many of the companies that wanted to use it
were using Red Hat Enterprise Linux (RHEL). Unfortunately, because the upstream
mainline Linux kernel did not include AUFS, RHEL did not use AUFS either.
To correct this Red Hat developers investigated getting AUFS into the mainline
kernel. Ultimately, though, they decided a better idea was to develop a new
storage backend. Moreover, they would base this new storage backend on existing
`Device Mapper` technology.
Red Hat collaborated with Docker Inc. to contribute this new driver. As a result
of this collaboration, Docker's Engine was re-engineered to make the storage
backend pluggable. So it was that the `devicemapper` became the second storage
driver Docker supported.
Device Mapper has been included in the mainline Linux kernel since version
2.6.9. It is a core part of RHEL family of Linux distributions. This means that
the `devicemapper` storage driver is based on stable code that has a lot of
real-world production deployments and strong community support.
## Image layering and sharing
The `devicemapper` driver stores every image and container on its own virtual
device. These devices are thin-provisioned copy-on-write snapshot devices.
Device Mapper technology works at the block level rather than the file level.
This means that `devicemapper` storage driver's thin provisioning and
copy-on-write operations work with blocks rather than entire files.
>**Note**: Snapshots are also referred to as *thin devices* or *virtual devices*. They all mean the same thing in the context of the `devicemapper` storage driver.
With the `devicemapper` the high level process for creating images is as follows:
1. The `devicemapper` storage driver creates a thin pool.
The pool is created from block devices or loop mounted sparse files (more on this later).
2. Next it creates a *base device*.
A base device is a thin device with a filesystem. You can see which filesystem is in use by running the `docker info` command and checking the `Backing filesystem` value.
3. Each new image (and image layer) is a snapshot of this base device.
These are thin provisioned copy-on-write snapshots. This means that they are initially empty and only consume space from the pool when data is written to them.
With `devicemapper`, container layers are snapshots of the image they are created from. Just as with images, container snapshots are thin provisioned copy-on-write snapshots. The container snapshot stores all updates to the container. The `devicemapper` allocates space to them on-demand from the pool as and when data is written to the container.
The high level diagram below shows a thin pool with a base device and two images.
![](images/base_device.jpg)
If you look closely at the diagram you'll see that it's snapshots all the way down. Each image layer is a snapshot of the layer below it. The lowest layer of each image is a snapshot of the the base device that exists in the pool. This base device is a `Device Mapper` artifact and not a Docker image layer.
A container is a snapshot of the image it is created from. The diagram below shows two containers - one based on the Ubuntu image and the other based on the Busybox image.
![](images/two_dm_container.jpg)
## Reads with the devicemapper
Let's look at how reads and writes occur using the `devicemapper` storage driver. The diagram below shows the high level process for reading a single block (`0x44f`) in an example container.
![](images/dm_container.jpg)
1. An application makes a read request for block 0x44f in the container.
Because the container is a thin snapshot of an image it does not have the data. Instead, it has a pointer (PTR) to where the data is stored in the image snapshot lower down in the image stack.
2. The storage driver follows the pointer to block `0xf33` in the snapshot relating to image layer `a005...`.
3. The `devicemapper` copies the contents of block `0xf33` from the image snapshot to memory in the container.
4. The storage driver returns the data to the requesting application.
### Write examples
With the `devicemapper` driver, writing new data to a container is accomplished by an *allocate-on-demand* operation. Updating existing data uses a copy-on-write operation. Because Device Mapper is a block-based technology these operations occur at the block level.
For example, when making a small change to a large file in a container, the `devicemapper` storage driver does not copy the entire file. It only copies the blocks to be modified. Each block is 64KB.
#### Writing new data
To write 56KB of new data to a container:
1. An application makes a request to write 56KB of new data to the container.
2. The allocate-on-demand operation allocates a single new 64KB block to the containers snapshot.
If the write operation is larger than 64KB, multiple new blocks are allocated to the container snapshot.
3. The data is written to the newly allocated block.
#### Overwriting existing data
To modify existing data for the first time:
1. An application makes a request to modify some data in the container.
2. A copy-on-write operation locates the blocks that need updating.
3. The operation allocates new blocks to the container snapshot and copies the data into those blocks.
4. The modified data is written into the newly allocated blocks.
The application in the container is unaware of any of these
allocate-on-demand and copy-on-write operations. However, they may add latency
to the application's read and write operations.
## Configuring Docker with Device Mapper
The `devicemapper` is the default Docker storage driver on some Linux
distributions. This includes RHEL and most of its forks. Currently, the following distributions support the driver:
* RHEL/CentOS/Fedora
* Ubuntu 12.04
* Ubuntu 14.04
* Debian
Docker hosts running the `devicemapper` storage driver default to a
configuration mode known as `loop-lvm`. This mode uses sparse files to build
the thin pool used by image and container snapshots. The mode is designed to work out-of-the-box
with no additional configuration. However, production deployments should not run
under `loop-lvm` mode.
You can detect the mode by viewing the `docker info` command:
$ sudo docker info
Containers: 0
Images: 0
Storage Driver: devicemapper
Pool Name: docker-202:2-25220302-pool
Pool Blocksize: 65.54 kB
Backing Filesystem: xfs
...
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Library Version: 1.02.93-RHEL7 (2015-01-28)
...
The output above shows a Docker host running with the `devicemapper` storage driver operating in `loop-lvm` mode. This is indicated by the fact that the `Data loop file` and a `Metadata loop file` are on files under `/var/lib/docker/devicemapper/devicemapper`. These are loopback mounted sparse files.
### Configure direct-lvm mode for production
The preferred configuration for production deployments is `direct lvm`. This
mode uses block devices to create the thin pool. The following procedure shows
you how to configure a Docker host to use the `devicemapper` storage driver in a
`direct-lvm` configuration.
> **Caution:** If you have already run the Docker daemon on your Docker host and have images you want to keep, `push` them Docker Hub or your private Docker Trusted Registry before attempting this procedure.
The procedure below will create a 90GB data volume and 4GB metadata volume to use as backing for the storage pool. It assumes that you have a spare block device at `/dev/xvdf` with enough free space to complete the task. The device identifier and volume sizes may be be different in your environment and you should substitute your own values throughout the procedure. The procedure also assumes that the Docker daemon is in the `stopped` state.
1. Log in to the Docker host you want to configure and stop the Docker daemon.
2. If it exists, delete your existing image store by removing the `/var/lib/docker` directory.
$ sudo rm -rf /var/lib/docker
3. Create an LVM physical volume (PV) on your spare block device using the `pvcreate` command.
$ sudo pvcreate /dev/xvdf
Physical volume `/dev/xvdf` successfully created
The device identifier may be different on your system. Remember to substitute your value in the command above.
4. Create a new volume group (VG) called `vg-docker` using the PV created in the previous step.
$ sudo vgcreate vg-docker /dev/xvdf
Volume group `vg-docker` successfully created
5. Create a new 90GB logical volume (LV) called `data` from space in the `vg-docker` volume group.
$ sudo lvcreate -L 90G -n data vg-docker
Logical volume `data` created.
The command creates an LVM logical volume called `data` and an associated block device file at `/dev/vg-docker/data`. In a later step, you instruct the `devicemapper` storage driver to use this block device to store image and container data.
If you receive a signature detection warning, make sure you are working on the correct devices before continuing. Signature warnings indicate that the device you're working on is currently in use by LVM or has been used by LVM in the past.
6. Create a new logical volume (LV) called `metadata` from space in the `vg-docker` volume group.
$ sudo lvcreate -L 4G -n metadata vg-docker
Logical volume `metadata` created.
This creates an LVM logical volume called `metadata` and an associated block device file at `/dev/vg-docker/metadata`. In the next step you instruct the `devicemapper` storage driver to use this block device to store image and container metadata.
5. Start the Docker daemon with the `devicemapper` storage driver and the `--storage-opt` flags.
The `data` and `metadata` devices that you pass to the `--storage-opt` options were created in the previous steps.
$ sudo docker daemon --storage-driver=devicemapper --storage-opt dm.datadev=/dev/vg-docker/data --storage-opt dm.metadatadev=/dev/vg-docker/metadata &
[1] 2163
[root@ip-10-0-0-75 centos]# INFO[0000] Listening for HTTP on unix (/var/run/docker.sock)
INFO[0027] Option DefaultDriver: bridge
INFO[0027] Option DefaultNetwork: bridge
<output truncated>
INFO[0027] Daemon has completed initialization
INFO[0027] Docker daemon commit=0a8c2e3 execdriver=native-0.2 graphdriver=devicemapper version=1.8.2
It is also possible to set the `--storage-driver` and `--storage-opt` flags in the Docker config file and start the daemon normally using the `service` or `systemd` commands.
6. Use the `docker info` command to verify that the daemon is using `data` and `metadata` devices you created.
$ sudo docker info
INFO[0180] GET /v1.20/info
Containers: 0
Images: 0
Storage Driver: devicemapper
Pool Name: docker-202:1-1032-pool
Pool Blocksize: 65.54 kB
Backing Filesystem: xfs
Data file: /dev/vg-docker/data
Metadata file: /dev/vg-docker/metadata
[...]
The output of the command above shows the storage driver as `devicemapper`. The last two lines also confirm that the correct devices are being used for the `Data file` and the `Metadata file`.
### Examine devicemapper structures on the host
You can use the `lsblk` command to see the device files created above and the `pool` that the `devicemapper` storage driver creates on top of them.
$ sudo lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
xvda 202:0 0 8G 0 disk
└─xvda1 202:1 0 8G 0 part /
xvdf 202:80 0 100G 0 disk
├─vg--docker-data 253:0 0 90G 0 lvm
│ └─docker-202:1-1032-pool 253:2 0 100G 0 dm
└─vg--docker-metadata 253:1 0 4G 0 lvm
└─docker-202:1-1032-pool 253:2 0 100G 0 dm
The diagram below shows the image from prior examples updated with the detail from the `lsblk` command above.
![](http://farm1.staticflickr.com/703/22116692899_0471e5e160_b.jpg)
In the diagram, the pool is named `Docker-202:1-1032-pool` and spans the `data` and `metadata` devices created earlier. The `devicemapper` constructs the pool name as follows:
```
Docker-MAJ:MIN-INO-pool
```
`MAJ`, `MIN` and `INO` refer to the major and minor device numbers and inode.
Because Device Mapper operates at the block level it is more difficult to see
diffs between image layers and containers. However, there are two key
directories. The `/var/lib/docker/devicemapper/mnt` directory contains the mount
points for images and containers. The `/var/lib/docker/devicemapper/metadata`
directory contains one file for every image and container snapshot. The files
contain metadata about each snapshot in JSON format.
## Device Mapper and Docker performance
It is important to understand the impact that allocate-on-demand and copy-on-write operations can have on overall container performance.
### Allocate-on-demand performance impact
The `devicemapper` storage driver allocates new blocks to a container via an allocate-on-demand operation. This means that each time an app writes to somewhere new inside a container, one or more empty blocks has to be located from the pool and mapped into the container.
All blocks are 64KB. A write that uses less than 64KB still results in a single 64KB block being allocated. Writing more than 64KB of data uses multiple 64KB blocks. This can impact container performance, especially in containers that perform lots of small writes. However, once a block is allocated to a container subsequent reads and writes can operate directly on that block.
### Copy-on-write performance impact
Each time a container updates existing data for the first time, the `devicemapper` storage driver has to perform a copy-on-write operation. This copies the data from the image snapshot to the container's snapshot. This process can have a noticeable impact on container performance.
All copy-on-write operations have a 64KB granularity. As a results, updating 32KB of a 1GB file causes the driver to copy a single 64KB block into the container's snapshot. This has obvious performance advantages over file-level copy-on-write operations which would require copying the entire 1GB file into the container layer.
In practice, however, containers that perform lots of small block writes (<64KB) can perform worse with `devicemapper` than with AUFS.
### Other device mapper performance considerations
There are several other things that impact the performance of the `devicemapper` storage driver..
- **The mode.** The default mode for Docker running the `devicemapper` storage driver is `loop-lvm`. This mode uses sparse files and suffers from poor performance. It is **not recommended for production**. The recommended mode for production environments is `direct-lvm` where the storage driver writes directly to raw block devices.
- **High speed storage.** For best performance you should place the `Data file` and `Metadata file` on high speed storage such as SSD. This can be direct attached storage or from a SAN or NAS array.
- **Memory usage.** `devicemapper` is not the most memory efficient Docker storage driver. Launching *n* copies of the same container loads *n* copies of its files into memory. This can have a memory impact on your Docker host. As a result, the `devicemapper` storage driver may not be the best choice for PaaS and other high density use cases.
One final point, data volumes provide the best and most predictable performance. This is because they bypass the storage driver and do not incur any of the potential overheads introduced by thin provisioning and copy-on-write. For this reason, you may want to place heavy write workloads on data volumes.
## Related Information
* [Understand images, containers, and storage drivers](imagesandcontainers.md)
* [Select a storage driver](selectadriver.md)
* [AUFS storage driver in practice](aufs-driver.md)
* [BTRFS storage driver in practice](btrfs-driver.md)

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 78 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 46 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 47 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 51 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 68 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 45 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 83 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 59 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 55 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 64 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 42 KiB

View File

@ -0,0 +1,255 @@
<!--[metadata]>
+++
title = "Understand images, containers, and storage drivers"
description = "Learn the technologies that support storage drivers."
keywords = ["container, storage, driver, AUFS, btfs, devicemapper,zvfs"]
[menu.main]
parent = "mn_storage_docker"
weight = -2
+++
<![end-metadata]-->
# Understand images, containers, and storage drivers
To use storage drivers effectively, you must understand how Docker builds and
stores images. Then, you need an understanding of how these images are used in containers. Finally, you'll need a short introduction to the technologies that enable both images and container operations.
## Images and containers rely on layers
Docker images are a series of read-only layers that are stacked
on top of each other to form a single unified view. The first image in the stack
is called a *base image* and all the other layers are stacked on top of this
layer. The diagram below shows the Ubuntu 15:04 image comprising 4 stacked image layers.
![](images/image-layers.jpg)
When you make a change inside a container by, for example, adding a new file to the Ubuntu 15.04 image, you add a new layer on top of the underlying image stack. This change creates a new image layer containing the newly added file. Each image layer has its own universal unique identifier (UUID) and each successive image layer builds on top of the image layer below it.
Containers (in the storage context) are a combination of a Docker image with a
thin writable layer added to the top known as the *container layer*. The diagram below shows a container running the Ubuntu 15.04 image.
![](images/container-layers.jpg)
The major difference between a container and an image is this writable layer. All writes to the container that add new or modifying existing data are stored in this writable layer. When the container is deleted the writeable layer is also deleted. The image remains unchanged.
Because each container has its own thin writable container layer and all data is stored this container layer, this means that multiple containers can share access to the same underlying image and yet have their own data state. The diagram below shows multiple containers sharing the same Ubuntu 15.04 image.
![](images/sharing-layers.jpg)
A storage driver is responsible for enabling and managing both the image layers and the writeable container layer. How a storage driver accomplishes these behaviors can vary. Two key technologies behind Docker image and container management are stackable image layers and copy-on-write (CoW).
## The copy-on-write strategy
Sharing is a good way to optimize resources. People do this instinctively in
daily life. For example, twins Jane and Joseph taking an Algebra class at
different times from different teachers can share the same exercise book by
passing it between each other. Now, suppose Jane gets an assignment to complete
the homework on page 11 in the book. At that point, Jane copy page 11, complete the homework, and hand in her copy. The original exercise book is unchanged and only Jane has a copy of the changed page 11.
Copy-on-write is a similar strategy of sharing and copying. In this strategy,
system processes that need the same data share the same instance of that data
rather than having their own copy. At some point, if one process needs to modify
or write to the data, only then does the operating system make a copy of the
data for that process to use. Only the process that needs to write has access to
the data copy. All the other processes continue to use the original data.
Docker uses a copy-on-write technology with both images and containers. This CoW
strategy optimizes both image disk space usage and the performance of container
start times. The next sections look at how copy-on-write is leveraged with
images and containers thru sharing and copying.
### Sharing promotes smaller images
This section looks at image layers and copy-on-write technology. All image and container layers exist inside the Docker host's *local storage area* and are managed by the storage driver. It is a location on the host's
filesystem.
The Docker client reports on image layers when instructed to pull and push
images with `docker pull` and `docker push`. The command below pulls the
`ubuntu:15.04` Docker image from Docker Hub.
$ docker pull ubuntu:15.04
15.04: Pulling from library/ubuntu
6e6a100fa147: Pull complete
13c0c663a321: Pull complete
2bd276ed39d5: Pull complete
013f3d01d247: Pull complete
Digest: sha256:c7ecf33cef00ae34b131605c31486c91f5fd9a76315d075db2afd39d1ccdf3ed
Status: Downloaded newer image for ubuntu:15.04
From the output, you'll see that the command actually pulls 4 image layers.
Each of the above lines lists an image layer and its UUID. The combination of
these four layers makes up the `ubuntu:15.04` Docker image.
The image layers are stored in the Docker host's local storage area. Typically,
the local storage area is in the host's `/var/lib/docker` directory. Depending
on which storage driver the local storage area may be in a different location. You can list the layers in the local storage area. The following example shows the storage as it appears under the AUFS storage driver:
$ sudo ls /var/lib/docker/aufs/layers
013f3d01d24738964bb7101fa83a926181d600ebecca7206dced59669e6e6778 2bd276ed39d5fcfd3d00ce0a190beeea508332f5aec3c6a125cc619a3fdbade6
13c0c663a321cd83a97f4ce1ecbaf17c2ba166527c3b06daaefe30695c5fcb8c 6e6a100fa147e6db53b684c8516e3e2588b160fd4898b6265545d5d4edb6796d
If you `pull` another image that shares some of the same image layers as the `ubuntu:15.04` image, the Docker daemon recognize this, and only pull the layers it hasn't already stored. After the second pull, the two images will share any common image layers.
You can illustrate this now for yourself. Starting the `ubuntu:15.04` image that
you just pulled, make a change to it, and build a new image based on the change.
One way to do this is using a Dockerfile and the `docker build` command.
1. In an empty directory, create a simple `Dockerfile` that starts with the ubuntu:15.04 image.
FROM ubuntu:15.04
2. Add a new file called "newfile" in the image's `/tmp` directory with the text "Hello world" in it.
When you are done, the `Dockerfile` contains two lines:
FROM ubuntu:15.04
RUN echo "Hello world" > /tmp/newfile
3. Save and close the file.
2. From a terminal in the same folder as your Dockerfile, run the following command:
$ docker build -t changed-ubuntu .
Sending build context to Docker daemon 2.048 kB
Step 0 : FROM ubuntu:15.04
---> 013f3d01d247
Step 1 : RUN echo "Hello world" > /tmp/newfile
---> Running in 2023460815df
---> 03b964f68d06
Removing intermediate container 2023460815df
Successfully built 03b964f68d06
> **Note:** The period (.) at the end of the above command is important. It tells the `docker build` command to use the current working directory as its build context.
The output above shows a new image with image ID `03b964f68d06`.
3. Run the `docker images` command to verify the new image is in the Docker host's local storage area.
REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE
changed-ubuntu latest 03b964f68d06 33 seconds ago 131.4 MB
ubuntu
4. Run the `docker history` command to see which image layers were used to create the new `changed-ubuntu` image.
$ docker history changed-ubuntu
IMAGE CREATED CREATED BY SIZE COMMENT
03b964f68d06 About a minute ago /bin/sh -c echo "Hello world" > /tmp/newfile 12 B
013f3d01d247 6 weeks ago /bin/sh -c #(nop) CMD ["/bin/bash"] 0 B
2bd276ed39d5 6 weeks ago /bin/sh -c sed -i 's/^#\s*\(deb.*universe\)$/ 1.879 kB
13c0c663a321 6 weeks ago /bin/sh -c echo '#!/bin/sh' > /usr/sbin/polic 701 B
6e6a100fa147 6 weeks ago /bin/sh -c #(nop) ADD file:49710b44e2ae0edef4 131.4 MB
The `docker history` output shows the new `03b964f68d06` image layer at the
top. You know that the `03b964f68d06` layer was added because it was created
by the `echo "Hello world" > /tmp/newfile` command in your `Dockerfile`.
The 4 image layers below it are the exact same image layers the make up the
ubuntu:15.04 image as their UUIDs match.
5. List the contents of the local storage area to further confirm.
$ sudo ls /var/lib/docker/aufs/layers
013f3d01d24738964bb7101fa83a926181d600ebecca7206dced59669e6e6778 2bd276ed39d5fcfd3d00ce0a190beeea508332f5aec3c6a125cc619a3fdbade6
03b964f68d06a373933bd6d61d37610a34a355c168b08dfc604f57b20647e073 6e6a100fa147e6db53b684c8516e3e2588b160fd4898b6265545d5d4edb6796d
13c0c663a321cd83a97f4ce1ecbaf17c2ba166527c3b06daaefe30695c5fcb8c
Where before you had four layers stored, you now have 5.
Notice the new `changed-ubuntu` image does not have its own copies of every layer. As can be seen in the diagram below, the new image is sharing it's four underlying layers with the `ubuntu:15.04` image.
![](images/saving-space.jpg)
The `docker history` command also shows the size of each image layer. The `03b964f68d06` is only consuming 13 Bytes of disk space. Because all of the layers below it already exist on the Docker host and are shared with the `ubuntu15:04` image, this means the entire `changed-ubuntu` image only consumes 13 Bytes of disk space.
This sharing of image layers is what makes Docker images and containers so space
efficient.
### Copying makes containers efficient
You learned earlier that a container a Docker image with a thin writable, container layer added. The diagram below shows the layers of a container based on the `ubuntu:15.04` image:
![](images/container-layers.jpg)
All writes made to a container are stored in the thin writable container layer. The other layers are read-only (RO) image layers and can't be changed. This means that multiple containers can safely share a single underlying image. The diagram below shows multiple containers sharing a single copy of the `ubuntu:15.04` image. Each container has its own thin RW layer, but they all share a single instance of the ubuntu:15.04 image:
![](images/sharing-layers.jpg)
When a write operation occurs in a container, Docker uses the storage driver to perform a copy-on-write operation. The type of operation depends on the storage driver. For AUFS and OverlayFS storage drivers the copy-on-write operation is pretty much as follows:
* Search through the layers for the file to update. The process starts at the top, newest layer and works down to the base layer one-at-a-time.
* Perform a "copy-up" operation on the first copy of the file that is found. A "copy up" copies the file up to the container's own thin writable layer.
* Modify the *copy of the file* in container's thin writable layer.
BTFS, ZFS, and other drivers handle the copy-on-write differently. You can read more about the methods of these drivers later in their detailed descriptions.
Containers that write a lot of data will consume more space than containers that do not. This is because most write operations consume new space in the containers thin writable top layer. If your container needs to write a lot of data, you can use a data volume.
A copy-up operation can incur a noticeable performance overhead. This overhead is different depending on which storage driver is in use. However, large files, lots of layers, and deep directory trees can make the impact more noticeable. Fortunately, the operation only occurs the first time any particular file is modified. Subsequent modifications to the same file do not cause a copy-up operation and can operate directly on the file's existing copy already present in container layer.
Let's see what happens if we spin up 5 containers based on our `changed-ubuntu` image we built earlier:
1. From a terminal on your Docker host, run the following `docker run` command 5 times.
$ docker run -dit changed-ubuntu bash
75bab0d54f3cf193cfdc3a86483466363f442fba30859f7dcd1b816b6ede82d4
$ docker run -dit changed-ubuntu bash
9280e777d109e2eb4b13ab211553516124a3d4d4280a0edfc7abf75c59024d47
$ docker run -dit changed-ubuntu bash
a651680bd6c2ef64902e154eeb8a064b85c9abf08ac46f922ad8dfc11bb5cd8a
$ docker run -dit changed-ubuntu bash
8eb24b3b2d246f225b24f2fca39625aaad71689c392a7b552b78baf264647373
$ docker run -dit changed-ubuntu bash
0ad25d06bdf6fca0dedc38301b2aff7478b3e1ce3d1acd676573bba57cb1cfef
This launches 5 containers based on the `changed-ubuntu` image. As the container is created, Docker adds a writable layer and assigns it a UUID. This is the value returned from the `docker run` command.
2. Run the `docker ps` command to verify the 5 containers are running.
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0ad25d06bdf6 changed-ubuntu "bash" About a minute ago Up About a minute stoic_ptolemy
8eb24b3b2d24 changed-ubuntu "bash" About a minute ago Up About a minute pensive_bartik
a651680bd6c2 changed-ubuntu "bash" 2 minutes ago Up 2 minutes hopeful_turing
9280e777d109 changed-ubuntu "bash" 2 minutes ago Up 2 minutes backstabbing_mahavira
75bab0d54f3c changed-ubuntu "bash" 2 minutes ago Up 2 minutes boring_pasteur
The output above shows 5 running containers, all sharing the `changed-ubuntu` image. Each `CONTAINER ID` is derived from the UUID when creating each container.
3. List the contents of the local storage area.
$ sudo ls containers
0ad25d06bdf6fca0dedc38301b2aff7478b3e1ce3d1acd676573bba57cb1cfef 9280e777d109e2eb4b13ab211553516124a3d4d4280a0edfc7abf75c59024d47
75bab0d54f3cf193cfdc3a86483466363f442fba30859f7dcd1b816b6ede82d4 a651680bd6c2ef64902e154eeb8a064b85c9abf08ac46f922ad8dfc11bb5cd8a
8eb24b3b2d246f225b24f2fca39625aaad71689c392a7b552b78baf264647373
Docker's copy-on-write strategy not only reduces the amount of space consumed by containers, it also reduces the time required to start a container. At start time, Docker only has to create the thin writable layer for each container. The diagram below shows these 5 containers sharing a single read-only (RO) copy of the `changed-ubuntu` image.
![](images/shared-uuid.jpg)
If Docker had to make an entire copy of the underlying image stack each time it
started a new container, container start times and disk space used would be
significantly increased.
## Data volumes and the storage driver
When a container is deleted, any data written to the container that is not stored in a *data volume* is deleted along with the container. A data volume is directory or file that is mounted directly into a container.
Data volumes are not controlled by the storage driver. Reads and writes to data
volumes bypass the storage driver and operate at native host speeds. You can mount any number of data volumes into a container. Multiple containers can also share one or more data volumes.
The diagram below shows a single Docker host running two containers. Each container exists inside of its own address space within the Docker host's local storage area. There is also a single shared data volume located at `/data` on the Docker host. This is mounted directly into both containers.
![](images/shared-volume.jpg)
The data volume resides outside of the local storage area on the Docker host further reinforcing its independence from the storage driver's control. When a container is deleted, any data stored in shared data volumes persists on the Docker host.
For detailed information about data volumes [Managing data in containers](https://docs.docker.com/userguide/dockervolumes/).
## Related information
* [Select a storage driver](selectadriver.md)
* [AUFS storage driver in practice](aufs-driver.md)
* [BTRFS storage driver in practice](btrfs-driver.md)
* [Device Mapper storage driver in practice](device-mapper-driver.md)

View File

@ -0,0 +1,28 @@
<!--[metadata]>
+++
title = "Docker storage drivers"
description = "Learn how select the proper storage driver for your container."
keywords = ["container, storage, driver, AUFS, btfs, devicemapper,zvfs"]
[menu.main]
identifier = "mn_storage_docker"
parent = "mn_use_docker"
weight = 7
+++
<![end-metadata]-->
# Docker storage drivers
Docker relies on driver technology to manage the storage and interactions associated with images and they containers that run them. This section contains the following pages:
* [Understand images, containers, and storage drivers](imagesandcontainers.md)
* [Select a storage driver](selectadriver.md)
* [AUFS storage driver in practice](aufs-driver.md)
* [BTRFS storage driver in practice](btrfs-driver.md)
* [Device Mapper storage driver in practice](device-mapper-driver.md)
* [OverlayFS in practice](overlayfs-driver.md)
* [FS storage in practice](zfs-driver.md)
If you are new to Docker containers make sure you read ["Understand images, containers, and storage drivers"](imagesandcontainers.md) first. It explains key concepts and technologies that can help you when working with storage drivers.
&nbsp;

View File

@ -0,0 +1,190 @@
<!--[metadata]>
+++
title = "OverlayFS storage in practice"
description = "Learn how to optimize your use of OverlayFS driver."
keywords = ["container, storage, driver, OverlayFS "]
[menu.main]
parent = "mn_storage_docker"
+++
<![end-metadata]-->
# Docker and OverlayFS in practice
OverlayFS is a modern *union filesystem* that is similar to AUFS. In comparison to AUFS, OverlayFS:
* has a simpler design
* has been in the mainline Linux kernel since version 3.18
* is potentially faster
As a result, OverlayFS is rapidly gaining popularity in the Docker community and is seen by many as a natural successor to AUFS. As promising as OverlayFS is, it is still relatively young. Therefore caution should be taken before using it in production Docker environments.
Docker's `overlay` storage driver leverages several OverlayFS features to build and manage the on-disk structures of images and containers.
>**Note**: Since it was merged into the mainline kernel, the OverlayFS *kernel module* was renamed from "overlayfs" to "overlay". As a result you may see the two terms used interchangeably in some documentation. However, this document uses "OverlayFS" to refer to the overall filesystem, and `overlay` to refer to Docker's storage-driver.
## Image layering and sharing with OverlayFS
OverlayFS takes two directories on a single Linux host, layers one on top of the other, and provides a single unified view. These directories are often referred to as *layers* and the technology used to layer them is is known as a *union mount*. The OverlayFS terminology is "lowerdir" for the bottom layer and "upperdir" for the top layer. The unified view is exposed through its own directory called "merged".
The diagram below shows how a Docker image and a Docker container are layered. The image layer is the "lowerdir" and the container layer is the "upperdir". The unified view is exposed through a directory called "merged" which is effectively the containers mount point. The diagram shows how Docker constructs map to OverlayFS constructs.
![](images/overlay_constructs.jpg)
Notice how the image layer and container layer can contain the same files. When this happens, the files in the container layer ("upperdir") are dominant and obscure the existence of the same files in the image layer ("lowerdir"). The container mount ("merged") presents the unified view.
OverlayFS only works with two layers. This means that multi-layered images cannot be implemented as multiple OverlayFS layers. Instead, each image layer is implemented as its own directory under `/var/lib/docker/overlay`. Hard links are then used as a space-efficient way to reference data shared with lower layers. The diagram below shows a four-layer image and how it is represented in the Docker host's filesystem.
![](images/overlay_constructs2.jpg)
To create a container, the `overlay` driver combines the directory representing the image's top layer plus a new directory for the container. The image's top layer is the "lowerdir" in the overlay and read-only. The new directory for the container is the "upperdir" and is writable.
## Example: Image and container on-disk constructs
The following `docker images -a` command shows a Docker host with a single image. As can be seen, the image consists of four layers.
$ docker images -a
REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE
ubuntu latest 1d073211c498 7 days ago 187.9 MB
<none> <none> 5a4526e952f0 7 days ago 187.9 MB
<none> <none> 99fcaefe76ef 7 days ago 187.9 MB
<none> <none> c63fb41c2213 7 days ago 187.7 MB
Below, the command's output illustrates that each of the four image layers has it's own directory under `/var/lib/docker/overlay/`.
$ ls -l /var/lib/docker/overlay/
total 24
drwx------ 3 root root 4096 Oct 28 11:02 1d073211c498fd5022699b46a936b4e4bdacb04f637ad64d3475f558783f5c3e
drwx------ 3 root root 4096 Oct 28 11:02 5a4526e952f0aa24f3fcc1b6971f7744eb5465d572a48d47c492cb6bbf9cbcda
drwx------ 5 root root 4096 Oct 28 11:06 99fcaefe76ef1aa4077b90a413af57fd17d19dce4e50d7964a273aae67055235
drwx------ 3 root root 4096 Oct 28 11:01 c63fb41c2213f511f12f294dd729b9903a64d88f098c20d2350905ac1fdbcbba
Each directory is named after the image layer IDs in the previous `docker images -a` command. The image layer directories contain the files unique to that layer as well as hard links to the data that is shared with lower layers. This allows for efficient use of disk space.
The following `docker ps` command shows the same Docker host running a single container. The container ID is "73de7176c223".
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
73de7176c223 ubuntu "bash" 2 days ago Up 2 days stupefied_nobel
This container exists on-disk in the Docker host's filesystem under `/var/lib/docker/overlay/73de7176c223...`. If you inspect this directory using the `ls -l` command you find the following file and directories.
$ ls -l /var/lib/docker/overlay/73de7176c223a6c82fd46c48c5f152f2c8a7e49ecb795a7197c3bb795c4d879e
total 16
-rw-r--r-- 1 root root 64 Oct 28 11:06 lower-id
drwxr-xr-x 1 root root 4096 Oct 28 11:06 merged
drwxr-xr-x 4 root root 4096 Oct 28 11:06 upper
drwx------ 3 root root 4096 Oct 28 11:06 work
These four filesystem objects are all artifacts of OverlayFS. The "lower-id" file contains the ID of the top layer of the image the container is based on. This is used by OverlayFS as the "lowerdir".
$ cat /var/lib/docker/overlay/73de7176c223a6c82fd46c48c5f152f2c8a7e49ecb795a7197c3bb795c4d879e/lower-id
1d073211c498fd5022699b46a936b4e4bdacb04f637ad64d3475f558783f5c3e
The "upper" directory is the containers read-write layer. Any changes made to the container are written to this directory.
The "merged" directory is effectively the containers mount point. This is where the unified view of the image ("lowerdir") and container ("upperdir") is exposed. Any changes written to the container are immediately reflected in this directory.
The "work" directory is required for OverlayFS to function. It is used for things such as *copy_up* operations.
You can verify all of these constructs from the output of the `mount` command. (Ellipses and line breaks are used in the output below to enhance readability.)
$ mount | grep overlay
overlay on /var/lib/docker/overlay/73de7176c223.../merged
type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay/1d073211c498.../root,
upperdir=/var/lib/docker/overlay/73de7176c223.../upper,
workdir=/var/lib/docker/overlay/73de7176c223.../work)
The output reflects the overlay is mounted as read-write ("rw").
## Container reads and writes with overlay
Consider three scenarios where a container opens a file for read access with overlay.
- **The file does not exist in the container layer**. If a container opens a file for read access and the file does not already exist in the container ("upperdir") it is read from the image ("lowerdir"). This should incur very little performance overhead.
- **The file only exists in the container layer**. If a container opens a file for read access and the file exists in the container ("upperdir") and not in the image ("lowerdir"), it is read directly from the container.
- **The file exists in the container layer and the image layer**. If a container opens a file for read access and the file exists in the image layer and the container layer, the file's version in the container layer is read. This is because files in the container layer ("upperdir") obscure files with the same name in the image layer ("lowerdir").
Consider some scenarios where files in a container are modified.
- **Writing to a file for the first time**. The first time a container writes to an existing file, that file does not exist in the container ("upperdir"). The `overlay` driver performs a *copy_up* operation to copy the file from the image ("lowerdir") to the container ("upperdir"). The container then writes the changes to the new copy of the file in the container layer.
However, OverlayFS works at the file level not the block level. This means that all OverlayFS copy-up operations copy entire files, even if the file is very large and only a small part of it is being modified. This can have a noticeable impact on container write performance. However, two things are worth noting:
* The copy_up operation only occurs the first time any given file is written to. Subsequent writes to the same file will operate against the copy of the file already copied up to the container.
* OverlayFS only works with two layers. This means that performance should be better than AUFS which can suffer noticeable latencies when searching for files in images with many layers.
- **Deleting files and directories**. When files are deleted within a container a *whiteout* file is created in the containers "upperdir". The version of the file in the image layer ("lowerdir") is not deleted. However, the whiteout file in the container obscures it.
Deleting a directory in a container results in *opaque directory* being created in the "upperdir". This has the same effect as a whiteout file and effectively masks the existence of the directory in the image's "lowerdir".
## Configure Docker with the overlay storage driver
To configure Docker to use the overlay storage driver your Docker host must be running version 3.18 of the Linux kernel (preferably newer) with the overlay kernel module loaded. OverlayFS can operate on top of most supported Linux filesystems. However, ext4 is currently recommended for use in production environments.
The following procedure shows you how to configure your Docker host to use OverlayFS. The procedure assumes that the Docker daemon is in a stopped state.
> **Caution:** If you have already run the Docker daemon on your Docker host and have images you want to keep, `push` them Docker Hub or your private Docker Trusted Registry before attempting this procedure.
1. If it is running, stop the Docker `daemon`.
2. Verify your kernel version and that the overlay kernel module is loaded.
$ uname -r
3.19.0-21-generic
$ lsmod | grep overlay
overlay
3. Start the Docker daemon with the `overlay` storage driver.
$ docker daemon --storage-driver=overlay &
[1] 29403
root@ip-10-0-0-174:/home/ubuntu# INFO[0000] Listening for HTTP on unix (/var/run/docker.sock)
INFO[0000] Option DefaultDriver: bridge
INFO[0000] Option DefaultNetwork: bridge
<output truncated>
Alternatively, you can force the Docker daemon to automatically start with
the `overlay` driver by editing the Docker config file and adding the
`--storage-driver=overlay` flag to the `DOCKER_OPTS` line. Once this option
is set you can start the daemon using normal startup scripts without having
to manually pass in the `--storage-driver` flag.
4. Verify that the daemon is using the `overlay` storage driver
$ docker info
Containers: 0
Images: 0
Storage Driver: overlay
Backing Filesystem: extfs
<output truncated>
Notice that the *Backing filesystem* in the output above is showing as `extfs`. Multiple backing filesystems are supported but `extfs` (ext4) is recommended for production use cases.
Your Docker host is now using the `overlay` storage driver. If you run the `mount` command, you'll find Docker has automatically created the `overlay` mount with the required "lowerdir", "upperdir", "merged" and "workdir" constructs.
## OverlayFS and Docker Performance
As a general rule, the `overlay` driver should be fast. Almost certainly faster than `aufs` and `devicemapper`. In certain circumstances it may also be faster than `btrfs`. That said, there are a few things to be aware of relative to the performance of Docker using the `overlay` storage driver.
- **Page Caching**. OverlayFS supports page cache sharing. This means multiple containers accessing the same file can share a single page cache entry (or entries). This makes the `overlay` driver efficient with memory and a good option for PaaS and other high density use cases.
- **copy_up**. As with AUFS, OverlayFS has to perform copy-up operations any time a container writes to a file for the first time. This can insert latency into the write operation &mdash; especially if the file being copied up is large. However, once the file has been copied up, all subsequent writes to that file occur without the need for further copy-up operations.
The OverlayFS copy_up operation should be faster than the same operation with AUFS. This is because AUFS supports more layers than OverlayFS and it is possible to incur far larger latencies if searching through many AUFS layers.
- **RPMs and Yum**. OverlayFS only implements a subset of the POSIX standards. This can result in certain OverlayFS operations breaking POSIX standards. One such operation is the *copy-up* operation. Therefore, using `yum` inside of a container on a Docker host using the `overlay` storage driver is unlikely to work without implementing workarounds.
- **Inode limits**. Use of the `overlay` storage driver can cause excessive inode consumption. This is especially so as the number of images and containers on the Docker host grows. A Docker host with a large number of images and lots of started and stopped containers can quickly run out of inodes.
Unfortunately you can only specify the number of inodes in a filesystem at the time of creation. For this reason, you may wish to consider putting `/var/lib/docker` on a separate device with its own filesystem or manually specifying the number of inodes when creating the filesystem.
The following generic performance best practices also apply to OverlayFS.
- **Solid State Devices (SSD)**. For best performance it is always a good idea to use fast storage media such as solid state devices (SSD).
- **Use Data Volumes**. Data volumes provide the best and most predictable performance. This is because they bypass the storage driver and do not incur any of the potential overheads introduced by thin provisioning and copy-on-write. For this reason, you may want to place heavy write workloads on data volumes.

View File

@ -0,0 +1,119 @@
<!--[metadata]>
+++
title = "Select a storage driver"
description = "Learn how select the proper storage driver for your container."
keywords = ["container, storage, driver, AUFS, btfs, devicemapper,zvfs"]
[menu.main]
parent = "mn_storage_docker"
weight = -1
+++
<![end-metadata]-->
# Select a storage driver
This page describes Docker's storage driver feature. It lists the storage
driver's that Docker supports and the basic commands associated with managing them. Finally, this page provides guidance on choosing a storage driver.
The material on this page is intended for readers who already have an [understanding of the storage driver technology](imagesandcontainers.md).
## A pluggable storage driver architecture
The Docker has a pluggable storage driver architecture. This gives you the flexibility to "plug in" the storage driver is best for your environment and use-case. Each Docker storage driver is based on a Linux filesystem or volume manager. Further, each storage driver is free to implement the management of image layers and the container layer in it's own unique way. This means some storage drivers perform better than others in different circumstances.
Once you decide which driver is best, you set this driver on the Docker daemon at start time. As a result, the Docker daemon can only run one storage driver, and all containers created by that daemon instance use the same storage driver. The table below shows the supported storage driver technologies and the driver names:
|Technology |Storage driver name |
|--------------|---------------------|
|OverlayFS |`overlay` |
|AUFS |`aufs` |
|BTRFS |`btrfs` |
|Device Maper |`devicemapper` |
|VFS* |`vfs` |
|ZFS |`zfs` |
To find out which storage driver is set on the daemon , you use the `docker info` command:
$ docker info
Containers: 0
Images: 0
Storage Driver: overlay
Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.19.0-15-generic
Operating System: Ubuntu 15.04
... output truncated ...
The `info` subcommand reveals that the Docker daemon is using the `overlay` storage driver with a `Backing Filesystem` value of `extfs`. The `extfs` value means that the `overlay` storage driver is operating on top of an existing (ext) filesystem. The backing filesystem refers to the filesystem that was used to create the Docker host's local storage area under `/var/lib/docker`.
Which storage driver you use, in part, depends on the backing filesystem you plan to use for your Docker host's local storage area. Some storage drivers can operate on top of different backing filesystems. However, other storage drivers require the backing filesystem to be the same as the storage driver. For example, the `btrfs` storage driver on a `btrfs` backing filesystem. The following table lists each storage driver and whether it must match the host's backing file system:
|Storage driver |Must match backing filesystem |
|---------------|------------------------------|
|overlay |No |
|aufs |No |
|btrfs |Yes |
|devicemapper |No |
|vfs* |No |
|zfs |Yes |
You pass the `--storage-driver=<name>` option to the `docker daemon` command line or by setting the option on the `DOCKER_OPTS` line in `/etc/defaults/docker` file.
The following command shows how to start the Docker daemon with the `devicemapper` storage driver using the `docker daemon` command:
$ docker daemon --storage-driver=devicemapper &
$ docker info
Containers: 0
Images: 0
Storage Driver: devicemapper
Pool Name: docker-252:0-147544-pool
Pool Blocksize: 65.54 kB
Backing Filesystem: extfs
Data file: /dev/loop0
Metadata file: /dev/loop1
Data Space Used: 1.821 GB
Data Space Total: 107.4 GB
Data Space Available: 3.174 GB
Metadata Space Used: 1.479 MB
Metadata Space Total: 2.147 GB
Metadata Space Available: 2.146 GB
Udev Sync Supported: true
Deferred Removal Enabled: false
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Library Version: 1.02.90 (2014-09-01)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.19.0-15-generic
Operating System: Ubuntu 15.04
<output truncated>
Your choice of storage driver can affect the performance of your containerized applications. So it's important to understand the different storage driver options available and select the right one for your application. Later, in this page you'll find some advice for choosing an appropriate driver.
## Shared storage systems and the storage driver
Many enterprises consume storage from shared storage systems such as SAN and NAS arrays. These often provide increased performance and availability, as well as advanced features such as thin provisioning, deduplication and compression.
The Docker storage driver and data volumes can both operate on top of storage provided by shared storage systems. This allows Docker to leverage the increased performance and availability these systems provide. However, Docker does not integrate with these underlying systems.
Remember that each Docker storage driver is based on a Linux filesystem or volume manager. Be sure to follow existing best practices for operating your storage driver (filesystem or volume manager) on top of your shared storage system. For example, if using the ZFS storage driver on top of *XYZ* shared storage system, be sure to follow best practices for operating ZFS filesystems on top of XYZ shared storage system.
## Which storage driver should you choose?
As you might expect, the answer to this question is "it depends". While there are some clear cases where one particular storage driver outperforms other for certain workloads, you should factor all of the following into your decision:
Choose a storage driver that you and your team/organization are comfortable with. Consider how much experience you have with a particular storage driver. There is no substitute for experience and it is rarely a good idea to try something brand new in production. That's what labs and laptops are for!
If your Docker infrastructure is under support contracts, choose an option that will get you good support. You probably don't want to go with a solution that your support partners have little or no experience with.
Whichever driver you choose, make sure it has strong community support and momentum. This is important because storage driver development in the Docker project relies on the community as much as the Docker staff to thrive.
## Related information
* [Understand images, containers, and storage drivers](imagesandcontainers.md)
* [AUFS storage driver in practice](aufs-driver.md)
* [BTRFS storage driver in practice](btrfs-driver.md)
* [Device Mapper storage driver in practice](device-mapper-driver.md)

View File

@ -0,0 +1,218 @@
<!--[metadata]>
+++
title = "ZFS storage in practice"
description = "Learn how to optimize your use of ZFS driver."
keywords = ["container, storage, driver, ZFS "]
[menu.main]
parent = "mn_storage_docker"
+++
<![end-metadata]-->
# Docker and ZFS in practice
ZFS is a next generation filesystem that supports many advanced storage technologies such as volume management, snapshots, checksumming, compression and deduplication, replication and more.
It was created by Sun Microsystems (now Oracle Corporation) and is open sourced under the CDDL license. Due to licensing incompatibilities between the CDDL and GPL, ZFS cannot be shipped as part of the mainline Linux kernel. However, the ZFS On Linux (ZoL) project provides an out-of-tree kernel module and userspace tools which can be installed separately.
The ZFS on Linux (ZoL) port is healthy and maturing. However, at this point in time it is not recommended to use the `zfs` Docker storage driver for production use unless you have substantial experience with ZFS on Linux.
> **Note:** There is also a FUSE implementation of ZFS on the Linux platform. This should work with Docker but is not recommended. The native ZFS driver (ZoL) is more tested, more performant, and is more widely used. The remainder of this document will relate to the native ZoL port.
## Image layering and sharing with ZFS
The Docker `zfs` storage driver makes extensive use of three ZFS datasets:
- filesystems
- snapshots
- clones
ZFS filesystems are thinly provisioned and have space allocated to them from a ZFS pool (zpool) via allocate on demand operations. Snapshots and clones are space-efficient point-in-time copies of ZFS filesystems. Snapshots are read-only. Clones are read-write. Clones can only be created from snapshots. This simple relationship is shown in the diagram below.
![](images/zfs_clones.jpg)
The solid line in the diagram shows the process flow for creating a clone. Step 1 creates a snapshot of the filesystem, and step two creates the clone from the snapshot. The dashed line shows the relationship between the clone and the filesystem, via the snapshot. All three ZFS datasets draw space form the same underlying zpool.
On Docker hosts using the `zfs` storage driver, the base layer of an image is a ZFS filesystem. Each child layer is a ZFS clone based on a ZFS snapshot of the layer below it. A container is a ZFS clone based on a ZFS Snapshot of the top layer of the image it's created from. All ZFS datasets draw their space from a common zpool. The diagram below shows how this is put together with a running container based on a two-layer image.
![](images/zfs_zpool.jpg)
The following process explains how images are layered and containers created. The process is based on the diagram above.
1. The base layer of the image exists on the Docker host as a ZFS filesystem.
This filesystem consumes space from the zpool used to create the Docker host's local storage area at `/var/lib/docker`.
2. Additional image layers are clones of the dataset hosting the image layer directly below it.
In the diagram, "Layer 1" is added by making a ZFS snapshot of the base layer and then creating a clone from that snapshot. The clone is writable and consumes space on-demand from the zpool. The snapshot is read-only, maintaining the base layer as an immutable object.
3. When the container is launched, a read-write layer is added above the image.
In the diagram above, the container's read-write layer is created by making a snapshot of the top layer of the image (Layer 1) and creating a clone from that snapshot.
As changes are made to the container, space is allocated to it from the zpool via allocate-on-demand operations. By default, ZFS will allocate space in blocks of 128K.
This process of creating child layers and containers from *read-only* snapshots allows images to be maintained as immutable objects.
## Container reads and writes with ZFS
Container reads with the `zfs` storage driver are very simple. A newly launched container is based on a ZFS clone. This clone initially shares all of its data with the dataset it was created from. This means that read operations with the `zfs` storage driver are fast &ndash; even if the data being read was copied into the container yet. This sharing of data blocks is shown in the diagram below.
![](images/zpool_blocks.jpg)
Writing new data to a container is accomplished via an allocate-on-demand operation. Every time a new area of the container needs writing to, a new block is allocated from the zpool. This means that containers consume additional space as new data is written to them. New space is allocated to the container (ZFS Clone) from the underlying zpool.
Updating *existing data* in a container is accomplished by allocating new blocks to the containers clone and storing the changed data in those new blocks. The original are unchanged, allowing the underlying image dataset to remain immutable. This is the same as writing to a normal ZFS filesystem and is an implementation of copy-on-write semantics.
## Configure Docker with the ZFS storage driver
The `zfs` storage driver is only supported on a Docker host where `/var/lib/docker` is mounted as a ZFS filesystem. This section shows you how to install and configure native ZFS on Linux (ZoL) on an Ubuntu 14.04 system.
### Prerequisites
If you have already used the Docker daemon on your Docker host and have images you want to keep, `push` them Docker Hub or your private Docker Trusted Registry before attempting this procedure.
Stop the Docker daemon. Then, ensure that you have a spare block device at `/dev/xvdb`. The device identifier may be be different in your environment and you should substitute your own values throughout the procedure.
### Install Zfs on Ubuntu 14.04 LTS
1. If it is running, stop the Docker `daemon`.
1. Install `the software-properties-common` package.
This is required for the `add-apt-repository` command.
$ sudo apt-get install software-properties-common
Reading package lists... Done
Building dependency tree
<output truncated>
2. Add the `zfs-native` package archive.
$ sudo add-apt-repository ppa:zfs-native/stable
The native ZFS filesystem for Linux. Install the ubuntu-zfs package.
<output truncated>
gpg: key F6B0FC61: public key "Launchpad PPA for Native ZFS for Linux" imported
gpg: Total number processed: 1
gpg: imported: 1 (RSA: 1)
OK
3. Get the latest package lists for all registered repositories and package archives.
$ sudo apt-get update
Ign http://us-west-2.ec2.archive.ubuntu.com trusty InRelease
Get:1 http://us-west-2.ec2.archive.ubuntu.com trusty-updates InRelease [64.4 kB]
<output truncated>
Fetched 10.3 MB in 4s (2,370 kB/s)
Reading package lists... Done
4. Install the `ubuntu-zfs` package.
$ sudo apt-get install -y ubuntu-zfs
Reading package lists... Done
Building dependency tree
<output truncated>
5. Load the `zfs` module.
$ sudo modprobe zfs
6. Verify that it loaded correctly.
$ lsmod | grep zfs
zfs 2768247 0
zunicode 331170 1 zfs
zcommon 55411 1 zfs
znvpair 89086 2 zfs,zcommon
spl 96378 3 zfs,zcommon,znvpair
zavl 15236 1 zfs
## Configure ZFS for Docker
Once ZFS is installed and loaded, you're ready to configure ZFS for Docker.
1. Create a new `zpool`.
$ sudo zpool create -f zpool-docker /dev/xvdb
The command creates the `zpool` and gives it the name "zpool-docker". The name is arbitrary.
2. Check that the `zpool` exists.
$ sudo zfs list
NAME USED AVAIL REFER MOUNTPOINT
zpool-docker 55K 3.84G 19K /zpool-docker
3. Create and mount a new ZFS filesystem to `/var/lib/docker`.
$ sudo zfs create -o mountpoint=/var/lib/docker zpool-docker/docker
4. Check that the previous step worked.
$ sudo zfs list -t all
NAME USED AVAIL REFER MOUNTPOINT
zpool-docker 93.5K 3.84G 19K /zpool-docker
zpool-docker/docker 19K 3.84G 19K /var/lib/docker
Now that you have a ZFS filesystem mounted to `/var/lib/docker`, the daemon should automatically load with the `zfs` storage driver.
5. Start the Docker daemon.
$ sudo service docker start
docker start/running, process 2315
The procedure for starting the Docker daemon may differ depending on the
Linux distribution you are using. It is possible to force the Docker daemon
to start with the `zfs` storage driver by passing the `--storage-driver=zfs`
flag to the `docker daemon` command, or to the `DOCKER_OPTS` line in the
Docker config file.
6. Verify that the daemon is using the `zfs` storage driver.
$ sudo docker info
Containers: 0
Images: 0
Storage Driver: zfs
Zpool: zpool-docker
Zpool Health: ONLINE
Parent Dataset: zpool-docker/docker
Space Used By Parent: 27648
Space Available: 4128139776
Parent Quota: no
Compression: off
Execution Driver: native-0.2
[...]
The output of the command above shows that the Docker daemon is using the
`zfs` storage driver and that the parent dataset is the `zpool-docker/docker`
filesystem created earlier.
Your Docker host is now using ZFS to store to manage its images and containers.
## ZFS and Docker performance
There are several factors that influence the performance of Docker using the `zfs` storage driver.
- **Memory**. Memory has a major impact on ZFS performance. This goes back to the fact that ZFS was originally designed for use on big Sun Solaris servers with large amounts of memory. Keep this in mind when sizing your Docker hosts.
- **ZFS Features**. Using ZFS features, such as deduplication, can significantly increase the amount
of memory ZFS uses. For memory consumption and performance reasons it is
recommended to turn off ZFS deduplication. However, deduplication at other
layers in the stack (such as SAN or NAS arrays) can still be used as these do
not impact ZFS memory usage and performance. If using SAN, NAS or other hardware
RAID technologies you should continue to follow existing best practices for
using them with ZFS.
* **ZFS Caching**. ZFS caches disk blocks in a memory structure called the adaptive replacement cache (ARC). The *Single Copy ARC* feature of ZFS allows a single cached copy of a block to be shared by multiple clones of a filesystem. This means that multiple running containers can share a single copy of cached block. This means that ZFS is a good option for PaaS and other high density use cases.
- **Fragmentation**. Fragmentation is a natural byproduct of copy-on-write filesystems like ZFS. However, ZFS writes in 128K blocks and allocates *slabs* (multiple 128K blocks) to CoW operations in an attempt to reduce fragmentation. The ZFS intent log (ZIL) and the coalescing of writes (delayed writes) also help to reduce fragmentation.
- **Use the native ZFS driver for Linux**. Although the Docker `zfs` storage driver supports the ZFS FUSE implementation, it is not recommended when high performance is required. The native ZFS on Linux driver tends to perform better than the FUSE implementation.
The following generic performance best practices also apply to ZFS.
- **Use of SSD**. For best performance it is always a good idea to use fast storage media such as solid state devices (SSD). However, if you only have a limited amount of SSD storage available it is recommended to place the ZIL on SSD.
- **Use Data Volumes**. Data volumes provide the best and most predictable performance. This is because they bypass the storage driver and do not incur any of the potential overheads introduced by thin provisioning and copy-on-write. For this reason, you may want to place heavy write workloads on data volumes.