Part of the benefit of containers is their temporary or transient nature. A container can be spun up in seconds, used just as long as necessary and then terminated with the minimum overhead on system resources. Unfortunately, this ethereal nature also applies to the data within the container itself and that can be a problem.
Initially, it was assumed application data could be injected into the container and taken out at shutdown, or perhaps the container could access resources over NFS. Neither of these approaches are practical for security and usability reasons, so Docker (the company) had to do something to address the need for persistent storage with containers. Here are four options for supporting persistent storage in Docker.
Docker data volumes
Docker data volumes provide the ability to create a resource that can be used to persistently store and retrieve data within a container. The functionality of data volumes was significantly enhanced in Docker version 1.9, with the ability to assign meaningful names to a volume, list volumes, and list the container associated with a volume.
Data volumes are a step forward from storing data within the container itself and offered better performance for the application. A running container is built from a snapshot of the base container image using file-based copy on write techniques so any data stored natively in the container attracts a significant overhead to manage. Data volumes sit outside this CoW mechanism and exist on the host filesystem, so they're more efficient to read and write to.
However, there are issues with using data volumes. For example, an existing volume can’t be attached to a running or new container, which means a volume can end up orphaned.
Data volume container
An alternative solution is to use a dedicated container to host a volume and to mount that volume space to other containers -- a so-called data volume container. In this technique, the volume container outlasts the application containers and can be used as a method of sharing data between more than one container at the same time.
Having a long-running container to store data provides other opportunities. For instance, a backup container can be spun up that copies or backs up the data in the container volume, for example. In both of the above scenarios, the container volume sits within the file structure of the Docker installation, typically /var/lib/docker/volumes. This means you can use standard tools to access this data, but beware, Docker provides no locking or security mechanisms to maintain data integrity.
Directory mounts
A third option for persistent data is to mount a local host directory into a container. This goes a step further than the methods described above in that the source directory can be any directory on the host running the container, rather than one under the Docker volumes folder. At container start time, the volume and mount point are specified on the Docker run command, providing a directory within the container that can be used by the application, e.g., data.
This approach is better than the previous volumes options as the directory can be used by one or more containers at the same time and can already exist for new containers. It’s possible to mount multiple directories at the same time, so one could be used for application code or perhaps the pages of a website while another is used to read/write application data.
Directory mounts are very powerful and currently unrestricted in terms of the access they offer to the host file system. An administrator could, for example, mount critical host directories into the container that are then overwritten. Mitigating this risk requires applying some common sense or stricter security rules with SELinux.
Storage plugins
Probably the most interesting development for persistent storage has been the ability to connect to external storage platforms through storage plugins. The plugin architecture provides an interface and API that allows storage vendors to build drivers to automate the creation and mapping of storage from external arrays and appliances into Docker and to be assigned to a container.
Today there are plugins to automate storage provisioning from HPE 3PAR, EMC (ScaleIO, XtremIO, VMAX, Isilon), and NetApp. There are also plugins to support storage from public cloud providers like Azure File Storage and Google Compute Platform. More details can be found on Docker’s plugin page.
Plugins map storage from a single host to an external storage source, typically an appliance or array. However, if a container is moved to another host for load balancing or failover reasons, then that storage association is lost. ClusterHQ has developed a platform called Flocker that manages and automates the process of moving the volume associated with a container to another host. Many storage vendors, including Hedvig, Nexenta, Kaminario, EMC, Dell, NetApp and Pure Storage, have chosen to write to the Flocker API, providing resilient storage and clustered container support within a single data center.
There are currently lots of options for supporting persistent storage in Docker. The next step will be providing support for storage that is distributed geographically. At that point, both data and application will have become truly portable.