Docker allow researchers to document, preserve, and share the computational environment used in their research project.
There are two ways researchers can use Docker.
Beginners will want to start by setting up Docker, and running existing containers.
More advanced users may want to create their own containers.
Docker is an open software platform that is used to create containers. "A container [...] packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings."
In terms of reproducible research, a container such as Docker can be used to document and share the computational environment used for your research.
Docker provides several products:
Installation
Product | Linux | macOS | Windows |
Docker Engine | Several Distributions are available, installed through CLI |
Available through Docker Desktop |
Available through Docker Desktop |
Docker Compose | Several Distributions are available, installed through CLI |
Available through Docker Desktop |
Available through Docker Desktop |
Docker Desktop | not available for Linux | Mojave and above | Windows 10 |
Docker Hub | Available through a browser | Available through a browser or Docker Desktop |
Available through a browser or Docker Desktop |
The installation of Docker Desktop on macOS is fairly straight-forward. When installing Docker Desktop on Windows 10, you might be prompted to answer a question: use Hyper-V or switch to WSL2? So, what are Hyper-V and WSL2? Without getting too in depth, Hyper-V is a toolkit in Windows to create virtual machines. Most people would use Hyper-V to create a VM to run Linux. WSL2 (Windows Subsystem for Linux 2) is itself a virtual machine running a Linux kernel. I think the most important thing to know when making this choice is Docker <3 WSL2, and has stated they will eventually replace Hyper-V with WSL2. The second important thing you should know is some containers may rely on Hyper-V, but until Docker completely phases out Hyper-V, you can still switch back to it if needed.
Running Docker images
A lot of basics about Docker can be learned through their tutorial,
docker run -d -p 80:80 docker/getting-started
After running the above command from the Command Line Interface (CLI), you access the tutorial by going to http://localhost/tutorial/ in a browser. Alternatively, you can launch the tutorial from Docker Desktop.
Data created or used by containers is persistent by mounting a volume. Basically, as you launch a container, you tell Docker to connect a directory on your computer with a directory in the container to save and share data.
If you went through the Docker tutorial episode Persisting our DB, you will have created a volume titled todo-db
, then ran the command
docker run -dp 3000:3000 -v todo-db:/etc/todos getting-started
The tutorial also tells you how locate where the volume is stored by using the inspect
command
docker volume inspect todo-db
[
{
"CreatedAt": "2019-09-26T02:18:36Z",
"Driver": "local",
"Labels": {},
"Mountpoint": "/var/lib/docker/volumes/todo-db/_data",
"Name": "todo-db",
"Options": {},
"Scope": "local"
}
]
Working in Windows
Some of you running Windows OS might ask, "but where is "/var/lib/docker/volumes/todo-db/_data"
?", because they make it sound like you can easily access the data. But you can't easily access that directory on Windows 10!
The data isn't really on your computer; at least not in the Windows OS. It's actually in the virtual machine Docker is running on.
Docker itself is a container, and since containers run separate from all else (unless you specify otherwise), you can't get the data out of the container.
While this isn't very useful for researchers who want to get data into or out of the container, there is a solution! Docker allows you to specify the location of the directory you want to mount.
Using the previous getting-started to-do list example in the tutorial, I'm going to create a data
sub-directory in my tutorial directory to save the data. The full path is C:\Users\Kat Koziar\Desktop\docker\tutorial\data
so, the Docker command is now
docker run -dp 3000:3000 -v "C:\Users\Kat Koziar\Desktop\docker\tutorial\data:/etc/todos" getting-started
Notice the double-quotes around the directory path? This is because there is a space in the directory Kat Koziar
that I am unable to remove. The quotes around the directory path tell the computer to read the space as part of the directory path, otherwise, it reads the space as separating commands, flags, or input variables.
What is $(pwd)
?
You may see the term $(pwd)
in commands. What is it? You might be tempted to think pwd is shorthand for password, but it stands for Print Working Directory. pwd is a BASH command that returns the path of the directory you are currently in. If I am in the command line interface, and navigate to the tutorial directory (in the above docker command example), when I execute the command pwd
, the response will be C:\Users\Kat Koziar\Desktop\docker\tutorial
. When the command is surrounded by parentheses and preceded by dollar sign, $(pwd)
, the computer will run the command in the parentheses, and paste the results into the rest of the command. So,
docker run -dp 3000:3000 -v "$(pwd)\data:/etc/todos" getting-started
is the same as
docker run -dp 3000:3000 -v "C:\Users\Kat Koziar\Desktop\docker\tutorial\data:/etc/todos" getting-started
but, less typing. It is important to keep track of what directory you're currently in when using $(pwd)
in a command.
Containers are created using a Dockerfile, which is a plain text file with no extension that contains a sequence of commands Docker executes in order to build the container. The Docker tutorial demonstrates building a container using a Dockerfile, with a passing reference to the contents of the file. Docker also provides long and very technical documentation in their reference library. Microsoft documentation provides an introduction to Dockerfiles that covers the basics and is a shorter read.
Here is a sample Dockerfile to create a container which the runs the code of a Python-based Jupyter Notebook, and includes the contents of the directory mycontainer
. The Dockerfile is in the main folder which contains the items you want to add to the container, in this case, mycontainer
. Comments are preceded by the octothorp (#).
# this will create the base image of the container
ARG BASE_CONTAINER=jupyter/minimal-notebook
FROM $BASE_CONTAINER
LABEL author="Container Author"
USER root
# this will copy the contents of your WORKDIR to the container
# when creating a Jupyter notebook, if you didn't copy in files,
# it would run a Jupyter notebook with no other files
WORKDIR mycontainer/
COPY . .
# installs the libraries into the container
RUN pip install pandas numpy matplotlib plotly
# Switch back to jovyan to avoid accidental container runs as root
USER $NB_UID
Once the Dockerfile is created and while in the directory that contains the file, the command to build
the container is
Docker build -t username/container-name .
Note that at the end of the command is a period preceded by a space. This tells Docker to run the build command in the current directory.
As when documenting any code, libraries (and their versions) installed in the container should be documented. This may be done by using the text displayed in the terminal which indicates the versions of the libraries,
Successfully installed cycler-0.10.0 kiwisolver-1.3.1 matplotlib-3.3.3 numpy-1.19.4 pandas-1.1.5 pillow-8.0.1 plotly-4.14.1 pytz-2020.4 retrying-1.3.3
or if the build is using Docker’s cache, then by running the container and executing the attribute for the version. For example, to get the version of the numpy package, print(numpy.__version__)
is run after the numpy package has been imported.
Sharing a container on hub.docker.com is fairly simple.
1. Create your container
2. Locally, tag your repo if it wasn't tagged during the build process
docker tag repo-name YOUR-USER-NAME/repo-name
3. push your local container to Docker Hub
docker push YOUR-USER-NAME/repo-name