Skip to Main Content Skip to footer content

Reproducible Research Methods

What is Docker?

Docker

Docker is an open software platform that is used to create containers.  "A container [...] packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings."

In terms of reproducible research, a container such as Docker can be used to document and share the computational environment used for your research.  

Examples of Containers in Research

Using Docker

Docker provides several products:

  • The Docker Engine, the technology used to create containers;
  • Docker Compose, a tool used to create multi-container applications;
  • Docker Hub, a repository of container images; and,
  • Docker Desktop, an application that bundles several products, while providing a dashboard to quickly access containers.

 

Installation

Product Linux macOS Windows
Docker Engine Several Distributions
are available, installed 
through CLI
Available through
Docker Desktop
Available through
Docker Desktop
Docker Compose Several Distributions
are available, installed 
through CLI
Available through
Docker Desktop
Available through
Docker Desktop
Docker Desktop not available for Linux Mojave and above Windows 10
Docker Hub Available through a browser Available through a browser
or Docker Desktop
Available through a browser
or Docker Desktop

The installation of Docker Desktop on macOS is fairly straight-forward.  When installing Docker Desktop on Windows 10, you might be prompted to answer a question: use Hyper-V or switch to WSL2?  So, what are Hyper-V and WSL2?  Without getting too in depth, Hyper-V is a toolkit in Windows to create virtual machines.  Most people would use Hyper-V to create a VM to run Linux.   WSL2 (Windows Subsystem for Linux 2) is itself a virtual machine running a Linux kernel.  I think the most important thing to know when making this choice is Docker <3 WSL2, and has stated they will eventually replace Hyper-V with WSL2.  The second important thing you should know is some containers may rely on Hyper-V, but until Docker completely phases out Hyper-V, you can still switch back to it if needed. 

 

Running Docker images

A lot of basics about Docker can be learned through their tutorial, 

docker run -d -p 80:80 docker/getting-started

After running the above command from the Command Line Interface (CLI), you access the tutorial by going to http://localhost/tutorial/ in a browser.  Alternatively, you can launch the tutorial from Docker Desktop. 

A circle around the Icon to Open In Browser from Docker Dashboard

 

Data created or used by containers is persistent by mounting a volume.  Basically, as you launch a container, you tell Docker to connect a directory on your computer with a directory in the container to save and share data.

If you went through the Docker tutorial episode Persisting our DB, you will have created a volume titled todo-db, then ran the command 

docker run -dp 3000:3000 -v todo-db:/etc/todos getting-started

The tutorial also tells you how locate where the volume is stored by using the inspect command

docker volume inspect todo-db
[
    {
        "CreatedAt": "2019-09-26T02:18:36Z",
        "Driver": "local",
        "Labels": {},
        "Mountpoint": "/var/lib/docker/volumes/todo-db/_data",
        "Name": "todo-db",
        "Options": {},
        "Scope": "local"
    }
]

 

Working in Windows

Some of you running Windows OS might ask, "but where is "/var/lib/docker/volumes/todo-db/_data"?", because they make it sound like you can easily access the data.  But you can't easily access that directory on Windows 10!

The data isn't really on your computer; at least not in the Windows OS.  It's actually in the virtual machine Docker is running on. 

 

Docker itself is a container, and since containers run separate from all else (unless you specify otherwise), you can't get the data out of the container.

 

While this isn't very useful for researchers who want to get data into or out of the container, there is a solution!  Docker allows you to specify the location of the directory you want to mount. 

Using the previous getting-started to-do list example in the tutorial, I'm going to create a data sub-directory in my tutorial directory to save the data.  The full path is C:\Users\Kat Koziar\Desktop\docker\tutorial\data so, the Docker command is now

docker run -dp 3000:3000 -v "C:\Users\Kat Koziar\Desktop\docker\tutorial\data:/etc/todos" getting-started

Notice the double-quotes around the directory path?  This is because there is a space in the directory Kat Koziar that I am unable to remove.  The quotes around the directory path tell the computer to read the space as part of the directory path, otherwise, it reads the space as separating commands, flags, or input variables. 

What is $(pwd)?

You may see the term $(pwd) in commands. What is it? You might be tempted to think pwd is shorthand for password, but it stands for Print Working Directory.  pwd is a BASH command that returns the path of the directory you are currently in.  If I am in the command line interface, and navigate to the tutorial directory (in the above docker command example), when I execute the command pwd, the response will be C:\Users\Kat Koziar\Desktop\docker\tutorial .  When the command is surrounded by parentheses and preceded by dollar sign, $(pwd), the computer will run the command in the parentheses, and paste the results into the rest of the command. So, 

docker run -dp 3000:3000 -v "$(pwd)\data:/etc/todos" getting-started

is the same as 

docker run -dp 3000:3000 -v "C:\Users\Kat Koziar\Desktop\docker\tutorial\data:/etc/todos" getting-started

but, less typing.  It is important to keep track of what directory you're currently in when using $(pwd) in a command. 

Containers are created using a Dockerfile, which is a plain text file with no extension that contains a sequence of commands Docker executes in order to build the container.  The Docker tutorial demonstrates building a container using a Dockerfile, with a passing reference to the contents of the file.  Docker also provides long and very technical documentation in their reference library.   Microsoft documentation provides an introduction to Dockerfiles that covers the basics and is a shorter read.

Here is a sample Dockerfile to create a container which the runs the code of a Python-based Jupyter Notebook, and includes the contents of the directory mycontainer.   The Dockerfile is in the main folder which contains the items you want to add to the container, in this case, mycontainer.  Comments are preceded by the octothorp (#). 

# this will create the base image of the container
ARG BASE_CONTAINER=jupyter/minimal-notebook 
FROM $BASE_CONTAINER

LABEL author="Container Author"

USER root

# this will copy the contents of your WORKDIR to the container
# when creating a Jupyter notebook, if you didn't copy in files, 
#   it would run a Jupyter notebook with no other files
WORKDIR mycontainer/
COPY . . 

# installs the libraries into the container
RUN pip install pandas numpy matplotlib plotly

# Switch back to jovyan to avoid accidental container runs as root
USER $NB_UID

Once the Dockerfile is created and while in the directory that contains the file, the command to build the container is 

Docker build -t username/container-name . 

Note that at the end of the command is a period preceded by a space.  This tells Docker to run the build command in the current directory.

As when documenting any code, libraries (and their versions) installed in the container should be documented.  This may be done by using the text displayed in the terminal which indicates the versions of the libraries, 

Successfully installed cycler-0.10.0 kiwisolver-1.3.1 matplotlib-3.3.3 numpy-1.19.4 pandas-1.1.5 
pillow-8.0.1 plotly-4.14.1 pytz-2020.4 retrying-1.3.3

or if the build is using Docker’s cache, then by running the container and executing the attribute for the version.  For example, to get the version of the numpy package, print(numpy.__version__) is run after the numpy package has been imported. 

Sharing a container on hub.docker.com is fairly simple. 

1. Create your container

2. Locally, tag your repo if it wasn't tagged during the build process

docker tag repo-name YOUR-USER-NAME/repo-name

3. push your local container to Docker Hub 

docker push YOUR-USER-NAME/repo-name