4. Manage dependencies and containers¶
Computational workflows are rarely composed of a single script or tool. More often, they depend on dozens of software components or libraries.
Installing and maintaining such dependencies is a challenging task and the most common source of irreproducibility in scientific applications.
To overcome these issues, we use containers that enable software dependencies, i.e. tools and libraries required by a data analysis application, to be encapsulated in one or more self-contained, ready-to-run, immutable Linux container images, that can be easily deployed in any platform that supports the container runtime.
Containers can be executed in an isolated manner from the hosting system. Having its own copy of the file system, processing space, memory management, etc.
Info
Containers were first introduced with kernel 2.6 as a Linux feature known as Control Groups or Cgroups.
4.1 Docker¶
Docker is a handy management tool to build, run and share container images.
These images can be uploaded and published in a centralized repository known as Docker Hub, or hosted by other parties like Quay.
4.1.1 Run a container¶
A container can be run using the following command:
Try for example the following publicly available container (if you have Docker installed):
4.1.2 Pull a container¶
The pull command allows you to download a Docker image without running it. For example:
The above command downloads a Debian Linux image. You can check it exists by using:
4.1.3 Run a container in interactive mode¶
Launching a BASH shell in the container allows you to operate in an interactive mode in the containerized operating system. For example:
Once launched, you will notice that it is running as root (!). Use the usual commands to navigate the file system. This is useful to check if the expected programs are present within a container.
To exit from the container, stop the BASH session with the exit
command.
4.1.4 Your first Dockerfile¶
Docker images are created by using a so-called Dockerfile
, which is a simple text file containing a list of commands to assemble and configure the image with the software packages required.
Here, you will create a Docker image containing cowsay and the Salmon tool.
Warning
The Docker build process automatically copies all files that are located in the current directory to the Docker daemon in order to create the image. This can take a lot of time when big/many files exist. For this reason, it’s important to always work in a directory containing only the files you really need to include in your Docker image. Alternatively, you can use the .dockerignore
file to select paths to exclude from the build.
Use your favorite editor (e.g., vim
or nano
) to create a file named Dockerfile
and copy the following content:
FROM debian:bullseye-slim
LABEL image.author.name "Your Name Here"
LABEL image.author.email "your@email.here"
RUN apt-get update && apt-get install -y curl cowsay
ENV PATH=$PATH:/usr/games/
4.1.5 Build the image¶
Build the Docker image based on the Dockerfile by using the following command:
Where "my-image" is the user-specified name for the container image you plan to build .
Tip
Don’t miss the dot in the above command.
When it completes, verify that the image has been created by listing all available images:
You can try your new container by running this command:
4.1.6 Add a software package to the image¶
Add the Salmon package to the Docker image by adding the following snippet to the Dockerfile
:
RUN curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v1.5.2/salmon-1.5.2_linux_x86_64.tar.gz | tar xz \
&& mv /salmon-*/bin/* /usr/bin/ \
&& mv /salmon-*/lib/* /usr/lib/
Save the file and build the image again with the same command as before:
You will notice that it creates a new Docker image with the same name but with a different image ID.
4.1.7 Run Salmon in the container¶
Check that Salmon is running correctly in the container as shown below:
You can even launch a container in an interactive mode by using the following command:
Use the exit
command to terminate the interactive session.
4.1.8 File system mounts¶
Create a genome index file by running Salmon in the container.
Try to run Salmon in the container with the following command:
The above command fails because Salmon cannot access the input file.
This happens because the container runs in a completely separate file system and it cannot access the hosting file system by default.
You will need to use the --volume
command-line option to mount the input file(s) e.g.
docker run --volume $PWD/data/ggal/transcriptome.fa:/transcriptome.fa my-image \
salmon index -t /transcriptome.fa -i transcript-index
Warning
The generated transcript-index
directory is still not accessible in the host file system.
An easier way is to mount a parent directory to an identical one in the container, this allows you to use the same path when running it in the container e.g.
docker run --volume $PWD:$PWD --workdir $PWD my-image \
salmon index -t $PWD/data/ggal/transcriptome.fa -i transcript-index
Or set a folder you want to mount as an environmental variable, called DATA
:
DATA=/workspace/gitpod/nf-training/data
docker run --volume $DATA:$DATA --workdir $PWD my-image \
salmon index -t $PWD/data/ggal/transcriptome.fa -i transcript-index
Now check the content of the transcript-index
folder by entering the command:
Note
Note that the permissions for files created by the Docker execution is root
.
4.1.9 Upload the container in the Docker Hub (bonus)¶
Publish your container in the Docker Hub to share it with other people.
Create an account on the https://hub.docker.com website. Then from your shell terminal run the following command, entering the user name and password you specified when registering in the Hub:
Rename the image to include your Docker user name account:
Finally push it to the Docker Hub:
After that anyone will be able to download it by using the command:
Note how after a pull and push operation, Docker prints the container digest number e.g.
Digest: sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266
Status: Downloaded newer image for nextflow/rnaseq-nf:latest
This is a unique and immutable identifier that can be used to reference a container image in a univocally manner. For example:
docker pull nextflow/rnaseq-nf@sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266
4.1.10 Run a Nextflow script using a Docker container¶
The simplest way to run a Nextflow script with a Docker image is using the -with-docker
command-line option:
As seen in the last section, you can also configure the Nextflow config file (nextflow.config
) to select which container to use instead of having to specify it as a command-line argument every time.
4.2 Singularity¶
Singularity is a container runtime designed to work in high-performance computing data centers, where the usage of Docker is generally not allowed due to security constraints.
Singularity implements a container execution model similar to Docker. However, it uses a completely different implementation design.
A Singularity container image is archived as a plain file that can be stored in a shared file system and accessed by many computing nodes managed using a batch scheduler.
Warning
Singularity will not work with Gitpod. If you wish to try this section, please do it locally, or on an HPC.
4.2.1 Create a Singularity images¶
Singularity images are created using a Singularity
file in a similar manner to Docker but using a different syntax.
Bootstrap: docker
From: debian:bullseye-slim
%environment
export PATH=$PATH:/usr/games/
%labels
AUTHOR <your name>
%post
apt-get update && apt-get install -y locales-all curl cowsay
curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v1.0.0/salmon-1.0.0_linux_x86_64.tar.gz | tar xz \
&& mv /salmon-*/bin/* /usr/bin/ \
&& mv /salmon-*/lib/* /usr/lib/
Once you have saved the Singularity
file, you can create the image with these commands:
Note: the build
command requires sudo
permissions. A common workaround consists of building the image on a local workstation and then deploying it in the cluster by copying the image file.
4.2.2 Running a container¶
Once done, you can run your container with the following command
By using the shell
command you can enter in the container in interactive mode. For example:
Once in the container instance run the following commands:
Info
Note how the files on the host environment are shown. Singularity automatically mounts the host $HOME
directory and uses the current work directory.
4.2.3 Import a Docker image¶
An easier way to create a Singularity container without requiring sudo
permission and boosting the containers interoperability is to import a Docker container image by pulling it directly from a Docker registry. For example:
The above command automatically downloads the Debian Docker image and converts it to a Singularity image in the current directory with the name debian-jessie.simg
.
4.2.4 Run a Nextflow script using a Singularity container¶
Nextflow allows the transparent usage of Singularity containers as easy as with Docker.
Simply enable the use of the Singularity engine in place of Docker in the Nextflow command line by using the -with-singularity
command-line option:
As before, the Singularity container can also be provided in the Nextflow config file. We’ll see how to do this later.
4.2.5 The Singularity Container Library¶
The authors of Singularity, SyLabs have their own repository of Singularity containers.
In the same way that we can push Docker images to Docker Hub, we can upload Singularity images to the Singularity Library.
4.3 Conda/Bioconda packages¶
Conda is a popular package and environment manager. The built-in support for Conda allows Nextflow workflows to automatically create and activate the Conda environment(s), given the dependencies specified by each process.
In this Gitpod environment, conda is already installed.
4.3.1 Using conda¶
A Conda environment is defined using a YAML file, which lists the required software packages. The first thing you need to do is to initiate conda for shell interaction, and then open a new terminal by running bash.
Then write your YAML file (to env.yml
). There is already a file named env.yml
in the nf-training
folder as an example. Its content is shown below.
name: nf-tutorial
channels:
- conda-forge
- defaults
- bioconda
dependencies:
- bioconda::salmon=1.5.1
- bioconda::fastqc=0.11.9
- bioconda::multiqc=1.12
- conda-forge::tbb=2020.2
Given the recipe file, the environment is created using the command shown below. The conda env create
command may take several minutes, as conda tries to resolve dependencies of the desired packages at runtime, and then downloads everything that is required.
You can check the environment was created successfully with the command shown below:
This should look something like this:
To enable the environment, you can use the activate
command:
Nextflow is able to manage the activation of a Conda environment when its directory is specified using the -with-conda
option (using the same path shown in the list
function. For example:
Info
When creating a Conda environment with a YAML recipe file, Nextflow automatically downloads the required dependencies, builds the environment and activates it.
This makes easier to manage different environments for the processes in the workflow script.
See the docs for details.
4.3.2 Create and use conda-like environments using micromamba¶
Another way to build conda-like environments is through a Dockerfile
and micromamba
.
micromamba
is a fast and robust package for building small conda-based environments.
This saves having to build a conda environment each time you want to use it (as outlined in previous sections).
To do this, you simply require a Dockerfile
and you use micromamba to install the packages. However, a good practice is to have a YAML recipe file like in the previous section, so we’ll do it here too, using the same env.yml
as before.
name: nf-tutorial
channels:
- conda-forge
- defaults
- bioconda
dependencies:
- bioconda::salmon=1.5.1
- bioconda::fastqc=0.11.9
- bioconda::multiqc=1.12
- conda-forge::tbb=2020.2
Then, we can write our Dockerfile with micromamba installing the packages from this recipe file.
FROM mambaorg/micromamba:0.25.1
LABEL image.author.name "Your Name Here"
LABEL image.author.email "your@email.here"
COPY --chown=$MAMBA_USER:$MAMBA_USER env.yml /tmp/env.yml
RUN micromamba create -n nf-tutorial
RUN micromamba install -y -n nf-tutorial -f /tmp/env.yml && \
micromamba clean --all --yes
ENV PATH /opt/conda/envs/nf-tutorial/bin:$PATH
The above Dockerfile
takes the parent image mambaorg/micromamba, then installs a conda
environment using micromamba
, and installs salmon
, fastqc
and multiqc
.
Try executing the RNA-Seq workflow from earlier (script7.nf). Start by building your own micromamba Dockerfile
(from above), save it to your Docker hub repo, and direct Nextflow to run from this container (changing your nextflow.config
).
Warning
Building a Docker container and pushing to your personal repo can take >10 minutes.
For an overview of steps to take, click here:
-
Make a file called
Dockerfile
in the current directory (with the code above). -
Build the image:
docker build -t my-image .
(don’t forget the .). -
Publish the Docker image to your online Docker account.
Something similar to the following, with
<myrepo>
replaced with your own Docker ID, without < and > characters!my-image
could be replaced with any name you choose. As good practice, choose something memorable and ensure the name matches the name you used in the previous command. -
Add the container image name to the
nextflow.config
file.e.g. remove the following from the
nextflow.config
:and replace with:
-
Trying running Nextflow, e.g.:
Nextflow should now be able to find salmon
to run the process.
4.4 BioContainers¶
Another useful resource linking together Bioconda and containers is the BioContainers project. BioContainers is a community initiative that provides a registry of container images for every Bioconda recipe.
So far, we’ve seen how to install packages with conda and micromamba, both locally and within containers. With BioContainers, you don’t need to create your own container image for the tools you want, and you don’t need to use conda or micromamba to install the packages. It already provides you with a Docker image containing the programs you want to be installed. For example, you can get the container image of fastqc using BioContainers with:
You can check the registry for the packages you want at BioContainers official website. For finding multi-tools container images, check their Multi-package images.
Contrary to other registries that will pull the latest image when no tag (version) is provided, you must specify a tag when pulling BioContainers (after a colon :
, e.g. fastqc:v0.11.5). Check the tags within the registry and pick the one that better suits your needs.
You can also install galaxy-util-tools
and search for mulled containers in your CLI. You'll find instructions below, using conda to install the tool.
conda activate a-conda-env-you-already-have
conda install galaxy-tool-util
mulled-search --destination quay singularity --channel bioconda --search bowtie samtools | grep mulled
Tip
You can have more complex definitions within your process block by letting the appropriate container image or conda package be used depending on if the user selected singularity, Docker or conda to be used. You can click here for more information and here for an example.
4.4.1 Exercises¶
Exercise
During the earlier RNA-Seq tutorial (script2.nf), we created an index with the salmon tool. Given we do not have salmon installed locally in the machine provided by Gitpod, we had to either run it with -with-conda
or -with-docker
. Your task now is to run it again -with-docker
, but without having to create your own Docker container image. Instead, use the BioContainers image for salmon 1.7.0.
Bonus Exercise
Change the process directives in script5.nf
or the nextflow.config
file to make the workflow automatically use BioContainers when using salmon, or fastqc.
Hint
Temporarily comment out the line process.container = 'nextflow/rnaseq-nf'
in the nextflow.config
file to make sure the processes are using the BioContainers that you set, and not the container image we have been using in this training.
Solution
With these changes, you should be able to run the workflow with BioContainers by running the following in the command line:
with the following container directives for each process:
and
process QUANTIFICATION {
tag "Salmon on $sample_id"
container 'quay.io/biocontainers/salmon:1.7.0--h84f40af_0'
publishDir params.outdir, mode: 'copy'
...
Check the .command.run
file in the work directory and ensure that the run line contains the correct Biocontainers.