Advanced HPC use cases

Overview

Teaching: 35 min
Exercises: 15 min
Questions
  • How can I run an MPI enabled application in a container using a bind approach?

  • How can I run a GPU enabled application in a container?

  • How can I store files inside a container filesystem?

  • How can I use workflow engines in conjunction with containers?

Objectives
  • Discuss the bind approach for containerised MPI applications, including its performance

  • Get started with containerised GPU applications

  • Work through common pitfalls with mounting host system files

  • See real world MPI/GPU examples using OpenFoam and Gromacs

  • See data-intensive pipeline, saving data in an overlay filesystem

  • Get an idea of the interplay between containers and workflow engines

DEMO: Container vs bare-metal MPI performance

NOTE: this part was executed on the Pawsey Zeus cluster but similar performance charactistics (using the system MPI libraries help you!) exist on most HPC clusters. You can follow the outputs here.

Pawsey Centre provides a set of MPI base images, which also ship with the OSU Benchmark Suite. Let’s use it to get a feel of what it’s like to use or not to use the high-speed interconnect.
We’re going to run a small bandwidth benchmark using the image pawsey/mpich-base:3.1.4_ubuntu18.04. All of the required commands can be found in the directory path of the first OpenFoam example, in the script benchmark_pawsey.sh:

#!/bin/bash -l

#SBATCH --job-name=mpi
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:20:00
#SBATCH --output=benchmark_pawsey.out

image="docker://pawsey/mpich-base:3.1.4_ubuntu18.04"
osu_dir="/usr/local/libexec/osu-micro-benchmarks/mpi"

# this configuration depends on the host
module load singularity


# see that SINGULARITYENV_LD_LIBRARY_PATH is defined (host MPI/interconnect libraries)
echo $SINGULARITYENV_LD_LIBRARY_PATH

# 1st test, with host MPI/interconnect libraries
srun singularity exec $image \
  $osu_dir/pt2pt/osu_bw -m 1024:1048576


# unset SINGULARITYENV_LD_LIBRARY_PATH
unset SINGULARITYENV_LD_LIBRARY_PATH

# 2nd test, without host MPI/interconnect libraries
srun singularity exec $image \
  $osu_dir/pt2pt/osu_bw -m 1024:1048576

Basically we’re running the test twice, the first time using the full bind approach configuration as provided by the singularity module on the cluster, and the second time after unsetting the variable that makes the host MPI/interconnect libraries available in containers.

Here is the first output (using the interconnect):

# OSU MPI Bandwidth Test v5.4.1
# Size      Bandwidth (MB/s)
1024                 2281.94
2048                 3322.45
4096                 3976.66
8192                 5124.91
16384                5535.30
32768                5628.40
65536               10511.64
131072              11574.12
262144              11819.82
524288              11933.73
1048576             12035.23

And here is the second one:

# OSU MPI Bandwidth Test v5.4.1
# Size      Bandwidth (MB/s)
1024                   74.47
2048                   93.45
4096                  106.15
8192                  109.57
16384                 113.79
32768                 116.01
65536                 116.76
131072                116.82
262144                117.19
524288                117.37
1048576               117.44

Well, you can see that for a 1 MB message, the bandwidth is 12 GB/s versus 100 MB/s, quite a significant difference in performance!

Interactive Work: Avoiding pitfalls with mounts

As with the previous example and singularity, Docker can also utilize system libraries mounted in. Work through this example to help you understand what all needs to be done in order to do this correctly. Here, we start with a simple Dockerfile that contains the OSU benchmarks.

FROM ethcscs/mpich:ub1804_cuda92_mpi314

RUN apt update \
    && apt install -y ca-certificates patch

RUN cd /tmp \
    && wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.6.1.tar.gz \
    && tar xf osu-micro-benchmarks-5.6.1.tar.gz \
    && cd osu-micro-benchmarks-5.6.1 \
    && ./configure --prefix=/usr/local CC=$(which mpicc) CFLAGS=-O3 \
    && make -j2 \
    && make install \
    && cd .. \
    && rm -rf osu-micro-benchmarks-5.6.1*

WORKDIR /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt
CMD ["mpiexec", "-n", "2", "-bind-to", "core", "./osu_latency"]

Let us build this. Remember that this will take a few minutes the first time, but because the build steps are cached, we will be able to update this quickly as we move forward.

docker build -t mpich_osu:1.0 .

Let us take a look within the container and make sure we understand what we currently are looking at.

docker run --rm -it mpich_osu:1.0  /bin/bash
ldd osu_latency

We can look at the MPI library used here. We will work to replace this in this tutorial.

Volume mounting

We can make files on our host system available within the container by mounting using the -v option. First, we will find where our system mpi is found.

Then we can run our container and mount this within it. It is generally best practice to mount these in a new folder in / and avoid accidentally overwriting something in the container’s filesystem. Here, we use a new directory /newMPI. Find where the MPI executables and libraries are located. Hint: use which and make sure you use the directory that has both the bin and lib as the path.

docker run -v /path/to/MPI:/newMPI --rm -it mpich_osu:1.0  /bin/bash

When you are in the container, find what changed. Did the MPI library for the executable change? Are the bin and lib directories available where they were mounted? Are these in your path?

If you did this step correctly, you should have access to those files, but because they are not in your path, the linked libraries for our osu_latency executable will not have changed. Try to add these to the required paths (PATH and LD_LIBRARY_PATH) and see what changes.

Hint: /usr/local/mvapich4-plus

Adding environment variables

The previous step showed us that volume mounting things in is great, but that we still need to add these things to our paths within the container for this to be useful.

Let us update our Dockerfile in two ways. First, we will make the directory we are going to mount our system files into. This is not a technical requirement - we saw in the previous step that we could do our mount without this step. However, because we are going to be adding the environment variables after this, it is a reminder to other users and our future selves what we are intending for this container. Second, let us put in the same commands we used before in order to add to our relevant paths.

FROM ethcscs/mpich:ub1804_cuda92_mpi314

RUN apt update \
    && apt install -y ca-certificates patch

RUN cd /tmp \
    && wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.6.1.tar.gz \
    && tar xf osu-micro-benchmarks-5.6.1.tar.gz \
    && cd osu-micro-benchmarks-5.6.1 \
    && ./configure --prefix=/usr/local CC=$(which mpicc) CFLAGS=-O3 \
    && make -j2 \
    && make install \
    && cd .. \
    && rm -rf osu-micro-benchmarks-5.6.1*

RUN mkdir /newMPI \
    && export PATH="/newMPI/bin:$PATH" \
    && export LD_LIBRARY_PATH="/newMPI/lib:$LD_LIBRARY_PATH"

WORKDIR /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt
CMD ["mpiexec", "-n", "2", "-bind-to", "core", "./osu_latency"]

We will rebuild this, remembering to iterate our version number:

docker build -t mpich_osu:1.1 .

Now, start the container interactively and look at the libraries used for osu_latency and the environment.

docker run -v /usr/local/mvapich4-plus:/newMPI --rm -it mpich_osu:1.1  /bin/bash

Is this what you expect?

Whoops! Correcting the environment variables

Typically, what you do in a container can be directly copy/pasted into the Dockerfile. Environment variables, however, require an additional step. Docker uses ENV to set environment variables so that these are built into the container.

FROM ethcscs/mpich:ub1804_cuda92_mpi314

RUN apt update \
    && apt install -y ca-certificates patch

RUN cd /tmp \
    && wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.6.1.tar.gz \
    && tar xf osu-micro-benchmarks-5.6.1.tar.gz \
    && cd osu-micro-benchmarks-5.6.1 \
    && ./configure --prefix=/usr/local CC=$(which mpicc) CFLAGS=-O3 \
    && make -j2 \
    && make install \
    && cd .. \
    && rm -rf osu-micro-benchmarks-5.6.1*

RUN mkdir /newMPI
ENV PATH="/newMPI/bin:$PATH"
ENV LD_LIBRARY_PATH="/newMPI/lib:$LD_LIBRARY_PATH"

WORKDIR /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt
CMD ["mpiexec", "-n", "2", "-bind-to", "core", "./osu_latency"]

Rebuild with this, remembering to iterate your version number once again, and then run interactively to understand what has happened. It turns out that because of the versioning used, the mpi library is still picking up the original. We are able to add a soft link to our file to point to the updated libmpi:

ln -s /newMPI/lib/libmpi.so /newMPI/lib/libmpi.so.12

This isn’t fully up and running, but we are getting close! What does this tell you needs to be added next?

What else?

The Dockerfile used for this example is relatively old, both with respect to the OS and the versions of MPICH and the OSU benchmarks that are used. If the portions of the imported communiation library used in compilation are incompatible with the library the software was built with, there may be issues running and with performance. Note that users often move faster than HPC system software, and so the opposite is also often true, and the libraries you import may not have the full functionality required for your executable.

Another important thing to note is that libraries you bring in from the host might require other libraries that aren’t required in the version within the container. In the example above, libmpi requires libfabric and libhwloc. Additionally, libfabric often requires libcxi. Our base containers may be fairly portable, but it often is quite a bit of work on an individual system to make your application performant. For this reason, some HPC centers use wrappers to provide the host system functionality for containers. For example, podman-hpc has a –cuda-mpi flag at NERSC that will insert the CUDA aware vendor tuned MPICH on Perlmutter. Options like this might be available for you, and likely work well for most containerized applications.

As an exercise for the user, please try update the MPICH and OSU Benchmarks given in this container to be able to properly use a mounted system communication library. If you are doing this during the tutorial, there is a latency benchmark that is already built and ready to be run for you to compare with. This is in

~/examples/osu-benchmarks/latency.qsub

and can be submitted with

qsub latency.sub

from that directory. The output file provides the results you are trying to match. If you are doing this on your host system, please install and run the OSU benchmark test you are interested in comparing on bare metal, using the communication library you are attempting to mount directly.

EXAMPLES from previous years

** From here down: These are showcase examples from previous years. **

We don’t have enough time to cover any/everything, and so are focusing this year on helping you work through common issues. The following examples have been used in previous tutorials but are not currently used or supported.

Showcasing how to configure the MPI/interconnect bind approach for Singularity

Before we start, let us cd to the openfoam example directory:

cd ~/sc-tutorials/exercises/openfoam

Now, suppose you have an MPI installation in your host and a containerised MPI application, built upon MPI libraries that are ABI compatible with the former.

For this tutorial, we do have MPICH installed on the host machine:

which mpirun
/opt/mvapich2-x/gnu11.1.0/mofed/aws/mpirun/bin/mpirun

and we’re going to pull an OpenFoam container, which was built on top of MPICH as well.

Since this is a large image, we will pull a copy to /tmp for everyone to share.

(cd /tmp && [ -e sc22-openfoam_v2012.sif ] || singularity pull docker://quay.io/pawsey/sc22-openfoam:v2012 && chmod a+r /tmp/sc22-openfoam_v2012.sif)

OpenFoam comes with a collection of executables, one of which is simpleFoam. We can use the Linux command ldd to investigate the libraries that this executable links to. As simpleFoam links to a few tens of libraries, let’s specifically look for MPI (libmpi*) libraries in the command output:

singularity exec /tmp/sc22-openfoam_v2012.sif bash -c 'ldd $(which simpleFoam) |grep libmpi'
	libmpi.so.12 => /installs/lib/libmpi.so.12 (0x00007f3b982de000)

This is the container MPI installation that was used to build OpenFoam.

How do we setup a bind approach to make use of the host MPI installation?
We can make use of Singularity-specific environment variables, to make these host libraries available in the container (see location of MPICH from which mpirun above):

# We need to get the efa library
mkdir -p /tmp/lib
podman run -v /tmp/lib:/tmp/lib scanon/efalibraries:20.04
chmod a+rx -R /tmp/lib

export MPICH_ROOT="/opt/mvapich2-x/gnu11.1.0/mofed/aws/mpirun"
export SINGULARITY_BINDPATH="$MPICH_ROOT"
export SINGULARITYENV_LD_LIBRARY_PATH="$MPICH_ROOT/lib64:/tmp/lib:\$LD_LIBRARY_PATH"

Now, if we inspect mpirun dynamic linking again:

singularity exec /tmp/sc22-openfoam_v2012.sif bash -c 'ldd $(which simpleFoam) |grep libmpi'
	libmpi.so.12 => /opt/mvapich2-x/gnu11.1.0/mofed/aws/mpirun/lib64/libmpi.so.12 (0x0000145ea2c00000)

Now OpenFoam is picking up the host MPI libraries!

Note that, on a HPC cluster, with the same mechanism it is possible to expose the host interconnect libraries in the container, to achieve maximum communication performance.

Let’s run OpenFoam in a container!

To get the real feeling of running an MPI application in a container, let’s run a practical example.
We’re using OpenFoam, a widely popular package for Computational Fluid Dynamics simulations, which is able to massively scale in parallel architectures up to thousands of processes, by leveraging an MPI library.
The sample inputs come straight from the OpenFoam installation tree, namely $FOAM_TUTORIALS/incompressible/pimpleFoam/LES/periodicHill/steadyState/.

Before getting started, let’s make sure that no previous output file is present in the exercise directory:

./clean-outputs.sh

Now, let’s execute the script in the current directory:

./mpirun_para.sh

This will take a few minutes to run. In the end, you will get the following output files/directories:

ls -ltr
total 1121572
-rwxr-xr-x. 1 tutorial livetau 1148433339 Nov  4 21:40 openfoam_v2012.sif
drwxr-xr-x. 2 tutorial livetau         59 Nov  4 21:57 0
-rw-r--r--. 1 tutorial livetau        798 Nov  4 21:57 slurm_pawsey.sh
-rwxr-xr-x. 1 tutorial livetau        843 Nov  4 21:57 mpirun.sh
-rwxr-xr-x. 1 tutorial livetau        197 Nov  4 21:57 clean-outputs.sh
-rwxr-xr-x. 1 tutorial livetau       1167 Nov  4 21:57 update-settings.sh
drwxr-xr-x. 2 tutorial livetau        141 Nov  4 21:57 system
drwxr-xr-x. 4 tutorial livetau         72 Nov  4 22:02 dynamicCode
drwxr-xr-x. 3 tutorial livetau         77 Nov  4 22:02 constant
-rw-r--r--. 1 tutorial livetau       3497 Nov  4 22:02 log.blockMesh
-rw-r--r--. 1 tutorial livetau       1941 Nov  4 22:03 log.topoSet
-rw-r--r--. 1 tutorial livetau       2304 Nov  4 22:03 log.decomposePar
drwxr-xr-x. 8 tutorial livetau         70 Nov  4 22:05 processor1
drwxr-xr-x. 8 tutorial livetau         70 Nov  4 22:05 processor0
-rw-r--r--. 1 tutorial livetau      18583 Nov  4 22:05 log.simpleFoam
drwxr-xr-x. 3 tutorial livetau         76 Nov  4 22:06 20
-rw-r--r--. 1 tutorial livetau       1533 Nov  4 22:06 log.reconstructPar

We ran using 2 MPI processes, who created outputs in the directories processor0 and processor1, respectively.
The final reconstruction creates results in the directory 20 (which stands for the 20th and last simulation step in this very short demo run), as well as the output file log.reconstructPar.

As execution proceeds, let’s ask ourselves: what does running singularity with MPI look run in the script? Here’s the script we’re executing:

#!/bin/bash

NTASKS="2"
#image="docker://quay.io/pawsey/sc22-openfoam:v2012"
image=/tmp/sc22-openfoam_v2012.sif

# this configuration depends on the host
export MPICH_ROOT="/usr/local/mvapich4-plus/bin/mpirun"
export MPICH_LIBS="$( which mpirun )"
export MPICH_LIBS="${MPICH_LIBS%/bin/mpirun*}/lib64/:/tmp/lib"

export SINGULARITY_BINDPATH="$MPICH_ROOT"
export SINGULARITYENV_LD_LIBRARY_PATH="$MPICH_LIBS:\$LD_LIBRARY_PATH"

# pre-processing
singularity exec $image \
  blockMesh | tee log.blockMesh

singularity exec $image \
  topoSet | tee log.topoSet

singularity exec $image \
  decomposePar -fileHandler uncollated | tee log.decomposePar


# run OpenFoam with MPI
mpirun -n $NTASKS \
  singularity exec $image \
  simpleFoam -fileHandler uncollated -parallel | tee log.simpleFoam


# post-processing
singularity exec $image \
  reconstructPar -latestTime -fileHandler uncollated | tee log.reconstructPar

In the beginning, Singularity variable SINGULARITY_BINDPATH and SINGULARITYENV_LD_LIBRARY_PATH are defined to setup the bind approach for MPI.
Then, a bunch of OpenFoam commands are executed, with only one being parallel:

mpirun -n $NTASKS \
  singularity exec $image \
  simpleFoam -fileHandler uncollated -parallel | tee log.simpleFoam

That’s as simple as prepending mpirun to the singularity command line, as for any other MPI application.

Singularity interface to Slurm

Now, have a look at the script variant for the Slurm scheduler, slurm_pawsey.sh:

srun -n $SLURM_NTASKS \
  singularity exec $image \
  simpleFoam -fileHandler uncollated -parallel | tee log.simpleFoam

The key difference is that every OpenFoam command is executed via srun, i.e. the Slurm wrapper for the MPI launcher, mpirun. Other schedulers will require a different command.
In practice, all we had to do was to replace mpirun with srun, as for any other MPI application.

Store the outputs of an RNA assembly pipeline inside an overlay filesystem using containers

There can be instances where, rather than reading/writing files in the host filesystem, it would instead come handy to persistently store them inside the container filesystem.
A practical user case is when using a host parallel filesystem such as Lustre to run applications that create a large number (e.g. millions) of small files. This practice creates a huge workload on the metadata servers of the filesystem, degrading its performance. In this context, significant performance benefits can be achieved by reading/writing these files inside the container.

Singularity offers a feature to achieve this, called OverlayFS.
Let us cd into the trinity example directory:

cd ~/sc-tutorials/exercises/trinity

And then execute the script run.sh:

./run.sh

It will run for a few minutes as we discuss its contents. The first part of the script defines the container image to be used for the analysis, and creates the filesystem-in-a-file:

#!/bin/bash

image="docker://trinityrnaseq/trinityrnaseq:2.8.6"

# create overlay
export COUNT="200"
export BS="1M"
export FILE="my_overlay"
singularity exec docker://ubuntu:18.04 bash -c " \
  mkdir -p overlay_tmp/upper overlay_tmp/work && \
  dd if=/dev/zero of=$FILE count=$COUNT bs=$BS && \
  mkfs.ext3 -d overlay_tmp $FILE && \
  rm -rf overlay_tmp"

Here, the Linux tools dd and mkfs.ext3 are used to create and format an empty ext3 filesystem in a file, which we are calling my_overlay. These Linux tools typically require sudo privileges to run. However, we can bypass this requirement by using the ones provided inside a standard Ubuntu container. This command looks a bit cumbersome, but is indeed just an idiomatic syntax to achieve our goal with Singularity (up to versions 3.7.x).
We have wrapped four commands into a single bash call from a container, just for the convenience of running it once. We’ve also defined shell variables for better clarity. What are the single commands doing?
We are creating (and then deleting at the end) two service directories, overlay_tmp/upper and overlay_tmp/work, that will be used by the command mkfs.ext3.
The dd command creates a file named my_overlay, made up of blocks of zeros, namely with count blocks of size bs (the unit here is megabytes); the product count*bs gives the total file size in bytes, in this case corresponding to 200 MB. The command mkfs.ext3 is then used to format the file as a ext3 filesystem image, that will be usable by Singularity. Here we are using the service directory we created, my_overlay, with the flag -d, to tell mkfs we want the filesystem to be owned by the same owner of this directory, i.e. by the current user. If we skipped this option, we would end up with a filesystem that is writable only by root, not very useful.

Note how, starting from version 3.8, Singularity offers a dedicated syntax that wraps arounds the commands above, providing a simpler interface (here size must be in MB):

export SIZE="200"
export FILE="my_overlay"
singularity overlay create --size $SIZE $FILE

The second part uses the filesystem file we have just created. We are mounting it at container runtime by using the flag --overlay followed by the image filename, and then creating the directory /trinity_out_dir, which will be in the overlay filesystem:


# create output directory in overlay
OUTPUT_DIR="/trinity_out_dir"
singularity exec --overlay my_overlay docker://ubuntu:18.04 mkdir $OUTPUT_DIR

Finally, we are running the Trinity pipeline from the container, with the overlay filesystem mounted, and while telling Trinity to write the output in the directory we have just created:

# run analysis in overlay
singularity exec --overlay my_overlay $image \
  Trinity \
  --seqType fq --left trinity_test_data/reads.left.fq.gz  \
  --right trinity_test_data/reads.right.fq.gz \
  --max_memory 1G --CPU 1 --output $OUTPUT_DIR

When the execution finishes, we can inspect the outputs. Because these are stored in the OverlayFS, we need to use a Singularity container to inspect them:

singularity exec --overlay my_overlay docker://ubuntu:18.04 ls /trinity_out_dir
Trinity.fasta		      both.fa.read_count	       insilico_read_normalization   partitioned_reads.files.list.ok   recursive_trinity.cmds.ok
Trinity.fasta.gene_trans_map  chrysalis			       jellyfish.kmers.fa	     pipeliner.18881.cmds	       right.fa.ok
Trinity.timing		      inchworm.K25.L25.DS.fa	       jellyfish.kmers.fa.histo      read_partitions		       scaffolding_entries.sam
both.fa			      inchworm.K25.L25.DS.fa.finished  left.fa.ok		     recursive_trinity.cmds
both.fa.ok		      inchworm.kmer_count	       partitioned_reads.files.list  recursive_trinity.cmds.completed

Now let’s copy the assembled sequence and transcripts, Trinity.fasta*, in the current directory:

singularity exec --overlay my_overlay docker://ubuntu:18.04 bash -c 'cp -p /trinity_out_dir/Trinity.fasta* ./'

Note how we’re wrapping the copy command within bash -c; this is to defer the evaluation of the * wildcard to when the container runs the command.

We’ve run the entire workflow within the OverlayFS, and got only the two relevant output files out in the host filesystem!

ls -l Trinity.fasta*
-rw-r--r-- 1 ubuntu ubuntu 171507 Nov  4 05:49 Trinity.fasta
-rw-r--r-- 1 ubuntu ubuntu   2818 Nov  4 05:49 Trinity.fasta.gene_trans_map

Key Points

  • Appropriate Singularity/Docker environment variables can be used to configure the bind approach for MPI containers (sys admins can help); Shifter achieves this via a configuration file

  • Singularity, Docker and Shifter interface almost transparently with HPC schedulers such as Slurm

  • MPI performance of containerised applications almost coincide with those of a native run

  • You can run containerised GPU applications with Singularity using the flags --nv or --rocm for Nvidia or AMD GPUs, respectively

  • Singularity and Shifter allow creating and using filesystems-in-a-file, leveraging the OverlayFS technology

  • Mount an overlay filesystem with Singularity using the flag --overlay <filename>

  • Some workflow engines offer transparent APIs for running containerised applications

  • If you need to run data analysis pipelines, the combination of containers and workflow engines can really make your life easier!