Skip to content

Latest commit

 

History

History
395 lines (243 loc) · 16.4 KB

README.md

File metadata and controls

395 lines (243 loc) · 16.4 KB

Documentation

Installation

Follow the Installation guidelines to compile and install mpiBWA.

Prerequisites

As mpiBWA relies on the Message Passing Interface (MPI) standard, mpirun must be available to run the program. Several MPI implementations exist such as mpich, open-mpi or Intel® MPI Library. The mpirun program must be available in your PATH.

Install mpirun on CentOS

  • for open-mpi: sudo yum install openmpi openmpi-devel
  • for mpich: sudo yum install mpich

Install mpirun on Ubuntu

  • for open-mpi: sudo apt-get install openmpi-bin
  • for mpich: sudo apt-get install mpich

Usage

We have 2 versions of mpiBWA:

  1. the first version creates a unique SAM file and is available in the binary mpiBWA
  2. the second version sorts the output by the chromosomes present in the header and is available in the binary mpiBWAByChr

After installation, 3 binaries are created:

  1. mpiBWA
  2. mpiBWAByChr
  3. mpiBWAIdx

mpiBWAIdx

mpiBWAIdx is responsible for creating a binary image of the reference genome. This image is subsequently loaded in the shared memory by mpiBWA. Then, every mpiBWA process on the same computing node will share the same genome reference in order to save memory usage. mpiBWAIdx does not need MPI to run. To create a binary image from a reference genome type:

mpiBWAIdx myReferenceGenome.fa

This creates the file myReferenceGenome.fa.map.

Importantly, mpiBWAIdx requires the following files: myReferenceGenome.fa.sa, myReferenceGenome.fa.bwt, myReferenceGenome.fa.ann, myReferenceGenome.fa.pac, myReferenceGenome.fa.amb to build the fa.map file. These files are generated by bwa index (see bwa documentation). It works also with fasta file.

The .map file needs only to be generated once.

The genome.sh script provides an example to build the .map file.

mpiBWA

mpiBWA is executed with the mpirun program, for example:

mpirun -n 2 mpiBWA mem -t 8 -o ./HCC1187C.sam hg19.small.fa examples/data/HCC1187C_R1_10K.fastq examples/data/HCC1187C_R2_10K.fastq

This command launches 2 processes MPI and 8 threads will be created by mpiBWA.

WARNING: do not write the extension .map for the reference. If the file is myReferenceGenome.fa.map just provide in the command line myReferenceGenome.fa to mpiBWA.

The -n options passed to mpirun indicates the number of processes to run in parallel. The total number of cores required is the product of the values provided in the -n and -t options (in the previous example, 16 cores are required). For more details on how to choose the number processes, see the Informatic resources section.

mpiBWA requires several mandatory arguments:

  • Input: a reference genome in .map format, one or two fastq files trimmed or not
  • Output: a sam file containing the entire reference header (e.g. HCC1187C.sam)

mpiBWAByChr

If you want to split the results of the alignment by chromosome, use mpiBWAByChr, for example:

mpirun -n 2 mpiBWAByChr mem -t 8 -o ./HCC1187C.sam hg19.small.fa examples/data/HCC1187C_R1_10K.fastq examples/data/HCC1187C_R2_10K.fastq

mpiBWAByChr requires several mandatory arguments:

  • Input: a reference genome in .map format, one or two fastq files trimmed or not
  • Output: a sam file containing the entire reference header, sam files by chromosome, discordant.sam and unmapped.sam.

Input

  • a reference genome in .map format (but do not write the extension '.map' for the reference, e.g. myReferenceGenome.fa)
  • fastq file for R1 (trimmed or not)
  • fastq file for R2 (trimmed or not) if the data come from paired-end sequencing (optional)

Options

  • bwa mem options can be passed from command line (e.g. mpiBWA mem -t 8 -k 18)
  • -f add this option if you want to fix the mates during the mapping (optionnal)
  • -b add this option to write output files in BAM format
  • -g add this option to write output files in BGZF format
  • -z add this option to place the reference genome in NUMA domain. Options are shared(default), socket, numa, l1, l2, l3

example1: mpirun -n 2 mpiBWAByChr mem -t 8 -b -f -o SAM REF FASTQ1 FASTQ2

the -f option add mate CIGAR, mate quality tags to each mate. This is equivalent to "samtools fixmate -m". This option permits to mark the duplicates with samtools markdup in downstream analisys. The overhead is negligible as this part is multithreaded.

example2: mpirun -n 2 --map-by numa mpiBWA mem -t 8 -z numa -o SAM REF FASTQ1 FASTQ2 the -z will place the reference genome in the NUMA domain called NUMA-node. Each mpi job are placed in a NUMA node (--map-by numa) and the reference genome is loaded in the memory of this domain. NUMA domains are important for memory bandwidth optimizations(see memory section in Informatics ressources).

output

In the case of mpiBWA:

  • a sam file containing the entire reference header

In the case of mpiBWAByChr, additional files are provided:

  • Individual chrN.sam files with aligned reads on each chrN (ChrN are the chromosome name from the header). The chrN.sam contains the header for the chrN and the reads mapping to that chromosome. The file contains primary and supplementary alignments for a read. If supplementary mapping are discordant they are not filtered out. The file can be sorted independently with mpiSORT and during the sorting these supplementary alignments are filtered out in the discordant.gz file (see mpiSORT documentation).

  • The file discordant.sam contains primary chimeric alignments with their secondary alignments.

  • The unmapped.sam contains unmapped reads.

Note that:

  1. Supplementary or secondary alignments could be ignored with the -M bwa-mem option passed to mpiBWA. In all cases they are filtered out by mpiSORT.

  2. The discordant fragments are not extracted from the chromosome files, they are just copied. The purpose of the discordant fragment SAM is to help the marking of duplicates in the downstream analysis. Indeed it is easier to mark them separately and pass the result in the chromosome file.

Informatic resources

Memory

The total memory used during the alignment if approximately the size of the .map file plus the size of the SAM chunks loaded by each bwa tasks. A bwa thread takes around 300 MB of memory. To optimize the access to reference genome you can put the reference in a NUMA domain. To get the number of numa domain use lstopo.

For instance AMD milan architecture:

lstopo | grep NUMA
NUMANode L#0 (P#0 31GB)
NUMANode L#1 (P#1 31GB)
NUMANode L#2 (P#2 31GB)
NUMANode L#3 (P#3 31GB)
NUMANode L#4 (P#4 31GB)
NUMANode L#5 (P#5 31GB)
NUMANode L#6 (P#6 31GB)
NUMANode L#7 (P#7 31GB)

and each NUMANode has 16 cores associated.

We have then 8 NUMANodes with each 31GB on the same node, enough for the human genome reference. Then place each MPI job in a NUMANode (mpirun -n 8 --map-by numa) and the reference genome in the memory associated ($MPIBWA mem -t 16 -z numa -o $OUTPUT $REF $FASTQ1 $FASTQ2).

NB: This feature has only been tested with openMPI.

Cpu

The number of cores is related to the number of rank of the MPI jobs and to the number of threads you ask with bwa with the -t option. For example, the command line mpirun -n 4 mpiBWA mem -t 8 -o HCC1187C.sam hg19.small.fa examples/data/HCC1187C_R1_10K.fastq examples/data/HCC1187C_R2_10K.fastq will use 32 cores.

Benchmark

This section provides some guidelines to benchmark mpiBWA and bwa with your infrastructure. It is intended to help the reader to assess what is the best use case and configuration to efficiently benefit from the multithreading with MPI depending on your computing cluster infrastructure. We strongly recommend that you read carefully this section before running mpiBWA on your cluster.

Assess the performance baseline with the standard bwa

1 threads on 1 node :

bwa mem -t 1
[M::mem_process_seqs] Processed 40246 reads in 23.303 CPU sec, 23.367 real sec

8 threads on 1 node :

bwa mem -t 8
[M::mem_process_seqs] Processed 322302 reads in 199.261 CPU sec, 25.104 real sec

16 threads on 1 node :

bwa mem -t 16
[M::mem_process_seqs] Processed 644448 reads in 413.054 CPU sec, 26.000 real sec

Assess the performance baseline with the standard mpiBWA

Start with one node and one bwa thread:

1 threads on 1 node :

mpirun -n 1 mpiBWA mem -t 1
[M::mem_process_seqs] Processed 40224 reads in 25.779 CPU sec, 25.840 real sec

Now use several nodes and one bwa thread:

1 threads on 8 nodes :

mpirun -N 8 -npernode 1 -n 8 mpiBWA mem -t 1
[M::mem_process_seqs] Processed 40244 reads in 24.416 CPU sec, 24.475 real sec

So far, the walltime for both bwa or mpiBWA is pretty much the same (~25 sec) whatever the configuration.

mpiBWA MEM + multithreads

Now we go further with the parallelization and increase the number of mpi jobs with 10 bwa threads.

10 threads on 1 node :

mpirun -N 1 -n 1 mpiBWA mem -t 10
[M::mem_process_seqs] Processed 402610 reads in 257.005 CPU sec, 25.803 real sec

20 threads on 1 node :

mpirun -N 1 -n 1 mpiBWA mem -t 20
[M::mem_process_seqs] Processed 804856 reads in 546.449 CPU sec, 27.477 real sec

mpiBWA MEM + multithreads + multi nodes

Now you can increase the number of nodes.

20 threads on 2 nodes :

mpirun -N 2 -npernode 1 -n 2 --bind-to socket mpiBWA mem -t 10
[M::mem_process_seqs] Processed 403144 reads in 260.081 CPU sec, 26.114 real sec

40 threads on 2 nodes :

mpirun -N 2 -npernode 1 -n 2 --bind-to socket mpiBWA mem -t 20
[M::mem_process_seqs] Processed 805198 reads in 549.086 CPU sec, 27.610 real sec

Still, the walltime is pretty much the same (~25 sec) whatever you use 1 node or 2 nodes. Then, you can repeat the experiment with 3 and more nodes.

Conclusion

In this benchmark example, we can see that running mpiBWA with 10 threads seems a good configuration. We notice a small increase of walltime when we use all the cores of a node. Therefore, We recommend to leave some cores for the system node. Explore the mpirun options as we do with the bind to socket, this could help and remove contention like NUMA effects. Test also the setup with mpiBWAByChr.

Examples

There are many ways to distribute and bind MPI jobs according to your architecture. We provide below several examples to launch MPI jobs in a standard manner or with a job scheduling system such as Slurm and PBS/Torque.

Two toy datasets (fastq files trimmed or not) are provided in the examples/data folder for testing the program.

Standard

mpirun can be launched in a standard manner without using any job scheduling systems. For example:

mpirun -n 4 mpiBWA mem -t 8 -o ${HOME}/mpiBWAExample/HCC1187C.sam ${HOME}/mpiBWAExample/hg19.small.fa examples/data/HCC1187C_R1_10K.fastq examples/data/HCC1187C_R2_10K.fastq

If needed, a file with the server name in -host option can be provided to mpirun. We invite you to read the mpirun documentation for more details.

You can go in the examples directory and execute the standard.sh script to test the program.

Slurm

In order to submit a job using Slurm, you can write a shell script as follows:

#! /bin/bash
#SBATCH -J MPIBWA_32_JOBS
#SBATCH -N 2                            # Ask 2 nodes
#SBATCH -n 2                            # total number of mpi jobs 
#SBATCH -c 16                           # use 16 cores per mpi job
#SBATCH --tasks-per-node=1              # Ask 1 mpi jobs per node
#SBATCH --mem-per-cpu=${MEM}            # See Memory ressources
#SBATCH -t 01:00:00
#SBATCH -o STDOUT_FILE.%j.o
#SBATCH -e STDERR_FILE.%j.e

mpirun mpiBWA mem -t 16 -o ${HOME}/mpiBWAExample/HCC1187C.sam ${HOME}/mpiBWAExample/hg19.small.fa examples/data/HCC1187C_R1_10K.fastq examples/data/HCC1187C_R2_10K.fastq`

You can go in the examples directory and submit the job with sbatch using the slurm.sh script to test the program.

PBS/Torque

In order to submit a job using PBS/Torque, you can write a shell script as follows:

#! /bin/bash
#PBS -N MPIBWA_32_JOBS
#PBS -l nodes=2:ppn=16:mem=${MEM}        # Ask 2 nodes and 16 jobs per node
#PBS -l walltime=24:00:00
#PBS -o STDOUT_FILE.%j.o
#PBS -e STDERR_FILE.%j.e

mpirun -n 2 mpiBWA mem -t 16 -o ${HOME}/mpiBWAExample/HCC1187C.sam ${HOME}/mpiBWAExample/hg19.small.fa examples/data/HCC1187C_R1_10K.fastq examples/data/HCC1187C_R2_10K.fastq`

You can go in the examples directory and submit the job with qsub command using the pbs.sh script to test the program.

Parallel filesystems

This software needs a parallel filesystem for execution. The program has been tested with Lustre and BeeGFS.

WARNING: be aware that the flock mode (i.e. file locking) on parallel filesystems (Lustre, BeeGFS, etc) must be on, otherwise the reproducibility is not guaranteed.

Algorithm

The algorithm consists of 3 parts:

  1. MPI jobs create chunks of reads. All these chunks contain the same number of nucleotids.
  2. MPI jobs calls aligner jobs. This part invokes BWA MEM algorithm.
  3. MPI jobs write the alignment results in a SAM file or in individual chromosome SAM files. This part uses shared file pointers.

FAQ

Why mpiBWA don't take fasq.gz as input?

Due to DEFLATE gunzip cannot be parallelized and therefore is a bottleneck.
One option is to use bzip2 for a better compression of fastq (on average 20% better) and pbzip2, mpibzip2, unpigz for staging in the fastqs in parallel.

Give an example of how pinning/mapping jobs on multi NUMA nodes architecture?

Here is an example tested on a 2 sockets AMD EPYC with 8 numa Nodes of 18 cores each

#MSUB -r MPI.MPIBWA.1NODE.128CPU               
#MSUB -N 1  
#MSUB -n 8 
#MSUB --tasks-per-node=8 
#MSUB -c 16 
#MSUB -t 6000 
#MSUB -x   
#MSUB -o test.HG002.HiSeq30x.mpi.bwa.%I.o            
#MSUB -e test.HG002.HiSeq30x.mpi.bwa.%I.e    

mpirun --map-by 'numa:PE=16' --bind-to core numactl -N 0,1,2,3,4,5,6,7 -l mpiBWA mem -t 16 -f -z numa -Y -K 100000000 -o HG002.HiSeq30x hg19.fasta HG002.HiSeq30x.subsampled.R1.fastq HG002.HiSeq30x.subsampled.R2.fastq

References

This work is based on the original bwa aligner (Burrow-Wheeler Aligner for short-read alignment) written by Li et al.:

Shared memory with MPI: