Skip to content

Getting started

Alessia Visconti edited this page Apr 20, 2021 · 10 revisions

Before running YAMP you need to ensure that all the required tools and databases are available on your system (that is, either on your local machine or HPC facility).

Dependencies

To run YAMP you will need to install Nextflow (version 20.10 or higher), as explained here. Please note that Nextflow requires BASH and Java 7+. Both should be already available in most of the POSIX compatible systems (Linux, Solaris, OS X, etc).

If you are using the containerised version of YAMP (as we strongly suggest), you should also install Docker or Singularity, as explained here and here, respectively.

Once you have either Docker or Singularity up and running, you will not need to install anything additional tools. All the pieces of software are already specified in the YAMP pipeline and will be downloaded during the first run. Please refer to our How to use Docker and How to use Singularity tutorials for more details.

For expert users only

If you do not want to use the containerised version of YAMP, you must install the following pieces of software:

All of them should be in the system path with execute and read permission.

Following the links, you will find detailed instructions on how to install them, as explained by their developers. Notably, many of these tools are also available in bioconda.

External databases

YAMP requires a set of databases that are queried during its execution. Some of them are already available with YAMP, others should be automatically downloaded either the first time you use the tool (MetaPhlAn), or using specialised scripts (HUMAnN), or should be created by the user. Specifically, you will need:

  • A FASTA file listing the adapter sequences to remove in the trimming step. A basic version is provided in this repository (./assests/data/adapters.fa), but please note that this file may need to be customised.
  • Two FASTA file describing synthetic contaminants. Basic versions are provided in this repository (./assests/data/sequencing_artifacts.fa.gz and ./assests/data/phix174_ill.ref.fa.gz), but please note that both may need to be customised.
  • A FASTA file describing the contaminant (pan)genome. This file should be created by the users according to the contaminants present in their dataset. When analysing human metagenomes, we recommend always including the human genome. We suggest downloading the FASTA file provided by Brian Bushnell for removing human contamination, using the instruction available here and/or in the section below.
  • the BowTie2 database files for MetaPhlAn. These files should be downloaded the first time you run MetaPhlAn. Please refer to their webpage for details, or to the section below for an alternative approach.
  • the ChocoPhlAn and UniRef databases for HUMAnN. Both can be downloaded directly by HUMAnN. Please refer to their webpage for details and/or to the section below.

Please note that we are no longer providing all these files on Zenodo. If this is a problem for you, please open an issue and we will try to help.

We are now providing demo datasets to allow for our tests to run (YAMP v0.9.5+). However, a demo set was not available for MetaPhlAn. Please refer to the section below for more details.

Notes on the contaminant human genome

We have made available the FASTA file for removing human contaminant provided by Brian Bushnell at the following link. If you want to use the standard YAMP config, please save this file in ./assets/data/, using the instruction below (which assumes you are in the YAMP repository folder). You can find more information regarding the folder layout here.

cd ./assets/data/
wget https://zenodo.org/record/4629921/files/hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz

YAMP allows providing a FASTA file describing the contaminant (pan)genome that will be indexed automatically. However, to save time, you can index it beforehand, using the following command:

cd ./assets/data/
bbmap.sh -Xmx24G ref=my_foreign_genome.fa.gz

(this command assumes that the FASTA file is saved in the folder above).

You can then set the path to the index in the foreing_genome_ref parameter (while leaving the foreing_genome parameter empty) in your configuration file(s) as:

foreign_genome = ""
foreign_genome_ref = "$baseDir/assets/data/ref"

(more on configuration files here)

Notes on the MetaPhlAn databases

To allows for the tests to run, we have provided some demo datasets. However, a demo set was not available for MetaPhlAn. Reasonably, GitHub limits individual files in a repository to a 100 MB maximum size, which makes it impossible to provide them in this repository. Therefore, to facilitate users that would only try YAMP and/or do not wish to deal with the MetaPhlAn auto-download procedure, we have provided them here (version: mpa_v30_CHOCOPhlAn_201901). Please download the tar file in the ./assets/data/ folder, where you can decompress it. You can use the following command (which assumes you are in the YAMP repository folder):

cd ./assets/data/
wget https://zenodo.org/record/4629921/files/metaphlan_databases.tar.gz
tar -xzf metaphlan_databases.tar.gz
  • Please note: this file is about 2GB large!
  • Please remember: these are the actual MetaPhlAn databases (and not demo files) and can be used in all your analyses!

Notes on the HUMAnN databases

HUMAnN provides a very helpful and easy-to-use command to download the required databases, that is humann_databases. This command requires users to specify which database to download (ChocoPhlAn and UniRef) and where to save them (./assets/data/chocophlan and ./assets/data/uniref, if your are using the layout used in the YAMP tutorials and config files, described here).

humann_databases is included in the Docker/Singularity container specified in nextflow.config, therefore, you can download them using the commands below.

If you are using Docker. These command assume that i) docker is installed and running, ii) the biobakery image is available (you can pull it with: docker pull biobakery/workflows:3.0.0.a.6.metaphlanv3.0.7), iii) you are using the suggested folders layout, and iv) you are running the command from the YAMP home folder.

docker container run --volume $HOME:$HOME --workdir $PWD -it biobakery/workflows:3.0.0.a.6.metaphlanv3.0.7
   humann_databases --download chocophlan full ./assets/data/chocophlan 
docker container run --volume $HOME:$HOME --workdir $PWD -it biobakery/workflows:3.0.0.a.6.metaphlanv3.0.7
   humann_databases --download uniref uniref90_diamond ./assets/data/uniref

If you are using Singularity. These command assume that i) singularity is installed and running, ii) the biobakery image is available (you can pull it with: singularity pull docker://biobakery/workflows:3.0.0.a.6.metaphlanv3.0.7), iii) you are using the suggested folders layout, and iv) you are running the command from the YAMP home folder.

singularity run biobakery/workflows:3.0.0.a.6.metaphlanv3.0.7 humann_databases --download chocophlan full
   ./assets/data/chocophlan 
singularity run biobakery/workflows:3.0.0.a.6.metaphlanv3.0.7 humann_databases --download uniref uniref90_diamond
   ./assets/data/uniref

The time required to download these databases will depend upon the speed of your internet connection.

Notes on the Zenodo data file and retro compatibility

If you have downloaded the Zenodo file provided with version YAMP prior to 0.9.5 (published here), please note that this file is no longer compatible but mostly outdated and should not be used.