Skip to content

3 Setup

Luke Thompson edited this page May 15, 2021 · 15 revisions

Clone the Tourmaline directory

Download the Tourmaline directory and files.

Every time you run Tourmaline on a new dataset—or with new parameters on the same dataset—you will clone (download) a fresh copy of the Tourmaline GitHub repository (directory and contents) to your computer. This allows Snakemake to run smoothly without having to account for previous inputs and outputs. If you don't want to save the output from a previous run, it is fine to delete the output and re-run the Tourmaline commands in the same directory.

Navigate to your project directory and clone the Tourmaline repository there. If using the Docker container, a good place to use as your project directory is /data:

cd /data # or your project directory
git clone https://github.com/aomlomics/tourmaline.git

Rename the tourmaline directory to something else before working in it. We recommend adding the name of the project (e.g., tourmaline-test or tourmaline-myproject) and/or the date (e.g., tourmaline-test-YYYYMMDD). For example:

mv tourmaline tourmaline-test

Initialize from an existing Tourmaline directory

Setup a new Tourmaline run with the parameters from a previous run (optional).

If you want to re-run Tourmaline on an existing dataset with a new set of parameters (and keep the output from your old parameters), you'll want to clone a fresh copy of the Tourmaline GitHub repository but then initialize it with all the parameters, files, and links you set up before.

Script to initialize Tourmaline directory

After you clone a fresh copy of the Tourmaline GitHub repository, go inside that newly downloaded Tourmaline directory and run the script initialize_dir_from_existing_tourmaline_dir.sh. This will copy config.yaml from an existing Tourmaline directory, remove the test files, then copy the data files and symlinks from the existing Tourmaline directory.

From the new Tourmaline top-level directory, run this:

scripts/initialize_dir_from_existing_tourmaline_dir.sh /path/to/tourmaline-existing

You may get error messages if some files don't exist, but it should have copied the files that were there. The files that will be copied from the existing directory to the new directory are:

config.yaml
00-data/manifest_pe.csv
00-data/manifest_se.csv
00-data/metadata.tsv
00-data/repseqs_to_filter_dada2-pe.tsv
00-data/repseqs_to_filter_dada2-se.tsv
00-data/repseqs_to_filter_deblur-se.tsv
01-imported/refseqs.qza
01-imported/reftax.qza
01-imported/classifier.qza

Check all the files. If everything looks good, you can breeze through the rest of the setup. You only need to make any new changes you want to the parameters in config.yaml. In some cases you may want to delete files you want to be regenerated.

Config file

Edit the config file to contain the file paths and parameters you want to use.

The configuration file—named config.yaml and residing in the top-level directory—contains all of the parameters that are editable for a Tourmaline run. Always make sure the parameters are appropriate for your dataset. (Note: If you are comfortable editing Python code, it is possible to change more parameters by editing Snakefile.)

Test data: review the config file

The configuration file config.yaml comes ready to run with the test data. Note: Because the sequencing depth of the test dataset is very low (1000 sequences per sample) and the sequences are long (2x300 bp; 269 and 268 bp after trimming primers) relative to the sequenced amplicon length (~316 bp), several of the parameters in the config file may not be appropriate for a normal run of Tourmaline.

Below are the full contents of config.yaml with the default settings:

# config.yaml - configuration file for the Tourmaline Snakemake workflow
# - the user MUST edit these parameters before running with their own data
# - see https://github.com/aomlomics/tourmaline/wiki for detailed instructions

# metadata file
metadata: 00-data/metadata.tsv

# fastq data
# option 1 - if your fastq sequences are not yet imported, the parameters below will be used to find your manifest file(s)
manifest_pe: 00-data/manifest_pe.csv
manifest_se: 00-data/manifest_se.csv
# option 2 - if your fastq sequences are already in qza format, the parameters below will be used to find your qza file(s)
fastq_pe_qza: 01-imported/fastq_pe.qza
fastq_se_qza: 01-imported/fastq_se.qza

# reference database
# option 1 - if your reference database is not yet imported, the parameters below will be used to find the fna and tsv files
refseqs_fna: 00-data/refseqs.fna
reftax_tsv: 00-data/reftax.tsv
# option 2 - if your reference database is already in qza format, the parameters below will be used to find your qza file(s)
refseqs_qza: 01-imported/refseqs.qza
reftax_qza: 01-imported/reftax.qza

# dada2 paired-end
# - denoising parameters for command "qiime dada2 denoise-paired"
dada2pe_trunc_len_f: 240
dada2pe_trunc_len_r: 190
dada2pe_trim_left_f: 0
dada2pe_trim_left_r: 0
dada2pe_max_ee_f: 2
dada2pe_max_ee_r: 2
dada2pe_trunc_q: 2
dada2pe_pooling_method: independent
dada2pe_chimera_method: consensus
dada2pe_min_fold_parent_over_abundance: 1
dada2pe_n_reads_learn: 1000000
dada2pe_hashed_feature_ids: --p-hashed-feature-ids

# dada2 single-end
# - denoising parameters for command "qiime dada2 denoise-single"
dada2se_trunc_len: 240
dada2se_trim_left: 0
dada2se_max_ee: 2
dada2se_trunc_q: 2
dada2se_pooling_method: independent
dada2se_chimera_method: consensus
dada2se_min_fold_parent_over_abundance: 1
dada2se_n_reads_learn: 1000000
dada2se_hashed_feature_ids: --p-hashed-feature-ids

# deblur parameters
# - denoising parameters for command "qiime deblur denoise-other"
deblur_trim_length: 240
deblur_sample_stats: --p-sample-stats
deblur_mean_error: 0.005
deblur_indel_prob: 0.01
deblur_indel_max: 3
deblur_min_reads: 10
deblur_min_size: 2
deblur_hashed_feature_ids: --p-hashed-feature-ids

# multiple sequence alignment
# - method: choose from: muscle, clustalo, mafft
# - muscle_maxiters: maximum number of iterations (integer, default 16)
# - muscle_diags: choose from: -diags (find diagonals - faster for similar sequences) or leave blank
alignment_method: muscle
alignment_muscle_maxiters: 2
alignment_muscle_diags: -diags

# taxonomic classification
# - method: choose from: naive-bayes, consensus-blast, consensus-vsearch
# - parameters: add arbitrary parameters or at least one parameter; see "qiime feature-classifier COMMAND --help"
classify_method: consensus-vsearch
classify_parameters: --verbose

# filtering - implemented by filtering commands, eg "snakemake dada2_pe_report_filtered", the parameters below are applied to:
#     1. representative sequences
#     2. taxonomy
#     3. feature table

# filter by featureid
# - these feature metadata tsv files have a header line with two columns: 1. "featureid", 2. anything
# - to create a feature metadata file for filtering repseqs:
#     1. go to 02-output-{method}-{filter}/02-alignment-tree
#     2. merge (removing duplicates) or copy repseqs_to_filter_outliers.tsv and/or repseqs_to_filter_unassigned.tsv
#     3. rename to the file name below corresponding to your method:
#          00-data/repseqs_to_filter_dada2-pe.tsv
#          00-data/repseqs_to_filter_dada2-se.tsv
#          00-data/repseqs_to_filter_deblur-se.tsv
# - these paths are hard-coded (they must match those above), not specified in config.yaml
# - to skip filtering by feature ID: provide a nonsense feature ID or don't run filtering commands

# filter by taxonomy
# - separate terms with commas (e.g., mitochondria,chloroplast,eukaryota)
# - terms are not case-sensitive
# - to skip filtering by taxonomy: provide a nonsense term or don't run filtering commands
exclude_terms: xxxxxxxx

# filter by length
# - increase minimum and/or decrease maximum to impose length limits for filtering
# - leave defaults (0, 10000) to do no filtering
# - limits are inclusive, ie, greater than or equal to minimum, less than or equal to maximum
repseq_min_length: 251
repseq_max_length: 255

# filter by abundance and prevalence
# - increase minimum and/or decrease maximum to impose abundance/prevalence limits for filtering
# - leave defaults (0, 1) to do no filtering
# - ranges are inclusive, ie, greater than or equal to minimum, less than or equal to maximum
repseq_min_abundance: 0
repseq_min_prevalence: 0

# representative sequence outlier detection using odseq
# - distance_metric: choose from: linear, affine
# - bootstrap_replicates: number of bootstrap replicates (integer)
# - threshold: probability to be at right of the bootstrap scores distribution (float)
odseq_distance_metric: linear # choose from: linear, affine
odseq_bootstrap_replicates: 100
odseq_threshold: 0.025

# subsampling (rarefaction)
# - core_sampling_depth: subsampling depth for core diversity analyses
# - alpha_max_depth: maximum subsampling depth for alpha diversity rarefaction analysis
core_sampling_depth: 500
alpha_max_depth: 500

# beta group significance
# - column: metadata column to test beta group significance
# - method: choose from: permanova, anosim, permdisp
# - pairwise: choose from: --p-no-pairwise, --p-pairwise (can be very slow)
beta_group_column: region
beta_group_method: permanova
beta_group_pairwise: --p-pairwise

# report theme for html version
# - choose from: github, gothic, newsprint, night, pixyll, whitey
report_theme: github

# threads
# - max number of threads for individual steps
# - threads used will be lower of this and snakemake parameter --cores
dada2pe_threads: 8
dada2se_threads: 8
deblur_threads: 8
alignment_threads: 8
feature_classifier_threads: 8
phylogeny_fasttree_threads: 8
diversity_core_metrics_phylogenetic_threads: 8

Your data: edit parameters customized to your data and analysis

Edit the configuration file config.yaml to change DADA2/Deblur parameters, subsampling (rarefaction) depth, multiple sequence alignment method, taxonomic classification method, and others parameters. We recommend that you don't change the input filenames in the sections "fastq data" and "reference database" but rather use symbolic links to those filenames instead. You should consider changing the following parameters:

Parameter Description Recommendation Help
dada2pe_trunc_len_f dada2pe_trunc_len_r dada2se_trunc_len Truncate bases (integer) from the 3' (right) ends of reads in DADA2. Choose values that maximize length but remove low-quality ends. Note that DADA2 paired-end mode requires a minimum overlap of 12 bp to merge Read 1 and Read 2. See the section below "Sequence quality control and choice of truncation length" for instructions on using the included script fastqc_per_base_sequence_quality_dropoff.py. dada2 denoise-paired; dada2 denoise-single
dada2pe_trim_left_f dada2pe_trim_left_f dada2se_trim_left Trim bases (integer) from the 5' (left) ends of reads in DADA2. Depending on your amplicon sequencing method, and if trimming was not done prior to running Tourmaline, you may have primer sequences, indexes, and/or adapters on the 5' ends of your reads. If so, set this parameter to remove those bases. If not, set this parameter to zero. Note that 5' trimming (this parameter) is done after 3' truncation (above parameter). dada2 denoise-paired; dada2 denoise-single
deblur_trim_length Truncate bases (integer) from the 3' (right) ends of reads in Deblur. Choose values that maximize length but remove low-quality ends. See the section below "Sequence quality control and choice of truncation length" for instructions on using the included script fastqc_per_base_sequence_quality_dropoff.py. deblur denoise-other
dada2pe_pooling_method dada2se_pooling_method DADA2 pooling method. Choose pseudo for pseudo-pooling or independent for no pooling. dada2 denoise-paired; dada2 denoise-single
dada2pe_chimera_method dada2se_chimera_method DADA2 chimera method. Choose pooled if pseudo-pooling otherwise consensus or none. dada2 denoise-paired; dada2 denoise-single
alignment_method Multiple sequence alignment method. Choose muscle or clustalo for best accuracy or mafft for faster results. muscle; clustalo; mafft
classify_method Taxonomic classification method. Choose naive-bayes for best accuracy or consensus-blast for faster results. feature-classifier
exclude_terms Filter terms (taxa) from taxonomy. Specify terms (comma-separated, no spaces) to find in taxonomy and filter out (case-insensitive), or provide a nonsense term to skip this step when filtering. taxa filter-seqs
repseq_min_length repseq_max_length Set minimum and maximum sequence lengths to filter representative sequences by. Limits are inclusive, i.e., sequences will be retained if greater than or equal to minimum, less than or equal to maximum. Leave defaults (0, 10000) to do no filtering. link
repseq_min_abundance repseq_min_prevalence set minimum abundance and prevalence limits to filter representative sequences by. Limit is inclusive, i.e., sequences will be retained if greater than or equal to minimum. Leave default (0) to do no filtering.
odseq_distance_metric Distance metric for odseq. Choose metric from: linear, affine. odseq
odseq_bootstrap_replicates Number (integer) of bootstrap replicates for odseq. Choose more replicates for more robust detection of outliers, fewer replicates for faster processing. odseq
odseq_threshold Threshold (float) for bootstrap probability distribution for odseq. Probability to be at the right of the bootstrap scores distribution when computing outliers. Tune this parameter depending on the diversity and occurrence of outliers in the MSA. odseq
core_sampling_depth Rarefaction depth (integer) for core diversity metrics. Choose a value that balances sequencing depth (more is better) with number of samples retained (more is better). diversity core-metrics-phylogenetic
alpha_max_depth Rarefaction depth (integer) for alpha rarefaction. Choose a value that balances sequencing depth (more is better) with number of samples retained (more is better). diversity alpha-rarefaction
beta_group_column Column (text) in your metadata to test beta-diversity group significance. Choose a category that may differentiate your samples. This analysis can be rerun with different columns by renaming the output file and changing the value in config.yaml before running again. diversity beta-group-significance
report_theme HTML report theme. Choose from: github, gothic, newsprint, night, pixyll, whitey. Typora theme gallery

Reference database

Format reference sequences and taxonomy for QIIME 2, if not done already.

Reference sequences and taxonomy are used by QIIME 2 to assign taxonomy to the representative sequences you get from DADA2 and Deblur. Any marker gene (amplicon locus) can be used with Tourmaline. The only requirement is that the reference sequence (.fasta) and taxonomy (.tsv) files are formatted for QIIME 2: the sequence IDs in the two files must match, and the taxonomy must have two tab-delimited columns where the second column is taxonomic levels delimited by semicolons. Here is what they should look like (shortened for simplicity):

refseqs.fna

>AB302407.1.2962
TCCGGTTGATCCTGCCGGACCCGACCGCTATCGGGGTAGGGCTAAGCCATGCGA
>KU725476.45629.48552
AGAGTTTGATCCTGGCTCAGGATGAACGCTGGCGGCATGCTTAACACATGCAAG

reftax.tsv

AF352532.1.1501	D_0__Bacteria;D_1__Patescibacteria;D_2__Parcubacteria;D_3__GW2011-GWA2-46-7
KM650414.1.1439	D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales

The reference sequence (.fasta) and taxonomy (.tsv) files, and their derived QIIME 2 archives (.qza), can be used for multiple studies and need only exist in one place on your computer. Therefore, it is best to store these files in a special directory for databases—for example, in a directory named ~/databases/qiime2—then create symbolic links to them in the respective project subdirectories (you'll see how to do this below).

When you run Snakemake, if you get an error like, "Missing input files for rule import_ref_seqs: 00-data/refseqs.fna", you can fix this either by placing the reference sequence (.fasta) and taxonomy (.tsv) files in 00-data or by placing their derived QIIME 2 archives (.qza) in 01-imported. You will do the latter below.

Test data: download reference database and create symbolic links

The small test dataset included with Tourmaline is a 16S rRNA amplicon dataset sequenced using the 515F-806R primers, so we will download a 16S+18S reference database trimmed to this region (Silva 138 SSURef NR99 515F/806R region sequences and taxonomy). Please note the citations and license for this dataset here.

cd ~/databases/qiime2
wget https://data.qiime2.org/2021.2/common/silva-138-99-seqs-515-806.qza
wget https://data.qiime2.org/2021.2/common/silva-138-99-tax-515-806.qza

Now navigate to the directory 01-imported inside your new Tourmaline directory:

cd /path/to/tourmaline-test/01-imported

Create symbolic links to the database files:

ln -s ~/databases/qiime2/silva-138-99-seqs-515-806.qza refseqs.qza
ln -s ~/databases/qiime2/silva-138-99-tax-515-806.qza reftax.qza

Your data: format and import your reference database

There are three options for setting up your QIIME 2 reference database files. In each case, the underlying reference sequence (.fna) and taxonomy (.tsv) text files have to follow the QIIME 2 format.

A QIIME 2 reference sequence (.fna) file looks like this:

>AB302407
TCCGGTTGATCCTGCCGGACCCGACCGCTATCGGGGTAGGGCTAAGCCATGCGAGTCGCGCGCCCGGGGGCGCCCGGGAG
>KU725476
AGAGTTTGATCCTGGCTCAGGATGAACGCTGGCGGCATGCTTAACACATGCAAGTCGGACGTTGTCTTCAAATTAGAATA
>KP109803
AGAGTTTGATCCTGGCTCAGAATGAACGCTAGCGATATGCTTAACACATGCAAGTCGAACGTTGTCTTCAAATTAGAATA

A QIIME 2 reference taxonomy (.tsv) file looks like this (sequence IDs must match those in the sequence file, not necessarily in the same order):

AB302407<TAB>D_0__Archaea;D_1__Crenarchaeota;D_2__Thermoprotei;D_3__Thermoproteales;D_4__Thermoproteaceae;D_5__Pyrobaculum;D_6__Pyrobaculum sp. M0H
KU725476<TAB>D_0__Bacteria;D_1__Proteobacteria;D_2__Alphaproteobacteria;D_3__Rickettsiales;D_4__Mitochondria;D_5__Sphagnum girgensohnii;D_6__Sphagnum girgensohnii
KP109803<TAB>D_0__Bacteria;D_1__Proteobacteria;D_2__Alphaproteobacteria;D_3__Rickettsiales;D_4__Mitochondria;D_5__Sphagnum palustre;D_6__Sphagnum palustre
Option 1 (import reference database using Tourmaline)

Start with the reference sequence and taxonomy files and let Tourmaline create the QIIME 2 archives (.qza). Place the text files (or symbolic links to them) in 00-data and Tourmaline will create the latter for you in 01-imported. For example, if your Tourmaline directory is /path/to/tourmaline-myproject and your (QIIME 2-ready) reference sequences and taxonomy are in ~/databases/qiime2/16s, you could create symbolic links them using the commands below (you can change the filenames refseqs.fna and reftax.tsv in config.yaml, but we don't recommend it):

cd /path/to/tourmaline-myproject/00-data
ln -s ~/databases/qiime2/16s/16s_refseqs.fna refseqs.fna
ln -s ~/databases/qiime2/16s/16s_reftax.tsv reftax.tsv
Option 2 (import reference database on your own or use pre-imported database)

Import your reference database files to .qza files yourself before running Tourmaline, or use the .qza files from a previous run or the internet (e.g., the Silva .qza files mentioned above).

To import your own reference database files, use these commands:

qiime tools import \
--type 'FeatureData[Sequence]' \
--input-path 00-data/refseqs.fna \
--output-path 01-imported/refseqs.qza

qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path 00-data/reftax.tsv \
--output-path 01-imported/reftax.qza

If you already have the .qza files, you can put them files in your databases directory (somewhere on your computer) and create symbolic links to them in 01-imported. Tourmaline will see these files and skip the step that creates them (you can change the filenames refseqs.qza and reftax.qza in config.yaml, but we don't recommend it):

cd /path/to/tourmaline-myproject/01-imported
ln -s ~/databases/qiime2/16s/16s_refseqs.qza refseqs.qza
ln -s ~/databases/qiime2/16s/16s_reftax.qza reftax.qza

Advanced: If you have extracted a specific region or created a naive Bayes classifier in a previous run, you can place these in 01-imported also:

ln -s ~/databases/qiime2/16s/16s_refseqs_515f_806r.qza refseqs_extracted.qza
ln -s ~/databases/qiime2/16s/16s_classifier.qza classifier.qza

For Snakemake to work with these symbolic links, you may have to run snakemake --cleanup-metadata <filenames> on them first.

Sequence files

List fastq.gz files in a QIIME 2 FASTQ manifest tsv file(s), or use your pre-imported FASTQ QIIME 2 archive file(s).

Tourmaline requires that your amplicon sequence data is already demultiplexed. There are two options to provide your sequence data:

Option 1 is to keep your data as per-sample fastq.gz files with a FASTQ manifest file and let Tourmaline create the FASTQ archive (.qza) file(s). The FASTQ manifest files—by default named manifest_se.csv and manifest_pe.csv and found in 00-data—tell QIIME 2 where to find your FASTQ files and what samples they correspond to. See QIIME 2 Docs for more information on FASTQ manifest files. Hint: Gzipped files like fastq.gz files can be viewed using zcat (or gzcat on Mac) and zless.

Option 2 is to use a pre-imported .qza file (QIIME 2 archive), which already contains your demultiplexed sequences and a manifest file.

Test data: edit the manifest file to match your filepath

We'll use Option 1 (FASTQ manifest and individual fastq.gz files) for the test data. The test sequence data were subsampled to 1000 sequences per sample and are provided in 00-data. They are already demultiplexed and in ".fastq.gz" format, and the file names match those in the FASTQ manifest files and metadata file. To run the test data, unless you are using the Docker container, you must edit the manifest files to point to the absolute filepaths of the sequences in your local copy of tourmaline (which you renamed to tourmaline-test). For example, if the filepath of your project is /Users/me/myproject, these commands will set up the manifest files appropriately:

cd /Users/me/myproject/tourmaline-test/00-data
cat manifest_pe.csv | sed 's|/data/tourmaline|/Users/me/myproject/tourmaline-test|' > temp
mv temp manifest_pe.csv 
cat manifest_pe.csv | grep -v "reverse" > manifest_se.csv

If you are using the Docker container and you cloned tourmaline into /data, you don't need to modify the manifest files!

Your data: choose manifest+sequences or pre-imported sequence artifact

First, delete the test fastq.gz files that came in directory 00-data.

Demultiplex (if necessary)

If your sequences are not already demultiplexed, demultiplex them, then gzip the resulting per-sample FASTQ files. See the QIIME 2 Docs for more information on importing and demultiplexing FASTQ files. If your sequences are in multiplexed EMP format, you can use the commands qiime tools import to import them as type EMPPairedEndSequences, then qiime demux emp-paired to demuliplex them.

Option 1 (FASTQ manifest and individual fastq.gz files)

Find the location of your sequence data on your computer or external drive. It is recommended that you leave the sequences where they are and just tell Tourmaline where to find them using the FASTQ manifest file(s) in the next step.

Using the sample names in your metadata file and the absolute filepaths of the forward and reverse demultiplexed gzipped sequence files for each sample, create your FASTQ manifest file. Ensure the sample IDs match those in your metadata file. The included script create_manifest_from_metadata.py (see below) can help with this.

Tip: Create and finalize your paired-end manifest (manifest_pe.csv) first, then run this command to generate the single-end manifest (manifest_se.csv):

cat manifest_pe.csv | grep -v "reverse" > manifest_se.csv

Two scripts are provided to help you create your FASTQ manifest files:

create_manifest_from_fastq_directory.py – Create a FASTQ manifest file from a directory of FASTQ files. This script makes the following assumptions: the first characters after the sample names are "_S[0-9]{1,3}""; only Lane 1 data is present ("_L001"); R1 and R2 files are both present; and only one file (with suffix "_001") is present for each of R1 and R2.

scripts/create_manifest_from_fastq_directory.py /path/to/fastq \
00-data/manifest_pe.csv 00-data/manifest_se.csv

match_manifest_to_metadata.py – Given a metadata file and a FASTQ manifest file, generate two new manifest files (paired-end and single-end) corresponding to the samples in the metadata file. This can be useful if you demultiplexed from "Earth Microbiome Project (EMP) protocol" format for paired-end reads and unzipped the output to get the manifest file.

scripts/match_manifest_to_metadata.py 00-data/metadata.tsv /path/to/demux/data/MANIFEST \
00-data/manifest_pe.csv 00-data/manifest_se.csv
Option 2 (pre-imported sequences as .qza file)

Put the .qza files in 01-imported. It is recommended that you name them (or use symbolic links to name them) fastq_pe.qza (paired-end data) and fastq_se.qza (single-end data), but you can change these filenames in config.yaml.

If your .qza file is in the correct format, the output of the command qiime tools peek FILE should be:

Paired-end data:

Type:        SampleData[PairedEndSequencesWithQuality]
Data format: SingleLanePerSamplePairedEndFastqDirFmt

Single-end data:

Type:        SampleData[SequencesWithQuality]
Data format: SingleLanePerSampleSingleEndFastqDirFmt

Sequence QC and truncation length

After your sequences are demultiplexed to per-sample .fastq.gz files, you can (optionally) run some diagnostics on them:

  • FastQC to get error profiles and lots of other information about your sequences.
  • MultiQC to collate and summarize the FastQC results.
  • Helper script fastqc_per_base_sequence_quality_dropoff.py to help choose appropriate truncation (DADA2) and trim (Deblur) lengths for your configuration file.
FastQC & MultiQC

Installation note: This section requires a separate Conda environment, multiqc, which you can create and activate with these commands:

conda create -n multiqc -c bioconda fastqc multiqc
conda activate multiqc

First, create directories for FastQC and MultiQC output for Read 1 and Read 2:

cd /path/to/fastq
mkdir fastqc-R1
mkdir fastqc-R2

Then run FastQC on the forward and the reverse reads (change the .fastq.gz filename wildcards as necessary):

fastqc *_R1_001.fastq.gz -o fastqc-R1
fastqc *_R2_001.fastq.gz -o fastqc-R2

Then run MultiQC on the output of each FastQC run:

cd fastqc-R1
multiqc --export .
cd ../fastqc-R2
multiqc --export .
Choose sequence truncation lengths

Finally, run fastqc_per_base_sequence_quality_dropoff.py, which will determine the position where median per-base sequence quality drops below some fraction (default: 0.90, here: 0.85) of its maximum value. The output of this script can be used as your parameter setting for DADA2 truncation or Deblur trim values (use output from Read 1 for dada2pe_trunc_len_f, dada2se_trunc_len, and deblur_trim_length; use output from Read 2 for dada2pe_trunc_len_r):

cd /path/to/tourmaline
scripts/fastqc_per_base_sequence_quality_dropoff.py \
--input /path/to/fastq/fastqc-R1/multiqc_data/mqc_fastqc_per_base_sequence_quality_plot_1.txt \
--cutoff 0.85

Note that DADA2 paired-end mode requires a minimum overlap of 12 bp to merge Read 1 and Read 2.

Metadata file

Format metadata file for QIIME 2.

The metadata file—by default named metadata.tsv and found in 00-data—is a tab-delimited text file (e.g., exported from Excel) with samples as rows and metadata categories as columns. It is also known as the mapping file.

Test data: metadata is ready to go

The metadata file that comes with Tourmaline is ready to go with the test data. It is complete, curated, and its sample names match the sample names in the FASTQ manifest file.

Your data: standardize and curate your metadata with meaningful categories

It's important to have rich and complete sample metadata before you begin your analyses. Your sample metadata should include basic collection information like collection_timestamp, latitude, and longitude, plus categories that describe treatment groups and environmental metadata relevant to your samples.

Sample preparation information is also part of your metadata and includes information about how the samples were sequenced.

The following are some suggested columns to include in your metadata for each project:

  • project_name
  • experiment_design_description
  • target_gene
  • target_subfragment
  • pcr_primers
  • pcr_primer_names
  • platform
  • instrument_model
  • run_center
  • run_date

The above columns follow the standards set by Qiita. For additional help formatting your metadata see Metadata in QIIME 2, the EMP Metadata Guide, QIIMP, and NMDC's Introduction to Metadata and Ontologies.

Symlinks to metadata and FASTQ manifest files

It's a good idea to keep current versions of your metadata file (and perhaps your FASTQ manifest files) in a dedicated metadata directory in your project directory. Then make a symbolic link to these files in the 00-data directory. Use the code below to remove the test files and create links to your own metadata and FASTQ manifest files (the files in /path/to/metadata can be named however you like):

cd /path/to/tourmaline/00-data
rm metadata.tsv manifest_se.csv manifest_pe.csv
ln -s /path/to/metadata/metadata.tsv metadata.tsv
ln -s /path/to/metadata/manifest_se.csv manifest_se.csv
ln -s /path/to/metadata/manifest_pe.csv manifest_pe.csv

Appendix: Helper scripts

As we have seen already, Tourmaline comes with helper scripts in the scripts directory.

Setup

These scripts might come in handy in setting up your workflow.

initialize_dir_from_existing_tourmaline_dir.sh – From the main directory of a newly cloned tourmaline directory, run this script to copy the config.yaml and Snakefile from an existing tourmaline directory, remove the test files, then copy the data files and symlinks from the existing tourmaline directory. Very useful when re-running an analysis on the same dataset. Simply clone a new copy of Tourmaline, then run this script to copy everything from the old Tourmaline directory to the new one, then make your desired changes to the parameters.

scripts/initialize_dir_from_existing_tourmaline_dir.sh /path/to/existing/tourmaline

create_manifest_from_fastq_directory.py – Create a FASTQ manifest file from a directory of FASTQ files. This script makes the following assumptions: the first characters after the sample names are "_S[0-9]{1,3}"; only Lane 1 data is present ("_L001"); R1 and R2 files are both present; and only one file (with suffix "_001") is present for each of R1 and R2.

scripts/create_manifest_from_fastq_directory.py /path/to/fastq/dir \
00-data/manifest_pe.csv 00-data/manifest_se.csv

match_manifest_to_metadata.py – Given a metadata file and a FASTQ manifest file, generate two new manifest files (paired-end and single-end) corresponding to the samples in the metadata file. This can be useful if you demultiplexed from "Earth Microbiome Project (EMP) protocol" format for paired-end reads and unzipped the output to get the manifest file.

scripts/match_manifest_to_metadata.py 00-data/metadata.tsv /path/to/demux/data/MANIFEST \
00-data/manifest_pe.csv 00-data/manifest_se.csv

fastqc_per_base_sequence_quality_dropoff.py – Determine the position where median per base sequence quality drops below some fraction (default: 0.90) of its maximum value. This is useful for defining 3' truncation positions in DADA2 (trunc-len) and Deblur (trim-length). This script should be run separately for Read 1 and Read 2 fastqc/multiqc output.

scripts/fastqc_per_base_sequence_quality_dropoff.py \
--input /path/to/fastqc-R1/multiqc_data/mqc_fastqc_per_base_sequence_quality_plot_1.txt \
--cutoff 0.85

Wrapped

These scripts are wrapped by Tourmaline (used by the workflow), but some of them can be run separately.

detect_amplicon_locus.py – Try to guess the amplicon locus of a fasta file based on the first four nucleotides. Used by rule repseq_detect_amplicon_locus.

scripts/detect_amplicon_locus.py \
-i 02-output-dada2-pe/00-table-repseqs/repseqs.fasta \
> 02-output-dada2-pe/00-table-repseqs/repseqs_amplicon_type.txt

fastaLengths.pl – Calculate the length of every sequence in a fasta file. Used by rule repseq_length_distribution.

scripts/fastaLengths.pl 02-output-dada2-pe/00-table-repseqs/repseqs.fasta \
> 02-output-dada2-pe/00-table-repseqs/repseqs_lengths.txt

run_odseq.R – Determine which representative sequences are likely to be outliers using the R package odseq. Used by rule alignment_detect_outliers.

Rscript --vanilla scripts/run_odseq.R 02-output-dada2-pe/01-alignment-tree-taxonomy/aligned_dna_sequences.fasta affine 1000 0.025 02-output-dada2-pe/01-alignment-tree-taxonomy/aligned_dna_sequences_outliers.csv