Merge pull request #40 from qbic-pipelines/dev

Release 1.2.0
nf-core · Jan 13, 2022 · b1fc825 · b1fc825
2 parents 303f6cd + 0f7bf76
commit b1fc825
Show file tree

Hide file tree

Showing 12 changed files with 222 additions and 84 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -20,7 +20,7 @@ jobs:
       matrix:
         # Nextflow versions: check pipeline minimum and current latest
         nxf_ver: ['20.04.1', '']
-        config: ['test_chr','test_bai']
+        config: ['test_chr','test_bai','test_cram']
     steps:
       - name: Check out pipeline code
         uses: actions/checkout@v2
@@ -33,13 +33,13 @@ jobs:
             environment.yml
       - name: Build new docker image
         if: env.MATCHED_FILES
-        run: docker build --no-cache . -t qbicpipelines/bamtofastq:1.1.0
+        run: docker build --no-cache . -t qbicpipelines/bamtofastq:1.2.0
 
       - name: Pull docker image
         if: ${{ !env.MATCHED_FILES }}
         run: |
-          docker pull qbicpipelines/bamtofastq:dev
-          docker tag qbicpipelines/bamtofastq:dev qbicpipelines/bamtofastq:1.1.0
+          docker pull qbicpipelines/bamtofastq:1.2.0
+          docker tag qbicpipelines/bamtofastq:1.2.0 qbicpipelines/bamtofastq:1.2.0
       - name: Install Nextflow
         run: |
           wget -qO- get.nextflow.io | bash

diff --git a/.github/workflows/push_dockerhub.yml b/.github/workflows/push_dockerhub.yml
@@ -30,7 +30,7 @@ jobs:
           echo "$DOCKERHUB_PASS" | docker login -u "$DOCKERHUB_USERNAME" --password-stdin
           docker tag qbicpipelines/bamtofastq:latest qbicpipelines/bamtofastq:dev
           docker push qbicpipelines/bamtofastq:dev
-          
+
       - name: Push Docker image to DockerHub (release)
         if: ${{ github.event_name == 'release' }}
         run: |

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,12 @@
 # nf-core/bamtofastq: Changelog
 
+## v1.2.0 - Anna Winlock
+
+- [#36](https://github.com/qbic-pipelines/bamtofastq/pull/36) Add options `--cram_files` and `--reference_fasta` to add support for CRAM files.
+- [#31](https://github.com/qbic-pipelines/bamtofastq/pull/31) Add option `--samtools_collate_fast` and improve speed of cat.
+- [#32](https://github.com/qbic-pipelines/bamtofastq/pull/32) Added `--samtools_collate_fast` to sortExtractMapped and changed cat command to append.
+- [#33](https://github.com/qbic-pipelines/bamtofastq/pull/33) Added flag `--reads_in_memory` to specify how many reads shall be stored in memory.
+
 ## v1.1.0 -  Katherine Johnson
 
 - [#21](https://github.com/qbic-pipelines/bamtofastq/21) Allows bam indices as additional input files

diff --git a/Dockerfile b/Dockerfile
@@ -4,4 +4,4 @@ LABEL authors="Friederike Hanssen" \
 
 COPY environment.yml /
 RUN conda env create -f /environment.yml && conda clean -a
-ENV PATH /opt/conda/envs/qbic-pipelines-bamtofastq-1.1.0/bin:$PATH
+ENV PATH /opt/conda/envs/qbic-pipelines-bamtofastq-1.2.0/bin:$PATH
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # ![qbic-pipelines/bamtofastq](docs/images/qbic-pipelines-bamtofastq_logo.png)
 
-> **An open-source pipeline converting (un)mapped single-end or paired-end bam files to fastq.gz**.
+> **An open-source pipeline converting (un)mapped single-end or paired-end bam/cram files to fastq.gz**.
 
 [![Nextflow](https://img.shields.io/badge/nextflow-%E2%89%A520.04.1-brightgreen.svg)](https://www.nextflow.io/)
 
@@ -14,8 +14,8 @@
 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4022137.svg)](https://doi.org/10.5281/zenodo.4022137)
 ## Introduction
 
-This pipeline converts (un)mapped `.bam` files into `fq.gz` files.
-Initially, it auto-detects, whether the input file contains single-end or paired-end reads. Following this step, the reads are sorted using `samtools collate` and extracted with `samtools fastq`. Furthermore, for mapped bam files it is possible to only convert reads mapping to a specific region or chromosome. The obtained FastQ files can then be used to further process with other pipelines.
+This pipeline converts (un)mapped `.bam` files (or `.cram` files with the `--cram_files` option) into `fq.gz` files.
+Initially, it auto-detects, whether the input file contains single-end or paired-end reads. Following this step, the reads are sorted using `samtools collate` and extracted with `samtools fastq`. Furthermore, for mapped bam/cram files it is possible to only convert reads mapping to a specific region or chromosome. The obtained FastQ files can then be used to further process with other pipelines.
 
 The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.
 
@@ -62,6 +62,7 @@ Helpful contributors:
 * [Gisela Gabernet](https://github.com/ggabernet)
 * [Matilda Åslin](https://github.com/matrulda)
 * [Susanne Jodoin](https://github.com/SusiJo)
+* [Bruno Grande](https://github.com/BrunoGrandePhd)
 
 ### Resources
 

diff --git a/conf/base.config b/conf/base.config
@@ -36,7 +36,7 @@ process {
   }
   withLabel:process_high {
     cpus = { check_max( 15 * task.attempt, 'cpus' ) }
-    memory = { check_max( 120.GB * task.attempt, 'memory' ) }
+    memory = { check_max( 200.GB * task.attempt, 'memory' ) }
     time = { check_max( 10.h * task.attempt, 'time' ) }
   }
   withLabel:process_long {

diff --git a/conf/test_bai.config b/conf/test_bai.config
@@ -15,6 +15,11 @@ params {
   max_cpus = 2
   max_memory = 6.GB
   max_time = 48.h
+  samtools_collate_fast = true
+  reads_in_memory = '10000'
+  no_stats = true
+  no_read_QC = true
+
 
   index_files = true
   input_paths = [

diff --git a/conf/test_cram.config b/conf/test_cram.config
@@ -0,0 +1,25 @@
+/*
+ * -------------------------------------------------
+ *  Nextflow config file for running tests
+ * -------------------------------------------------
+ * Defines bundled input files and everything required
+ * to run a fast and simple test. Use as follows:
+ *   nextflow run qbic-pipelines/bamtofastq -profile test_cram
+ */
+
+
+params {
+  config_profile_name = 'Test profile'
+  config_profile_description = 'Minimal test dataset to check pipeline function'
+  // Limit resources so that this can run on Travis
+  max_cpus = 2
+  max_memory = 6.GB
+  max_time = 48.h
+
+  cram_files = true
+  input = [
+          'https://github.com/qbic-pipelines/bamtofastq/master/testdata/First_SmallTest_Paired.cram',
+          'https://github.com/qbic-pipelines/bamtofastq/master/testdata/Second_SmallTest_Paired.cram'
+          ]
+  reference_fasta = 'ftp://ftp.broadinstitute.org/pub/seq/references/Homo_sapiens_assembly19.fasta'
+}
diff --git a/docs/usage.md b/docs/usage.md
@@ -4,39 +4,44 @@
 
 <!-- Install Atom plugin markdown-toc-auto for this ToC to auto-update on save -->
 <!-- TOC START min:2 max:3 link:true asterisk:true update:true -->
-* [Table of contents](#table-of-contents)
-* [Introduction](#introduction)
-* [Running the pipeline](#running-the-pipeline)
-  * [Updating the pipeline](#updating-the-pipeline)
-  * [Reproducibility](#reproducibility)
-* [Main arguments](#main-arguments)
-  * [`-profile`](#-profile)
-  * [`--input`](#--input)
-  * [`--index_files`](#--index_files)
-  * [`--chr`](#--chr)
-  * [`--no_read_QC`](#--no_read_QC)
-  * [`--no_stats`](#--no_stats)
-* [Job resources](#job-resources)
-  * [Automatic resubmission](#automatic-resubmission)
-  * [Custom resource requests](#custom-resource-requests)
-* [AWS Batch specific parameters](#aws-batch-specific-parameters)
-  * [`--awsqueue`](#--awsqueue)
-  * [`--awsregion`](#--awsregion)
-* [Other command line parameters](#other-command-line-parameters)
-  * [`--outdir`](#--outdir)
-  * [`--email`](#--email)
-  * [`--email_on_fail`](#--email_on_fail)
-  * [`-name`](#-name)
-  * [`-resume`](#-resume)
-  * [`-c`](#-c)
-  * [`--custom_config_version`](#--custom_config_version)
-  * [`--custom_config_base`](#--custom_config_base)
-  * [`--max_memory`](#--max_memory)
-  * [`--max_time`](#--max_time)
-  * [`--max_cpus`](#--max_cpus)
-  * [`--plaintext_email`](#--plaintext_email)
-  * [`--monochrome_logs`](#--monochrome_logs)
-  * [`--multiqc_config`](#--multiqc_config)
+- [qbic-pipelines/bamtofastq: Usage](#qbic-pipelinesbamtofastq-usage)
+  - [Table of contents](#table-of-contents)
+  - [Introduction](#introduction)
+  - [Running the pipeline](#running-the-pipeline)
+    - [Updating the pipeline](#updating-the-pipeline)
+    - [Reproducibility](#reproducibility)
+  - [Main arguments](#main-arguments)
+    - [`-profile`](#-profile)
+    - [`--input`](#--input)
+    - [`--index_files`](#--index_files)
+    - [`--cram_files`](#--cram_files)
+    - [`--reference_fasta`](#--reference_fasta)
+    - [`--chr` (optional)](#--chr-optional)
+    - [`--no_read_QC` (optional)](#--no_read_qc-optional)
+    - [`--samtools_collate_fast` (optional)](#--samtools_collate_fast-optional)
+    - [`--reads_in_memory` (optional)](#--reads_in_memory-optional)
+    - [`--no_stats` (optional)](#--no_stats-optional)
+  - [Job resources](#job-resources)
+    - [Automatic resubmission](#automatic-resubmission)
+    - [Custom resource requests](#custom-resource-requests)
+  - [AWS Batch specific parameters](#aws-batch-specific-parameters)
+    - [`--awsqueue`](#--awsqueue)
+    - [`--awsregion`](#--awsregion)
+  - [Other command line parameters](#other-command-line-parameters)
+    - [`--outdir`](#--outdir)
+    - [`--email`](#--email)
+    - [`--email_on_fail`](#--email_on_fail)
+    - [`-name`](#-name)
+    - [`-resume`](#-resume)
+    - [`-c`](#-c)
+    - [`--custom_config_version`](#--custom_config_version)
+    - [`--custom_config_base`](#--custom_config_base)
+    - [`--max_memory`](#--max_memory)
+    - [`--max_time`](#--max_time)
+    - [`--max_cpus`](#--max_cpus)
+    - [`--plaintext_email`](#--plaintext_email)
+    - [`--monochrome_logs`](#--monochrome_logs)
+    - [`--multiqc_config`](#--multiqc_config)
 <!-- TOC END -->
 
 ## Introduction
@@ -92,24 +97,24 @@ Use this parameter to choose a configuration profile. Profiles can give configur
 
 If `-profile` is not specified at all the pipeline will be run locally and expects all software to be installed and available on the `PATH`.
 
-* `awsbatch`
-  * A generic configuration profile to be used with AWS Batch.
-* `conda`
-  * A generic configuration profile to be used with [conda](https://conda.io/docs/)
-  * Pulls most software from [Bioconda](https://bioconda.github.io/)
-* `docker`
-  * A generic configuration profile to be used with [Docker](http://docker.com/)
-  * Pulls software from dockerhub: [`nfcore/bamtofastq`](http://hub.docker.com/r/nfcore/bamtofastq/)
-* `singularity`
-  * A generic configuration profile to be used with [Singularity](http://singularity.lbl.gov/)
-  * Pulls software from DockerHub: [`nfcore/bamtofastq`](http://hub.docker.com/r/nfcore/bamtofastq/)
-* `test`
-  * A profile with a complete configuration for automated testing
-  * Includes links to test data so needs no other parameters
+- `awsbatch`
+  - A generic configuration profile to be used with AWS Batch.
+- `conda`
+  - A generic configuration profile to be used with [conda](https://conda.io/docs/)
+  - Pulls most software from [Bioconda](https://bioconda.github.io/)
+- `docker`
+  - A generic configuration profile to be used with [Docker](http://docker.com/)
+  - Pulls software from dockerhub: [`nfcore/bamtofastq`](http://hub.docker.com/r/nfcore/bamtofastq/)
+- `singularity`
+  - A generic configuration profile to be used with [Singularity](http://singularity.lbl.gov/)
+  - Pulls software from DockerHub: [`nfcore/bamtofastq`](http://hub.docker.com/r/nfcore/bamtofastq/)
+- `test`
+  - A profile with a complete configuration for automated testing
+  - Includes links to test data so needs no other parameters
 
 ### `--input`
 
-Use this to specify the location of your input Bam files. For example:
+Use this to specify the location of your input Bam files (or CRAM files if used with [`--cram_files`](#--cram_files)). For example:
 
 ```bash
 --input 'path/to/data/sample_*.bam'
@@ -118,7 +123,7 @@ Use this to specify the location of your input Bam files. For example:
 Please note the following requirements:
 
 1. The path must be enclosed in quotes
-2. The path must have at least one `*` wildcard character
+2. The path must have at least one `*`/`**` wildcard character
 
 ### `--index_files`
 
@@ -133,10 +138,34 @@ Please note the following requirements:
 1. The path must be enclosed in quotes
 2. The path must have at least one `*` wildcard character
 
+### `--cram_files`
+
+Use this to indicate that **all** of the files listed in `--input` are CRAM files instead of BAM files. This enabled a step at the beginning of the workflow that converts each CRAM file to BAM format on the fly. Note that this option is incompatible with [`--index_files`](#--index_files). For example:
+
+```bash
+--cram_files --input 'path/to/data/sample_*.cram'
+```
+
+While the above command is valid, it will only work if the reference genome FASTA file listed in the CRAM header is available (_e.g._ via HTTP/FTP or on the local file system). Otherwise, you will need to use the [`--reference_fasta` option](#--reference_fasta). You can check which reference FASTA file is indicated in the CRAM header with the following command:
+
+```bash
+samtools view -H path/to/sample.cram | grep '@SQ'
+```
+
+Unfortunately, at the time of writing, FastQC [doesn't support](https://github.com/s-andrews/FastQC/issues/54) CRAM files as input. Hence, a benefit of converting CRAM files to BAM format as opposed to converting directly to FASTQ format is that you can perform QC before the final conversion.
+
+### `--reference_fasta`
+
+Use this option to indicate which reference genome FASTA file to use when decompressing CRAM files. This is useful if the FASTA file indicated in the CRAM header (see [`--cram_files`](#--cram_files) for more information). For example:
+
+```bash
+--cram_files --input 'path/to/data/sample_*.cram' --reference_fasta 'ftp://ftp.broadinstitute.org/pub/seq/references/Homo_sapiens_assembly19.fasta'
+```
+
 ### `--chr` (optional)
 
 Use to only obtain reads mapping to a specific chromosome or region.
-> It is important to specify the chromsome or region name **exactly** as set in the bam file. Otherwise no reads may be extracted!
+> It is important to specify the chromosome or region name **exactly** as set in the bam file. Otherwise no reads may be extracted!
 
 For example:
 
@@ -154,6 +183,20 @@ Use to skip `FastQC` on obtained reads. This is useful, when the reads are used
 --no_read_QC
 ```
 
+### `--samtools_collate_fast` (optional)
+
+Use to specify the fast mode for the `samtools collate` command in the processes `sortExtractMapped`, `sortExtractUnmapped` and `sortExtractSingleEnd`. This option relies on the samtools command line flags `-f -r INT` and will output primary alignments only. For full documentation of this mode please refer to the [samtools documentation](http://www.htslib.org/doc/samtools-collate.html#OPTIONS).
+
+### `--reads_in_memory` (optional)
+
+Only relevant in combination with `--samtools_collate_fast`. It specifies how many alignment reads are kept in memory [default = '100000']. This is useful for speeding up the processes `sortExtractMapped`, `sortExtractUnmapped` and `sortExtractSingleEnd`.
+
+Example:
+
+```bash
+--samtools_collate_fast --reads_in_memory '1000000'
+```
+
 ### `--no_stats` (optional)
 
 Use to skip `FastQC` on both input bam and output reads, as well as all `samtools flagstat`, `samtools idxstats`, and `samtools stats`. This is useful for large datasets, since the quality metrics processes require a significant amount of time and resources.

diff --git a/environment.yml b/environment.yml
@@ -1,6 +1,6 @@
 # You can use this file to create a conda environment for this pipeline:
 #   conda env create -f environment.yml
-name: qbic-pipelines-bamtofastq-1.1.0
+name: qbic-pipelines-bamtofastq-1.2.0
 channels:
   - conda-forge
   - bioconda