Skip to content

Commit

Permalink
Merge pull request #221 from nf-core/salmon
Browse files Browse the repository at this point in the history
[Feature Addition: Salmon]
  • Loading branch information
apeltzer committed Jul 8, 2019
2 parents e23da52 + 2ab698d commit e9284af
Show file tree
Hide file tree
Showing 20 changed files with 1,117 additions and 636 deletions.
2 changes: 1 addition & 1 deletion .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Typically, pull-requests are only fully reviewed when these tests are passing, t
There are typically two types of tests that run:

### Lint Tests
The nf-core has a [set of guidelines](http://nf-co.re/guidelines) which all pipelines must adhere to.
The nf-core has a [set of guidelines](https://nf-co.re/developers/guidelines) which all pipelines must adhere to.
To enforce these and ensure that all pipelines stay in sync, we have developed a helper tool which runs checks on the pipeline code. This is in the [nf-core/tools repository](https://github.com/nf-core/tools) and once installed can be run locally with the `nf-core lint <pipeline-directory>` command.

If any failures or warnings are encountered, please follow the listed URL for more documentation.
Expand Down
6 changes: 4 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,9 @@ script:
- nf-core lint ${TRAVIS_BUILD_DIR}
# Lint the documentation
- markdownlint ${TRAVIS_BUILD_DIR} -c ${TRAVIS_BUILD_DIR}/.github/markdownlint.yml
# Run, build reference genome with STAR
# Run with STAR
- nextflow run ${TRAVIS_BUILD_DIR} -profile test,docker
# Run, build reference genome with HISAT2
# Run with HISAT2
- nextflow run ${TRAVIS_BUILD_DIR} -profile test,docker --aligner hisat2
# Run with STAR and Salmon
- nextflow run ${TRAVIS_BUILD_DIR} -profile test,docker --pseudo_aligner salmon
17 changes: 17 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,28 @@

### Pipeline updates

* Removed `genebody_coverage` process [#195](https://github.com/nf-core/rnaseq/issues/195)
* Implemented Pearsons correlation instead of euclidean distance [#146](https://github.com/nf-core/rnaseq/issues/146)
* Add `--stringTieIgnoreGTF` parameter [#206](https://github.com/nf-core/rnaseq/issues/206)
* Resolved link to guidelines is broken [#203](https://github.com/nf-core/rnaseq/issues/203)
* Removed unnecessary `stringtie` channels for `MultiQC`
* Added tximport to merge salmon output
* Added Salmon as an supplementary method to STAR and HiSAT2
* Added `--psuedo_aligner`, `--transcript_fasta` and `--salmon_index` parameters
* Add `Citation` and `Quick Start` section to `README.md`
* Closed missing multiqc_plots in dev branch output [#200](https://github.com/nf-core/rnaseq/issues/200)
* Integrate changes in `nf-core/tools v1.6` template which resolved [#90](https://github.com/nf-core/rnaseq/issues/90)
* Moved process "convertGFFtoGTF" before "makeSTARindex" [#215](https://github.com/nf-core/rnaseq/issues/215)
* Add tximport and summarizedexperiment dependency [#171](https://github.com/nf-core/rnaseq/issues/171)
* Change all boolean parameters from snake_case to camelCase and vice versa for value parameters
* Appointed changes because of missing output of the multiqc_plots folder [#200](https://github.com/nf-core/rnaseq/issues/200)
* Add Qualimap dependency [#202](https://github.com/nf-core/rnaseq/issues/202)
* Add SM ReadGroup info for QualiMap compatibility[#238](https://github.com/nf-core/rnaseq/issues/238)
* Obtain edgeR + dupRadar version information [#198](https://github.com/nf-core/rnaseq/issues/198) and [#112](https://github.com/nf-core/rnaseq/issues/112)
* Get MultiQC to save plots as [standalone files](https://github.com/nf-core/rnaseq/issues/183)
* Get MultiQC to save plots as [standalone files](https://github.com/nf-core/rnaseq/issues/183): added the folder "multiqc_plots" to the output.
* Get MultiQC to write out the software versions in a .csv file [#185](https://github.com/nf-core/rnaseq/issues/185)
* Add `--gencode` option for compatibility of Salmon and featureCounts biotypes with GENCODE gene annotations
* Use `file` instead of `new File` to create `pipeline_report.{html,txt}` files, and properly create subfolders

### Dependency Updates
Expand All @@ -24,6 +38,7 @@
* qualimap 2.2.2b -> 2.2.2c
* trim-galore 0.6.1 -> 0.6.2
* gffread 0.9.12 -> 0.11.4
* Force matplotlib=3.0.3
* Added Salmon 0.14.0
* Added RSEM 1.3.2
* Added tximport 1.0.3
Expand Down Expand Up @@ -63,6 +78,8 @@
* deeptools 3.2.0 -> 3.2.1
* trim-galore 0.5.0 -> 0.6.1
* qualimap 2.2.2b
* matplotlib 3.0.3
* r-base 3.5.1

## [Version 1.2](https://github.com/nf-core/rnaseq/releases/tag/1.2) - 2018-12-12

Expand Down
36 changes: 33 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,34 @@
[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg)](http://bioconda.github.io/)
[![Docker](https://img.shields.io/docker/automated/nfcore/rnaseq.svg)](https://hub.docker.com/r/nfcore/rnaseq/)


### Introduction

**nf-core/rnaseq** is a bioinformatics analysis pipeline used for RNA sequencing data.

The workflow processes raw data from FastQ inputs ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), [Trim Galore!](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)), aligns the reads ([STAR](https://github.com/alexdobin/STAR) or [HiSAT2](https://ccb.jhu.edu/software/hisat2/index.shtml)), generates gene counts ([featureCounts](http://bioinf.wehi.edu.au/featureCounts/), [StringTie](https://ccb.jhu.edu/software/stringtie/)) and performs extensive quality-control on the results ([RSeQC](http://rseqc.sourceforge.net/), [Qualimap](http://qualimap.bioinfo.cipf.es/), [dupRadar](https://bioconductor.org/packages/release/bioc/html/dupRadar.html), [Preseq](http://smithlabresearch.org/software/preseq/), [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html), [MultiQC](http://multiqc.info/)). See the [output documentation](docs/output.md) for more details of the results.
The workflow processes raw data from FastQ inputs ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), [Trim Galore!](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)), aligns the reads ([STAR](https://github.com/alexdobin/STAR) or [HiSAT2](https://ccb.jhu.edu/software/hisat2/index.shtml)), generates counts relative to genes ([featureCounts](http://bioinf.wehi.edu.au/featureCounts/), [StringTie](https://ccb.jhu.edu/software/stringtie/)) or transcripts ([Salmon](https://combine-lab.github.io/salmon/), [tximport](https://bioconductor.org/packages/release/bioc/html/tximport.html)) and performs extensive quality-control on the results ([RSeQC](http://rseqc.sourceforge.net/), [Qualimap](http://qualimap.bioinfo.cipf.es/), [dupRadar](https://bioconductor.org/packages/release/bioc/html/dupRadar.html), [Preseq](http://smithlabresearch.org/software/preseq/), [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html), [MultiQC](http://multiqc.info/)). See the [output documentation](docs/output.md) for more details of the results.

The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.

## Quick Start

i. Install [`nextflow`](https://nf-co.re/usage/installation)

ii. Install one of [`docker`](https://docs.docker.com/engine/installation/), [`singularity`](https://www.sylabs.io/guides/3.0/user-guide/) or [`conda`](https://conda.io/miniconda.html)

iii. Download the pipeline and test it on a minimal dataset with a single command

```bash
nextflow run nf-core/rnaseq -profile test,<docker/singularity/conda>
```

iv. Start running your own analysis!

```bash
nextflow run nf-core/rnaseq -profile <docker/singularity/conda> --reads '*_R{1,2}.fastq.gz' --genome GRCh37
```

See [usage docs](docs/usage.md) for all of the available options when running the pipeline.

### Documentation
The nf-core/rnaseq pipeline comes with documentation about the pipeline, found in the `docs/` directory:

Expand All @@ -37,4 +56,15 @@ Many thanks to other who have helped out along the way too, including (but not l
[@orzechoj](https://github.com/orzechoj),
[@apeltzer](https://github.com/apeltzer),
[@colindaven](https://github.com/colindaven),
[@jburos](https://github.com/jburos).
[@lpantano](https://github.com/lpantano),
[@olgabot](https://github.com/olgabot),
[@jburos](https://github.com/jburos),
[@drpatelh](https://github.com/drpatelh).

## Citation

<!-- TODO nf-core: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi. -->
If you use nf-core/rnaseq for your analysis, please cite it using the following doi: [10.5281/zenodo.1400710](https://doi.org/10.5281/zenodo.1400710)

You can cite the `nf-core` pre-print as follows:
Ewels PA, Peltzer A, Fillinger S, Alneberg JA, Patel H, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. **nf-core: Community curated bioinformatics pipelines**. *bioRxiv*. 2019. p. 610741. [doi: 10.1101/610741](https://www.biorxiv.org/content/10.1101/610741v1).
4 changes: 2 additions & 2 deletions assets/heatmap_header.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
# section_name: 'edgeR: Sample Similarity'
# description: "is generated from normalised gene counts through
# <a href='https://bioconductor.org/packages/release/bioc/html/edgeR.html' target='_blank'>edgeR</a>.
# Euclidean distances between log<sub>2</sub> normalised CPM values are then calculated and clustered."
# Pearson's correlation between log<sub>2</sub> normalised CPM values are then calculated and clustered."
# plot_type: 'heatmap'
# anchor: 'ngi_rnaseq-sample_similarity'
# pconfig:
# title: 'edgeR: Euclidean distances'
# title: 'edgeR: Pearsons correlation'
# xlab: True
# reverseColors: True
1 change: 1 addition & 0 deletions assets/multiqc_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ extra_fn_clean_exts:
- _R2
- .hisat
- '.sorted.markDups'
- '.sorted'

report_comment: >
This report has been generated by the <a href="https://github.com/nf-core/rnaseq" target="_blank">nf-core/rnaseq</a>
Expand Down
19 changes: 9 additions & 10 deletions bin/edgeR_heatmap_MDS.r
Original file line number Diff line number Diff line change
Expand Up @@ -67,25 +67,24 @@ write.csv(MDSxy, 'edgeR_MDS_Aplot_coordinates_mqc.csv', quote=FALSE, append=TRUE
# Get the log counts per million values
logcpm <- cpm(dataNorm, prior.count=2, log=TRUE)

# Calculate the euclidean distances between samples
dists = dist(t(logcpm))

# Calculate the Pearsons correlation between samples
# Plot a heatmap of correlations
pdf('log2CPM_sample_distances_heatmap.pdf')
hmap <- heatmap.2(as.matrix(dists),
main="Sample Correlations", key.title="Distance", trace="none",
pdf('log2CPM_sample_correlation_heatmap.pdf')
hmap <- heatmap.2(as.matrix(cor(logcpm, method="pearson")),
key.title="Pearsons Correlation", trace="none",
dendrogram="row", margin=c(9, 9)
)
dev.off()

# Write correlation values to file
write.csv(hmap$carpet, 'log2CPM_sample_correlation_mqc.csv', quote=FALSE, append=TRUE)

# Plot the heatmap dendrogram
pdf('log2CPM_sample_distances_dendrogram.pdf')
plot(hmap$rowDendrogram, main="Sample Dendrogram")
hmap <- heatmap.2(as.matrix(dist(t(logcpm))))
plot(hmap$rowDendrogram, main="Sample Euclidean Distance Clustering")
dev.off()

# Write clustered distance values to file
write.csv(hmap$carpet, 'log2CPM_sample_distances_mqc.csv', quote=FALSE, append=TRUE)

file.create("corr.done")

# Printing sessioninfo to standard out
Expand Down
75 changes: 75 additions & 0 deletions bin/parse_gtf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
#!/usr/bin/env python
from __future__ import print_function
from collections import OrderedDict, defaultdict, Counter
import logging
import argparse
import glob
import os

# Create a logger
logging.basicConfig(format='%(name)s - %(asctime)s %(levelname)s: %(message)s')
logger = logging.getLogger(__file__)
logger.setLevel(logging.INFO)


def read_top_transcript(salmon):
txs = set()
fn = glob.glob(os.path.join(salmon, "*", "quant.sf"))[0]
with open(fn) as inh:
for line in inh:
if line.startswith("Name"):
continue
txs.add(line.split()[0])
if len(txs) > 100:
break
logger.info("Transcripts found in FASTA: %s" % txs)
return txs


def tx2gene(gtf, salmon, gene_id, extra, out):
txs = read_top_transcript(salmon)
votes = Counter()
gene_dict = defaultdict(list)
with open(gtf) as inh:
for line in inh:
if line.startswith("#"):
continue
cols = line.split("\t")
attr_dict = OrderedDict()
for gff_item in cols[8].split(";"):
item_pair = gff_item.strip().split(" ")
if len(item_pair) > 1:
value = item_pair[1].strip().replace("\"", "")
if value in txs:
votes[item_pair[0].strip()] += 1

attr_dict[item_pair[0].strip()] = value
gene_dict[attr_dict[gene_id]].append(attr_dict)

if not votes:
logger.warning("No attribute in GTF matching transcripts")
return None

txid = votes.most_common(1)[0][0]
logger.info("Attributed found to be transcript: %s" % txid)
seen = set()
with open(out, 'w') as outh:
for gene in gene_dict:
for row in gene_dict[gene]:
if txid not in row:
continue
if (gene, row[txid]) not in seen:
seen.add((gene, row[txid]))
print("%s,%s,%s" % (row[txid], gene, row[extra]), file=outh)


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="""Get tx to gene names for tximport""")
parser.add_argument("--gtf", type=str, help="GTF file")
parser.add_argument("--salmon", type=str, help="output of salmon")
parser.add_argument("--id", type=str, help="gene id in the gtf file")
parser.add_argument("--extra", type=str, help="extra id in the gtf file")
parser.add_argument("-o", "--output", dest='output', default='tx2gene.csv', type=str, help="file with output")

args = parser.parse_args()
tx2gene(args.gtf, args.salmon, args.id, args.extra, args.output)
2 changes: 2 additions & 0 deletions bin/scrape_software_versions.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
'Picard MarkDuplicates': ['v_markduplicates.txt', r"([\d\.]+)-SNAPSHOT"],
'Samtools': ['v_samtools.txt', r"samtools (\S+)"],
'featureCounts': ['v_featurecounts.txt', r"featureCounts v(\S+)"],
'Salmon': ['v_salmon.txt', r"salmon (\S+)"],
'deepTools': ['v_deeptools.txt', r"bamCoverage (\S+)"],
'StringTie': ['v_stringtie.txt', r"(\S+)"],
'Preseq': ['v_preseq.txt', r"Version: (\S+)"],
Expand All @@ -34,6 +35,7 @@
results['Picard MarkDuplicates'] = '<span style="color:#999999;\">N/A</span>'
results['Samtools'] = '<span style="color:#999999;\">N/A</span>'
results['featureCounts'] = '<span style="color:#999999;\">N/A</span>'
results['Salmon'] = '<span style="color:#999999;\">N/A</span>'
results['StringTie'] = '<span style="color:#999999;\">N/A</span>'
results['Preseq'] = '<span style="color:#999999;\">N/A</span>'
results['deepTools'] = '<span style="color:#999999;\">N/A</span>'
Expand Down
81 changes: 81 additions & 0 deletions bin/tximport.r
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
#!/usr/bin/env Rscript

args = commandArgs(trailingOnly=TRUE)
if (length(args) < 2) {
stop("Usage: tximeta.r <coldata> <salmon_out>", call.=FALSE)
}

path = args[2]
coldata = args[1]

sample_name = args[3]

prefix = paste(c(sample_name, "salmon"), sep="_")

tx2gene = "tx2gene.csv"
info = file.info(tx2gene)
if (info$size == 0){
tx2gene = NULL
}else{
rowdata = read.csv(tx2gene, header = FALSE)
colnames(rowdata) = c("tx", "gene_id", "gene_name")
tx2gene = rowdata[,1:2]
}

fns = list.files(path, pattern = "quant.sf", recursive = T, full.names = T)
names = basename(dirname(fns))
names(fns) = names

if (file.exists(coldata)){
coldata = read.csv(coldata)
coldata = coldata[match(names, coldata[,1]),]
coldata = cbind(files = fns, coldata)
}else{
message("ColData not avaliable ", coldata)
coldata = data.frame(files = fns, names = names)
}

library(SummarizedExperiment)
library(tximport)

txi = tximport(fns, type = "salmon", txOut = TRUE)
rownames(coldata) = coldata[["names"]]
extra = setdiff(rownames(txi[[1]]), as.character(rowdata[["tx"]]))
if (length(extra) > 0){
rowdata = rbind(rowdata,
data.frame(tx=extra,
gene_id=extra,
gene_name=extra))
}
rowdata = rowdata[match(rownames(txi[[1]]), as.character(rowdata[["tx"]])),]
rownames(rowdata) = rowdata[["tx"]]
se = SummarizedExperiment(assays = list(counts = txi[["counts"]],
abundance = txi[["abundance"]],
length = txi[["length"]]),
colData = DataFrame(coldata),
rowData = rowdata)
if (!is.null(tx2gene)){
gi = summarizeToGene(txi, tx2gene = tx2gene)
growdata = unique(rowdata[,2:3])
growdata = growdata[match(rownames(gi[[1]]), growdata[["gene_id"]]),]
rownames(growdata) = growdata[["tx"]]
gse = SummarizedExperiment(assays = list(counts = gi[["counts"]],
abundance = gi[["abundance"]],
length = gi[["length"]]),
colData = DataFrame(coldata),
rowData = growdata)
}

if(exists("gse")){
saveRDS(gse, file = "gse.rds")
write.csv(assays(gse)[["abundance"]], paste(c(prefix, "gene_tpm.csv"), collapse="_"), quote=FALSE)
write.csv(assays(gse)[["counts"]], paste(c(prefix, "gene_counts.csv"), collapse="_"), quote=FALSE)
}

saveRDS(se, file = "se.rds")
write.csv(assays(se)[["abundance"]], paste(c(prefix, "transcript_tpm.csv"), collapse="_"), quote=FALSE)
write.csv(assays(se)[["counts"]], paste(c(prefix, "transcript_counts.csv"), collapse="_"), quote=FALSE)

# Print sessioninfo to standard out
citation("tximeta")
sessionInfo()
Loading

0 comments on commit e9284af

Please sign in to comment.