Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restores mapping to nf-core; Keeps gatk *Spark #84

Closed
wants to merge 70 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
d4201e4
Adds GenomeChronicler; Changes "withLabel:VEP" -> "withName:VEP"
cgpu Oct 21, 2019
48c51c9
Creates minimal bam channel with only file
cgpu Oct 21, 2019
b0107f6
C-out for testing
cgpu Oct 21, 2019
e264e9f
Added url; baseDir doee not work when on aws instance
cgpu Oct 21, 2019
a3f5b3a
Adds back nf-core functions; WIP testing
cgpu Oct 21, 2019
ad06d26
Adds genomechronicler; Comments out all following processes; WIP
Oct 21, 2019
39d99e8
Changes outpit dir; WIP
Oct 21, 2019
a50fded
Fixes reports for GenomeChronicler; WIP (vep html not implemented)
cgpu Oct 22, 2019
982cf87
WIP; report update
cgpu Oct 22, 2019
de24ddc
Absorbs path-release 2.5.1-Årjep-Ålkatjjekna :tada:
cgpu Oct 25, 2019
299057d
Merge branch 'nf-core-master'
cgpu Oct 25, 2019
4969b57
Restores channel name from nfcore
cgpu Oct 25, 2019
034057b
Changes MultiQC loc; enables display
cgpu Oct 25, 2019
58215f3
Updated docker build used
cgpu Oct 25, 2019
4b3686b
Added --resultsDir to reflect .pl changes
cgpu Oct 25, 2019
804053e
Adds --resultsDir flag requirement
Oct 25, 2019
7f59320
Update to reflect https://github.com/PGP-UK/GenomeChronicler/commit/6…
cgpu Oct 26, 2019
ad6386f
Removed resources limitation from samtools sort
cgpu Oct 30, 2019
dc6c16a
Decreased --MAX_RECORDS_IN_RAM
cgpu Oct 30, 2019
eb8615d
Restoring to default 500000
cgpu Oct 31, 2019
eeb3078
Removes java options, MAX_RECORDS_IN_RAM and TMP_DIR
cgpu Nov 1, 2019
ecccfd6
Revert "Removes java options, MAX_RECORDS_IN_RAM and TMP_DIR"
cgpu Nov 1, 2019
0c877b8
Added labels memory and cpus max to MarkDuplicates
cgpu Nov 1, 2019
b64788d
Updated MAX_RECORDS_IN_RAM to extremely low value (500)
cgpu Nov 1, 2019
b5d9f7e
Merge pull request #2 from nf-core/master
cgpu Nov 13, 2019
6e19b5f
Updates container version
cgpu Nov 20, 2019
ed6b72f
Reverts to previous container tag
cgpu Nov 20, 2019
a378a12
Adds spark version of MarkDuplicates and BaseRecalibrator
cgpu Nov 24, 2019
79ced20
Merge branch 'master' of https://github.com/cgpu/sarek
cgpu Nov 24, 2019
b2b6023
Adds cpus and container declaration for Spark versions of gatk tools
cgpu Nov 24, 2019
4a2ed13
Minor typo
cgpu Nov 24, 2019
84879a3
Modifies MarkDuplicates command
cgpu Nov 24, 2019
fa090ba
Adds updated lowercase params to Spark tools
cgpu Nov 25, 2019
6c628d1
Adds verbosity ERROR
cgpu Nov 25, 2019
eafcf9f
Adding error verbosity
cgpu Nov 25, 2019
1b6bd45
Adding error verbosity
cgpu Nov 25, 2019
a7a1977
Adds redirection of log message to file (now it prints to sterr)
cgpu Nov 25, 2019
80099e8
Changes temp_dir from container's tmp to s3 remote workdir; Changes .…
cgpu Nov 25, 2019
699a32b
Restricts resources alloced to BaseRecal
cgpu Nov 25, 2019
3921fd7
Adds explicit declaration of maxForks for processes with --intervals
cgpu Nov 25, 2019
1f94c1b
Adds ApplyBQSRSpark
cgpu Nov 25, 2019
da3fd41
Defines maxForks for --intervals processes
cgpu Nov 25, 2019
9981c96
Defines maxForks for --intervals processes
cgpu Nov 25, 2019
2a26bbe
Adds custom genome GRCh38_PGP_UK
cgpu Nov 25, 2019
b7d4713
Adds default genome to match GenomeChronicler resources
cgpu Nov 25, 2019
c560cc1
Removes residue bwaIndex from GRCh38
cgpu Nov 26, 2019
752b953
Removes GenomeChronicler .dict, .fa.fai
cgpu Nov 26, 2019
e574989
Testing 2FA
cgpu Nov 26, 2019
d2cfa80
Update nextflow.config
cgpu Nov 26, 2019
25c74b7
Adds explicit memory definition for processes
cgpu Nov 27, 2019
7fcdaeb
Merge branch 'master' of cgpu/sarek
cgpu Nov 27, 2019
2fe8908
Adds more resources to BaseRecalibratorSpark processes
cgpu Nov 27, 2019
379a35d
Adds less resources to BaseRecalibratorSpark processes
cgpu Nov 27, 2019
96fd293
Adds ultra many resources; To revert back after test
cgpu Nov 27, 2019
46a7a5b
Restricts resources; Needed for using instances with Mem.GB/vCPU rati…
cgpu Nov 27, 2019
5d54bd8
Update resources for PGP-UK
cgpu Nov 28, 2019
3ad4a0c
Addresses https://github.com/cgpu/sarek/issues/3 ; Known issue with b…
cgpu Dec 4, 2019
c67c282
Updates resource alloc to *Spark versions; Adds retry for MultiQC
cgpu Dec 4, 2019
9725de1
Adds new image built with https://github.com/PGP-UK/GenomeChronicler/…
cgpu Dec 4, 2019
97c7d5e
Adds verbosity INFO for debugging
cgpu Dec 4, 2019
0a56cdb
Freezes broadinstitute/gatk to 4.1.4.0
cgpu Dec 4, 2019
e2f6d08
Change GATK Spark tools verbosity to debug
cgpu Dec 7, 2019
88e08a2
Restricts resource alloc of MarkDuplicatesSpark
cgpu Dec 7, 2019
40bd743
Removes explicit resource alloc from bamqc (qualimap)
cgpu Dec 8, 2019
d577d68
Removes metrics file; Refer to this: https://gatkforums.broadinstitut…
cgpu Dec 8, 2019
ca60c30
Downgrades gatk from 4.1.4.0 to 4.1.3.0
cgpu Dec 8, 2019
d353841
Returns mapping to nfcore version 2.5.1
cgpu Dec 8, 2019
c3a9567
Reformats `bwa mem | samtools sort` command; WIP suboptimal resource …
cgpu Dec 20, 2019
126603a
Addresses https://github.com/PGP-UK/GenomeChronicler-Sarek-nf/issues/…
cgpu Dec 23, 2019
70ad1e3
Restores mapping to nf-core; Keeps gatk *Spark
cgpu Dec 23, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file added assets/no_vepFile.txt
Empty file.
2 changes: 1 addition & 1 deletion conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -72,4 +72,4 @@ process {
container = {(params.annotation_cache && params.vep_cache) ? 'nfcore/sarek:2.5.2' : "nfcore/sarekvep:2.5.2.${params.genome}"}
errorStrategy = {task.exitStatus == 143 ? 'retry' : 'ignore'}
}
}
}
16 changes: 16 additions & 0 deletions conf/igenomes.config
Original file line number Diff line number Diff line change
Expand Up @@ -201,5 +201,21 @@ params {
bwaIndex = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/BWAIndex/genome.fa.{amb,ann,bwt,pac,sa}"
fasta = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/WholeGenomeFasta/genome.fa"
}
'GRCh38_PGP_UK' {
acLoci = "${params.igenomes_base}/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/1000G_phase3_GRCh38_maf0.3.loci"
acLociGC = "${params.igenomes_base}/Homo_sapiens/GATK/GRCh38/Annotation/ASCAT/1000G_phase3_GRCh38_maf0.3.loci.gc"
chrDir = "${params.igenomes_base}/Homo_sapiens/GATK/GRCh38/Sequence/Chromosomes"
chrLength = "${params.igenomes_base}/Homo_sapiens/GATK/GRCh38/Sequence/Length/Homo_sapiens_assembly38.len"
dbsnp = "${params.igenomes_base}/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/dbsnp_146.hg38.vcf.gz"
dbsnpIndex = "${params.igenomes_base}/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/dbsnp_146.hg38.vcf.gz.tbi"
fasta = "s3://lifebit-featured-datasets/pipelines/nf-core-sarek/resources/GRCh38_full_analysis_set_plus_decoy_hla_noChr.fa"
germlineResource = "${params.igenomes_base}/Homo_sapiens/GATK/GRCh38/Annotation/GermlineResource/gnomAD.r2.1.1.GRCh38.PASS.AC.AF.only.vcf.gz"
germlineResourceIndex = "${params.igenomes_base}/Homo_sapiens/GATK/GRCh38/Annotation/GermlineResource/gnomAD.r2.1.1.GRCh38.PASS.AC.AF.only.vcf.gz.tbi"
intervals = "${params.igenomes_base}/Homo_sapiens/GATK/GRCh38/Annotation/intervals/wgs_calling_regions.hg38.bed"
knownIndels = "${params.igenomes_base}/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/{Mills_and_1000G_gold_standard.indels.hg38,beta/Homo_sapiens_assembly38.known_indels}.vcf.gz"
knownIndelsIndex = "${params.igenomes_base}/Homo_sapiens/GATK/GRCh38/Annotation/GATKBundle/{Mills_and_1000G_gold_standard.indels.hg38,beta/Homo_sapiens_assembly38.known_indels}.vcf.gz.tbi"
snpeffDb = "GRCh38.86"
vepCacheVersion = "95"
}
}
}
101 changes: 74 additions & 27 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -793,6 +793,12 @@ process MapReads {
extra = status == 1 ? "-B 3" : ""
convertToFastq = hasExtension(inputFile1, "bam") ? "gatk --java-options -Xmx${task.memory.toGiga()}g SamToFastq --INPUT=${inputFile1} --FASTQ=/dev/stdout --INTERLEAVE=true --NON_PF=true | \\" : ""
input = hasExtension(inputFile1, "bam") ? "-p /dev/stdin - 2> >(tee ${inputFile1}.bwa.stderr.log >&2)" : "${inputFile1} ${inputFile2}"
// Pseudo-code: Add soft-coded memory allocation to the two tools:
bwa_memory = task.memory.toGiga() * 0.60
sort_memory = task.memory.toGiga() * 0.40
// Pseudo-code: Add soft-coded memory allocation to the two tools:
bwa_cpus = (task.cpus * 0.60).toInteger()
sort_cpus = (task.cpus * 0.40).toInteger()
"""
${convertToFastq}
bwa mem -K 100000000 -R \"${readGroup}\" ${extra} -t ${task.cpus} -M ${fasta} \
Expand Down Expand Up @@ -940,10 +946,12 @@ process IndexBamFile {

// STEP 2: MARKING DUPLICATES

process MarkDuplicates {
label 'cpus_16'
process MarkDuplicatesSpark {
label 'cpus_max'
label 'memory_max'

tag {idPatient + "-" + idSample}
echo true

publishDir params.outdir, mode: params.publishDirMode,
saveAs: {
Expand All @@ -956,23 +964,21 @@ process MarkDuplicates {
set idPatient, idSample, file("${idSample}.bam") from mergedBam

output:
set idPatient, idSample, file("${idSample}.md.bam"), file("${idSample}.md.bai") into duplicateMarkedBams
set idPatient, idSample, file("${idSample}.md.bam"), file("${idSample}.md.bam.bai") into duplicateMarkedBams
file ("${idSample}.bam.metrics") into markDuplicatesReport

when: params.knownIndels

script:
markdup_java_options = task.memory.toGiga() > 8 ? params.markdup_java_options : "\"-Xms" + (task.memory.toGiga() / 2).trunc() + "g -Xmx" + (task.memory.toGiga() - 1) + "g\""
"""
gatk --java-options ${markdup_java_options} \
MarkDuplicates \
--MAX_RECORDS_IN_RAM 50000 \
--INPUT ${idSample}.bam \
--METRICS_FILE ${idSample}.bam.metrics \
--TMP_DIR . \
--ASSUME_SORT_ORDER coordinate \
--CREATE_INDEX true \
--OUTPUT ${idSample}.md.bam
gatk MarkDuplicatesSpark \
--input ${idSample}.bam \
--output ${idSample}.md.bam \
--tmp-dir . \
--verbosity DEBUG \
--create-output-bam-index true \
--spark-runner LOCAL --spark-master local[${task.cpus}]
"""
}

Expand Down Expand Up @@ -1044,10 +1050,11 @@ process SentieonDedup {

// STEP 3: CREATING RECALIBRATION TABLES

process BaseRecalibrator {
process BaseRecalibratorSpark {
label 'cpus_1'

tag {idPatient + "-" + idSample + "-" + intervalBed.baseName}
tag {idPatient + "-" + idSample + "-" + intervalBed.simpleName}
echo true

input:
set idPatient, idSample, file(bam), file(bai), file(intervalBed) from bamBaseRecalibrator
Expand All @@ -1073,15 +1080,16 @@ process BaseRecalibrator {
// TODO: --use-original-qualities ???
"""
gatk --java-options -Xmx${task.memory.toGiga()}g \
BaseRecalibrator \
-I ${bam} \
-O ${prefix}${idSample}.recal.table \
--tmp-dir /tmp \
-R ${fasta} \
BaseRecalibratorSpark \
--input ${bam} \
--output ${prefix}${idSample}.recal.table \
--tmp-dir . \
--reference ${fasta} \
${intervalsOptions} \
${dbsnpOptions} \
${knownOptions} \
--verbosity INFO
--verbosity INFO \
--spark-runner LOCAL --spark-master local[${task.cpus}]
"""
}

Expand Down Expand Up @@ -1164,7 +1172,7 @@ bamApplyBQSR = bamApplyBQSR.dump(tag:'BAM + BAI + RECAL TABLE + INT')

// STEP 4: RECALIBRATING

process ApplyBQSR {
process ApplyBQSRSpark {
label 'memory_singleCPU_2_task'
label 'cpus_2'

Expand All @@ -1184,12 +1192,14 @@ process ApplyBQSR {
intervalsOptions = params.no_intervals ? "" : "-L ${intervalBed}"
"""
gatk --java-options -Xmx${task.memory.toGiga()}g \
ApplyBQSR \
-R ${fasta} \
ApplyBQSRSpark \
--reference ${fasta} \
--input ${bam} \
--output ${prefix}${idSample}.recal.bam \
${intervalsOptions} \
--bqsr-recal-file ${recalibrationReport}
--bqsr-recal-file ${recalibrationReport} \
--verbosity DEBUG \
--spark-runner LOCAL --spark-master local[${task.cpus}] &> applyBQSRspark.log.txt
"""
}

Expand Down Expand Up @@ -1284,7 +1294,7 @@ bamRecalSentieonSampleTSV
["recalibrated_sentieon_${idSample}.tsv", "${idPatient}\t${gender}\t${status}\t${idSample}\t${bam}\t${bai}\n"]
}

// STEP 4.5: MERGING THE RECALIBRATED BAM FILES
// STEP 4.5.1: MERGING THE RECALIBRATED BAM FILES

process MergeBamRecal {
label 'cpus_8'
Expand All @@ -1300,6 +1310,7 @@ process MergeBamRecal {
set idPatient, idSample, file("${idSample}.recal.bam"), file("${idSample}.recal.bam.bai") into bamRecal
set idPatient, idSample, file("${idSample}.recal.bam") into bamRecalQC
set idPatient, idSample into bamRecalTSV
file("${idSample}.recal.bam") into (bamGenomeChronicler, bamGenomeChroniclerToPrint)

when: !(params.no_intervals)

Expand All @@ -1309,6 +1320,40 @@ process MergeBamRecal {
samtools index ${idSample}.recal.bam
"""
}
bamGenomeChroniclerToPrint.view()

// TODO: Bind this with HaplotypeCaller output and migrate process chuck after HaplotypeCasller + VEP
Channel.fromPath(params.vepFile)
.ifEmpty { exit 1, "--vepFile not specified or no file found at that destination with the suffix .html. Please make sure to provide the file path correctly}" }
.set { vepGenomeChronicler }


// STEP 4.5.2: RUNNING GenomeChronicler FOR THE RECALIBRATED BAM FILES
// TODO: Update this when there is a different VEP html report for each bam
process RunGenomeChronicler {
tag "$bam"
publishDir "$params.outdir/GenomeChronicler", mode: 'copy'
echo true

input:
file(bam) from bamGenomeChronicler
each file(vep) from vepGenomeChronicler

output:
file("results_${bam.simpleName}") into chronicler_results

script:

optional_argument = vep.endsWith("no_vepFile.txt") ? '' : "--vepFile ${vep}"

"""
genomechronicler \
--resultsDir '/GenomeChronicler' \
--bamFile $bam $optional_argument &> STDERR.txt
cp -r /GenomeChronicler/results/results_${bam.simpleName} .
mv STDERR.txt results_${bam.simpleName}/
"""
}

// STEP 4.5': INDEXING THE RECALIBRATED BAM FILES

Expand Down Expand Up @@ -2830,7 +2875,7 @@ compressVCFsnpEffOut = compressVCFsnpEffOut.dump(tag:'VCF')
process VEP {
label 'VEP'
label 'cpus_4'

echo true
tag {"${idSample} - ${variantCaller} - ${vcf}"}

publishDir params.outdir, mode: params.publishDirMode, saveAs: {
Expand Down Expand Up @@ -2861,6 +2906,8 @@ process VEP {
cadd = (params.cadd_cache && params.cadd_WG_SNVs && params.cadd_InDels) ? "--plugin CADD,whole_genome_SNVs.tsv.gz,InDels.tsv.gz" : ""
genesplicer = params.genesplicer ? "--plugin GeneSplicer,/opt/conda/envs/nf-core-sarek-${workflow.manifest.version}/bin/genesplicer,/opt/conda/envs/nf-core-sarek-${workflow.manifest.version}/share/genesplicer-1.0-1/human,context=200,tmpdir=\$PWD/${reducedVCF}" : "--offline"
"""
hello_message="I am here at VEP"
echo $hello_message
mkdir ${reducedVCF}

vep \
Expand Down Expand Up @@ -2983,7 +3030,7 @@ compressVCFOutVEP = compressVCFOutVEP.dump(tag:'VCF')
// STEP MULTIQC

process MultiQC {
publishDir "${params.outdir}/Reports/MultiQC", mode: params.publishDirMode
publishDir "${params.outdir}/MultiQC", mode: params.publishDirMode

input:
file (multiqcConfig) from Channel.value(params.multiqc_config ? file(params.multiqc_config) : "")
Expand Down
72 changes: 71 additions & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,15 @@ params {

// Workflow flags
annotateTools = null // Only with --step annotate
genome = 'GRCh38'
genome = 'GRCh38_PGP_UK'
input = null // No default input
noGVCF = null // g.vcf are produced by HaplotypeCaller
noStrelkaBP = null // Strelka will use Manta candidateSmallIndels if available
no_intervals = null // Intervals will be built from the fasta file
skipQC = null // All QC tools are used
step = 'mapping' // Starts with mapping
tools = null // No default Variant Calling or Annotation tools
vepFile = 'https://github.com/cgpu/sarek/master/assets/no_vepFile.txt'

// Workflow settings
annotation_cache = null // Annotation cache disabled
Expand Down Expand Up @@ -89,6 +90,75 @@ params {
// Developmental code should specify dev
process.container = 'nfcore/sarek:2.5.2'


process {

withName: MapReads {
container = "nfcore/sarek:2.5.1"
cpus = 30
memory = 210.GB
maxForks = 2
}

withName: BaseRecalibratorSpark {
container = "broadinstitute/gatk:4.1.4.0"
cpus = 8
memory = 32.GB
maxForks = 96
}

withName: MarkDuplicatesSpark {
container = "broadinstitute/gatk:4.1.3.0"
cpus = 28
memory = 180.GB
maxForks = 2
}

withName: RunGenomeChronicler {
container = "lifebitai/genomechronicler:pgp-uk-5513c6f"
cpus = 1
memory = 4.GB
maxForks = 96
}

withName: ApplyBQSRSpark {
container = "broadinstitute/gatk:4.1.4.0"
cpus = 4
memory = 16.GB
maxForks = 64
}

withName: HaplotypeCaller {
container = "broadinstitute/gatk:4.1.4.0"
cpus = 2
memory = 8.GB
maxForks = 64
}

withName: Mutect2 {
container = "broadinstitute/gatk:4.1.4.0"
cpus = 2
memory = 8.GB
maxForks = 32
}

withName: PileupSummariesForMutect2 {
container = "broadinstitute/gatk:4.1.4.0"
cpus = 1
memory = 4.GB
maxForks = 96
}

withName: MultiQC {
container = "nfcore/sarek:2.5.1"
cpus = 8
memory = 32.GB
maxForks = 1
errorStrategy = 'retry'
maxRetries = 4
}
}

// Load base.config by default for all pipelines
includeConfig 'conf/base.config'

Expand Down