Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joint Germline subworkflow haplotypecaller -> Vqsr #595

Merged
merged 90 commits into from
Jul 21, 2022

Conversation

nickhsmith
Copy link
Contributor

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
    • If you've added a new tool - add to the software_versions process and a regex to scrape_software_versions.py
    • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
    • If necessary, also make a PR on the nf-core/sarek branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint .).
  • Ensure the test suite passes (nextflow run . -profile test,docker).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@maxulysse
Copy link
Member

Duplicate of #546, but more advanced, so taking over

@nickhsmith nickhsmith changed the title Vqsr Joint Germline subworkflow haplotypecaller -> Vqsr Jun 20, 2022
@@ -18,6 +18,7 @@ params {
chr_dir = "${params.igenomes_base}/Homo_sapiens/GATK/GRCh37/Sequence/Chromosomes"
dbsnp = "${params.igenomes_base}/Homo_sapiens/GATK/GRCh37/Annotation/GATKBundle/dbsnp_138.b37.vcf"
dbsnp_tbi = "${params.igenomes_base}/Homo_sapiens/GATK/GRCh37/Annotation/GATKBundle/dbsnp_138.b37.vcf.idx"
dbsnp_vqsr = 'dbsnp,known=false,training=true,truth=false,prior=2 dbsnp_138.b37.vcf'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer having args in the modules.config, and avoiding adding extra files in igenomes.config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that doesn't fit with the nf-core/module styling as this is expected to be an inputted value

Comment on lines 42 to 52
// resources for GATK joint germline variant recalibration
RESOURCE_SNP = [
[ res_1000g, dbsnp ],
[ res_1000g, dbsnp_tbi ],
[ res_1000g_vqsr, dbsnp_vqsr ]
]
resource_INDEL = [
[ known_indels, dbsnp ],
[ known_indels_tbi, dbsnp_tbi ],
[ known_indels_mills_vqsr, known_indels_1000g_vqsr, dbsnp_vqsr ]
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that, but I feel like it should be done in the sarek script or in the joint germline variant calling workflow instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would then have to use less descriptive names as for hg19 and hg38 the files are slightly different. So the naming convention has to match regardless of the genome

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about something like known_snps (dbsnp should stay separate because tools like haplotypecaller explicetly want that file)

@@ -34,11 +34,10 @@ workflow RUN_HAPLOTYPECALLER {
// group by interval
genotype_gvcf_to_call = HAPLOTYPECALLER.out.vcf.join(HAPLOTYPECALLER.out.tbi).map{
meta, gvcf, tbi ->
interval_name = meta.num_intervals > 1 ? (gvcf.simpleName - "${meta.id}_").replaceFirst("_",":") : meta.id
new_meta = [id: "joint_germline", interval_name: interval_name, num_intervals: meta.num_intervals]
interval_name = meta.num_intervals > 1 ? (gvcf.simpleName - "${meta.id}_").replaceFirst("_",":") : file(params.intervals).simpleName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be very careful here. I am afraid this may lead to the weird resume errors/unmatching meta mpas we had before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, would it be smarter to keep meta the same, and group py another (temporary) key?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is more the reqirval of the file name. I still can't explain this, but sometimes, when retrieving something from the file name like here, the name is incorrectly resolved later on (even though the matched file in the channel element is the correct one). In this case here this would lead to some very wrong results. (in the rest of sarek, since we group on patient ID it only lead to file name clashes, a) easy to find the bug, b) the actual output results were not impacted)

ext.when = { params.tools && params.tools.split(',').contains('haplotypecaller') && params.joint_germline}
withName: 'GATK4_GENOMICSDBIMPORT' {
ext.prefix = { meta.num_intervals > 1 ? meta.intervals_name : "joint_interval" }
ext.when = { params.tools && params.tools.split(',').contains('haplotypecaller') && params.joint_germline && !params.no_intervals}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ext.when = { params.tools && params.tools.split(',').contains('haplotypecaller') && params.joint_germline && !params.no_intervals}
ext.when = { params.tools && params.tools.split(',').contains('haplotypecaller') && params.joint_germline && !params.no_intervals}

@@ -118,17 +119,23 @@ workflow GERMLINE_VARIANT_CALLING {
if (tools.split(',').contains('haplotypecaller')){
cram_recalibrated_intervals_haplotypecaller = cram_recalibrated_intervals
.map{ meta, cram, crai, intervals ->
[meta, cram, crai, intervals, []]

intervals_name = meta.num_intervals == 0 ? "no_interval" : intervals.simpleName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we also need a conditional for joint germline. This addition of meta can only happen for haplotypecaller + joint germline or we need to rewrite a bunch of logic for the single sample case

@@ -47,7 +48,7 @@ workflow GERMLINE_VARIANT_CALLING {
.map{ meta, cram, crai, intervals, num_intervals ->

//If no interval file provided (0) then add empty list
intervals_new = num_intervals == 0 ? [] : intervals
intervals_new = num_intervals == 0 ? [] : intervals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
intervals_new = num_intervals == 0 ? [] : intervals
intervals_new = num_intervals == 0 ? [] : intervals

//Merge scatter/gather vcfs & index
//Rework meta for variantscalled.csv and annotation tools
MERGE_GENOTYPEGVCFS(vcfs_sorted_input.intervals.map{meta, vcf ->
[[id: "joint_variant_calling", patient: "all_samples", variantcaller: "haplotypecaller", num_intervals: meta.num_intervals], vcf]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do the same nice formatting here as you did in line 116 ff?

[meta, cram, crai, intervals, []]

intervals_name = meta.num_intervals == 0 ? "no_interval" : intervals.simpleName
new_meta = [patient:meta.patient, sample:meta.sample, sex:meta.sex, status:meta.status, id:meta.sample, data_type:meta.data_type, num_intervals:meta.num_intervals, intervals_name:intervals_name]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
new_meta = [patient:meta.patient, sample:meta.sample, sex:meta.sex, status:meta.status, id:meta.sample, data_type:meta.data_type, num_intervals:meta.num_intervals, intervals_name:intervals_name]
new_meta = [
data_type:meta.data_type,
id:meta.sample,
intervals_name:intervals_name,
num_intervals:meta.num_intervals,
patient:meta.patient,
sample:meta.sample,
sex:meta.sex,
status:meta.status
]


versions = ch_versions // channel: [ versions.yml ]
versions = ch_versions // channel: [ versions.yml ]
genotype_vcf = Channel.empty().mix(vcfs_sorted_input.no_intervals,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the vcfs_sorted_input.no_intervals also needs the variantcaller: "haplotypecaller" in it smeta map here to make sure annotation is placing it in the proper folder

//Merge scatter/gather vcfs & index
//Rework meta for variantscalled.csv and annotation tools
MERGE_GENOTYPEGVCFS(vcfs_sorted_input.intervals.map{meta, vcf ->
[[id: "joint_variant_calling", patient: "all_samples", variantcaller: "haplotypecaller", num_intervals: meta.num_intervals], vcf]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[[id: "joint_variant_calling", patient: "all_samples", variantcaller: "haplotypecaller", num_intervals: meta.num_intervals], vcf]
[[
id: "joint_variant_calling",
num_intervals: meta.num_intervals,
patient: "all_samples",
variantcaller: "haplotypecaller"
], vcf]

nickhsmith and others added 11 commits July 20, 2022 15:58
Co-authored-by: FriederikeHanssen <Friederike.hanssen@qbic.uni-tuebingen.de>
Co-authored-by: Maxime U. Garcia <maxime.garcia@scilifelab.se>
Co-authored-by: FriederikeHanssen <Friederike.hanssen@qbic.uni-tuebingen.de>
Co-authored-by: FriederikeHanssen <Friederike.hanssen@qbic.uni-tuebingen.de>
Co-authored-by: Maxime U. Garcia <maxime.garcia@scilifelab.se>
@FriederikeHanssen FriederikeHanssen merged commit 21b9f62 into nf-core:dev Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants