Nf core dev / variant calling bugs / ideas #489

nickhsmith · 2022-02-22T10:17:20Z

PR checklist

The following changes were made after trying to run the variant_calling pipeline (specifically --tools haplotypecaller). I have readded the gatk4/genotypegvcfs module from nf-core (which was updated to include interval index files)

An overview of the changes file by file:

concatenateVCFs.sh : The default for bcftools sorting is on the /tmp directory (or /scratch) however it makes sense to me to maintain the temp.sorting files in the location of your work directory (.)

conf/base.config: GENOTYPEGVCFS is listed as a medium_priority tool, but for WGS it was constantly failing the first 1 or 2 attempts. Giving it the default high_priority mem/cpus

conf/modules.config: added the GENOTYPEGVCFS module to publish the final vcf file. Edited the HAPLOTYPECALLER module to not publish the gvcfs by default (could be set to true, but they are huge for a WGS sample) and to name them "{meta.id}.g" to differentiate them from the final vcf file.

lib/WorkflowMain.groovy: the nextflow pipeline requires an input file. But this is in direct contrast to the automatic restart based on the results/../.csv used in sarek.nf to determine the step and input files. Not elegant, but works to restart if appropriate step is named

subworkflows/local/germline_variant_calling.nf: include GENOTYPEGVCFS and account for both with intervals genotyping (work on the gvcfs after concat) or no intervals (work on the gvfs without concat) and create a new channel for the GENOTYPEGVCFS input. Should eventually be linked with logic for joint variant calling. The current behaviour is ALWAYS do single sample variant calling, and if joint variant calling is called, do that also. (not ideal)

…e-dev

maxulysse · 2022-02-22T11:24:17Z

concatenateVCFs.sh : The default for bcftools sorting is on the /tmp directory (or /scratch) however it makes sense to me to maintain the temp.sorting files in the location of your work directory (.)

I'd rather keep the default for bcftools or change the default.

conf/base.config: GENOTYPEGVCFS is listed as a medium_priority tool, but for WGS it was constantly failing the first 1 or 2 attempts. Giving it the default high_priority mem/cpus

Why not changing the modules to a high_priority tool?

conf/modules.config: added the GENOTYPEGVCFS module to publish the final vcf file. Edited the HAPLOTYPECALLER module to not publish the gvcfs by default (could be set to true, but they are huge for a WGS sample) and to name them "{meta.id}.g" to differentiate them from the final vcf file.

Sounds like a good idea

lib/WorkflowMain.groovy: the nextflow pipeline requires an input file. But this is in direct contrast to the automatic restart based on the results/../.csv used in sarek.nf to determine the step and input files. Not elegant, but works to restart if appropriate step is named

Not seeing any other solution there for now

subworkflows/local/germline_variant_calling.nf: include GENOTYPEGVCFS and account for both with intervals genotyping (work on the gvcfs after concat) or no intervals (work on the gvfs without concat) and create a new channel for the GENOTYPEGVCFS input. Should eventually be linked with logic for joint variant calling. The current behaviour is ALWAYS do single sample variant calling, and if joint variant calling is called, do that also. (not ideal)

I see what you mean there

FriederikeHanssen · 2022-02-22T11:34:29Z

concatenateVCFs.sh : The default for bcftools sorting is on the /tmp directory (or /scratch) however it makes sense to me to maintain the temp.sorting files in the location of your work directory (.)

I don't understand the advantages of this. can you elaborate?

conf/base.config: GENOTYPEGVCFS is listed as a medium_priority tool, but for WGS it was constantly failing the first 1 or 2 attempts. Giving it the default high_priority mem/cpus

I agree that this should be changed upstream

lib/WorkflowMain.groovy: the nextflow pipeline requires an input file. But this is in direct contrast to the automatic restart based on the results/../.csv used in sarek.nf to determine the step and input files. Not elegant, but works to restart if appropriate step is named

not sure about a solution, but also haven't given it any thought. Honestly, not very happy about changing the temaplte code. Can we maybe take this disucssion to #tools before making any changes here? Looks like a release there will happen soon anyways, so it would maybe be good to get it in there now

subworkflows/local/germline_variant_calling.nf: include GENOTYPEGVCFS and account for both with intervals genotyping (work on the gvcfs after concat) or no intervals (work on the gvfs without concat) and create a new channel for the GENOTYPEGVCFS input. Should eventually be linked with logic for joint variant calling. The current behaviour is ALWAYS do single sample variant calling, and if joint variant calling is called, do that also. (not ideal)

Joint variant calling subworkflow needs a bit of work still anyways. So fine with merging it here and then work on the changes in a separate PR.

nickhsmith · 2022-02-22T13:00:26Z

the bcftools sort in the concatenateVCFs.sh file is used because the concatenation of all of the intervals (by chromosome) builds a vcf that is unsorted. bcftools by default uses the /tmp/ dir to create temporary /tmp/bcftools-sort.XXXXXX files that are then merged to create the sorted vcf file.

If a user is on a shared file system that doesn't have write access to /tmp/ or if this is not very large (for WGS these temp gvcf files can get quite big) then you get errors. By changing the default temp space to the local directory . which is actually in the work/hash1/hash2/ dir we make sure that the user has write access to this location, and that presumably the work_dir is large enough to handle all of the the intermediate files this negates the problems mentioned above (although at the cost of potential I/O depending on how /tmp/ and /PATH/TO/WORK/ storage is built)

nickhsmith · 2022-02-22T13:03:13Z

I agree that we can change the upstream priority for nf-core/modules/genotypegvfs but that makes assumptions for other users that may not be ideal. I will make a PR for that though

FriederikeHanssen · 2022-02-22T13:08:06Z

ah now I see. Yes we had the same issue with GATK tools in the past and had to add --TMP-DIR or so to also allow to write to defined scratch space with the nextflow parameter scratch = true or scratch = /some/place/where/intermediate/file/are/stored . Did you try the process with setting the scratch directory as well to make sure that would work to? It would be preferabe to not enforce the work dir to be used for intermediate files, since it can slow down execution. I think your solution should work, but just be sure maybe a quick test would be nice. Scratch directory for any process can be set in the custom.config too

FriederikeHanssen · 2022-02-22T13:09:11Z

I agree that we can change the upstream priority for nf-core/modules/genotypegvfs but that makes assumptions for other users that may not be ideal. I will make a PR for that though

oretty sure we are the only ones using it at the moment :D If the raredisease pipeline needs it they would run into the same problem with WGS data.

maxulysse · 2022-02-22T13:18:13Z

I agree that we can change the upstream priority for nf-core/modules/genotypegvfs but that makes assumptions for other users that may not be ideal. I will make a PR for that though

oretty sure we are the only ones using it at the moment :D If the raredisease pipeline needs it they would run into the same problem with WGS data.

I think default should be more, so that it's easier to restrict when needed.
If default is less, then we really need to be sure to add more resources

nickhsmith · 2022-02-22T15:23:24Z

ah now I see. Yes we had the same issue with GATK tools in the past and had to add --TMP-DIR or so to also allow to write to defined scratch space with the nextflow parameter scratch = true or scratch = /some/place/where/intermediate/file/are/stored . Did you try the process with setting the scratch directory as well to make sure that would work to? It would be preferabe to not enforce the work dir to be used for intermediate files, since it can slow down execution. I think your solution should work, but just be sure maybe a quick test would be nice. Scratch directory for any process can be set in the custom.config too

What about a params.scratchdir so that we can set any scratch directive with process.scratch = "${params.scratchdir}"

FriederikeHanssen · 2022-02-22T15:45:28Z

this already exists to provide in a custom config/institutional config etc.. not sure, if process configurations should be command line parameters

nickhsmith · 2022-02-22T15:49:34Z

this already exists to provide in a custom config/institutional config etc.. not sure, if process configurations should be command line parameters

Like (or something):

withName CONCAT_VCF{
    scratch = 'path/to/scratch'
}

FriederikeHanssen · 2022-02-22T15:53:00Z

yep it can also be set globally for all pipelines like here: https://github.com/nf-core/configs/blob/91b5875b44b01edfd2992e19adec0d1b552feef8/conf/cfc.config#L13 or just for your pipeline run for all processes:

process{
   scratch = false (writes to work/ then)
   
   or
   
   scratch = /some/directory/where/there/is/space
 }

Docu: https://www.nextflow.io/docs/latest/process.html#scratch

FriederikeHanssen · 2022-02-23T08:50:40Z

I am confused. What do you mean :D Lasse Folkersen is implementing the module right now, so I can add it to the subworkflow soon, but nothing going on in this PR as far as I see.... it's all still in nf-core/modules

maxulysse · 2022-02-23T08:52:41Z

I am confused. What do you mean :D Lasse Folkersen is implementing the module right now, so I can add it to the subworkflow soon.

Sorry, replied in the wrong PR...

subworkflows/local/germline_variant_calling.nf

.gitignore

lib/WorkflowMain.groovy

Co-authored-by: FriederikeHanssen <Friederike.hanssen@qbic.uni-tuebingen.de>

conf/base.config

github-actions · 2022-02-24T10:03:56Z

Markdown linting is failing

To keep the code consistent with lots of contributors, we run automated code consistency checks.
To fix this CI test, please run:

Install markdownlint-cli
- On Mac: brew install markdownlint-cli
- Everything else: Install npm then install markdownlint-cli (npm install -g markdownlint-cli)
Fix the markdown errors
- Automatically: markdownlint . --fix
- Manually resolve anything left from markdownlint .

Once you push these changes the test should pass, and you can hide this comment 👍

We highly recommend setting up markdownlint in your code editor so that this formatting is done automatically on save. Ask about it on Slack for help!

Thanks again for your contribution!

Smith Nicholas and others added 8 commits February 14, 2022 11:36

remove mandatory input

230bca6

fix typo

d621bf0

copy enabled processes

a1a8cc3

ignore vim edits

ab02aae

bug fixes and input/output matching

59cf6bb

change configs. Naming and cpu/mem allocation

3491dd2

merge with nf-core/dev

8ec1fac

Update modules.config

3b75504

nickhsmith requested a review from maxulysse as a code owner February 22, 2022 10:17

Smith Nicholas added 2 commits February 22, 2022 11:22

update genotypegvcfs

324d027

Merge branch 'nf-core-dev' of github.com:nickhsmith/sarek into nf-cor…

a3333fb

…e-dev

maxulysse approved these changes Feb 22, 2022

View reviewed changes

Smith Nicholas added 2 commits February 22, 2022 16:31

add scratchdir param

73061a5

update scratch

1e2a585

remove scratchdir

bc9fdcb

maxulysse reviewed Feb 24, 2022

View reviewed changes

subworkflows/local/germline_variant_calling.nf Outdated Show resolved Hide resolved

maxulysse approved these changes Feb 24, 2022

View reviewed changes

Update subworkflows/local/germline_variant_calling.nf

fcf6371

FriederikeHanssen reviewed Feb 24, 2022

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

maxulysse reviewed Feb 24, 2022

View reviewed changes

lib/WorkflowMain.groovy Outdated Show resolved Hide resolved

Apply suggestions from code review

a9781f7

Co-authored-by: FriederikeHanssen <Friederike.hanssen@qbic.uni-tuebingen.de>

maxulysse reviewed Feb 24, 2022

View reviewed changes

conf/base.config Outdated Show resolved Hide resolved

Update conf/base.config

720b2b9

maxulysse merged commit 9bc4815 into nf-core:dev Feb 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nf core dev / variant calling bugs / ideas #489

Nf core dev / variant calling bugs / ideas #489

nickhsmith commented Feb 22, 2022

maxulysse commented Feb 22, 2022

FriederikeHanssen commented Feb 22, 2022

nickhsmith commented Feb 22, 2022

nickhsmith commented Feb 22, 2022

FriederikeHanssen commented Feb 22, 2022 •

edited

Loading

FriederikeHanssen commented Feb 22, 2022

maxulysse commented Feb 22, 2022

nickhsmith commented Feb 22, 2022 •

edited

Loading

FriederikeHanssen commented Feb 22, 2022

nickhsmith commented Feb 22, 2022

FriederikeHanssen commented Feb 22, 2022 •

edited

Loading

FriederikeHanssen commented Feb 23, 2022 •

edited

Loading

maxulysse commented Feb 23, 2022

github-actions bot commented Feb 24, 2022

Nf core dev / variant calling bugs / ideas #489

Nf core dev / variant calling bugs / ideas #489

Conversation

nickhsmith commented Feb 22, 2022

PR checklist

An overview of the changes file by file:

maxulysse commented Feb 22, 2022

FriederikeHanssen commented Feb 22, 2022

nickhsmith commented Feb 22, 2022

nickhsmith commented Feb 22, 2022

FriederikeHanssen commented Feb 22, 2022 • edited Loading

FriederikeHanssen commented Feb 22, 2022

maxulysse commented Feb 22, 2022

nickhsmith commented Feb 22, 2022 • edited Loading

FriederikeHanssen commented Feb 22, 2022

nickhsmith commented Feb 22, 2022

FriederikeHanssen commented Feb 22, 2022 • edited Loading

FriederikeHanssen commented Feb 23, 2022 • edited Loading

maxulysse commented Feb 23, 2022

github-actions bot commented Feb 24, 2022

Markdown linting is failing

FriederikeHanssen commented Feb 22, 2022 •

edited

Loading

nickhsmith commented Feb 22, 2022 •

edited

Loading

FriederikeHanssen commented Feb 22, 2022 •

edited

Loading

FriederikeHanssen commented Feb 23, 2022 •

edited

Loading