convert final pipeline output to vcf #16

aryarm · 2020-06-18T17:47:01Z

Changes to the Code

Resolves #6 by adding two steps to the end of the classify subworkflow which 1) convert the final TSV output to VCF and 2) add contigs to the VCF header, so that it can be used with other software like GATK's ValidateVariants. A new script 2vcf.py handles the conversion to VCF. It also has a "-i" switch for internal use which can create trained linear models for mapping RF probabilities to recalibrated QUAL scores.

Resolves #8 by adding a singularity container containing docker and conda to the beginning of all Snakefiles. Unfortunately, this option is still untested because I had a lot of trouble installing singularity on our server without root access. But I'm pretty sure it should work, as long as there isn't a bug with snakemake.

Resolves #9 by adding reads from chr1 of the Jurkat and Molt-4 samples to the example data. That should allow the user to run the prepare subworkflow as well as the classify subworkflow when they run the pipeline on the example data. To demonstrate how the pipeline can take both FASTQs and BAM/BED files, I uploaded the MOLT-4 reads as FASTQs and the Jurkat reads as a BAM file and a BED file. By using only chromosome 1, I was able to cut the runtime of the pipeline on the example data down to about 1 hour (excluding dependency installation).

Resolves #10 by specifying ==<version_numer> at the end of each entry in the envs/ files.

Resolves #15 by converting the prepare and classify subworkflows to .smk rules files in the rules/ dir and creating a Snakefile that imports those subworkflows. From here on out, we refer to this Snakefile as the "master pipeline" in the documentation. I added a README to discuss how to execute the master pipeline and the two subworkflows, and another README to explain the config options for each. The master pipeline takes a new config file, unoriginally named config.yaml, which includes only the most important parameters from the prepare.yaml and classify.yaml config files. All other config options are unset, which forces the pipeline to use appropriate defaults. However, any of the config options from the prepare.yaml config file can be added to config.yaml for more advanced usage.

Resolves #17 by checking the file extensions of the first and second entries in the samples.tsv file. If the first entry has a '.bam' file extension, the pipeline will skip the alignment step and the PCR removal steps. If the second entry exists and has a '.bed' file extension, the pipeline will also skip the peak calling step. Otherwise, it will assume the files are FASTQs and run the pipeline from start to finish. Unfortunately, this means that if the user provides a BAM file, it must be constructed in the same way that our BAM files are constructed in order for everything to run smoothly when it is used by the variant callers in the ensemble. I've outlined those requirements in the new config README and in the config.yaml and prepare.yaml config files.

Changes to the Documentation

I've added two new README files: the rules README and the config README. In addition to their main roles as described above, they also generally serve to explain the structure of the pipeline and what a user should do for specific use-cases of the pipeline. I've also added a section in the main README for users who have no prior experience with Snakemake.

Lastly, I rewrote the run.bash script to work for both local and cluster execution of Snakemake, so that the user wouldn't be bombarded with output from stdout and stderr when running the pipeline locally.

…see #6) still TODO: 1) recalibrate the qual scores 2) test that the vcf passes muster with other software (mainly GATK) 3) ensure that it writes all of the desired sites properly

…lves #10)

1) move manta and strelka configuration to the configs directory 2) move the caller specific parameters to their own config file 3) explain required vs optional params in the caller specific params config file

… other also make it clear which config params are required vs optional

…nes as subworkflows resolves #15 and updates the README documentation

and prepare configs for training with GM12878

…flow using a new 'subset_callers' config option

…eta score much

aryarm and others added 30 commits March 16, 2020 10:26

start 2vcf script

d5d5591

allow statistics.py to output threshold values

c419828

use thresholds in Snakefile-classify

aded5fb

create a script for converting the outputs of the pipeline to a VCF (…

13cecc1

…see #6) still TODO: 1) recalibrate the qual scores 2) test that the vcf passes muster with other software (mainly GATK) 3) ensure that it writes all of the desired sites properly

fix indexing of output vcf in 2vcf.py

e527b1c

suppress mlr2 warning (see issue #10)

89b23e3

let snakemake use docker and singularity if installed (resolves #8)

57c120a

remove {threads} from the prepare pipeline. every rule gets one thread

2490ab8

add default parameters to the caller scripts that take params

f1d330c

update all dependencies and freeze them at the versions we used (reso…

39c11b1

…lves #10)

simplify config checks using a custom function

e998743

allow bam files as input to prepare pipeline

bd8ff56

upgrade snakemake version to 5.18.1 since 5.19 is too unstable still

796c128

remove -S param from samtools; it didn't do anything anyway

5911876

make more config files

0435ec0

1) move manta and strelka configuration to the configs directory 2) move the caller specific parameters to their own config file 3) explain required vs optional params in the caller specific params config file

rewrite pipelines so they will execute the example data one after the…

dcde6a9

… other also make it clear which config params are required vs optional

fix vcf header coming from 2vcf.py (see issue #6)

9704ce2

create a snakefile for executing both the prepare and classify pipeli…

9ca4405

…nes as subworkflows resolves #15 and updates the README documentation

apparently, ranger 0.11.5 doesn't exist. oops

a5ba3ed

explain config options more thoroughly

2ef0a0d

and prepare configs for training with GM12878

update vcf script with old changes

ba9b6a2

clarify that we haven't tried using --use-singularity

1bd6ee7

allow the user to more easily exclude callers in the classify subwork…

a28cea3

…flow using a new 'subset_callers' config option

remove breakca, manta, and delly from defaults b/c didn't improve f-b…

ab8bfde

…eta score much

silence error that occurs when subset_callers is not provided

90942de

allow bed files as input. resolves #17

7ba740a

clarify .bam requirements in config files

f76f6fa

fix indentation in master workflow and prepare subworkflow

5738275

add local (non-SGE) execution to run.bash

c251533

rename breakca to varca in the final output

3a80fd1

aryarm added 3 commits June 27, 2020 08:42

check snp or indel in 2vcf.py

f9dc1e7

add linear regression model to 2vcf.py for QUAL recalibration

8ea04e6

Merge branch 'master' into 2vcf

309374d

aryarm merged commit ec971d6 into master Jun 27, 2020

aryarm deleted the 2vcf branch June 27, 2020 16:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert final pipeline output to vcf #16

convert final pipeline output to vcf #16

aryarm commented Jun 18, 2020 •

edited

Loading

convert final pipeline output to vcf #16

convert final pipeline output to vcf #16

Conversation

aryarm commented Jun 18, 2020 • edited Loading

Changes to the Code

Changes to the Documentation

aryarm commented Jun 18, 2020 •

edited

Loading