Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert final pipeline output to vcf #16

Merged
merged 33 commits into from
Jun 27, 2020
Merged

convert final pipeline output to vcf #16

merged 33 commits into from
Jun 27, 2020

Conversation

aryarm
Copy link
Owner

@aryarm aryarm commented Jun 18, 2020

Changes to the Code

Resolves #6 by adding two steps to the end of the classify subworkflow which 1) convert the final TSV output to VCF and 2) add contigs to the VCF header, so that it can be used with other software like GATK's ValidateVariants. A new script 2vcf.py handles the conversion to VCF. It also has a "-i" switch for internal use which can create trained linear models for mapping RF probabilities to recalibrated QUAL scores.

Resolves #8 by adding a singularity container containing docker and conda to the beginning of all Snakefiles. Unfortunately, this option is still untested because I had a lot of trouble installing singularity on our server without root access. But I'm pretty sure it should work, as long as there isn't a bug with snakemake.

Resolves #9 by adding reads from chr1 of the Jurkat and Molt-4 samples to the example data. That should allow the user to run the prepare subworkflow as well as the classify subworkflow when they run the pipeline on the example data. To demonstrate how the pipeline can take both FASTQs and BAM/BED files, I uploaded the MOLT-4 reads as FASTQs and the Jurkat reads as a BAM file and a BED file. By using only chromosome 1, I was able to cut the runtime of the pipeline on the example data down to about 1 hour (excluding dependency installation).

Resolves #10 by specifying ==<version_numer> at the end of each entry in the envs/ files.

Resolves #15 by converting the prepare and classify subworkflows to .smk rules files in the rules/ dir and creating a Snakefile that imports those subworkflows. From here on out, we refer to this Snakefile as the "master pipeline" in the documentation. I added a README to discuss how to execute the master pipeline and the two subworkflows, and another README to explain the config options for each. The master pipeline takes a new config file, unoriginally named config.yaml, which includes only the most important parameters from the prepare.yaml and classify.yaml config files. All other config options are unset, which forces the pipeline to use appropriate defaults. However, any of the config options from the prepare.yaml config file can be added to config.yaml for more advanced usage.

Resolves #17 by checking the file extensions of the first and second entries in the samples.tsv file. If the first entry has a '.bam' file extension, the pipeline will skip the alignment step and the PCR removal steps. If the second entry exists and has a '.bed' file extension, the pipeline will also skip the peak calling step. Otherwise, it will assume the files are FASTQs and run the pipeline from start to finish. Unfortunately, this means that if the user provides a BAM file, it must be constructed in the same way that our BAM files are constructed in order for everything to run smoothly when it is used by the variant callers in the ensemble. I've outlined those requirements in the new config README and in the config.yaml and prepare.yaml config files.

Changes to the Documentation

I've added two new README files: the rules README and the config README. In addition to their main roles as described above, they also generally serve to explain the structure of the pipeline and what a user should do for specific use-cases of the pipeline. I've also added a section in the main README for users who have no prior experience with Snakemake.

Lastly, I rewrote the run.bash script to work for both local and cluster execution of Snakemake, so that the user wouldn't be bombarded with output from stdout and stderr when running the pipeline locally.

aryarm and others added 30 commits March 16, 2020 10:26
…see #6)

still TODO:
1) recalibrate the qual scores
2) test that the vcf passes muster with other software (mainly GATK)
3) ensure that it writes all of the desired sites properly
1) move manta and strelka configuration to the configs directory
2) move the caller specific parameters to their own config file
3) explain required vs optional params in the caller specific params config file
… other

also make it clear which config params are required vs optional
…nes as subworkflows

resolves #15 and updates the README documentation
and prepare configs for training with GM12878
…flow

using a new 'subset_callers' config option
@aryarm aryarm merged commit ec971d6 into master Jun 27, 2020
@aryarm aryarm deleted the 2vcf branch June 27, 2020 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant