Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert final pipeline output to vcf #16

Merged
merged 33 commits into from
Jun 27, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
d5d5591
start 2vcf script
aryarm Mar 16, 2020
c419828
allow statistics.py to output threshold values
aryarm May 7, 2020
aded5fb
use thresholds in Snakefile-classify
aryarm May 7, 2020
13cecc1
create a script for converting the outputs of the pipeline to a VCF (…
aryarm Jun 4, 2020
e527b1c
fix indexing of output vcf in 2vcf.py
aryarm Jun 4, 2020
89b23e3
suppress mlr2 warning (see issue #10)
aryarm Jun 6, 2020
57c120a
let snakemake use docker and singularity if installed (resolves #8)
aryarm Jun 10, 2020
2490ab8
remove {threads} from the prepare pipeline. every rule gets one thread
aryarm Jun 10, 2020
f1d330c
add default parameters to the caller scripts that take params
aryarm Jun 11, 2020
39c11b1
update all dependencies and freeze them at the versions we used (reso…
aryarm Jun 11, 2020
e998743
simplify config checks using a custom function
aryarm Jun 11, 2020
bd8ff56
allow bam files as input to prepare pipeline
aryarm Jun 11, 2020
796c128
upgrade snakemake version to 5.18.1 since 5.19 is too unstable still
aryarm Jun 11, 2020
5911876
remove -S param from samtools; it didn't do anything anyway
aryarm Jun 11, 2020
0435ec0
make more config files
aryarm Jun 15, 2020
dcde6a9
rewrite pipelines so they will execute the example data one after the…
aryarm Jun 16, 2020
9704ce2
fix vcf header coming from 2vcf.py (see issue #6)
aryarm Jun 16, 2020
9ca4405
create a snakefile for executing both the prepare and classify pipeli…
aryarm Jun 16, 2020
a5ba3ed
apparently, ranger 0.11.5 doesn't exist. oops
aryarm Jun 17, 2020
2ef0a0d
explain config options more thoroughly
aryarm Jun 18, 2020
ba9b6a2
update vcf script with old changes
aryarm Jun 18, 2020
1bd6ee7
clarify that we haven't tried using --use-singularity
aryarm Jun 18, 2020
a28cea3
allow the user to more easily exclude callers in the classify subwork…
aryarm Jun 24, 2020
ab8bfde
remove breakca, manta, and delly from defaults b/c didn't improve f-b…
aryarm Jun 25, 2020
90942de
silence error that occurs when subset_callers is not provided
aryarm Jun 25, 2020
7ba740a
allow bed files as input. resolves #17
aryarm Jun 26, 2020
f76f6fa
clarify .bam requirements in config files
aryarm Jun 27, 2020
5738275
fix indentation in master workflow and prepare subworkflow
aryarm Jun 27, 2020
c251533
add local (non-SGE) execution to run.bash
aryarm Jun 27, 2020
3a80fd1
rename breakca to varca in the final output
aryarm Jun 27, 2020
f9dc1e7
check snp or indel in 2vcf.py
aryarm Jun 27, 2020
8ea04e6
add linear regression model to 2vcf.py for QUAL recalibration
aryarm Jun 27, 2020
309374d
Merge branch 'master' into 2vcf
aryarm Jun 27, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
out
.snakemake/
data/
data
55 changes: 34 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
[![Snakemake](https://img.shields.io/badge/snakemake-≥5.5.0-brightgreen.svg?style=flat-square)](https://snakemake.bitbucket.io)
[![Snakemake](https://img.shields.io/badge/snakemake-5.18.0-brightgreen.svg?style=flat-square)](https://snakemake.readthedocs.io/)
[![License](https://img.shields.io/apm/l/vim-mode.svg)](LICENSE)

# varCA
A pipeline for running an ensemble of variant callers to predict variants in ATAC-seq reads.

The entire pipeline is made up of two smaller pipelines. The `prepare` pipeline calls each variant caller and prepares the resulting data for use by the `classify` pipeline, which runs the ensemble classifier to predict the existence of variants at each site.
The entire pipeline is made up of two smaller subworkflows. The `prepare` subworkflow calls each variant caller and prepares the resulting data for use by the `classify` subworkflow, which runs the ensemble classifier to predict the existence of variants at each site.

# download
Execute the following commands or download the [latest release](https://github.com/aryam7/varCA/releases/latest) manually.
Expand All @@ -19,51 +19,64 @@ wget -O- -q https://github.com/aryam7/varCA/releases/latest/download/data.tar.gz
```
The example data includes the following files:

- `samples.tsv` - An example samples file, required for the `predict` pipeline.
- `samples.tsv` - An example samples file; required for the master pipeline and the `predict` subworkflow
- `snp.rda` - A trained RF model for classifying SNVs
- `indel.rda` - A trained RF model for classifying indels
- `snp.tsv.gz` - Prepared, example SNV training data
- `indel.tsv.gz` - Prepared, example indel training data
- `even-indels.tsv.gz` - Prepared, example indel test data
- `hg38.chr1.*` - Chromosome 1 of the hg38 reference genome and its index files
- `jurkat.chr1.bam` - BAM file from ATAC-seq on chromosome 1 from the Jurkat sample in GSE129086
- `jurkat.chr1.bam.bai` - Index for the Jurkat BAM file
- `molt4.chr1.*.fq.gz` - FASTQ files from paired ATAC-seq reads for chromosome 1 from the Molt-4 sample in GSE129086

# execution
On example data:
```
# install snakemake via conda (if not already installed)
conda install -c bioconda -c conda-forge 'snakemake>=5.5.0'
# 1) install snakemake via conda (if not already installed)
conda install -c bioconda -c conda-forge 'snakemake==5.18.0'

# execute the pipeline on example data locally
snakemake -s Snakefiles/Snakefile-classify --use-conda
# execute the pipeline on example data on an SGE cluster
#qsub run.bash
# 2) execute the pipeline on example data locally
./run.bash &

# OR execute the pipeline on example data on an SGE cluster
qsub run.bash
```

The pipeline is written as Snakefiles, so it must be executed via [Snakemake](https://snakemake.readthedocs.io/en/stable/). See the [`run.bash` script](run.bash) for an example. Make sure to provide required input and options in the [config files](configs) before executing. The `classify.yaml` config file is currently configured to run the pipeline on the example data provided.
The pipeline is written as Snakefiles, so it must be executed via [Snakemake](https://snakemake.readthedocs.io). See the [`run.bash` script](run.bash) for an example. Make sure to provide required input and options in the [config files](configs) before executing. The config files are currently configured to run the pipeline on the example data provided.

### If this is your first time using Snakemake
We highly recommend that you run `snakemake --help` to learn about all of the options available to you. You might discover, for example, that calling Snakemake with the `-n -p -r` flags can be a helpful way to check that the pipeline will be executed correctly before you run it. This can also be a good way to familiarize yourself with the steps of the pipeline and their inputs and outputs (the latter of which are inputs to the first rule in each workflow -- ie the `all` rule).

By default, the pipeline will automatically delete some files it deems unnecessary (ex: unsorted copies of a BAM). You can opt to keep these files instead by providing the `--notemp` flag to Snakemake when executing each pipeline.
Another important thing to know is that Snakemake will not recreate output that it has already generated, unless you request it. If a job fails or is interrupted, subsequent executions of Snakemake will just pick up where it left off. This can also apply to files that *you* create and provide in place of the files it would have generated.

By default, the pipeline will automatically delete some files it deems unnecessary (ex: unsorted copies of a BAM). You can opt to keep these files instead by providing the `--notemp` flag to Snakemake when executing the pipeline.

# dependencies
We highly recommend you install [Snakemake via conda](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda) so that you can use the `--use-conda` flag when calling `snakemake` to let it automatically handle all dependencies of the pipelines. Otherwise, you must manually install the dependencies listed in the [env files](envs).
We highly recommend you install [Snakemake via conda](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda) so that you can use the `--use-conda` flag when calling `snakemake` to let it automatically handle all dependencies of the pipeline. Otherwise, you must manually install the dependencies listed in the [env files](envs).
Although currently untested, we also provide the option of executing the pipeline in a Docker container using [`singularity`](https://sylabs.io/guides/3.0/user-guide/quick_start.html#quick-installation-steps). Just provide Snakemake with the `--use-conda --use-singularity` flags when you execute it. Unlike with the previous method, having `conda` installed on your machine is not a requirement for this option, since it will be installed automatically in the Docker container.

# files and directories

### [Snakefiles/Snakefile-prepare](Snakefiles/Snakefile-prepare)
A [Snakemake](https://snakemake.readthedocs.io/en/stable/) pipeline for preparing data for the classifier. It uses ATAC-seq FASTQ files to generate a tab-delimited table containing variant caller output for every site in open chromatin regions of the genome. The output of this pipeline can be fed into `Snakefiles/Snakefile-classify`.
### [Snakefile](Snakefile)
A [Snakemake](https://snakemake.readthedocs.io/en/stable/) pipeline for calling variants from a set of ATAC-seq reads. This pipeline is made up of two subworkflows:

1. the [`prepare` subworkflow](rules/prepare.smk), which prepares the reads for classification and
2. the [`classify` subworkflow](rules/classify.smk), which creates a VCF containing predicted variants

### [Snakefiles/Snakefile-classify](Snakefiles/Snakefile-classify)
A [Snakemake](https://snakemake.readthedocs.io/en/stable/) pipeline for training and testing the classifier. It uses the output of `Snakefiles/Snakefile-prepare`.
### [rules/](rules)
Snakemake rules for the `prepare` and `classify` subworkflows. You can either execute these subworkflows from the [master Snakefile](#snakefile) or individually as their own Snakefiles. See the [rules README](rules/README.md) for more information.

### [configs/](configs)
Config files that define options and input for the `prepare` and `classify` pipelines. You should start by filling these out.
Config files that define options and input for the pipeline and the `prepare` and `classify` subworkflows. If you want to predict variants from your own ATAC-seq data, you should start by filling out [the config file for the pipeline](/configs#configyaml).

### [callers/](callers)
Scripts for executing each of the variant callers which are used by the `prepare` pipeline. Small pipelines can be written for each caller by using a special naming convention. See the [caller README](callers/README.md) for more information.
Scripts for executing each of the variant callers which are used by the `prepare` subworkflow. Small pipelines can be written for each caller by using a special naming convention. See the [caller README](callers/README.md) for more information.

### [breakCA/](breakCA)
Scripts for calculating posterior probabilities for the existence of an insertion or deletion, which can be used when running the classifier. These scripts are an adaptation from [@Arkosen](https://github.com/Arkosen)'s [BreakCA code](https://www.biorxiv.org/content/10.1101/605642v1.abstract).
Scripts for calculating posterior probabilities for the existence of an insertion or deletion, which can be used as features for the classifier. These scripts are an adaptation from [@Arkosen](https://github.com/Arkosen)'s [BreakCA code](https://www.biorxiv.org/content/10.1101/605642v1.abstract).

### [scripts/](scripts)
Various scripts used by the pipeline. See the [script README](scripts/README.md) for more information.

### [run.bash](run.bash)
An example bash script for executing the `prepare` and `classify` pipelines on an SGE cluster using `snakemake` and `conda`.
An example bash script for executing the pipeline using `snakemake` and `conda`. Any parameters to this script are passed directly to `snakemake`.
77 changes: 77 additions & 0 deletions Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
from snakemake.utils import min_version

##### set minimum snakemake version #####
min_version("5.18.0")

configfile: "configs/config.yaml"
configfile: "configs/callers.yaml"

container: "docker://continuumio/miniconda3:4.8.2"


def check_config(value, default=False, place=config):
""" return true if config value exists and is true """
return place[value] if (value in place and place[value]) else default

def read_samples():
"""
Function to get names and dna fastq paths from a sample file
specified in the configuration. Input file is expected to have 3
columns: <unique_sample_id> <fastq1_path> <fastq2_path> or
<unique_sample_id> <paired_bam_path> <bed_path>. Modify this function
as needed to provide a dictionary of sample_id keys and either a tuple
of strings: (fastq1, fastq2) OR a single string: paired_bam
"""
f = open(config['sample_file'], "r")
samp_dict = {}
for line in f:
words = line.strip().split("\t")
if len(words) == 2:
samp_dict[words[0]] = (words[1], "")
elif len(words) == 3:
samp_dict[words[0]] = (words[1], words[2])
else:
raise ValueError('Your samples_file is not formatted correctly. Make sure that it has the correct number of tab-separated columns for every row.')
return samp_dict
SAMP = read_samples()

# the user can change config['SAMP_NAMES'] here (or define it in the config
# file) to contain whichever sample names they'd like to run the pipeline on
if 'SAMP_NAMES' not in config or not config['SAMP_NAMES']:
config['SAMP_NAMES'] = list(SAMP.keys())
else:
# double check that the user isn't asking for samples they haven't provided
user_samps = set(config['SAMP_NAMES'])
config['SAMP_NAMES'] = list(set(SAMP.keys()).intersection(user_samps))
if len(config['SAMP_NAMES']) != len(user_samps):
warnings.warn("Not all of the samples requested have provided input. Proceeding with as many samples as is possible...")


rule all:
input:
expand(
config['out']+"/classify/{sample}_{type}/final.vcf.gz",
sample=config['SAMP_NAMES'],
type=[i for i in ["snp", "indel"] if check_config(i+"_callers")]
)

# an internal variable we use to tell the other subworkflows not to import their configs
config['imported'] = True

include: "rules/prepare.smk"

config['predict'] = []
config['data'] = {}
for samp in config['SAMP_NAMES']:
for i in ['snp', 'indel']:
if check_config(i+"_callers"):
sample_name = samp + "_" + i
config['predict'].append(sample_name)
config['data'][sample_name] = {
'path': config['out']+"/merged_"+i+"/"+samp+"/final.tsv.gz",
'merged': config['out']+"/merged_"+i+"/"+samp+"/merge.tsv.gz",
'model': config[i+"_model"]
}
config['out'] += "/classify"

include: "rules/classify.smk"
11 changes: 5 additions & 6 deletions callers/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
## What is a caller script?
An isolated script in which you can define the logic for running a variant caller. These scripts are called, in turn, by the `prepare` pipeline.
An isolated script in which you can define the logic for running a variant caller. These scripts are called, in turn, by the `prepare` subworkflow.

## Creating a caller script
If you'd like to add another variant caller to the pipeline, you can specify a script to execute it in the [callers dir](/callers).
The script should run the caller on a single sample.

Each caller script is referred to in the pipeline by a unique identifier.
Currently, the filename of the script is used as the identifier.
When writing the [config file](/configs/prepare.yaml) for the `prepare` pipeline, you must refer to callers by their identifier.
When writing the `prepare.yaml` and `callers.yaml` [config files](/configs) for the `prepare` subworkflow, you must refer to callers by their identifier.

### Caller script inputs
Each caller script is provided the following arguments respectively (any of which can be ignored):
Expand All @@ -16,16 +16,15 @@ Each caller script is provided the following arguments respectively (any of whic
- a reference genome
- the path to a directory in which the script must place any of its output
- the sample name
- the number of threads to use
- the path to the output directory of a script that prepares files for the caller script (only if it is relevant -- see [below](#caller-scripts-that-depend-on-other-scripts))
- user provided parameters (passed via the [config file](/configs/prepare.yaml))
- user provided parameters (passed via the [`callers.yaml` config file](/configs/callers.yaml))

### Caller script outputs
Each caller script must use the provided output directory path for all of its output.

Besides this requirement, there is only one other: that at the end of its execution, the script create a VCF file named by the caller identifier followed by a ".vcf" extension. The VCF file cannot be gzipped.

You may optionally specify that the caller script outputs a TSV instead of a VCF. Just add an 'ext' attribute with a value of 'tsv' to the caller-specific parameters in the `prepare` [config file](/configs/prepare.yaml). The TSV file must be named similarly to the VCF, except that it must have a ".tsv" instead of a ".vcf" extension. The first two columns of the TSV must be "CHROM" and "POS", in that order.
You may optionally specify that the caller script outputs a TSV instead of a VCF. Just add an 'ext' attribute with a value of 'tsv' to the caller-specific parameters in the [`callers.yaml` config file](/configs/callers.yaml). The TSV file must be named similarly to the VCF, except that it must have a ".tsv" instead of a ".vcf" extension. The first two columns of the TSV must be "CHROM" and "POS", in that order.

### Caller scripts that depend on other scripts
Some caller scripts must depend on a different script for special input.
Expand All @@ -38,6 +37,6 @@ Then `gatk-snp` and `gatk-indel` can each extract SNVs and indels from the outpu
Note that in this example, `gatk` is not a caller script (and cannot be used as one) because it will not produce the correct output. Even if it did, it would be ignored.

By providing a dash character `-` in the caller identifier, the caller script (ex: `gatk-snp`) can communicate to the pipeline that it requires input(s) from another special script with the same filename but without the characters after the final dash (ex: `gatk-snp` => `gatk`).
The pipeline will run this separate script first but with the same parameters as the caller script, and the directory containing its output will be passed to the original caller script.
The pipeline will run this separate script first but with the same parameters as the caller script, and the directory containing its output will be passed as an argument to the original caller script. (For example, the directory containing the output of the `gatk` special script will be passed as a parameter to the `gatk-snp` caller script.)

By providing multiple dashes in your caller identifiers using this scheme, you may design complicated caller script heirarchies involving multiple levels of nesting. In a sense, you can create small pipelines within the pipeline.
11 changes: 7 additions & 4 deletions callers/breakca
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,13 @@ genome="$3"
[[ -z "$3" ]] && { echo "Parameter 3 is empty" 1>&2; exit 1; }
output_dir="$4"
[[ -z "$4" ]] && { echo "Parameter 4 is empty" 1>&2; exit 1; }
breakca_dir="$7"
[[ -z "$7" ]] && { echo "Parameter 7 is empty" 1>&2; exit 1; }
Rscript="$8"
if [[ -z "$8" ]]; then
breakca_dir="$6"
[[ -z "$6" ]] && { echo "Path to breakCA scripts directory was not specified. Using 'breakCA'." 1>&2 && breakca_dir="breakCA"; }
if [ ! -f "$breakca_dir"/get_reads_from_bam.R ] || [ ! -f "$breakca_dir"/count_reads_per_base.R ] || [ ! -f "$breakca_dir"/calculate_posterior.R ]; then
echo "The breakCA scripts directory ($breakca_dir) does not contain the proper scripts." 1>&2; exit 1;
fi
Rscript="$7"
if [[ -z "$7" ]]; then
echo "Rscript path not specified. Attempting to retrieve from conda env..." 1>&2
if [[ -z "$CONDA_PREFIX" ]]; then
echo "Error: not running in conda env" 1>&2; exit 1;
Expand Down
4 changes: 2 additions & 2 deletions callers/gatk-indel
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ genome="$3"
[[ -z "$3" ]] && { echo "Parameter 3 is empty" 1>&2; exit 1; }
output_dir="$4"
[[ -z "$4" ]] && { echo "Parameter 4 is empty" 1>&2; exit 1; }
gatk_dir="$7"
[[ -z "$7" ]] && { echo "Parameter 7 is empty" 1>&2; exit 1; }
gatk_dir="$6"
[[ -z "$6" ]] && { echo "Parameter 6 is empty" 1>&2; exit 1; }


gatk --java-options "-Xmx4g" SelectVariants \
Expand Down
4 changes: 2 additions & 2 deletions callers/gatk-snp
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ genome="$3"
[[ -z "$3" ]] && { echo "Parameter 3 is empty" 1>&2; exit 1; }
output_dir="$4"
[[ -z "$4" ]] && { echo "Parameter 4 is empty" 1>&2; exit 1; }
gatk_dir="$7"
[[ -z "$7" ]] && { echo "Parameter 7 is empty" 1>&2; exit 1; }
gatk_dir="$6"
[[ -z "$6" ]] && { echo "Parameter 6 is empty" 1>&2; exit 1; }


gatk --java-options "-Xmx4g" SelectVariants \
Expand Down
13 changes: 8 additions & 5 deletions callers/illumina
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,13 @@ genome="$3"
[[ -z "$3" ]] && { echo "Parameter 3 is empty" 1>&2; exit 1; }
output_dir="$4"
[[ -z "$4" ]] && { echo "Parameter 4 is empty" 1>&2; exit 1; }
config="$7"
[[ -z "$7" ]] && { echo "Parameter 7 is empty" 1>&2; exit 1; }
manta_dir="$8"
if [[ -z "$8" ]]; then
config="$6"
[[ -z "$6" ]] && { config="configs/configManta.py.ini" && echo "Path to config file was not specified. Using '$config'." 1>&2; }
if [ ! -f "$config" ]; then
echo "The mamta config file ($config) could not be found. Check that it exists." 1>&2; exit 1;
fi
manta_dir="$7"
if [[ -z "$7" ]]; then
echo "Manta directory not specified. Attempting to retrieve from conda env..." 1>&2
if [[ -z "$CONDA_PREFIX" ]]; then
echo "Error: not running in conda env" 1>&2; exit 1;
Expand All @@ -24,7 +27,7 @@ if [[ -z "$8" ]]; then
manta_dir="$CONDA_PREFIX"/share/"$manta_dir"
fi
else
manta_dir="$8"
manta_dir="$7"
fi
echo "Using manta dir: $manta_dir" 1>&2

Expand Down
4 changes: 2 additions & 2 deletions callers/illumina-manta
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ genome="$3"
[[ -z "$3" ]] && { echo "Parameter 3 is empty" 1>&2; exit 1; }
output_dir="$4"
[[ -z "$4" ]] && { echo "Parameter 4 is empty" 1>&2; exit 1; }
illumina_dir="$7"
[[ -z "$7" ]] && { echo "Parameter 7 is empty" 1>&2; exit 1; }
illumina_dir="$6"
[[ -z "$6" ]] && { echo "Parameter 6 is empty" 1>&2; exit 1; }


ln -sf "$(realpath --relative-to "$output_dir" "$illumina_dir"/results/variants/diploidSV.vcf)" "$output_dir/illumina-manta.vcf"
Loading