aryarm · aryarm · Jun 27, 2020 · Mar 16, 2020 · May 7, 2020 · May 7, 2020
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,3 @@
 out
 .snakemake/
-data/
+data
diff --git a/README.md b/README.md
@@ -1,10 +1,10 @@
-[![Snakemake](https://img.shields.io/badge/snakemake-≥5.5.0-brightgreen.svg?style=flat-square)](https://snakemake.bitbucket.io)
+[![Snakemake](https://img.shields.io/badge/snakemake-5.18.0-brightgreen.svg?style=flat-square)](https://snakemake.readthedocs.io/)
 [![License](https://img.shields.io/apm/l/vim-mode.svg)](LICENSE)
 
 # varCA
 A pipeline for running an ensemble of variant callers to predict variants in ATAC-seq reads.
 
-The entire pipeline is made up of two smaller pipelines. The `prepare` pipeline calls each variant caller and prepares the resulting data for use by the `classify` pipeline, which runs the ensemble classifier to predict the existence of variants at each site.
+The entire pipeline is made up of two smaller subworkflows. The `prepare` subworkflow calls each variant caller and prepares the resulting data for use by the `classify` subworkflow, which runs the ensemble classifier to predict the existence of variants at each site.
 
 # download
 Execute the following commands or download the [latest release](https://github.com/aryam7/varCA/releases/latest) manually.
@@ -19,51 +19,64 @@ wget -O- -q https://github.com/aryam7/varCA/releases/latest/download/data.tar.gz
 ```
 The example data includes the following files:
 
-- `samples.tsv` - An example samples file, required for the `predict` pipeline.
+- `samples.tsv` - An example samples file; required for the master pipeline and the `predict` subworkflow
 - `snp.rda` - A trained RF model for classifying SNVs
 - `indel.rda` - A trained RF model for classifying indels
 - `snp.tsv.gz` - Prepared, example SNV training data
 - `indel.tsv.gz` - Prepared, example indel training data
-- `even-indels.tsv.gz` - Prepared, example indel test data
+- `hg38.chr1.*` - Chromosome 1 of the hg38 reference genome and its index files
+- `jurkat.chr1.bam` - BAM file from ATAC-seq on chromosome 1 from the Jurkat sample in GSE129086
+- `jurkat.chr1.bam.bai` - Index for the Jurkat BAM file
+- `molt4.chr1.*.fq.gz` - FASTQ files from paired ATAC-seq reads for chromosome 1 from the Molt-4 sample in GSE129086
 
 # execution
 On example data:
 ```
-# install snakemake via conda (if not already installed)
-conda install -c bioconda -c conda-forge 'snakemake>=5.5.0'
+# 1) install snakemake via conda (if not already installed)
+conda install -c bioconda -c conda-forge 'snakemake==5.18.0'
 
-# execute the pipeline on example data locally
-snakemake -s Snakefiles/Snakefile-classify --use-conda
-# execute the pipeline on example data on an SGE cluster
-#qsub run.bash
+# 2) execute the pipeline on example data locally
+./run.bash &
+
+# OR execute the pipeline on example data on an SGE cluster
+qsub run.bash
 ```
 
-The pipeline is written as Snakefiles, so it must be executed via [Snakemake](https://snakemake.readthedocs.io/en/stable/). See the [`run.bash` script](run.bash) for an example. Make sure to provide required input and options in the [config files](configs) before executing. The `classify.yaml` config file is currently configured to run the pipeline on the example data provided.
+The pipeline is written as Snakefiles, so it must be executed via [Snakemake](https://snakemake.readthedocs.io). See the [`run.bash` script](run.bash) for an example. Make sure to provide required input and options in the [config files](configs) before executing. The config files are currently configured to run the pipeline on the example data provided.
+
+### If this is your first time using Snakemake
+We highly recommend that you run `snakemake --help` to learn about all of the options available to you. You might discover, for example, that calling Snakemake with the `-n -p -r` flags can be a helpful way to check that the pipeline will be executed correctly before you run it. This can also be a good way to familiarize yourself with the steps of the pipeline and their inputs and outputs (the latter of which are inputs to the first rule in each workflow -- ie the `all` rule).
 
-By default, the pipeline will automatically delete some files it deems unnecessary (ex: unsorted copies of a BAM). You can opt to keep these files instead by providing the `--notemp` flag to Snakemake when executing each pipeline.
+Another important thing to know is that Snakemake will not recreate output that it has already generated, unless you request it. If a job fails or is interrupted, subsequent executions of Snakemake will just pick up where it left off. This can also apply to files that *you* create and provide in place of the files it would have generated.
+
+By default, the pipeline will automatically delete some files it deems unnecessary (ex: unsorted copies of a BAM). You can opt to keep these files instead by providing the `--notemp` flag to Snakemake when executing the pipeline.
 
 # dependencies
-We highly recommend you install [Snakemake via conda](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda) so that you can use the `--use-conda` flag when calling `snakemake` to let it automatically handle all dependencies of the pipelines. Otherwise, you must manually install the dependencies listed in the [env files](envs).
+We highly recommend you install [Snakemake via conda](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html#installation-via-conda) so that you can use the `--use-conda` flag when calling `snakemake` to let it automatically handle all dependencies of the pipeline. Otherwise, you must manually install the dependencies listed in the [env files](envs).
+Although currently untested, we also provide the option of executing the pipeline in a Docker container using [`singularity`](https://sylabs.io/guides/3.0/user-guide/quick_start.html#quick-installation-steps). Just provide Snakemake with the `--use-conda --use-singularity` flags when you execute it. Unlike with the previous method, having `conda` installed on your machine is not a requirement for this option, since it will be installed automatically in the Docker container.
 
 # files and directories
 
-### [Snakefiles/Snakefile-prepare](Snakefiles/Snakefile-prepare)
-A [Snakemake](https://snakemake.readthedocs.io/en/stable/) pipeline for preparing data for the classifier. It uses ATAC-seq FASTQ files to generate a tab-delimited table containing variant caller output for every site in open chromatin regions of the genome. The output of this pipeline can be fed into `Snakefiles/Snakefile-classify`.
+### [Snakefile](Snakefile)
+A [Snakemake](https://snakemake.readthedocs.io/en/stable/) pipeline for calling variants from a set of ATAC-seq reads. This pipeline is made up of two subworkflows:
+
+1. the [`prepare` subworkflow](rules/prepare.smk), which prepares the reads for classification and
+2. the [`classify` subworkflow](rules/classify.smk), which creates a VCF containing predicted variants
 
-### [Snakefiles/Snakefile-classify](Snakefiles/Snakefile-classify)
-A [Snakemake](https://snakemake.readthedocs.io/en/stable/) pipeline for training and testing the classifier. It uses the output of `Snakefiles/Snakefile-prepare`.
+### [rules/](rules)
+Snakemake rules for the `prepare` and `classify` subworkflows. You can either execute these subworkflows from the [master Snakefile](#snakefile) or individually as their own Snakefiles. See the [rules README](rules/README.md) for more information.
 
 ### [configs/](configs)
-Config files that define options and input for the `prepare` and `classify` pipelines. You should start by filling these out.
+Config files that define options and input for the pipeline and the `prepare` and `classify` subworkflows. If you want to predict variants from your own ATAC-seq data, you should start by filling out [the config file for the pipeline](/configs#configyaml).
 
 ### [callers/](callers)
-Scripts for executing each of the variant callers which are used by the `prepare` pipeline. Small pipelines can be written for each caller by using a special naming convention. See the [caller README](callers/README.md) for more information.
+Scripts for executing each of the variant callers which are used by the `prepare` subworkflow. Small pipelines can be written for each caller by using a special naming convention. See the [caller README](callers/README.md) for more information.
 
 ### [breakCA/](breakCA)
-Scripts for calculating posterior probabilities for the existence of an insertion or deletion, which can be used when running the classifier. These scripts are an adaptation from [@Arkosen](https://github.com/Arkosen)'s [BreakCA code](https://www.biorxiv.org/content/10.1101/605642v1.abstract).
+Scripts for calculating posterior probabilities for the existence of an insertion or deletion, which can be used as features for the classifier. These scripts are an adaptation from [@Arkosen](https://github.com/Arkosen)'s [BreakCA code](https://www.biorxiv.org/content/10.1101/605642v1.abstract).
 
 ### [scripts/](scripts)
 Various scripts used by the pipeline. See the [script README](scripts/README.md) for more information.
 
 ### [run.bash](run.bash)
-An example bash script for executing the `prepare` and `classify` pipelines on an SGE cluster using `snakemake` and `conda`.
+An example bash script for executing the pipeline using `snakemake` and `conda`. Any parameters to this script are passed directly to `snakemake`.
diff --git a/Snakefile b/Snakefile
@@ -0,0 +1,77 @@
+from snakemake.utils import min_version
+
+##### set minimum snakemake version #####
+min_version("5.18.0")
+
+configfile: "configs/config.yaml"
+configfile: "configs/callers.yaml"
+
+container: "docker://continuumio/miniconda3:4.8.2"
+
+
+def check_config(value, default=False, place=config):
+    """ return true if config value exists and is true """
+    return place[value] if (value in place and place[value]) else default
+
+def read_samples():
+    """
+        Function to get names and dna fastq paths from a sample file
+        specified in the configuration. Input file is expected to have 3
+        columns: <unique_sample_id> <fastq1_path> <fastq2_path> or
+        <unique_sample_id> <paired_bam_path> <bed_path>. Modify this function
+        as needed to provide a dictionary of sample_id keys and either a tuple
+        of strings: (fastq1, fastq2) OR a single string: paired_bam
+    """
+    f = open(config['sample_file'], "r")
+    samp_dict = {}
+    for line in f:
+        words = line.strip().split("\t")
+        if len(words) == 2:
+            samp_dict[words[0]] = (words[1], "")
+        elif len(words) == 3:
+            samp_dict[words[0]] = (words[1], words[2])
+        else:
+            raise ValueError('Your samples_file is not formatted correctly. Make sure that it has the correct number of tab-separated columns for every row.')
+    return samp_dict
+SAMP = read_samples()
+
+# the user can change config['SAMP_NAMES'] here (or define it in the config
+# file) to contain whichever sample names they'd like to run the pipeline on
+if 'SAMP_NAMES' not in config or not config['SAMP_NAMES']:
+    config['SAMP_NAMES'] = list(SAMP.keys())
+else:
+    # double check that the user isn't asking for samples they haven't provided
+    user_samps = set(config['SAMP_NAMES'])
+    config['SAMP_NAMES'] = list(set(SAMP.keys()).intersection(user_samps))
+    if len(config['SAMP_NAMES']) != len(user_samps):
+        warnings.warn("Not all of the samples requested have provided input. Proceeding with as many samples as is possible...")
+
+
+rule all:
+	input:
+		expand(
+            config['out']+"/classify/{sample}_{type}/final.vcf.gz",
+            sample=config['SAMP_NAMES'],
+            type=[i for i in ["snp", "indel"] if check_config(i+"_callers")]
+        )
+
+# an internal variable we use to tell the other subworkflows not to import their configs
+config['imported'] = True
+
+include: "rules/prepare.smk"
+
+config['predict'] = []
+config['data'] = {}
+for samp in config['SAMP_NAMES']:
+	for i in ['snp', 'indel']:
+		if check_config(i+"_callers"):
+			sample_name = samp + "_" + i
+			config['predict'].append(sample_name)
+			config['data'][sample_name] = {
+				'path': config['out']+"/merged_"+i+"/"+samp+"/final.tsv.gz",
+				'merged': config['out']+"/merged_"+i+"/"+samp+"/merge.tsv.gz",
+				'model': config[i+"_model"]
+			}
+config['out'] += "/classify"
+
+include: "rules/classify.smk"
diff --git a/callers/README.md b/callers/README.md
@@ -1,13 +1,13 @@
 ## What is a caller script?
-An isolated script in which you can define the logic for running a variant caller. These scripts are called, in turn, by the `prepare` pipeline.
+An isolated script in which you can define the logic for running a variant caller. These scripts are called, in turn, by the `prepare` subworkflow.
 
 ## Creating a caller script
 If you'd like to add another variant caller to the pipeline, you can specify a script to execute it in the [callers dir](/callers).
 The script should run the caller on a single sample.
 
 Each caller script is referred to in the pipeline by a unique identifier.
 Currently, the filename of the script is used as the identifier.
-When writing the [config file](/configs/prepare.yaml) for the `prepare` pipeline, you must refer to callers by their identifier.
+When writing the `prepare.yaml` and `callers.yaml` [config files](/configs) for the `prepare` subworkflow, you must refer to callers by their identifier.
 
 ### Caller script inputs
 Each caller script is provided the following arguments respectively (any of which can be ignored):
@@ -16,16 +16,15 @@ Each caller script is provided the following arguments respectively (any of whic
 - a reference genome
 - the path to a directory in which the script must place any of its output
 - the sample name
-- the number of threads to use
 - the path to the output directory of a script that prepares files for the caller script (only if it is relevant -- see [below](#caller-scripts-that-depend-on-other-scripts))
-- user provided parameters (passed via the [config file](/configs/prepare.yaml))
+- user provided parameters (passed via the [`callers.yaml` config file](/configs/callers.yaml))
 
 ### Caller script outputs
 Each caller script must use the provided output directory path for all of its output.
 
 Besides this requirement, there is only one other: that at the end of its execution, the script create a VCF file named by the caller identifier followed by a ".vcf" extension. The VCF file cannot be gzipped.
 
-You may optionally specify that the caller script outputs a TSV instead of a VCF. Just add an 'ext' attribute with a value of 'tsv' to the caller-specific parameters in the `prepare` [config file](/configs/prepare.yaml). The TSV file must be named similarly to the VCF, except that it must have a ".tsv" instead of a ".vcf" extension. The first two columns of the TSV must be "CHROM" and "POS", in that order.
+You may optionally specify that the caller script outputs a TSV instead of a VCF. Just add an 'ext' attribute with a value of 'tsv' to the caller-specific parameters in the [`callers.yaml` config file](/configs/callers.yaml). The TSV file must be named similarly to the VCF, except that it must have a ".tsv" instead of a ".vcf" extension. The first two columns of the TSV must be "CHROM" and "POS", in that order.
 
 ### Caller scripts that depend on other scripts
 Some caller scripts must depend on a different script for special input.
@@ -38,6 +37,6 @@ Then `gatk-snp` and `gatk-indel` can each extract SNVs and indels from the outpu
 Note that in this example, `gatk` is not a caller script (and cannot be used as one) because it will not produce the correct output. Even if it did, it would be ignored.
 
 By providing a dash character `-` in the caller identifier, the caller script (ex: `gatk-snp`) can communicate to the pipeline that it requires input(s) from another special script with the same filename but without the characters after the final dash (ex: `gatk-snp` => `gatk`).
-The pipeline will run this separate script first but with the same parameters as the caller script, and the directory containing its output will be passed to the original caller script.
+The pipeline will run this separate script first but with the same parameters as the caller script, and the directory containing its output will be passed as an argument to the original caller script. (For example, the directory containing the output of the `gatk` special script will be passed as a parameter to the `gatk-snp` caller script.)
 
 By providing multiple dashes in your caller identifiers using this scheme, you may design complicated caller script heirarchies involving multiple levels of nesting. In a sense, you can create small pipelines within the pipeline.
diff --git a/callers/breakca b/callers/breakca
@@ -9,10 +9,13 @@ genome="$3"
 [[ -z "$3" ]] && { echo "Parameter 3 is empty" 1>&2; exit 1; }
 output_dir="$4"
 [[ -z "$4" ]] && { echo "Parameter 4 is empty" 1>&2; exit 1; }
-breakca_dir="$7"
-[[ -z "$7" ]] && { echo "Parameter 7 is empty" 1>&2; exit 1; }
-Rscript="$8"
-if [[ -z "$8" ]]; then
+breakca_dir="$6"
+[[ -z "$6" ]] && { echo "Path to breakCA scripts directory was not specified. Using 'breakCA'." 1>&2 && breakca_dir="breakCA"; }
+if [ ! -f  "$breakca_dir"/get_reads_from_bam.R ] || [ ! -f  "$breakca_dir"/count_reads_per_base.R ] || [ ! -f  "$breakca_dir"/calculate_posterior.R ]; then
+	echo "The breakCA scripts directory ($breakca_dir) does not contain the proper scripts." 1>&2; exit 1;
+fi
+Rscript="$7"
+if [[ -z "$7" ]]; then
 	echo "Rscript path not specified. Attempting to retrieve from conda env..." 1>&2
 	if [[ -z "$CONDA_PREFIX" ]]; then
 		echo "Error: not running in conda env" 1>&2; exit 1;

diff --git a/callers/gatk-indel b/callers/gatk-indel
@@ -9,8 +9,8 @@ genome="$3"
 [[ -z "$3" ]] && { echo "Parameter 3 is empty" 1>&2; exit 1; }
 output_dir="$4"
 [[ -z "$4" ]] && { echo "Parameter 4 is empty" 1>&2; exit 1; }
-gatk_dir="$7"
-[[ -z "$7" ]] && { echo "Parameter 7 is empty" 1>&2; exit 1; }
+gatk_dir="$6"
+[[ -z "$6" ]] && { echo "Parameter 6 is empty" 1>&2; exit 1; }
 
 
 gatk --java-options "-Xmx4g" SelectVariants \

diff --git a/callers/gatk-snp b/callers/gatk-snp
@@ -9,8 +9,8 @@ genome="$3"
 [[ -z "$3" ]] && { echo "Parameter 3 is empty" 1>&2; exit 1; }
 output_dir="$4"
 [[ -z "$4" ]] && { echo "Parameter 4 is empty" 1>&2; exit 1; }
-gatk_dir="$7"
-[[ -z "$7" ]] && { echo "Parameter 7 is empty" 1>&2; exit 1; }
+gatk_dir="$6"
+[[ -z "$6" ]] && { echo "Parameter 6 is empty" 1>&2; exit 1; }
 
 
 gatk --java-options "-Xmx4g" SelectVariants \

diff --git a/callers/illumina b/callers/illumina
@@ -9,10 +9,13 @@ genome="$3"
 [[ -z "$3" ]] && { echo "Parameter 3 is empty" 1>&2; exit 1; }
 output_dir="$4"
 [[ -z "$4" ]] && { echo "Parameter 4 is empty" 1>&2; exit 1; }
-config="$7"
-[[ -z "$7" ]] && { echo "Parameter 7 is empty" 1>&2; exit 1; }
-manta_dir="$8"
-if [[ -z "$8" ]]; then
+config="$6"
+[[ -z "$6" ]] && { config="configs/configManta.py.ini" && echo "Path to config file was not specified. Using '$config'." 1>&2; }
+if [ ! -f  "$config" ]; then
+	echo "The mamta config file ($config) could not be found. Check that it exists." 1>&2; exit 1;
+fi
+manta_dir="$7"
+if [[ -z "$7" ]]; then
 	echo "Manta directory not specified. Attempting to retrieve from conda env..." 1>&2
 	if [[ -z "$CONDA_PREFIX" ]]; then
 		echo "Error: not running in conda env" 1>&2; exit 1;
@@ -24,7 +27,7 @@ if [[ -z "$8" ]]; then
 		manta_dir="$CONDA_PREFIX"/share/"$manta_dir"
 	fi
 else
-	manta_dir="$8"
+	manta_dir="$7"
 fi
 echo "Using manta dir: $manta_dir" 1>&2
 

diff --git a/callers/illumina-manta b/callers/illumina-manta
@@ -9,8 +9,8 @@ genome="$3"
 [[ -z "$3" ]] && { echo "Parameter 3 is empty" 1>&2; exit 1; }
 output_dir="$4"
 [[ -z "$4" ]] && { echo "Parameter 4 is empty" 1>&2; exit 1; }
-illumina_dir="$7"
-[[ -z "$7" ]] && { echo "Parameter 7 is empty" 1>&2; exit 1; }
+illumina_dir="$6"
+[[ -z "$6" ]] && { echo "Parameter 6 is empty" 1>&2; exit 1; }
 
 
 ln -sf "$(realpath --relative-to "$output_dir" "$illumina_dir"/results/variants/diploidSV.vcf)" "$output_dir/illumina-manta.vcf"