This pipeline processes validation datasets to test ENCODE_rE2G performance on predicting enhancer-gene pairs through:
- Differential Expression Analysis: Uses SCEPTRE to identify significant enhancer-gene interactions in various CRISPRi Perturb-seq datasets
- Power Analysis: Simulates perturbation effects at various effect sizes to create a high-confidence set of non-significant element-gene pairs
- Data Integration: Incorporates various other held-out datasets and applies filtering based on genic features to eliminate confounders
- Benchmarking Output: Produces final held-out dataset for enhancer-gene prediction benchmarking
The final output file used for benchmarking can be found in results/combine_val_data_and_format/Final_Validation_Dataset.tsv.gz
ENCODE_Sceptre_Analysis/
├── config/
│ └── config.yml # Pipeline configuration parameters
├── workflow/
│ ├── Snakefile # Main workflow definition
│ ├── rules/ # Snakemake rule definitions
│ │ ├── sceptre_setup.smk
│ │ ├── sceptre_power_analysis.smk
│ │ ├── create_encode_output.smk
│ │ └── combine_val_data_and_format.smk
│ ├── scripts/ # Analysis scripts
│ │ ├── sceptre_setup/
│ │ ├── sceptre_power_analysis/
│ │ ├── encode_datasets/
│ │ └── combine_val_data_and_format/
│ └── envs/ # Conda environment definitions
├── resources/ # Input data and reference files
├── results/ # Pipeline outputs
├── Perturb_Seq_Test_Set_Preprocessing/ # Pre-processing to create pipeline inputs - README.md included in this folder
└── README.md
- Snakemake: 7.3.2
- Conda: 24.11.3
Dependencies are managed automatically through conda environments defined in workflow/envs/
. Snakemake will create and activate the appropriate environments for each step.
Key environments include:
sceptre_env.yml
: SCEPTRE differential expression analysisanalyze_crispr_screen.yml
: Single-cell analysis toolsr_process_crispr_data.yml
: Data processing and formatting
-
Externally processed Enhancer-Gene pairs
- DC TAP-seq data:
resources/combine_val_data_and_format/DC_TAP_Seq_data.tsv
- ENCODE-rE2G Training dataset:
resources/combine_val_data_and_format/EPCrisprBenchmark_ensemble_data_GRCh38.tsv
- Other test datasets:
resources/create_encode_output/ENCODE/EPCrisprBenchmark/
- DC TAP-seq data:
-
Raw data input created in Perturb_Seq_Test_Set_Preprocessing - see Data Processing section
- Klann et al. 2021:
resources/sceptre_setup/Klann/
- Morris et al. 2023:
resources/sceptre_setup/Morrisv1/
&resources/sceptre_setup/Morrisv2/
- Xie et al. 2019:
resources/sceptre_setup/Xie/
- Klann et al. 2021:
(Optional as these results are included on SYNAPSE) Before running the main pipeline, you must first preprocess the raw CRISPR screen data.
See the comprehensive guide in: Perturb_Seq_Test_Set_Preprocessing/README.md
This preprocessing step includes:
- Converting raw sequencing data to count matrices
- Filtering for high-confidence guides and creating a guide annotation file
- Generating metadata files
- Clone the repository:
git clone https://github.com/jamesgalante/ENCODE_Test_Dataset_Analysis.git
cd ENCODE_Test_Dataset_Analysis
- Complete data preprocessing (see
Perturb_Seq_Test_Set_Preprocessing/README.md
)
- Optionally, download the preprocessed data from [SYNAPSE](INCLUDE LINK TO SYNAPSE)
- Run the complete pipeline:
# For HPC with SLURM
snakemake all
# For local execution (not recommended due to size)
snakemake --use-conda --cores 8 all
While these configuration parameters can be included via flags (e.g. --use-conda) and are often variable depending on the HPC setup, the profile used to create this pipeline is provided. Store this profile in a Snakemake config file (e.g., ~/.config/snakemake/slurm_profile/config.yaml
):
jobs: 500
cluster: slurm
use-conda: true
notemp: true
default-resources:
- runtime="13h"
- mem="32G"
# Add your specific SLURM configuration:
# - slurm_account=your_account
# - slurm_partition=your_partition
# - slurm_extra="--nice"
Then run:
snakemake --profile slurm_profile all
- Downloads genome annotation files
- Pairs all tested elements to any gene within 1Mb
- Creates SCEPTRE input objects for each dataset
- Runs SCEPTRE differential expression analysis
- Performs power simulations at multiple effect sizes (2%, 3%, 5%, 10%, 15%, 20%, 25%, 50%)
- Estimates statistical power for detecting enhancer-gene interactions
- Filters based on genic features
- Integrates other test datasets from
resources/create_encode_output/ENCODE/EPCrisprBenchmark/
- Integrates DC TAP-seq data
- Removes training set overlaps
- Resolves duplicates between datasets
- Produces final held-out dataset
- Final Validation Dataset:
results/combine_val_data_and_format/Final_Validation_Dataset.tsv.gz