This repository provides scripts for copy number variation (CNV) analysis of RNASeq data. Currently, works with human genome and single-end bulk RNA-seq data.
Install Homebrew
xcode-select --install
ruby -e "$(curl -fsSL httgit ps://github.com/Homebrew/install/master/install)"
Set up pyenv
brew install pyenv
pyenv install 3.8.2
pyenv global 3.8.2
echo -e 'if command -v pyenv 1>/dev/null 2>&1; then\n eval "$(pyenv init -)"\nfi' >> ~/.zshrc
Install cutadapt
pip install cutadapt
Download fetchChromSizes, wget https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/fetchChromSizes; chmod +x fetchChromSizes
Download trim_galore
chmod +x trim_galore
ln -s /path/to/trim_galore /usr/local/bin/trim_galore
Download fastqc (Make sure to select .zip even on macOS)
chmod +x fastqc
ln -s /path/to/fastqc /usr/local/bin/fastqc
Download STAR
chmod +x STAR
ln -s /path/to/STAR /usr/local/bin/STAR
Download BAFExtract
make BAFExtract
chmod +x BAFExtract
ln -s /path/to/BAFExtract /usr/local/bin/BAFExtract
Download samtools
make
make install
chmod +x samtools
ln -s /path/to/samtools /usr/local/bin/samtools
Download and install R
Download and install RStudio
Update BioCManager
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install()
Install CaSpER dependencies
BiocManager::install(c('HMMcopy', 'GenomeGraphs', 'biomaRt', 'limma', 'GO.db', 'org.Hs.eg.db', 'GOstats'))
Install devtools
install.packages("devtools")
Windows users will need to download and install Rtools
Install CaSpER
require(devtools)
install_github("akdess/CaSpER")
The pipeline assumes the files are downloaded in the project folder.
Download hg38 genome sequence in FASTA format
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz
Download hg38 gene annotation GTF file from
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ensGene.gtf.gz
gunzip hg38.ensGene.gtf.gz
Download cytoband and centromere information
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/cytoBand.txt.gz
gunzip cytoBand.txt.gz
curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/cytoBand.txt.gz" | gunzip -c | grep acen > centromere.tab
Run scripts/00_genome_sort.sh to prepare the genome file with chromosomes sorted in the right order
Run scripts/01_trim_galore.sh to remove adapters and analyze quality of RNA-seq reads
Index the genome using scripts/02_star_index.sh
Reads are aligned to UCSC reference genome using scripts/03_star.sh
B-Allele frequencies are computed using BAFExtract, [scripts/04_BAFExtract.sh
BAF and aligned reads are used to perform CaSPER, [scripts/05_CaSpER.Rmd
The output from STAR will have the following columns in the *ReadsPerGene.out.tab files: V1 - genes, V2 - non-stranded, V3 - forward, V4 - reverse stranded alignment
Select the column with the most reads to create the new dataframe counts
Please refer to CaSpER documentation for functions to create output graphs.