-
Notifications
You must be signed in to change notification settings - Fork 12
RNASeq I practical RNASeq mapping & alignment visualization
cd /lustre/haven/courses/EPP622-2018Fa/$USER
mkdir RNASeq_lab_I
cd RNASeq_lab_I && mkdir raw_data
cd raw_data
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-28/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.28.dna.genome.fa.gz
wget ftp://ftp.ensemblgenomes.org/pub/release-28/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.28.gtf.gz
gunzip *gz ## uncompress all .gz files in your current directory
- RAN sequences from 4 phenotypes of A. thaliana under 4 different stress treatment. https://www.ebi.ac.uk/ena/data/view/PRJDB2508
- Phenotype and Stress Info: https://github.com/mestato/epp622/blob/master/RNA_labs_data/experimental_info.csv
ln -s /lustre/haven/courses/EPP622-2018Fa/nsoltan1/RNASeq_lab_I/raw_data/DRR016140_1.fastq
ln -s /lustre/haven/courses/EPP622-2018Fa/nsoltan1/RNASeq_lab_I/raw_data/DRR016140_2.fastq
qsub -I -A ACF-UTK0085 -q debug -l nodes=1:ppn=1
module load star
cd ..
mkdir analysis && cd analysis
mkdir alignment_STAR && cd alignment_STAR
mkdir genomeDir
-
Index reference genome
STAR --runMode genomeGenerate \ --genomeDir ./genomeDir \ --genomeFastaFiles ../../raw_data/Arabidopsis_thaliana.TAIR10.28.dna.genome.fa \ --runThreadN 1 \ --sjdbGTFfile ../../raw_data/Arabidopsis_thaliana.TAIR10.28.gtf \ --sjdbOverhang 101
-
--runMode genomeGenerate
: run genome indices generation job, default is to run alignment. -
--genomeDir
: specify the directory for storing genome indices -
--genomeFastaFiles
: one or more FASTA files with genome reference sequences -
--runThreadN
: the number of threads to use. -
--sjdbGTFfile
: The annotation file that STAR uses to build splice junctions database -
--sjdbOverhang
: specifies the length of genomic sequence around the annotated junction. Usually it is set to Readlength - 1.
-
Command line to get the read length
-
Read 1:
head -2 ../../raw_data/DRR016140_1.fastq | awk "{print length}" | tail -1
-
Read 2:
head -2 ../../raw_data/DRR016140_2.fastq | awk "{print length}" | tail -1
-
-
-
Align the reads
We have 16 pairs of reads file. We can use a for loop to get the job done, instead of running the command 16 times. But, since it takes long time, let's use one of the reads DRR016140
in alignment.
Make sure you are in the right directory (/RNASeq_lab_I/analysis/alignment_STAR
) and then create a directory to store the alignment output files:
mkdir alignment_output
-
STAR alignment for one paired read
STAR \ --genomeDir genomeDir \ --readFilesIn ../../raw_data/DRR016140_1.fastq ../../raw_data/DRR016140_2.fastq \ --outFileNamePrefix alignment_output/DRR016140_ \ --outSAMtype BAM SortedByCoordinate \ --runThreadN 1
-
--genomeDir
: specifies the directory where you put your genome indices -
--readFilesIn
: your paired RNASeq reads files. -
--outFileNamePrefix
: your output file name prefix. -
--outSAMtype
: your output file type. Here we want the generated bam file to be sorted by coordination. -
--runThreadN
: the number of threads to be used.
-
-
Alignment of all 16 files to the reference genome by loop command:
for i1 in `ls /lustre/haven/courses/EPP622-2018Fa/$USER/RNASeq_lab_I/raw_data/*_1.fastq` do i2=`sed 's/_1.fastq/_2.fastq/' <(echo $i1)` BASE=$(basename $i1 | sed 's/_1.fastq*//g') echo "i1 $i1" echo "i2 $i2" echo "BASE $BASE" STAR \ --genomeDir genomeDir \ --readFilesIn $i1 $i2 \ --outFileNamePrefix alignment_output/$BASE. \ --outSAMtype BAM SortedByCoordinate \ --runThreadN 1 done
-
This file
Log.final.out
holds the actual alignment information for each readcat alignment_output/DRR016140_Log.final.out
cp /lustre/haven/courses/EPP622-2018Fa/nsoltan1/RNASeq_lab_I/analysis/alignment_STAR/alignment_output/DRR016140_Aligned.sortedByCoord.out.bam .
cp /lustre/haven/courses/EPP622-2018Fa/nsoltan1/RNASeq_lab_I/analysis/alignment_STAR/alignment_output/DRR016140.Log.final.out .
cd alignment_output
module load samtools
samtools index \
DRR016140_Aligned.sortedByCoord.out.bam
This creates an index file, which ends in .bai
(bam index).
IGV (Integrative Genomics Viewer), from the Broad Institute, is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets.
It can read in different file formats, details are available on this web page
In this case, we will use:
- the reference genome
- the annotations for the reference genome in gtf format
- the read alignment file, which must be a sorted BAM file
- an index file for the sorted BAM file
To transfer the files, you have the same two options as earlier, FileZilla or scp. Here are the file locations if you are using FileZilla (replace your username
instead of $USER):
/lustre/haven/courses/EPP622-2018Fa/$USER/RNASeq_lab_I/raw_data/Arabidopsis_thaliana.TAIR10.28.dna.genome.fa
/lustre/haven/courses/EPP622-2018Fa/$USER/RNASeq_lab_I/raw_data/Arabidopsis_thaliana.TAIR10.28.gtf
/lustre/haven/courses/EPP622-2018Fa/$USER/RNASeq_lab_I/analysis/alignment_STAR/alignment_output/DRR016140_Aligned.sortedByCoord.out.bam
/lustre/haven/courses/EPP622-2018Fa/$USER/RNASeq_lab_I/analysis/alignment_STAR/alignment_output/DRR016140_Aligned.sortedByCoord.out.bam
.bai
For secure copy, here are example commands to run on your terminal (after you have navigated to your Desktop or another location you would like the files to be):
scp yourusername@duo.acf.tennessee.edu:/lustre/haven/courses/EPP622-2018Fa/$USER/RNASeq_lab_I/raw_data/Arabidopsis_thaliana.TAIR10.28.dna.genome.fa .
scp yourusername@duo.acf.tennessee.edu:/lustre/haven/courses/EPP622-2018Fa/$USER/RNASeq_lab_I/raw_data/Arabidopsis_thaliana.TAIR10.28.gtf .
scp yourusername@duo.acf.tennessee.edu:/lustre/haven/courses/EPP622-2018Fa/$USER/RNASeq_lab_I/analysis/alignment_STAR/alignment_output/DRR016140_Aligned.sortedByCoord.out.bam .
scp yourusername@duo.acf.tennessee.edu:/lustre/haven/courses/EPP622-2018Fa/$USER/RNASeq_lab_I/analysis/alignment_STAR/alignment_output/DRR016140_Aligned.sortedByCoord.out.bam
.bai .
Now, we will visualize the mapping on IGV
Download and unzip IGV, then launch it.
Make sure that files we need to upload to IGV are in a same folder:
Arabidopsis_thaliana.TAIR10.28.dna.genome.fa
Arabidopsis_thaliana.TAIR10.28.gtf
DRR016140_Aligned.sortedByCoord.out.bam
DRR016140_Aligned.sortedByCoord.out.bam.bai