Skip to content

RNASeq I practical RNASeq mapping & alignment visualization

Nourolah edited this page Oct 29, 2018 · 20 revisions

RNASeq Mapping by STAR

Set up data directory

cd /lustre/haven/courses/EPP622-2018Fa/$USER
mkdir RNASeq_lab_I
cd RNASeq_lab_I && mkdir raw_data
cd raw_data

Get Arabidopsis thaliana reference genome

wget ftp://ftp.ensemblgenomes.org/pub/plants/release-28/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.28.dna.genome.fa.gz

Get Arabidopsis thaliana genome annotation

wget ftp://ftp.ensemblgenomes.org/pub/release-28/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.28.gtf.gz
gunzip *gz ## uncompress all .gz files in your current directory

Get raw reads

create a softlink from the directory containing one of the paired-end raw reads

ln -s /lustre/haven/courses/EPP622-2018Fa/nsoltan1/RNASeq_lab_I/raw_data/DRR016140_1.fastq 
ln -s /lustre/haven/courses/EPP622-2018Fa/nsoltan1/RNASeq_lab_I/raw_data/DRR016140_2.fastq

Let's get an interactive session for our work

qsub -I -A ACF-UTK0085 -q debug -l nodes=1:ppn=1

Run STAR

module load star
cd ..
mkdir analysis && cd analysis 
mkdir alignment_STAR && cd alignment_STAR
mkdir genomeDir
  1. Index reference genome

     STAR --runMode genomeGenerate     \
     --genomeDir ./genomeDir       \
     --genomeFastaFiles ../../raw_data/Arabidopsis_thaliana.TAIR10.28.dna.genome.fa  \
     --runThreadN 1      \
     --sjdbGTFfile ../../raw_data/Arabidopsis_thaliana.TAIR10.28.gtf   \
     --sjdbOverhang 101
    
    • --runMode genomeGenerate: run genome indices generation job, default is to run alignment.
    • --genomeDir: specify the directory for storing genome indices
    • --genomeFastaFiles: one or more FASTA files with genome reference sequences
    • --runThreadN: the number of threads to use.
    • --sjdbGTFfile: The annotation file that STAR uses to build splice junctions database
    • --sjdbOverhang: specifies the length of genomic sequence around the annotated junction. Usually it is set to Readlength - 1.
    • Command line to get the read length

      • Read 1:

        head -2 ../../raw_data/DRR016140_1.fastq | awk "{print length}" | tail -1
        
      • Read 2:

        head -2 ../../raw_data/DRR016140_2.fastq | awk "{print length}" | tail -1
        
  2. Align the reads

We have 16 pairs of reads file. We can use a for loop to get the job done, instead of running the command 16 times. But, since it takes long time, let's use one of the reads DRR016140 in alignment.

Make sure you are in the right directory (/RNASeq_lab_I/analysis/alignment_STAR) and then create a directory to store the alignment output files:

      mkdir alignment_output   
  • STAR alignment for one paired read

     STAR \
     --genomeDir genomeDir \
     --readFilesIn ../../raw_data/DRR016140_1.fastq ../../raw_data/DRR016140_2.fastq \
     --outFileNamePrefix alignment_output/DRR016140_  \
     --outSAMtype BAM SortedByCoordinate     \
     --runThreadN 1
    
    • --genomeDir: specifies the directory where you put your genome indices
    • --readFilesIn: your paired RNASeq reads files.
    • --outFileNamePrefix: your output file name prefix.
    • --outSAMtype: your output file type. Here we want the generated bam file to be sorted by coordination.
    • --runThreadN: the number of threads to be used.
  • Alignment of all 16 files to the reference genome by loop command:

    for i1 in `ls /lustre/haven/courses/EPP622-2018Fa/$USER/RNASeq_lab_I/raw_data/*_1.fastq`
    
    do
    
      i2=`sed 's/_1.fastq/_2.fastq/' <(echo $i1)`
      BASE=$(basename $i1 | sed 's/_1.fastq*//g')
      echo "i1 $i1"
      echo "i2 $i2"
      echo "BASE $BASE"
    
      STAR \
      --genomeDir genomeDir \
      --readFilesIn $i1 $i2 \
      --outFileNamePrefix alignment_output/$BASE.  \
      --outSAMtype BAM SortedByCoordinate     \
      --runThreadN 1
    
     done
    
    

Obtain alignment info.

  • This file Log.final.out holds the actual alignment information for each read

    cat alignment_output/DRR016140_Log.final.out
    

If you couldn't get the analysis done, just copy the results to your directory.

cp /lustre/haven/courses/EPP622-2018Fa/nsoltan1/RNASeq_lab_I/analysis/alignment_STAR/alignment_output/DRR016140_Aligned.sortedByCoord.out.bam .

cp /lustre/haven/courses/EPP622-2018Fa/nsoltan1/RNASeq_lab_I/analysis/alignment_STAR/alignment_output/DRR016140.Log.final.out .

Use .bam file for alignment visualization

cd alignment_output

module load samtools
samtools index \
DRR016140_Aligned.sortedByCoord.out.bam

This creates an index file, which ends in .bai (bam index).

Visualize with IGV

IGV (Integrative Genomics Viewer), from the Broad Institute, is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets.

It can read in different file formats, details are available on this web page

In this case, we will use:

  • the reference genome
  • the annotations for the reference genome in gtf format
  • the read alignment file, which must be a sorted BAM file
  • an index file for the sorted BAM file

To transfer the files, you have the same two options as earlier, FileZilla or scp. Here are the file locations if you are using FileZilla (replace your username instead of $USER):

/lustre/haven/courses/EPP622-2018Fa/$USER/RNASeq_lab_I/raw_data/Arabidopsis_thaliana.TAIR10.28.dna.genome.fa
/lustre/haven/courses/EPP622-2018Fa/$USER/RNASeq_lab_I/raw_data/Arabidopsis_thaliana.TAIR10.28.gtf
/lustre/haven/courses/EPP622-2018Fa/$USER/RNASeq_lab_I/analysis/alignment_STAR/alignment_output/DRR016140_Aligned.sortedByCoord.out.bam
/lustre/haven/courses/EPP622-2018Fa/$USER/RNASeq_lab_I/analysis/alignment_STAR/alignment_output/DRR016140_Aligned.sortedByCoord.out.bam
.bai

For secure copy, here are example commands to run on your terminal (after you have navigated to your Desktop or another location you would like the files to be):

scp yourusername@duo.acf.tennessee.edu:/lustre/haven/courses/EPP622-2018Fa/$USER/RNASeq_lab_I/raw_data/Arabidopsis_thaliana.TAIR10.28.dna.genome.fa .
scp yourusername@duo.acf.tennessee.edu:/lustre/haven/courses/EPP622-2018Fa/$USER/RNASeq_lab_I/raw_data/Arabidopsis_thaliana.TAIR10.28.gtf .
scp yourusername@duo.acf.tennessee.edu:/lustre/haven/courses/EPP622-2018Fa/$USER/RNASeq_lab_I/analysis/alignment_STAR/alignment_output/DRR016140_Aligned.sortedByCoord.out.bam .
scp yourusername@duo.acf.tennessee.edu:/lustre/haven/courses/EPP622-2018Fa/$USER/RNASeq_lab_I/analysis/alignment_STAR/alignment_output/DRR016140_Aligned.sortedByCoord.out.bam
.bai .

Now, we will visualize the mapping on IGV

Download and unzip IGV, then launch it.

Make sure that files we need to upload to IGV are in a same folder:

Arabidopsis_thaliana.TAIR10.28.dna.genome.fa
Arabidopsis_thaliana.TAIR10.28.gtf
DRR016140_Aligned.sortedByCoord.out.bam
DRR016140_Aligned.sortedByCoord.out.bam.bai
Clone this wiki locally