DNASeq III Practical

Log into ACR and qsub

Will it work today? Who knows, lets find out! You want to end up in your class project directory, back in e_coli_fixed.

Filesystem

This is the directory structure we left off with last time, make sure your folder structure looks like this:

Visualize alignments with samtools

We'll be using the reference genome a lot in this lesson. Lets create a symbolic link.

ln -s ../../raw_data/GCF_000005845.2_ASM584v2_genomic.fna

Index the newest bam file. (Indexing the bam file is needed for many types of software to operate on a bam file)

/lustre/haven/courses/EPP622-2018Fa/software/samtools-1.9/bin/samtools \
index \
DRR021342_BWA_alignments_fixmate_lanesorted.bam

And launch the visualization:

/lustre/haven/courses/EPP622-2018Fa/software/samtools-1.9/bin/samtools \
tview \
DRR021342_BWA_alignments_fixmate_lanesorted.bam \
GCF_000005845.2_ASM584v2_genomic.fna

Some instructions:

'?' for help
'g' then '=10,000' to go to a particular region (this latter instruction only works if you have a single reference sequence).
'q' to quit

Try to find a real variant in the viewer. For example, type g, then =8690

Improving alignments with GATK

Continuing with the HTSLIB mapping to variant workflow, now we'll start using GATK and picard tools. We'll use them to improve the alignments (they have many other functions too!)

GATK has much better workflows for variant calling and you should use them! The developers released GATK4 in January, 2018. Its quite a bit different than the old version, and I haven't yet learned to use it yet, so we're going to stick with our simplified GATK3, samtools and bcftools protocol. Also, we just don't have enough lab time to get into a super detailed variant calling protocol. But you should see enough of the process to be able to expand to the full recommended workflow if you are doing variant calling for your own research.

GATK is going to help improve our alignments, so lets add it to the analysis folder name. You should start from your class project directory:

cd /lustre/haven/courses/EPP622-2018Fa/yourusername/e_coli_fixed/analysis/

Rename the 3_bwa folder

mv 3_bwa 3_bwa_and_gatk
cd 3_bwa_and_gatk

Create an index file for the reference genome.

java \
-jar /lustre/haven/courses/EPP622-2018Fa/software/picard-2.18.14.jar \
CreateSequenceDictionary \
R=GCF_000005845.2_ASM584v2_genomic.fna \
O=GCF_000005845.2_ASM584v2_genomic.dict

In order to reduce the number of miscalls of INDELs in your data it is helpful to realign your raw gapped alignment with the Broad’s GATK Realigner. First, we'll run RealignerTargetCreator

java -Xmx2g \
-jar /lustre/haven/courses/EPP622-2018Fa/software/GenomeAnalysisTK-3.8-1-0-gf15c1c3ef/GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R GCF_000005845.2_ASM584v2_genomic.fna \
-I DRR021342_BWA_alignments_fixmate_lanesorted.bam \
-o DRR021342.intervals

Next, run IndelRealigner

This one takes a while (11 minutes yesterday):

java -Xmx4g \
-jar /lustre/haven/courses/EPP622-2018Fa/software/GenomeAnalysisTK-3.8-1-0-gf15c1c3ef/GenomeAnalysisTK.jar \
-T IndelRealigner \
-R GCF_000005845.2_ASM584v2_genomic.fna \
-I DRR021342_BWA_alignments_fixmate_lanesorted.bam \
-targetIntervals DRR021342.intervals \
-o DRR021342_BWA_alignments_fixmate_lanesorted_realigned.bam

To fully follow the workflow, next you would want to do a number of additional steps:

BQSR from the Broad’s GATK allows you to reduce the effects of analysis artefacts produced by your sequencing machines.
Marking PCR and optical duplicates.
realign indels AGAIN

(Of course as per the statement at the beginning of the lesson, you'd better off to dig into the full set of best practices from GATK4!)

We're going to skip those to save time, and move on. Finally lets re-index our new BAM using samtools:

/lustre/haven/courses/EPP622-2018Fa/software/samtools-1.9/bin/samtools \
index \
DRR021342_BWA_alignments_fixmate_lanesorted_realigned.bam

Clean up

At this point, our directory is quite large. So how do we find out how much storage space a directory is taking up?

cd ..
du -skh 3_bwa_and_gatk

We really don't need the intermediate sam or bam files for the rest of our work, so lets delete them. Remove all the sam and bam files other than DRR021342_BWA_alignments_fixmate_lanesorted_realigned.bam. (I removed DRR021342_BWA_alignments.sam, DRR021342_BWA_alignments_fixmate.bam, DRR021342_BWA_alignments_fixmate_lanesorted.bam, and DRR021342_BWA_alignments_fixmate_lanesorted.bam.bai)

How much space did you free up?

Call Variants

Onto bcftools, another great software package, and the third member of the htslib/samtools/bcftools suite of programs. This one is used for reading/writing BCF2/VCF/gVCF files and calling/filtering/summarizing SNP and short indel sequence variants/

Lets create a new directory and create symbolic links for our bam file, our bam index (the .bai file) and the reference.

cd ..
mkdir 4_bcftools
cd 4_bcftools
ln -s ../3_bwa_and_gatk/DRR021342_BWA_alignments_fixmate_lanesorted_realigned.ba* .
ln -s ../../raw_data/GCF_000005845.2_ASM584v2_genomic.fna

To convert your BAM file into genomic positions we first use mpileup to produce a BCF file that contains all of the locations in the genome. We use this information to call genotypes and reduce our list of sites to those found to be variant by passing this file into bcftools call.

To specify the ploidy of your organism, you need to create a file. Using nano, create a file named ploidy.txt. In this file put the following line:

*	*	*	*	1

This specifies a ploidy of 1 (i.e. haploid) for all samples at all locations.

You can do this using a pipe as shown here (this one takes a realllly long time - I recommend logging out and getting a new qsub job so you have the full hour to wait):

/lustre/haven/courses/EPP622-2018Fa/software/bcftools-1.9/bin/bcftools \
mpileup -Ou \
-f GCF_000005845.2_ASM584v2_genomic.fna \
DRR021342_BWA_alignments_fixmate_lanesorted_realigned.bam \
 | \
/lustre/haven/courses/EPP622-2018Fa/software/bcftools-1.9/bin/bcftools \
call -vmO z \
--ploidy-file ploidy.txt \
-o DRR021342.vcf.gz

Oh boy, what the heck are all those flags? Well, guess what, figuring that out is your homework! (No really, learning to read these manuals and understand how to use flags is SUPER IMPORTANT).

Additionally you may find it helpful to get some statistics to assist you in filtering your variants.

/lustre/haven/courses/EPP622-2018Fa/software/bcftools-1.9/bin/bcftools \
stats \
-F GCF_000005845.2_ASM584v2_genomic.fna \
-s - DRR021342.vcf.gz \
> DRR021342.vcfstats

Finally you will probably need to filter your data, because it likely has a lot of things that aren't very reliable. You can use a command such as:

/lustre/haven/courses/EPP622-2018Fa/software/bcftools-1.9/bin/bcftools \
filter \
-O z \
-o DRR021342.filtered.vcf.gz \
-s LOWQUAL \
-i'%QUAL>10' \
DRR021342.vcf.gz

How many variants were lost to the filtering? How would you figure it out? What if you change the quality cut off?

Preferably you would fire up that samtools tview or another viewer and LOOK AT THE DATA. Does the filtering make sense?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly