Skip to content

Finish up variant calling

Meg Staton edited this page Sep 28, 2016 · 12 revisions

Set up

Log into newton

qrsh 

Get back to your variant calling folder and make sure your variant calling from last week worked.

cd e_coli/analysis/
cd e_coli/analysis/5_samtools
wc -l *raw.vcf

This should yield ~242 lines. You should also use head and tail to check that it looks like a vcf formatted file.

Filtering variant calls

There could still be low quality variants in this file, lets do some basic filtering. We have almost 1 billion bases, yielding an average depth of coverage of 214X. Lets filter SNPs with more than 500X coverage and a quality value of less than 20.

module load bcftools
bcftools filter -s LowQual -e '%QUAL<20 || DP>500' DRR021342.raw.vcf > DRR021342.flt.vcf

BCFtools manua has much more explanation about the filtering command and various ways to filter.

How many SNPs were removed by this command?

wc -l *vcf

There are actually MORE lines in our filtered vcf. What happened? The header was updated with new lines to reflect the filtering. Also, SNPs are not removed - instead bcftools changes in the information in the filter column from "PASS" to "LowQual".

grep 'LowQual' DRR021342.flt.vcf

SNPEff

Unfortunately, at this point you will just have to watch instead of doing this yourself. There is an issue where we can't install databases on the public snpEff instance due to permissions, and I havne't had a chance to ask the newton admin to fix. So then I thought 'no problem, we'll do a local install'. But that doesn't work because it requires the use of the /tmp directory on the head node. Which turns into a problem if we all try to install the same files in that location. Long story short: if you need a reference genome added to this software, let me know and I can show you the work around locally. And, welcome to the world of bioinformatics. Truthfully, you'll spend 80% of your time banging your head against a wall trying to get software to work.

We have to manually install the software in our home directories - its installed on newton, but we can't add a new reference genome due to permission issues. These install instructions are taken from these SnpEff installation directions. Fortunately, it doesn't require to much work - just download and decompress.

cd ~
mkdir software
cd software
wget http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip
unzip snpEff_latest_core.zip
cd ~/e_coli/analysis
mkdir 6_snpEff
cd 6_snpEff

To find a database (over 20,000 represented)

java -jar /data/apps/snpeff/4.2/snpEff.jar databases | grep -i Escherichia_coli

This produces a lot files - a lot of E coli strains have been sequenced. Lets search by our particular reference

java -jar /data/apps/snpeff/4.2/snpEff.jar databases | grep -i  GCA_000005845

You can now use this info to install the database. But you'll have to open a new session on the head node, so that you are connected to the internet.

java -jar /data/apps/snpeff/4.2/snpEff.jar download -v GCA_000005845.2.29

It turns out our genome sequence name and the name from the SNPeff database are different. So we can use the handy sed command to fix the genome sequence name.

(This is another debug step that took me a long time to figure out! You're welcome. But seriously, running your own analysis is going to take waaaaayyyyy longer than these nice labs where I figured out how to make it work. Please consider this further encouragement to start on your project early :)

cp ../5_samtools/DRR021342.flt.vcf .
sed 's/NC_000913.3/Chromosome/' DRR021342.flt.vcf  > DRR021342.flt.renamed.vcf

If you are trying to do this in the future, a good recommendation is to start with the same genome from Ensembl as the one that is in the snpEff database. That should avoid this step of renaming the vcf file reference sequences.

Lets run snpEff with our genome and filtered vcf file. This will output a new vcf file with the functional annotations.

java -Xmx4g -jar ~/software/snpEff/snpEff.jar \
GCA_000005845.2.29 \
DRR021342.flt.renamed.vcf \
> DRR021342.flt.ann.vcf

This produces an html file that you can transfer to your local computer to check out. It also adds the ANN flags to the vcf file.

Extensive SnpEff documentation

Clone this wiki locally