-
Notifications
You must be signed in to change notification settings - Fork 12
Finish up variant calling
Log into newton
qrsh
Get back to your variant calling folder and make sure your variant calling from last week worked.
cd e_coli/analysis/
cd e_coli/analysis/5_samtools
wc -l *raw.vcf
This should yield ~242 lines. You should also use head and tail to check that it looks like a vcf formatted file.
There could still be low quality variants in this file, lets do some basic filtering. We have almost 1 billion bases, yielding an average depth of coverage of 214X. Lets filter SNPs with more than 500X coverage and a quality value of less than 20.
module load bcftools
bcftools filter -s LowQual -e '%QUAL<20 || DP>500' DRR021342.raw.vcf > DRR021342.flt.vcf
BCFtools manua has much more explanation about the filtering command and various ways to filter.
How many SNPs were removed by this command?
wc -l *vcf
There are actually MORE lines in our filtered vcf. What happened? The header was updated with new lines to reflect the filtering. Also, SNPs are not removed - instead bcftools changes in the information in the filter column from "PASS" to "LowQual".
grep 'LowQual' DRR021342.flt.vcf
Unfortunately, at this point you will just have to watch instead of doing this yourself. There is an issue where we can't install databases on the public snpEff instance due to permissions, and I havne't had a chance to ask the newton admin to fix. So then I thought 'no problem, we'll do a local install'. But that doesn't work because it requires the use of the /tmp directory on the head node. Which turns into a problem if we all try to install the same files in that location. Long story short: if you need a reference genome added to this software, let me know and I can show you the work around locally. And, welcome to the world of bioinformatics. Truthfully, you'll spend 80% of your time banging your head against a wall trying to get software to work.
We have to manually install the software in our home directories - its installed on newton, but we can't add a new reference genome due to permission issues. These install instructions are taken from these SnpEff installation directions. Fortunately, it doesn't require to much work - just download and decompress.
cd ~
mkdir software
cd software
wget http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip
unzip snpEff_latest_core.zip
cd ~/e_coli/analysis
mkdir 6_snpEff
cd 6_snpEff
To find a database (over 20,000 represented)
java -jar /data/apps/snpeff/4.2/snpEff.jar databases | grep -i Escherichia_coli
This produces a lot files - a lot of E coli strains have been sequenced. Lets search by our particular reference
java -jar /data/apps/snpeff/4.2/snpEff.jar databases | grep -i GCA_000005845
You can now use this info to install the database. But you'll have to open a new session on the head node, so that you are connected to the internet.
java -jar /data/apps/snpeff/4.2/snpEff.jar download -v GCA_000005845.2.29
It turns out our genome sequence name and the name from the SNPeff database are different. So we can use the handy sed command to fix the genome sequence name.
(This is another debug step that took me a long time to figure out! You're welcome. But seriously, running your own analysis is going to take waaaaayyyyy longer than these nice labs where I figured out how to make it work. Please consider this further encouragement to start on your project early :)
cp ../5_samtools/DRR021342.flt.vcf .
sed 's/NC_000913.3/Chromosome/' DRR021342.flt.vcf > DRR021342.flt.renamed.vcf
If you are trying to do this in the future, a good recommendation is to start with the same genome from Ensembl as the one that is in the snpEff database. That should avoid this step of renaming the vcf file reference sequences.
Lets run snpEff with our genome and filtered vcf file. This will output a new vcf file with the functional annotations.
java -Xmx4g -jar ~/software/snpEff/snpEff.jar \
GCA_000005845.2.29 \
DRR021342.flt.renamed.vcf \
> DRR021342.flt.ann.vcf
This produces an html file that you can transfer to your local computer to check out. It also adds the ANN flags to the vcf file.
Extensive SnpEff documentation