Skip to content

BLAST practical

Meg Staton edited this page Sep 12, 2016 · 11 revisions

BLAST

Log in and get data

Log into newton but DO NOT get an interactive session (yet).

	ssh userid@newlogin.newton.utk.edu

wget doesn't work on interactive sessions. Lets start by downloading some data.

Grab some data to play with. Grab some cow and human RefSeq proteins:

	wget ftp://ftp.ncbi.nih.gov/refseq/B_taurus/mRNA_Prot/cow.1.protein.faa.gz
	wget ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.1.protein.faa.gz

This is only part of the human and cow protein files.

Now we can get an interactive session on newton.

	qrsh
	hostname

Interactive session and load session

BLAST is available through the module system:

module load blast

If this fails you may need to try module switch intel-compilers/2016u3 first, then load the blast software again.

You can check that you now have all the blast programs you might need (blastp, blastx, etc). Lets download the data we need for this lesson.

blastn -version
blastp -help

Make a directory to hold our blast lesson

	cd ~
	mkdir blast_examples
	cd blast_examples

##Protein vs Protein on the command line

The database files are both gzipped, so lets unzip them

	gunzip *gz
	ls

Take a look at the head of each file:

	head cow.1.protein.faa 
	head human.1.protein.faa 

Note that the files are in fasta format, even though they end if ".faa" instead of the usual ".fasta". This NCBI's way of denoting that this is a fasta file with amino acids instead of nucleotides.

How many sequences are in each one?

	grep -c '^>' cow.1.protein.faa
	grep -c '^>' human.1.protein.faa 

This grep command uses the c flag, which reports a count of lines with match to the pattern. In this case, the pattern is a regular expression, meaning match only lines that begin with a >.

This is a bit too big, lets take a smaller set for practice. Lets take the first two sequences of the cow proteins, which we can see are on the first 6 lines

	head -6 cow.1.protein.faa > cow.small.faa

Now we can blast these two cow sequences against the set of human sequences. First, we need to tell blast about our database. BLAST needs to do some pre-work on the database file prior to searching. This helps to make the software work a lot faster. Use the makeblastdb command:

	makeblastdb -in human.1.protein.faa -dbtype prot
	ls

Note that this makes a lot of extra files, with the same name as the database plus new extensions (.pin, .psq, etc). To make blast work, these files, called index files, must be in the same directory as the fasta file.

Now we can run the blast job. We will use blastp, which is appropriate for protein to protein comparisons.

	blastp -query cow.small.faa -db human.1.protein.faa -out cow_vs_human_blast_results.txt

Take a look at the results in nano. Note that there can be more than one match between the query and the same subject. These are referred to as high-scoring segment pairs (HSPs).

Lets also take a look at the help pages. Unfortunately there are no man pages (those are usually reserved for shell commands, but some software authors will provide them as well), but there is a text help output

	blastp -help

To scroll through slowly

	blastp -help | less

To quit the less screen, press the q key.

Parameters of interest: -evalue ( Default is 10?!?) and -outfmt

Lets get only very meaningful matches with a different output format:

	 blastp \
	 -query cow.small.faa \
	 -db human.1.protein.faa \
	 -out cow_vs_human_blast_results.tab \
	 -evalue 1e-5 \
	 -outfmt 7

Check out the results with nano.

Tab delimited has these default columns:

	qseqid 		Query sequence ID
	sseqid		Subject (ie DB) sequence ID
	pident		Percent Identity across the alignment
	length 		Alignment length
	mismatch 	# of mismatches
	gapopen 	Number of gap openings
	qstart 		Start of alignment in query
	qend 		End of alignment in query 
	sstart 		Start of alignment in subject
	send		End of alignment in subject
	evalue 		E-value
	bitscore	Bit score

I find it very useful to add in the subject sequence description to the tabular output. If you go through help some more, you will find that you can decide which types of output you would like in the tab delimited file and what order they should be in. For example:

	blastp \
	-query cow.small.faa \
	-db human.1.protein.faa \
	-out cow_vs_human_blast_results.tab \
	-evalue 1e-5 \
	-outfmt "7 std stitle" 

Lets try a medium sized data set next

	head -203 cow.1.protein.faa > cow.medium.faa

What size is this db?

	grep -c '^>' cow.medium.faa 

Lets run the blast again, but this time lets not have the comment lines, and lets return only the best hit for each query.

	blastp \
	-query cow.medium.faa \
	-db human.1.protein.faa \
	-out cow_vs_human_blast_results.tab \
	-evalue 1e-5 \
	-outfmt "6 std stitle" \
	-max_target_seqs 1

##Protein vs Protein - submitting a job to Newton with threads

Lets write a script to submit a large BLAST job on newton. Since it is large and we don't want to wait for it to finish, we can write a newton submission script

nano newton_threaded_blast.qsh

Inside this file put:

#$ -N blast
#$ -q medium*
#$ -cwd
#$ -pe threads 8
#$ -S /bin/bash
module switch intel-compilers/2016u3
module load blast
blastp \
-query cow.medium.faa \
-db human.1.protein.faa \
-out cow_vs_human_blast_results.THREADED.tab \
-evalue 1e-5 \
-outfmt "6 std stitle" \
-num_threads 8

Now run:

	qsub newton_threaded_blast.qsh

Note - this is limited by the number of cores per node. Check out the newton technical details page. You can't request more threads than there are cores in a single machine.

As soon as we learn some biopython, we can split the query file into multiple segments and run this as a parallel job instead of a threaded job! That's coming up soon.

Clone this wiki locally