BLAST practical

BLAST

BLAST is one of the most common bioinformatic tools out there, and many biology students learn to run BLAST for individual or small sets of sequences via a web interface: you can find a quick overview of web BLAST here. The most popular is probably NCBI BLAST. You can scale up your BLAST runs significantly on an HPC.

Log in and get a directory set up

Log into ACF login node

	ssh yourusername@duo.acf.tennessee.edu

Change into our class project directory

        cd /lustre/haven/courses/EPP622-2018Fa

Create your own personal folder

        mkdir yourusername

Look at the permissions

        ls -l

By default, others should not be able to write over anything in your folder. However, anyone will be able to see your work. If you want to change that, lets alter the permissions.

         chmod go-rwx yourusername

Seeing an "S"? What the heck is that? Explanation here.

Okay, now before you download data, get into your own folder.

         cd yourusername

Get Data

Make a directory to hold our blast lesson. Make sure you are in your own folder in our shared project folder.

	mkdir blast_examples
	cd blast_examples

wget doesn't work on compute nodes. So anytime you need to download data, you will have to do it on the login node. Lets start by downloading some data.

Grab some cow and human RefSeq proteins:

	wget ftp://ftp.ncbi.nih.gov/refseq/B_taurus/mRNA_Prot/cow.1.protein.faa.gz
	wget ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.1.protein.faa.gz

This is only part of the human and cow protein files. check out how big they are:

         ls -lh

Now we can get an interactive session on a compute node.

	qsub -I -A ACF-UTK0085 -q debug -l walltime=1:00:00,nodes=1

Change back to the directory with your blast files

        cd /lustre/haven/courses/EPP622-2018Fa/yourusername

Interactive session and load software

BLAST is available through the module system. This is a way for a computer system to manage many software packages and make them available to users. Lets see what software is available on the ACF:

module avail

We know we want blast, so we can ask if it is available directly

module avail blast

Unfortunately its not the most current version, but it'll do. To use this software, we need to "load" it. This sets up your compute environment to use the software and means you can run it without specifying the path.

module load blast

One easy way to see if a program is working is to look for the version or the help pages.

blastn -version
blastp -help

BLAST has a number of possible programs to run depending on whether you have nucleotide or protein sequences:

nucleotide query and nucleotide db - blastn
nucleotide query and nucleotide db - tblastx (includes six frame translation of query and db sequences)
nucleotide query and protein db - blastx (includes six frame translation of query sequences)
protein query and nucleotide db - tblastn (includes six frame translation of db sequences)
protein query and protein db - blastp

Running BLAST - Protein vs Protein on the command line

The database files are both gzipped, so lets unzip them

	gunzip *gz
	ls

Take a look at the head of each file:

	head cow.1.protein.faa 
	head human.1.protein.faa

Note that the files are in fasta format, even though they end if ".faa" instead of the usual ".fasta". This NCBI's way of denoting that this is a fasta file with amino acids instead of nucleotides.

How many sequences are in each one?

	grep -c '^>' cow.1.protein.faa
	grep -c '^>' human.1.protein.faa

Grep? Grep is short had for "globally search a regular expression and print". So what exactly is a regular expression? Well, its a formal language to define search patterns. In this case, we are asking to search for lines that begin with (the ^ symbol) a greater than sign. Regular expressions (also known as regex) are extremely powerful, and if you are going to be doing a lot of file manipulation and data science, worth learning more about!

This grep command uses the c flag, which reports a count of lines with match to the pattern. In this case, the pattern is a regular expression, meaning match only lines that begin with a >.

This is a bit too big, lets take a smaller set for practice. Lets take the first two sequences of the cow proteins, which we can see are on the first 6 lines

	head -n 16 cow.1.protein.faa > cow.small.faa

Now we can blast these two cow sequences against the set of human sequences. First, we need to tell blast about our database. BLAST needs to do some pre-work on the database file prior to searching. This helps to make the software work a lot faster. Use the makeblastdb command:

	makeblastdb -in human.1.protein.faa -dbtype prot
	ls

Note that this makes a lot of extra files, with the same name as the database plus new extensions (.pin, .psq, etc). To make blast work, these files, called index files, must be in the same directory as the fasta file.

Now we can run the blast job. We will use blastp, which is appropriate for protein to protein comparisons.

	blastp -query cow.small.faa -db human.1.protein.faa -out cow_vs_human_blast_results.txt

Take a look at the results in nano. Note that there can be more than one match between the query and the same subject. These are referred to as high-scoring segment pairs (HSPs).

Lets also take a look at the help pages. Unfortunately there are no man pages (those are usually reserved for shell commands, but some software authors will provide them as well), but there is a text help output

	blastp -help

To scroll through slowly

	blastp -help | less

To quit the less screen, press the q key.

Parameters of interest: -evalue ( Default is 10?!?) and -outfmt

Lets get only very meaningful matches with a different output format:

	 blastp \
	 -query cow.small.faa \
	 -db human.1.protein.faa \
	 -out cow_vs_human_blast_results.tab \
	 -evalue .1 \
	 -outfmt 7

Whoa!? What are those slashes? In a practical sense, these tell the shell "I'm not done with this command yet". In a literal sense, the forward slash is an "escape" character, and its telling the shell to interpret the newline as just a newline, not a "submit" of the command.

Check out the results with nano.

Tab delimited has these default columns:

	qseqid 		Query sequence ID
	sseqid		Subject (ie DB) sequence ID
	pident		Percent Identity across the alignment
	length 		Alignment length
	mismatch 	# of mismatches
	gapopen 	Number of gap openings
	qstart 		Start of alignment in query
	qend 		End of alignment in query 
	sstart 		Start of alignment in subject
	send		End of alignment in subject
	evalue 		E-value
	bitscore	Bit score

I find it very useful to add in the subject sequence description to the tabular output. If you go through help some more, you will find that you can decide which types of output you would like in the tab delimited file and what order they should be in. For example:

	blastp \
	-query cow.small.faa \
	-db human.1.protein.faa \
	-out cow_vs_human_blast_results.tab \
	-evalue .1 \
	-outfmt "7 std stitle"

Lets try a medium sized data set next

	head -n 500 cow.1.protein.faa > cow.medium.faa

What size is this db?

	grep -c '^>' cow.medium.faa

Lets run the blast again, but this time lets not have the comment lines, and lets return only the best hit for each query.

	blastp \
	-query cow.medium.faa \
	-db human.1.protein.faa \
	-out cow_vs_human_blast_results.tab \
	-evalue .1 \
	-outfmt "6 std stitle" \
	-max_target_seqs 1

This is still very inefficient compared to what we can do. Lets imagine needing to BLAST 50,000 sequences against a database of 500,000 proteins (a surprisingly common task!). By the end of the semester, our goal is to be able to:

Write a short python script to split the 50,000 input sequences into a set of 100 files (each containing 500 sequences)
Write an ACF submission script that will submit 100 jobs (one for each input file) to the cluster to run in parallel
Get the results back in a few short hours!

Provide feedback

Saved searches