Skip to content

Centrifuge nt

Jose Manuel Martí edited this page Jun 10, 2022 · 16 revisions

Overview

The Centrifuge nt database is the NCBI nt database pre-processed so that it can be used by Centrifuge allowing rapid and sensitive classification within a huge range of organisms.

For further details about the NCBI nt database, please consult the NCBI nt database section in this page.

Fast-paced way

The most straightforward way of generating an updated version of the nt database is using the Makefile provided with Centrifuge:

cd your_centrifuge_folder/indices
make THREADS=16 nt

or whatever number of threads you can dedicate to the build of the nt database. As it will take some time, the larger this number, the better (of course, with the scalability limit of the computer architecture where you are running the code). In addition to Centrifuge, you will need dustmasker in your path, a program that identifies and masks out low complexity parts. It is part of the NCBI BLAST command line applications. In case you don't want to do this masking, you can pass DONT_DUSTMASK=1 to make.

If this "automatic" build fails or if you need or want to take control of each stage, in the rest of this page you will find the instructions to generate your own updated version of such database step by step.

Step by step instructions

You will need high performance computing (HPC) resources to be able to generate the Centrifuge nt database. Typically, a current fat-node will do the job. The last successful build required 128 cores, 2 tebibyte of memory, a fast scratch storage system, and more than a week. These are the step by step instructions:

  1. The first step is to download the NCBI nt database and unzip it (both operations will take some time as it has hundreds of GiB):
mkdir nt
cd nt
wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz
gunzip nt.gz
mv -v nt nt.fa
  1. The same with the taxdump databases (this is the shorter step):
mkdir taxonomy
cd taxonomy
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar xvzf taxdump.tar.gz
cd ..
  1. We need to properly generate the accession to taxid mapping file (that will exceed 16 GiB as of Nov 2021), using the following commands:
wget "ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_*.accession2taxid.gz"
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/pdb.accession2taxid.gz
gunzip -c *.accession2taxid.gz | awk -v OFS='\t' '{print $2, $3}' >> acc2tax.map
  1. This is an optional step, which is also optional if you use the Makefile to build the database and pass DONT_DUSTMASK=1 to make. This step will mask low-complexity sequences by using DustMasker, a NCBI BLAST command-line application that you should install on your own (with the rest of the NCBI BLAST+ tools from here or alone from here). We will run dustmasker with the DUST level (score threshold for subwindows) set to 20, which is the default. Finally, all the masked nucleotides from the DustMasker output will be remasked as N using sed:
mv nt.fa nt_unmasked.fa 
dustmasker -infmt fasta -in nt_unmasked.fa -level 20 -outfmt fasta | sed '/^>/! s/[^AGCT]/N/g' > nt.fa
  1. Last but not least, we issue the Centrifuge command that will generate the Centrifuge nt database (this is the part that actually benefits from high performance computing):
centrifuge-build --ftabchars=14 -p 32 --bmax 1342177280 --conversion-table acc2tax.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp nt.fa nt

The last process will finish with a line like this:

Total time for call to driver() for forward index: HH:MM:SS

That is the time of the last step. In our case (using 32 cores) it took more than 20 hours. As this is not a short time, if centrifuge-build is launched not using a batch system but in an interactive session, I strongly recommend using any mechanism to protect the process from unintentional interruptions, for instance by nohup:

nohup centrifuge-build --ftabchars=14 -p 32 --bmax 1342177280 --conversion-table acc2tax.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp nt.fa nt &

And tail -f nohup.out to safely follow the progress.

Download [OBSOLETE]

As an alternative to generate your own database, you can download the last version of the nt database prepared by the Centrifuge team here: ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/

If you need a previous version, this is one I generated with the procedure described above:

  • Version: NCBI nt database downloaded in August 2017.
  • 7z ultra-compressed file (58.2 Gb): DOWNLOAD nt.cf.7z FILE
  • Contents: The files uncompressed are nt.1.cf, nt.2.cf and nt.3.cf, with a total size of about 90 GiB.
  • The md5 file: DOWNLOAD nt.cf.7z.md5 FILE

After downloading both files, you could check the MD5 checksum with the command:

md5sum -c nt.cf.7z.md5

You can extract the contents just with:

7z e nt.cf.7z

Details

NCBI nt Database

It is not true that the NCBI nt database contains all the sequences from the NCBI nucleotide databases. Currently, it contains "all traditional divisions of GenBank, EMBL, and DDBJ excluding GSS, STS, PAT, EST, HTG, and WGS". So, it does include:

DB Contents
TSA Transcriptome shotgun data
ENV Environmental samples
PHG Phages
BCT Bacteria
INV Invertebrates
VRL Viruses
MAM Other mammals
PLN Plants
SYN Synthetic
VRT Other vertebrates
UNA Unannotated
PRI Primates
ROD Rodents
HTC High-throughput cDNA

and it does NOT include:

DB Contents
GSS Genome survey sequences
STS Sequence tagged sites
PAT Patented sequences
EST Expressed sequence tags
HTC High-throughput cDNA
WGS Whole-genome shotgun data

From these latter, in addition to nt, the following databases are downloadable from NCBI:

DB Compressed filename(s)
STS sts.gz
PAT patnt.gz
EST est_human.gz, est_mouse.gz, est_others.gz
HTC htgs.*tar.gz

The WGS sequences can be downloaded on a project-basis approach.