Centrifuge nt

Overview

The Centrifuge nt database is the NCBI nt database pre-processed so that it can be used by Centrifuge allowing rapid and sensitive classification within a huge range of organisms.

For further details about the NCBI nt database, please consult the NCBI nt database section in this page.

Fast-paced way

The most straightforward way of generating an updated version of the nt database is using the Makefile provided with Centrifuge:

cd your_centrifuge_folder/indices
make THREADS=16 nt

or whatever number of threads you can dedicate to the build of the nt database. As it will take some time, the larger this number, the better (of course, with the scalability limit of the computer architecture where you are running the code). In addition to Centrifuge, you will need dustmasker in your path, a program that identifies and masks out low complexity parts. It is part of the NCBI BLAST command line applications. In case you don't want to do this masking, you can pass DONT_DUSTMASK=1 to make.

If this "automatic" build fails or if you need or want to take control of each stage, in the rest of this page you will find the instructions to generate your own updated version of such database step by step.

Step by step instructions

You will need high performance computing (HPC) resources to be able to generate the Centrifuge nt database. Typically, a current fat-node will do the job. The last successful build required 128 cores, 2 tebibyte of memory, a fast scratch storage system, and more than a week. These are the step by step instructions:

The first step is to download the NCBI nt database and unzip it (both operations will take some time as it has hundreds of GiB):

mkdir nt
cd nt
wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz
gunzip nt.gz
mv -v nt nt.fa

The same with the taxdump databases (this is the shorter step):

mkdir taxonomy
cd taxonomy
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar xvzf taxdump.tar.gz
cd ..

We need to properly generate the accession to taxid mapping file (that will exceed 16 GiB as of Nov 2021), using the following commands:

wget "ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_*.accession2taxid.gz"
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/pdb.accession2taxid.gz
gunzip -c *.accession2taxid.gz | awk -v OFS='\t' '{print $2, $3}' >> acc2tax.map

This is an optional step, which is also optional if you use the Makefile to build the database and pass DONT_DUSTMASK=1 to make. This step will mask low-complexity sequences by using DustMasker, a NCBI BLAST command-line application that you should install on your own (with the rest of the NCBI BLAST+ tools from here or alone from here). We will run dustmasker with the DUST level (score threshold for subwindows) set to 20, which is the default. Finally, all the masked nucleotides from the DustMasker output will be remasked as N using sed:

mv nt.fa nt_unmasked.fa 
dustmasker -infmt fasta -in nt_unmasked.fa -level 20 -outfmt fasta | sed '/^>/! s/[^AGCT]/N/g' > nt.fa

Last but not least, we issue the Centrifuge command that will generate the Centrifuge nt database (this is the part that actually benefits from high performance computing):

centrifuge-build --ftabchars=14 -p 32 --bmax 1342177280 --conversion-table acc2tax.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp nt.fa nt

The last process will finish with a line like this:

Total time for call to driver() for forward index: HH:MM:SS

That is the time of the last step. In our case (using 32 cores) it took more than 20 hours. As this is not a short time, if centrifuge-build is launched not using a batch system but in an interactive session, I strongly recommend using any mechanism to protect the process from unintentional interruptions, for instance by nohup:

nohup centrifuge-build --ftabchars=14 -p 32 --bmax 1342177280 --conversion-table acc2tax.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp nt.fa nt &

And tail -f nohup.out to safely follow the progress.

Download [OBSOLETE]

As an alternative to generate your own database, you can download the last version of the nt database prepared by the Centrifuge team here: ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/

If you need a previous version, this is one I generated with the procedure described above:

Version: NCBI nt database downloaded in August 2017.
7z ultra-compressed file (58.2 Gb): DOWNLOAD nt.cf.7z FILE
Contents: The files uncompressed are nt.1.cf, nt.2.cf and nt.3.cf, with a total size of about 90 GiB.
The md5 file: DOWNLOAD nt.cf.7z.md5 FILE

After downloading both files, you could check the MD5 checksum with the command:

md5sum -c nt.cf.7z.md5

You can extract the contents just with:

7z e nt.cf.7z

Details

NCBI nt Database

It is not true that the NCBI nt database contains all the sequences from the NCBI nucleotide databases. Currently, it contains "all traditional divisions of GenBank, EMBL, and DDBJ excluding GSS, STS, PAT, EST, HTG, and WGS". So, it does include:

DB	Contents
TSA	Transcriptome shotgun data
ENV	Environmental samples
PHG	Phages
BCT	Bacteria
INV	Invertebrates
VRL	Viruses
MAM	Other mammals
PLN	Plants
SYN	Synthetic
VRT	Other vertebrates
UNA	Unannotated
PRI	Primates
ROD	Rodents
HTC	High-throughput cDNA

and it does NOT include:

DB	Contents
GSS	Genome survey sequences
STS	Sequence tagged sites
PAT	Patented sequences
EST	Expressed sequence tags
HTC	High-throughput cDNA
WGS	Whole-genome shotgun data

From these latter, in addition to nt, the following databases are downloadable from NCBI:

DB	Compressed filename(s)
STS	sts.gz
PAT	patnt.gz
EST	est_human.gz, est_mouse.gz, est_others.gz
HTC	htgs.*tar.gz

The WGS sequences can be downloaded on a project-basis approach.

If you use Recentrifuge in your research, please consider citing the paper. Thanks!

Martí JM (2019) Recentrifuge: Robust comparative analysis and contamination removal for metagenomics. PLOS Computational Biology 15(4): e1006967. https://doi.org/10.1371/journal.pcbi.1006967

Provide feedback

Saved searches

Use saved searches to filter your results more quickly