-
Notifications
You must be signed in to change notification settings - Fork 7
Centrifuge nt
The Centrifuge nt database is the NCBI nt database pre-processed so that it can be used by Centrifuge allowing rapid and sensitive classification within a huge range of organisms.
For further details about the NCBI nt database, please consult the NCBI nt database section in this page.
The most straightforward way of generating an updated version of the nt database is using the Makefile provided with Centrifuge:
cd your_centrifuge_folder/indices
make THREADS=16 nt
or whatever number of threads you can dedicate to the build of the nt database. As it will take some time, the larger this number, the better (of course, with the scalability limit of the computer architecture where you are running the code). In addition to Centrifuge, you will need dustmasker
in your path, a program that identifies and masks out low complexity parts. It is part of the NCBI BLAST command line applications. In case you don't want to do this masking, you can pass DONT_DUSTMASK=1
to make
.
If this "automatic" build fails or if you need or want to take control of each stage, in the rest of this page you will find the instructions to generate your own updated version of such database step by step.
You will need high performance computing (HPC) resources to be able to generate the Centrifuge nt database. Typically, a current fat-node will do the job. The last successful build required 128 cores, 2 tebibyte of memory, a fast scratch storage system, and more than a week. These are the step by step instructions:
- The first step is to download the NCBI nt database and unzip it (both operations will take some time as it has hundreds of GiB):
mkdir nt
cd nt
wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz
gunzip nt.gz
mv -v nt nt.fa
- The same with the taxdump databases (this is the shorter step):
mkdir taxonomy
cd taxonomy
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar xvzf taxdump.tar.gz
cd ..
- We need to properly generate the accession to taxid mapping file (that will exceed 16 GiB as of Nov 2021), using the following commands:
wget "ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_*.accession2taxid.gz"
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/pdb.accession2taxid.gz
gunzip -c *.accession2taxid.gz | awk -v OFS='\t' '{print $2, $3}' >> acc2tax.map
- This is an optional step, which is also optional if you use the Makefile to build the database and pass
DONT_DUSTMASK=1
tomake
. This step will mask low-complexity sequences by using DustMasker, a NCBI BLAST command-line application that you should install on your own (with the rest of the NCBI BLAST+ tools from here or alone from here). We will rundustmasker
with the DUST level (score threshold for subwindows) set to 20, which is the default. Finally, all the masked nucleotides from the DustMasker output will be remasked asN
usingsed
:
mv nt.fa nt_unmasked.fa
dustmasker -infmt fasta -in nt_unmasked.fa -level 20 -outfmt fasta | sed '/^>/! s/[^AGCT]/N/g' > nt.fa
- Last but not least, we issue the Centrifuge command that will generate the Centrifuge nt database (this is the part that actually benefits from high performance computing):
centrifuge-build --ftabchars=14 -p 32 --bmax 1342177280 --conversion-table acc2tax.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp nt.fa nt
The last process will finish with a line like this:
Total time for call to driver() for forward index: HH:MM:SS
That is the time of the last step. In our case (using 32 cores) it took more than 20 hours. As this is not a short time, if centrifuge-build
is launched not using a batch system but in an interactive session, I strongly recommend using any mechanism to protect the process from unintentional interruptions, for instance by nohup
:
nohup centrifuge-build --ftabchars=14 -p 32 --bmax 1342177280 --conversion-table acc2tax.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp nt.fa nt &
And tail -f nohup.out
to safely follow the progress.
As an alternative to generate your own database, you can download the last version of the nt database prepared by the Centrifuge team here: ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/
If you need a previous version, this is one I generated with the procedure described above:
- Version: NCBI nt database downloaded in August 2017.
- 7z ultra-compressed file (58.2 Gb): DOWNLOAD nt.cf.7z FILE
- Contents: The files uncompressed are
nt.1.cf
,nt.2.cf
andnt.3.cf
, with a total size of about 90 GiB. - The md5 file: DOWNLOAD nt.cf.7z.md5 FILE
After downloading both files, you could check the MD5 checksum with the command:
md5sum -c nt.cf.7z.md5
You can extract the contents just with:
7z e nt.cf.7z
It is not true that the NCBI nt database contains all the sequences from the NCBI nucleotide databases. Currently, it contains "all traditional divisions of GenBank, EMBL, and DDBJ excluding GSS, STS, PAT, EST, HTG, and WGS". So, it does include:
DB | Contents |
---|---|
TSA | Transcriptome shotgun data |
ENV | Environmental samples |
PHG | Phages |
BCT | Bacteria |
INV | Invertebrates |
VRL | Viruses |
MAM | Other mammals |
PLN | Plants |
SYN | Synthetic |
VRT | Other vertebrates |
UNA | Unannotated |
PRI | Primates |
ROD | Rodents |
HTC | High-throughput cDNA |
and it does NOT include:
DB | Contents |
---|---|
GSS | Genome survey sequences |
STS | Sequence tagged sites |
PAT | Patented sequences |
EST | Expressed sequence tags |
HTC | High-throughput cDNA |
WGS | Whole-genome shotgun data |
From these latter, in addition to nt, the following databases are downloadable from NCBI:
DB | Compressed filename(s) |
---|---|
STS | sts.gz |
PAT | patnt.gz |
EST | est_human.gz, est_mouse.gz, est_others.gz |
HTC | htgs.*tar.gz |
The WGS sequences can be downloaded on a project-basis approach.
If you use Recentrifuge in your research, please consider citing the paper. Thanks!
Martí JM (2019) Recentrifuge: Robust comparative analysis and contamination removal for metagenomics. PLOS Computational Biology 15(4): e1006967. https://doi.org/10.1371/journal.pcbi.1006967