Skip to content

Advice By Input Type

hyattpd edited this page Aug 11, 2014 · 43 revisions

Finished Genomes

We define a finished genome to be a genome where each chromosome or plasmid is in one contig, and there are no runs of N's (gaps).

Finished genomes should be run in Normal Mode.

For genomes where you are sure the first and last bases of the sequence(s) do not fall inside a gene, you should consider the -c option.

   -c, --closed:         Closed ends.  Do not allow partial genes
                         at edges of sequence.

If the genome consists of multiple chromosomes, you can analyze them together or separately. Chromosomes should only be separated if (1) each chromosome is at least 500kb, and (2) you have reason to believe the chromosomes are quite different in terms of GC content, RBS motif usage, and other parameters.

Plasmids are trickier, and it isn't clear what the best approach is. They can either be included alongside the chromosomes (in which case Prodigal will train on the chromosomes and plasmids together), or you can analyze them separately, as discussed below. Your decision should be guided again by whether or not the plasmid is similar to or different from the rest of the genome.

Draft Genomes

In most cases, draft genomes should be analyzed in Normal Mode. Prodigal should do fine even if the average contig length is small (3000+bp). Alternatively, the presence of even one long contig is usually sufficient to provide good training data.

If Prodigal is having trouble building a good training set (due to the sequence being in too many contigs), it will output warnings that look like this:

Warning: Average training set contig length is short at 720.10 bases.
You may get better results with the '-p anon' option.


Warning: Training sequence is highly fragmented.
You may get better results with the '-p anon' option.

By default, Prodigal's parameters are ideal for scaffolds and/or multiple FASTA with many contigs. Partial genes are allowed to run into gaps of N's, which means you should get the same results analyzing 1000 contigs in one file, or analyzing one scaffold with the 1000 contigs joined together by runs of N's. In addition, genes are allowed to run off the edges. You should never use the -c option with draft genomes.

Prodigal can handle gaps (defined as two or more consecutive codons of completely ambiguous characters) a variety of ways, using the -e option:

  -e, --gap_mode:       Specify gap-handling behavior.
                          0:    Partial genes run into gaps.
                                (Default)
                          1:    Genes cannot run into gaps.
                          2:    Do not treat N's as gaps.

In some rare cases, where you are certain you have the exactly correct number of N's in all your gaps (so as to preserve reading frame), you might choose the -e 2 option, which would allow Prodigal to build gene models that span the gap. You might also use this option if your sequence is low quality and contains many short runs of N's that are not meant to be treated as gaps.

Prodigal 2.x: Older versions of Prodigal do not contain gap handling (except for the -m option, which acts similarly to -e 1 above, but requires a run of 50 N's before it considers it a gap).

If you feel like your draft genome is in too many contigs to get a good result (or if you see the warnings shown above), an alternative is to find a closely related genome that is finished, train on it, and use that training file to analyze your highly fragmented draft genome. This process is described in the section on Training Mode.

If your genome is in low quality draft, and you do not have a high quality closely related genome to train on, you should analyze the sequence in Anonymous Mode.

Metagenomes

The simplest approach for metagenomes is to put all the sequences in one FASTA file and analyze them in Anonymous Mode. This will produce reasonable results (about 95% as good as if Prodigal had been trained on the actual genomes). It also has the advantage of being easily parallelized, as each sequence in the file can be processed independently from any other sequence in the file.

A more ideal solution, when possible, is to assemble as many genomes as you can from the sample, put each genome into a FASTA file, and analyze each genome using Normal Mode. You can then analyze the leftovers using anonymous mode.

Similarly, you might bin the sequences using a classification program (these programs usually rely on GC content, BLAST searches, or other information). You could then make a multiple FASTA file from each bin and analyze it using normal mode.

TIP: Never analyze a multiple FASTA file containing sequences from more than one genome using normal mode. The only exception to this rule would be if the genomes are closely related (strains of the same species).

Both of the above solutions should produce better results than anonymous mode, since Prodigal always does better when it can train on the sequence itself rather than relying on preset training files. These methods involve a lot of preprocessing work, though, and cannot be run as easily in parallel. The fastest solution is just to use anonymous mode.

Alternate Genetic Codes

Prodigal supports all genetic codes defined by NCBI. Most bacteria and archaea use genetic code 11, which uses three stop codons (TAA, TGA, and TAG). Some bacteria do not use TGA as a stop codon. Mycoplasma, spiroplasma, and ureaplasma translate UGA to tryptophan (W) (genetic code 4), while bacteria using genetic code 25 translate UGA to glycine (G).

By default, Prodigal tries genetic code 11. If the average gene length is too low, it tries genetic code 4. If the average gene length is still too low, it reverts back to genetic code 11 and outputs a warning. This looks like the following:

Building training set using genetic code 11...done!
Checking average training gene length...459.7, too low.
Trying genetic code 4...still bad, switching back to genetic code 11.
Redoing genome with genetic code 11...done.

Warning: Average training gene length is low (459.7).
Double check translation table or check for pseudogenes/gene decay.

Examining upstream regions and training starts...done.

Prodigal cannot automatically distinguish between genetic code 4 and genetic code 25. In such cases, it will likely choose genetic code 4, and you will need to rerun manually using genetic code 25.

Autodetection is highly reliable and shouldn't make any mistakes (we tested on 20,000 genomes and it did not make any errors in genetic code determination). However, the user can also explicitly specify genetic code using the -g option.

  -g, --trans_table:    Specify a translation table to use.
                          auto: Tries 11 then 4 (Default)
                          11:   Standard Bacteria/Archaea
                          4:    Mycoplasma/Spiroplasma
                          #:    Other genetic codes 1-25

This will be necessary for any organisms using genetic code 25, and potentially some 4's that don't get recognized by the autodetection. If you know the genetic code, you might as well override the autodetection and explicitly specify it using this option.

Organisms with Gene Decay

Plasmids, Viruses, and Other Short Sequences