Skip to content

Advice By Input Type

hyattpd edited this page Aug 10, 2014 · 43 revisions

Finished Genomes

We define a finished genome to be a genome where each chromosome or plasmid is in one contig, and there are no runs of N's (gaps).

Finished genomes should be run in Normal Mode.

For genomes where you are sure the first and last bases of the sequence(s) do not fall inside a gene, you should consider the -c option.

   -c, --closed:         Closed ends.  Do not allow partial genes
                         at edges of sequence.

If the genome consists of multiple chromosomes, you can analyze them together or separately. Chromosomes should only be separated if (1) each chromosome is at least 500kb, and (2) you have reason to believe the chromosomes are quite different in terms of GC content, RBS motif usage, and other parameters.

Plasmids are trickier, and it isn't clear what the best approach is. They can either be included alongside the chromosomes (in which case Prodigal will train on the chromosomes and plasmids together), or you can analyze them separately, as discussed below. Your decision should be guided again by whether or not the plasmid is similar or different to the rest of the genome.

Draft Genomes

In most cases, draft genomes should be analyzed in Normal Mode. Prodigal should do fine even if the average contig length is small (3000+bp). Alternatively, the presence of even one long contig is usually sufficient to provide good training data.

If Prodigal is having trouble building a good training set (due to the sequence being in too many contigs), it will likely output a warning that looks like this:

Warning: Average training set contig length is short at 720.10 bases.
You may get better results with the '-p anon' option.


Warning: Training sequence is highly fragmented.
You may get better results with the '-p anon' option.

By default, Prodigal's parameters are ideal for scaffolds and/or multiple FASTA with many contigs. Partial genes are allowed to run into gaps of N's, which means you should get the same results analyzing 1000 contigs in one file, or analyzing one scaffold with the 1000 contigs joined together by runs of N's. In addition, genes are allowed to run off the edges. You should never use the -c option with draft genomes.

Prodigal can handle gaps (defined as two or more consecutive codons of completely ambiguous characters) a variety of ways, using the -e option:

  -e, --gap_mode:       Specify gap-handling behavior.
                          0:    Partial genes run into gaps.
                                (Default)
                          1:    Genes cannot run into gaps.
                          2:    Do not treat N's as gaps.

In some rare cases, where you are certain you have the exactly correct number of N's in all your gaps (so as to preserve reading frame), you might choose the -e 2 option, which would allow Prodigal to build gene models that span the gap. You might also use this option if your sequence is low quality and contains many short runs of N's that are not meant to be treated as gaps.

Prodigal 2.x: Older versions of Prodigal do not contain gap handling (except for the -m option, which acts similarly to -e 1 above, but requires a run of 50 N's before it considers it a gap).

If you feel like your draft genome is in too many contigs to get a good result (and definitely if you see the warnings shown above), an alternative is to find a closely related genome that is finished, train on it, and use that training file to analyze your highly fragmented draft genome. This process is described in the section on Training Mode.

If your genome is in low quality draft, and you do not have any high quality closely related genome to train on, you should analyze the sequence in Anonymous Mode.

Metagenomes

Alternate Genetic Codes

Organisms with Gene Decay

Plasmids, Viruses, and Other Short Sequences