AUGUSTUS is used in many genome annotation projects. Below are some accuracy values in comparison to other programs. As accuracy measure we use sensitivity (Sn) and specificity (Sp). For a feature (coding base, exon, transcript, gene) the sensitivity is defined as the number of correctly predicted features divided by the number of annotated features. The specificity is the number of correctly predicted features divided by the number of predicted features. A predicted exon is considered correct if both splice sites are at the annotated position of an exon. A predicted transcript is considered correct if all exons are correctly predicted and no additional exons not in the annotation. A predicted gene is considered correct if any of its transcripts are correct, i.e. if at least one isoform of the gene is exactly as annotated in the reference annotation.
program | base | exon | transcript | gene | ||||
Sn | Sp | Sn | Sp | Sn | Sp | Sn | Sp | |
AUGUSTUS | 99.0 | 90.5 | 92.5 | 80.2 | 68.3 | 47.1 | 80.1 | 51.8 |
Fgenesh++ | 97.6 | 89.7 | 90.4 | 80.9 | 65.5 | 53.4 | 78.3 | 54.2 |
MGENE | 98.7 | 91.9 | 91.0 | 80.6 | 57.7 | 48.0 | 70.6 | 51.1 |
EUGENE | 98.5 | 85.1 | 92.1 | 70.3 | 60.8 | 31.5 | 68.8 | 36.1 |
ExonHunter | 93.7 | 92.0 | 81.2 | 76.9 | 37.2 | 39.7 | 45.6 | 40.5 |
Gramene | 98.2 | 95.4 | 88.5 | 71.8 | 41.7 | 19.6 | 48.7 | 37.2 |
MAKER | 92.9 | 88.5 | 80.7 | 66.3 | 41.3 | 19.6 | 50.7 | 47.6 |
program | base | exon | transcript | gene | ||||
Sn | Sp | Sn | Sp | Sn | Sp | Sn | Sp | |
AUGUSTUS | 97.0 | 89.0 | 86.1 | 72.6 | 50.1 | 28.7 | 61.1 | 38.4 |
Fgenesh | 98.2 | 87.1 | 86.4 | 73.6 | 47.1 | 34.6 | 57.8 | 35.4 |
GeneMark.hmm | 98.3 | 83.1 | 83.2 | 65.6 | 37.7 | 24.0 | 46.3 | 24.5 |
MGENE | 97.2 | 91.5 | 84.6 | 78.6 | 44.6 | 40.9 | 54.8 | 42.3 |
GeneID | 93.9 | 88.2 | 77.0 | 68.6 | 36.2 | 22.8 | 44.4 | 25.1 |
Agene | 93.8 | 83.4 | 68.9 | 61.1 | 9.8 | 13.1 | 12.0 | 14.1 |
CRAIG | 95.6 | 90.9 | 80.2 | 78.2 | 35.7 | 36.3 | 43.8 | 37.8 |
EUGENE | 94.0 | 89.5 | 80.3 | 73.0 | 49.1 | 28.8 | 60.2 | 30.2 |
ExonHunter | 95.4 | 86.0 | 72.6 | 62.5 | 15.5 | 18.6 | 19.1 | 19.2 |
GlimmerHMM | 97.6 | 87.6 | 84.4 | 71.4 | 47.3 | 29.3 | 58.0 | 30.6 |
SNAP | 94.0 | 84.5 | 74.6 | 61.3 | 32.6 | 18.6 | 40.0 | 19.1 |
long human sequences | Program | ||||
AUGUSTUS | GENSCAN | GENEID | GENEMARK | GENEZILLA | |
base level sensitivity | 78.65% | 84.17% | 76.77% | 76.09% | 87.56% |
base level specificity | 75.29% | 60.60% | 76.48% | 62.94% | 50.93% |
exon level sensitivity | 52.39% | 58.65% | 53.84% | 48.15% | 62.08% |
exon level specificity | 62.93% | 46.37% | 61.08% | 47.25% | 50.25% |
gene level sensitivity | 24.32% | 15.54% | 10.47% | 16.89% | 19.59% |
gene level specificity | 17.22% | 10.13% | 8.78% | 7.91% | 8.84% |
long drosophila sequence | Program | ||
AUGUSTUS | GENEID | GENIE | |
base level sensitivity (std1) | 98% | 96% | 96% |
base level specificity (std3) | 93% | 92% | 92% |
exon level sensitivity (std1) | 86% | 71% | 70% |
exon level specificity (std3) | 66% | 62% | 57% |
gene level sensitivity (std1) | 71% | 47% | 40% |
gene level specificity (std3) | 39% | 33% | 29% |
multi-gene sequences | Program | ||
AUGUSTUS | |||
base level sensitivity | 97% | ||
base level specificity | 72% | ||
exon level sensitivity | 89% | ||
exon level specificity | 70% | ||
gene level sensitivity | 62% | ||
gene level specificity | 39% |
adh222 is a single sequence of drosophila melanogaster and
2.9Mb long.
There are two sets of annotations. The first, smaller set,
called std1, was chosen so that the genes in it are likely to be
correctly annotated and the second larger set, called std3, was chosen to be
as complete as possible.
This dataset and the annotation was taken from
here.
In the corrected version std1 contains 38 genes with a total of 111
exons and
std3 contains 222 genes with a total of 909 exons.
The genes lie on both strands.
araset is a set of 74 multi-gene sequences with 168 genes of Arabidopsis thaliana. The specificity is likely to be underestimated because there are sometimes genes at the boundaries of a sequence that are not annotated.
The datasets can be downloaded.