Augustus [datasets]

The following sequence files were used to train AUGUSTUS or to test its accuracy. Some of the datasets are described in the paper “Gene Prediction with a Hidden Markov Model and a new Intron Submodel”, which was presented at the European Conference on Computational Biology in September 2003 and appeared in the proceedings.

Test sets:

human:

h178:

178 single-gene short human sequences
h178.gb.gz (gzipped genbank format)

sag178:

semi artificial genomic sequences from Guigo et al.:
sag178.gb.gz (gzipped genbank format)
sag178.fa.gz (gzipped fasta format)
sag178.gff (annotation in gff format)

fly:

fly100:

100 single gene sequences from FlyBase:
fly100.gb.gz (gzipped Genbank format)

adh122:

A 2.9 Mb long sequence from the Drosophila adh region (copied from the GASP dataset page)
adh.fa.gz (gzipped fasta format)
adh.std1.gff_corrected (gff format)
adh.std1+3.gff (gff format)

Arabidopsis thaliana:

Araset. 74 sequences with 168 genes.
araset.gb.gz (gzipped genbank format)

Training sets:

human:

single gene sequences from genbank (1284 genes):
human.train.gb.gz (gzipped genbank format)

11739 human splice sites, originally from Guigó et al., but filtered for similarities to h178, sag178:
splicesites.gz (gzipped flat file)

fly:

320 single gene sequences from FlyBase, disjoint with fly100:
fly.train.gb.gz (gzipped genbank format)

400 single gene sequences from FlyBase, disjoint with adh122:
adh.train.gb.gz (gzipped genbank format)

Arabidopsis:

249 single gene sequences obtained by deleting the sequences from the Araball set which overlap with the sequences from Araset:
araball.train.gb (gzipped Genbank format)

Coprinus cinereus (a fungus):

851 single gene sequences predicted by genewise and compiled by Jason Stajich. 261 genes are complete, 590 genes are incomplete at the 3' end. Genes redundand with those in the Genbank annotations were deleted:
cop.genomewise.gb.gz (gzipped Genbank format)

91 sequences containing 93 genes from Genbank. Genes in Genbank with nothing else than the coding sequence were omitted. Identical or extremely similar genes in genbank were used only once. This set has first been used as a test set for above training set. The Coprinus version here used :
cop.gb.clean.gb.gz (gzipped Genbank format)