1. COMPILE A SET OF TRAINING AND TEST GENES
You will need a set of sequences with bona fide gene structures. Currently,
only the coding parts of the gene are important (plus a short window upstream).
Generally, the more genes you have the better. So far I haven't tried
it with less than 200 genes for training. Make also sure that in particular
the number of multi-exon genes is large. These are needed to train the
introns. Needless to say, the gene structures of these genes should be
as accurate as possible. However, it is not necessary that they
are 100% correct, neither does the annotation have to be necessarily complete.
It is more imporant that the start codon is correctly annotated than that
the stop codon is correctly annotated. Your set of annotated sequences
should be non-redundant. If in two different sequences genes with
an almost identical annoated amino acid sequence is annotated then delete
one of them. (My criterion is: No two genes in the set are more than 70% identical on the amino acid level.)
The non-redundancy is very important for avoiding overfitting. This is of course also very important if you want to test the accuracy
on a test set. Each sequence can contain one or more genes, the genes can
be on either strand. However, the genes must not overlap, and only one
transcript per gene is allowed. Store the sequences together with
their annotation in a simple genbank format. For the exact format
that the training program can read in look as an example at one of the training genbank
files at the augustus web server: http://augustus.gobics.de/datasets/
Randomly split the set of annotated sequences in a training and a test set. Lets call these two files myspecies.train.gb and myspecies.test.gb. In order for the test accuracy to be statistically meaningful the test set should also be large enough (100-200 genes). Your script that splits the large genbank file into these two should really split the set randomly! Do not just take the first and the last part of the file as then the test set is unlikely to be representative. The script 'randomSplit.pl' in the scripts directory does it correctly. In addition to the complete genes you may specify a set of splice site sequences that should be used for training.
Options for compiling a set of gene structures:
NOT ENOUGH TRAINING DATA OR ONLY CODING SEQUENCES AVAILABLE
In case you want to train for species X but do not have
enough training data for X you may consider the following 'artificial monster
gene trick'. This trick allows you to train the coding content models using one species
and the signals models and length distributions of another species.
You need enough (200-300 genes) training data from another
species Y which is similar to X especially in intron and exon length distributions
and signal patterns. You take all the coding sequences you have for X (or
a species with a very similar coding content), exclude the start and stop
codons and concatenate these. Let s be that sequence. Then you create an
artifial genomic sequence g like this:
...non-coding sequence from X... ATG sssssssssssssssss
TAA ...non-coding sequence from X... Then you add this monster gene sequence g and all the
available training data from X to your training set from sequence Y and
train AUGUSTUS with this extended training set. AUGUSTUS will train the
intron and exon length istributions, splice site patterns, translation start patterns, branch
point regions, etc. from species Y and X, weighted by the occurrences of
these signals or lengths, respecively. If n is large enough the amount
of coding sequences in g will outweight the amount of coding sequences
in Y, so augustus uses approximately the coding content of species X. If
s is too short, then n should not be very large as this would overfit the
model to the sequences in s.
(n copies of s)
2. CREATE A META PARAMETERS FILE FOR YOUR SPECIES
I call the parameters like the size of the window of the splice site models and the order of the Markov model meta parameters. Say, you want to train AUGUSTUS for a species called 'myspecies'. Copy the parameter file generic_parameters.cfg to myspecies_parameters.cfg and copy the file generic_weightmatrix.cfg to myspecies_weightmatrix. Then adjust myspecies_parameters.cfg according to the comments in this file.
The optional file with the filename given by /IntronModel/splicefile
may contain a list of sequence windows of known splice sites.
The format is as in the following example.
--------example begin-------
dss gccgagaactccgctcgttctgtgcgttctcctgtcccaggtagggaagaggggctgccgggcgcgctctgcgccccgtttc
dss cgtgattgtcggggggaaagacatccagggctccttgcaggtaacacatctgtttgagataacttgggttcaaggaggacat
dss agagaatcagagacagcctttcccaagagatgttggcaaggtaagtcagacaaacagcaaatgacaaaaacatgtttttatg
dss cattgtcactgttgtgtcacctgcgctgctggaccgagaggtgagctgaaaagaataccactttctttttcacgagaataga
dss tgacaaaaatgatcactcaccaaaattcaccaagaaagaggtaaacccctgtgccaaacaccaaccaccactgtggtcacag
ass gttagtatgcttctttaattttttttctccctgaaattataggaaccagatgttaaaaaattagaagaccaacttcaaggcg
ass --------------------------ggctttgtctttgcagaatttatagagcggcagcacgcaaagaacaggtattacta
ass gattccttgtgattagcctctcttgctccttttctccaccagcaaagtcgaccaagaaattatcaacattatgcaggatcgg
ass aaccgtagtaaacagcatgaatcgtgttttgtttttgaacagaccactggccttgtgggattggctgtgtgcaatactcctc
------example end--------
dss: donor (=5') splice site. 40 letters + gt + 40 letters
ass: donor (=3') splice site. 40 letters + ag + 40 letters
use '-' for unknown characters
myspecies_weightmatrix.txt
Changes in myspecies_weightmatrix.txt are often not necessary. If you do not want to mess with this,
also leave /Constant/decomp_num_steps as it is (=1). This section describes the meaning of the meta parameter /Constant/decomp_num_steps
and the matrix in the file myspecies_weightmatrix.txt. For some species it makes sense to let the model
parameters depend on the average frequencies of the 4 bases in the query sequence. For example, in human, the GC content stays
consistently above or below average over long sequence stretches (isochores). AUGUSTUS can use for each piece of the query sequence
(at most maxDNAPieceSize=200kb by default) different parameters that are adjusted to the base composition of this piece.
I describe here only the dependency on the GC content. /Constant/decomp_num_steps is the number of different levels of GC
content that is taken into account, i.e. AUGUSTUS uses /Constant/decomp_num_steps different sets of parameters, each for a different GC content.
Values between 1 and 10 are useful. The GC content ranges between /Constant/gc_range_min and /Constant/gc_range_max. These parameters can be set in
the meta parameters file. Given a target GC content, each sequence in the training set is weighted depending on how similar its GC content
is to the target GC content. It gets an integer weight between 1 and 10. The weight is the higher, the closer the GC content
of the training sequence is to the target GC content. The two non-zero numbers in the middle of the 4x4 Matrix in myspecies_weightmatrix.txt
(default: 200) determine the influence of the deviation in GC content.
A high value (like 300) means that mostly training sequences with a very similar GC
content to the target GC content are taken into account. The advantage is that the training is more specific to the target GC content.
The disadvantage of a high value is that this effectively reduces the size of the training set.
A low value (like 150) means that all training sequences are taken into account, except
that training sequences with a similar GC content are weighed somewhat stronger.
3. RUN THE SCRIPT optimize_augustus.pl
Please check for updates on this part of the documentation.
This script optimizes the prediction accuracy by adjusting the meta parameters in
the myspecies_parameters.cfg file. The script used the programs augustus and etraining. They must be in the $PATH.
You need to tell optimize_augustus.pl, which metaparameters it should optimize. Do this by adjusting the file
config/generic_metapars.cfg. (You may also make a copy of it and then use the command line parameter
--metapars=nameofmycopy to the script optimize_augustus.pl.)
Run
scripts/optimize_augustus.pl --species=myspecies myspecies.train.gb
In an evaluation step this script does the following 10 fold cross validation. It splits the set myspecies.train.gb randomly into 10 sets (buckets) of ±1 equal size. Then takes 9 of the 10 sets for training and the other one for evaluating the prediction accuracy using the annotation of the genbank file. The 10 possible buckets for evaluating are rotated and each case the other sequences are used for training. It computes a single target value that is subject to optimization. The target is a a weighted average of the sensitivities and specificities on the base, exon and gene level. (Feel free to change the weighing to your preferences in the function 'sub gettarget'.)
For a meta parameter the script repeats above evaluation step for different values of the meta parameter. If it finds an improvement in the target value it adjusts the value of the meta parameter in your myspecies_parameters.cfg file. It tries new values for the parameter until it finds no more improvement. Then it optimizes the next meta parameter. When it has optimized all meta parameters once, it repeats with the first meta parameter. It does at most 5 rounds of optimizations, but stops earlier if no improvements are found. This script probably has to run over night. When it is done your myspecies_parameters.cfg will hopefully have well suited meta parameters for your species and your training set.
After optimize_augustus.pl has finished you have to train AUGUSTUS with the metaparameters it has found.
etraining --species=myspecies myspecies.train.gb
If you have a test set, you can now check the prediction accuracy on this test set by running
augustus --species=myspecies myspecies.test.gb
The end of the output will then contain a summary of the accuracy of
the prediction. If the gene level sensitivity is below 20% it is likely
that the training set is not large enough, that it doesn't have a good
quality or that the species is somehow 'special'.
If you succeeded in creating a good AUGUSTUS version for your
species I would be very interested in your results. If possible please
share your results and give me your myspecies_parameters.cfg and the test
and training set.
4. SPECIAL CASE: ORGANISM WITH DIFFERENT GENETIC CODE
AUGUSTUS can be told to use a different translation table,
in particular one with a different set of stop codons.
This is useful for a small number of species such as Tetrahymena thermophilia, in which some codons translate
to a different amino acid than usual. If you train AUGUSTUS for such a species set the variable translation_table
in the parameter file of your species. Further, adjust the stop codon probabilities in the same config file. E.g. say
translation_table 6
/Constant/amberprob 0 # Prob(stop codon = tag), if 0 tag is assumed to code for amino acid
/Constant/ochreprob 0 # Prob(stop codon = taa), if 0 taa is assumed to code for amino acid
/Constant/opalprob 1 # Prob(stop codon = tga), if 0 tga is assumed to code for amino acid
in the case of Tetrahymena, where taa and tag are coding for glutamine (Q).
Choose the translation table number accoding to this table. translation_table=1 is
the default value and the standard with stop codons taa, tga, tag. If you have a species with the standard genetic code you don't have to do anything.
In case your species' code is not covered by this table send us a note with the string of 64 one-letter amino acid codes in the codon order below.
translation | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | g | g | g | g | g | g | g | g | g | g | g | g | g | g | g | g | t | t | t | t | t | t | t | t | t | t | t | t | t | t | t | t |
table | a | a | a | a | c | c | c | c | g | g | g | g | t | t | t | t | a | a | a | a | c | c | c | c | g | g | g | g | t | t | t | t | a | a | a | a | c | c | c | c | g | g | g | g | t | t | t | t | a | a | a | a | c | c | c | c | g | g | g | g | t | t | t | t |
number | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t | a | c | g | t |
1 | K | N | K | N | T | T | T | T | R | S | R | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | * | C | W | C | L | F | L | F |
2 | K | N | K | N | T | T | T | T | * | S | * | S | M | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | W | C | W | C | L | F | L | F |
3 | K | N | K | N | T | T | T | T | R | S | R | S | M | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | T | T | T | T | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | W | C | W | C | L | F | L | F |
4 | K | N | K | N | T | T | T | T | R | S | R | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | W | C | W | C | L | F | L | F |
5 | K | N | K | N | T | T | T | T | S | S | S | S | M | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | W | C | W | C | L | F | L | F |
6 | K | N | K | N | T | T | T | T | R | S | R | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | Q | Y | Q | Y | S | S | S | S | * | C | W | C | L | F | L | F |
9 | N | N | K | N | T | T | T | T | S | S | S | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | W | C | W | C | L | F | L | F |
10 | K | N | K | N | T | T | T | T | R | S | R | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | C | C | W | C | L | F | L | F |
11 | K | N | K | N | T | T | T | T | R | S | R | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | * | C | W | C | L | F | L | F |
12 | K | N | K | N | T | T | T | T | R | S | R | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | S | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | * | C | W | C | L | F | L | F |
13 | K | N | K | N | T | T | T | T | G | S | G | S | M | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | W | C | W | C | L | F | L | F |
14 | N | N | K | N | T | T | T | T | S | S | S | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | Y | Y | * | Y | S | S | S | S | W | C | W | C | L | F | L | F |
15 | K | N | K | N | T | T | T | T | R | S | R | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | Q | Y | S | S | S | S | * | C | W | C | L | F | L | F |
16 | K | N | K | N | T | T | T | T | R | S | R | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | L | Y | S | S | S | S | * | C | W | C | L | F | L | F |
21 | N | N | K | N | T | T | T | T | S | S | S | S | M | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | W | C | W | C | L | F | L | F |
22 | K | N | K | N | T | T | T | T | R | S | R | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | L | Y | * | S | S | S | * | C | W | C | L | F | L | F |
23 | K | N | K | N | T | T | T | T | R | S | R | S | I | I | M | I | Q | H | Q | H | P | P | P | P | R | R | R | R | L | L | L | L | E | D | E | D | A | A | A | A | G | G | G | G | V | V | V | V | * | Y | * | Y | S | S | S | S | * | C | W | C | * | F | L | F |