Retraining AUGUSTUS

This manual is intended for those who want to retrain on their own ab initio AUGUSTUS for another species. Please do not rely on this manual and the scripts and programs. Check what they do on your data! This document is work in progress. If you want us to retrain AUGUSTUS for your species, please contact me through mstanke@gwdg.de.

1. COMPILE A SET OF TRAINING AND TEST GENES
You will need a set of sequences with bona fide gene structures. Currently, only the coding parts of the gene are important (plus a short window upstream). Generally, the more genes you have the better. So far I haven't tried it with less than 200 genes for training. Make also sure that in particular the number of multi-exon genes is large. These are needed to train the introns. Needless to say, the gene structures of these genes should be as accurate as possible. However, it is not necessary that they are 100% correct, neither does the annotation have to be necessarily complete. It is more imporant that the start codon is correctly annotated than that the stop codon is correctly annotated. Your set of annotated sequences should be non-redundant. If in two different sequences genes with an almost identical annoated amino acid sequence is annotated then delete one of them. (My criterion is: No two genes in the set are more than 70% identical on the amino acid level.) The non-redundancy is very important for avoiding overfitting. This is of course also very important if you want to test the accuracy on a test set. Each sequence can contain one or more genes, the genes can be on either strand. However, the genes must not overlap, and only one transcript per gene is allowed. Store the sequences together with their annotation in a simple genbank format. For the exact format that the training program can read in look as an example at one of the training genbank files at the augustus web server: http://augustus.gobics.de/datasets/

Randomly split the set of annotated sequences in a training and a test set. Lets call these two files myspecies.train.gb and myspecies.test.gb. In order for the test accuracy to be statistically meaningful the test set should also be large enough (100-200 genes). Your script that splits the large genbank file into these two should really split the set randomly! Do not just take the first and the last part of the file as then the test set is unlikely to be representative. The script 'randomSplit.pl' in the scripts directory does it correctly. In addition to the complete genes you may specify a set of splice site sequences that should be used for training.

Options for compiling a set of gene structures:

NOT ENOUGH TRAINING DATA OR ONLY CODING SEQUENCES AVAILABLE

In case you want to train for species X but do not have enough training data for X you may consider the following 'artificial monster gene trick'. This trick allows you to train the coding content models using one species and the signals models and length distributions of another species. You need enough (200-300 genes) training data from another species Y which is similar to X especially in intron and exon length distributions and signal patterns. You take all the coding sequences you have for X (or a species with a very similar coding content), exclude the start and stop codons and concatenate these. Let s be that sequence. Then you create an artifial genomic sequence g like this:

...non-coding sequence from X... ATG sssssssssssssssss TAA ...non-coding sequence from X...
(n copies of s)

Then you add this monster gene sequence g and all the available training data from X to your training set from sequence Y and train AUGUSTUS with this extended training set. AUGUSTUS will train the intron and exon length istributions, splice site patterns, translation start patterns, branch point regions, etc. from species Y and X, weighted by the occurrences of these signals or lengths, respecively. If n is large enough the amount of coding sequences in g will outweight the amount of coding sequences in Y, so augustus uses approximately the coding content of species X. If s is too short, then n should not be very large as this would overfit the model to the sequences in s.

2. CREATE A META PARAMETERS FILE FOR YOUR SPECIES

I call the parameters like the size of the window of the splice site models and the order of the Markov model meta parameters. Say, you want to train AUGUSTUS for a species called 'myspecies'. Copy the parameter file generic_parameters.cfg to myspecies_parameters.cfg and copy the file generic_weightmatrix.cfg to myspecies_weightmatrix. Then adjust myspecies_parameters.cfg according to the comments in this file.

The optional file with the filename given by /IntronModel/splicefile may contain a list of sequence windows of known splice sites.
The format is as in the following example.
--------example begin-------
dss gccgagaactccgctcgttctgtgcgttctcctgtcccaggtagggaagaggggctgccgggcgcgctctgcgccccgtttc
dss cgtgattgtcggggggaaagacatccagggctccttgcaggtaacacatctgtttgagataacttgggttcaaggaggacat
dss agagaatcagagacagcctttcccaagagatgttggcaaggtaagtcagacaaacagcaaatgacaaaaacatgtttttatg
dss cattgtcactgttgtgtcacctgcgctgctggaccgagaggtgagctgaaaagaataccactttctttttcacgagaataga
dss tgacaaaaatgatcactcaccaaaattcaccaagaaagaggtaaacccctgtgccaaacaccaaccaccactgtggtcacag
ass gttagtatgcttctttaattttttttctccctgaaattataggaaccagatgttaaaaaattagaagaccaacttcaaggcg
ass --------------------------ggctttgtctttgcagaatttatagagcggcagcacgcaaagaacaggtattacta
ass gattccttgtgattagcctctcttgctccttttctccaccagcaaagtcgaccaagaaattatcaacattatgcaggatcgg
ass aaccgtagtaaacagcatgaatcgtgttttgtttttgaacagaccactggccttgtgggattggctgtgtgcaatactcctc
------example end--------

dss: donor (=5') splice site. 40 letters + gt + 40 letters
ass: donor (=3') splice site. 40 letters + ag + 40 letters
use '-' for unknown characters

myspecies_weightmatrix.txt
Changes in myspecies_weightmatrix.txt are often not necessary. If you do not want to mess with this, also leave /Constant/decomp_num_steps as it is (=1). This section describes the meaning of the meta parameter /Constant/decomp_num_steps and the matrix in the file myspecies_weightmatrix.txt. For some species it makes sense to let the model parameters depend on the average frequencies of the 4 bases in the query sequence. For example, in human, the GC content stays consistently above or below average over long sequence stretches (isochores). AUGUSTUS can use for each piece of the query sequence (at most maxDNAPieceSize=200kb by default) different parameters that are adjusted to the base composition of this piece. I describe here only the dependency on the GC content. /Constant/decomp_num_steps is the number of different levels of GC content that is taken into account, i.e. AUGUSTUS uses /Constant/decomp_num_steps different sets of parameters, each for a different GC content. Values between 1 and 10 are useful. The GC content ranges between /Constant/gc_range_min and /Constant/gc_range_max. These parameters can be set in the meta parameters file. Given a target GC content, each sequence in the training set is weighted depending on how similar its GC content is to the target GC content. It gets an integer weight between 1 and 10. The weight is the higher, the closer the GC content of the training sequence is to the target GC content. The two non-zero numbers in the middle of the 4x4 Matrix in myspecies_weightmatrix.txt (default: 200) determine the influence of the deviation in GC content. A high value (like 300) means that mostly training sequences with a very similar GC content to the target GC content are taken into account. The advantage is that the training is more specific to the target GC content. The disadvantage of a high value is that this effectively reduces the size of the training set. A low value (like 150) means that all training sequences are taken into account, except that training sequences with a similar GC content are weighed somewhat stronger.

3. RUN THE SCRIPT optimize_augustus.pl
Please check for updates on this part of the documentation.

This script optimizes the prediction accuracy by adjusting the meta parameters in the myspecies_parameters.cfg file. The script used the programs augustus and etraining. They must be in the $PATH. You need to tell optimize_augustus.pl, which metaparameters it should optimize. Do this by adjusting the file config/generic_metapars.cfg. (You may also make a copy of it and then use the command line parameter --metapars=nameofmycopy to the script optimize_augustus.pl.)
Run

scripts/optimize_augustus.pl --species=myspecies myspecies.train.gb

In an evaluation step this script does the following 10 fold cross validation. It splits the set myspecies.train.gb randomly into 10 sets (buckets) of ±1 equal size. Then takes 9 of the 10 sets for training and the other one for evaluating the prediction accuracy using the annotation of the genbank file. The 10 possible buckets for evaluating are rotated and each case the other sequences are used for training. It computes a single target value that is subject to optimization. The target is a a weighted average of the sensitivities and specificities on the base, exon and gene level. (Feel free to change the weighing to your preferences in the function 'sub gettarget'.)

For a meta parameter the script repeats above evaluation step for different values of the meta parameter. If it finds an improvement in the target value it adjusts the value of the meta parameter in your myspecies_parameters.cfg file. It tries new values for the parameter until it finds no more improvement. Then it optimizes the next meta parameter. When it has optimized all meta parameters once, it repeats with the first meta parameter. It does at most 5 rounds of optimizations, but stops earlier if no improvements are found. This script probably has to run over night. When it is done your myspecies_parameters.cfg will hopefully have well suited meta parameters for your species and your training set.

After optimize_augustus.pl has finished you have to train AUGUSTUS with the metaparameters it has found.

etraining --species=myspecies myspecies.train.gb

If you have a test set, you can now check the prediction accuracy on this test set by running

augustus --species=myspecies myspecies.test.gb

The end of the output will then contain a summary of the accuracy of the prediction. If the gene level sensitivity is below 20% it is likely that the training set is not large enough, that it doesn't have a good quality or that the species is somehow 'special'.
If you succeeded in creating a good AUGUSTUS version for your species I would be very interested in your results. If possible please share your results and give me your myspecies_parameters.cfg and the test and training set.

4. SPECIAL CASE: ORGANISM WITH DIFFERENT GENETIC CODE

AUGUSTUS can be told to use a different translation table, in particular one with a different set of stop codons. This is useful for a small number of species such as Tetrahymena thermophilia, in which some codons translate to a different amino acid than usual. If you train AUGUSTUS for such a species set the variable translation_table in the parameter file of your species. Further, adjust the stop codon probabilities in the same config file. E.g. say

translation_table 6
/Constant/amberprob 0 # Prob(stop codon = tag), if 0 tag is assumed to code for amino acid
/Constant/ochreprob 0 # Prob(stop codon = taa), if 0 taa is assumed to code for amino acid
/Constant/opalprob 1 # Prob(stop codon = tga), if 0 tga is assumed to code for amino acid

in the case of Tetrahymena, where taa and tag are coding for glutamine (Q). Choose the translation table number accoding to this table. translation_table=1 is the default value and the standard with stop codons taa, tga, tag. If you have a species with the standard genetic code you don't have to do anything. In case your species' code is not covered by this table send us a note with the string of 64 one-letter amino acid codes in the codon order below.

translation  aaaaaaaaaaaaaaaa cccccccccccccccc gggggggggggggggg tttttttttttttttt
table aaaaccccggggtttt aaaaccccggggtttt aaaaccccggggtttt aaaaccccggggtttt
number acgtacgtacgtacgt acgtacgtacgtacgt acgtacgtacgtacgt acgtacgtacgtacgt
1KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSS*CWCLFLF
2KNKNTTTT*S*SMIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSSWCWCLFLF
3KNKNTTTTRSRSMIMIQHQHPPPPRRRRTTTTEDEDAAAAGGGGVVVV*Y*YSSSSWCWCLFLF
4KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSSWCWCLFLF
5KNKNTTTTSSSSMIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSSWCWCLFLF
6KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVVQYQYSSSS*CWCLFLF
9NNKNTTTTSSSSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSSWCWCLFLF
10KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSSCCWCLFLF
11KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSS*CWCLFLF
12KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLSLEDEDAAAAGGGGVVVV*Y*YSSSS*CWCLFLF
13KNKNTTTTGSGSMIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSSWCWCLFLF
14NNKNTTTTSSSSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVVYY*YSSSSWCWCLFLF
15KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*YQYSSSS*CWCLFLF
16KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*YLYSSSS*CWCLFLF
21NNKNTTTTSSSSMIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSSWCWCLFLF
22KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*YLY*SSS*CWCLFLF
23KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSS*CWC*FLF