Gene modelling and annotation for Matina/Mars cacao genome. Tables, figures referenced below are in sub-folders at http://server7.eugenes.org:8091/cacao/genes10/ Methods G. Gene predictions with AUGUSTUS. ----------------------------------------------- Several gene prediction sets are produced to create a superset of models that include the best models. This is done with the one prediction software program AUGUSTUS that is flexible in use of both HMM training models and available gene evidence for each locus. Training the predictor hidden markov model involves steps described in [AUG ref], with a starting template (generic or species similar to this one), and validated genes for this species. Valid cacao genes (n=6100) are selected from EST/RNA transcript assemblies that appear to be full length. These are split into subsets for training and validated the resulting predictor (optimize_augustus script). Three training sets were created and used, plus an un-optimized one. The arabidopsis species configuration for AUGUSTUS is used for starting values. Evidence sets and configuration weightings are constructed to include (1) complete gene structure information (exon, CDS, intron, gene spans) and (2) an extra influence of one major component (proteins, EST exons, full transcript assemblies) The first is needed to reduce aberrant gene models by over-influence of one structure component. E.g evidence of exons only from ESTs leads to missed introns and missed gene ends. The second extra influence of one gene evidence class reduces conflicting signals, and returns better models for that class, eg. extra influence of homologous proteins returns models that more closely match those proteins. Evidence for AUGUSTUS is a location table derived from expression and protein homology mappings to the genome, indicating the gene part supported (exon, CDS-exon, intron, start/stop sites, etc.; Table G1). Evidence configuration for AUGUSTUS is a table with entries for each type of gene evidence, with weights that influence how strongly that evidence is used to generate predictions, and whether evidence has gene structure information (grouped exons), or simple base-level evidence. Following each prediction run, the result is assessed for overall quality and match to evidence. This assessment then suggests the options for new configuration and evidence mixes (Table G2). AUGUSTUS has an ability to model alternate transcripts from evidence that indicates this (alternate spliced ESTs). These alternates often are not supported by transcript assemblies, and tend to include aberrations such as joined genes. This option is used to generate additional best model/locus, but are not used as locus alternates. Only transcripts assembled directly from EST/RNA reads are used in alternate selection. Gene selection with EvidentialGene. ----------------------------------------------- Gene models are annotated with evidence scores (est,rna,introns,proteins) .. Evidence scores are weighted sums of each evidence type. Best gene set is selected from all models using two basic filters: (1) drop all models with score sum below minimum, (2) select highest scored model per locus, where "locus" is defined as location of overlapping CDS-exons. There are complexities in scoring gene joins and splits. One indicator of a joined model with homology is that its homology score is no greater than unjoined models. Best gene selection is an iterative process that involves evaluation after selection, modification of score weights, and reselection. After the majority of optimal models are found, smaller subsets of problem loci are sampled and examined, with additional evaluations to resolve these. This is a negative-feedback process designed to filter out errors and suboptimal gene models, with successive iterations changing fewer models until the optimal set is found. It also involves extensive expert curation to identify and remove suboptimal models. Because models are drawn from the outputs of several programs, model errors of various types are checked and corrected with accessory evigene scripts. These include intron errors (exons that span evidence introns without evidence of alternate splicing), CDS-exons that are mismatched with protein sequence, strand information errors with single exon genes, handling of internal stop codons and/or internal gaps, and partial gene handling. Table G1. Gene evidence location counts used for predictions, these include overlapping evidence of different types. Evidence component exon CDS intron genespan nonexon ----------------------------------------------- epir1 600k 300k 300k ~50k 100k epir3 600k 600k 300k ~50k 100k pie2b 400k 400k 300k 0 100k piern7 400k 600k 300k ~100k 150k pier8 600k 600k 300k ~50k 100k ----------------------------------------------- CDS parts come from plant species proteins mapped with tblastn/exonerate. exon and intron parts come from mapping of EST/RNA and their assemblies, gene spans come from full assemblies and protein spans. The "epir" sets gave precedence to expression evidence, "pier" sets gave precedence to protein evidence. Table G2. Gene prediction set configurations # -- add col No. predictions? in Table G4 RunID Config SppHMM Evidence Alts In final set ------------------------------------------------- epir1 run1 cacao2 epir1 no 320 epir1a run1 cacao2 epir1 yes 3666 epir2 run1 cacao11a epir1 no not used epir3 run1 cacao3 epir1 no 1901 epir5 run1 cacaomaz epir1 no not used pie3 run2 cacao2 pie2b no 1872 pier6 run3 cacao3 epir3 no 2496 piern7 run7 cacao3 piern7 no 1659 pier8 run8 cacao3 pier8 no 291 pier8a run8 cacao3 pier8 yes 4335 pier8c run8 cacao11c pier8 no not used CGD09 run0 cacao1 prelim no 7612 ------------------------------------------------- CGD09 set is the preliminary release gene set v0.9 (2010) selected from a combination of several predictions in a similar fashion to this, but with less evidence. Sets not used for final gene selection were assessed to lack useful additional gene models. # Table G3. Transcript assembly sets used with gene predictions for selecting final gene models. # Same now as Table Xn. in cacao11genes_expression.txt Table G3 == X4. Count of best transcript assemblies used for gene modelling, for primary locus and valid alternates, by assembly method. Best transcripts were selected using scores for CDS and exon sizes, C/X ratio, and protein homology. Primary gene Alternates Assembly input final set final set ------------------------------------------- EST-Newbler 8950 1600 1729 RNA-Cufflinks 10008 1300 2598 ESTRNA-Velvet 29446 1600 8465 PASA 0 0 2222 total 48404 4500 15014 --------------------------------------------- Transcript assembly software (see X. Methods): EST-Newbler = Newbler 454 EST assembler ; RNA-Cufflinks = Cufflinks, v0.8 and v1.0.3, ESTRNA-Velvet = Velvet/Oases, PASA assembly of EST/RNA assemblies. Table G4. Summary of evidence in gene sets used for final (Ma11) model/locus set. -- summary stats from evaluate output here : n models, EST/Prot/Intron evd. -- add stats for final pub3i, cirad1? see cacao3d/genes/cacao3.eval3bout.txt or elsewhere Evid. Nevd Statistic Ma11 CGD09 epir1 epir1a epir3 pie3 pier6 pier8 pier8a piern7 rnaest8 ------ ------ ------------- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ EST 49Mb BaseOverlap 0.657 0.620 0.668 0.687 0.659 0.677 0.662 0.673 0.687 0.530 0.614 Pro 36Mb BaseOverlap 0.764 0.756 0.744 0.758 0.744 0.769 0.739 0.766 0.778 0.681 0.678 RNA 67Mb BaseOverlap 0.573 0.526 0.586 0.602 0.574 0.593 0.580 0.591 0.603 0.442 0.534 Intron 161333 SplicesHit 0.907 0.829 0.911 0.915 0.911 0.906 0.893 0.908 0.911 0.886 0.926 Progene 25481 Equal66% 29183 14126 15699 18255 15834 16138 16093 16046 18581 14959 15043 RNAgene 48404 Equal66% 32363 14925 15473 18280 15505 15080 15718 15515 18060 15715 48404 Homolog 28007 homolog.Nmatch 25430 30641 24095 25708 24173 24666 23944 24404 25590 21834 19457 Homolog 28007 homolog.Nfound 20209 20374 19689 19774 19702 19719 19648 19682 19669 18654 15844 Homolog 28007 homolog.%found 0.725 0.727 0.703 0.706+ 0.703 0.704 0.702 0.703 0.702 0.666 0.566 Genome -- Coding Mb 35Mb 43Mb 37Mb 39Mb 37Mb 38Mb 36Mb 38Mb 39Mb 31Mb 21Mb Genome -- Exon Mbase 54Mb 65Mb 68Mb 72Mb 65Mb 67Mb 65Mb 67Mb 69Mb 44Mb 31Mb Genome -- Gene count 29408 35601 39494 42626 35990 34894 35632 34700 35577 29275 48404 ---------------------------------------------------------------------------------------------------------------------- # Maybe move some/all tables here from ?? # cacao3d/docs/cacao3gene-evid2012sum.txt Table E1. Cacao gene sets summary counts Table E2. Cacao gene evidence recovered in gene sets Table E3. Homology average for gene set proteins Table E4. Homology that supports Cacao gene sets, separated by locus agreement Table E5. Expression evidence from EST/RNA transcript assemblies aligned to gene sets. Table E6. EST/RNA transcript assemblies and unmapped counts Table Gn. == Table E1. Cacao gene sets summary counts Statistic Mars1.1 Cirad1.0 --------- ------- -------- Locus count 29408 29484 Same locus+CDS 13519 13709 Same locus/different 8582 10646 Unique locus 7307 4928 Alternate transcripts 14996 0 Poor models 17244 17342 Coding bases 35 Mb 34 Mb Exon bases 54 Mb 48 Mb ave protein size 319 286 ave transcript size 2.3 Kb 1.5 Kb ---------------------------------------- Locus = good gene loci, excluding those identified as transposons, fragments, or unsupported by gene evidence. Alternate transcripts of Mars gene set are all from EST/RNA transcript assemblies. Poor models are not counted for coding and transcript sizes. Same/unique loci for two gene sets are described in tables E4, E5. Table Gn. == Table E2. Cacao gene evidence recovered in gene sets Evidence Nevd Ma11 Cirad10 --------- ------ ---- ---- Proteins 36Mb 76% 73% RNA exons 67Mb 57% 48% Introns 161333 91% 82% RNA genes 48404 67% 32% ----------------------------------- Proteins and RNA exons are bases of evidence aligned to genome, and percent of gene models that match those. Introns are number of unique introns from multiple EST/RNA reads, and percent of gene models matching both splice ends. RNA genes are unique transcript assemblies, and percent gene models that align >= 66% . --------------------------------------------------------------------------- Table Gn. Gene structure statistics for Cacao (Mars, Cirad) and Arabidopsis # or Table Sn. cacao11genes_summaries-work.txt # separate doc? want for cacao3ig, cirad1, poplar?, arath, others? # see ~/Desktop/dspp-work/daphwork/dpxgenestruc/daphnia-genostats2.txt # see also gene struct stats for arthropods with methods details # http://arthropods.eugenes.org/arthropods/summaries/arthropod-genestruc-table.pdf GenomeMB N_Intron CDS_size Exon_sizes Gene_size UTR_size Intron_sizes Interg_size ---------------------------------------------------------------------------------------------------------------------------------------------------- arath 120 2 4.06 +/- 0.031 1032 1204 +/- 5.5 135 241 +/- 0.87 1538 1849 +/- 9 316 352 +/- 1.3 99 159 +/- 0.5 1218 2528 +/- 47 cacaoM11 331 2 3.73 +/- 0.028 935 1148 +/- 5.5 138 246 +/- 0.93 1993 3091 +/- 62 562 624 +/- 1.8 183 454 +/- 2.9 2612 8844 +/- 134 cacaoC10 291 2 4.02 +/- 0.029 945 1170 +/- 5.7 137 237 +/- 0.88 1938 2679 +/- 16 403 663 +/- 6.3 172 375 +/- 1.6 3079 8392 +/- 92 poptr 400 2 3.75 +/- 0.022 932 1133 +/- 4.3 138 241 +/- 0.73 1786 2433 +/- 11 362 409 +/- 1.68 172 347 +/- 1.0 4419 7532 +/- 60 sorbi 690 2 3.81 +/- 0.027 1092 1260 +/- 5.0 139 265 +/- 0.99 2146 2918 +/- 19 367 394 +/- 1.74 146 432 +/- 2.8 5336 22029 +/- 732 soybn 955 3 4.70 +/- 0.026 1043 1242 +/- 4.9 131 222 +/- 0.56 2427 3222 +/- 15 396 435 +/- 1.32 181 422 +/- 1.4 6371 17367 +/- 202 vitvi 485 3 4.95 +/- 0.032 873 1134 +/- 7.8 124 190 +/- 0.62 3018 5938 +/- 57 367 463 +/- 2.88 209 960 +/- 6.0 4809 12489 +/- 168 ---------------------------------------------------------------------------------------------------------------------------------------------------- Each column gives Median Mean +/- SEM values. Two+ exon genes 1-exon% N_Intron CDS_size Exon_sizes ------------------------------------------------------------------------- arath 25% 4 5.41 +/- 0.036 1116 1322 +/- 6.6 129 208 +/- 0.71 cacaoM11 24% 3 4.93 +/- 0.033 1048 1265 +/- 6.6 131 217 +/- 0.79 cacaoC10 19% 3 4.96 +/- 0.033 1023 1250 +/- 6.6 132 213 +/- 0.75 poptr 21% 3 4.74 +/- 0.026 1020 1227 +/- 5.1 132 217 +/- 0.64 sorbi 16% 3 4.92 +/- 0.032 1140 1332 +/- 5.9 132 228 +/- 0.82 soybn 23% 4 5.61 +/- 0.029 1100 1318 +/- 5.6 128 203 +/- 0.49 vitvi 8% 4 5.39 +/- 0.033 931 1184 +/- 8.2 123 185 +/- 0.59 ------------------------------------------------------------------------- Difference in N_Intron counts for cacao mars, cirad is from 1 exon genes. Mars set has 25% 1-exon genes, same as Arabidopis, while Cirad set has only 19% 1-exons Intron bi-modal sizes Short Long %Long -------------------------- aratht10 93 321 19% cacaoM11 105 517 48% cacaoC10 102 489 46% poptr 103 486 46% vitvi 103 775 51% soybn 99 531 48% sorbi 101 543 42% -------------------------- Intron distribution is bi-modal, with short introns of near constant size, and long introns of variable and extreme length, so that mean is not a useful descriptor. # Details for above tables Arabidopsis reference Genome Statistic N Median Mean SD PerGenome Sum 1 aratht10 N_Intron 27227 2 4.06 5.06 0.000923 110452 2 aratht10 Gene_size 27227 1538 1848.88 1498.19 0.420660 50339426 3 aratht10 CDS_size 27227 1032 1204.48 903.34 0.274045 32794344 4 aratht10 Exon_sizes 136676 135 240.50 322.28 0.274543 32853938 5 aratht10 Intron_sizes 110503 99 158.92 170.14 0.146609 17544361 6 aratht10 UTR_size 28253 316 352.26 221.73 0.056799 6797005 7 aratht10 Interg_size 27076 1218 2528.26 7723.93 0.572044 68455264 #------------------------------ Cacao gene sets Cacao Evigene pub3i/Matina-Mars genome Genome Statistic N Median Mean SD PerGenome Sum 1 cacao3gi N_Intron 29093 2 3.73 4.76 0.000324 108626 2 cacao3gi Gene_size 29093 1993 3090.98 10609.90 0.268056 89925952 3 cacao3gi CDS_size 29093 935 1147.80 939.62 0.099540 33392964 4 cacao3gi Exon_sizes 136243 138 246.31 343.17 0.099662 33433982 5 cacao3gi Intron_sizes 109042 183 454.18 977.50 0.146020 48985833 6 cacao3gi UTR_size 43245 562 624.07 384.73 0.050061 16794083 7 cacao3gi Interg_size 28116 2612 8843.62 22479.49 0.741181 248647261 #------------------------------ Cacao Cirad1/Criollo genes + genome Genome Statistic N Median Mean SD PerGenome Sum 1 cacaocir1 N_Intron 28450 2 4.02 4.86 0.000359 114346 2 cacaocir1 Gene_size 28450 1938 2679.41 2733.39 0.239063 76229273 3 cacaocir1 CDS_size 28450 945 1170.30 965.41 0.104417 33295135 4 cacaocir1 Exon_sizes 140180 137 237.31 327.81 0.104327 33266289 5 cacaocir1 Intron_sizes 114346 172 375.48 562.96 0.134646 42934138 6 cacaocir1 UTR_size 20586 403 663.22 897.58 0.042817 13653009 7 cacaocir1 Interg_size 28439 3079 8391.93 15505.13 0.748458 238658152 #------------------------------ Two+ exon genes Genome Statistic N Median Mean SD PerGenome Count of 2+ exon genes 1 aratht10 N_Intron 20412 4 5.41 5.18 0.000923 75%, 20412/27227 1 cacao3gi N_Intron 22012 3 4.93 4.9 0.000324 76%, 22012/29093 1 cacaocir1 N_Intron 23035 3 4.96 4.94 0.000359 81%, 23035/28450 1 cacao3eg2cir N_Intron 20476 3 5.06 5.09 0.000325 74%, 20476/27797 #------------------------------ # cacao3gi mapped to cir1asm (gmap.gff) : all Cacao Evigene pub3i transcripts mapped to Criollo genome this includes genes with poor mapping, but not split (chimeric) or no mappings. 29093 input genes = 819 split + 198 nomap + 279 problems + 27797 mapped Of those mapped, 1848 have <90% alignment, which skews these stats some. Genome Statistic N Median Mean SD PerGenome Sum 1 cacao3eg2cir N_Intron 27797 2 3.73 4.91 0.000325 1.04e+05 2 cacao3eg2cir Gene_size 27797 1895 3629.53 20345.26 0.316402 1.01e+08 3 cacao3eg2cir CDS_size 27797 903 1108.61 922.39 0.096643 3.08e+07 4 cacao3eg2cir Exon_sizes 126633 136 239.66 328.02 0.073919 2.36e+07 5 cacao3eg2cir Intron_sizes 100038 181 464.27 1140.43 0.104526 3.33e+07 6 cacao3eg2cir UTR_size 41430 565 643.03 436.98 0.053408 1.70e+07 7 cacao3eg2cir Interg_size 22004 2740 8381.58 19092.16 0.578387 1.84e+08 #------------------------------ Two+ exons only Genome Statistic N Median Mean SD PerGenome Sum 1 cacao3eg2cir N_Intron 20476 3 5.06 5.09 0.000325 1.04e+05 2 cacao3eg2cir Gene_size 20476 2610 4672.24 23614.56 0.300028 9.57e+07 3 cacao3eg2cir CDS_size 20476 1047 1250.00 960.07 0.080269 2.56e+07 4 cacao3eg2cir Exon_sizes 119210 129 210.36 269.30 0.058426 1.86e+07 5 cacao3eg2cir Intron_sizes 100038 181 464.27 1140.43 0.104526 3.33e+07 6 cacao3eg2cir UTR_size 33773 607 680.80 427.99 0.043675 1.39e+07 7 cacao3eg2cir Interg_size 15342 2605 7952.69 18252.96 0.382637 1.22e+08