Cacao gene set evidence comparison Gene sets are Mars Thecc (2012, n loci=29452) and Cirad Tc v1 (2011, n loci=29283) Table E1. Cacao gene sets summary counts Statistic Mars.v11 Cirad.v1 --------- ------- -------- Locus count 29283 29484 Same locus+CDS 13519 13709 Same locus/different 8599 10646 Unique locus 7337 4928 Alternate transcripts 14920 0 Poor models 17244 17342 Coding bases 35 Mb 34 Mb Exon bases 54 Mb 48 Mb ave protein size 319 286 ave transcript size 2.3 Kb 1.5 Kb ---------------------------------------- Locus = good gene loci, excluding those identified as transposons, fragments, or unsupported by gene evidence. Alternate transcripts of Mars gene set are all from EST/RNA transcript assemblies. Poor models are not counted for coding and transcript sizes. Same/unique loci for two gene sets are described in tables E4, E5. Table E2. Cacao gene evidence recovered in gene sets Evidence Nevd Mars Cirad --------- ------ ---- ---- Proteins 36Mb 76% 73% RNA exons 67Mb 57% 48% Introns 161333 91% 82% RNA genes 48404 67% 32% ----------------------------------- Proteins and RNA exons are bases of evidence aligned to genome, and percent of gene models that match those. Introns are number of unique introns from multiple EST/RNA reads, and percent of gene models matching both splice ends. RNA genes are unique transcript assemblies, and percent gene models that align >= 66% . Table E3. Homology average for gene set proteins Tree gene set TAIR10 Plant8 ----------------------------------- Cacao11_mars 632 549 Cacao1_cirad 620 522 Poplar 609 Castor bean 591 Grape 563 ----------------------------------- TAIR10= average blastp bitscore to Arabidopsis, TAIR10, using 10253 TAIR genes that are common best matches to all 5 gene sets. Plant8= average blastp bitscore to best matching plant protein of 8 plant proteome sets. Note that difference in gene sets of Cacao are in same range as difference among tree species gene sets, so that phylogeny and gene construction quality differences are confounded. Table E4. Homology that supports Cacao gene sets, separated by locus agreement Subset nGene nHomolog Bits Align --------------------------------------------------- All genes Mars 29452 24682 549 448 Cirad 29283 24956 522 428 Same locus, same CDS Mars 13519 13381 657 496 Cirad 13709 13500 654 496 Same locus, different CDS Mars 8599 7925 484 420 Cirad 10646 9300 421 385 Unique locus Mars 7337 3379 269 320 Cirad 4928 2156 131 176 --------------------------------------------------- Average bit score and alignment bases to 8 plant protein sets, using blastp e-value <= 1e-5. Mars and Cirad shared loci are determined with Cirad transcripts aligned to Matina genome. Same locus, same CDS is determined by >=90% CDS exon alignment. Same locus, different has lower CDS alignment. Unique loci are those with no significant exon overlap for Mars11 and Cirad1 gene sets. Table E5. Expression evidence from EST/RNA transcript assemblies aligned to gene sets. Subset nGene nRNA Align %Align 95% 66% ------------------------------------------------------------- All genes Mars 29452 25133 1340 88.3 14919 22163 Cirad 29283 20615 1269 82.8 8634 18460 Same locus, same CDS Mars 13519 12966 1688 92.1 8685 12186 Cirad 13709 12902 1525 88.1 6252 11801 Same locus, different CDS Mars 8599 7392 1225 86.5 3953 6370 Cirad 10646 7874 1011 77.6 2186 5894 Unique locus Mars 7337 4778 572 80.5 2282 3609 Cirad 4928 1427 375 64.0 196 765 ------------------------------------------------------------- Gene locus agreement of gene sets as in table E4. nRNA = number of genes with some expression assembly alignment. Align = average aligned bases, %Align = average percent of gene transcript with align to expression assembly, 95% = number of gene transcripts with >= 95% alignment, 66% = number of genes with >= 66% alignment. These measures use 353350 transcript assemblies from EST and RNAseq assemblies totald in Table E6. Table E6. EST/RNA transcript assemblies and unmapped counts Mars11 genes Cirad1 genes Mars11 genome Cirad1 genome Assembly nAsm Nomap %Nomap Nomap %Nomap Nomap %Nomap Nomap %Nomap ----------------------------------------------------------------------------------- EST.genbank 157996 39357 24.9% 37196 23.5% 11220 7.1% 19698 12.4% EST.bean 25501 3819 14.9% 5633 22.0% 1227 4.8% 1434 5.6% EST.leaf 33237 11835 35.6% 13415 40.3% 9224 27.7% 9342 28.1% EST.pistil1 20507 2894 14.1% 3447 16.8% 1307 6.3% 1597 7.7% EST.pistil2 25415 4055 15.9% 5145 20.2% 1315 5.1% 1834 7.2% EST.CCN51 25152 2075 8.2% 3582 14.2% 53 0.2% 169 0.6% EST.TSH1188 26528 3238 12.2% 4267 16.0% 1272 4.7% 1495 5.6% RNA+EST 197010 57011 28.9% 76135 38.6% -- -- -- -- ----------------------------------------------------------------------------------- EST/RNA transcript assemblies are mapped to gene model transcripts and genome assembly with GMAP v2011-10. CCN51 and TSH1188 groups were not used to inform Mars11 gene construction, others were. EST.genbank data were available to both projects.