Gene summaries for Matina/Mars cacao genome. Tables, figures referenced below are in sub-folders at http://server7.eugenes.org:8091/cacao/genes10/ Methods S. For Table Sn. Gene structure statistics for Cacao: Sizes are given in bases as median, mean +/- std. error of mean. Gene_size is the span including introns and UTR. CDS size is the coding sequence length without introns or UTRs. N_Intron is count of introns. Exon_size is size for CDS-exons. Intergenic size is measured from distance between adjacent genes. These statistics have a standard deviation close to the mean, but Intergenic size has a much larger variance. Intron size is non-normally distributed. The "Intron bi-modal sizes" table lists the primary and secondary peaks, mean and the percent of introns larger than exons. UTR size is an over-estimate, as it is measured only where exons extend past coding sequence, and misses true cases of zero length UTRs. # Maybe move some/all tables here from # cacao3d/docs/cacao3gene-evid2012sum.txt Table E1. Cacao gene sets summary counts Table E2. Cacao gene evidence recovered in gene sets Table E3. Homology average for gene set proteins Table E4. Homology that supports Cacao gene sets, separated by locus agreement Table E5. Expression evidence from EST/RNA transcript assemblies aligned to gene sets. Table E6. EST/RNA transcript assemblies and unmapped counts # drop: not interesting, no diff for expressed genes, # Table Sn. Alternate splicing and Gene Duplication # # Paralog >=66% Paralog+Express 66% Paralog+Express 33% # has alt 1389 / 7661 18% 1242 / 6922 18% 3688 / 7621 48% # no alt 5457 / 24893 22% 2407 / 12726 19% 8574 / 17776 48% # ----------------------------------------------------- # ref: Talavera D, Vogel C, Orozco M, Teichmann SA, de la Cruz X (2007) # The (in)dependence of alternative splicing and gene duplication. PLoS Comput Biol 3(3): e33. # doi:10.1371/journal.pcbi.0030033 # this may or may not be interesting, no conclusive role in mars/cirad diff. Table Sn. Pentatricopeptide repeat (PPR) and Tetratricopeptide repeat (TPR) superfamily Loci Notes ---------------------------------------------------------------- arabidopsis 651 cacaomars 609 956 mRNA, expressed>33% = 568 loci (907 mRNA) cacaocirad 538 mars-cirad1 403 equivalent loci (>=90%) ---------------------------------------------------------------- PPR repeats are a degenerate ~30 amino acid motif that occur tandemly multiple times within a protein [37]. PPR repeat genes are a potential source of assembly error. ref1: Schlueter,... Randy C Shoemaker. BMC Genomics 2007, 8:330 doi:10.1186/1471-2164-8-330 Gene duplication and paleopolyploidy in soybean and the implications for whole genome sequencing ref37: Geddy R, Brown GG: Genes encoding pentatricopeptide (PPR) proteins are not conserved in location in plant genomes and may be subject to diversifying selection. BMC Genomics 2007, 3:130-142. Table Sn. Disease resistance genes ------------------------------------------------------------------- Table Sn.1 Cacao (Matina) overabundant disease resistance families, versus 8 species, as classified by OrthoMCL. OID Nt Ng Cacao Ave Max G Description --------------------------------------------------------------- 157 6 43 27 2 9 75 LRR and NB-ARC domains-containing disease resistance protein 198 8 38 15 3 8 34 LRR and NB-ARC domains-containing disease resistance protein 1694 2 16 12 0 4 32 Disease resistance protein RPP8 2068 4 15 9 1 3 19 TMV resistance protein N 5033 3 12 7 1 3 15 NB-ARC domain-containing disease resistance protein 12410 2 8 7 0 1 14 CC-NBS-LRR class disease resistance protein ------------------------------------------------------------------- OID = group ID, Nt = number of taxa, Ng = no. genes, Cacao = cacao genes, Ave = average no. genes among other 8 species, Max = max no. among other 8 species, G = g-statistic for difference from other species, all significant p<0.001 -- should add non-cacao groups with 2+ species for balance. Table Sn.2 Cacao Matina vs Criollo assembly genes from poorly cross-mapping transcripts (non-aligned regions and others) classified by OrthoMCL into disease resistance families Matina Criollo Family ------------------------------------------------------------------- 3 2 Disease resistance-responsive (dirigent-like protein) family protein 10 2 Disease resistance protein (CC-NBS-LRR class) family 16 16 LRR and NB-ARC domains-containing disease resistance protein 16 7 NB-ARC domain-containing disease resistance protein 5 1 disease resistance RPP8-like protein 2 (NB-LRR class) 19 16 disease resistance family protein / LRR family protein 6 2 disease resistance protein (TIR-NBS-LRR class) 2 0 TMV resistance protein N ------------------------------------------------------------------- Table Sn.3 Genes containing disease resistance protein domains, using rpsblast of gene proteins x disease-domains from NCBI Conserved Domain Database. Protein domains Species Dirig Psyr NBS TIR TIR-NBS --------------------------------------------------------- cacao_MA 37 410 341 10 21 cacao_MA.noCR 6 89 86 1 7 cacao_CR 38 334 267 18 14 cacao_CR.noMA 2 43 34 1 2 arath 29 284 88 42 126 poptr 46 693 430 68 112 ricco 24 174 127 13 31 vitvi 18 338 300 10 21 --------------------------------------------------------- cacao_CR = Criollo/Cirad v1 genome; cacao_MA = Matina/CGD v1.1 genome; cacao_CR.noMA = Criollo genes from non-aligned regions (n=1690) cacao_MA.noCR = Matina genes from non-aligned regions (n=2986) The non-aligned genes are located in spans where assemblies do not align, and the transcripts fail to map properly elsewhere in other assembly. All proteins per species were aligned using RPSBLAST to a disease domain database, and type and number of domains per gene counted, then summed over all genes. Disease resistance domains: Dirig= CDD:190504 pfam03018 Dirigent, Dirigent-like protein Psyr = CDD:178749 PLN03210 Psyringae, PLN03210, Resistant to P. syringae 6 NBS = CDD:201512 pfam00931 NB-ARC, NB-ARC domain TIR = CDD:201870 pfam01582 TIR, TIR domain or CDD:205852 pfam13676 TIR_2, TIR domain # mars annotated as "disease.resistance|TMV.resist|NB.ARC |NBS.LRR " # loci=393 mrna=469, # in non-align cirad spans, loci=165 mrna=180 # # cirad1 annotated as "disease.resistance|TMV.resist|NB.ARC |NBS.LRR " # loci=323 , # in non-align-mars spans, loci=36 # Table Sn.3 == M3.4 Defense response genes from non-aligned regions # ------------------------------------------------------------------- # 20 LRR and NB-ARC domains-containing disease resistance protein # 20 NB-ARC domain-containing disease resistance protein # 20 CC-NBS-LRR class disease resistance protein # 11 NBS-LRR type disease resistance protein # 3 TMV resistance protein N # 10 Disease resistance protein families # 5 Disease resistance protein RPP8 # 2 Disease resistance RPS5 # 1 Disease resistance RPM1 # 1 NBS type disease resistance protein # 1 Disease resistance (TIR-NBS-LRR class) # 1 TIR-NBS-LRR resistance-like protein # --- # 95 total (see above, count 165, this is more limited set of ciradpoormap+nonalign) # %% See also 2011 medicago gnopp, S6.3 NBS-LRR discovery and analysis : 705 NBS-LRR genes # medicago Table S14. NBS‐LRRs in clusters # medicago Table S15. NBS‐LRR domain organization # med. supl disc. S3. NBS-LRR resistance genes # -- better way to find these is tblastn with found NBS-LRR genes (tr or aa?) # # # medicago Table S14. NBS‐LRRs in clusters # 582 NBS‐LRRs # 549 in chromosomes 33 in unassembled BACs # 254 TIR 11 TIR # 295 nonTIR 22 nonTIR # -------------------- # Clusters # 100 kb window 250 kb window # genes clusters genes clusters # TIR‐NBS‐LRR 172 31 142 20 # nonTIR‐NBS‐LRR 192 40 163 23 # Mixed 86 (47 CC + 39 TIR) 15 177 (95 CC + 82 TIR) 22 # Total in clusters 450 (82%) 86 482 (88%) 65 # Single TIR 43 (17%) 30 (12%) # Single nonTIR 56 (19%) 37 (13%) # ** See also 2006 poplar genome paper for table of these disease res. genes (high in poptr) # ** Compare likewise, use omcl families?, limit to those for arabidopsis? # # Table 2. Numbers of genes that encode domains similar to plant R # proteins in Populus, Arabidopsis (81), and Oryza (82). *, BED finger # and/or DUF1544 domain; CC, coiled coil; –, not detected. # # Predicted domains Letter Populus Arabidopsis Oryza # TIR-NBS TN 10 21 – # TIR-NBS-LRR TNL 64 83 – # TIR-NBS-LRR-TIR TNLT 13 # TIR-NBS-LRR-NBS TNLN 1 # NBS-LRR-TIR NLT 1 # TIR-CC-NBS-LRR TCNL 2 # CC-NBS CN 19 4 7 # CC-NBS-LRR CNL 119 51 159 # BED/DUF1544*-NBS BN 5 # NBS-BED/DUF1544* NB 1 # BED/DUF1544*-NBS-LRR BNL 24 # NBS-LRR NL 90 6 40 # NBS N 49 1 45 # Others – – 41 284 # Total NBS genes 398 207 535 # -------------------------------------------- Table Sn. Gene structure statistics for Cacao (Mars, Cirad) and Arabidopsis N_Intron Gene_size CDS_size Exon_size UTR_size Intron_sizes Interg_size arath 2 4.06 +/- 0.031 1538 1849 +/- 9 1032 1204 +/- 5.5 135 241 +/- 0.87 316 352 +/- 1.3 99 159 +/- 0.51 1218 2528 +/- 47 cacaomars 2 3.73 +/- 0.028 1993 3091 +/- 62 935 1148 +/- 5.5 138 246 +/- 0.93 562 624 +/- 1.8 183 454 +/- 2.96 2612 8844 +/- 134 cacaocir1 2 4.02 +/- 0.029 1938 2679 +/- 16 945 1170 +/- 5.7 137 237 +/- 0.88 403 663 +/- 6.3 172 375 +/- 1.66 3079 8392 +/- 92 -------------------------------------------------------------------------------------------------------------------------------------------- Each column gives Median Mean +/- SEM values. # see also gene struct stats for arthropods with methods details # http://arthropods.eugenes.org/arthropods/summaries/arthropod-genestruc-table.pdf Two+ exon genes N_Intron CDS_size Exon_sizes arath 4 5.41 +/- 0.036 1116 1322 +/- 6.6 129 208 +/- 0.71 cacaomars 3 4.93 +/- 0.033 1048 1265 +/- 6.6 131 217 +/- 0.79 cacaocir1 3 4.96 +/- 0.033 1023 1250 +/- 6.6 132 213 +/- 0.75 ----------------------------------------------------------------- Difference in N_Intron counts for cacao mars, cirad is from 1 exon genes. Mars set has 25% 1-exon genes, same as Arabidopis, while Cirad set has only 19% 1-exons Intron bi-modal sizes Short Long %Long -------------------------- arath 93 321 19% cacaomars 105 517 48% cacaocir1 102 489 46% -- vs model animals -- nematode 51 502 33% fruitfly 63 757 27% mouse 107 1588 86% -------------------------- # === related plant geno papers == # # cucumber genome pp supl table 15 : skip this, cannot match ERF gene of this table # Supplementary Table 15: Number of genes involved in ethylene signaling pathway in cucumber, # Arabidopsis, papaya, poplar, grapevine and rice. # Gene Cucumber Arabidopsis Papaya Poplar Grapevine Rice # Total 137 160 109 239 141 175 # SAM synthase: S-adenosylmethionine synthase; ACO: 1-aminocyclopropane-1-carboxylate oxidase; ACS: # 1-aminocyclopropane-1-carboxylate synthase; ReACS: regulator of ACS; ETR: ethylene receptor; ETR: # Ethylene receptors; CTR1: constitutive triple response-1; EIN2: ethylene insensitive 2; EIN3: ethylene # insensitive 3; ERF: ethylene responsive factor. # # >> probably missing by these horrid names, most 110+ are ERF, others in singles. # egrep -i 'S-adenosylmethionine synthase|SAM synthase|1-aminocyclopropane-1-carboxylate oxidase|1-aminocyclopropane-1-carboxylate synthase|regulator of ACS' # cacao n=19 << this about same as cuc, arath,grape # egrep -i 'ethylene.receptor|constitutive.triple.response|ethylene.insensitive|ethylene.responsive' # cacao n=37 << missing 100+ ERF, probably other name, BUT TAIR names only 12 ethylene.responsive, not 123 of this table # cucumber Supplementary Table 17: Number of predicted genes in GA biosynthetic and signaling pathways # in cucumber, Arabidopsis, papaya, poplar, grapevine and rice. # >> give gene names, total 25-50 genes # CYP88A, cytochrome P450 88A; GA7ox, GA7-oxidase; GA20ox, GA20-oxidase; GA3ox, GA3-oxidase; # GA2ox, GA2-oxidase; GID, gibberellin-insensitive dwarf, receptor of GA; DELLA, the subfamily of # GRAS transcription factors, which negatively regulate GA signaling; SPINDLY, gibberellin signal # transduction protein; GASA, gibberellin-regulated protein. # # cucumber Supplementary Table 20: Number of lignin and cellulose biosynthesis related genes in cucumber, # Arabidopsis, papaya, poplar, grapevine and rice. # Gene Cucumber Arabidopsis Papaya Poplar Grapevine Rice # Total 26 28 39 48 49 40 Lignin synth # Total 18 21 17 36 21 20 Cellulose synth # # Lignin: PAL, Phenylalanine amonnia lyase; C4H, Trans-cinnamate 4-monooxygenase; 4CL, # 4-Coumarate:CoA Ligase; HCT, hydroxycinnamoyl-CoA shikimate/quinate # hydroxycinnamoyltransferase; COMT, Trans-caffeoyl-CoA 3-O-methyltransferase; C3H, # p-Coumaroyl shikimate 3'-hydroxylase/Coumaroyl 3-Hydroxylase; CCoAOMT, Caffeoyl-coenzyme # A (CoA) O-methyltransferase; CCR, Cinnamoyl CoA reductase; F5H, Coniferylaldehyde # 5-hydroxylase/Ferulate 5-hydroxylase; CAD, Cinnamyl alcohol dehydrogenase; # Cellulose: CeSA, Cellulose Synthase; COBRA, cellulose orientation genes. # # # Supplementary Figure 14: Genomic locations of R genes on the cucumber chromosomes. Three R genes could not be anchored on specific chromosome # .. what are R genes ? cute karyotype glyph w/ genes located, mostly near telomeres. # # # Title: The Medicago genome provides insight into the evolution of rhizobial symbioses # Author(s): Young Nevin D.; Debelle Frederic; Oldroyd Giles E. D.; et al. # Source: NATURE Volume: 480 Issue: 7378 Pages: 520-524 DOI: 10.1038/nature10625 Published: DEC 22 2011