Protein homology assessment for Matina/Mars cacao genome. Tables, figures referenced below are in sub-folders at http://server7.eugenes.org:8091/cacao/genes10/ Methods P. Homology assessment methods and steps 1. homology proteins aligned to genome assembly tblastn each proteome x genome dna, p<= 1e-5 then join local alignmentss of same gene (evigene script) Used as evidence for directed gene predictions Used as evidence of protein bases in gene model assessment Display on genome map for expert annotation evidence 2. refined protein gene mapping exonerate with options model=protein2genome:bestfit, exhaustive for best mapped alignment to extend tblastn local alignments to best matching complete gene alignment: then best protein gene per locus is selected, using majority vote among 3+ plant proteins to reduce spurious protein models. Used as complete gene evidence for directed gene predictions Used as evidence of protein gene structures in gene model assessment Display on genome map for expert annotation evidence 3. homology proteins matched to all cacao gene set proteins blastp plant8 x each gene model set reduce to best homology score for gene models Used as evidence for protein homology in gene model assessment 4. orthology assessment of final cacao gene set x related plant proteins blastp all x all of 9 species proteomes reciprocal best-blast clustering to gene families using orthomcl 5. TAIR and Uniprot/UniRef50 homology proteins, and Orthomcl gene group consensus name, is used to choose best gene name, from highest identity match and name qualities. -------------------------------------- Table P1. Source proteomes used for gene homology assessments are 8 related plants, the most current versions as of 2011 Sept. Sppid Species No. proteins Date of data URL -------------------------------------------------------------------------- arath Arabidopsis thaliana naa=35386 2010 Dec 15 ftp://ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/TAIR10_blastsets/TAIR10_pep_20101214 poptr Populus trichocarpa (poplar tree) Version 2.2 naa=45033 2011 Mar 31 ftp://ftp.jgi-psf.org/pub/JGI_data/phytozome/v7.0/Ptrichocarpa/assembly/Ptrichocarpa_156.fa.gz soybn Glycine max (soybean, v Glyma1.0) naa=55787 2011 Mar 31 ftp://ftp.jgi-psf.org/pub/JGI_data/phytozome/v7.0/Gmax/assembly/Gmax_109.fa.gz frave Fragaria vesca (woodland strawberry) naa=34809 https://strawberry.plantandfood.co.nz/gbrowse/navbar/strawberry/DownloadData/vescagenemodels2.faa ricco Ricinus communis (castor bean, TIGR 0.1) naa=31221 2008 May 22 ftp://ftp.tigr.org/pub/data/castorbean/release_0.1/TIGR_castorWGS_release_0.1.aa.fsa.gz soltu Solanum tuberosum (potato) naa=39031 http://potatogenomics.plantbiology.msu.edu/data/PGSC_DM_v3.4_pep_representative.fasta.zip sorbi Sorghum bicolor (Sbi1.4) naa=29448 2011 Mar 31 ftp://ftp.jgi-psf.org/pub/JGI_data/phytozome/v7.0/Sbicolor/assembly/Sbicolor_79.fa.gz vitvi Vitis vinifera (grape) naa=26346 2011 Mar 31 ftp://ftp.jgi-psf.org/pub/JGI_data/phytozome/v7.0/Vvinifera/assembly/Vvinifera_145.fa.gz UniProt UniRef50 plant subset naa=345729 2011 June ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/uniref50.fasta.gz used for gene naming -------------------------------------------------------------------------- Table P2. Homology average for gene set proteins # (expand this from Table E3) Tree gene set TAIR10 Plant8 ----------------------------------- Cacao11_mars 632 549 Cacao1_cirad 620 522 Poplar 609 Castor bean 591 Grape 563 ----------------------------------- TAIR10= average blastp bitscore to Arabidopsis, TAIR10, using 10253 TAIR genes that are common best matches to all 5 gene sets. Plant8= average blastp bitscore to best matching plant protein of 8 plant proteome sets. Note that difference in gene sets of Cacao are in same range as difference among tree species gene sets, so that phylogeny and gene construction quality differences are confounded. Table P3. Homology that supports Cacao gene sets, separated by locus agreement (expand/replace this from Table E4, leave out cirad here?) Subset nGene nHomolog Bits Align --------------------------------------------------- All genes Mars 29452 24682 549 448 Cirad 29283 24956 522 428 Same locus, same CDS Mars 13519 13381 657 496 Cirad 13709 13500 654 496 Same locus, different CDS Mars 8599 7925 484 420 Cirad 10646 9300 421 385 Unique locus Mars 7337 3379 269 320 Cirad 4928 2156 131 176 --------------------------------------------------- Average bit score and alignment bases to 8 plant protein sets, using blastp e-value <= 1e-5. Mars and Cirad shared loci are determined with Cirad transcripts aligned to Matina genome. Same locus, same CDS is determined by >=90% CDS exon alignment. Same locus, different has lower CDS alignment. Unique loci are those with no significant exon overlap for Mars11 and Cirad1 gene sets. Table P4. Plant orthology gene groups (OrthoMCL) ---------- GROUPS --------- -------- GENES -------------- nGroup OrGrp OrMis1 UniqGrp nGene Orth1 OrDup Uniq1 UDup --------------------------- ----------------------------- poptr 16701 15600 56 1101 41064 7844 19479 10040 3701 cacao 16226 15523 43 703 29408 13700 4929 7726 3053 ricco 15680 15085 228 595 31221 13738 3251 12713 1519 soybn 16431 14216 77 2215 47549 3519 29587 8421 6022 frave 15084 13273 582 1811 34809 11759 4038 12348 6664 vitvi 13856 13124 538 732 26233 11583 4160 8373 2117 arath 13724 12755 330 969 27300 9498 8160 5974 3668 soltu 14158 12341 960 1817 35953 9421 8027 7302 11203 sorbi 11973 10499 1086 1474 27667 7999 6589 8315 4764 ------------------------------------------------------------------ Uniq1,UDup = single-copy and duplicated species-unique genes Orth1,OrDup = single-copy and duplicated orthologous genes UniqGrp,OrGrp = species-unique and orthologous groups OrMis1 = groups missing in species that all other species have Excluded: Alternate transcripts and TE-gene groups. # redone 20120324 to remove cacao-TE gene groups, affects only UDup, UniqGrp counts # full table at genes10/orthomcl/plant9-orthomcl-gclass.tab Figure P5. Venn diagram of gene counts in shared gene families for 5 species (arath,cacao,frave,poptr,vitvi) shows number of genes common to all 5 (n=7980), species-unique genes (n=6630 for arath, n=9840 for cacao, and more for other 3), as well as counts of gene families missing from 1 only (107 for cacao is fewest, 150 for poptr, 944 for arath, 957 for frave, 1041 for vitvi), and pair-wise shared groups. genes10/orthomcl/cacao_plant9_genefam_venn.pdf Table P6. Gene family counts for all 9 species. This table lists counts for each species of genes in family clusters. genes10/orthomcl/plant9-orthomcl-count.tab Table P7. Gene family annotation for all 9 species. These data describe gene family clusters, with details of each species gene ID, name, homology score, consensus family name. genes10/orthomcl/plant9_genes.ugp.txt genes10/orthomcl/plant9_genes.ugp_brief.txt : subset of above listing in row format each group, species counts, description, cacao gene ids. genes10/orthomcl/plant9_omclgn.consensus_def.txt : table of group IDs and consensus description. Table P8. Transposon groups in cacao, from subsequent transposon analysis. genes10/orthomcl/plant9-cacao-teistrue.ugp_brief.txt lists 16 groups, 2204 genes of cacao-only genes that are identified as transposon-origin these are the first 16 most abundant cacao-only groups, as were previously listed as likely transposon groups (file plant9-cacao-telikely.ids ) Table P9. Gene families overabundant in Cacao. genes10/orthomcl/overgroups/table.overgroups.cacao.txt This table includes gene family ID, number of taxa, number of genes in family, number of cacao genes, average and maximum gene count excluding cacao, and significance of overabundance or underabundance, with G statistic. Among significantly overabundant gene families are 1. disease resistance proteins (group ids:157,198,840,1694,2068,5033,11985,12410,14528) > see docs/cacao11genes_summaries-work.txt > Table Sn. Disease resistance genes 2. cytochrome P450 (groups 374,5041) .. more to list ? There are also several groups of uncharacterized function that are overabundant in cacao both shared with the other plants, and unique to cacao.