euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe
See also EvidentialGene sets compared with others for completeness, for 11 species of Arthropods, Plants and Fishes .
      Name                       Last modified       Size  

[DIR] Parent Directory 05-Aug-2024 16:20 - [DIR] arabidopsis/ 30-Oct-2021 16:32 - [DIR] banana/ 10-Jun-2013 16:09 - [DIR] blastplants/ 21-May-2017 16:08 - [DIR] cacao/ 24-Oct-2022 23:07 - [DIR] corn/ 26-Jan-2017 21:02 - [DIR] pine/ 20-Dec-2016 12:57 - [TXT] plant_geneset_qual2014.txt 09-Feb-2014 18:08 7k


An existing dogma in genome projects, that quality of a gene set is dependent on the 
quality of the genome assembly, is no longer accurrate.

 - mRNA-seq assembly now does as well or better than genome-gene modelling.
   Both together, with methods that emphasize mRNA-seq assembly and address 
   genome-assembly and prediction errors, do the best.

 - The most common problem with first gene sets is fragmented, short and missed genes.
   The plant gene sets in below table show this, and comparison of same-species sets, with 
   shorter genes and fewer orthology groups for the less complete sets.

 - Cacao genes from the Mars/USDA sponsored project are at top in gene-set completeness.
   These were built using mixed methods that include more mRNA-assembly than genome-gene models.

 - Banana genes assembled only from mRNA are about as complete as banana genome-gene and
   and amborella genome-gene sets.  Banana expressed mRNA-seq has
   about 10% fewer orthology groups than the genome-set, a rate I find
   generally for various species.
 
 - mRNA-seq assembled genes avoid 2 important errors: from genome-assembly, and
   from other species proteins used in genome-gene modelling, which add both
   phylogenetic biases and other gene set mistakes.  It is usual to have genome-gene
   models that closely match structure of other species proteins, where those other proteins
   inform or constrain the models, but the models are a mismatch to expressed 
   mRNA-seq assembled genes.  Genome and mRNA gene assembly now use same basic
   methodology, but genome assembly is a more difficult problem, and lacks the
   validation evidence provided by orthologous species genes.

 - The process of mRNA-assembly of genes is encouraging that both
   orthologous and species-specific genes constructed this way may be
   biologically more valid than genome-predicted genes. Genome-assembly
   related errors are eliminated or reduced, and there is no
   phylogenetically introduced error source.  Given that complete or near
   complete orthology gene sets are recovered, the species-specific genes
   recovered by same methods are of higher validity.
   
 - EvidentialGene gene-assembly methods I've developed are available, and some
   are being used successfully in other projects.  I've recently finished a killifish gene set,
   and find it the most complete among 10 fish, including recent ones from NCBI
   and Ensembl.  Recent summary document is
     http://arthropods.eugenes.org/EvidentialGene/about/EvigeneRNA2013poster.pdf
     
- Don Gilbert, 2014 Feb.

  Gene set completeness for plant orthologs
  ranked by completeness (Bitscores, aaSize, nGroup, Tiny)
            Common families   All families  
Geneset     cBits    dSize   aBits   nGroup Tiny   
------------------------------------------------------
cacao1ma     671     15       544    15161  111 (0.7%) 
cotton       653      3       519    15026  153 (1%) 
orange1cn    648      0       499    14249  198 (1.3%) 
poplar       639     -2       512    15130  244 (1.6%)   
castorbean   631     -7       493    14605  460 (3.1%)  
capsella     603      0       435    13397  171 (1.2%)
eucalypt     624     -5       468    13877  312 (2.2%)
soybean      618    -17       477    14559  402 (2.7%) 
arabido.th   600     -1       428    13345  135 (1.0%) 
arabibo.ly   604     -1       430    13304  253 (1.9%) 
brassica     594      2       432    13714  283 (2%) 
grape        611    -20       447    13203  726 (5.4%) 
amborella    548     -6       355    11766  489 (4.1%) 
banana1g     542    -19       369    12537  577 (4.6%) 
------------------------------------------------------
    Common families n=7540, All families n=15928
    Bits  = bitscore from blastp, for groups common (cBits) to all and for 
            all (aBits) families with 3+ plants
    dSize = protein size difference from family median
    Tiny  = count of tiny protein size outliers (-3sd below family median)

Notes: cacao1ma, orange1cn, banana1g are best of 2 independent gene sets for 
those species.  cotton is close relative to cacao and its gene set has been 
built using the cacao1ma gene set (among others).  Bitscores are influenced
by phylogeny as well as quality, scores by alignment (somewhat less phylo-dependent)
show same ordering.  Protein size is closely +correlated with bitscore.
Ranking quality by protein size and orthology families (nGroup) gives similar
result, but arabido.th and brassica move up to middle (6,7th).

  Gene set completeness for plant orthologs
  comparing 2 independent gene sets for 3 species
              Common families      All families  
Geneset     cBits    dSize      aBits  nGroup  Tiny   
--------------------------------------------------------
cacao1ma     653     15         547    15161  112 (0.7%) 
cacao1cr     641     11         530    14897  235 (1.5%) 
orange1cn    629     0          502    14249  199 (1.3%) 
orange1jg    610     -21        480    14039  658 (4.6%) 
banana1g     522     -19        371    12537  577 (4.6%) 
banana1e     521     -21        349    11733  880 (7.5%) 
--------------------------------------------------------
    Common families n=8461, All families n=15838

Plant comparison gene sets
  amborella = amborella genome-gene predictions
              BioProject PRJNA212863, http://www.amborella.org/, doi:10.1126/science.1241089 
  banana1g = Banana genome-gene predictions
             BioProject PRJNA81189, http://www.musagenomics.org/, doi:10.1038/nature11241
  banana1e = Banana mRNA-seq only assembly with Evigene
             http://arthropods.eugenes.org/EvidentialGene/plants/banana/
  cacao1cr = Cacao Cirad genome-gene predictions
             http://cocoagendb.cirad.fr/ doi:10.1038/ng.736
  cacao1ma = Cacao Mars mRNA-assembly + genome-genes with Evigene
              BioProject PRJNA51633,  http://arthropods.eugenes.org/EvidentialGene/plants/cacao/ doi:10.1186/gb-2013-14-6-r53
  orange1cn = Sweet orange, Cn genome-genes gene set 
              BioProject PRJNA86123, http://citrus.hzau.edu.cn/orange, doi:10.1038/ng.2472
  orange1jg = Sweet orange, JGI genome-genes gene set
              http://www.phytozome.net/citrus.php
              
  arath = arabido.th, arabidopsis TAIR10,
  poptr = poplar, Populus poptr_Ptrichocarpa_156 JGI phytozome
  ricco = castorbean, Ricinus v0.1 from castorbean.jcvi.org
  soybn = soybean, soybn_Gmax_109 JGI phytozome
  vitvi = grape, vitvi_Vvinifera_145 JGI phytozome
  soltu = potato, Solanum v3.4 from potatogenomics.plantbiology.msu.edu/
  sorbi = sorghum, sorbi_Sbicolor_79 JGI phytozome

  cotton = gossypium phytozome/v9.0/Graimondii/
  capsella = phytozome/v9.0/Crubella/
  eucalyptus = phytozome/v9.0/Egrandis/
  brassica = phytozome/v9.0/Brapa/
  arabido.ly = phytozome/v9.0/Alyrata/
................................................................................  

Orthology reference set of 8 plants: 
  arath TAIR10 35k, poplar 45k, castorbn 31k, soybn 55k, grape 26k, strawb 35k, potato 39k, sorhgum 29k



Developed at the Genome Informatics Lab of Indiana University Biology Department