EvidentialGene gene construction for Zea mays corn plant

EvidentialGene gene set for maize is more accurate and complete than other published maize gene sets, measured by orthology. A quality comparison of Evigene genes ranks these above corn gene sets of Gramene (Ensembl/Maker models) and PacBio gene assemblies (2016), NCBI gene models, and JGI gene assemblies, for primary ortholog loci, and also for alternate transcripts and duplicate genes. See evigene_maize_info/corngenes_qualsum/. These Evigene sets are in-progress draft status on 2016-10, and may be improved. .. Don Gilbert, 18 Oct 2016, gilbertd at

EvidentialGene is a genome informatics project/pipeline for gene set construction that has a measurably high accuracy and completeness rate, compared with other gene informatics methods used for animals and plants. See or

Gene orthology accuracy and completeness, measured with protein homology to reference species genes, for gene sets of Zea mays (corn plant) are summarized here for EvidentialGene in comparison with other gene sets of maize gene/genome Zea_mays.AGPv4 by Gramene/ENSembl, B73_RefGen_v3 by NCBI, and genes assembled with Rnnotator by JGI. EvidentialGene gene assembly uses the same published RNA-seq, assembles it with 4 gene assemblers, then reduces to a concise and accurate locus/alternate gene set. In these tests, Evigene produced the more accurate gene sets, with minimal time and effort.

Zea_mays gene sets compared

Zea_mays x REFERENCE Arabidopsis thal. model (Araport 2015 version, ngene=28902)
        Evigene5    Gramene4   NCBIRef3   JGI14denovo  
Found     80.9%       80.4%     80.6%      79.2%   
Align     91.7%       89.3%     89.0%      84.4%   
AlignAA    428       412       405        388

Zea_mays x REFERENCE Sorghum (Sbicolor_313 v3.1 of JGI Phytozome, ngene=31054)
         Evigene5  Gramene4   NCBIRef3   JGI14denovo  
Found     83.4%     82.0%     80.0%      77.3%   
Align     93.8%     91.7%     89.7%      82.4%   
AlignAA    436        419      409       381

Component assemblers used for Evigene x Sorghum REFERENCE 
        Velv/Oases idba_tran  SOAPtrans Trinity
Found     80.0%     78.6%      78.0%     77.8%
Align     88.6%     86.0%      86.0%     83.7% 
AlignAA   413        398        400      388  

Maize gene sets compared:
a. Evigene5 evg5corn, genes de-novo assembled and classified with Evigene methods using four gene assemblers and 3 Illumina RNA-seq sets (JGI-2014 PRJNA168080, CSHL-2016 PRJEB10406, UCBerkeley-2016 PRJNA306885) The ohnolog/paralog loci are resolved with locations on chromosome assembly of (c). Gene loci=50963, mRNA=231177 (a0. evg4corn loci=42597)
b. Gramene4 = gene set Zea_mays.AGPv4.32 from Gramene/Ensembl, 2016, MAKER modelled on chr assembly. Gene loci=39310, mRNA=149669
c. NCBIRef3, genes/genome release B73_RefGen_v3, 2013, from NCBI reference genomes. Gene loci=39873, mRNA=58277
d. JGI14_denovo_maize, genes assembled with Rnnotator from JGI, doi:10.1038/srep04519, 2014, RNA-seq from maize seedling, 250 M Illumina pairs. Gene loci=133756, mRNA=187045

Found = % reference proteins with significant alignment to test gene sets
Align = % alignment of target proteins sets to reference proteins
AlignAA = average alignment size (in aminos) to reference proteins

Evigene ref: Gilbert, Donald (2013) Gene-omes built from mRNA seq not genome DNA.
7th annual arthropod genomics symposium. Notre Dame; doi:10.7490/f1000research.1112594.1

Case: Duplicate genes from chromosome duplication

Reliable homeologous genes (ohnologs) in maize that are conserved with single loci in rice, sorhgum and Arabidopsis are identifed by Schnable et al. (doi:10.1073/pnas.1101368108). These are 1750 paired-loci, each of pair on a separate chromosome (3500 loci). Of these, 1661 paired-loci are identified in corn gene sets via alignment to sorhgum loci. See further Details in corn ohnolog and alternate transcript reconstruction.

Zea_mays Ohnologs x REFERENCE Sorghum

          Evigene  NCBIv3  JGI14  
Found     3201     3218      3111   
Miss1       25       66        83
Mixup       26        0       200
Align     86.8%    87.9%     83.7% 
Sorghum n=1661, corn loci n=3322
Found = contains ohnolog gene model (align >= 25%)
Miss1 = missing locus model that other two gene sets contain
Mixup = transcripts on separate chromosomes classed as alternates of one locus
Align = % alignment of target proteins sets to reference proteins

Case: Mediator of RNA polymerase II transcription subunit gene family

Mediator of RNA polymerase II transcription subunit genes are a well-conserved, animal and plant gene family of 25 to 30 loci, ranging in size from 2000 aa to 100 aa. Arabidopsis ref contains 44 of these loci, though several are of uncertain or weak association. Of those, 36 are found in Sorghum reference set, and 36 across all Maize gene sets, at >= 25% protein alignment identity. Generally these are all well-expressed, housekeeping genes, and all gene sets should be able to find and assemble/model them. In gene set comparisons, this is not always so, modellers or assemblers are prone to miss some. The shorter ones may be joined to ends of other genes, the longer may be partly assembled/modelled.

         ------ Gene Sets -----  --- Gene Assemblers ---
Stat.    Evigene  NCBIv3 JGI14  Oases IDBA SOAP Trinity
Found      36      31       32     36    36   34    34
Align%     92      82       79     92    86   86    83
AlignAA   504     423      422    500   478  480   456

-- Don Gilbert, gilbertd at_indiana_edu
-- update: 12 Oct 2016 to evg5corn set, adding 2 RNA data sets that sample several tissues, to fill in expressed genes (JGI-14 RNA of evg4 uses only seedling tops)
-- update: 01 Sept 2016 to evg4corn2g set, orig: 16 Jul 2016
-- update primarily involved resolution of ohnologs (high-identity paralogs in recent chromosome duplications), with a small improvement to orthology statistics vs other 2 corn gene sets. No new gene assemblies added, only reclassified as to locus. v4.1/1607 has gene loci=42387, mRNA=245534, about counts number as this 4.2 update.

