Animal and Plant gene set reconstructions with EvidentialGene:
Comparisons to other popular and recent gene reconstructions.
D.G. Gilbert, gilbertd at indiana.edu, 2016/2017
Recent plant & animal EvidentialGene constructions surpass PacBio,
Maker, NCBI and Trinity methods for arabidopsis, corn plants,
white fly, water flea.
In comparison to gene sets of these other commonly used methods,
the Evigene methods are more accurate at recovering genes as measured by
homology across species and by expression data.
In particular, for 3 plant species sets, Illumina RNA assemblies done
according to Evigene methods surpass Pac-Bio RNA genes not only in total gene set
accuracy, but in per-locus accuracy, where both methods recover some transcripts,
for primary, alternate and paralog transcript reconstruction. Trinity
assembled Illumina RNA gene sets are likewise incomplete compared to
Evigene's multiple-assembler/reduction approach.
In comparison to genome-modeled gene sets, derived from many
sources of gene evidence (prediction from chromosomes, RNA, other species proteins),
Evigene's RNA-only constructions often surpass accuracy of those modeled
genes. This is likely due to the greater complexity of merging many evidence
sources in modeled genes, with greater chances of mis-modeling.
Evigene Illumina-RNA versus PacBio RNA comparisons include below summarized
Arabidopsis model plant,
Zea mays corn, as well as
pine trees.
Evigene versus Trinity-only comparisons include these plants and animals such as
Bemisia white fly,
Daphnia water fleas,
Aedes and Anopheles mosquitoes,
Honey bee,
mice, fishes and others (including several by independent authors of animal
and plant gene sets).
Evigene versus genome modeled sets include those produced by NCBI EGAP,
MAKER software, AUGUSTUS and similar gene modelers, for Arabidopsis, corn,
pine and other plants, and animals including mosquitos, water fleas, honey bee,
and others.
Zebrafish model animal is added for 2017-Dec reconstruction with Evigene methods,
compared with the modelled gene sets of NCBI and Ensembl, surpassing both on
average for complete fish/vertebrate protein homology and intron recovery. This
draft evigene zebrafish set contains incomplete genes however, as with arabidopsis,
only a small subset of RNA data was used.
1. Plant model Arabidopsis thal. gene reconstructions ... evigene2017_arabidopsis
Gene assemblies of Illumina RNA-seq vs PacBio
AtAraport genes Cacao genes Introns
Geneset Found% AlignT% Found% AlignF% Found%
AtAraport -- -- 88.7 70.7 88.1
AtEvigene 95.4 95.0 89.1 70.3 87.5
AtOases 90.0 91.2 na na 81.1
AtIDBAtr 89.5 89.1 na na 80.7
AtSOAPtr 88.9 87.0 na na 79.1
AtTrinity 88.4 84.1 na na 81.4
AtPacBio 58.1 48.2 64.2 60.5 56.3
--------------------------------------------------------
2. Corn Zea mays gene reconstructions ... evigene2016_corn
Gene assemblies of Illumina, PacBio, and genes modeled on chromosome assembly
Sorghum genes Introns
Geneset Found% AlignT% Found%
ZmEvigene 82.9 91.1 68.7
ZmGramene 81.9 90.3 68.1
ZmNCBI 81.3 89.6 na
ZmPacBio 78.0 82.4 68.2
ZmJgi4 77.6 81.2 68.9
------------------------------------
3. White fly gene reconstructions ... evigene2016_whitefly
Bemisia tabaci (cotton/crop plant pest)
Reference species RNA
Pea aphid Fruit fly Introns
Geneset Found% AlnT% Found% AlnT% Found%
BtEvigene 81.2 88.0 74.1 74.9 68.5
BtNCBI 79.7 82.3 73.4 71.6 69.4
BtMaker 77.4 73.8 72.1 66.0 57.7
BtTrinity 73.5 59.2 68.0 53.2 50.5
----------------------------------------------
4. Water flea Daphnia pulex gene reconstructions ... evigene2017_daphnia_pulex
Reference species RNA
Daphnia magna Fruit fly Introns
Geneset Found% AlnT% Found% AlnT% Found%
DpEvigene 72.0 88.6 67.9 80.3 66.6
DpMaker 58.9 69.9 64.3 74.5 46.7
----------------------------------------------
5. Zebrafish model Danio rerio gene reconstructions ... zebrafish17evigene
Evigene RNA assemblies vs NCBI, Ensembl genome-gene models
Cavefish Human genes Vertebrate_BUSCO Introns
Geneset Found% AlnT% Frag% Found% AlnT% Frag% Align Miss Frag Found%
DrEvigene 97.0 96.5 0.5 87.5 90.8 0.5 446.8 9 5 81.6
DrNCBI 93.9 92.7 3.9 86.9 89.3 1.2 434.6 19 13 76.4
DrEnsembl 93.1 90.3 5.7 86.3 88.4 2.2 428.2 29 47 57.6
------------------------------ ------------------ ----------------- -----
6. Pig, Sus scrufa, gene reconstructions compared
Human genes Vertebrate_BUSCO
Geneset Align% Miss% Frag% Align Miss Frag
SsEvigene 97.0 0.7 1.4 447 8 10
SsNCBI 96.0 0.7 0.7 440 17 2
SsEnsembl 95.0 0.9 1.1 431 14 20
----------------------------- -----------------
Arabidopsis gene sets
AtAraport = public gene set of 2016 of Arabidopsis thal. from Araport.org
AtEvigene= Evigene classification/reduction of Illumina RNA assemblies
http://arthropods.eugenes.org/EvidentialGene/plants/arabidopsis/evigene2017_arabidopsis/
AtOases = Velvet/oases assembly of Illumina RNA,
AtIDBAtr = idba_tran asm of Ill. RNA,
AtSOAPtr = SOAP-Trans asm of Ill. RNA,
AtTrinity = Trinity asm of Ill. RNA,
AtPacBio = Pac-Bio "no-assembly" assembly (PacBio xxx method) of Pac-Bio RNA data
Zea mays gene sets
ZmEvig = Evigene Zeamay5fEVm 2016 assembly of Illumina RNA-seq, public at
http://arthropods.eugenes.org/EvidentialGene/plants/corn/evg5corn/
ZmGram = Ensembl/Gramene 2016.09 Zm000nnnn,
ZmPacb = CSHL/Gramene PacBio gene assemblies of 2016 as SRA entries SRR3147024..054,
ZmNCBI = NCBI 2014 refgen zeamay
ZmJgi4 = JGI Rnnotator assembly set of Illumina RNA-Seq , 2014
Bemisia tabaci gene sets
BtEvig = Evigene gene assembly, 2016 update (vers 3), available [soon] at
http://arthropods.eugenes.org/EvidentialGene/arthropods/whitefly/whitefly3evigene/
BtNCBI = NCBI RefSeq gene models, 2016
BtMakr = Whitefly genome project genes modeled with MAKER, 2016, whiteflygenomics.org
BtTrin = TSA.GBII gene assembly 2015, Trinity of Illumina
Daphnia pulex gene sets
DpEvig7 Evigene genes of 2017 from
http://arthropods.eugenes.org/EvidentialGene/daphnia/daphnia_pulex/daphnia_pulex_genes2017/
DpMaker7 genes of 2017 from report of doi:10.1534/g3.116.038638
Danio rerio gene sets
DrEvigene = Evigene gene assembly, 2017 Dec
http://eugenes.org/EvidentialGene/vertebrates/zebrafish/zebrafish17evigene/
DrNCBI = NCBI RefSeq gene models, 2016 Dec, accession GCF_000002035.5_GRCz10
DrEnsEMBL = Ensembl gene models, 2017 Nov
Pig gene sets
SsEvigene = http://eugenes.org/EvidentialGene/vertebrates/pig/pig18evigene/
SsNCBI = NCBI RefSeq genes for the pig, 2018-May
SsEnsembl = Ensembl genes for the pig, 2018-June
Human RefSeq 2018 genes, 37,868 proteins matched, of 19122/20011 loci matched
Vertebrate BUSCO set of 2586 proteins of OrthoDB v9
Measures
Genes Found% = percent of reference genes with significant alignment to gene sets (BLASTp/n of proteins or CDS),
Genes AlnT% = percent of aligned bases of reference gene bases
Introns Found% = percent of evidence introns aligned to gene set exons,
intron evidence from Illumina RNA-seq mapped to chromosome assemblies
Further details are in evigene_plantsanimals_2017.txt
|