See also
EvidentialGene sets compared with others for completeness, for 11 species of Arthropods, Plants and Fishes .
Name Last modified Size
Parent Directory 05-Aug-2024 16:20 -
arabidopsis/ 30-Oct-2021 16:32 -
banana/ 10-Jun-2013 16:09 -
blastplants/ 21-May-2017 16:08 -
cacao/ 24-Oct-2022 23:07 -
corn/ 26-Jan-2017 21:02 -
pine/ 20-Dec-2016 12:57 -
plant_geneset_qual2014.txt 09-Feb-2014 18:08 7k
An existing dogma in genome projects, that quality of a gene set is dependent on the
quality of the genome assembly, is no longer accurrate.
- mRNA-seq assembly now does as well or better than genome-gene modelling.
Both together, with methods that emphasize mRNA-seq assembly and address
genome-assembly and prediction errors, do the best.
- The most common problem with first gene sets is fragmented, short and missed genes.
The plant gene sets in below table show this, and comparison of same-species sets, with
shorter genes and fewer orthology groups for the less complete sets.
- Cacao genes from the Mars/USDA sponsored project are at top in gene-set completeness.
These were built using mixed methods that include more mRNA-assembly than genome-gene models.
- Banana genes assembled only from mRNA are about as complete as banana genome-gene and
and amborella genome-gene sets. Banana expressed mRNA-seq has
about 10% fewer orthology groups than the genome-set, a rate I find
generally for various species.
- mRNA-seq assembled genes avoid 2 important errors: from genome-assembly, and
from other species proteins used in genome-gene modelling, which add both
phylogenetic biases and other gene set mistakes. It is usual to have genome-gene
models that closely match structure of other species proteins, where those other proteins
inform or constrain the models, but the models are a mismatch to expressed
mRNA-seq assembled genes. Genome and mRNA gene assembly now use same basic
methodology, but genome assembly is a more difficult problem, and lacks the
validation evidence provided by orthologous species genes.
- The process of mRNA-assembly of genes is encouraging that both
orthologous and species-specific genes constructed this way may be
biologically more valid than genome-predicted genes. Genome-assembly
related errors are eliminated or reduced, and there is no
phylogenetically introduced error source. Given that complete or near
complete orthology gene sets are recovered, the species-specific genes
recovered by same methods are of higher validity.
- EvidentialGene gene-assembly methods I've developed are available, and some
are being used successfully in other projects. I've recently finished a killifish gene set,
and find it the most complete among 10 fish, including recent ones from NCBI
and Ensembl. Recent summary document is
http://arthropods.eugenes.org/EvidentialGene/about/EvigeneRNA2013poster.pdf
- Don Gilbert, 2014 Feb.
Gene set completeness for plant orthologs
ranked by completeness (Bitscores, aaSize, nGroup, Tiny)
Common families All families
Geneset cBits dSize aBits nGroup Tiny
------------------------------------------------------
cacao1ma 671 15 544 15161 111 (0.7%)
cotton 653 3 519 15026 153 (1%)
orange1cn 648 0 499 14249 198 (1.3%)
poplar 639 -2 512 15130 244 (1.6%)
castorbean 631 -7 493 14605 460 (3.1%)
capsella 603 0 435 13397 171 (1.2%)
eucalypt 624 -5 468 13877 312 (2.2%)
soybean 618 -17 477 14559 402 (2.7%)
arabido.th 600 -1 428 13345 135 (1.0%)
arabibo.ly 604 -1 430 13304 253 (1.9%)
brassica 594 2 432 13714 283 (2%)
grape 611 -20 447 13203 726 (5.4%)
amborella 548 -6 355 11766 489 (4.1%)
banana1g 542 -19 369 12537 577 (4.6%)
------------------------------------------------------
Common families n=7540, All families n=15928
Bits = bitscore from blastp, for groups common (cBits) to all and for
all (aBits) families with 3+ plants
dSize = protein size difference from family median
Tiny = count of tiny protein size outliers (-3sd below family median)
Notes: cacao1ma, orange1cn, banana1g are best of 2 independent gene sets for
those species. cotton is close relative to cacao and its gene set has been
built using the cacao1ma gene set (among others). Bitscores are influenced
by phylogeny as well as quality, scores by alignment (somewhat less phylo-dependent)
show same ordering. Protein size is closely +correlated with bitscore.
Ranking quality by protein size and orthology families (nGroup) gives similar
result, but arabido.th and brassica move up to middle (6,7th).
Gene set completeness for plant orthologs
comparing 2 independent gene sets for 3 species
Common families All families
Geneset cBits dSize aBits nGroup Tiny
--------------------------------------------------------
cacao1ma 653 15 547 15161 112 (0.7%)
cacao1cr 641 11 530 14897 235 (1.5%)
orange1cn 629 0 502 14249 199 (1.3%)
orange1jg 610 -21 480 14039 658 (4.6%)
banana1g 522 -19 371 12537 577 (4.6%)
banana1e 521 -21 349 11733 880 (7.5%)
--------------------------------------------------------
Common families n=8461, All families n=15838
Plant comparison gene sets
amborella = amborella genome-gene predictions
BioProject PRJNA212863, http://www.amborella.org/, doi:10.1126/science.1241089
banana1g = Banana genome-gene predictions
BioProject PRJNA81189, http://www.musagenomics.org/, doi:10.1038/nature11241
banana1e = Banana mRNA-seq only assembly with Evigene
http://arthropods.eugenes.org/EvidentialGene/plants/banana/
cacao1cr = Cacao Cirad genome-gene predictions
http://cocoagendb.cirad.fr/ doi:10.1038/ng.736
cacao1ma = Cacao Mars mRNA-assembly + genome-genes with Evigene
BioProject PRJNA51633, http://arthropods.eugenes.org/EvidentialGene/plants/cacao/ doi:10.1186/gb-2013-14-6-r53
orange1cn = Sweet orange, Cn genome-genes gene set
BioProject PRJNA86123, http://citrus.hzau.edu.cn/orange, doi:10.1038/ng.2472
orange1jg = Sweet orange, JGI genome-genes gene set
http://www.phytozome.net/citrus.php
arath = arabido.th, arabidopsis TAIR10,
poptr = poplar, Populus poptr_Ptrichocarpa_156 JGI phytozome
ricco = castorbean, Ricinus v0.1 from castorbean.jcvi.org
soybn = soybean, soybn_Gmax_109 JGI phytozome
vitvi = grape, vitvi_Vvinifera_145 JGI phytozome
soltu = potato, Solanum v3.4 from potatogenomics.plantbiology.msu.edu/
sorbi = sorghum, sorbi_Sbicolor_79 JGI phytozome
cotton = gossypium phytozome/v9.0/Graimondii/
capsella = phytozome/v9.0/Crubella/
eucalyptus = phytozome/v9.0/Egrandis/
brassica = phytozome/v9.0/Brapa/
arabido.ly = phytozome/v9.0/Alyrata/
................................................................................
Orthology reference set of 8 plants:
arath TAIR10 35k, poplar 45k, castorbn 31k, soybn 55k, grape 26k, strawb 35k, potato 39k, sorhgum 29k
|