EvidentialGene: Orthology completeness of Arthropod gene assemblies
These EvidentialGene sets for 12 arthropods, fishes and plants are the
most orthology-complete, compared to same or related species gene sets
by well regarded gene discovery methods (NCBI, Ensembl, Augustus,
Maker, Glean, ...).
Orthology-complete is the measure of presence and fullness of genes with
protein homology to related species. This measure is for a subset
of genes that have orthology to 9,000 gene families that are shared
by 10 species across arthropods, plus human and fish. Similar
completeness of gene set ranking is obtained using conserved protein
domains (NCBI CDD).
mRNA-assembled genes are
constructed without any reference protein information, nor with genome
data. Genome gene sets use both reference genes and genomes to
produce gene models. I.e., the genome genes have the answers in front
of them, while mRNAseq genes do not.
Some of these gene set comparisons use the same RNA-seq data, where the genome-based
versions have added sources of information, and of errors. If genes can be fully
constructed from mRNA-seq alone, then adding genome data, prediction software, and
related species proteins may contribute more mistakes than improvements.
This summary shows that Arthropod species gene sets constructed from mRNA-seq are often more
complete than genes from genome-modelling methods. mRNA-seq genes are
more biologically convincing as they don't rely on the reference
species genes, predictions or genome assembly, and thus are not
subject to artifacts from those sources. When they match the
reference genes, it is true discovery.
Fig 4. Orthology completeness summaries per Arthropod species-clade
Fig 4a. Honey bee
|
Fig 4b. Daphnia water fleas (D. magna, pulex)
|
Fig 4c. Ticks (Deer tick, Zebra tick, Spider mite)
|
Fig 4d. Tribolium Beetle
|
Fig 4f. Fruit fly
|
Fig 4. Legend
- Nref_common = Reference ortholog families found,
as percent of common ortholog gene families found in gene set
(common set = 8723 groups, common to each of 10 species including these)
- AaSize = protein size, as percent of ortholog reference protein size
- Align_common = Alignment to Reference ortholog protein, as percent of ref protein.
- Best_of_Species = gene set score for all reference orthologs, as percent of all ref gene families,
from best, or same as best (tie), align score per reference gene over gene sets.
I.e. 100% best for setA = all orthologs have best/tie score for setA, none better in sets B or C,
90% for setA implies 10% better in other gene setB, setC,..
- Species gene sets left-right order, newest to oldest, as listed below.
Evigene sets are solid color bars, others are hashed color.
|
Arthropod Species gene sets
Honey bee, Insecta
apisevg14, Apis mellifera, Evigene mRNA assembly 2014.06
apis14nc, Apis mel., NCBI genome gene preds 2014.02 (r102)
apis45, Apis mel., OGS v3.2 (genome v4.5) genome gene preds 2012, doi: 10.1186/1471-2164-15-86
apis1, Apis mel., OGS v1.1 genome gene preds 2006/8, doi:10.1038/nature05260
Water flea, Crustacea
dapmagevg14, Daphnia magna, Evigene mRNA assembly + genome predict. 2014.08
dapmag11, Daphnia magna, Evigene genome gene preds 2011
dapplx10evg, Daphnia pulex, Evigene genome gene preds 2010
dapplxjgiv12, Daphnia pulex, OGS v1.1 + JGI updates 2011 (v1.2)
dapplxjgiv11, Daphnia pulex, OGS v1.1 genome gene preds 2007, doi: 10.1126/science.1197761
Ticks & Mites, Chelicerata
tickixevg14, Ixodes scapularis deer tick, Evigene mRNA assembly 2014.06
tickix, Ixodes scap. deer tick, genome gene preds 2011 (IscaW1.1)
tickte, Tetranychus ur. spider mite, genome gene preds 2011 (v.0620111123), doi:10.1038/nature10640
tickzbevg13, Rhipicephalus pulchellus zebra tick, Evigene mRNA assembly 2013
Beetles, Insecta
pogoevg13, Pogonus chalceus beetle, Evigene mRNA assembly 2013
tribcas4evg2, Tribolium castaneum beetle, Evigene mRNA assembly 2014.12
tribcas14nc, Tribolium cast., NCBI genome gene preds 2014.05 (r102)
tribcas4aug, Tribolium cast., AUGUSTUS genome gene preds, tcas4.0 2014
tribcas1, Tribolium cast., OGS v1 genome gene preds 2006/08, doi:10.1038/nature06784
Fruit fly, Insecta
drosmel548n, Drosophila melanogaster, rel 5.48, 2013
drosmelr5, Drosophila mel., rel 5.30, 2011
drosmelr4, Drosophila mel., rel 4.0, 2004
|
Plant & Fish Species gene sets
Fishes
Catfishevg, Ictalurus punctatus, Evigene mRNA assembly 2013
Killifishevg, Fundulus heteroclitus, Evigene mRNA assembly + genome preds 2014
Cavefish, Astyanax mexicanus, EnsEMBL r74 genome gene preds 2013.11
Mayzebr, Maylandia zebra african cichlid, NCBI genome gene preds 2013 (r100/MetZeb1.1)
Platyfish, Xiphophorus maculatus, EnsEMBL r74 genome gene preds 2013.11
Tilapia, Oreochromis niloticus, EnsEMBL r74 genome gene preds 2013.11
Zfish, Danio rerio zebrafish, Zv9/EnsEMBL r74 genome gene preds 2013.11
Plants
Banana1evg, Musa acuminata Banana plant, Evigene mRNA assembly 2013
Banana1g, Banana, genome gene preds 2012, doi:10.1038/nature11241
Cacao1evg, Theobroma cacao chocolate tree, Evigene mRNA assembly + genome predict. 2012, doi:10.1186/gb-2013-14-6-r53
Cacao1cr, Theobroma cacao, genome gene preds 2011, doi:10.1038/ng.736
Pine1evg, Loblolly pine tree, Evigene mRNA assembly 2013, doi:10.1186/gb-2014-15-3-r59
Pine1mk, Loblolly, genome gene preds (Maker) 2013, doi:10.1186/gb-2014-15-3-r59
|
Fig 5. Conserved Domain gene-set completeness of Arthropods, Fish and Plants
Fig 5a. Conserved Domains in Arthropod gene sets
|
Fig 5b. Conserved Domains in Fishes
|
Fig 5c. Conserved Domains in Plants
|
Fig 5. Legend
- Conserved Domains from NCBI CDD (2014.10), subsets by taxonomic group,
CD found (a) in Arthropods, (b) in Vertebrates, and (c) in Plants with deltablastp.
CD set for Arthropods (a) is determined as Eukaryote CD that occur in all 3 clades
Insecta, Crustacea and Chelicerata, in at least one species gene set.
- pHit = Conserved Domains found,
as percent of all conserved domains in species/clade set
- Alignment = percent Alignment to Domains.
- Best_of_Species = gene set best with longest alignment, or same as best (tie), per domain,
as percent of all domains.
I.e. 100% best for setA = all genes have best/tie domain alignment for setA,
none better in sets B or C, 60% for setA implies 40% better in other gene sets.
- Calculations are within species/clade (color coded sets), normalized to 100% of all CD per clade.
- Arthropod gene sets are left-right ordered, newest to oldest, as listed above.
Evigene sets are solid color bars, others are hashed color.
- Gene set rank order for Best_of_Species are same for Orthology group and Conserved Domain analyses.
|
Table 4. Orthology gene set completeness across species,
measured with average protein size and orthology in gene groups
Table 4a. Gene set completeness of Arthropod species
Common Families All Families
Species cBits aaSize oMiss tBits oGroup Tiny
--------- ------------------ -----------------
daphniam 650 46 18 466 11523 1.8%
daphniap 643 -25 36 462 11670 5.1%
beetlet 526 -26 42 351 8765 4.1%
beetlep 541 15 78 358 8875 3.5%
honeybee 532 38 161 346 8682 3.1%
fruitfly 470 68 203 290 7801 1.8%
----------------------------------------------
|
Table 4b. Gene set completeness of Plant species
Common All Families
Geneset cBits aaSize tBits oGroup Tiny
----------------------- -----------------
cacao1evg 653 15 547 15161 0.7%
cacao1cr 641 11 530 14897 1.5%
banana1g 522 -19 371 12537 4.6%
banana1evg 521 -21 349 11733 7.5%
-----------------------------------------
|
Table 4c. Gene set completeness of Fish species
Common Families All Families
Species cBits aaSize oMiss tBits oGroup Tiny
--------- ------------------- -------------------
killifish 803 50 18 585 17272 1.1%
maylandia 824 45 76 596 16469 1.1%
tilapia 822 6 223 568 14905 1.9%
platyfish 783 -12 118 549 15305 4.7%
zebrafish 711 -9 366 478 15190 4.8%
catfish 725 21 729 470 14276 3.4%
-------------------------------------------------
|
Legend:
cBits = bitscore average for 4740 common gene groups (plant common n=8461);
tBits = bitscore average for all ortholog groups;
aaSize = average protein size difference from group median;
oMiss = missing ortholog groups that are common to other 9 of 10 species;
oGroup = number of ortholog gene groups in species;
Tiny = percent species gene size outliers below 2sd of group median size;
Arthropod Species gene sets:
daphniam = daphnia magna dapmagevg14, daphniap = daphnia pulex dapplx10evg,
beetlet = tribolium cas tribcas1, beetlep = pogonus cha. pogoevg13,
honeybee = apis mel. apisevg14, fruitfly = dros. mel. drosmel548n
Analysis source:
Arthropods, arp7bor5/arp7s10f-orthomcl, 2014.09.30;
Fish, fish11gor3-orthomcl, 2013.12.11 ;
Plants, orthomcl 2014.02.09
Orthology analysis methods
2015-Mar update to 2014-Nov assessment
The target gene/transcript assignment to ortho-groups is changed, to one
group per target locus (per species gene set). Prior method allowed
alternates of locus to fill other ortho-groups, which can be
biologically valid for some loci, but also included a large set of weak,
partial aligned group members. Also calculation of alignment and
identity from blastp tables, per gene is updated, to improve addition of
part-hits (HSP), with little effect except for a few largish genes. This
update reduces number of "missed" ortho-groups for the gene sets with
few alternates. The CDD analysis is not affected by alternates and is
unchanged.
One visible consequence in the per-group bar graphs (Figs 6a-6f) is that
Longest reference gene groups are almost all filled by each gene set
(i.e. few misses for these most obvious ortho-genes), while short
reference gene groups still contain a fair number of misses per gene set
(shorter genes are in some ways harder to accurately find, due to
"noise" factors). For the longest genes, sizes often differ among gene
sets, esp. the initial gene sets have many short/partial models of
complex long genes, see eg. Ticks (Fig 6c) where partial genes are common
with the fragmented genome assembly of deer tick OGS1.
Another consequence in the summary bar graphs is the
Best-of-species percentages have increase for poorer gene sets, to
within 20% to 30% of top scoring gene sets. This is still a large range
in gene set completeness. Notice the comparably small range in Fruitfly gene
set versions that span over a decade of effort.
Both these changes indicate a likely more valid way to assess gene set
qualities in this restricted case for orthology-completeness. Bear in
mind this is one gene set quality, albeit perhaps the most biologically
rigorous one available.
Prior data and plots of
2014-Nov assessment are available here.
Orthology completeness assessment
- Orthology gene groups are those of ARP7 OrthoMCL analysis set of 10 species, group IDs of
ARP7f_Gnnn, such as ARP7f_G1000.
This gene orthology database ARP7 is at
arthropods.eugenes.org/arthropods/orthologs/ARP7/,
including orthology tabulations and annotations, and species gene set proteins.
Gene family report pages are found with ARP7 IDs at
http://arthropods.eugenes.org/genepage/arp7xml/ARP7f_G1000
- Orthology gene groups that are common to all 10 species are used,
though missing from some species gene sets, n=8723 listed in
allspp-refarp7s10fset1.comgrpid2
Tables for this analysis were produces using blastp of all transcript proteins
(primary and alternate) to the ARP7 OrthoMCL reference protein database (n=131876 proteins of 10 species).
The best aligned or best bitscore match of query gene set to reference database is tabulated,
along with reference ARP7 gene group. These tabulations exclude same-species matches (query-ref).
Alignment score is prefered over bitscore (or e-value) in this analysis because the bitscore
is very strongly affected by taxonomic distance, alignment less so. Using bitscore gives different
results per gene, but returns the same basic results in overall gene set qualities.
- Tabulation of closest alignment of query genes x reference orthology gene group includes
classification of Best_of_Species across gene sets per species, as Best=higher alignment than any
others, Same=tied for top alignment (within 2% of top), Diff=lower than top alignment,
Miss=no alignment to reference orthology group gene.
For Ticks gene sets, related species were grouped into Best_of_Species classification.
- Alignment score tables used for above summaries are in this format
tgeneset orgroupid refgeneid trgeneid bits iden algn pal dlen rsize tsize isbest comorgrp
amelevg14 ARP7f_G1370 nasvit:Nasvi2EG037131t1 amevg14:Apimel3EVm000005t4 14052 6667 8474 89.9 93.6 9421 8815 best 1
apimel14nc ARP7f_G1370 nasvit:Nasvi2EG037131t1 apimel14nc:XP_006568818 12637 6007 7574 80.4 81.2 9421 7654 diff 1
apismel45 ARP7f_G1370 nasvit:Nasvi2EG037131t1 apismel45:GB55483-PA 13980 6618 8456 89.8 100 9421 9504 diff 1
apimel1 ARP7f_G1370 nasvit:Nasvi2EG037131t1 apimel1:GB11358-PA 12598 5976 7655 81.3 98.6 9421 9290 diff 1
amelevg14 ARP7f_G251 tribcas:XP_008191512 amevg14:Apimel3EVm000003t2 11260 5507 8819 34.6 40.7 25481 10366 diff 1
apimel14nc ARP7f_G251 tribcas:XP_008191512 apimel14nc:XP_006564376 13101 6650 11658 45.8 78.3 25481 19952 best 1
apismel45 ARP7f_G251 tribcas:XP_008191512 apismel45:GB47977-PA 10029 5206 10376 40.7 91.9 25481 23421 diff 1
apimel1 ARP7f_G251 tribcas:XP_008191512 apimel1:GB14642-PA 8242 4124 7786 30.6 67.7 25481 17256 diff 1
found here:
apismel4set,
tribolium4set,
daphmag2set,
daphplx3set,
dromel3set,
tick4set
Alignment score summary table for all species gene sets
- ARP7 OrthoMCL analysis includes 10 species, 8 arthropods plus Human and a Fish (maylandia),
with species proteins excluding Daph magna available at
EvidentialGene/daphnia/daphnia_magna_new/Proteins/
arp7s10b14nodmag.species.txt
45212 daphplx Daphnia pulex water flea, 2010 beta gene set (~10,000 noncoding included)
29127 daphmag Daphnia magna water flea, 2014 gene set
13927 dromel Drosophila melanogaster fruit fly, FBgn version 5.x (date?)
72392 honbee Apis melifera honey bee, 2014 evigene mRNA-assembly (url)
39357 human Homo sapiens human, UniProt 2014 ?
23194 mayzebr Malandia zeb.. cichlid fish, 2014 NCBI gene set
36390 nasvit Nasonia vitripennis jewel wasp, 2010 evigene
26962 penmon Pen. mon. tiger shrimp, 2013 evigene mRNA assembly (url)
21503 pogcha Pog. cha. beetle, 2013 evigene mRNA assembly (url)
12420 tribcas Tribolium castenaeum beetle, 2014 NCBI gene set
- OrthoMCL (www.orthomcl.org) methods include reciprocal best blast hit (bbh) of all primary transcript proteins
(one protein/gene locus), then tabulation of best reciprocal hits for within (paralog) and
between (ortholog) species, followed by Markov clustering (MCL) to form gene groups. Several
papers attest to the validity of OrthoMCL and MCL clustering for identifying orthology gene families.
- Conserved Domain analysis is performed with NCBI deltablast, using cdd_delta database of 2014 Oct.
CD subsets are derived from taxonomic division common to all sequences provided as representatives
of each CD.
Longest alignment to each CD, or highest identity, per gene set is used to measure gene set
presence (Hit) and score. Alignment and Identity scores give equivalent rank orders for gene sets.
Bitscore and e-value are not used because they are confounded by sizes of proteins that
are always longer than domains they contain.
- Conserved Domain alignment score tables used for above summaries are in this format
CDid CDlen Bestset Geneset1 Bit1 Id1 Aln1 Geneset2 Bit2 Id2 Aln2 Geneset3...
CDD:100015 434 amelevg14 amelevg14:Apimel3aEVm009774t2 928 404 466 apimel1:GB10009-PA 965 395 434 apimel14nc:XP_006558759 896 356 434 ..
CDD:100016 425 apimel14nc amelevg14:Apimel3aEVm006739t16 942 417 452 apimel1:GB11920-PA 944 391 431 apimel14nc:XP_006567220 663 294 466 ..
CDD:100017 431 same4 amelevg14:Apimel3aEVm010780t1 880 353 432 apimel1:GB18755-PA 889 353 432 apimel14nc:XP_394981 889 353 432 ..
found here:
apismel4set,
beetle3eu2e,
daphnia2set,
dromel3set,
tick4set,
nasonia3set,
plants3set,
fish7set (longest alignments),
with summary tables cdd-alleuks-dompair5sum.txt (longest align)
and cdd-alleuks-dompair4sum.txt (highest identity)
of the 34 animal and plant species gene sets analyzed.
Statistics pHit, paln, pbest, the percentages of present (hit), aligned and best score per CD of nRef CD are
plotted above, from columns "nRef nHit pHit bits iden algn paln ptop best same diff miss pbest"
for NCBI CD descriptions and taxonomy in cdd-delta14desc.txt
Fig 6a. Honey bee gene sets, per gene alignment to orthology reference genes
Percent alignment to reference (Align_common) on y-axis,
with gene groups on x-axis from longest (left) to shorter genes,
Best_of_Species genes are colored as red-orange=Best align,
pink=Same/Best align,
blue-gray=Poorer align,
purple dotted=Missing
Note: Apis Evigene 2014 and Apis NCBI 2014 used same Apis RNA-seq,
NCBI used additional genome data.
Best includes ties, where 2nd is nearly same as best score.
Apis Evigene-M 2014, apisevg14
Apis NCBI 2014, apis14nc
Apis OGS 4.5, apis45
Apis OGS 1, apis1
|
Longest orthogene groups (1-200)
|
Mid sized orthogene groups (1000-1200)
|
Shorter orthogene groups (5000-5200)
|
Fig 6b. Daphnia water flea gene sets, per gene alignment to orthology reference genes
Percent alignment to reference (Align_common) on y-axis,
with gene groups on x-axis from longest (left) to shorter genes,
Best_of_Species genes are colored as red-orange=Best align,
blue-gray=Poorer align,
purple dotted=Missing
Note: Best_of_Species colors are for 2 Daph. magna gene sets, and 2 Daph. pulex gene sets, separately
Best includes ties, where 2nd is nearly same as best score.
Daph. magna Evigene-M 2014, dapmagevg14
Daphnia pulex 2010 Evigene-G, dapplx10evg
Daph. magna Evigene-G 2011, dapmag11
Daphnia pulex OGS1, dapplxjgiv11
|
Longest orthogene groups (1-200)
|
Mid sized orthogene groups (1000-1200)
|
Shorter orthogene groups (5000-5200)
|
Fig 6c. Tick species gene sets, per gene alignment to orthology reference genes
Deer tick Evigene-M 2014, tickixevg14
Deer tick OGS1 2010, tickix
Spider mite OGS1, tickte
Zebra tick Evigene-M 2013, tickzbevg13
|
Longest orthogene groups (1-200)
|
Mid sized orthogene groups (1000-1200)
|
Shorter orthogene groups (5000-5200)
|
Fig 6d. Tribolium beetle gene sets, per gene alignment to orthology reference genes
Percent alignment to reference (Align_common) on y-axis,
with gene groups on x-axis from longest (left) to shorter genes,
Best_of_Species genes are colored as red-orange=Best align,
pink=Same/Best align,
blue-gray=Poorer align,
purple dotted=Missing
Tribolium Evigene-M 2014, tribcas4evg2
Tribolium AUGUSTUS 2014, tribcas4aug
Tribolium NCBI 2014, tribcas14nc
Tribolium OGS1, tribcas1
|
Longest orthogene groups (1-200)
|
Mid sized orthogene groups (1000-1200)
|
Shorter orthogene groups (5000-5200)
|
Fig 6f. Fruitfly gene sets, per gene alignment to orthology reference genes
Fruitfly rel5.48, drosmel548n
Fruitfly rel5.0, drosmelr5
Fruitfly rel4, drosmelr4
|
Longest orthogene groups (1-200)
|
Mid sized orthogene groups (1000-1200)
|
Shorter orthogene groups (5000-5200)
|
|