EvidentialGene: Orthology completeness of Arthropod gene assemblies
These EvidentialGene sets for 12 arthropods, fishes and plants are the
most orthology-complete, compared to same or related species gene sets
by well regarded gene discovery methods (NCBI, Ensembl, Augustus,
Maker, Glean, ...).
Orthology-complete is the measure of presence and fullness of genes with
protein homology to related species. This measure is for a subset
of genes that have orthology to 9,000 gene families that are shared
by 10 species across arthropods, plus human and fish. Similar
completeness of gene set ranking is obtained using conserved protein
domains (NCBI CDD).
mRNA-assembled genes are
constructed without any reference protein information, nor with genome
data. Genome gene sets use both reference genes and genomes to
produce gene models. I.e., the genome genes have the answers in front
of them, while mRNAseq genes do not.
Some of these gene set comparisons use the same RNA-seq data, where the genome-based
versions have added sources of information, and of errors. If genes can be fully
constructed from mRNA-seq alone, then adding genome data, prediction software, and
related species proteins may contribute more mistakes than improvements.
This summary shows that Arthropod species gene sets constructed from mRNA-seq are often more
complete than genes from genome-modelling methods. mRNA-seq genes are
more biologically convincing as they don't rely on the reference
species genes, predictions or genome assembly, and thus are not
subject to artifacts from those sources. When they match the
reference genes, it is true discovery.
Fig 4. Orthology completeness summaries per Arthropod species-clade
Fig 4a. Honey bee
|
Fig 4b. Daphnia water fleas (D. magna, pulex)
|
Fig 4c. Ticks (Deer tick, Zebra tick, Spider mite)
|
Fig 4d. Tribolium Beetle
|
Fig 4f. Fruit fly
|
Fig 4. Legend
- Nref_common = Reference ortholog families found,
as percent of common ortholog gene families found in gene set
(common set = 8723 groups, common to each of 10 species including these)
- AaSize = protein size, as percent of ortholog reference protein size
- Align_common = Alignment to Reference ortholog protein, as percent of ref protein.
- Best_of_Species = gene set score for all reference orthologs, as percent of all ref gene families,
from best, or same as best (tie), align score per reference gene over gene sets.
I.e. 100% best for setA = all orthologs have best/tie score for setA, none better in sets B or C,
90% for setA implies 10% better in other gene setB, setC,..
- Species gene sets left-right order, newest to oldest, as listed below.
Evigene sets are solid color bars, others are hashed color.
Fig 4d2 Pogonus + Tribolium beetles
|
Arthropod Species gene sets
Honey bee, Insecta
apisevg14, Apis mellifera, Evigene mRNA assembly 2014.06
apis14nc, Apis mel., NCBI genome gene preds 2014.02 (r102)
apis45, Apis mel., OGS v3.2 (genome v4.5) genome gene preds 2012, doi: 10.1186/1471-2164-15-86
apis1, Apis mel., OGS v1.1 genome gene preds 2006/8, doi:10.1038/nature05260
Water flea, Crustacea
dapmagevg14, Daphnia magna, Evigene mRNA assembly + genome predict. 2014.08
dapmag11, Daphnia magna, Evigene genome gene preds 2011
dapplx10evg, Daphnia pulex, Evigene genome gene preds 2010
dapplxjgiv12, Daphnia pulex, OGS v1.1 + JGI updates 2011 (v1.2)
dapplxjgiv11, Daphnia pulex, OGS v1.1 genome gene preds 2007, doi: 10.1126/science.1197761
Ticks & Mites, Chelicerata
tickixevg14, Ixodes scapularis deer tick, Evigene mRNA assembly 2014.06
tickix, Ixodes scap. deer tick, genome gene preds 2011 (IscaW1.1)
tickte, Tetranychus ur. spider mite, genome gene preds 2011 (v.0620111123), doi:10.1038/nature10640
tickzbevg13, Rhipicephalus pulchellus zebra tick, Evigene mRNA assembly 2013
Beetles, Insecta
pogoevg13, Pogonus chalceus beetle, Evigene mRNA assembly 2013
tribcas4evg2, Tribolium castaneum beetle, Evigene mRNA assembly 2014.12
tribcas14nc, Tribolium cast., NCBI genome gene preds 2014.05 (r102)
tribcas4aug, Tribolium cast., AUGUSTUS genome gene preds, tcas4.0 2014
tribcas1, Tribolium cast., OGS v1 genome gene preds 2006/08, doi:10.1038/nature06784
Fruit fly, Insecta
drosmel548n, Drosophila melanogaster, rel 5.48, 2013
drosmelr5, Drosophila mel., rel 5.30, 2011
drosmelr4, Drosophila mel., rel 4.0, 2004
|
Plant & Fish Species gene sets
Fishes
Catfishevg, Ictalurus punctatus, Evigene mRNA assembly 2013
Killifishevg, Fundulus heteroclitus, Evigene mRNA assembly + genome preds 2014
Cavefish, Astyanax mexicanus, EnsEMBL r74 genome gene preds 2013.11
Mayzebr, Maylandia zebra african cichlid, NCBI genome gene preds 2013 (r100/MetZeb1.1)
Platyfish, Xiphophorus maculatus, EnsEMBL r74 genome gene preds 2013.11
Tilapia, Oreochromis niloticus, EnsEMBL r74 genome gene preds 2013.11
Zfish, Danio rerio zebrafish, Zv9/EnsEMBL r74 genome gene preds 2013.11
Plants
Banana1evg, Musa acuminata Banana plant, Evigene mRNA assembly 2013
Banana1g, Banana, genome gene preds 2012, doi:10.1038/nature11241
Cacao1evg, Theobroma cacao chocolate tree, Evigene mRNA assembly + genome predict. 2012, doi:10.1186/gb-2013-14-6-r53
Cacao1cr, Theobroma cacao, genome gene preds 2011, doi:10.1038/ng.736
Pine1evg, Loblolly pine tree, Evigene mRNA assembly 2013, doi:10.1186/gb-2014-15-3-r59
Pine1mk, Loblolly, genome gene preds (Maker) 2013, doi:10.1186/gb-2014-15-3-r59
|
Fig 5. Conserved Domain gene-set completeness of Arthropods, Fish and Plants
Fig 5a. Conserved Domains in Arthropod gene sets
|
Fig 5b. Conserved Domains in Fishes
|
Fig 5c. Conserved Domains in Plants
|
Fig 5. Legend
- Conserved Domains from NCBI CDD (2014.10), subsets by taxonomic group,
CD found (a) in Arthropods, (b) in Vertebrates, and (c) in Plants with deltablastp.
CD set for Arthropods (a) is determined as Eukaryote CD that occur in all 3 clades
Insecta, Crustacea and Chelicerata, in at least one species gene set.
- pHit = Conserved Domains found,
as percent of all conserved domains in species/clade set
- Alignment = percent Alignment to Domains.
- Best_of_Species = gene set best with longest alignment, or same as best (tie), per domain,
as percent of all domains.
I.e. 100% best for setA = all genes have best/tie domain alignment for setA,
none better in sets B or C, 60% for setA implies 40% better in other gene sets.
- Calculations are within species/clade (color coded sets), normalized to 100% of all CD per clade.
- Arthropod gene sets are left-right ordered, newest to oldest, as listed above.
Evigene sets are solid color bars, others are hashed color.
- Gene set rank order for Best_of_Species are same for Orthology group and Conserved Domain analyses.
|
Table 4. Orthology gene set completeness across species,
measured with average protein size and orthology in gene groups
Table 4a. Gene set completeness of Arthropod species
Common Families All Families
Species cBits aaSize oMiss tBits oGroup Tiny
--------- ------------------ -----------------
daphniam 650 46 18 466 11523 1.8%
daphniap 643 -25 36 462 11670 5.1%
beetlet 526 -26 42 351 8765 4.1%
beetlep 541 15 78 358 8875 3.5%
honeybee 532 38 161 346 8682 3.1%
fruitfly 470 68 203 290 7801 1.8%
----------------------------------------------
|
Table 4b. Gene set completeness of Plant species
Common All Families
Geneset cBits aaSize tBits oGroup Tiny
----------------------- -----------------
cacao1evg 653 15 547 15161 0.7%
cacao1cr 641 11 530 14897 1.5%
banana1g 522 -19 371 12537 4.6%
banana1evg 521 -21 349 11733 7.5%
-----------------------------------------
|
Table 4c. Gene set completeness of Fish species
Common Families All Families
Species cBits aaSize oMiss tBits oGroup Tiny
--------- ------------------- -------------------
killifish 803 50 18 585 17272 1.1%
maylandia 824 45 76 596 16469 1.1%
tilapia 822 6 223 568 14905 1.9%
platyfish 783 -12 118 549 15305 4.7%
zebrafish 711 -9 366 478 15190 4.8%
catfish 725 21 729 470 14276 3.4%
-------------------------------------------------
|
Legend:
cBits = bitscore average for 4740 common gene groups (plant common n=8461);
tBits = bitscore average for all ortholog groups;
aaSize = average protein size difference from group median;
oMiss = missing ortholog groups that are common to other 9 of 10 species;
oGroup = number of ortholog gene groups in species;
Tiny = percent species gene size outliers below 2sd of group median size;
Arthropod Species gene sets:
daphniam = daphnia magna dapmagevg14, daphniap = daphnia pulex dapplx10evg,
beetlet = tribolium cas tribcas1, beetlep = pogonus cha. pogoevg13,
honeybee = apis mel. apisevg14, fruitfly = dros. mel. drosmel548n
Analysis source:
Arthropods, arp7bor5/arp7s10f-orthomcl, 2014.09.30;
Fish, fish11gor3-orthomcl, 2013.12.11 ;
Plants, orthomcl 2014.02.09
Orthology analysis methods
- Orthology gene groups are those of ARP7 OrthoMCL analysis set of 10 species, group IDs of
ARP7f_Gnnn, such as ARP7f_G1000.
This gene orthology database ARP7 is at
arthropods.eugenes.org/arthropods/orthologs/ARP7/,
including orthology tabulations and annotations, and species gene set proteins.
Gene family report pages are found with ARP7 IDs at
http://arthropods.eugenes.org/genepage/arp7xml/ARP7f_G1000
- Orthology gene groups that are common to all 10 species are used,
though missing from some species gene sets, n=8723 listed in
allspp-refarp7s10fset1.comgrpid2
Tables for this analysis were produces using blastp of all transcript proteins
(primary and alternate) to the ARP7 OrthoMCL reference protein database (n=131876 proteins of 10 species).
The best aligned or best bitscore match of query gene set to reference database is tabulated,
along with reference ARP7 gene group. These tabulations exclude same-species matches (query-ref).
Alignment score is prefered over bitscore (or e-value) in this analysis because the bitscore
is very strongly affected by taxonomic distance, alignment less so. Using bitscore gives different
results per gene, but returns the same basic results in overall gene set qualities.
- Tabulation of closest alignment of query genes x reference orthology gene group includes
classification of Best_of_Species across gene sets per species, as Best=higher alignment than any
others, Same=tied for top alignment (+/- 9 aminos), Diff=lower than top alignment, Miss=no alignment
to reference orthology group gene. For gene sets of
Ticks and Beetles, related species were grouped into Best_of_Species classification.
- Alignment score tables used for above summaries are in this format
BestClass OrthoMCL_ID,referencegene_ID Geneset_Best_ID Bits Ident Align refSize querySize
best3.apimel14nc ARP7f_G1000,nasvit:Nasvi2EG002657t1 apimel14nc:XP_006560316 409 479 705 571 690
diff.apimel1 ARP7f_G1000,nasvit:Nasvi2EG002657t1 apimel1:GB13919-PA 415 443 648 571 639
diff.amelevg14 ARP7f_G1000,nasvit:Nasvi2EG002657t1 amelevg14:Apimel3aEVm003441t1 419 458 634 571 842
diff.apismel45 ARP7f_G1000,nasvit:Nasvi2EG002657t1 apismel45:GB48026-PA 369 397 569 571 559
found here:
apismel4set,
beetle3set,
tribolium4set,
daphmag2set,
daphplx3set,
dromel3set,
tick4set
- ARP7 OrthoMCL analysis includes 10 species, 8 arthropods plus Human and a Fish (maylandia),
with species proteins excluding Daph magna available at
EvidentialGene/daphnia/daphnia_magna_new/Proteins/
arp7s10b14nodmag.species.txt
45212 daphplx Daphnia pulex water flea, 2010 beta gene set (~10,000 noncoding included)
29127 daphmag Daphnia magna water flea, 2014 gene set
13927 dromel Drosophila melanogaster fruit fly, FBgn version 5.x (date?)
72392 honbee Apis melifera honey bee, 2014 evigene mRNA-assembly (url)
39357 human Homo sapiens human, UniProt 2014 ?
23194 mayzebr Malandia zeb.. cichlid fish, 2014 NCBI gene set
36390 nasvit Nasonia vitripennis jewel wasp, 2010 evigene
26962 penmon Pen. mon. tiger shrimp, 2013 evigene mRNA assembly (url)
21503 pogcha Pog. cha. beetle, 2013 evigene mRNA assembly (url)
12420 tribcas Tribolium castenaeum beetle, 2014 NCBI gene set
- OrthoMCL (www.orthomcl.org) methods include reciprocal best blast hit (bbh) of all primary transcript proteins
(one protein/gene locus), then tabulation of best reciprocal hits for within (paralog) and
between (ortholog) species, followed by Markov clustering (MCL) to form gene groups. Several
papers attest to the validity of OrthoMCL and MCL clustering for identifying orthology gene families.
- Conserved Domain analysis is performed with NCBI deltablast, using cdd_delta database of 2014 Oct.
CD subsets are derived from taxonomic division common to all sequences provided as representatives
of each CD.
Longest alignment to each CD, or highest identity, per gene set is used to measure gene set
presence (Hit) and score. Alignment and Identity scores give equivalent rank orders for gene sets.
Bitscore and e-value are not used because they are confounded by sizes of proteins that
are always longer than domains they contain.
- Conserved Domain alignment score tables used for above summaries are in this format
CDid CDlen Bestset Geneset1 Bit1 Id1 Aln1 Geneset2 Bit2 Id2 Aln2 Geneset3...
CDD:100015 434 amelevg14 amelevg14:Apimel3aEVm009774t2 928 404 466 apimel1:GB10009-PA 965 395 434 apimel14nc:XP_006558759 896 356 434 ..
CDD:100016 425 apimel14nc amelevg14:Apimel3aEVm006739t16 942 417 452 apimel1:GB11920-PA 944 391 431 apimel14nc:XP_006567220 663 294 466 ..
CDD:100017 431 same4 amelevg14:Apimel3aEVm010780t1 880 353 432 apimel1:GB18755-PA 889 353 432 apimel14nc:XP_394981 889 353 432 ..
found here:
apismel4set,
beetle3eu2e,
daphnia2set,
dromel3set,
tick4set,
nasonia3set,
plants3set,
fish7set (longest alignments),
with summary tables cdd-alleuks-dompair5sum.txt (longest align)
and cdd-alleuks-dompair4sum.txt (highest identity)
of the 34 animal and plant species gene sets analyzed.
Statistics pHit, paln, pbest, the percentages of present (hit), aligned and best score per CD of nRef CD are
plotted above, from columns "nRef nHit pHit bits iden algn paln ptop best same diff miss pbest"
for NCBI CD descriptions and taxonomy in cdd-delta14desc.txt
Fig 6a. Honey bee gene sets, per gene alignment to orthology reference genes
Percent alignment to reference (Align_common) on y-axis,
with gene groups on x-axis from longest (left) to shorter genes,
Best_of_Species genes are colored as red-orange=Best align,
blue-gray=Poorer align,
green dotted=Missing
Note: Apis Evigene 2014 and Apis NCBI 2014 used same Apis RNA-seq,
NCBI used additional genome data.
Best includes ties, where 2nd is nearly same as best score.
Apis Evigene-M 2014, apisevg14
Apis NCBI 2014, apis14nc
Apis OGS 4.5, apis45
Apis OGS 1, apis1
|
Longest orthogene groups (1-200)
|
Mid sized orthogene groups (1000-1200)
|
Shorter orthogene groups (5000-5200)
|
Fig 6b. Daphnia water flea gene sets, per gene alignment to orthology reference genes
Percent alignment to reference (Align_common) on y-axis,
with gene groups on x-axis from longest (left) to shorter genes,
Best_of_Species genes are colored as red-orange=Best align,
blue-gray=Poorer align,
green dotted=Missing
Note: Best_of_Species colors are for 2 Daph. magna gene sets, and 2 Daph. pulex gene sets, separately
Best includes ties, where 2nd is nearly same as best score.
Daph. magna Evigene-M 2014, dapmagevg14
Daph. magna Evigene-G 2011, dapmag11
Daphnia pulex 2010 Evigene-G, dapplx10evg
Daphnia pulex OGS1, dapplxjgiv11
|
Longest orthogene groups (1-200)
|
Mid sized orthogene groups (1000-1200)
|
Shorter orthogene groups (5000-5200)
|
Fig 6c. Tick species gene sets, per gene alignment to orthology reference genes
Deer tick Evigene-M 2014, tickixevg14
Deer tick OGS1 2010, tickix
Spider mite OGS1, tickte
Zebra tick Evigene-M 2013, tickzbevg13
|
Longest orthogene groups (1-200)
|
Mid sized orthogene groups (1000-1200)
|
Shorter orthogene groups (5000-5200)
|
Fig 6d. Tribolium beetle gene sets, per gene alignment to orthology reference genes
Tribolium Evigene-M 2014, tribcas4evg2
Tribolium AUGUSTUS 2014, tribcas4aug
Tribolium NCBI 2014, tribcas14nc
Tribolium OGS1, tribcas1
|
Longest orthogene groups (1-200)
|
Mid sized orthogene groups (1000-1200)
|
Shorter orthogene groups (5000-5200)
|
Fig 6d2. Beetle species gene sets, per gene alignment to orthology reference genes
Pogonus Evigene-M 2013, pogoevg13
Tribolium NCBI 2014, tribcas14nc
Tribolium OGS1, tribcas1
|
Longest orthogene groups (1-200)
|
Mid sized orthogene groups (1000-1200)
|
Shorter orthogene groups (5000-5200)
|
Fig 6f. Fruitfly gene sets, per gene alignment to orthology reference genes
Fruitfly rel5.48, drosmel548n
Fruitfly rel5.0, drosmelr5
Fruitfly rel4, drosmelr4
|
Longest orthogene groups (1-200)
|
Mid sized orthogene groups (1000-1200)
|
Shorter orthogene groups (5000-5200)
|
|