euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

EvidentialGene: Orthology completeness of Arthropod gene assemblies

2014-Nov, D. Gilbert, gilbertd at indiana edu
2015-Mar update; See also EvidentialGene: Alignment scores of Gene sets Constructed from mRNA-Seq Assembly or Genome Predictions/mappings , 2013

These EvidentialGene sets for 12 arthropods, fishes and plants are the most orthology-complete, compared to same or related species gene sets by well regarded gene discovery methods (NCBI, Ensembl, Augustus, Maker, Glean, ...). Orthology-complete is the measure of presence and fullness of genes with protein homology to related species. This measure is for a subset of genes that have orthology to 9,000 gene families that are shared by 10 species across arthropods, plus human and fish. Similar completeness of gene set ranking is obtained using conserved protein domains (NCBI CDD).

mRNA-assembled genes are constructed without any reference protein information, nor with genome data. Genome gene sets use both reference genes and genomes to produce gene models. I.e., the genome genes have the answers in front of them, while mRNAseq genes do not. Some of these gene set comparisons use the same RNA-seq data, where the genome-based versions have added sources of information, and of errors. If genes can be fully constructed from mRNA-seq alone, then adding genome data, prediction software, and related species proteins may contribute more mistakes than improvements.

This summary shows that Arthropod species gene sets constructed from mRNA-seq are often more complete than genes from genome-modelling methods. mRNA-seq genes are more biologically convincing as they don't rely on the reference species genes, predictions or genome assembly, and thus are not subject to artifacts from those sources. When they match the reference genes, it is true discovery.


Fig 4. Orthology completeness summaries per Arthropod species-clade

Fig 4a. Honey bee

Honeybee4

Fig 4b. Daphnia water fleas (D. magna, pulex)

Daphnia4

Fig 4c. Ticks (Deer tick, Zebra tick, Spider mite)

Ticks4

Fig 4d. Tribolium Beetle

Tribolium4

Fig 4f. Fruit fly

Fruitfly4

Fig 4. Legend

  1. Nref_common = Reference ortholog families found, as percent of common ortholog gene families found in gene set (common set = 8723 groups, common to each of 10 species including these)
  2. AaSize = protein size, as percent of ortholog reference protein size
  3. Align_common = Alignment to Reference ortholog protein, as percent of ref protein.
  4. Best_of_Species = gene set score for all reference orthologs, as percent of all ref gene families, from best, or same as best (tie), align score per reference gene over gene sets. I.e. 100% best for setA = all orthologs have best/tie score for setA, none better in sets B or C, 90% for setA implies 10% better in other gene setB, setC,..
  5. Species gene sets left-right order, newest to oldest, as listed below. Evigene sets are solid color bars, others are hashed color.

Arthropod Species gene sets

Honey bee, Insecta
apisevg14, Apis mellifera, Evigene mRNA assembly 2014.06
apis14nc, Apis mel., NCBI genome gene preds 2014.02 (r102)
apis45, Apis mel., OGS v3.2 (genome v4.5) genome gene preds 2012, doi: 10.1186/1471-2164-15-86
apis1, Apis mel., OGS v1.1 genome gene preds 2006/8, doi:10.1038/nature05260
Water flea, Crustacea
dapmagevg14, Daphnia magna, Evigene mRNA assembly + genome predict. 2014.08
dapmag11, Daphnia magna, Evigene genome gene preds 2011
dapplx10evg, Daphnia pulex, Evigene genome gene preds 2010
dapplxjgiv12, Daphnia pulex, OGS v1.1 + JGI updates 2011 (v1.2)
dapplxjgiv11, Daphnia pulex, OGS v1.1 genome gene preds 2007, doi: 10.1126/science.1197761
Ticks & Mites, Chelicerata
tickixevg14, Ixodes scapularis deer tick, Evigene mRNA assembly 2014.06
tickix, Ixodes scap. deer tick, genome gene preds 2011 (IscaW1.1)
tickte, Tetranychus ur. spider mite, genome gene preds 2011 (v.0620111123), doi:10.1038/nature10640
tickzbevg13, Rhipicephalus pulchellus zebra tick, Evigene mRNA assembly 2013
Beetles, Insecta
pogoevg13, Pogonus chalceus beetle, Evigene mRNA assembly 2013
tribcas4evg2, Tribolium castaneum beetle, Evigene mRNA assembly 2014.12
tribcas14nc, Tribolium cast., NCBI genome gene preds 2014.05 (r102)
tribcas4aug, Tribolium cast., AUGUSTUS genome gene preds, tcas4.0 2014
tribcas1, Tribolium cast., OGS v1 genome gene preds 2006/08, doi:10.1038/nature06784
Fruit fly, Insecta
drosmel548n, Drosophila melanogaster, rel 5.48, 2013
drosmelr5, Drosophila mel., rel 5.30, 2011
drosmelr4, Drosophila mel., rel 4.0, 2004

Plant & Fish Species gene sets

Fishes
Catfishevg, Ictalurus punctatus, Evigene mRNA assembly 2013
Killifishevg, Fundulus heteroclitus, Evigene mRNA assembly + genome preds 2014
Cavefish, Astyanax mexicanus, EnsEMBL r74 genome gene preds 2013.11
Mayzebr, Maylandia zebra african cichlid, NCBI genome gene preds 2013 (r100/MetZeb1.1)
Platyfish, Xiphophorus maculatus, EnsEMBL r74 genome gene preds 2013.11
Tilapia, Oreochromis niloticus, EnsEMBL r74 genome gene preds 2013.11
Zfish, Danio rerio zebrafish, Zv9/EnsEMBL r74 genome gene preds 2013.11

Plants
Banana1evg, Musa acuminata Banana plant, Evigene mRNA assembly 2013
Banana1g, Banana, genome gene preds 2012, doi:10.1038/nature11241
Cacao1evg, Theobroma cacao chocolate tree, Evigene mRNA assembly + genome predict. 2012, doi:10.1186/gb-2013-14-6-r53
Cacao1cr, Theobroma cacao, genome gene preds 2011, doi:10.1038/ng.736
Pine1evg, Loblolly pine tree, Evigene mRNA assembly 2013, doi:10.1186/gb-2014-15-3-r59
Pine1mk, Loblolly, genome gene preds (Maker) 2013, doi:10.1186/gb-2014-15-3-r59

Fig 5. Conserved Domain gene-set completeness of Arthropods, Fish and Plants

Fig 5a. Conserved Domains in Arthropod gene sets

Fig 5b. Conserved Domains in Fishes

Fig 5c. Conserved Domains in Plants

Fig 5. Legend

  1. Conserved Domains from NCBI CDD (2014.10), subsets by taxonomic group, CD found (a) in Arthropods, (b) in Vertebrates, and (c) in Plants with deltablastp. CD set for Arthropods (a) is determined as Eukaryote CD that occur in all 3 clades Insecta, Crustacea and Chelicerata, in at least one species gene set.
  2. pHit = Conserved Domains found, as percent of all conserved domains in species/clade set
  3. Alignment = percent Alignment to Domains.
  4. Best_of_Species = gene set best with longest alignment, or same as best (tie), per domain, as percent of all domains. I.e. 100% best for setA = all genes have best/tie domain alignment for setA, none better in sets B or C, 60% for setA implies 40% better in other gene sets.
  5. Calculations are within species/clade (color coded sets), normalized to 100% of all CD per clade.
  6. Arthropod gene sets are left-right ordered, newest to oldest, as listed above. Evigene sets are solid color bars, others are hashed color.
  7. Gene set rank order for Best_of_Species are same for Orthology group and Conserved Domain analyses.



Table 4. Orthology gene set completeness across species,
measured with average protein size and orthology in gene groups

Table 4a. Gene set completeness of Arthropod species

             Common Families     All Families
Species   cBits aaSize oMiss tBits oGroup Tiny
--------- ------------------ -----------------
daphniam   650   46     18    466  11523  1.8%
daphniap   643  -25     36    462  11670  5.1%
beetlet    526  -26     42    351   8765  4.1%
beetlep    541   15     78    358   8875  3.5%
honeybee   532   38    161    346   8682  3.1%
fruitfly   470   68    203    290   7801  1.8%
----------------------------------------------

Table 4b. Gene set completeness of Plant species

              Common     All Families  
Geneset    cBits aaSize tBits oGroup Tiny
----------------------- -----------------
cacao1evg    653   15    547  15161  0.7%
cacao1cr     641   11    530  14897  1.5%
banana1g     522   -19   371  12537  4.6%
banana1evg   521   -21   349  11733  7.5%
-----------------------------------------

Table 4c. Gene set completeness of Fish species

             Common Families    All Families
Species   cBits aaSize oMiss  tBits  oGroup Tiny
--------- ------------------- -------------------
killifish  803     50     18   585   17272   1.1%
maylandia  824     45     76   596   16469   1.1% 
tilapia    822      6    223   568   14905   1.9% 
platyfish  783    -12    118   549   15305   4.7% 
zebrafish  711     -9    366   478   15190   4.8% 
catfish    725     21    729   470   14276   3.4% 
-------------------------------------------------
Legend: cBits = bitscore average for 4740 common gene groups (plant common n=8461); tBits = bitscore average for all ortholog groups; aaSize = average protein size difference from group median; oMiss = missing ortholog groups that are common to other 9 of 10 species; oGroup = number of ortholog gene groups in species; Tiny = percent species gene size outliers below 2sd of group median size;
Arthropod Species gene sets: daphniam = daphnia magna dapmagevg14, daphniap = daphnia pulex dapplx10evg, beetlet = tribolium cas tribcas1, beetlep = pogonus cha. pogoevg13, honeybee = apis mel. apisevg14, fruitfly = dros. mel. drosmel548n
Analysis source: Arthropods, arp7bor5/arp7s10f-orthomcl, 2014.09.30; Fish, fish11gor3-orthomcl, 2013.12.11 ; Plants, orthomcl 2014.02.09



Orthology analysis methods

2015-Mar update to 2014-Nov assessment

The target gene/transcript assignment to ortho-groups is changed, to one group per target locus (per species gene set). Prior method allowed alternates of locus to fill other ortho-groups, which can be biologically valid for some loci, but also included a large set of weak, partial aligned group members. Also calculation of alignment and identity from blastp tables, per gene is updated, to improve addition of part-hits (HSP), with little effect except for a few largish genes. This update reduces number of "missed" ortho-groups for the gene sets with few alternates. The CDD analysis is not affected by alternates and is unchanged. One visible consequence in the per-group bar graphs (Figs 6a-6f) is that Longest reference gene groups are almost all filled by each gene set (i.e. few misses for these most obvious ortho-genes), while short reference gene groups still contain a fair number of misses per gene set (shorter genes are in some ways harder to accurately find, due to "noise" factors). For the longest genes, sizes often differ among gene sets, esp. the initial gene sets have many short/partial models of complex long genes, see eg. Ticks (Fig 6c) where partial genes are common with the fragmented genome assembly of deer tick OGS1. Another consequence in the summary bar graphs is the Best-of-species percentages have increase for poorer gene sets, to within 20% to 30% of top scoring gene sets. This is still a large range in gene set completeness. Notice the comparably small range in Fruitfly gene set versions that span over a decade of effort. Both these changes indicate a likely more valid way to assess gene set qualities in this restricted case for orthology-completeness. Bear in mind this is one gene set quality, albeit perhaps the most biologically rigorous one available.

Prior data and plots of 2014-Nov assessment are available here.

Orthology completeness assessment

  • Orthology gene groups are those of ARP7 OrthoMCL analysis set of 10 species, group IDs of ARP7f_Gnnn, such as ARP7f_G1000.
    This gene orthology database ARP7 is at arthropods.eugenes.org/arthropods/orthologs/ARP7/,
    including orthology tabulations and annotations, and species gene set proteins.
    Gene family report pages are found with ARP7 IDs at http://arthropods.eugenes.org/genepage/arp7xml/ARP7f_G1000

  • Orthology gene groups that are common to all 10 species are used, though missing from some species gene sets, n=8723 listed in allspp-refarp7s10fset1.comgrpid2
    Tables for this analysis were produces using blastp of all transcript proteins (primary and alternate) to the ARP7 OrthoMCL reference protein database (n=131876 proteins of 10 species). The best aligned or best bitscore match of query gene set to reference database is tabulated, along with reference ARP7 gene group. These tabulations exclude same-species matches (query-ref). Alignment score is prefered over bitscore (or e-value) in this analysis because the bitscore is very strongly affected by taxonomic distance, alignment less so. Using bitscore gives different results per gene, but returns the same basic results in overall gene set qualities.

  • Tabulation of closest alignment of query genes x reference orthology gene group includes classification of Best_of_Species across gene sets per species, as Best=higher alignment than any others, Same=tied for top alignment (within 2% of top), Diff=lower than top alignment, Miss=no alignment to reference orthology group gene. For Ticks gene sets, related species were grouped into Best_of_Species classification.

  • Alignment score tables used for above summaries are in this format
    tgeneset     orgroupid    refgeneid                 trgeneid                    bits    iden    algn    pal     dlen    rsize   tsize   isbest  comorgrp
    amelevg14    ARP7f_G1370  nasvit:Nasvi2EG037131t1   amevg14:Apimel3EVm000005t4  14052   6667    8474    89.9    93.6    9421    8815    best    1
    apimel14nc   ARP7f_G1370  nasvit:Nasvi2EG037131t1   apimel14nc:XP_006568818     12637   6007    7574    80.4    81.2    9421    7654    diff    1
    apismel45    ARP7f_G1370  nasvit:Nasvi2EG037131t1   apismel45:GB55483-PA        13980   6618    8456    89.8    100     9421    9504    diff    1
    apimel1      ARP7f_G1370  nasvit:Nasvi2EG037131t1   apimel1:GB11358-PA          12598   5976    7655    81.3    98.6    9421    9290    diff    1
    
    amelevg14    ARP7f_G251   tribcas:XP_008191512      amevg14:Apimel3EVm000003t2  11260   5507    8819    34.6    40.7    25481   10366   diff    1
    apimel14nc   ARP7f_G251   tribcas:XP_008191512      apimel14nc:XP_006564376     13101   6650    11658   45.8    78.3    25481   19952   best    1
    apismel45    ARP7f_G251   tribcas:XP_008191512      apismel45:GB47977-PA        10029   5206    10376   40.7    91.9    25481   23421   diff    1
    apimel1      ARP7f_G251   tribcas:XP_008191512      apimel1:GB14642-PA          8242    4124    7786    30.6    67.7    25481   17256   diff    1
    
    found here: apismel4set, tribolium4set, daphmag2set, daphplx3set, dromel3set, tick4set
    Alignment score summary table for all species gene sets

  • ARP7 OrthoMCL analysis includes 10 species, 8 arthropods plus Human and a Fish (maylandia), with species proteins excluding Daph magna available at EvidentialGene/daphnia/daphnia_magna_new/Proteins/
    arp7s10b14nodmag.species.txt
    45212 daphplx Daphnia pulex water flea, 2010 beta gene set (~10,000 noncoding included)
    29127 daphmag Daphnia magna water flea, 2014 gene set
    13927 dromel  Drosophila melanogaster fruit fly, FBgn version 5.x (date?)
    72392 honbee  Apis melifera honey bee, 2014 evigene mRNA-assembly (url)
    39357 human   Homo sapiens human, UniProt 2014 ?
    23194 mayzebr Malandia zeb.. cichlid fish, 2014 NCBI gene set
    36390 nasvit  Nasonia vitripennis jewel wasp, 2010 evigene
    26962 penmon  Pen. mon. tiger shrimp, 2013 evigene mRNA assembly (url)
    21503 pogcha  Pog. cha. beetle, 2013 evigene mRNA assembly (url)
    12420 tribcas Tribolium castenaeum beetle, 2014 NCBI gene set

  • OrthoMCL (www.orthomcl.org) methods include reciprocal best blast hit (bbh) of all primary transcript proteins (one protein/gene locus), then tabulation of best reciprocal hits for within (paralog) and between (ortholog) species, followed by Markov clustering (MCL) to form gene groups. Several papers attest to the validity of OrthoMCL and MCL clustering for identifying orthology gene families.

  • Conserved Domain analysis is performed with NCBI deltablast, using cdd_delta database of 2014 Oct. CD subsets are derived from taxonomic division common to all sequences provided as representatives of each CD. Longest alignment to each CD, or highest identity, per gene set is used to measure gene set presence (Hit) and score. Alignment and Identity scores give equivalent rank orders for gene sets. Bitscore and e-value are not used because they are confounded by sizes of proteins that are always longer than domains they contain.

  • Conserved Domain alignment score tables used for above summaries are in this format
    CDid      CDlen Bestset    Geneset1                      Bit1 Id1 Aln1 Geneset2          Bit2 Id2 Aln2 Geneset3...
    CDD:100015  434 amelevg14  amelevg14:Apimel3aEVm009774t2  928 404 466 apimel1:GB10009-PA  965 395 434 apimel14nc:XP_006558759 896 356 434 ..
    CDD:100016  425 apimel14nc amelevg14:Apimel3aEVm006739t16 942 417 452 apimel1:GB11920-PA  944 391 431 apimel14nc:XP_006567220 663 294 466 ..
    CDD:100017  431 same4      amelevg14:Apimel3aEVm010780t1  880 353 432 apimel1:GB18755-PA  889 353 432 apimel14nc:XP_394981    889 353 432 ..
    
    found here: apismel4set, beetle3eu2e, daphnia2set, dromel3set, tick4set, nasonia3set, plants3set, fish7set (longest alignments),
    with summary tables cdd-alleuks-dompair5sum.txt (longest align) and cdd-alleuks-dompair4sum.txt (highest identity) of the 34 animal and plant species gene sets analyzed.
    Statistics pHit, paln, pbest, the percentages of present (hit), aligned and best score per CD of nRef CD are plotted above, from columns "nRef nHit pHit bits iden algn paln ptop best same diff miss pbest"
    for NCBI CD descriptions and taxonomy in cdd-delta14desc.txt


Fig 6a. Honey bee gene sets, per gene alignment to orthology reference genes

Percent alignment to reference (Align_common) on y-axis, with gene groups on x-axis from longest (left) to shorter genes,
Best_of_Species genes are colored as red-orange=Best align, pink=Same/Best align, blue-gray=Poorer align, purple dotted=Missing
Note: Apis Evigene 2014 and Apis NCBI 2014 used same Apis RNA-seq, NCBI used additional genome data.
Best includes ties, where 2nd is nearly same as best score.



Apis Evigene-M 2014, apisevg14



Apis NCBI 2014, apis14nc



Apis OGS 4.5, apis45



Apis OGS 1, apis1
Longest orthogene groups (1-200)
arp7genesiden66_Honeybee0
Mid sized orthogene groups (1000-1200)
arp7genesiden66_Honeybee1000
Shorter orthogene groups (5000-5200)
arp7genesiden66_Honeybee5000

Fig 6b. Daphnia water flea gene sets, per gene alignment to orthology reference genes

Percent alignment to reference (Align_common) on y-axis, with gene groups on x-axis from longest (left) to shorter genes,
Best_of_Species genes are colored as red-orange=Best align, blue-gray=Poorer align, purple dotted=Missing
Note: Best_of_Species colors are for 2 Daph. magna gene sets, and 2 Daph. pulex gene sets, separately
Best includes ties, where 2nd is nearly same as best score.



Daph. magna Evigene-M 2014, dapmagevg14


Daphnia pulex 2010 Evigene-G, dapplx10evg


Daph. magna Evigene-G 2011, dapmag11


Daphnia pulex OGS1, dapplxjgiv11
Longest orthogene groups (1-200)
arp7genesiden66_Daphnia0
Mid sized orthogene groups (1000-1200)
arp7genesiden66_Daphnia1000
Shorter orthogene groups (5000-5200)
arp7genesiden66_Daphnia5000

Fig 6c. Tick species gene sets, per gene alignment to orthology reference genes



Deer tick Evigene-M 2014, tickixevg14


Deer tick OGS1 2010, tickix



Spider mite OGS1, tickte



Zebra tick Evigene-M 2013, tickzbevg13
Longest orthogene groups (1-200)
arp7genesiden66_Ticks0
Mid sized orthogene groups (1000-1200)
arp7genesiden66_Ticks1000
Shorter orthogene groups (5000-5200)
arp7genesiden66_Ticks5000

Fig 6d. Tribolium beetle gene sets, per gene alignment to orthology reference genes

Percent alignment to reference (Align_common) on y-axis, with gene groups on x-axis from longest (left) to shorter genes,
Best_of_Species genes are colored as red-orange=Best align, pink=Same/Best align, blue-gray=Poorer align, purple dotted=Missing



Tribolium Evigene-M 2014, tribcas4evg2


Tribolium AUGUSTUS 2014, tribcas4aug


Tribolium NCBI 2014, tribcas14nc



Tribolium OGS1, tribcas1



Longest orthogene groups (1-200)
arp7genesiden66_Tribolium0
Mid sized orthogene groups (1000-1200)
arp7genesiden66_Tribolium1000
Shorter orthogene groups (5000-5200)
arp7genesiden66_Tribolium5000

Fig 6f. Fruitfly gene sets, per gene alignment to orthology reference genes



Fruitfly rel5.48, drosmel548n


Fruitfly rel5.0, drosmelr5



Fruitfly rel4, drosmelr4



Longest orthogene groups (1-200)
arp7genesiden66_Fruitfly0
Mid sized orthogene groups (1000-1200)
arp7genesiden66_Fruitfly1000
Shorter orthogene groups (5000-5200)
arp7genesiden66_Fruitfly5000



Developed at the Genome Informatics Lab of Indiana University Biology Department