Summary of OrthoMCL clustering of proteins among arthropod genomes (Aug 2008) Uniq UPar Or1 Or+Par nGene p_n0 p_n1 p_n2 p_n3 n0 n1 n2 n3 ------------------------------------------------------------------------ Apid 34660 0.30 0.287 0.17 0.242 10281 9933 6043 8403 Daphnia 30983 0.33 0.337 0.19 0.139 10292 10446 5933 4312 Nasonia 26252 0.23 0.275 0.28 0.213 6057 7232 7372 5591 ... ^v ^v ^v Culex 18783 0.21 0.105 0.49 0.200 3940 1965 9121 3757 DrosPse 18352 0.19 0.099 0.55 0.163 3409 1814 10138 2991 DrosMoj 16809 0.15 0.092 0.59 0.167 2589 1539 9880 2801 Tribolium 16222 0.32 0.085 0.46 0.136 5183 1383 7446 2210 Aedes 15419 0.13 0.050 0.56 0.257 1977 778 8707 3957 Ixodes 14528 0.57 0.088 0.27 0.067 8344 1279 3930 975 * outlier in n0, n2, n3 Apis 14436 0.30 0.048 0.53 0.125 4301 688 7643 1804 DrosMel 13421 0.12 0.029 0.75 0.100 1647 383 10052 1339 * removed 6000 alt-tr Anopheles 12457 0.16 0.050 0.67 0.121 2036 620 8296 1505 Pediculus 11186 0.27 0.015 0.65 0.068 3019 170 7241 756 ------------------------------------------------------------------------ Gene categories: ... n0 : unique singleton (no ortholog, no paralog) ... n1 : species-specific paralogs (no ortholog) ... n2 : has ortholog and no paralog ... n3 : has ortholog and >1 paralog ... p_n: portion of total gene calls nGene nGene: total gene predictions The n0,n1 categories will include matches to other species not considered, just no orthology at criterion among this proteome collection. Gene predictions are as provided by sources. Different criteria for retaining predictions were used (see source data notes below). Notably some of these exclude most predictions that lack homology or EST evidence. This includes some or all Vecbase sources (mosquitos). Others, notably the top 3 gene counts, include all predictions. Main factors here for high vs low gene count, besides differing policy on what to report as a gene prediction, appear to be finding specific paralogs (n1) versus 1-1 or 1-n orthologs (n2/n3). Methods: An all-against-all BlastP is performed on these proteins, after removing small (< 40 aa) predicted proteins. Alternate transcripts were removed after BlastP matching, in order to use the most similar gene variants; these included 6500 alternate transcripts from Dros. melanogaster, 1300 from Aedes, and less than 800 from the others. The similar genes are clustered using the standard methods outlined for OrthoMCL [Li et al 2003; Chen et al 2007], which can be summarized this way. Significance criteria are applied with recommended options: a similarity P-value <= 1e-05, protein percent identity >= 40%, and MCL inflation of 1.5 (this affects granularity of clustering). Reciprocal best similarity pairs between species, and reciprocal better similarity pairs within species (i.e., recently arisen paralogs, or in-paralogs, proteins that are more similar to each other within one species than to any protein in the other species) are added to a similarity matrix. The matrix is normalized by species and subjected to Markov clustering (MCL; Stijn van Dongen, 2000) to generate ortholog groups including recent in-paralogs. An additional round of MCL clustering was applied to link related gene groups. Citations: D. Gilbert, OrthoMCL clustering among 13 arthropod proteomes. http://insects.eugenes.org/arthropods/ Aug. 2008, gilbertd@indiana.edu OrthoMCL: http://www.orthomcl.org/ Li Li, Christian J. Stoeckert, Jr., and David S. Roos OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes. Genome Res. 2003 13: 2178-2189. Feng Chen, Aaron J. Mackey, Jeroen K. Vermunt, and David S. Roos Assessing Performance of Orthology Detection Strategies Applied to Eukaryotic Genomes. PLoS ONE 2007 2(4): e383. Source data notes ..SPECIES .. N genes ....... NOTES .................. SOURCE ............. Aphid 37994 genes 6500 <40aa, few alt-tr (NCBI Gnomon) Aphid notes: ~4000 TE-genes, ~25000 with EST or protein match 10% of aphid genes w/ daphnia match are to daphnia genes w/ no prot/EST evidence (571 no_evd_daphnia / 5976 daphnia) Aedes 16789 genes (Vecbase) Apis 17182 genes 2700 <40aa (NCBI Gnomon) Anopheles 13134 genes (Vecbase) Culex 18883 genes (Vecbase/Broad) Culex notes: 10693 Named proteins, 7414 conserved hypothetic prot., 539 hypothetical, 237 predicted (this last is only group w/o evidence) see http://www.broad.mit.edu/annotation/genome/culex_pipiens.4/GeneFinding.html Daphnia 31952 genes 1900 <40aa, few alt-tr (NCBI Gnomon) Daphnia notes: ~600 TE genes, ~17000 with EST or protein match - 11571 daphnia predictions have no Protein,EST evidence as of Sep 2007 (noEvd) -- 7000 daphnia noEvd genes have paralog duplicate -- 1500 daphnia noEvd genes match Nasonia, of 12151 total matching Nasonia (10%) -- 2444 match one or more of these other arthropods (Aug 2008 data) : 1468 aphid, 595 aedes, 452 apis, 476 anopheles, 529 culex, 723 dmel, 452 dmoj, 474 dpse, : 386 ixodes, 1095 nasonia, 415 pediculus, 575 tribolium -- 2128 have strong tile expression (1975 of these lack new aabugs match); -- 5624 have moderate tile expression (5261 lack new aabugs match) Nasonia 27287 genes 2100 <40aa, few alt-tr (NCBI Gnomon) Nasonia notes: 12563 with protein or EST match, ~5000 EST matched (from 30K est) DrosMel 20513 genes includes ~6000 alt-tr (Ref 5) DrosMoj 17950 genes 4800 <40aa (NCBI Gnomon CAF1) DrosPse 19259 genes 1100 <40aa (NCBI Gnomon CAF1) Ixodes 17742 genes (Vecbase/JCVI) Pediculus 11198 genes (Vecbase) Tribolium 16422 genes (Beetlebase rel3) ---------------------------------------------------------------------------- Transposon gene assessment -------------------------- This uses two assessments: transpson naming in gene homology descriptions, and PILER-DF transposon predictions for Nasonia (Chris Smith). In general, these two assessments are consistent. Species allclus TEname TEwasp Aedes 14783 50 68 Anopheles 11040 35 48 Aphid 24376 1261 1372 *B 3855 TEnamed genes using DroSpeGe aphid annotation Apis 10134 151 110 Culex 14842 63 89 Daphnia 20689 262 166 DrosMel 12528 34 112 *C TEwasp clusters here appear to be non-TE genes (see below) DrosMoj 14213 295 375 DrosPse 14935 235 216 Ixodes 6182 26 40 Nasonia 20195 2447 7514 *A 4037 PILER-DF annotated wasp genes (C.Smith) Pediculus 8170 65 92 Tribolium 11037 211 232 Key: allclus = gene count in all OrthoMCL clusters (excludes singleton genes that can have homology to non-arthropod species) TEname = gene count from clusters with transposon name in one or more species TEwasp = gene count from clusters that include wasp genes with Piler annotated TE TEName patterns include polyprotein|transpos(on|able)|retrovirus|jockey|gypsy|mariner|reverse transcriptase|mobile element|DNA polymerase|transposase|envelope * A. A list of Nasonia gene predictions has 4037 identified as transposon-like by PILER-DF repeat predictions (Chris Smith; sfsu.edu). Of these, 2400 also are in TEnamed orthologous groups, while 7500 are included in orthologous groups with at least 1 Piler predicted wasp TE gene. For instance the largest Arthropod orthologous group (ARP1_G0) has 413 Nasonia genes and 1 Daphnia gene, of the Nasonia genes, 329 are identified by Piler as TE genes. The other 84 in this group are good candidates for TE genes also. However this inference doesn't apply to all such cases; one could say the TE gene count in Nasonia ranges between 2400 to 7500. See also (C) for cases where these are not TE genes. * B. A more detailed annotation of Aphid genes at DroSpeGe finds more TE-named genes. * C. Are the TEWasp gene clusters with well annotated DrosMel that lack TE names really transposon-related? Here is one check. It suggests the 80 DrosMel genes, and a number of Nasonia genes, from TEwasp clusters are not really TE genes, but are well known (and repetitive) arthropod genes. This is not a large fraction of those identified as TE genes. It does however recommend caution with TE annotations: flagging these as possible TE, rather than excluding from further use, is likely the best action. Arthropod gene groups with DrosMel>0 AND TEwasp BUT NOT TEname ArpID Description ARP1_G9 cytoplasmic dynein heavy chain; src=aedes_AAEL000885-PA ARP1_G17 Histone H4 replacement CG3379-PC (LOC656232); src=tcas3_GLEAN_05056 ARP1_G21 DSCAM, down syndrome cell adhesion molecule; src=culex_CPIJ000087 ARP1_G37 histone H2B.1; src=culex_CPIJ020276 ARP1_G41 histone H3 type 2; src=culex_CPIJ012445 ARP1_G163 CG40081-PA.3; src=dmel_NP_001015348.1 ARP1_G436 CG16863 CG16863-PB; src=dmel_NP_001036361.1 ARP1_G465 CG14204, CG14219, CG14205, CG13325, .. DrosMel; conserved hypothetical protein; src=nasonia_NCBI_hmm490274 ARP1_G541 Sorbitol dehydrogenase-2 CG4649-PA; src=acyr1_ncbi_hmm76343 ARP1_G1873 lysine-specific histone demethylase 1; src=culex_CPIJ002197 ARP1_G2087 AMME syndrome candidate gene 1 protein; src=culex_CPIJ003770 ARP1_G2375 strawberry notch CG1903-PC; src=acyr1_ncbi_hmm9583 ARP1_G2939 conserved hypothetical protein, CG32030 DrosMel; src=culex_CPIJ006609 ARP1_G4728 Niemann-Pick Type C-2; src=aedes_AAEL015136-PA ARP1_G5244 CG41123 CG41123-PA; src=dmel_NP_001015393.1 ARP1_G6462 Smg6 CG6369 DrosMel; Telomerase-binding protein EST1A; src=amel4_ncbi_hmm914