Summary of OrthoMCL clustering of proteins among arthropod genomes, version ARP2 (Dec 2009) ARTHROPOD GENE GROUP PHYLOGENY (v. ARP2, 2010.01, D. Gilbert) =============================================================== from OrthoMCL/BlastP analysis of 14 species proteomes, distance matrix of best reciprocal orthologs (using formula D2 of doi:10.1093/nar/gki181), followed by Phylip Fitch distance trees, J=250, rooted at Ixodes. Plotted in arthropod-arp2-tree.pdf and arthropod-arp2-pheno.pdf #-------------------------------- # Arthropod Gene Tree (ARP2) ((((((( ((DrosMel:0.16946,DrosPse:0.27553):0.03808,DrosMoj:0.29297):0.37569, ((Culex:0.28236,Aedes:0.20645):0.14114,Anopheles:0.27578):0.19011):0.10815, Bombyx:0.57957):0.01218,Tribolium:0.46032):0.02573, (Apis:0.32155,Nasonia:0.56158):0.17971):0.01871, Pediculus:0.41209):0.02903,Aphid:0.81649):0.10212, Daphnia:0.75698,Ixodes:0.72260); #-------------------------------- SUMMARY OF ORTHOLOGY/DUPLICATION IN ARTHROPOD GENOMES (Dec 2009) *** ==================================================================================== species nGene nGroup Uniq1 UDup Orth1 OrDup Guniq Gmax Gmin Dupls Singl dDP dSI ------------------------------------------------------------------------------------ aphid 34409 9837 10246 9822 6226 8115 2263 813 134 17937 16472 4.6 1.4 daphnia 30770 9458 10109 9975 6175 4511 2326 594 106 14486 16284 3.7 1.4 nasonia 26141 9442 5968 7146 7468 5559 1253 295 128 12705 13436 3.3 1.1 culex 18779 11110 3833 1927 9158 3861 528 729 112 5788 12991 1.5 1.1 aedes 16004 10810 2208 807 8816 4173 323 899 42 4980 11024 1.3 0.9 ixodes 20311 6965 9824 2986 5712 1789 732 247 558 4775 15536 1.2 1.3 drospse 17677 11271 3372 1832 10009 2464 448 347 121 4296 13381 1.1 1.1 drosmoj 16469 10665 2556 1501 9779 2633 250 218 36 4134 12335 1.1 1.0 tribol 16437 8671 4883 1573 7606 2375 359 245 54 3948 12489 1.0 1.0 bombyx 14517 7661 5141 1108 6870 1398 273 163 221 2506 12011 0.6 1.0 anophe 12607 9298 1760 664 8425 1758 237 210 45 2422 10185 0.6 0.8 apis 14096 8501 4194 669 7726 1507 215 219 86 2176 11920 0.6 1.0 drosmel 14063 11232 1704 364 10451 1544 178 186 2 1908 12155 0.5 1.0 pedicu 10759 7556 2723 88 7213 735 38 70 60 823 9936 0.2 0.8 ------------------------------------------------------------------------------------ *** Putative transposon genes have NOT YET been removed from these counts, giving larger Dupl difference from ARP1 result (esp. ~6000 Nasonia, ~2000 Aphid) Categories nGene : total gene predictions nGroup : orthomcl gene groups ... Uniq1 : unique singleton (no ortholog, no paralog) ... UDup : species-specific paralogs (no ortholog) ... Orth1 : has ortholog and no paralog ... OrDup : has ortholog and >1 paralog Groups ... Guniq : number of species unique groups, ... Gmax : species has max count ... Gmin : species has min count/group Dupls : All duplicates (UDup + OrDup) Singl : All singltons (Uniq1 + Orth1) dDP, dSI : Dups and Singl relative to Dipteran average (12000 Singl, 3900 Dupl ) See also these documents table.overgroups.txt : Summary of over/under abundant gene groups table.arp2methods.txt : Recipe of computational methods table.arp2sources.txt : Genome sources MISSING GENES FOUND ANALYSIS ======================================== Species missing from well-conserved gene groups are tested whether the misses are artifacts of gene prediction/combining. This is done with tblastn, finding missed proteins of other species to each species genome, with minimum p<=1e-5. The results are mapped to GFF. Newly found proteins are removed if they overlap any existing gene models. These Lo+Fnd results are significant matches at un-predicted locations to the missed conserved genes. This test doesn't ensure a full, expressable gene at these locations. Species Found / Missed -------- -------------- anopheles 42 / 45 aphid 17 / 129 bombyx 143 / 298 ** > 50% daphnia 14 / 105 drosmoj 13 / 35 ixodes 315 / 550 ** > 50% nasonia 35 / 128 pediculus 5 / 58 COMMENTS ON THESE GENOMES =============================================================== * Pediculus (the human parasitic body louse) has an interesting gene set in having single copies of most of the common insect orthologous gene set (more than apis, aphid, culex, or nasonia), but almost no paralogs. If one wanted an example of the basic / primordial insect gene set, Pediculus would be a good choice (maybe along with Tribolium). * Aphid and Daphnia have 4 times the number of gene duplications (paralogs) as any of the other arthropods, as previously found. Of interest, Aphid and Pediculus, at opposite extremes in duplicates, are most closely related. * Ixodes and Bombyx gene sets have artifactually missed a significant number of genes. For Ixodes at least this is explained from a high portion of repetitive/transposon dna, a challenge for assembly, yeilding a fragmented genome with genes split across scaffolds. This gene finding problem is compounded by Ixodes having mostly long introns, longer on average than a full coding transcript, in contrast to the other arthropods with mostly short introns, except for Bombyx (which may also have had gene finding challenges). BRIEF METHODS =============================================================== An all-against-all BlastP is performed on these proteins, after removing small (< 40 aa) predicted proteins. Alternate transcripts were removed after BlastP matching, in order to use the most similar gene variants; these included 6500 alternate transcripts from Dros. melanogaster, 1300 from Aedes, and less than 800 from the others. The similar genes are clustered using the standard methods outlined for OrthoMCL [Li et al 2003; Chen et al 2007], which can be summarized this way. Significance criteria are applied with recommended options: a similarity P-value <= 1e-05, protein percent identity >= 40%, and MCL inflation of 1.5 (this affects granularity of clustering). Reciprocal best similarity pairs between species, and reciprocal better similarity pairs within species (i.e., recently arisen paralogs, or in-paralogs, proteins that are more similar to each other within one species than to any protein in the other species) are added to a similarity matrix. The matrix is normalized by species and subjected to Markov clustering (MCL; Stijn van Dongen, 2000) to generate ortholog groups including recent in-paralogs. An additional round of MCL clustering was applied to link related gene groups. TRANSPOSON GENE ASSESSMENT =============================================================== See ARP1/arthropod-orthomcl-v1.txt CITATIONS =============================================================== D. Gilbert, OrthoMCL clustering among 14 arthropod proteomes (ARP2). http://arthropods.eugenes.org/arthropods/ Dec. 2009, gilbertd@indiana.edu D. Gilbert, OrthoMCL clustering among 13 arthropod proteomes. http://insects.eugenes.org/arthropods/ Aug. 2008, gilbertd@indiana.edu OrthoMCL: http://www.orthomcl.org/ Li Li, Christian J. Stoeckert, Jr., and David S. Roos OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes. Genome Res. 2003 13: 2178-2189. Feng Chen, Aaron J. Mackey, Jeroen K. Vermunt, and David S. Roos Assessing Performance of Orthology Detection Strategies Applied to Eukaryotic Genomes. PLoS ONE 2007 2(4): e383. ======= EST assemblies *? add Dap magna here to show D pulex errors are assembly related? see ~/Desktop/dspp-work/arthropod/arp_est/estmaperr.work.txt ~/Desktop/dspp-work/daphwork/dmagna/dmagna-asm24quality.txt EST mapping errors Species N OK pOK pDUP pMISS 6 aphid 132559 0.76 0.076 0.085 4 bombyx 209897 0.80 0.083 0.030 5 dappulex 115809 0.78 0.052 0.158 2 dapmagna 1211902 0.94 0.019 0.025 1 drosmel 520952 0.95 0.018 0.016 7 ixodes 149662 0.74 0.079 0.102 3 nasvit 156288 0.88 0.030 0.063 pOK = valid map, pDUP = duplicate location, pMISS = missing or low identity Also see better calcs ~/Desktop/dspp-work/daphwork/dmagna/dmagna24asm.info EST assemblies summary DB Name daphnia dapmag drosmel nasonia Total EST 166289 1426011 567759 175853 Any alignment 145578 1304846 561200 167823 Valid align 114128 1161824 533435 147382 Assemblies 18211 73525 42618 21865 Subclusters 15827 54436 33329 17847 Comparison 1 2 1 9 Date 2008 2010 2008 2009 Genes w/ EST 10595 8300 9202 8676 Incorporated 6351 8111 7704 1350 UTR addition 2388 1759 5423 5600 Gene extension 655 862 1062 452 Gene Merging 837 1135 1128 760 Gene Splitting 0 0 0 0 Alt splice 1102 3096 3298 2044 New Gene 3094 6419 5925 6000 Antisense 452 362 190 240 Single exon inc. 908 3818 2049 1317 Percent of Genes w/ EST Genes_update(%) aphid bombyx daphnia drosmel ixodes nasonia Incorporated 58 15 60 84 34 16 UTR addition 27 54 23 59 18 65 Gene extension 9 13 6 12 8 5 Gene merging 4 7 8 12 9 9 Alt splice 13 15 10 36 11 24 Antisense 7 2 4 2 3 3 Single exon. 16 19 9 22 9 15 New gene 5714 18734 3094 5925 6029 6000 ## Count, move up? ##New gene% 119 220 29 64 78 69 << Need N not % here Assembly_err(%) aphid bombyx daphnia drosmel ixodes nasonia Low/0 identity 14 13 22 5 30 12 Duplicates 7.6 8.3 5.2 1.8 7.9 3.0 Split scafs 0.9 0.6 0.3 0.2 2.7 1.0 ** this is bogus calc, more assemblies does not mean more genes w/ EST # Percent of assemblies # DB Name daphnia dapmag drosmel nasonia # % Incorporated 35 11 18 6 # % UTR addition 13 2 13 26 # % Gene extension 4 1 2 2 # % Gene Merging 5 2 3 3 # % Alt splice 6 4 8 9 # % New Gene 17 9 14 27 # % Antisense 2 0 0 1 # % Single exon inc. 5 5 5 6