2012 February I'm in the process of building and testing a non-insect-centric arthropod orthology gene set, that includes some transcript assembly for crustaceans and ticks/Chelicerata. These I hope to make publicly available this spring. They are from already published data but the protein genes are hard to find for some, except the daphnia magna set is not yet public. These will I think help to place genes of the rest of the arthropods. A few of the transcript-data sets may be too poor to use (I get only gene fragments, missing many orthologs), except maybe to pick out a small subset of full genes. It appears currently that the Chelicerata sets (now ixodes + spidermite, and very poor dog tick assembly) are all artifactually incomplete, but the union of these may provide a more complete Chelicerata orthology set. - Don Gilbert ======= orthomcl gene groups draft 2 ================ aax13u2omcl/Feb_18 Clade presence for gene families, ncommon gene=24565 Clade Only Miss OutAny OutOnly OutMiss Crust 30 538 7033 165 183 Tick 61 1824 6272 48 566 inSect 300 3175 6533 114 324 ---------------------------------------------- OutAny = 1+ species in clade has outgroup gene family; OutOnly= 2+ species in clade have outgroup gene family, none of other clades have. OutMiss= no species in clade has outgroup, both other clades have. Only = all species in clade have family, none of other clades have Miss = no species in clade has family, both other clades have Common gene families presence, ncommon=5416, min taxa=9 Species Have Miss tribol 5223 193 daphmag 5202 214 wasp 5192 224 daphplx 5167 249 drosmel 5157 259 human 5110 306 zfish 5107 309 aphid 5051 365 ixodes 4777 639 # so-so incomplete geneset tetur 4538 878 # so-so incomplete geneset shrimp 4382 1034 # weak,incomplete gene set barnacl 3466 1950 # very poor gene set dogtick 2328 3088 # very poor gene set ----------------------- species inGene oGene nGroup Uniq1 UDup Orth1 OrDup OrGrp OrMis1 Guniq Gmax Gmin # Crustaceans daphmag 109520 38517 20697 na 20065 9394 9058 12823 1 7874 1519 1 daphplx 49018 32007 15051 na 13549 9341 9117 11767 10 3284 1120 10 shrimp 73151 20752 11475 na 11430 6261 3061 7481 56 3994 383 55 -- shrimp weak gene set, partial prots, could assemble more (75bp paired rnaseq) barnacl 151549 117377 21574 na 89923 4144 23310 7910 231< 13664 2660 229 -- barnacle very poor, partial gene set from 454 ESTs # Ticks/Chelicerata ixodes 20249 12061 8733 na 2146 7352 2563 8081 97 652 238 98 tetur 17072 11230 6694 na 4027 5218 1985 5940 99 754 174 100 -- both ixodes, tetur are so-so incomplete genesets, union may give Chelicerata common genes dogtick 121372 58885 15888 na 49157 4529 5199 6016 803< 9872 764 795 -- dogtick very poor, partial gene set from 454 ESTs # Insects aphid 31962 25109 9589 na 9463 6702 8944 7975 29 1614 494 29 # add Locust, near aphid, may be full gene set from transcript assembly (30477 uniq prots) drosmel 14289 11535 8406 na 2513 6869 2153 7588 17 818 188 18 tribol 16985 12519 8846 na 2190 7502 2827 8346 17 500 223 17 wasp 24296 18672 9504 na 7616 7359 3697 8166 18 1338 243 18 # Outgroups human 21830 18845 11800 na 2299 8223 8323 11063 30 737 1020 7 zfish 24150 19971 11736 na 3045 8066 8860 11144 46 592 1298 10 ------------------------------------------------------------------------------------------------------ nGroup = number of gene family groups (2+genes), orthology + species-unique OrGrp = count of ortho groups (nGroup = OrGrp + unique paralog groups) Uniq1 = species-unique single gene UDup = species-unique duplicated paralog genes Orth1 = count of single ortho gene OrDup = count of duplicated ortho gene OrMis1 = groups missing gene all others have (ignoring human) ======= Data sources for arpx13 gene groups ================ inSect/ aphid2bo3all_cd.aa.gz : n 31963 ; pea aphid 2 best-of-3 gene set trica.noalt.aa.gz : n 16985 ; uniprot 2011 tribolium wasp_ogs2.noalt.aa.gz : n 24296 ; nasonia ogs2 (2012) drosmel.noalt.aa.gz : n 14289 ; drosophila mel (ncbi refseq/flybase) add Locust near aphid, rna assembly may be complete set Crustacean/ daphmag2_estvel5asm.aa.gz : n 109520 : not public yet daphplx_evg10jgi6cd.aa.gz : n 49018 : daphnia pulex JGI2007+Evigene2010 (has some dups to remove) shrimpv1asm_cd.aa.gz : n 73151 : shrimp Pandalus latirostris; useful assembly but partial genes, missing more than should barnacle3vela_cdtop.aa.gz : n 151549 : acorn barnacle Balanus amphitrite; drop: very poor transcript assembly, fragmented, not useful Tick/Chelicerata ixodes2011v11_cd.aa.gz : n 20249 : ixodes and tetur spidermite both are partial gene sets mites_tetur_cd.aa.gz : n 17072 : Tetranychus urticae; " union of both may give Chelicerata common gene set dogtick_velcap1asm_cd.aa : n 121372 : Dermacentor variabilis; drop: very poor transcript assembly, fragmented, not useful Outgroup/ human_ncbi_cd.aa.gz : n 21830 zfish_ncbi_cd.aa.gz : n 24150 Danio rerio maybe add water bear, Tardigrada in Panarthropoda, 1 mill 454 EST reads, but looks to be poor/partial gene set ===== protein size distribution ============= n=number of unique proteins found (cd-hit 90%) n1000 = number >= 1000aa, n500 = no. >= 500aa; aveaa= average; maxaa= max size human_ncbi_cd : n=21830; n1000=2682; n500=8947; aveaa=573; maxaa=33423 zfish_ncbi_cd : n=24150; n1000=2628; n500=9095; aveaa=538; maxaa=32757 daphmag_estvel5asm : n=109520; n1000=1009; n500=5000; aveaa=146; maxaa=6299 daphplx_evg10jgi6cd : n=49018; n1000=2601; n500=9954; aveaa=350; maxaa=7809 aphid2bo3all_cd : n=31963; n1000=2274; n500=8954; aveaa=419; maxaa=20627 drosmel.noalt : n=14289; n1000=1508; n500=5268; aveaa=521; maxaa=22971 trica.noalt : n=16985; n1000=1346; n500=4824; aveaa=447; maxaa=21117 wasp.noalt : n=24296; n1000=1464; n500=5698; aveaa=385; maxaa=16711 locust1vel_cd : n=88929; n1000=1222; n500=4568; aveaa=141; maxaa=8523 # looks ok, from rna asm # .. poor gene sets from genome projects .. mites_tetur_cd : n=17072; n1000=881; n500=3789; aveaa=361; maxaa=18253 ixodes2011v11_cd : n=20249; n1000=549; n500=2738; aveaa=285; maxaa=4588 # .. poor gene sets from EST/rna assembly .. shrimpv1asm_cd : n=73151; n1000=349; n500=2441; aveaa=116; maxaa=3206 # .. very poor gene sets from EST/rna assembly .. too poor to use dogtick_velcap1asm_cd : n=121372; n1000=17; n500=193; aveaa=87; maxaa=1479 dogtickcap5merg_cd.aa n=22187; n1000=12; n500=138; aveaa=106; maxaa=1485 # better than below, but not great # merge of cap3 + mira EST454 asms; plus 30k singletons may include useful genes. barnacle3vela_cdtop : n=151549; n1000=4; n500=18; aveaa=143; maxaa=3345 # velvet/oases barnacle3cap.aa n=33252; n1000=0; n500=0; aveaa=82; maxaa=359 # cap barnaclemira2adcylv.aa n=63146; n1000=0; n500=0; aveaa=78; maxaa=300 # mira: no improvement #.. ^dgg assembled, give up, cannot get useful gene set from this data