# kfish2rae5g_sum.txt 2013-Nov-12++ Killifish, Fundulus heteroclitus genome project http://arthropods.eugenes.org/EvidentialGene/killifish/project/ Gene families,Fish orthology: Search at killifish/project/ (FISH11G) TABLE G2a. Fish species average orthology gene groups ---Common Groups--- ----All Groups----- Species cBits aaSize orMiss tBits orGroup Tiny --------- -------------------- -------------------- killifish 803 50 18 585 17272 1.1% maylandia 824 45 76 596 16469 1.1% tilapia 822 6 223 568 14905 1.9% platyfish 783 -12 118 549 15305 4.7% zebrafish 711 -9 366 478 15190 4.8% sticklebk 763 -42 342 509 14343 7.8% catfish 725 21 729 470 14276 3.4% medaka 743 -45 654 478 13541 9.7% tetraodon 732 -50 658 473 13423 7.9% spotgar -- -110 2588 329 10882 20.2% human -- 27 -- 395 12606 1.7% ---------------------------------------------------- source: kfish2/prot/fish11c/fish11gor3, Dec11 .. fish11gor3 update cBits = bitscore average for 8656 common fish gene groups tBits = bitscore average for all ortholog groups aaSize = average protein size difference from group median orMiss = missing ortholog groups that are common to other 8 of 9 fish (-gar) orGroup = number of ortholog gene groups in species Tiny = percent species gene size outliers below 2sd of group median size ---------------------------------------------------- TABLE G2b. Fish gene orthology categories (using OrthoMCL) ----------- GENES ----------- ------ GROUPS ----- Ortho to Kfish nGene Orlog Inpara Uniq1 UDup OrGrp OrMis1 UniGrp Shared Best ----------------------------- ------------------- ------------- killifish 34931 21133 3672 7694 2432 17272 10 682 --- --- maylandia 23194 21021 879 1171 122 16469 46 52 15468* 5159 tilapia 21437 19461 975 810 189 14904 135 78 14019 954 platyfish 20366 19483 214 641 29 15307 54 14 14609* 13352 zebrafish 26190 18465 4089 2202 1439 15187 188 226 13286* 1363 stickleback 20787 17954 1254 1396 181 14344 180 44 13317* 636 catfish 43671 17279 2964 15470 7938 14246 407 1561 12799* 1208 medaka 19686 16881 912 1535 361 13542 366 106 12670* 1071 tetraodon 19602 16814 904 1796 176 13425 360 67 12475* 696 spotgar 15734 11841 2655 1009 230 10880 1514 56 9492 116 human 39357 12699 7758 12497 6402 12608 420 2221 11265* 1182 ------------------------------------------------------------------------------- source: kfish2/prot/fish11c/fish11gor3, Dec11 .. gor3 upd Dec27 nGene = count of input genes, excludes alternate isoforms/locus. Orlog = Orthologous genes (one-to-one matches among species) Inpara = Inparalogs (recent ortholog duplicates) of orthologous genes Uniq1,UDup = single-copy and duplicated species-unique genes OrGrp,UniqGrp = orthologous and species-unique groups OrMis1 = groups missing in species that all other species have Ortho to Kfish, Shared= count of ortho groups shared with killifish, Best = count of groups with closest homolog, Shared* = maximum shared of 10 choices, tilapia shares more with maylandia, and spotgar with zebrafish. ------------------------------------------------------------------ TABLE G2c. Fish Taxonomy with Human gene alignment stats Human genes Fish Taxonomy ------------------------------------------------------------------------------- Nhuman Ident% Align% Neopterygii + Teleostei + + + Euteleostei + + + + + + Pseudocrenilabrinae 14822 66 71 :+ + + + + + + + Maylandia zebra # african cichlid Zebra Mbuna, NCBI : KF2 14181 65 70 :+ + + + + + + + Oreochromis niloticus # Tilapia, Ensembl : KF1,2 + + + + + Smegmamorpha 14893 64 70 :+ + + + + + + Gasterosteus aculeatus # Stickleback, Ensembl? : KF1,2 + + + + + + Atherinomorpha 14478 64 67 :+ + + + + + + + Oryzias latipes # Medaka, Ensembl? : KF1,2 + + + + + + + Cyprinodontiformes 15033 64 71 :+ + + + + + + + + + Xiphophorus maculatus # Platyfish, Consortium : KF2 17072 65 76 :+ + + + + + + + + + Fundulus heteroclitus # Killifish, Evigene : KF1,2 + + + Otocephala 16127 65 70 :+ + + + + + Ictalurus punctatus # Catfish, Evigene : KF2 16871 64 73 :+ + + + + + Danio rerio # Zebrafish, Consortium/Ensembl : KF1,2 + Semionotiformes 13081 70 54 :+ + Lepisosteus oculatus # Spotted gar, Draft Ensembl : KF2 ------------------------------------------------------------------------------- for N =26859 human genes, nc=16555 common to 7+ fish for align score Notes: ------------- Killifish, Maylandia and Tilapia form a good gene methods/results comparison, as top-scored gene sets, recently built by 3 groups with "good" gene construction pipelines. Some artifacts of methods may be found. Tilapia has Ensembl:genewise+exonerate models from mix of rna-seq + uniprot prots. Mayzebr is NCBI Gnomon annotate, also mix of RNA-seq and related proteins, Kfish2 is Evigene annotate, mRNA-gene strong but also using related species proteins. Other two do use RNA-seq assembly, but not as extensively or carefully, and rely on mapping to genome assembly. Ensembl-genewise-protein mapping has potential to add artifacts of homolog models. NCBI Gnomon now uses RNA-seq more carefully than in past, and better than Ensemble I think. Killifish and Platyfish form another useful comparison, platyfish being closest relative, and also a recent genome product built with current data and software. Differences that can be highlighted: 1. "The quality of a gene set is dependent on the quality of the genome assembly" (from Ensembl platyfish gene build document). This also can be derived from methods of platyfish genome paper, e.g. the methods included discarding mRNA assemblies that did not map well to genome assembly). In contrast, killifish genes v2 are not dependent on quality of genome assembly, merging both mRNA-assembly and genome-mapped methods to pick best set from both. 2. The human gene orthology stats indicate killifish surpases platyfish in completeness of genes. Killifish, Maylandia and Catfish form a third special comparison to other fish genes. You will find in the Orthology search that these three share more gene families that are missed in the other fish, than any other 3-fish comparison, by about 100 families. This is I think an effect of (a) mRNA assembly independent of genome genes used for Killifish and Catfish, and (b) for Maylandia, the NCBI has improved its mRNA evidence use enough to be roughly equivalent to discovering genes that may be poorly modelled on genome assembly. Find families shared by just these 3 fish, http://arthropods.eugenes.org/lucegene_arthropod/search?q=fish11xml-all:geneid+AND+Killifish:[1+TO+999]+AND+Catfish:[1+TO+999]+AND+Maylandia:[1+TO+999]+AND+Medaka:0+AND+Stickleback:0+AND+Tetraodon:0+AND+Tilapia:0+AND+Zebrafish:0+AND+Platyfish:0 One of these is http://arthropods.eugenes.org/genepage/fish11xml/FISH11G_G18567 FISH11G_G18567 : new D-tyrosyl-tRNA(Tyr) deacylase, one of 3 same named families, in killifish, catfish and maylandia only, maylandia: XP_004554729.1/LOC101478506, 1035 aa FISH11G_G1773 : D-tyrosyl-tRNA deacylase member 2, all but killifish, various number of genes (Tetraodon has 10) maylandia: XP_004554728.1/LOC101478225, 168 aa G1773 and G18567 are related in that G1773 shorter protein aligns to longer G18567, both have same CDD:202294 domain. In maylandia, these are tandem genes. In killifish, missing shorter one would be where genome gap exists (mRNA assembly may or may not have partial version) FISH11G_G5675 : D-tyrosyl-tRNA deacylase member 1, one gene in all 11 species