Daphnia magna gene set 201304 status 2014 Jan, from Don Gilbert, gilbertd at indiana.edu NOTE: Dmagna5x_evigene_2014jan is a pre-release gene set that will be replaced in summer 2014. 0. Prior D. magna gene sets of 2011, genome-modelled, and 2012, RNA-seq assembled are incomplete (missing ortholog families, short/fragmented genes). DMagna project folks can find these at http://server7.wfleabase.org/genome/Daphnia_magna/prerelease/gene-predictions/ 1. 2013 gene set is constructed from StressFlea + NotreDame mRNA-seq with EvidentialGene methods. It follows current RNA-seq de-novo assembly methods, adding 20% genome-modelled genes for low expression, orthology models. This gene set has 38,800 loci with 144,000 alternate transcripts. There is 35% orthology improvement over the 2011 gene set, and ~90% are fully supported by expression. 2. Orthology and expression validations are essentially done. This D. magna gene set has more complete genes (longer) than D. pulex, and about same number of ortholog gene families. [ picture here.. ] This D. magna gene set has a bit more Human-orthology than D. pulex, and more than any other tested arthropod among 9 insects, 4 arachnids, 2 shrimp. 3. The major remaining problem is accurately classifying among paralogs, alternate transcripts and artifactual assemblies. It is both a hard problem, and an important one to solve. E.g., I have not yet been able to accurately count and model all the Hemoglobin genes in Daphnia magna, primarily because they are high identity recent paralogs. These are also differentially expressed genes, responding to several environmental stressors. This problem holds for many gene families. 4. I'm for looking collaborators to publish this gene set, sooner rather than later this year. It appears better quality than current D. pulex genes in useful respects, and adds a needed good quality crustacean gene set. A limitation is my pressing need to find funding to support this work. Strong human-gene orthology, coupled with existing methods of measuring environmentally responsive genes, reinforce the role for Daphnia as an important model organism for enviro-genomics. Having a very accurate Daphnia gene set for this is an important step that I ask your help with. * "The quality of a gene set is dependent on the quality of the genome assembly" This dogma is now wrong, a quote from one and implicit in many recent genome projects. With today's accurate and inexpensive RNA-seq data, the quality of eukaryote gene sets is more related to quality of mRNA-gene assembly (e.g. http://arthropods.eugenes.org/EvidentialGene/about/EvidentialGene_quality.html) Complete and biologically accurate expressed gene catalogs can now be determined better from mRNA seq assembly than from gene prediction on genome assemblies. Using both approaches improves genes, but also turns up conflicting gene evidence that makes reaching a higher level of accuracy a difficult problem. See http://arthropods.eugenes.org/EvidentialGene/about/alntopbars/index.html for comparison of gene set quality for insects, crustacea, ticks, fish and plants. In all comparable cases, mRNA-assemblies have more complete orthology genes than genome-modelled genes. * Eco-Environ relevant genes remain a hard problem. - Recently evolved genes are the hardest to accurately assembly or model, with duplication repeat problems, weak or no orthology models, variable expression (i.e. lower than orthology genes for standard environs, but high for special environs). Such are subject to mis-assembly, gene-joins, poor genome mapping, gapping, and other gene construction problems. - Environmentally responsive genes are more often recently evolved, species-specific or new paralogs, with lesser known functions inferred from orthology. This finding turned up for Daphnia pulex, and appears with Daphnia magna and Killifish, as well as other gene x environ studies. Daphnia magna Gene Orthology x Expression effects Express Ortholog Inparalog Unique HS7 9%- 12% 78%+ Cyanobacteria,Carbaryl,Crowding,Fish,Parasite,Triops/HS ND4 22%- 20% 58%+ Cadmium,Salt,pH5,UVlight/ND NDxPB 50%+ 10%- 40% Lead/ND No diff 31% 23% 46% no differential expression ------------------------------------------------------ Express = High expression group gene proportions for HS=StressFlea/Helsinki, ND=NotreDame/Pfrender Ortholog = has ortholog gene outside Daphnia (insect, arthro, vert) Unique = species-specific genes Inparalog = Inparalogs (new genes), including Daphnia magna+pulex ------------------------------------------------------ Killifish Gene Orthology x Expression effects Express Inparalog Unique Adult-Env 31%+ 35%+ Embryo 7%- 11%- No diff 12% 31% ------------------------------------------------------ Express = High expression group gene proportions for Adult-Env = from adult tissues, with environ stressors Embryo = from embryonic tissues, no stressors Unique = species-specific genes versus ortholog genes Inparalog = Inparalogs (new genes) versus orthologs ------------------------------------------------------ My opinion on accurate reconstruction of recent-genes is that available gene data is often sufficient, but improved methods are needed for construction and validation. - Developing new methods for inferring or measuring gene functions of these environmentally responsive, recently evolved genes is an important problem for researchers in this area now (Daphnia-philes and others).