Daphnia magna gene set status 2014 August, from Don Gilbert, gilbertd at indiana.edu =========================================================== Prior Daphnia magna gene sets that I produced for the Daphnia magna genome consortium, from 2010 thru 2013, including genome-modelled on a partly assembled genome, and RNA-seq assembled are incomplete and have significant numbers of various aberrations such as joined-genes, fragmented genes, poor coding sequence qualities, etc. However on the whole the level of orthology-completeness is high for these varies D. magna gene sets (numbers of gene families shared with arthropods, and size of protein releative to those shared orthologs). Each new set brought improvements, but a high level of the various errors remained. From late 2013 thru summer 2014 I invested a substatial amount of expertese, and time/effort to disentangle the gene artifacts and biology, to construct a quality gene information set for this well studied organism of importance to environmental, toxicological, health and basic biological sciences and applications. The major remaining problem solved this year has been accurate classification among near identical gene transcripts, paralog loci, alternate transcripts and artifactual assemblies. It is both a hard problem, and an important one to solve. These are also differentially expressed genes, responding to several environmental stressors. This problem holds for many gene families. The solution I have found to this is to use mRNA gene information of 3 independent Daphnia magna clones (contributed by XXXX). These are produce independently constructed gene and alternate transcript sets, with high level agreement for many gene loci. Classification and clustering of gene transcripts across independently produced clone sets provides a basis for confidently discriminating paralogs, alternate transcripts and artifactual assemblies. A limitation on this project is my need for funds support this work. I have contributed substantial effort without salary to this Daphnia magna and related genome information engineering and dissemination, amounting to over $60,000 salary in 2013-2014. To enable use of this work and future Daphnia genome informatics, I am asking those with research budgets who wish to use these D.magna genes to contribute significant funds to defray my contribution. Those without non-personnel research funding are welcome to freely use this genome information. Those of you with budgets for such work may pay for this as a research expense and in support of wider use by the public. Otherwise I ask you to refrain from use of these new Daphnia magna gene information for 1 year, until 2015 September. Contact Don Gilbert, email: gilbertd @ indiana.edu, for details about contributing funds for this work. Some problems remain with this first publicly released Daphnia magna gene set, and I hope to be able to contribute improvements in the coming months and year. Don Gilbert, 30 July 2014 ------------------------------------------------- * "The quality of a gene set is dependent on the quality of the genome assembly" This dogma is now wrong, a quote from one and implicit in many recent genome projects. With today's accurate and inexpensive RNA-seq data, the quality of eukaryote gene sets is more related to quality of mRNA-gene assembly (e.g. http://arthropods.eugenes.org/EvidentialGene/about/EvidentialGene_quality.html) Complete and biologically accurate expressed gene catalogs can now be determined better from mRNA seq assembly than from gene prediction on genome assemblies. Using both approaches improves genes, but also turns up conflicting gene evidence that makes reaching a higher level of accuracy a difficult problem. See http://arthropods.eugenes.org/EvidentialGene/about/alntopbars/index.html for comparison of gene set quality for insects, crustacea, ticks, fish and plants. In all comparable cases, mRNA-assemblies have more complete orthology genes than genome-modelled genes. * Eco-Environ relevant genes remain a hard problem. Recently evolved genes are the hardest to accurately assembly or model, with duplication repeat problems, weak or no orthology models, variable expression (i.e. lower than orthology genes for standard environs, but high for special environs). Such are subject to mis-assembly, gene-joins, poor genome mapping, gapping, and other gene construction problems. Environmentally responsive genes are more often recently evolved, species-specific or new paralogs, with lesser known functions inferred from orthology. This finding turned up for Daphnia pulex, and appears with Daphnia magna and Killifish, as well as other gene x environ studies. Complex but relevant gene and genome structure and function in Daphnia, good Daphnia-Human orthology, coupled with existing methods of measuring environmentally responsive genes, reinforce the role for Daphnia as an important model organism for enviro-genomics. Constructing a very accurate Daphnia gene set for this is an important step. Daphnia magna Gene Orthology x Expression effects Express Ortholog Inparalog Unique HS7 9%- 12% 78%+ Cyanobacteria,Carbaryl,Crowding,Fish,Parasite,Triops/HS ND4 22%- 20% 58%+ Cadmium,Salt,pH5,UVlight/ND NDxPB 50%+ 10%- 40% Lead/ND No diff 31% 23% 46% no differential expression ------------------------------------------------------ Express = High expression group gene proportions for HS=StressFlea/Helsinki, ND=NotreDame/Pfrender Ortholog = has ortholog gene outside Daphnia (insect, arthro, vert) Unique = species-specific genes Inparalog = Inparalogs (new genes), including Daphnia magna+pulex ------------------------------------------------------ My opinion on accurate reconstruction of recent-genes is that available gene data is often sufficient, but improved methods are needed for construction and validation. Developing new methods for inferring or measuring gene functions of these environmentally responsive, recently evolved genes is an important problem for researchers in this area now (Daphnia-philes and others).