Dear genomics folks, Re: EvidentialGene project at http://eugenes.org/EvidentialGene/ EvidentialGene has a high accuracy rate for gene set construction, compared with other gene informatics methods, for fish, plants, various arthropods. Recently I've generated gene sets for two Anopheles mosquito species with Evigene mRNA assembly, and they surpass recently published* gene sets from Vectorbase project in orthology completeness, using same RNA-seq as that project reports. The software pipeline pair of MAKER and Trinity form a common recipe now for genome biologists, without those scientists realizing that greater accuracy is possible and not much harder to obtain, I suspect. In all cases where I test, with fishes, plants, insects, Evigene is producing the notably more accurate and ortho-complete gene sets. See below for mosquitos, fishes at http://eugenes.org/EvidentialGene/vertebrates/ The EvidentialGene gene reconstruction methods have been used for several animal and plant genome projects, where they produce gene sets more accurate than those of peer annotation methods. There are basic reasons these methods have high accuracy: careful, complete assembly of the now highly accurate RNA-sequences, and extensive use of protein orthology testing to validate, reject or accept, alternate gene constructions. Assembly of RNA sequences is similar but simpler than of genomic DNA, as RNA-seq read sizes are near to gene transcript sizes, there are no repetitive transposons, nor problematic intron breaks. Accurate RNA assembly solves problems that exist for traditional genome gene-modelling: artifacts from draft genome assemblies, from modelling prediction algorithms that are not gene-level accurate, and from artifacts contributed by related species gene models. Improvements to the Evigene locus classifier, including chromosome-assembly map classifier, are producing better discrimination of alternate transcripts versus paralog genes. I hope to offer an update in coming months that (a) improves gene locus classification (removing some duplication, improving alternate transcript classification), and (b) offering an initial mRNA-assembly by chromosome assembly classifier (i.e. genome mapping of transcript assemblies). If you have interests in accurate animal and plant gene-ome construction from RNA sequences, with or without a chromosome assembly, this project may be of interest. I would like to work with a few collaborators who have genome + transcriptome data sets plus genome-modelled gene sets (e.g. from pipelines such as MAKER, NCBI, Augustus, EvidenceModeller, etc) to compare with EvidentialGene results. Don Gilbert, 2016.feb ----------------- * Evigene vs MAKER gene set of doi: 10.1126/science.1258522 Highly evolvable malaria vectors:the genomes of 16 Anopheles mosquitoes Protein homology to reference genes, 2 gene sets for 2 species of Anopheles mosquito. For both species published RNA-seq was assembled with 4 gene assemblers, then reduced to locus/alternate gene sets with Evigene (roughly 3 days work). The RNA data sets here were too small by half of recommended amount, so some genes did not assemble properly. With 100+ M read pairs instead of the 50 M provided, the completeness of Evigene sets would be improved. Highly conserved REFERENCE (BUSCO drosmel, nr=3038) Anopheles-funestus Anopheles-albimanus Evigene MAKER Evigene MAKER found 99.4% 97.7% 98.3% 97.3% align 87.3% 83.2% 87.3% 83.2% best 30% 11.8% 26.5% 12.6% equal 58% 61% Drosophila mel. model REFERENCE (nr=10902) Anopheles-funestus Anopheles-albimanus Evigene MAKER Evigene MAKER found 98.4% 96.1% 95.8% 95.8% align 87.3% 83.2% 77.5% 76.8% best 31.6% 15.1% 28.6% 18.6% equal 58% 53% Anopheles gambia REFERENCE (tr total=14870, locus total=12994) Anopheles-funestus Anopheles-albimanus Evigene MAKER Evigene MAKER found 97.9% 96.6% 94.7% 96.3% align 93.1 89.3 86.4% 87.5% best 33.9% 16% 30.7% 21.2% equal 50% 48% ---------------------------------------------------