euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

EvidentialGene 2020 March, version4

There is a major EvidentialGene software update available now, in evigene20mar15.tar at EvidentialGene/other/evigene_old/ and SourceForge.

This includes updates described in this paper, and newer ones. A brief summary of updates is listed in evigene20mar15_updates.txt.
The major new script versions are and, with updates and additions of associated pipeline components. : new pipeline to select non-coding RNA subset of input transcripts.

If you have interest/time to try it, please provide feedback. I've tested these updates extensively, and revised, debugged them over last several months. However your uses will turn up some new wrinkles to be smoothed out, and I'll appreciate those comments.

This well-used portion,, can be used as prior versions, but has new options, new reduction algorithm parts, and some new outputs. Most of the updates here are folded-in changes that I've tested and used over several years, as separate scripts.

If you want to test the full Evigene pipeline that includes reference homology tests, and helps one to see where the tr2aacds component fits into a fuller reconstruction methodology,
see here for SRA2Genes: EvidentialGene/other/sra2genes_testdrive/sra2genes4v_testdrive/

run_plant1kYYPE.txt is a brief explanation of these steps, including assemblies, tr2aacds reduction, blastp reference.aa and busco tests, and tr2ncrna.

Evigene docs with some of this include: docs/EvigeneR/tr2aacds4_about.txt, and docs/EvigeneR/tr2ncrna_about.txt
I continue work on documents and examples for this set of updates. In particular, the SRA2Genes omnibus has all the basic parts needed for a good example on how to use and combine each component in a good transcriptome reconstruction pipeline.

Most of the updates are in precision improvements, pushing from 95% to 98% accuracy. You wont find large improvements in conserved unique gene BUSCO scores because those are easiest and are properly kept by prior and new versions. The improvements focus on alternates, paralogs, coding/non-coding classes, and other complexities for accurate, complete gene set reconstruction. For instance, there is a new mixed-strand reorient test that recovers occasional conserved proteins shorter than the longest computed ORF. The new tr2ncrna pipeline sometimes picks up 1 or 2 conserved ortholog proteins from poor-quality transcript assemblies, at the edge of accuracy, which other methods tend to miss as well.

Effort was spent to ensure that the accuracy of prior versions was not lost by side-effects to "improvements". One side-effect result is that this new version introduces a -pHeterozygosity option, to balance improved precision for alternate/paralog models with noise from heterozygous samples. RNA pools of non-isogenic individuals produce many more putative gene loci, alts and paralogs, than from isogenic RNA samples, without such an option.

Don Gilbert, gilbertd at

Developed at the Genome Informatics Lab of Indiana University Biology Department