There is a major EvidentialGene software update available now, in evigene20mar15.tar
at
EvidentialGene/other/evigene_old/
and SourceForge.
This includes updates described in this paper, and newer ones.
A brief summary of updates is listed in
evigene20mar15_updates.txt.
The major new script versions are
tr2aacds4.pl and evgpipe_sra2genes4v.pl, with updates and additions of associated pipeline components.
tr2ncrna.pl : new pipeline to select non-coding RNA subset of input transcripts.
If you have interest/time to try it, please provide feedback. I've tested these updates extensively, and revised, debugged them over last several months. However your uses will turn up some new wrinkles to be smoothed out, and I'll appreciate those comments.
This well-used portion, tr2aacds4.pl, can be used as prior versions, but has new options, new reduction algorithm parts, and some new outputs. Most of the updates here are folded-in changes that I've tested and used over several years, as separate scripts.
If you want to test the full Evigene pipeline that includes reference homology tests, and helps one to see where the tr2aacds component fits into a fuller reconstruction methodology,
see here for SRA2Genes:
EvidentialGene/other/sra2genes_testdrive/sra2genes4v_testdrive/
run_plant1kYYPE.txt is a brief explanation of these steps, including assemblies, tr2aacds reduction, blastp reference.aa and busco tests, and tr2ncrna.
Evigene docs with some of this include:
docs/EvigeneR/tr2aacds4_about.txt, and
docs/EvigeneR/tr2ncrna_about.txt
I continue work on documents and examples for this set of updates. In particular, the SRA2Genes omnibus has all the basic parts needed for a good example on how to use and combine each component in a good transcriptome reconstruction pipeline.
Most of the updates are in precision improvements, pushing from 95% to 98% accuracy. You wont find large improvements in conserved unique gene BUSCO scores because those are easiest and are properly kept by prior and new versions. The improvements focus on alternates, paralogs, coding/non-coding classes, and other complexities for accurate, complete gene set reconstruction. For instance, there is a new mixed-strand reorient test that recovers occasional conserved proteins shorter than the longest computed ORF. The new tr2ncrna pipeline sometimes picks up 1 or 2 conserved ortholog proteins from poor-quality transcript assemblies, at the edge of accuracy, which other methods tend to miss as well.
Effort was spent to ensure that the accuracy of prior versions was not lost by side-effects to "improvements". One side-effect result is that this new version introduces a -pHeterozygosity option, to balance improved precision for alternate/paralog models with noise from heterozygous samples. RNA pools of non-isogenic individuals produce many more putative gene loci, alts and paralogs, than from isogenic RNA samples, without such an option.
|