SRA2Genes Test Drive, 2019-May for transcript assembly input, step7, to public gene set, step10 Here is an update in-progress, a replacement for the 'tr2aacds' component of EvidentialGene. SRA2Genes does more than tr2aacds, which is included as part of a full gene set reconstruction pipeline. tr2aacds reduces a large over-assembly of transcripts by using only self-referential coding-gene metrics. That is very useful but also fairly limited and rough, in that it uses only the gene evidence from that transcript assembly. The more complete gene reconstruction pipeline of SRA2Genes brings in external gene evidence, notably the wealth of conserved gene information. Genome biologists should consider using SRA2Genes in place of tr2aacds. Test gene transcript set, Arabidopsis thaliana, Araport gene set of 2016 Araport11_genes.201606.mrna/cdna,aa,cds from http://eugenes.org/EvidentialGene/plants/arabidopsis/evigene2017_arabidopsis/gene_models/ source https://www.araport.org/data/araport11/ Setup steps: 1. fetch software, data from sra2genes_testdrive/ http://eugenes.org/EvidentialGene/other/sra2genes_testdrive/ sra2genes_start7test.tar.gz contains starting data sets, trsets/ refset/ and genome/ evigene19may14.tar has updates to evgpipe_sra2genes.pl and components 2. Install evigene19may14.tar for updates to sra2genes that cure a few bugs in following usage: cd your-path-to-scripts; tar -xf evigene19may14.tar The pre-configured path for this is $HOME/bio/apps/evigene 2b. Install component applications for Linux OS (x86_64) if you use that and want: fetch and extract with gtar -zxf evigene_apps_linux_x86_64_19may14.tar.gz These fill in $HOME/bio/apps/ with ncbi/bin, exonerate/bin, cdhit/bin, etc.. 3. Unpack starting data set; cd your-test-drive-path gtar -zxf sra2genes_start7test.tar.gz cd sra2genes_start7test 4. Edit run_evgsra2genes.sh to set PATH for $HOME/bio/apps used by sra2genes Install needed bioapps (ncbi/bin, exonerate/bin, cdhit/bin). See run_evgsra2genes.sh for bioapp sources; Now (not Later on) linux_X64 binaries are provided. 5. Drive thru steps 7 to 10 (public, annotated gene set) evgpipe_sra2genes.pl generates unix shell scripts for each step that uses some cpu/memory. You can run these from command line, or send to cluster batch system, as needed. (set ncpu= maxmem= for what your system has, maxmem=megabytes of memory) Rerun run_evgsra2genes.sh same way, after each "run_sNNN.sh" step to update following steps. 6. Compare your final result to contents of sra2genes_finish7test.tar Should be same, look at summary of genes in arath16test.genesum.txt 7. Re-use with your data sets, replace contents of sra2genes_start7test in trsets/input.cdna, refset/refgenes.aa and genome/chrassembly.fa (not required) Test drive steps: env name=arath16test species=Arabidopsis_thaliana runsteps=start7 ncpu=2 maxmem=8000 datad=`pwd` ./run_evgsra2genes.sh ./run_s7_tr2aacds.arath16test.sh >& log.tr2aa env name=arath16test species=Arabidopsis_thaliana runsteps=start7 ncpu=2 maxmem=8000 datad=`pwd` ./run_evgsra2genes.sh ./run_s8_evgblastp.arath16test.sh >& log.blp env name=arath16test species=Arabidopsis_thaliana runsteps=start7 ncpu=2 maxmem=8000 datad=`pwd` ./run_evgsra2genes.sh ./run_s9a_evgtrimvec.arath16test.sh >& log.vecs env name=arath16test species=Arabidopsis_thaliana runsteps=start7 ncpu=2 maxmem=8000 datad=`pwd` ./run_evgsra2genes.sh ./run_s9b_gmapgenes.arath16test.sh >& log.gmap env name=arath16test species=Arabidopsis_thaliana runsteps=start7 ncpu=2 maxmem=8000 datad=`pwd` ./run_evgsra2genes.sh ./run_s10_evgpubset.arath16test.sh >& log.pub env name=arath16test species=Arabidopsis_thaliana runsteps=start7 ncpu=2 maxmem=8000 datad=`pwd` ./run_evgsra2genes.sh ./run_s11_evgpub2submit.arath16test.sh >& log.sub ** run_s11_evgpub2submit fails (missing config) ** touch Fixme.evgpub2submit ./run_s10b_evgclean.arath16test.sh >& clean.log cd ../ mv sra2genes_start7test sra2genes_finish7test gtar -X sra2genes_finish7test.xclude -cvf sra2genes_finish7test.tar sra2genes_finish7test