euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

Index of /EvidentialGene/other/evigene_old/sra2genes_testdrive

      Name                                     Last modified       Size  

[DIR] Parent Directory 03-Dec-2021 14:36 - [   ] sra2genes_start7test.tar.gz 07-May-2019 23:33 81.5M [   ] sra2genes_finish7test.tar.gz 08-May-2019 16:49 301M [TXT] sra2genes_finish7test.list 08-May-2019 16:52 4k [TXT] sra2genes_start7test.list 08-May-2019 16:52 1k [   ] run_evgsra2genes.sh 13-May-2019 15:02 3k [   ] evigene_apps_linux_x86_64_19may14.tar.gz 13-May-2019 15:03 548M [TXT] evigene_apps_linux_x86_64_19may14.list 13-May-2019 15:10 1k [   ] evigene19may14.tar 13-May-2019 15:40 8.0M [TXT] sra2genes_test.readme.txt 16-May-2019 14:53 4k [DIR] sra2genes4v_testdrive/ 04-Dec-2021 15:09 -

 
 SRA2Genes Test Drive, 2019-May
 for transcript assembly input, step7, to public gene set, step10

Here is an update in-progress, a replacement for the 'tr2aacds' component 
of EvidentialGene.  SRA2Genes does more than tr2aacds, which is included 
as part of a full gene set reconstruction pipeline.  tr2aacds reduces a 
large over-assembly of transcripts by using only self-referential 
coding-gene metrics.  That is very useful but also fairly limited and rough, 
in that it uses only the gene evidence from that transcript assembly.
The more complete gene reconstruction pipeline of SRA2Genes brings in external 
gene evidence, notably the wealth of conserved gene information.
Genome biologists should consider using SRA2Genes in place of tr2aacds.  

 Test gene transcript set, Arabidopsis thaliana, Araport gene set of 2016 
  Araport11_genes.201606.mrna/cdna,aa,cds
 from http://eugenes.org/EvidentialGene/plants/arabidopsis/evigene2017_arabidopsis/gene_models/
 source https://www.araport.org/data/araport11/ 

 Setup steps:

  1. fetch software, data from sra2genes_testdrive/
      http://eugenes.org/EvidentialGene/other/sra2genes_testdrive/

    sra2genes_start7test.tar.gz contains starting data sets, trsets/ refset/ and genome/
    evigene19may14.tar has updates to evgpipe_sra2genes.pl and components

  2. Install evigene19may14.tar for updates to sra2genes that cure a few bugs in following usage:
    cd your-path-to-scripts; tar -xf evigene19may14.tar
    The pre-configured path for this is $HOME/bio/apps/evigene

  2b. Install component applications for Linux OS (x86_64) if you use that and want:
     fetch and extract with gtar -zxf evigene_apps_linux_x86_64_19may14.tar.gz
     These fill in $HOME/bio/apps/ with ncbi/bin, exonerate/bin, cdhit/bin, etc..

  3. Unpack starting data set;
     cd your-test-drive-path
     gtar -zxf sra2genes_start7test.tar.gz
     cd sra2genes_start7test

  4. Edit run_evgsra2genes.sh to set PATH for $HOME/bio/apps used by sra2genes 
     Install needed bioapps (ncbi/bin, exonerate/bin, cdhit/bin).
     See run_evgsra2genes.sh for bioapp sources;  
     Now (not Later on) linux_X64 binaries are provided.

  5. Drive thru steps 7 to 10 (public, annotated gene set)
      evgpipe_sra2genes.pl generates unix shell scripts for each step that uses some cpu/memory.
      You can run these from command line, or send to cluster batch system, as needed.
      (set ncpu= maxmem= for what your system has, maxmem=megabytes of memory)
     Rerun run_evgsra2genes.sh same way, after each "run_sNNN.sh" step to update following steps.

  6. Compare your final result to contents of sra2genes_finish7test.tar
     Should be same, look at summary of genes in arath16test.genesum.txt

  7. Re-use with your data sets, replace contents of sra2genes_start7test 
       in  trsets/input.cdna, refset/refgenes.aa and genome/chrassembly.fa (not required)

 Test drive steps:

 env name=arath16test species=Arabidopsis_thaliana runsteps=start7 ncpu=2 maxmem=8000 datad=`pwd` ./run_evgsra2genes.sh
 ./run_s7_tr2aacds.arath16test.sh >& log.tr2aa

 env name=arath16test species=Arabidopsis_thaliana runsteps=start7 ncpu=2 maxmem=8000 datad=`pwd` ./run_evgsra2genes.sh
 ./run_s8_evgblastp.arath16test.sh >& log.blp

 env name=arath16test species=Arabidopsis_thaliana runsteps=start7 ncpu=2 maxmem=8000 datad=`pwd` ./run_evgsra2genes.sh
 ./run_s9a_evgtrimvec.arath16test.sh >& log.vecs

 env name=arath16test species=Arabidopsis_thaliana runsteps=start7 ncpu=2 maxmem=8000 datad=`pwd` ./run_evgsra2genes.sh
 ./run_s9b_gmapgenes.arath16test.sh >& log.gmap

 env name=arath16test species=Arabidopsis_thaliana runsteps=start7 ncpu=2 maxmem=8000 datad=`pwd` ./run_evgsra2genes.sh
 ./run_s10_evgpubset.arath16test.sh >& log.pub

 env name=arath16test species=Arabidopsis_thaliana runsteps=start7 ncpu=2 maxmem=8000 datad=`pwd` ./run_evgsra2genes.sh
 ./run_s11_evgpub2submit.arath16test.sh >& log.sub
   ** run_s11_evgpub2submit fails (missing config) **
 touch Fixme.evgpub2submit

 ./run_s10b_evgclean.arath16test.sh >& clean.log

 cd ../
 mv sra2genes_start7test sra2genes_finish7test
 gtar -X sra2genes_finish7test.xclude  -cvf sra2genes_finish7test.tar sra2genes_finish7test


Developed at the Genome Informatics Lab of Indiana University Biology Department