euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

Index of /EvidentialGene/vertebrates/pig/pig18evigene/evgmethods

      Name                     Last modified       Size  

[DIR] Parent Directory 17-May-2019 15:32 - [TXT] evgsra2genepipe_help.txt 16-Aug-2018 14:48 6k [DIR] runscripts/ 08-Dec-2018 15:01 -

EvidentialGene sra2genes DRAFT VERSION 2017.12.07
  omnibus pipe for evigene methods, from SRA RNA-seq data to annotated public gene set
Usage: -SRAtable=myspecies_sra.csv | -SRAids=SRRnnnn,SRRmmmm 
opts: -help -runname MyProjectXXX -nCPU=0 -idprefix Thecc1EG ..  
      -runstep 1,2,3,4..10  -log -dryrun -debug 

  *** EARLY DRAFT VERSION, Expect problems ***
Current pipleline design: 
    Process SRA RNA-seq data to a finished, annotated gene set, in steps, using existing, 
    tested Evigene methods.   Compute-intensive steps are run
    asynchronously, by generating cluster-ready shell scripts that you then submit to your
    cluster batch queue.  These steps include runassemblers, reduceassemblies, refblastgenes.
    See '' an example cluster script to call this omnibus pipe. It sets
    paths to component software (assemblers, NCBI tools, others) that you must adjust.
    After these cluster runs, rerun this pipeline to proceed to next steps.  E.g.

evgpipe STEPs 1..4:
  env sratable=daphsim16huau_srarna.csv  name=dapsim_sra2evg  datad=`pwd` prog=./ sbatch

ASYNC run assemblers (~ 8 hr each)
  env ncpu=8  datad=`pwd` prog=./ sbatch
  env ncpu=12  datad=`pwd` prog=./ sbatch

evgpipe STEPs 5..7:
  env sratable=daphsim16huau_srarna.csv  name=dapsim_sra2evg  datad=`pwd` prog=./ sbatch

ASYNC run assembly reduction to genes (~ 2 hr)
  env ncpu=20 maxmem=120000 prog=./ datad=`pwd` sbatch

evgpipe STEPs 8..9:
  env REFAA=refset/refarp7s10fset1.aa  sratable=daphsim16huau_srarna.csv  name=dapsim_sra2evg  datad=`pwd` prog=./ sbatch

ASYNC run blastp (~ 8 hr)
  env ncpu=20 maxmem=120000 prog=./ datad=`pwd` sbatch

evgpipe STEPs 10:
  env sratable=daphsim16huau_srarna.csv  name=dapsim_sra2evg  datad=`pwd` prog=./ sbatch

STEPS in pipeline (will change)
  STEP2_sra2fasta STEP2a_sra2spot STEP2b_pairfa
  STEP5_collectassemblies STEP5b_qualassemblies (option)
  STEP9_annotgenes   STEP9a_trimvec  STEP9c_contamcheck STEP9b_consdomains
  More details via 'pod2man | nroff -man |less' 
Component applications currently used on PATH:
  app=fastq-dump, path=/bio/sratoolkit/sratoolkit281/bin/fastq-dump
  app=blastn, path=/bio/ncbi/bin/blastn
  app=cd-hit-est, path=/bio/cdhit466/bin/cd-hit-est
  app=fastanrdb, path=/bio/exonerate/bin/fastanrdb, path=/bio/khmer/scripts/
  app=vecscreen, path=/bio/ncbi/bin/vecscreen
  app=velveth, oases, path=/bio/velvet1210/bin4/velveth
  app=idba_tran, path=/bio/idba/bin/idba_tran
  app=SOAPdenovo-Trans-127mer, path=/bio/soaptrans103/SOAPdenovo-Trans-127mer
  app=Trinity, path=/bio/trinity/Trinity
  data=UniVec, path=
      Pipeline will work without some of these, eg assemblers.
      sratoolkit: need current v281+ for web fetch by SRR id
      velvet: fixme multi kmer binaries, bin4 = 151mer; bin2 = 99mer
INPUT  -SRAtable=myspecies_sra.csv is NCBI SraRunInfo.csv, 2017 format
    from ( Send TO File, Format RunInfo)
Expected sra.csv format input may change; use of only -SRAids to be enabled.
Now requires NCBI sratoolkit/fastq-dump that has enabled web-fetch of data by SRAid.
That will become one option, others you fetch SRA/ENA data, or supply RNA-read-pairs.fasta/fastq
Layout of project directory:
  spotfa:	 1. SRA spot (joined read pairs) files, from fastq-dump of SRAids
  pairfa:	 2. unjoined read pair files, _1.fa and _2.fa
  rnasets:	 3. read pair rna sets, input to assemblers, various pairfa data slices
  tra_XXX:	 4. subfolders per assembler/data slice
  trsets:	 5. assembled transcripts from several assembly runs
  inputset:	 6. all transcripts/cds/aa from trsets as input to tr2aacds reduction
  okayset:	 7. non-redundant transcripts of tr2aacds, as gene locus primary (okay) and alternates (okalt)
  dropset:	 8. redundant transcripts of tr2aacds
  refset:	 9. reference sequences for annotation, eg refgenes.aa for homology, vector/contam screen
  publicset:	10. public transcript/cds/aa sequences, annotations of evgmrna2tsa
  submitset:	11. submission set for TSA database,  of evgmrna2tsa
  genome:	20. chromosome assembly, where available, for EvigeneH methods
  aaeval:	21. protein homology annotations, comparisons
  geneval:	22. mRNA/CDS sequence annotations, comparisons


Developed at the Genome Informatics Lab of Indiana University Biology Department