euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

Index of /EvidentialGene/vertebrates/pig/pig18evigene/evgmethods

      Name                     Last modified       Size  

[DIR] Parent Directory 17-May-2019 15:32 - [TXT] evgsra2genepipe_help.txt 16-Aug-2018 14:48 6k [DIR] runscripts/ 08-Dec-2018 15:01 -

EvidentialGene sra2genes DRAFT VERSION 2017.12.07
  omnibus pipe for evigene methods, from SRA RNA-seq data to annotated public gene set
  
Usage: evgpipe_sra2genes.pl -SRAtable=myspecies_sra.csv | -SRAids=SRRnnnn,SRRmmmm 
opts: -help -runname MyProjectXXX -nCPU=0 -idprefix Thecc1EG ..  
      -runstep 1,2,3,4..10  -log -dryrun -debug 

  *** EARLY DRAFT VERSION, Expect problems ***
---------------------------------------------------------------
Current pipleline design: 
    Process SRA RNA-seq data to a finished, annotated gene set, in steps, using existing, 
    tested Evigene methods.   Compute-intensive steps are run
    asynchronously, by generating cluster-ready shell scripts that you then submit to your
    cluster batch queue.  These steps include runassemblers, reduceassemblies, refblastgenes.
    
    See 'run_evgsra2genes.sh' an example cluster script to call this omnibus pipe. It sets
    paths to component software (assemblers, NCBI tools, others) that you must adjust.
    
    After these cluster runs, rerun this pipeline to proceed to next steps.  E.g.

evgpipe STEPs 1..4:
  env sratable=daphsim16huau_srarna.csv  name=dapsim_sra2evg  datad=`pwd` prog=./run_evgsra2genes.sh sbatch srun_shared.sh

ASYNC run assemblers (~ 8 hr each)
  env ncpu=8  datad=`pwd` prog=./runvelo.sh sbatch srun_comet.sh
  env ncpu=12  datad=`pwd` prog=./runidba.sh sbatch srun_comet.sh

evgpipe STEPs 5..7:
  env sratable=daphsim16huau_srarna.csv  name=dapsim_sra2evg  datad=`pwd` prog=./run_evgsra2genes.sh sbatch srun_shared.sh

ASYNC run assembly reduction to genes (~ 2 hr)
  env ncpu=20 maxmem=120000 prog=./run_tr2aacds.sh datad=`pwd` sbatch srun_comet.sh

evgpipe STEPs 8..9:
  env REFAA=refset/refarp7s10fset1.aa  sratable=daphsim16huau_srarna.csv  name=dapsim_sra2evg  datad=`pwd` prog=./run_evgsra2genes.sh sbatch srun_shared.sh

ASYNC run blastp (~ 8 hr)
  env ncpu=20 maxmem=120000 prog=./run_evgblastp.sh datad=`pwd` sbatch srun_comet.sh

evgpipe STEPs 10:
  env sratable=daphsim16huau_srarna.csv  name=dapsim_sra2evg  datad=`pwd` prog=./run_evgsra2genes.sh sbatch srun_shared.sh

STEPS in pipeline (will change)
  STEP1_sraget
  STEP2_sra2fasta STEP2a_sra2spot STEP2b_pairfa
  STEP3_selectrna
  STEP4_runassemblers
  STEP5_collectassemblies STEP5b_qualassemblies (option)
  STEP6_reduceassemblies
  STEP7_refblastgenes      
  STEP9_annotgenes   STEP9a_trimvec  STEP9c_contamcheck STEP9b_consdomains
  STEP10_publicgenes      
  More details via 'pod2man evgpipe_sra2genes.pl | nroff -man |less' 
---------------------------------------------------------------
Component applications currently used on PATH:
  app=fastq-dump, path=/bio/sratoolkit/sratoolkit281/bin/fastq-dump
    https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/
  app=blastn, path=/bio/ncbi/bin/blastn
    https://blast.ncbi.nlm.nih.gov/
  app=cd-hit-est, path=/bio/cdhit466/bin/cd-hit-est
    https://github.com/weizhongli/cdhit/
  app=fastanrdb, path=/bio/exonerate/bin/fastanrdb
    https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate
  app=normalize-by-median.py, path=/bio/khmer/scripts/normalize-by-median.py
    https://github.com/ged-lab/khmer
  app=vecscreen, path=/bio/ncbi/bin/vecscreen
    http://ncbi.nlm.nih.gov/tools/vecscreen/
  app=velveth, oases, path=/bio/velvet1210/bin4/velveth
    https://www.ebi.ac.uk/~zerbino/oases/
  app=idba_tran, path=/bio/idba/bin/idba_tran
    http://hku-idba.googlecode.com/files/idba-1.1.1.tar.gz
  app=SOAPdenovo-Trans-127mer, path=/bio/soaptrans103/SOAPdenovo-Trans-127mer
    http://soap.genomics.org.cn/SOAPdenovo-Trans.html
  app=Trinity, path=/bio/trinity/Trinity
    https://github.com/trinityrnaseq/trinityrnaseq
  data=UniVec, path=
      Pipeline will work without some of these, eg assemblers.
      sratoolkit: need current v281+ for web fetch by SRR id
      velvet: fixme multi kmer binaries, bin4 = 151mer; bin2 = 99mer
---------------------------------------------------------------
INPUT  -SRAtable=myspecies_sra.csv is NCBI SraRunInfo.csv, 2017 format
    from https://www.ncbi.nlm.nih.gov/sra/ ( Send TO File, Format RunInfo)
Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,
  Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,
  Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,
  TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,
  Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,
  dbgap_study_accession,Consent,RunHash,ReadHash
Expected sra.csv format input may change; use of only -SRAids to be enabled.
Now requires NCBI sratoolkit/fastq-dump that has enabled web-fetch of data by SRAid.
That will become one option, others you fetch SRA/ENA data, or supply RNA-read-pairs.fasta/fastq
---------------------------------------------------------------
Layout of project directory:
  spotfa:	 1. SRA spot (joined read pairs) files, from fastq-dump of SRAids
  pairfa:	 2. unjoined read pair files, _1.fa and _2.fa
  rnasets:	 3. read pair rna sets, input to assemblers, various pairfa data slices
  tra_XXX:	 4. subfolders per assembler/data slice
  trsets:	 5. assembled transcripts from several assembly runs
  inputset:	 6. all transcripts/cds/aa from trsets as input to tr2aacds reduction
  okayset:	 7. non-redundant transcripts of tr2aacds, as gene locus primary (okay) and alternates (okalt)
  dropset:	 8. redundant transcripts of tr2aacds
  refset:	 9. reference sequences for annotation, eg refgenes.aa for homology, vector/contam screen
  publicset:	10. public transcript/cds/aa sequences, annotations of evgmrna2tsa
  submitset:	11. submission set for TSA database,  of evgmrna2tsa
  genome:	20. chromosome assembly, where available, for EvigeneH methods
  aaeval:	21. protein homology annotations, comparisons
  geneval:	22. mRNA/CDS sequence annotations, comparisons

==============================================================

Developed at the Genome Informatics Lab of Indiana University Biology Department