euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

SRA2Genes Test Drive for EvidentialGene

SRA2Genes is a complete pipeline to reconstruct genes from RNA data sources, to publishable gene sets for animals and plants.
      Name                                     Last modified       Size  

[DIR] Parent Directory 03-Dec-2021 14:36 - [   ] evigene20may20.tar 20-May-2020 22:37 9.0M [TXT] evigene_apps_linux_x86_64_19may14.list 13-May-2019 15:10 1k [   ] evigene_apps_linux_x86_64_19may14.tar.gz 13-May-2019 15:03 548M [   ] run_evgsra2genes4v.sh 07-Mar-2020 20:42 3k [TXT] run_plant1kYYPE.txt 17-Mar-2020 15:41 5k [TXT] run_plantATtest.txt 17-Mar-2020 14:48 1k [TXT] sra2genes4v_about.txt 17-Mar-2020 15:44 7k [DIR] sra2genes_start7test/ 18-Mar-2020 20:41 - [TXT] sra2genes_testdrive_help.txt 04-Dec-2021 15:04 10k [TXT] trasm_1kplants_nat19_SraRunInfo.csv 09-Nov-2019 21:14 567k [TXT] trasm_1kplants_nat19_info.txt 09-Nov-2019 20:25 1k [TXT] trasm_1kplants_nat19st1clean.tab 09-Nov-2019 20:25 208k



SRA2Genes Test Drive for EvidentialGene software
2021-Dec update

SRA2Genes is a complete pipeline to reconstruct genes from RNA data
sources, to publishable gene sets for animals and plants.

It does more than EvidentialGene's tr2aacds, which is included as part
of a full gene set reconstruction pipeline.  tr2aacds reduces a large
over-assembly of transcripts by using only self-referential
coding-gene metrics. The more complete gene reconstruction pipeline of
SRA2Genes brings in external gene evidence, notably the wealth of
conserved gene information.

URL: http://arthropods.eugenes.org/EvidentialGene/other/sra2genes_testdrive/sra2genes4v_testdrive/

Contents:
 run_plant1kYYPE.txt : full example for 1000-Plants RNA-seq data (steps 1..10)
 run_plantATtest.txt : simplest example for Arabidopsis transcripts (steps 7..10)
 evigene20may20.tar  : Evigene source code and documents tar file  
 sra2genes_start7test: starting data needed for run_plantATtest, and also for 
      run_plant1kYYPE, though that fetches RNA-Seq from NCBI SRA 

SRA2Genes creates and uses a data file/folder layout that you can
re-use by careful replacement of its data parts. Start the pipeline at
these steps, depending on your needs:
  a. Your initial transcript set, already assembled by any RNA assembly software, placed in trsets/ folder,
     start at STEP7
  b. NCBI SRA info on RNA data, srainfo.csv for Illumina paired-end RNA-seq, start at STEP1
  c. Your RNA data, unassembled Illumina paired-end RNA-seq in fasta/fastq format placed in pairfa/ folder, 
     start at STEP3

This test drive shows how a, b can be run on a computer cluster batch system.

refset/ folder: Choices a, b, c need reference data for your species
Reference data sets for SRA2Genes are currently defined as
  REFAA=refset/refgenes.aa and refgenes.names : some reference protein set to align to transcript.aa 
  BUSCO=refset/busco : OrthoDB BUSCO database of conserved 1:1 gene proteins, such as embryophyta_odb9 vertebrata_odb9
  DFAM=refset/dfam  : transposon nucleic acid motif database from dfam.org

  refset/contam : contaminant reference data, 2020-May, replaces and adds to  UniVec, see ContamEvg_README.txt
  UniVec=refset/UniVec_Core17.fa.nsq,nhr,nin : UniVec vector BlastN database 

The samples here are for a plant with Arabidopsis reference proteins.
You need to change busco and dfam symlinks to a valid data path on your system.
   busco -> /YOUR/PATH/TO/BUSCO/embryophyta_odb9
   dfam -> /YOUR/PATH/TO/DFAMdb

genome/ folder:
If you have chromosome assembly data, placed in genome/ folder, genes are mapped onto these,
and genome locations are added to gene annotations.

trsets/ folder:
This contains assembled gene transcripts, done by steps 1..6 or added by you.

SRA2Genes steps are
  data selection and RNA assemblies
        1. get RNA data (from NCBI SRA, or other sources)
        2. reformat sra to fasta
        3. subset(s) of data, digital normalize/reduce; 
        4. run several assemblers, with kmer size options, other opts
        5. post process assembly sets (trformat.pl)
        6. quick assessment:  aastats per assembly, report

  reduction to best draft gene set, self-referential (no external gene evidence)
        7. run evg over-assembly reduction, tr2aacds.pl

  refinement with external gene evidence
        8. ref protein blastp x evg okayset
        9. annotate and name genes, vector/contam screen, conserved domains, transposons
       10. make annotated public gene set 
       11. make NCBI TSA submission file set 

# =========================================
# Unix(Linux) Command Lines for Test Drive
# =========================================

mkdir sra2genestest
cd sra2genestest

# copy software, test data
wget --mirror -R 'index.html*'  -np -nH -L  --cut-dirs=3 \
   http://arthropods.eugenes.org/EvidentialGene/other/sra2genes_testdrive/sra2genes4v_testdrive/

cd sra2genes4v_testdrive
tar -xf evigene20may20.tar   # expands to evigene/..
tar -zxf evigene_apps_linux_x86_64_19may14.tar.gz     # expands to bio/apps/..
# NOTE: some of these compiled apps_linux_x86_64_19may14 may not work on your linux system, need update
mv evigene bio/apps/         # move to common software folder

cd sra2genes_start7test
edit run_evgsra2genes4v.sh
# edit bioapps path to this  bioapps=../bio/apps
chmod +x run_evgsra2genes4v.sh

cd ../
cp -rp sra2genes_start7test start7test_arath
cp -rp sra2genes_start7test start1test_plyype

# ** FIXME: all run_*.sh have '#PBS' headers need to be removed for Slurm use
# ** .. update needs sra2genes_startup.sh cluster header

#=== START STEP7 TEST with transcript assembly set in trsets/arath16ap.cdna

cd start7test_arath/
env name=arath16test species=Arabidopsis_thaliana runsteps=start7 ncpu=8 maxmem=16000 datad=`pwd` ./run_evgsra2genes4v.sh

# remove '#PBS' lines from run*.sh to run on Slurm sbatch system
perl -pi -e 's/^#PBS/#.../;' run_s*.arath16test.sh

# -- for Slurm batch system, others like PBS, ..
sbatch  -p debug --nodes=1 --ntasks-per-node=8 -t 2:00:00 ./run_s07_tr2aacds.arath16test.sh
#  -- OR --
srun  -p debug --nodes=1 --ntasks-per-node=8 -t 2:00:00 ./run_s07_tr2aacds.arath16test.sh

#START Fri Dec  3 15:13:57 EST 2021 
# sra2genes4v_testdrive/bio/apps/evigene/scripts/prot/tr2aacds4.pl -NCPU 8 -MAXMEM 16000 -log -cdna arath16test.tr
#DONE : Fri Dec  3 15:16:32 EST 2021

#== STEP7 OUTPUT okayset ===
[start7test_arath]$ ls okayset
arath16test.ann.txt    arath16test.genesum.txt	arath16test.okay.mrna		arath16test.pubids.realt.log
arath16test.cull.aa    arath16test.mainalt.tab	arath16test.okay.mrna.checktab
arath16test.cull.cds   arath16test.okay.aa	arath16test.pubids
arath16test.cull.mrna  arath16test.okay.cds	arath16test.pubids.old

# CONTINUE with next steps
env name=arath16test species=Arabidopsis_thaliana runsteps=start7 ncpu=8 maxmem=16000 datad=`pwd` ./run_evgsra2genes4v.sh
# for each of these new run scripts
 run_s08_evgblastp.arath16test.sh
 run_s09a_evgtrimvec.arath16test.sh
 run_s09a_tr2ncrna.arath16test.sh
 run_s09b_gmapgenes.arath16test.sh
 run_s12_evgclean.arath16test.sh

perl -pi -e 's/^#PBS/#.../;' run_s*.arath16test.sh
sbatch  -p debug --nodes=1 --ntasks-per-node=8 -t 1:00:00  run_s08_evgblastp.arath16test.sh
.. ditto for run_s09a,run_s09b,

# Rerun this after each run.sh finishes, to check data and update for next step in pipeline
env name=arath16test .. ./run_evgsra2genes4v.sh
  .. creates step run_s10.sh from new data of steps 8,9
env name=arath16test .. ./run_evgsra2genes4v.sh
  .. creates step run_s11.sh from new data of step 10

# DONE all these steps s7..s12 
runscripts/
run_s07_tr2aacds.arath16test.sh*     run_s09a_tr2ncrna.arath16test.sh*	 run_s11_evgpub2submit.arath16test.sh*
run_s08_evgblastp.arath16test.sh*    run_s09b_gmapgenes.arath16test.sh*  run_s12_evgclean.arath16test.sh*
run_s09a_evgtrimvec.arath16test.sh*  run_s10_evgpubset.arath16test.sh*

# Gene data directories : publicset/ for public use, submitset/ to send to NCBI TSA database (if desired)
# Text Summaries in arath16test.genesum.txt, arath16test.trclass.sum.txt
[start7test_arath]  
aaeval/			     cdshexprob.codepot  okayset/	       run_evgsra2genes4v.sh*  tmpfiles/
arath16test.genesum.txt@     dropset/		 okayset1st/	       run_plant1kYYPE.info    tmpsets/
arath16test.sra2genes.info   geneval/		 plYYPE.srainfo.csv    runlogs/		       trsets/
arath16test.trclass	     genome/		 publicset/	       runscripts/	       vectrimset/
arath16test.trclass.orig     inputset/		 refset/	       srun_shared20h.sh
arath16test.trclass.sum.txt  ncrnaset/		 run_arath16test.info  submitset/


#==== START STEP1 TEST with NCBI SRA data of tiny plant RNA set ===========

cd start1test_plyype/

env name=plYYPE sratable=plYYPE.srainfo.csv ncpu=8 maxmem=64000 datad=`pwd`  ./run_evgsra2genes4v.sh

plYYPE.sra2genes.log
#s2g: CMD: evgpipe_sra2genes.pl  -NCPU 8 -MAXMEM 64000 -log -debug -runname plYYPE -SRAtable plYYPE.srainfo.csv
#s2g: BEGIN with input= plYYPE.srainfo.csv date= Sat Dec  4 13:47:39 EST 2021
#s2g: sra_info: Run=ERR2040805; ScientificName=Austrocedrus chilensis; Platform=ILLUMINA; CenterName=DEPARTMENT OF BIOLOGICAL SCIENCES; size_MB=1828; spots=16951334; BioProject=PRJEB21674;
# ..
#s2g: forkCMD= /N/slate/gilbertd/chrs/evigenes/sra2genes4v_testdrive/bio/apps/sratools/bin/fastq-dump -O spotfa --qual-filter --fasta 0 ERR2040805

# initial data sets
spotfa: from NCBI ERR2040805.sra, ILLUMINA pair-end RNA data
 3.9G Dec  4 13:56 ERR2040805.fasta

pairfa: spotfa split into pair parts _1,_2
   98 Dec  4 13:56 ERR2040805.fa.info
 1.8G Dec  4 13:56 ERR2040805_1.fa
 1.8G Dec  4 13:56 ERR2040805_2.fa

rnasets: RNA data to assemble, depending on sizes, types
 168 Dec  4 13:56 sBn1l1ERR2040805.fa.info
  25 Dec  4 13:56 sBn1l1ERR2040805_1.fa -> ../pairfa/ERR2040805_1.fa
  25 Dec  4 13:56 sBn1l1ERR2040805_2.fa -> ../pairfa/ERR2040805_2.fa

sBn1l1ERR2040805.fa.info : used by sra2genes to track RNA data sources
  nreads=16951334; maxlen=90; totlen=3051240120; lfn=rnasets/sBn1l1ERR2040805_1.fa; 
  rfn=rnasets/sBn1l1ERR2040805_2.fa; sample=100%; satype=mixset; safrom=pairfa/ERR2040805_1.fa

# initial run_ scripts to assemble rnasets/sBn1l1ERR2040805_[12].fa
 run_s02_diginorm.plYYPE.sh : normalize/reduce RNA data, can skip for small data set
 run_s04_idba.plYYPE.sh     : RNA assembly scripts
 run_s04_soap.plYYPE.sh
 run_s04_velo.plYYPE.sh

# remove '#PBS' lines from run*.sh to run on Slurm sbatch system
perl -pi -e 's/^#PBS/#.../;' run_s*.plYYPE.sh

# submit to cluster batch system, note --mem= allocation, will need more for larger RNA data sets
sbatch  -p general --nodes=1 --ntasks-per-node=16 --mem=64G -t 8:00:00 ./run_s04_velo.plYYPE.sh
sbatch  -p general --nodes=1 --ntasks-per-node=16 --mem=64G -t 8:00:00 ./run_s04_soap.plYYPE.sh
sbatch  -p general --nodes=1 --ntasks-per-node=16 --mem=64G -t 8:00:00 ./run_s04_idba.plYYPE.sh


#==== START STEP3 TEST with your own unassembled RNA-seq data ====

Run the first part of STEP1 test above, then replace pairfa/ data with your own RNAset_1.fa and RNAset_2.fa

Developed at the Genome Informatics Lab of Indiana University Biology Department