SRA2Genes Test Drive for EvidentialGene
SRA2Genes is a complete pipeline to reconstruct genes from RNA data
sources, to publishable gene sets for animals and plants.
Name Last modified Size
Parent Directory 03-Dec-2021 14:36 -
evigene20may20.tar 20-May-2020 22:37 9.0M
evigene_apps_linux_x86_64_19may14.list 13-May-2019 15:10 1k
evigene_apps_linux_x86_64_19may14.tar.gz 13-May-2019 15:03 548M 07-Mar-2020 20:42 3k
run_plant1kYYPE.txt 17-Mar-2020 15:41 5k
run_plantATtest.txt 17-Mar-2020 14:48 1k
sra2genes4v_about.txt 17-Mar-2020 15:44 7k
sra2genes_start7test/ 18-Mar-2020 20:41 -
sra2genes_testdrive_help.txt 04-Dec-2021 15:04 10k
trasm_1kplants_nat19_SraRunInfo.csv 09-Nov-2019 21:14 567k
trasm_1kplants_nat19_info.txt 09-Nov-2019 20:25 1k 09-Nov-2019 20:25 208k
SRA2Genes Test Drive for EvidentialGene software
2021-Dec update
SRA2Genes is a complete pipeline to reconstruct genes from RNA data
sources, to publishable gene sets for animals and plants.
It does more than EvidentialGene's tr2aacds, which is included as part
of a full gene set reconstruction pipeline. tr2aacds reduces a large
over-assembly of transcripts by using only self-referential
coding-gene metrics. The more complete gene reconstruction pipeline of
SRA2Genes brings in external gene evidence, notably the wealth of
conserved gene information.
run_plant1kYYPE.txt : full example for 1000-Plants RNA-seq data (steps 1..10)
run_plantATtest.txt : simplest example for Arabidopsis transcripts (steps 7..10)
evigene20may20.tar : Evigene source code and documents tar file
sra2genes_start7test: starting data needed for run_plantATtest, and also for
run_plant1kYYPE, though that fetches RNA-Seq from NCBI SRA
SRA2Genes creates and uses a data file/folder layout that you can
re-use by careful replacement of its data parts. Start the pipeline at
these steps, depending on your needs:
a. Your initial transcript set, already assembled by any RNA assembly software, placed in trsets/ folder,
start at STEP7
b. NCBI SRA info on RNA data, srainfo.csv for Illumina paired-end RNA-seq, start at STEP1
c. Your RNA data, unassembled Illumina paired-end RNA-seq in fasta/fastq format placed in pairfa/ folder,
start at STEP3
This test drive shows how a, b can be run on a computer cluster batch system.
refset/ folder: Choices a, b, c need reference data for your species
Reference data sets for SRA2Genes are currently defined as
REFAA=refset/refgenes.aa and refgenes.names : some reference protein set to align to transcript.aa
BUSCO=refset/busco : OrthoDB BUSCO database of conserved 1:1 gene proteins, such as embryophyta_odb9 vertebrata_odb9
DFAM=refset/dfam : transposon nucleic acid motif database from
refset/contam : contaminant reference data, 2020-May, replaces and adds to UniVec, see ContamEvg_README.txt
UniVec=refset/UniVec_Core17.fa.nsq,nhr,nin : UniVec vector BlastN database
The samples here are for a plant with Arabidopsis reference proteins.
You need to change busco and dfam symlinks to a valid data path on your system.
busco -> /YOUR/PATH/TO/BUSCO/embryophyta_odb9
dfam -> /YOUR/PATH/TO/DFAMdb
genome/ folder:
If you have chromosome assembly data, placed in genome/ folder, genes are mapped onto these,
and genome locations are added to gene annotations.
trsets/ folder:
This contains assembled gene transcripts, done by steps 1..6 or added by you.
SRA2Genes steps are
data selection and RNA assemblies
1. get RNA data (from NCBI SRA, or other sources)
2. reformat sra to fasta
3. subset(s) of data, digital normalize/reduce;
4. run several assemblers, with kmer size options, other opts
5. post process assembly sets (
6. quick assessment: aastats per assembly, report
reduction to best draft gene set, self-referential (no external gene evidence)
7. run evg over-assembly reduction,
refinement with external gene evidence
8. ref protein blastp x evg okayset
9. annotate and name genes, vector/contam screen, conserved domains, transposons
10. make annotated public gene set
11. make NCBI TSA submission file set
# =========================================
# Unix(Linux) Command Lines for Test Drive
# =========================================
mkdir sra2genestest
cd sra2genestest
# copy software, test data
wget --mirror -R 'index.html*' -np -nH -L --cut-dirs=3 \
cd sra2genes4v_testdrive
tar -xf evigene20may20.tar # expands to evigene/..
tar -zxf evigene_apps_linux_x86_64_19may14.tar.gz # expands to bio/apps/..
# NOTE: some of these compiled apps_linux_x86_64_19may14 may not work on your linux system, need update
mv evigene bio/apps/ # move to common software folder
cd sra2genes_start7test
# edit bioapps path to this bioapps=../bio/apps
chmod +x
cd ../
cp -rp sra2genes_start7test start7test_arath
cp -rp sra2genes_start7test start1test_plyype
# ** FIXME: all run_*.sh have '#PBS' headers need to be removed for Slurm use
# ** .. update needs cluster header
#=== START STEP7 TEST with transcript assembly set in trsets/arath16ap.cdna
cd start7test_arath/
env name=arath16test species=Arabidopsis_thaliana runsteps=start7 ncpu=8 maxmem=16000 datad=`pwd` ./
# remove '#PBS' lines from run*.sh to run on Slurm sbatch system
perl -pi -e 's/^#PBS/#.../;' run_s*
# -- for Slurm batch system, others like PBS, ..
sbatch -p debug --nodes=1 --ntasks-per-node=8 -t 2:00:00 ./
# -- OR --
srun -p debug --nodes=1 --ntasks-per-node=8 -t 2:00:00 ./
#START Fri Dec 3 15:13:57 EST 2021
# sra2genes4v_testdrive/bio/apps/evigene/scripts/prot/ -NCPU 8 -MAXMEM 16000 -log -cdna
#DONE : Fri Dec 3 15:16:32 EST 2021
#== STEP7 OUTPUT okayset ===
[start7test_arath]$ ls okayset
arath16test.ann.txt arath16test.genesum.txt arath16test.okay.mrna arath16test.pubids.realt.log
arath16test.cull.aa arath16test.okay.mrna.checktab
arath16test.cull.cds arath16test.okay.aa arath16test.pubids
arath16test.cull.mrna arath16test.okay.cds arath16test.pubids.old
# CONTINUE with next steps
env name=arath16test species=Arabidopsis_thaliana runsteps=start7 ncpu=8 maxmem=16000 datad=`pwd` ./
# for each of these new run scripts
perl -pi -e 's/^#PBS/#.../;' run_s*
sbatch -p debug --nodes=1 --ntasks-per-node=8 -t 1:00:00
.. ditto for run_s09a,run_s09b,
# Rerun this after each finishes, to check data and update for next step in pipeline
env name=arath16test .. ./
.. creates step from new data of steps 8,9
env name=arath16test .. ./
.. creates step from new data of step 10
# DONE all these steps s7..s12
# Gene data directories : publicset/ for public use, submitset/ to send to NCBI TSA database (if desired)
# Text Summaries in arath16test.genesum.txt, arath16test.trclass.sum.txt
aaeval/ cdshexprob.codepot okayset/* tmpfiles/
arath16test.genesum.txt@ dropset/ okayset1st/ tmpsets/ geneval/ plYYPE.srainfo.csv runlogs/ trsets/
arath16test.trclass genome/ publicset/ runscripts/ vectrimset/
arath16test.trclass.orig inputset/ refset/
arath16test.trclass.sum.txt ncrnaset/ submitset/
#==== START STEP1 TEST with NCBI SRA data of tiny plant RNA set ===========
cd start1test_plyype/
env name=plYYPE sratable=plYYPE.srainfo.csv ncpu=8 maxmem=64000 datad=`pwd` ./
#s2g: CMD: -NCPU 8 -MAXMEM 64000 -log -debug -runname plYYPE -SRAtable plYYPE.srainfo.csv
#s2g: BEGIN with input= plYYPE.srainfo.csv date= Sat Dec 4 13:47:39 EST 2021
#s2g: sra_info: Run=ERR2040805; ScientificName=Austrocedrus chilensis; Platform=ILLUMINA; CenterName=DEPARTMENT OF BIOLOGICAL SCIENCES; size_MB=1828; spots=16951334; BioProject=PRJEB21674;
# ..
#s2g: forkCMD= /N/slate/gilbertd/chrs/evigenes/sra2genes4v_testdrive/bio/apps/sratools/bin/fastq-dump -O spotfa --qual-filter --fasta 0 ERR2040805
# initial data sets
spotfa: from NCBI ERR2040805.sra, ILLUMINA pair-end RNA data
3.9G Dec 4 13:56 ERR2040805.fasta
pairfa: spotfa split into pair parts _1,_2
98 Dec 4 13:56
1.8G Dec 4 13:56 ERR2040805_1.fa
1.8G Dec 4 13:56 ERR2040805_2.fa
rnasets: RNA data to assemble, depending on sizes, types
168 Dec 4 13:56
25 Dec 4 13:56 sBn1l1ERR2040805_1.fa -> ../pairfa/ERR2040805_1.fa
25 Dec 4 13:56 sBn1l1ERR2040805_2.fa -> ../pairfa/ERR2040805_2.fa : used by sra2genes to track RNA data sources
nreads=16951334; maxlen=90; totlen=3051240120; lfn=rnasets/sBn1l1ERR2040805_1.fa;
rfn=rnasets/sBn1l1ERR2040805_2.fa; sample=100%; satype=mixset; safrom=pairfa/ERR2040805_1.fa
# initial run_ scripts to assemble rnasets/sBn1l1ERR2040805_[12].fa : normalize/reduce RNA data, can skip for small data set : RNA assembly scripts
# remove '#PBS' lines from run*.sh to run on Slurm sbatch system
perl -pi -e 's/^#PBS/#.../;' run_s*
# submit to cluster batch system, note --mem= allocation, will need more for larger RNA data sets
sbatch -p general --nodes=1 --ntasks-per-node=16 --mem=64G -t 8:00:00 ./
sbatch -p general --nodes=1 --ntasks-per-node=16 --mem=64G -t 8:00:00 ./
sbatch -p general --nodes=1 --ntasks-per-node=16 --mem=64G -t 8:00:00 ./
#==== START STEP3 TEST with your own unassembled RNA-seq data ====
Run the first part of STEP1 test above, then replace pairfa/ data with your own RNAset_1.fa and RNAset_2.fa