Name Last modified Size
Parent Directory 14-Dec-2017 17:23 -
aaeval/ 10-Dec-2017 17:10 -
evgmethods/ 01-Jan-2018 19:44 -
evgsra2genepipe_help.txt 14-Dec-2017 17:59 5k
evgsra2genepipe_urchin.sum.txt 16-Jun-2018 15:34 13k
geneval/ 01-Jan-2018 19:56 -
genome/ 14-Dec-2017 17:08 -
greenseaurchin_SraRunInfo.csv 06-Dec-2017 20:20 2k
gurchin2sra4d.names 12-Dec-2017 02:04 27.8M
gurchin2sra4d.sra2genes.info 12-Dec-2017 14:05 1k
gurchin2sra4d.trclass.sum.txt 10-Dec-2017 23:15 1k
logfiles/ 01-Jan-2018 19:24 -
okayset/ 13-Dec-2017 13:24 -
pairfa/ 07-Dec-2017 18:22 -
publicset/ 01-Jan-2018 19:50 -
trsets/ 10-Dec-2017 08:29 -
trsets_reduced/ 14-Dec-2017 16:43 -
evgpipe_sra2genes Trial results for greenseaurchin_SraRunInfo
Wed Dec 13 2017, by Don Gilbert
Testing new sra2genes method of
http://arthropods.eugenes.org/EvidentialGene/evigene/scripts/evgpipe_sra2genes.pl
STEPs 1-3: Fetch and pre-process RNA-seq data from SRA
using 4 tissue-stage SRR sets of 18 available for Lytechinus variegatus
STEP 4: Assemble RNA data slices with multiple assemblers and options
STEPs 5,7 repeated: Two-level reduction of transcript over-assembly
a. per SRR sample data slice, reduce several assemblies (method,kmer)
e.g. sRn1l2SRR1661111.tr2aacds
b. combine and reduce results (4 data slices) of a.
to gurchin2sra4d.tr2aacds
STEPs 8-10: Public gene set with names, annotations, vector screen, chr mapping,
from non-redundant okayset/transcripts,cds,proteins.
---------------------------------------------------------
Conserved proteins recovery (BUSCO metazoa ordb9)
compared for 3 gene sets of green sea urchin Lytechinus variegatus.
Echinobase project results (David Mathog), for
MAKER genome genes vs Evigene assemblies,
model proteins from http://www.echinobase.org/Echinobase/LvDownloads
maker = LVA_protein sequence(Genome Assembly version2.2)
evigene = Evigene predicted peptides, by DM,
and Evigene set, built with sra2genes by dgg,
http://arthropods.eugenes.org/EvidentialGene/inverts/sea_urchin/publicset/
gurchin2sra4d.aa_pub.fa.gz
http://arthropods.eugenes.org/EvidentialGene/inverts/sea_urchin/
==> run_bumeLv_evgsra2genes4d/short_summary_bumeLv_evgsra2genes4d.txt <==
C:99.6%[S:18.0%,D:81.6%],F:0.1%,M:0.3%,n:978
974 Complete BUSCOs (C)
1 Fragmented BUSCOs (F)
3 Missing BUSCOs (M) [other two miss these 3 also]
EOG091G0GA7,EOG091G0IMI,EOG091G0OPI
Note: Complete+Duplicated is not informative here, as all
alternate transcripts are including in busco test.
http://www.echinobase.org/Echinobase/Lv_other_assemblies
==> run_bumeLv_evigene_pep/short_summary_bumeLv_evigene_pep.txt <==
C:97.3%[S:92.3%,D:5.0%],F:0.3%,M:2.4%,n:978
952 Complete BUSCOs (C)
3 Fragmented BUSCOs (F)
23 Missing BUSCOs (M)
http://www.echinobase.org/Echinobase/Lv_LVA_Genes
==> run_bumeLvar22_maker_pep/short_summary_bumeLvar22_maker_pep.txt <==
C:45.7%[S:26.3%,D:19.4%],F:19.0%,M:35.3%,n:978
447 Complete BUSCOs (C)
186 Fragmented BUSCOs (F)
345 Missing BUSCOs (M)
---------------------------------------------------------
gurchin2sra4d.sra2genes.log
#s2g: EvidentialGene evgpipe_sra2genes.pl (-help for info), VERSION 2017.12.07
#s2g: CMD: evgpipe_sra2genes.pl -NCPU 8 -MAXMEM 32000 -log -debug -runname gurchin2sra4d -SRAtable greenseaurchin_SraRunInfo.csv
#s2g: BEGIN with input= greenseaurchin_SraRunInfo.csv date= Tue Dec 12 11:05:22 PST 2017
#s2g: STEP1_sraget
#s2g: sra_info: Run=SRR1661090;SRR1661111;SRR1661397;SRR1661409; ScientificName=Lytechinus variegatus; Platform=ILLUMINA;
#s2g: CenterName=BOSTON UNIVERSITY; size_MB=11057;12201;8157;14099; spots=88877018;97227011;69147257;111354178;
#s2g: BioProject=PRJNA241187;
#s2g: STEP2_sra2fasta
#s2g: sra2fasta ids: SRR1661090 SRR1661111 SRR1661397 SRR1661409
#s2g: done STEP2_sra2fasta 2 spotfa=4,dnorm done=0
#s2g: STEP3_selectrna
#s2g: sample_reads type=pairs, nreads=49505385/88877018 (55% for 10000 maxMB) of ids=SRR1661090 to rnasets/sRn1l2SRR1661090, pok=0
#s2g: sample_reads type=pairs, nreads=49507108/97227011 (50% for 10000 maxMB) of ids=SRR1661111 to rnasets/sRn1l2SRR1661111, pok=0
#s2g: sample_reads type=pairs, nreads=49507594/69147257 (71% for 10000 maxMB) of ids=SRR1661397 to rnasets/sRn1l2SRR1661397, pok=0
#s2g: sample_reads type=pairs, nreads=49506147/111354178 (44% for 10000 maxMB) of ids=SRR1661409 to rnasets/sRn1l3SRR1661409, pok=0
#s2g: done STEP3_selectrna 1 rnasets=4
#s2g: STEP4_runassemblers
#s2g: STEP4_runassemblers have scripts: runvelo.sh;runidba.sh;runsoap.sh;runtrin.sh
#s2g: STEP5_collectassemblies
#s2g: done STEP5_collectassemblies 1 trsets ready=13 (done=13), err=0, waitfor=0
#s2g: STEP7_reduceassemblies have scripts: run_tr2aacds.sh
#s2g: STEP8_refblastgenes have scripts: run_evgblastp.sh
#s2g: STEP9_annotgenes
#s2g: STEP9a_trimvec
#s2g: STEP10_publicgenes have scripts: run_evgpubset.sh
#s2g: settings saved to gurchin2sra4d.sra2genes.info
#s2g: ======================================
Project folder with data sets and scripts for seaurchin
aaeval refset sRn1l3SRR1661409.trclass
buscoscan.sh rnasets spotfa
dropset run_evgblastp.sh srun_comet.sh
geneval run_evgpubset1a.sh srun_share8m48h40.sh
genome run_evgpubset2b.sh srun_shared.sh
greenseaurchin_SraRunInfo.csv run_evgsra2genes.sh submitset
greenseaurchin_SraRunInfo.csv_all run_evgtrimvec1a.sh tmpfiles
greenseaurchin_taxon.info run_evgtrimvec2b.sh tr2cds.13266894.comet-06-18.out
gurchin2sra4d.egtrimvec.log run_tr2aacds.sh tr2cds.13275618.comet-03-33.out
gurchin2sra4d.mrna2tsa.info rundiginorm_gurchin2sra4d.sh tridba1a_sNn4l2SRR1661090
gurchin2sra4d.mrna2tsa.log runidba.sh tridba1a_sRn1l2SRR1661111
gurchin2sra4d.names runidba1b.sh tridba1a_sRn1l3SRR1661409
gurchin2sra4d.sra2genes.info runsoap.sh tridba1b_sRn1l2SRR1661090
gurchin2sra4d.sra2genes.log runtr2genome3.sh tridba1b_sRn1l2SRR1661397
gurchin2sra4d.tr2aacds.log runtrin.sh trimset
gurchin2sra4d.trclass runvelo.sh trsets
gurchin2sra4d.trclass.sum.txt runvelo1b.sh trsoap1a_sNn4l2SRR1661090
gurchin2sra4d_evgmethods.tar.gz runvelofin.sh trsoap1a_sRn1l2SRR1661111
gurchin2sra4d_okall.aa sNn4l2SRR1661090.tr2aacds.info trsoap1a_sRn1l3SRR1661409
gurchin2sra4d_okall.aa.qual sNn4l2SRR1661090.tr2aacds.log trvelo1a_sNn4l2SRR1661090
gurchin2sra4d_okall.names sNn4l2SRR1661090.trclass trvelo1a_sRn1l2SRR1661111
gurchin2sra4d_okall.names.Fixmename sRn1l2SRR1661090.tr2aacds.info trvelo1a_sRn1l3SRR1661409
gurchin2sra4d_pubset.tar.gz sRn1l2SRR1661090.tr2aacds.log trvelo1b_sRn1l2SRR1661090
gurchin2sraevg.sra2genes.info sRn1l2SRR1661090.trclass trvelo1b_sRn1l2SRR1661397
gurchin2sraevg.sra2genes.log sRn1l2SRR1661111.tr2aacds.info try1
inputset sRn1l2SRR1661111.tr2aacds.log try2
logs sRn1l2SRR1661111.trclass try3
okayset sRn1l2SRR1661397.tr2aacds.info urchin1.hist
pairfa sRn1l2SRR1661397.tr2aacds.log urchin2.hist
publicset sRn1l2SRR1661397.trclass urchin3.hist
refgenes3urchin-gurchin2sra4d.aa.btall sRn1l3SRR1661409.tr2aacds.info workf
refgenes3urchin-gurchin2sra4d.blastp.gz sRn1l3SRR1661409.tr2aacds.log
---------------------------------------------------------
Disk usage of data sets : [seaurchin]$ du -sh */
384M aaeval/ # protein annots, eval
48M refset/ # refgenes, ..
320M genome/ # spp chromosome assembly
938M geneval/ # publicset/genes.mrna gmapped to chr assembly
.. evg tr2aacds reduced trsets ..
14G inputset/ # input transcripts from trsets/, with cds,aa
2.6G okayset/ # non redundant output set
13G dropset/ # redundant and fragment output set
.. evgmrna2tsa of gurchin2sra4d ..
1.1G publicset/ # public release annotated gene set
490M submitset/ # NCBI TSA submit set, .fsa, .tbl annots
21M trimset/ # vecscreen results (trimmed of vectors/adaptors or dropped)
26G tmpfiles/ # intermediate data files
.. SRA RNA data sets ..
75G spotfa/ # from SRA, gzipped
137G pairfa/ # spots split to _1,left/_2,right read pairs
47G rnasets/ # data slices from pairfa, for assemblers,
.. assembler runs ..
7.1G trsets/ # trformat of assembled transcripts, per asm run
23G tridba1a_sNn4l2SRR1661090/ # idba_trans
13G tridba1a_sRn1l2SRR1661111/
12G tridba1a_sRn1l3SRR1661409/
16G tridba1b_sRn1l2SRR1661090/
16G tridba1b_sRn1l2SRR1661397/
673M trsoap1a_sNn4l2SRR1661090/ # SOAP_tran
806M trsoap1a_sRn1l2SRR1661111/
784M trsoap1a_sRn1l3SRR1661409/
12G trvelo1a_sNn4l2SRR1661090/ # Velvet/Oases
244G trvelo1a_sRn1l2SRR1661111/ # clear out big Graph files
9.4G trvelo1a_sRn1l3SRR1661409/
8.9G trvelo1b_sRn1l2SRR1661090/
9.0G trvelo1b_sRn1l2SRR1661397/
---------------------------------------------------------
Public gene set stats
741313 publicset transcripts & proteins
235169 primary transcript (t1) of gene "loci"
60000 are >= 100 aa, 50000 are < 50 aa, 125000 are sized between 100 > aa > 50
24576 of primaries have protein homology to seaurchinp (22279), starfishc (1730), or human
506144 alternate transcripts of gene loci
134082 alts have protein homology to three ref species
longest 10,030 aa (longest animal aa =~ 36,000; 15,000-20,000 is common in arthropods)
median 70 aa (lots of short things), min 19 aa (despite 30aa MIN cut off)
Busco score to metazoa is 'C:99.6%[S:18.0%,D:81.6%],F:0.1%,M:0.3%,n:978', missing 3/978.
Of transcripts gmapped to chromosome assembly (Lytechinus variegatus Lvar 01 from NCBI)
Of 235169 primary transcripts:
70305 are uniquely mapped
140044 are overlapped with another locus (i.e. ~70022, 140044/2 are extras at same locus)
4475 are not mapped
66885 have introns (2+exons),
three have >=100 exons (Lytvar1tEVm000007t1/Fibrillin, Lytvar1tEVm000006t1/Muscle assembly, Lytvar1tEVm000022t1/Ryanodine receptor)
165753 have single mapped exon
of 66885 with introns, 23235 map with >= 98% identity, and 7515 map with <90% identity
of 165753 single exon, 65251 map with >= 98% identity, and 16373 map with <90% identity
of 66885 with introns, 10467 are split-mapped over 2+ scaffolds, 199 are split-mapped over same scaffold,
and 2383 are split-mapped at same location
---------------------------------------------------------
Sources of longest 1000 genes (t1 only)
assembler (limited 3 soap runs)
559 velv
366 idba
75 soap
data slices
288 sNn4l2d1090 # diginorm of other 4
132 sRn1l2d1090
198 sRn1l2d1111
256 sRn1l2d1397
126 sRn1l3d1409
kmers (rounded to d5)
53 k25
77 k35
62 k45
452 k55
139 k65
129 k75
71 k85
17 k95
---------------------------------------------------------
Busco scores at aaeval/buscof/busumm_versions.txt
See also refgenes3urchin-gurchin2sra4d.blastp and gurchin2sra4d.names for full orthology
Evigene publicset of 4 read sets, gurchin2sra4d
==> gurchin2sra4d_pub/short_summary_bugurchin2sra4d_pub.txt <==
C:99.6%[S:18.0%,D:81.6%],F:0.1%,M:0.3%,n:978
974 Complete BUSCOs (C)
176 Complete and single-copy BUSCOs (S)
798 Complete and duplicated BUSCOs (D)
1 Fragmented BUSCOs (F)
3 Missing BUSCOs (M) # likely findable among unused tissue expression SRR sets
978 Total BUSCO groups searched
Single read set SRR1661111 data slice
okayset/okay.aa = primary (longest) protein, n=145,247
==> sRn1l2SRR1661111.okay/short_summary_busRn1l2SRR1661111.okay.txt <==
C:95.3%[S:94.8%,D:0.5%],F:1.3%,M:3.4%,n:978
932 Complete BUSCOs (C)
927 Complete and single-copy BUSCOs (S)
5 Complete and duplicated BUSCOs (D)
13 Fragmented BUSCOs (F)
33 Missing BUSCOs (M)
978 Total BUSCO groups searched
okayset/okay+okalt.aa = all transcript proteins per gene locus, non-redundant n=321,255
==> sRn1l2SRR1661111_okall/short_summary_busRn1l2SRR1661111_okall.txt <==
C:97.6%[S:42.7%,D:54.9%],F:0.7%,M:1.7%,n:978
955 Complete BUSCOs (C)
418 Complete and single-copy BUSCOs (S)
537 Complete and duplicated BUSCOs (D)
7 Fragmented BUSCOs (F)
16 Missing BUSCOs (M)
978 Total BUSCO groups searched
inputset/Rn1l2SRR1661111.aa = all assembled transcripts, redundant n=3,832,824
==> sRn1l2SRR1661111_input/short_summary_busRn1l2SRR1661111_input.txt <==
C:98.1%[S:0.7%,D:97.4%],F:1.0%,M:0.9%,n:978
960 Complete BUSCOs (C)
7 Complete and single-copy BUSCOs (S)
953 Complete and duplicated BUSCOs (D)
10 Fragmented BUSCOs (F)
8 Missing BUSCOs (M)
978 Total BUSCO groups searched
---------------------------------------------------------
|