euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

EvidentialGene : evgpipe_sra2genes trial results for
Lytechinus variegatus green sea urchin

Lytechinus variegatus (green sea urchin)
Lineage (full): cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Echinodermata; Eleutherozoa; Echinozoa; Echinoidea; Euechinoidea; Echinacea; Temnopleuroida; Toxopneustidae; Lytechinus

Related species reference genes
Strongylocentrotus_purpuratus = purple sea urchin, NCBI EGAP 2016, 27728 pc-genes, 35773 isoforms
Acanthaster_planci = crown-of-thorns starfish, NCBI EGAP 2017, 16468 pc-genes, 33201 isoforms

  1. Lytechinus variegatus Gene/Genome map

      Name                           Last modified       Size  

[DIR] Parent Directory 14-Dec-2017 17:23 - [DIR] aaeval/ 10-Dec-2017 17:10 - [DIR] evgmethods/ 01-Jan-2018 19:44 - [TXT] evgsra2genepipe_help.txt 14-Dec-2017 17:59 5k [TXT] evgsra2genepipe_urchin.sum.txt 16-Jun-2018 15:34 13k [DIR] geneval/ 01-Jan-2018 19:56 - [DIR] genome/ 14-Dec-2017 17:08 - [TXT] greenseaurchin_SraRunInfo.csv 06-Dec-2017 20:20 2k [TXT] gurchin2sra4d.names 12-Dec-2017 02:04 27.8M [TXT] gurchin2sra4d.sra2genes.info 12-Dec-2017 14:05 1k [TXT] gurchin2sra4d.trclass.sum.txt 10-Dec-2017 23:15 1k [DIR] logfiles/ 01-Jan-2018 19:24 - [DIR] okayset/ 13-Dec-2017 13:24 - [DIR] pairfa/ 07-Dec-2017 18:22 - [DIR] publicset/ 01-Jan-2018 19:50 - [DIR] trsets/ 10-Dec-2017 08:29 - [DIR] trsets_reduced/ 14-Dec-2017 16:43 -


evgpipe_sra2genes Trial results for greenseaurchin_SraRunInfo 
Wed Dec 13 2017, by Don Gilbert

Testing new sra2genes method of
  http://arthropods.eugenes.org/EvidentialGene/evigene/scripts/evgpipe_sra2genes.pl

STEPs 1-3: Fetch and pre-process RNA-seq data from SRA
      using 4 tissue-stage SRR sets of 18 available for Lytechinus variegatus
      
STEP 4: Assemble RNA data slices with multiple assemblers and options

STEPs 5,7 repeated: Two-level reduction of transcript over-assembly 
  a. per SRR sample data slice, reduce several assemblies (method,kmer)
      e.g. sRn1l2SRR1661111.tr2aacds
  b. combine and reduce results (4 data slices) of a.
      to gurchin2sra4d.tr2aacds
    
STEPs 8-10: Public gene set with names, annotations, vector screen, chr mapping,
  from non-redundant okayset/transcripts,cds,proteins.
---------------------------------------------------------

Conserved proteins recovery (BUSCO metazoa ordb9)
compared for 3 gene sets of green sea urchin Lytechinus variegatus.

Echinobase project results (David Mathog), for
MAKER genome genes vs Evigene assemblies,  
model proteins from http://www.echinobase.org/Echinobase/LvDownloads
  maker = LVA_protein sequence(Genome Assembly version2.2)
  evigene = Evigene predicted peptides, by DM,
and Evigene set, built with sra2genes by dgg,
 http://arthropods.eugenes.org/EvidentialGene/inverts/sea_urchin/publicset/
  gurchin2sra4d.aa_pub.fa.gz  

http://arthropods.eugenes.org/EvidentialGene/inverts/sea_urchin/
==> run_bumeLv_evgsra2genes4d/short_summary_bumeLv_evgsra2genes4d.txt <==
	C:99.6%[S:18.0%,D:81.6%],F:0.1%,M:0.3%,n:978
	974	Complete BUSCOs (C)
	1	Fragmented BUSCOs (F)
	3	Missing BUSCOs (M) [other two miss these 3 also]
		EOG091G0GA7,EOG091G0IMI,EOG091G0OPI
  Note: Complete+Duplicated is not informative here, as all 
  alternate transcripts are including in busco test.

http://www.echinobase.org/Echinobase/Lv_other_assemblies
==> run_bumeLv_evigene_pep/short_summary_bumeLv_evigene_pep.txt <==
	C:97.3%[S:92.3%,D:5.0%],F:0.3%,M:2.4%,n:978
	952	Complete BUSCOs (C)
	3	Fragmented BUSCOs (F)
	23	Missing BUSCOs (M)

http://www.echinobase.org/Echinobase/Lv_LVA_Genes
==> run_bumeLvar22_maker_pep/short_summary_bumeLvar22_maker_pep.txt <==
	C:45.7%[S:26.3%,D:19.4%],F:19.0%,M:35.3%,n:978
	447	Complete BUSCOs (C)
	186	Fragmented BUSCOs (F)
	345	Missing BUSCOs (M)

---------------------------------------------------------

gurchin2sra4d.sra2genes.log
#s2g: EvidentialGene evgpipe_sra2genes.pl (-help for info), VERSION 2017.12.07
#s2g: CMD: evgpipe_sra2genes.pl -NCPU 8 -MAXMEM 32000 -log -debug -runname gurchin2sra4d -SRAtable greenseaurchin_SraRunInfo.csv

#s2g: BEGIN with input= greenseaurchin_SraRunInfo.csv date= Tue Dec 12 11:05:22 PST 2017
#s2g: STEP1_sraget
#s2g: sra_info: Run=SRR1661090;SRR1661111;SRR1661397;SRR1661409; ScientificName=Lytechinus variegatus; Platform=ILLUMINA; 
#s2g:   CenterName=BOSTON UNIVERSITY; size_MB=11057;12201;8157;14099; spots=88877018;97227011;69147257;111354178; 
#s2g:   BioProject=PRJNA241187;

#s2g: STEP2_sra2fasta
#s2g: sra2fasta ids: SRR1661090 SRR1661111 SRR1661397 SRR1661409
#s2g: done STEP2_sra2fasta 2 spotfa=4,dnorm done=0

#s2g: STEP3_selectrna
#s2g: sample_reads type=pairs, nreads=49505385/88877018 (55% for 10000 maxMB) of ids=SRR1661090 to rnasets/sRn1l2SRR1661090, pok=0
#s2g: sample_reads type=pairs, nreads=49507108/97227011 (50% for 10000 maxMB) of ids=SRR1661111 to rnasets/sRn1l2SRR1661111, pok=0
#s2g: sample_reads type=pairs, nreads=49507594/69147257 (71% for 10000 maxMB) of ids=SRR1661397 to rnasets/sRn1l2SRR1661397, pok=0
#s2g: sample_reads type=pairs, nreads=49506147/111354178 (44% for 10000 maxMB) of ids=SRR1661409 to rnasets/sRn1l3SRR1661409, pok=0
#s2g: done STEP3_selectrna 1 rnasets=4

#s2g: STEP4_runassemblers
#s2g: STEP4_runassemblers have scripts: runvelo.sh;runidba.sh;runsoap.sh;runtrin.sh

#s2g: STEP5_collectassemblies
#s2g: done STEP5_collectassemblies 1 trsets ready=13 (done=13), err=0, waitfor=0

#s2g: STEP7_reduceassemblies have scripts: run_tr2aacds.sh

#s2g: STEP8_refblastgenes have scripts: run_evgblastp.sh

#s2g: STEP9_annotgenes
#s2g: STEP9a_trimvec

#s2g: STEP10_publicgenes have scripts: run_evgpubset.sh
#s2g: settings saved to gurchin2sra4d.sra2genes.info
#s2g: ======================================

Project folder with data sets and scripts for seaurchin 
aaeval                                   refset                          sRn1l3SRR1661409.trclass
buscoscan.sh                             rnasets                         spotfa
dropset                                  run_evgblastp.sh                srun_comet.sh
geneval                                  run_evgpubset1a.sh              srun_share8m48h40.sh
genome                                   run_evgpubset2b.sh              srun_shared.sh
greenseaurchin_SraRunInfo.csv            run_evgsra2genes.sh             submitset
greenseaurchin_SraRunInfo.csv_all        run_evgtrimvec1a.sh             tmpfiles
greenseaurchin_taxon.info                run_evgtrimvec2b.sh             tr2cds.13266894.comet-06-18.out
gurchin2sra4d.egtrimvec.log              run_tr2aacds.sh                 tr2cds.13275618.comet-03-33.out
gurchin2sra4d.mrna2tsa.info              rundiginorm_gurchin2sra4d.sh    tridba1a_sNn4l2SRR1661090
gurchin2sra4d.mrna2tsa.log               runidba.sh                      tridba1a_sRn1l2SRR1661111
gurchin2sra4d.names                      runidba1b.sh                    tridba1a_sRn1l3SRR1661409
gurchin2sra4d.sra2genes.info             runsoap.sh                      tridba1b_sRn1l2SRR1661090
gurchin2sra4d.sra2genes.log              runtr2genome3.sh                tridba1b_sRn1l2SRR1661397
gurchin2sra4d.tr2aacds.log               runtrin.sh                      trimset
gurchin2sra4d.trclass                    runvelo.sh                      trsets
gurchin2sra4d.trclass.sum.txt            runvelo1b.sh                    trsoap1a_sNn4l2SRR1661090
gurchin2sra4d_evgmethods.tar.gz          runvelofin.sh                   trsoap1a_sRn1l2SRR1661111
gurchin2sra4d_okall.aa                   sNn4l2SRR1661090.tr2aacds.info  trsoap1a_sRn1l3SRR1661409
gurchin2sra4d_okall.aa.qual              sNn4l2SRR1661090.tr2aacds.log   trvelo1a_sNn4l2SRR1661090
gurchin2sra4d_okall.names                sNn4l2SRR1661090.trclass        trvelo1a_sRn1l2SRR1661111
gurchin2sra4d_okall.names.Fixmename      sRn1l2SRR1661090.tr2aacds.info  trvelo1a_sRn1l3SRR1661409
gurchin2sra4d_pubset.tar.gz              sRn1l2SRR1661090.tr2aacds.log   trvelo1b_sRn1l2SRR1661090
gurchin2sraevg.sra2genes.info            sRn1l2SRR1661090.trclass        trvelo1b_sRn1l2SRR1661397
gurchin2sraevg.sra2genes.log             sRn1l2SRR1661111.tr2aacds.info  try1
inputset                                 sRn1l2SRR1661111.tr2aacds.log   try2
logs                                     sRn1l2SRR1661111.trclass        try3
okayset                                  sRn1l2SRR1661397.tr2aacds.info  urchin1.hist
pairfa                                   sRn1l2SRR1661397.tr2aacds.log   urchin2.hist
publicset                                sRn1l2SRR1661397.trclass        urchin3.hist
refgenes3urchin-gurchin2sra4d.aa.btall   sRn1l3SRR1661409.tr2aacds.info  workf
refgenes3urchin-gurchin2sra4d.blastp.gz  sRn1l3SRR1661409.tr2aacds.log
---------------------------------------------------------

Disk usage of data sets : [seaurchin]$ du -sh */
384M    aaeval/     # protein annots, eval
48M     refset/     # refgenes, ..
320M    genome/     # spp chromosome assembly
938M    geneval/    # publicset/genes.mrna gmapped to chr assembly 

  .. evg tr2aacds reduced trsets ..
14G     inputset/   # input transcripts from trsets/, with cds,aa
2.6G    okayset/    # non redundant output set
13G     dropset/    # redundant and fragment output set

  .. evgmrna2tsa of gurchin2sra4d ..
1.1G    publicset/  # public release annotated gene set
490M    submitset/  # NCBI TSA submit set, .fsa, .tbl annots
21M     trimset/    # vecscreen results (trimmed of vectors/adaptors or dropped)
26G     tmpfiles/   # intermediate data files

.. SRA RNA data sets ..
75G     spotfa/     # from SRA, gzipped
137G    pairfa/     # spots split to _1,left/_2,right read pairs
47G     rnasets/    # data slices from pairfa, for assemblers, 

.. assembler runs ..
7.1G    trsets/     # trformat of assembled transcripts, per asm run

23G     tridba1a_sNn4l2SRR1661090/  # idba_trans
13G     tridba1a_sRn1l2SRR1661111/
12G     tridba1a_sRn1l3SRR1661409/
16G     tridba1b_sRn1l2SRR1661090/
16G     tridba1b_sRn1l2SRR1661397/
673M    trsoap1a_sNn4l2SRR1661090/  # SOAP_tran
806M    trsoap1a_sRn1l2SRR1661111/
784M    trsoap1a_sRn1l3SRR1661409/
12G     trvelo1a_sNn4l2SRR1661090/  # Velvet/Oases
244G    trvelo1a_sRn1l2SRR1661111/  # clear out big Graph files
9.4G    trvelo1a_sRn1l3SRR1661409/
8.9G    trvelo1b_sRn1l2SRR1661090/
9.0G    trvelo1b_sRn1l2SRR1661397/
---------------------------------------------------------

Public gene set stats
  741313 publicset transcripts & proteins    
  235169 primary transcript (t1) of gene "loci" 
   60000 are >= 100 aa, 50000 are < 50 aa, 125000 are sized between 100 > aa > 50 
   24576 of primaries have protein homology to seaurchinp (22279), starfishc (1730), or human
    
  506144 alternate transcripts of gene loci
  134082 alts have protein homology to three ref species
  
  longest 10,030 aa (longest animal aa =~ 36,000; 15,000-20,000 is common in arthropods)
  median  70 aa (lots of short things), min 19 aa (despite 30aa MIN cut off)
  Busco score to metazoa is 'C:99.6%[S:18.0%,D:81.6%],F:0.1%,M:0.3%,n:978', missing 3/978.
 
  Of transcripts gmapped to chromosome assembly (Lytechinus variegatus Lvar 01 from NCBI)
  Of 235169 primary transcripts:
     70305 are uniquely mapped
    140044 are overlapped with another locus (i.e. ~70022, 140044/2 are extras at same locus)
      4475 are not mapped
     66885 have introns (2+exons), 
      three have >=100 exons (Lytvar1tEVm000007t1/Fibrillin, Lytvar1tEVm000006t1/Muscle assembly, Lytvar1tEVm000022t1/Ryanodine receptor)
    165753 have single mapped exon
    of 66885 with introns, 23235 map with >= 98% identity, and  7515 map with <90% identity
    of 165753 single exon, 65251 map with >= 98% identity, and 16373 map with <90% identity
    of 66885 with introns,  10467 are split-mapped over 2+ scaffolds, 199 are split-mapped over same scaffold, 
              and 2383 are split-mapped at same location
---------------------------------------------------------
         
Sources of longest 1000 genes (t1 only)
assembler (limited 3 soap runs)
    559 velv
    366 idba
     75 soap

data slices
    288 sNn4l2d1090 # diginorm of other 4
    132 sRn1l2d1090
    198 sRn1l2d1111
    256 sRn1l2d1397
    126 sRn1l3d1409

kmers (rounded to d5)
     53 k25
     77 k35
     62 k45
    452 k55
    139 k65
    129 k75
     71 k85
     17 k95
---------------------------------------------------------

Busco scores at aaeval/buscof/busumm_versions.txt
See also refgenes3urchin-gurchin2sra4d.blastp and gurchin2sra4d.names for full orthology

Evigene publicset of 4 read sets, gurchin2sra4d
==> gurchin2sra4d_pub/short_summary_bugurchin2sra4d_pub.txt <==

        C:99.6%[S:18.0%,D:81.6%],F:0.1%,M:0.3%,n:978
        974     Complete BUSCOs (C)
        176     Complete and single-copy BUSCOs (S)
        798     Complete and duplicated BUSCOs (D)
        1       Fragmented BUSCOs (F)
        3       Missing BUSCOs (M)      # likely findable among unused tissue expression SRR sets
        978     Total BUSCO groups searched

Single read set SRR1661111 data slice      
okayset/okay.aa = primary (longest) protein, n=145,247                                                
==> sRn1l2SRR1661111.okay/short_summary_busRn1l2SRR1661111.okay.txt <==
        C:95.3%[S:94.8%,D:0.5%],F:1.3%,M:3.4%,n:978
        932     Complete BUSCOs (C)
        927     Complete and single-copy BUSCOs (S)
        5       Complete and duplicated BUSCOs (D)
        13      Fragmented BUSCOs (F)
        33      Missing BUSCOs (M)
        978     Total BUSCO groups searched

okayset/okay+okalt.aa = all transcript proteins per gene locus, non-redundant n=321,255 
==> sRn1l2SRR1661111_okall/short_summary_busRn1l2SRR1661111_okall.txt <==
        C:97.6%[S:42.7%,D:54.9%],F:0.7%,M:1.7%,n:978
        955     Complete BUSCOs (C)
        418     Complete and single-copy BUSCOs (S)
        537     Complete and duplicated BUSCOs (D)
        7       Fragmented BUSCOs (F)
        16      Missing BUSCOs (M)
        978     Total BUSCO groups searched

inputset/Rn1l2SRR1661111.aa = all assembled transcripts, redundant n=3,832,824
==> sRn1l2SRR1661111_input/short_summary_busRn1l2SRR1661111_input.txt <==
        C:98.1%[S:0.7%,D:97.4%],F:1.0%,M:0.9%,n:978
        960     Complete BUSCOs (C)
        7       Complete and single-copy BUSCOs (S)
        953     Complete and duplicated BUSCOs (D)
        10      Fragmented BUSCOs (F)
        8       Missing BUSCOs (M)
        978     Total BUSCO groups searched
---------------------------------------------------------


Developed at the Genome Informatics Lab of Indiana University Biology Department