euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

EvidentialGene : evgpipe_sra2genes trial results for
Lytechinus variegatus green sea urchin

Lytechinus variegatus (green sea urchin)
Lineage (full): cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Echinodermata; Eleutherozoa; Echinozoa; Echinoidea; Euechinoidea; Echinacea; Temnopleuroida; Toxopneustidae; Lytechinus

Related species reference genes
Strongylocentrotus_purpuratus = purple sea urchin, NCBI EGAP 2016, 27728 pc-genes, 35773 isoforms
Acanthaster_planci = crown-of-thorns starfish, NCBI EGAP 2017, 16468 pc-genes, 33201 isoforms

  1. Lytechinus variegatus Gene/Genome map

      Name                           Last modified       

[DIR] Parent Directory 01-Jan-2018 20:03 [DIR] aaeval/ 10-Dec-2017 17:10 [DIR] evgmethods/ 01-Jan-2018 19:44 [TXT] evgsra2genepipe_help.txt 14-Dec-2017 17:59 [TXT] evgsra2genepipe_urchin.sum.txt 15-Dec-2017 13:55 [DIR] geneval/ 01-Jan-2018 19:56 [DIR] genome/ 14-Dec-2017 17:08 [TXT] greenseaurchin_SraRunInfo.csv 06-Dec-2017 20:20 [TXT] gurchin2sra4d.names 12-Dec-2017 02:04 [TXT] 12-Dec-2017 14:05 [TXT] gurchin2sra4d.trclass.sum.txt 10-Dec-2017 23:15 [DIR] logfiles/ 01-Jan-2018 19:24 [DIR] okayset/ 13-Dec-2017 13:24 [DIR] pairfa/ 07-Dec-2017 18:22 [DIR] publicset/ 01-Jan-2018 19:50 [DIR] trsets/ 10-Dec-2017 08:29 [DIR] trsets_reduced/ 14-Dec-2017 16:43

evgpipe_sra2genes Trial results for greenseaurchin_SraRunInfo 
Wed Dec 13 2017, by Don Gilbert

Testing new sra2genes method of

STEPs 1-3: Fetch and pre-process RNA-seq data from SRA
      using 4 tissue-stage SRR sets of 18 available for Lytechinus variegatus
STEP 4: Assemble RNA data slices with multiple assemblers and options

STEPs 5,7 repeated: Two-level reduction of transcript over-assembly 
  a. per SRR sample data slice, reduce several assemblies (method,kmer)
      e.g. sRn1l2SRR1661111.tr2aacds
  b. combine and reduce results (4 data slices) of a.
      to gurchin2sra4d.tr2aacds
STEPs 8-10: Public gene set with names, annotations, vector screen, chr mapping,
  from non-redundant okayset/transcripts,cds,proteins.

#s2g: EvidentialGene (-help for info), VERSION 2017.12.07
#s2g: CMD: -NCPU 8 -MAXMEM 32000 -log -debug -runname gurchin2sra4d -SRAtable greenseaurchin_SraRunInfo.csv

#s2g: BEGIN with input= greenseaurchin_SraRunInfo.csv date= Tue Dec 12 11:05:22 PST 2017
#s2g: STEP1_sraget
#s2g: sra_info: Run=SRR1661090;SRR1661111;SRR1661397;SRR1661409; ScientificName=Lytechinus variegatus; Platform=ILLUMINA; 
#s2g:   CenterName=BOSTON UNIVERSITY; size_MB=11057;12201;8157;14099; spots=88877018;97227011;69147257;111354178; 
#s2g:   BioProject=PRJNA241187;

#s2g: STEP2_sra2fasta
#s2g: sra2fasta ids: SRR1661090 SRR1661111 SRR1661397 SRR1661409
#s2g: done STEP2_sra2fasta 2 spotfa=4,dnorm done=0

#s2g: STEP3_selectrna
#s2g: sample_reads type=pairs, nreads=49505385/88877018 (55% for 10000 maxMB) of ids=SRR1661090 to rnasets/sRn1l2SRR1661090, pok=0
#s2g: sample_reads type=pairs, nreads=49507108/97227011 (50% for 10000 maxMB) of ids=SRR1661111 to rnasets/sRn1l2SRR1661111, pok=0
#s2g: sample_reads type=pairs, nreads=49507594/69147257 (71% for 10000 maxMB) of ids=SRR1661397 to rnasets/sRn1l2SRR1661397, pok=0
#s2g: sample_reads type=pairs, nreads=49506147/111354178 (44% for 10000 maxMB) of ids=SRR1661409 to rnasets/sRn1l3SRR1661409, pok=0
#s2g: done STEP3_selectrna 1 rnasets=4

#s2g: STEP4_runassemblers
#s2g: STEP4_runassemblers have scripts:;;;

#s2g: STEP5_collectassemblies
#s2g: done STEP5_collectassemblies 1 trsets ready=13 (done=13), err=0, waitfor=0

#s2g: STEP7_reduceassemblies have scripts:

#s2g: STEP8_refblastgenes have scripts:

#s2g: STEP9_annotgenes
#s2g: STEP9a_trimvec

#s2g: STEP10_publicgenes have scripts:
#s2g: settings saved to
#s2g: ======================================

Project folder with data sets and scripts for seaurchin 
aaeval                                   refset                          sRn1l3SRR1661409.trclass                             rnasets                         spotfa
greenseaurchin_SraRunInfo.csv               submitset
greenseaurchin_SraRunInfo.csv_all             tmpfiles                   tr2cds.13266894.comet-06-18.out
gurchin2sra4d.egtrimvec.log                     tr2cds.13275618.comet-03-33.out        tridba1a_sNn4l2SRR1661090
gurchin2sra4d.mrna2tsa.log                           tridba1a_sRn1l2SRR1661111
gurchin2sra4d.names                                tridba1a_sRn1l3SRR1661409                         tridba1b_sRn1l2SRR1661090
gurchin2sra4d.sra2genes.log                    tridba1b_sRn1l2SRR1661397
gurchin2sra4d.tr2aacds.log                           trimset
gurchin2sra4d.trclass                                trsets
gurchin2sra4d.trclass.sum.txt                      trsoap1a_sNn4l2SRR1661090
gurchin2sra4d_evgmethods.tar.gz                   trsoap1a_sRn1l2SRR1661111
gurchin2sra4d_okall.aa           trsoap1a_sRn1l3SRR1661409
gurchin2sra4d_okall.aa.qual              sNn4l2SRR1661090.tr2aacds.log   trvelo1a_sNn4l2SRR1661090
gurchin2sra4d_okall.names                sNn4l2SRR1661090.trclass        trvelo1a_sRn1l2SRR1661111
gurchin2sra4d_okall.names.Fixmename  trvelo1a_sRn1l3SRR1661409
gurchin2sra4d_pubset.tar.gz              sRn1l2SRR1661090.tr2aacds.log   trvelo1b_sRn1l2SRR1661090            sRn1l2SRR1661090.trclass        trvelo1b_sRn1l2SRR1661397
gurchin2sraevg.sra2genes.log     try1
inputset                                 sRn1l2SRR1661111.tr2aacds.log   try2
logs                                     sRn1l2SRR1661111.trclass        try3
okayset                          urchin1.hist
pairfa                                   sRn1l2SRR1661397.tr2aacds.log   urchin2.hist
publicset                                sRn1l2SRR1661397.trclass        urchin3.hist
refgenes3urchin-gurchin2sra4d.aa.btall  workf
refgenes3urchin-gurchin2sra4d.blastp.gz  sRn1l3SRR1661409.tr2aacds.log

Disk usage of data sets : [seaurchin]$ du -sh */
384M    aaeval/     # protein annots, eval
48M     refset/     # refgenes, ..
320M    genome/     # spp chromosome assembly
938M    geneval/    # publicset/genes.mrna gmapped to chr assembly 

  .. evg tr2aacds reduced trsets ..
14G     inputset/   # input transcripts from trsets/, with cds,aa
2.6G    okayset/    # non redundant output set
13G     dropset/    # redundant and fragment output set

  .. evgmrna2tsa of gurchin2sra4d ..
1.1G    publicset/  # public release annotated gene set
490M    submitset/  # NCBI TSA submit set, .fsa, .tbl annots
21M     trimset/    # vecscreen results (trimmed of vectors/adaptors or dropped)
26G     tmpfiles/   # intermediate data files

.. SRA RNA data sets ..
75G     spotfa/     # from SRA, gzipped
137G    pairfa/     # spots split to _1,left/_2,right read pairs
47G     rnasets/    # data slices from pairfa, for assemblers, 

.. assembler runs ..
7.1G    trsets/     # trformat of assembled transcripts, per asm run

23G     tridba1a_sNn4l2SRR1661090/  # idba_trans
13G     tridba1a_sRn1l2SRR1661111/
12G     tridba1a_sRn1l3SRR1661409/
16G     tridba1b_sRn1l2SRR1661090/
16G     tridba1b_sRn1l2SRR1661397/
673M    trsoap1a_sNn4l2SRR1661090/  # SOAP_tran
806M    trsoap1a_sRn1l2SRR1661111/
784M    trsoap1a_sRn1l3SRR1661409/
12G     trvelo1a_sNn4l2SRR1661090/  # Velvet/Oases
244G    trvelo1a_sRn1l2SRR1661111/  # clear out big Graph files
9.4G    trvelo1a_sRn1l3SRR1661409/
8.9G    trvelo1b_sRn1l2SRR1661090/
9.0G    trvelo1b_sRn1l2SRR1661397/

Public gene set stats
  741313 publicset transcripts & proteins    
  235169 primary transcript (t1) of gene "loci" 
   60000 are >= 100 aa, 50000 are < 50 aa, 125000 are sized between 100 > aa > 50 
   24576 of primaries have protein homology to seaurchinp (22279), starfishc (1730), or human
  506144 alternate transcripts of gene loci
  134082 alts have protein homology to three ref species
  longest 10,030 aa (longest animal aa =~ 36,000; 15,000-20,000 is common in arthropods)
  median  70 aa (lots of short things), min 19 aa (despite 30aa MIN cut off)
  Busco score to metazoa is 'C:99.6%[S:18.0%,D:81.6%],F:0.1%,M:0.3%,n:978', missing 3/978.
  Of transcripts gmapped to chromosome assembly (Lytechinus variegatus Lvar 01 from NCBI)
  Of 235169 primary transcripts:
     70305 are uniquely mapped
    140044 are overlapped with another locus (i.e. ~70022, 140044/2 are extras at same locus)
      4475 are not mapped
     66885 have introns (2+exons), 
      three have >=100 exons (Lytvar1tEVm000007t1/Fibrillin, Lytvar1tEVm000006t1/Muscle assembly, Lytvar1tEVm000022t1/Ryanodine receptor)
    165753 have single mapped exon
    of 66885 with introns, 23235 map with >= 98% identity, and  7515 map with <90% identity
    of 165753 single exon, 65251 map with >= 98% identity, and 16373 map with <90% identity
    of 66885 with introns,  10467 are split-mapped over 2+ scaffolds, 199 are split-mapped over same scaffold, 
              and 2383 are split-mapped at same location
Sources of longest 1000 genes (t1 only)
assembler (limited 3 soap runs)
    559 velv
    366 idba
     75 soap

data slices
    288 sNn4l2d1090 # diginorm of other 4
    132 sRn1l2d1090
    198 sRn1l2d1111
    256 sRn1l2d1397
    126 sRn1l3d1409

kmers (rounded to d5)
     53 k25
     77 k35
     62 k45
    452 k55
    139 k65
    129 k75
     71 k85
     17 k95

Busco scores at aaeval/buscof/busumm_versions.txt
See also refgenes3urchin-gurchin2sra4d.blastp and gurchin2sra4d.names for full orthology

Evigene publicset of 4 read sets, gurchin2sra4d
==> gurchin2sra4d_pub/short_summary_bugurchin2sra4d_pub.txt <==

        974     Complete BUSCOs (C)
        176     Complete and single-copy BUSCOs (S)
        798     Complete and duplicated BUSCOs (D)
        1       Fragmented BUSCOs (F)
        3       Missing BUSCOs (M)      # likely findable among unused tissue expression SRR sets
        978     Total BUSCO groups searched

Single read set SRR1661111 data slice      
okayset/okay.aa = primary (longest) protein, n=145,247                                                
==> sRn1l2SRR1661111.okay/short_summary_busRn1l2SRR1661111.okay.txt <==
        932     Complete BUSCOs (C)
        927     Complete and single-copy BUSCOs (S)
        5       Complete and duplicated BUSCOs (D)
        13      Fragmented BUSCOs (F)
        33      Missing BUSCOs (M)
        978     Total BUSCO groups searched

okayset/okay+okalt.aa = all transcript proteins per gene locus, non-redundant n=321,255 
==> sRn1l2SRR1661111_okall/short_summary_busRn1l2SRR1661111_okall.txt <==
        955     Complete BUSCOs (C)
        418     Complete and single-copy BUSCOs (S)
        537     Complete and duplicated BUSCOs (D)
        7       Fragmented BUSCOs (F)
        16      Missing BUSCOs (M)
        978     Total BUSCO groups searched

inputset/Rn1l2SRR1661111.aa = all assembled transcripts, redundant n=3,832,824
==> sRn1l2SRR1661111_input/short_summary_busRn1l2SRR1661111_input.txt <==
        960     Complete BUSCOs (C)
        7       Complete and single-copy BUSCOs (S)
        953     Complete and duplicated BUSCOs (D)
        10      Fragmented BUSCOs (F)
        8       Missing BUSCOs (M)
        978     Total BUSCO groups searched

Developed at the Genome Informatics Lab of Indiana University Biology Department