euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

Index of /EvidentialGene/arthropods/deertick/ixotick1evg1

      Name                            Last modified       Size  

[DIR] Parent Directory 05-Apr-2014 13:17 - [TXT] evg1itick-ncbi-biosamples.txt 05-Apr-2014 13:15 4k [TXT] evg1itick-ncbi-sra-pj239331.csv 29-Mar-2014 17:59 2k [TXT] evg1itick.mrna2tsa.info 05-Jun-2014 15:43 1k [TXT] evg1itick.mrna2tsa.log 05-Jun-2014 15:43 29k [TXT] evg1itick.tr2aacds.log 03-Apr-2014 22:24 15k [   ] evg1itick.trclass.gz 03-Apr-2014 22:15 42.8M [DIR] inputset/ 30-Jul-2014 13:32 - [DIR] publicset/ 30-Jun-2015 21:37 - [   ] run_evgmrna2tsa.sh 08-Jun-2014 16:24 2k [   ] runtr2cds.sh 03-Apr-2014 13:53 1k


====== Ixodes scapularis (Deer Tick) EvidentialGene gene construction set ========
2014-jun-07, by Don Gilbert, gilbertd at indiana edu
http://arthropods.eugenes.org/EvidentialGene/arthropods/deertick/

This is a 'reference-free' gene set assembly from mRNA-seq, without reference made to
a genome assembly nor training/mapping from other species genes. As such it has different 
values than genome-based gene sets, one important one is no external artifacts or errors 
contribute to these genes.  Any protein orthology measured has not been influenced by 
gene modelling using other species (with their artifacts), and genome assembly errors.

Ixodes scapularis mRNA-seq is of NCBI SRA public data BioProject PRJNA239331,
accessions listed in itick_pj239331_nobact.csv

#t2ac: EvidentialGene tr2aacds.pl VERSION 2013.07.27
#t2ac: BEGIN with cdnaseq= evg1itick.tr date= Thu Apr  3 17:29:18 PDT 2014
#t2ac: bestorf_cds= evg1itick.cds nrec= 5895165
#t2ac: nonredundant_cds= evg1iticknr.cds nrec= 2792966
#t2ac: nofragments_cds= evg1iticknrcd1.cds nrec= 1725680
#t2ac: asmdupfilter_cds= evg1itick.trclass

# Class Table for evg1itick.trclass 
class           okay    drop    okay    drop
althi           2.6     17.3    45175   298756
althi1          3       21.9    53369   378193
althia2         0       0.4     0       7382
altmfrag        0.4     0.6     7551    11395
altmfraga2      0       0       778     998
altmid          0.7     1.3     13181   23676
altmida2        0       0       734     1198
main            5.7     8       99872   139614    # too many main/noclass, frags, .. see below
maina2          0.4     0.3     7205    5640
noclass         2.4     17.1    41799   295300
noclassa2       0       0       93      573
parthi          0       11.2    0       194358
parthi1         0       4.4     0       77612
parthia2        0       1.2     0       21175
---------------------------------------------
total           15.6    84.3    269757  1455870
=============================================
# AA-quality for okay set of evg1itick.aa.qual (no okalt): all and longest 1000 summary 
okay.top         n=1000; average=1738; median=1464; min,max=1187,10671; sum=1738176; gaps=7731,7.7
okay.all         n=148969; average=158; median=116; min,max=40,10671; sum=23548951; gaps=1098349,7.3
#t2ac: DONE at date= Thu Apr  3 19:24:11 PDT 2014
#t2ac: ======================================

# itick1 publicset class count
71709  althi	# alternates, hi identity exon align to main
6726   althim   # alternate<> main swapped
10610  altmid   # alternates, lower ident exon align, will include some paralogs
3983   altmidfrag  # shorter altmid
119588 main     # main transcripts w/ alternates, large count see noho.tab, most have no homology (of 3 ref spp)
29975  noclass  # main without alternates
------
15249 dropalt	# removed as uninformative, short, 
11917 dropnoclass
----------------------------------------------------

aaeval/...................................
Basic orthology comparison, blastp x human,daphnia,tribolium  

Align averages
refspp  nref    ixodes  ixoevg  tetur   ztickevg
daphnia 5827    377     471     408     436
human   7780    411     515     444     479
trica   5700    368     462     400     429

Human genes found (n=16631)
geneset         hit%    alnh    alnt    Gene set method, species
................................................................
ixodes.evg      95.7    434     415     mRNA-assembly, deer tick
ztick.evg       91.4    416     380     mRNA-assembly, zebra tick
ixodes.gno      89.5    364     326     genome-predict, deer tick
tetur.gno       83.2    399     332     genome-predict, spider mite
................................................................

-------------------------------------------------
Main transcript checks
 wc -l evg1itick.maint1.*.tab
   15958 evg1itick.maint1.homol.tab   : 738 have TEnames, rest have other above species homology
  133605 evg1itick.maint1.noho.tab    : no homol to above 3 ref species, median aa size ~ 100aa

noho set:
   cd-hit90 cluster of maint1.noho.aa, only 1400 cluster w/ others, not useful to reclassify 
   in:120906  finished     119341  clusters

  12336 utrorf's : discard all but longest (>= 200aa, ~1000)
  60000 short half of noho, <= 110 aa, 24000 are partial or/and utrpoor/bad.  discard all?
  check random set of noho for NR blastp hits .. any contams, TE genes? bacteria?
  likely that large NR protein set will categorize many of these in 3 bins:
  dont know w/o doing blastp what proportion of noho set to keep
   1. eukaryote (may include ixodes/ticks)
   2. bacterial/contam
   3. transposon/virus/contam
   4. no detectable homology .. probable ixodes, maybe not
 
Random noho check at NCBI NR Blastp
Ixosca1aEVm000051t1 3039aa; NRB=polyprotein-virus*; 
Ixosca1aEVm006413t1 489aa; NRBl=hypothetical protein IscW_ISCW011049 [Ixodes scapularis] (98aa, hi ident) + other Ixo
   .. also pea aphid  LOC100574918, daphnia, other weaker partial matches
Ixosca1aEVm029071t1 185aa; NRBl=IscW_ISCW014947, other weak (ecoli) 
Ixosca1aEVm018013t1 235aa; NRBl=LOC100888907 [Strongylocentrotus purpuratus]  and other Euks, ~full match
Ixosca1aEVm041859t1 159aa; NRBl=weak hits to bacteria
Ixosca1aEVm024727t1 199aa; NRBl=weak hits to euks
Ixosca1aEVm045897t1 153aa; NRBl=weak hit to 1 bact
-------------

Developed at the Genome Informatics Lab of Indiana University Biology Department