Index of /EvidentialGene/arthropods/deertick/ixotick1evg1
Name Last modified Size
Parent Directory 31-Dec-2019 15:25 -
evg1itick-ncbi-biosamples.txt 05-Apr-2014 13:15 4k
evg1itick-ncbi-sra-pj239331.csv 29-Mar-2014 17:59 2k
evg1itick.mrna2tsa.info 05-Jun-2014 15:43 1k
evg1itick.mrna2tsa.log 05-Jun-2014 15:43 29k
evg1itick.tr2aacds.log 03-Apr-2014 22:24 15k
evg1itick.trclass.gz 03-Apr-2014 22:15 42.8M
inputset/ 30-Jul-2014 13:32 -
publicset/ 30-Jun-2015 21:37 -
run_evgmrna2tsa.sh 08-Jun-2014 16:24 2k
runtr2cds.sh 03-Apr-2014 13:53 1k
====== Ixodes scapularis (Deer Tick) EvidentialGene gene construction set ========
2014-jun-07, by Don Gilbert, gilbertd at indiana edu
http://arthropods.eugenes.org/EvidentialGene/arthropods/deertick/
This is a 'reference-free' gene set assembly from mRNA-seq, without reference made to
a genome assembly nor training/mapping from other species genes. As such it has different
values than genome-based gene sets, one important one is no external artifacts or errors
contribute to these genes. Any protein orthology measured has not been influenced by
gene modelling using other species (with their artifacts), and genome assembly errors.
Ixodes scapularis mRNA-seq is of NCBI SRA public data BioProject PRJNA239331,
accessions listed in itick_pj239331_nobact.csv
#t2ac: EvidentialGene tr2aacds.pl VERSION 2013.07.27
#t2ac: BEGIN with cdnaseq= evg1itick.tr date= Thu Apr 3 17:29:18 PDT 2014
#t2ac: bestorf_cds= evg1itick.cds nrec= 5895165
#t2ac: nonredundant_cds= evg1iticknr.cds nrec= 2792966
#t2ac: nofragments_cds= evg1iticknrcd1.cds nrec= 1725680
#t2ac: asmdupfilter_cds= evg1itick.trclass
# Class Table for evg1itick.trclass
class okay drop okay drop
althi 2.6 17.3 45175 298756
althi1 3 21.9 53369 378193
althia2 0 0.4 0 7382
altmfrag 0.4 0.6 7551 11395
altmfraga2 0 0 778 998
altmid 0.7 1.3 13181 23676
altmida2 0 0 734 1198
main 5.7 8 99872 139614 # too many main/noclass, frags, .. see below
maina2 0.4 0.3 7205 5640
noclass 2.4 17.1 41799 295300
noclassa2 0 0 93 573
parthi 0 11.2 0 194358
parthi1 0 4.4 0 77612
parthia2 0 1.2 0 21175
---------------------------------------------
total 15.6 84.3 269757 1455870
=============================================
# AA-quality for okay set of evg1itick.aa.qual (no okalt): all and longest 1000 summary
okay.top n=1000; average=1738; median=1464; min,max=1187,10671; sum=1738176; gaps=7731,7.7
okay.all n=148969; average=158; median=116; min,max=40,10671; sum=23548951; gaps=1098349,7.3
#t2ac: DONE at date= Thu Apr 3 19:24:11 PDT 2014
#t2ac: ======================================
# itick1 publicset class count
71709 althi # alternates, hi identity exon align to main
6726 althim # alternate<> main swapped
10610 altmid # alternates, lower ident exon align, will include some paralogs
3983 altmidfrag # shorter altmid
119588 main # main transcripts w/ alternates, large count see noho.tab, most have no homology (of 3 ref spp)
29975 noclass # main without alternates
------
15249 dropalt # removed as uninformative, short,
11917 dropnoclass
----------------------------------------------------
aaeval/...................................
Basic orthology comparison, blastp x human,daphnia,tribolium
Align averages
refspp nref ixodes ixoevg tetur ztickevg
daphnia 5827 377 471 408 436
human 7780 411 515 444 479
trica 5700 368 462 400 429
Human genes found (n=16631)
geneset hit% alnh alnt Gene set method, species
................................................................
ixodes.evg 95.7 434 415 mRNA-assembly, deer tick
ztick.evg 91.4 416 380 mRNA-assembly, zebra tick
ixodes.gno 89.5 364 326 genome-predict, deer tick
tetur.gno 83.2 399 332 genome-predict, spider mite
................................................................
-------------------------------------------------
Main transcript checks
wc -l evg1itick.maint1.*.tab
15958 evg1itick.maint1.homol.tab : 738 have TEnames, rest have other above species homology
133605 evg1itick.maint1.noho.tab : no homol to above 3 ref species, median aa size ~ 100aa
noho set:
cd-hit90 cluster of maint1.noho.aa, only 1400 cluster w/ others, not useful to reclassify
in:120906 finished 119341 clusters
12336 utrorf's : discard all but longest (>= 200aa, ~1000)
60000 short half of noho, <= 110 aa, 24000 are partial or/and utrpoor/bad. discard all?
check random set of noho for NR blastp hits .. any contams, TE genes? bacteria?
likely that large NR protein set will categorize many of these in 3 bins:
dont know w/o doing blastp what proportion of noho set to keep
1. eukaryote (may include ixodes/ticks)
2. bacterial/contam
3. transposon/virus/contam
4. no detectable homology .. probable ixodes, maybe not
Random noho check at NCBI NR Blastp
Ixosca1aEVm000051t1 3039aa; NRB=polyprotein-virus*;
Ixosca1aEVm006413t1 489aa; NRBl=hypothetical protein IscW_ISCW011049 [Ixodes scapularis] (98aa, hi ident) + other Ixo
.. also pea aphid LOC100574918, daphnia, other weaker partial matches
Ixosca1aEVm029071t1 185aa; NRBl=IscW_ISCW014947, other weak (ecoli)
Ixosca1aEVm018013t1 235aa; NRBl=LOC100888907 [Strongylocentrotus purpuratus] and other Euks, ~full match
Ixosca1aEVm041859t1 159aa; NRBl=weak hits to bacteria
Ixosca1aEVm024727t1 199aa; NRBl=weak hits to euks
Ixosca1aEVm045897t1 153aa; NRBl=weak hit to 1 bact
-------------
|