Index of /EvidentialGene/arthropods/triboliumbeetle/evg2tribol
Name Last modified Size
Parent Directory 26-Dec-2014 13:19 -
aaeval/ 26-Dec-2014 13:55 -
evg2tribol.mrna2tsa.info 09-Dec-2014 13:44 1k
evg2tribol.mrna2tsa.log 09-Dec-2014 13:44 18k
evg2tribol.sra_result.csv 30-Nov-2014 22:56 17k
evg2tribol.tr2aacds.info 09-Dec-2014 13:16 1k
evg2tribol.tr2aacds.log 09-Dec-2014 11:36 16k
evg2tribol.trclass.gz 09-Dec-2014 11:24 51.1M
inputset/ 09-Dec-2014 14:03 -
publicset/ 26-Dec-2014 13:40 -
rnasource_tribcas14_sra.csv 30-Nov-2014 22:56 17k
rnasource_tribcas14_sra.readme.txt 24-Dec-2014 17:29 2k
run_evgmrna2tsa.sh 09-Dec-2014 10:20 2k
runtr2cds.sh 04-Dec-2014 11:56 1k
scripts/ 24-Dec-2014 17:19 -
subsets/ 18-Dec-2014 14:37 -
Tribolium_castaneum evg2tribol. 2014.12
EvidentialGene mRNA gene set assembled from RNA-seq
by Don Gilbert, gilbertd at indiana edu
http://arthropods.eugenes.org/EvidentialGene/
EvidentialGene gene set evg2tribol for Tribolium_castaneum is more
complete than 2 other recent Tribolium gene sets, measured by orthology
completeness. See Figs 4d, 5a, and 6d of
http://arthropods.eugenes.org/EvidentialGene/arthropods/Arthropod_Orthology_Completeness/
== evg2tribol public data set ==============================
Gene data files in evg2tribol/publicset/
evg2tribol.fin1alt.aa.gz evg2tribol.fin1cull.aa.gz evg2tribol.fin1loc.aa.gz
evg2tribol.fin1alt.ann.txt.gz evg2tribol.fin1cull.ann.txt.gz evg2tribol.fin1loc.ann.txt.gz
..
where file names are "evg2tribol.fin1"{contents},
Gene class is loc (primary transcript/locus), alt (alternate transcripts), cull (uninteresting extras)
Gene sequences are in fasta with suffix for contents: aa (protein), cds (coding transcript), mrna (full transcript)
Gene locations in gff are mapped to tcas3 assembly of NCBI genomes
Gene annotations table is ann.txt (see below)
== evg2tribol class table ==================================
2014.12.09 ; Evigene tr2aacds pipeline summary
# Class Table for evg2tribol.trclass
class okay drop okay drop
althi 2.9 5.5 50021 95581
althi1 14.6 32.3 251929 558120 # large count includes mix of uninformative(clone-diff,teprots),some ok
althia2 0 0.3 0 5771
altmfrag 1.4 0.7 24169 13571
altmfraga2 0.1 0.2 3087 3644
altmid 1 0.5 18376 9820
altmida2 0 0 1214 656
main 1.1 2.5 20266 44573
maina2 0.1 0.1 2343 2945
noclass 0.3 4.6 6576 79591
noclassa2 0 0 22 324
parthi 0 16.2 0 280898
parthi1 0 11.8 0 204005
parthia2 0 2.6 0 46490
---------------------------------------------
total 21.9 78 378003 1345989
=============================================
# AA-quality for okay set of evg2tribol.aa.qual (no okalt): all and longest 1000 summary
okay.top n=1000; average=1834; median=1491; min,max=1145,18274; sum=1834289; gaps=3507,3.5
okay.all n=29207; average=302; median=157; min,max=36,18274; sum=8842928; gaps=73490,2.5
--------
Notes:
Done -- need to run cull step as for apis, remove some of frags, althi1;
Done -- remove TE gene/prots, using CDD TE domains and ref blast hits ; have fair number of these, some map to genome
Done -- need table of correspondence to tcas4, tcas3ncbi: mRNA equivs and genome-mappped
Done -- also cull alts w/ identical prots; pubset aa nr=231966/378003
.. keep only 1 of each isoform (?) or use some criteria to keep alts w/ ident aa
Done -- annot: names tab w/ best ref gene name and CDD names, gmap align.tab or map.attr,
eqgene tables for tcas4aug and tcas3ncbi for Dbxref, eqref columns
Cull steps
additional removals from okayset, using sensible criteria.
Culls are retained in public set as separate data, may contain useful genes but less likely.
cull1: TE protein genes, w/ CDD hits, arp7 hits; some in cull2/nopathnoho
cull2: loci without homology nor genome map, nogenomap-nohomol, tcas4evg/gmap/
cull3: uninformative alts, identical prots and short/partial/utrbad prots
cull5: genomap main.eqgene overlaps, cull overlapped(cds>33 or exon>50, splits?) + lowerqual of eqgenes
also remove alts of main over culls..
cull totals:
310974 publicset/evg2tribol.ann2.tab
1514 publicset/evg2tribol.ann2.cull1
1910 publicset/evg2tribol.ann2.cull2
78691 publicset/evg2tribol.ann2.cull3
8081 publicset/evg2tribol.ann2.cull5
culled.ids n=81908 for 1,2,3; n=89987 for 1,2,3,5
keep n= 229066; t1.n=29626 loci ; ta.n=199440 alts (ga.n=15107)
keep nopath t1.n=1077, noname t1.n=10616,
correspondence to other gene sets
tcas4aug t1n=16110, 11560/13331 uniq t1/ta, 2976 aug t1 dups
tcas3ncb t1n=14492, 11420/15261 uniq t1/ta, 2060 nc t1 dupid
=============
See orthology completeness of this gene set vs other tribolium gene sets, at
http://arthropods.eugenes.org/EvidentialGene/arthropods/Arthropod_Orthology_Completeness/
aaeval-bitscore comparison of 4 tribol gene sets
# summary for beetle4enoculset-refarp7s8set2.arpbest7bits, iscore=2
# ref: ARP7f, ngene=13566, ngroup=13566, ncomgrp=8723 ................
tspp tng png pncom bits algn pal dlen bcom acom pcom dcom best same diff miss pbest ppoor
tribcas1 9541 70.3 83.8 321.3 331.5 54.2 66.5 440.4 431.3 67.0 78.1 63 4442 2803 1387 51.8 48.2
tribcas14nc 11236 82.8 94.4 353.3 377.9 63.0 84.2 476.3 480.5 75.2 94.7 185 6159 1887 464 73.0 27.0
tribcas4a 10320 76.1 88.5 336.5 354.4 58.1 76.2 457.3 455.1 70.7 86.6 64 5244 2410 977 61.0 39.0
tribca4evg2 13253 97.7 98.8 377.4 420.8 72.0 108.5 491.2 502.7 78.7 103.0 1912 5960 749 74 90.5 9.5
------------
png, pncom = percent of total ref genes, or common ref genes;
com = subset of common arthropod gene families;
bits = bitscore, algn = blast align score, pal = %align to ref, dlen = difference in length to ref;
bcom, acom, pcom, dcom are the above for common ref genes;
best, same, diff, miss = count of per gene-family quality class to reference genes among target gene sets;
pbest = percent best+same
The above statistics are plotted in Arthropod_Orthology_Completeness Fig 4d,
pncom = % Nref_common, pcom = % Align_common, pbest = % Best_of_Species
beetle4enoculset-refarp7s8set2.arpbest7bits is tabulated as largest bitscore in target gene sets (tspp)
to each of reference gene family genes (8 species from ARP7 orthology analysis), using
blastp -query refarp7s8set2.aa -db beetle4eset of tspp genesets
For each of ngroup=13566, ncomgrp=8723, in refarp7 proteins table (1 row per reference group), each target gene set
has an row entry, or missing value, for best aligned gene. Average statistics of ngene (ng), bits, algn,
dlen are summarized from this table, along with average percentages relative to reference genes.
The best,same,diff,miss are counts of per ref-gene rank (bitscore or alignment score) comparing the 4 gene
sets.
In Fig 6d. Tribolium beetle gene sets, per gene alignment to orthology reference genes of
Arthropod_Orthology_Completeness, individual ref-gene group rows are bar-graphed as alignment %,
with best,same,diff,miss scores as bar height and color for each tspp target gene set. These
graphs indicate the best,diff,miss rankings are spread over the protein size range of ref gene families,
and each gene set contains different best and missed families. The oldest tribcas1 has most misses and poor
diff entries, while tribcas4evg2 has the most of best aligned to ref genes.
Orthology gene groups are those of ARP7 OrthoMCL analysis set of 10 species.
This gene orthology database ARP7 is at arthropods.eugenes.org/arthropods/orthologs/ARP7/
tribcas14nc is an ARP7 reference gene set, however it is excluded in refarp7s8set2.aa for this analysis.
=============
Annotation table contents, evg2tribol.fin1loc.ann.txt
PublicID : public id
OrigID : original gene transcript id
ClassV : class, version as main/alternate/culled
TrLen : transcript length
CDSoff : coding start-end in transc.
AAqual : protein size,%coding,quality
TrGaps : gap count
MapCov : map % coverage on genome assembly
MapIdn : map % identity on genome assembly
MapInExon : map intron/exon count (introns with valid splice sites)
MapLocus : map location on gen. assembly
MapPath : map split paths on gen. assembly
DbXref : database cross reference IDs, from blastp homology, orthoMCL group and conserved domains
OGenes : OtherGene set IDs (ncbi14, tcas4augustus)
NamePct : naming percent alignment
ProductName : name from reference protein (1st in DbXref)
--------------
Tribca2aEVm000001t1
tcas4sb2p8nmvelvk47Loc2069t1
main,1
56083
349-55173
18274,97%,complete
0
100
99
88/93
tcas3NC_007424:13110520-13178971:-
0
RefID:ARP7f_G2613,dromel:FBgn0053196,CDD:249587,CDD:224280,CDD:240526,CDD:238011/2,CDD:236766
tcas3nc:XP_008197897/90
100%,19215/11023,18624
Neurogenic locus notch protein
PublicID is in form: Tribca2aEVm000029t1, Tribca2aEVm000029t2, .. t100 suffix for primary/alternates of one locus,
low locus ID numbers are larger proteins, sizes are in .aa.qual tables.
Tribca2aEVm000001t1 18274,97%,complete
Tribca2aEVm000002t1 15475,97%,complete
Tribca2aEVm000003t1 14640,98%,partial3
Tribca2aEVm000004t1 9923,99%,partial3
Tribca2aEVm000005t1 8806,98%,complete
OrigID "tcas4sb2p8nmvelvk47Loc2069t1" encodes strain source (sb=SB or cr=Cro1
of triboliumbeetle/sradata/), transcript assembler
(velv=VelvetO, soap=SOAPdenovoTr, trin=Trinity), RNA-shredding kmer k47,
digital normalization (nm), and other data-slice/parameter information.
Each transcript assembly is of a single strain, but alternates at locus
can be both, including some strain differences. "cull3" likely
includes SNP differences across strains that produce identical protein
isoforms.
=============
|