Tribolium_castaneum evg2tribol. 2014.12  
EvidentialGene mRNA gene set assembled from RNA-seq
by Don Gilbert, gilbertd at indiana edu

EvidentialGene  gene set evg2tribol for Tribolium_castaneum is more
complete than 2 other recent Tribolium gene sets, measured by orthology
completeness.  See Figs 4d, 5a, and 6d of

== evg2tribol public data set ==============================
Gene data files in evg2tribol/publicset/
  evg2tribol.fin1alt.aa.gz         evg2tribol.fin1cull.aa.gz        evg2tribol.fin1loc.aa.gz
  evg2tribol.fin1alt.ann.txt.gz    evg2tribol.fin1cull.ann.txt.gz   evg2tribol.fin1loc.ann.txt.gz
where file names are "evg2tribol.fin1"{contents},
Gene class is loc (primary transcript/locus), alt (alternate transcripts), cull (uninteresting extras)
Gene sequences are in fasta with suffix for contents: aa (protein), cds (coding transcript), mrna (full transcript)
Gene locations in gff are mapped to tcas3 assembly of NCBI genomes
Gene annotations table is ann.txt (see below)

== evg2tribol class table ==================================
2014.12.09 ; Evigene tr2aacds pipeline summary

# Class Table for evg2tribol.trclass 
class           okay    drop    okay    drop
althi           2.9     5.5     50021   95581
althi1          14.6    32.3    251929  558120    # large count includes mix of uninformative(clone-diff,teprots),some ok
althia2         0       0.3     0       5771
altmfrag        1.4     0.7     24169   13571
altmfraga2      0.1     0.2     3087    3644
altmid          1       0.5     18376   9820
altmida2        0       0       1214    656
main            1.1     2.5     20266   44573
maina2          0.1     0.1     2343    2945
noclass         0.3     4.6     6576    79591
noclassa2       0       0       22      324
parthi          0       16.2    0       280898
parthi1         0       11.8    0       204005
parthia2        0       2.6     0       46490
total           21.9    78      378003  1345989
# AA-quality for okay set of evg2tribol.aa.qual (no okalt): all and longest 1000 summary         n=1000; average=1834; median=1491; min,max=1145,18274; sum=1834289; gaps=3507,3.5
okay.all         n=29207; average=302; median=157; min,max=36,18274; sum=8842928; gaps=73490,2.5

  Done -- need to run cull step as for apis, remove some of frags, althi1; 
  Done -- remove TE gene/prots, using CDD TE domains and ref blast hits ; have fair number of these, some map to genome
  Done -- need table of correspondence to tcas4, tcas3ncbi: mRNA equivs and genome-mappped
  Done -- also cull alts w/ identical prots; pubset aa nr=231966/378003 
     .. keep only 1 of each  isoform (?) or use some criteria to keep alts w/ ident aa
  Done -- annot: names tab w/ best ref gene name and CDD names, gmap or map.attr, 
      eqgene tables for tcas4aug and tcas3ncbi for Dbxref, eqref columns

Cull steps
  additional removals from okayset, using sensible criteria.
  Culls are retained in public set as separate data, may contain useful genes but less likely.

cull1: TE protein genes, w/ CDD hits, arp7 hits; some in cull2/nopathnoho
cull2: loci without homology nor genome map, nogenomap-nohomol, tcas4evg/gmap/
cull3: uninformative alts, identical prots and short/partial/utrbad prots
cull5: genomap main.eqgene overlaps, cull overlapped(cds>33 or exon>50, splits?) + lowerqual of eqgenes
       also remove alts of main over culls..

cull totals:
  310974 publicset/
    1514 publicset/evg2tribol.ann2.cull1
    1910 publicset/evg2tribol.ann2.cull2
   78691 publicset/evg2tribol.ann2.cull3
    8081 publicset/evg2tribol.ann2.cull5

  culled.ids n=81908 for 1,2,3; n=89987 for 1,2,3,5
  keep  n= 229066; t1.n=29626 loci ; ta.n=199440 alts (ga.n=15107)
    keep nopath t1.n=1077, noname t1.n=10616, 
  correspondence to other gene sets
    tcas4aug t1n=16110, 11560/13331 uniq t1/ta, 2976 aug t1 dups
    tcas3ncb t1n=14492, 11420/15261 uniq t1/ta, 2060 nc t1 dupid

See orthology completeness of this gene set vs other tribolium gene sets, at

aaeval-bitscore comparison of 4 tribol gene sets 
# summary for beetle4enoculset-refarp7s8set2.arpbest7bits, iscore=2
# ref: ARP7f, ngene=13566, ngroup=13566, ncomgrp=8723 ................

tspp        tng   png   pncom bits  algn  pal   dlen  bcom  acom  pcom  dcom  best  same  diff  miss  pbest ppoor
tribcas1    9541  70.3  83.8  321.3 331.5 54.2  66.5  440.4 431.3 67.0  78.1  63    4442  2803  1387  51.8  48.2
tribcas14nc 11236 82.8  94.4  353.3 377.9 63.0  84.2  476.3 480.5 75.2  94.7  185   6159  1887  464   73.0  27.0
tribcas4a   10320 76.1  88.5  336.5 354.4 58.1  76.2  457.3 455.1 70.7  86.6  64    5244  2410  977   61.0  39.0
tribca4evg2 13253 97.7  98.8  377.4 420.8 72.0  108.5 491.2 502.7 78.7  103.0 1912  5960  749   74    90.5  9.5
  png, pncom = percent of total ref genes, or common ref genes;
  com = subset of common arthropod gene families;
  bits = bitscore, algn = blast align score, pal = %align to ref, dlen = difference in length to ref;
  bcom, acom, pcom, dcom are the above for common ref genes;
  best, same, diff, miss = count of per gene-family quality class to reference genes among target gene sets;
     pbest = percent best+same

 The above statistics are plotted in Arthropod_Orthology_Completeness Fig 4d, 
   pncom = % Nref_common, pcom = % Align_common, pbest = % Best_of_Species

 beetle4enoculset-refarp7s8set2.arpbest7bits is tabulated as largest bitscore in target gene sets (tspp)
 to each of reference gene family genes (8 species from ARP7 orthology analysis), using 
    blastp -query refarp7s8set2.aa -db beetle4eset of tspp genesets  
 For each of ngroup=13566, ncomgrp=8723, in refarp7 proteins table (1 row per reference group), each target gene set 
 has an row entry, or missing value, for best aligned gene.  Average statistics of ngene (ng), bits, algn, 
 dlen are summarized from this table, along with average percentages relative to reference genes.  
 The best,same,diff,miss are counts of per ref-gene rank (bitscore or alignment score) comparing the 4 gene

 In Fig 6d. Tribolium beetle gene sets, per gene alignment to orthology reference genes of 
 Arthropod_Orthology_Completeness, individual ref-gene group rows are bar-graphed as alignment %,
 with best,same,diff,miss scores as bar height and color for each tspp target gene set.  These
 graphs indicate the best,diff,miss rankings are spread over the protein size range of ref gene families,
 and each gene set contains different best and missed families. The oldest tribcas1 has most misses and poor
 diff entries, while tribcas4evg2 has the most of best aligned to ref genes.

 Orthology gene groups are those of ARP7 OrthoMCL analysis set of 10 species. 
 This gene orthology database ARP7 is at
 tribcas14nc is an ARP7 reference gene set, however it is excluded in refarp7s8set2.aa for this analysis. 

Annotation table contents, evg2tribol.fin1loc.ann.txt
PublicID  : public id
	OrigID  : original gene transcript id
	ClassV  : class, version as main/alternate/culled
	TrLen   : transcript length
	CDSoff  : coding start-end in transc.
	AAqual  : protein size,%coding,quality
	TrGaps  : gap count
	MapCov  : map % coverage on genome assembly
	MapIdn  : map % identity on genome assembly
	MapInExon : map intron/exon count (introns with valid splice sites)
	MapLocus  : map location on gen. assembly
	MapPath   : map split paths on gen. assembly
	DbXref    : database cross reference IDs, from blastp homology, orthoMCL group and conserved domains
	OGenes    : OtherGene set IDs (ncbi14, tcas4augustus)
	NamePct   : naming percent alignment
	ProductName : name from reference protein (1st in DbXref)
	Neurogenic locus notch protein

PublicID is in form: Tribca2aEVm000029t1, Tribca2aEVm000029t2, .. t100 suffix for primary/alternates of one locus,
      low locus ID numbers are larger proteins, sizes are in .aa.qual tables.
  Tribca2aEVm000001t1	18274,97%,complete
  Tribca2aEVm000002t1	15475,97%,complete
  Tribca2aEVm000003t1	14640,98%,partial3
  Tribca2aEVm000004t1	9923,99%,partial3
  Tribca2aEVm000005t1	8806,98%,complete
OrigID "tcas4sb2p8nmvelvk47Loc2069t1" encodes strain source (sb=SB or cr=Cro1 
of triboliumbeetle/sradata/), transcript assembler
(velv=VelvetO, soap=SOAPdenovoTr, trin=Trinity), RNA-shredding kmer k47,
digital normalization (nm), and other data-slice/parameter information.
Each transcript assembly is of a single strain, but alternates at locus
can be both, including some strain differences.  "cull3"  likely
includes SNP differences across strains that produce identical protein


