Pea aphid v2 choice of best gene model per locus
from v2 of NCBI RefSeq, EvidentialGene and version 1 Aphid official genes,
using two basic scores, protein homology and valid expression span, at

This best-of-3 choice results in 360,000 more homology bits,
and 1.4 million more expressed bases than only RefSeq models
at same loci, an 8% average improvement for 15,500 loci.

This 8% improvement seems small, but we are not starting at
zero; v2 improvements are incremental.  The Refseq v2
improvement over Aphid genes v1 is 4%, for comparable
loci, Evigene has 7% improvement, and best-of-3 has
10% improvement.

Summary statistics:

     Best-of-3 Source
 All loci  RefSeq+Evigene loci
 --------  -------
    37063  15501    total
     2442    837    Acypi1
    22012   6101    Evigene 
     5384   4521    Ncbi RefSeq
     8159     82    Noscore
     4878   3600    Same2
      442    442    Same3

Average scores for loci with both Ncbi and Evigene models
stat    hbest   xbest   hevg    xevg    hncbi   xncbi  
ave:    341     1962    332     1857    319     1862   
   n loci=15501, h=homology bitscore, x=valid expression span
Averages for all loci, excluding Noscore
stat    hbest   xbest   hevg    xevg    hncbi   xncbi   hacyp   xacyp
ave:    222     1100    224     1081    303     1790    254     911
ngene:  35158   35158   33747   33747   16426   16426   20065   20065

Alternate transcripts of best models:
6482 total in 3561 genes
4162 Evigene
2296 RefSeq
  24 Acypi1

Result files:
  aphid2-bestof3gene-table.txt : classification of best models/locus : same with more details
  aphid2-bestof3gene.ids       : IDs from column 2 of bestof3gene-table
  aphid2-bestof3gene.gff       : gene features from IDs
  aphid2-bestof3gene.aa        : proteins from IDs
  aphid2-bestof3alttr.idtab    : alternate transcripts of best models
  aphid2-bestof3alttr.gff      : alternate-tr features
  aphid2-bestof3alttr.aa       : alternate-tr locations
     Note the choice algorithm selected best model among alternates
     using same scoring as.  The alternates here are all lower scoring
     of the best model.
Combining tables:
    = equivalent models per locus, from exon overlap, with equivalence scores
    = homology, expression scores per best transcript, and alternates

Combining methods:
  classify best gene model per locus using homology, expression scores.
    1. evidence tables of gene id, h=homology, x=expression scores, a=alternate transcripts, per gene set
    2. pairwise tables of 2 gene sets with equivalent transcript ids
         transcript-*-hxscore.txt == rnas/tr*
    1.  read input tables
    2.  merge as compare3 table, row of equivalent models per locus with evidence scores
    3.  classify best model per 1 locus row, using scores
    4.  classify join/split loci over several rows

  cat *.equal.ids | sed 's/^/eq /' | cat - transcript-*-hxscore.txt   | \
  $evigene/scripts/ > 
  Classification rules for choice of best model/locus
     rule 6SJa = if (i is split and  c is join) 
       if( iscore / cscore > SPLITBEST ) ibest  elsif ( iscore / cscore < JOINBEST ) cbest
     rule 6SJb = if (c is split and i is join) ... converse rule
     rule 4XH = rule1H + h small + x large = h/x discrepancy
     rule 1H  = diff h large enough
     rule 2XH = diff x large enough, h not too small
     rule 3S  = diff x and h both small
     rule 5L  = x and h too small
      (see : sub classify0)

Developed at the Genome Informatics Lab of Indiana University Biology Department