Theobroma cacao public gene set pub3i
08 March 2012, D. Gilbert, gilbertd at indiana edu

Gene class counts
21806 Class:Strong      : >= 66% expression/homology evidence
 5101 Class:Medium      : >= 33% expression/homology evidence
 2691 Class:Weak        : >=  5% evidence, worth considering if more evidence turns up
13035 Class:Transposon  : >= 33% transposon and no/weak expression
 3465 Class:Poor        : mixed bag of partial models
  550 Class:None        : no evidence
14977 Class:AltStrong   : alternate transcripts from EST/rna assemblies
   20 Class:AltMedium

pub3i.good.ids main=29408 alts=14996 include Class:(Strong|Medium|Weak) and Alt transcripts
   First transcript ID ends with 't1' but isn't always the best of alternates.

pub3i.good is corrected from pub3h,pub3g for CDS-exon errors: off-by-1, missing strand, 
   partly mangled proteins, from transcript-gene-assembly software weak on CDS/protein methods.

   pub3h>3i: nupdate=570, ndrop=108
  34 changemrna  (CDS/exon/protein changes)
  32 renamelocus  ; 8 renamealt 
 491 newgoodlocus ; 5 other = newgoodlocus (shifted from notgood to good subset)
  73 dropoverlaplocus; 21 droplocus ; 14 other = drop
   2826 updated transcripts: 785 with CDS exon changes (333 main, 452 alts), 400 altered proteins,
   1641 strand additions, 41 dropped records.
   CDS sequences in good set translate to protein sequences.  There are CDS mismatches in non-good set.

2012.07.11: cacao11genes_pub3i.good.gff corrected 1 record transcript ID=Thecc1EG015900t2 
            with "puevd3b" in score column instead of numeric score.
2012.08.23: cacao11genes_pub3i.good.shortaa contains 29 proteins with size < 40aa,
            excluded from cacao11genes_pub3i.good.aa.  No significant homology is apparent,
	    these are likely from either non-coding genes or gene fragments.
The cacao mitochondrial genome and associated genes,
M16_mito_v1.0, have been withdrawn from public use for now (6 Dec 2011).
This includes 214 genes mostly of Class Strong (112) or Medium, about 40 are 1-1 orthologs
to other tested plant gene sets. The IDs for these are in cacao11genes_pub3g.mitoremoved.ids

Gene data files:
 cacao11genes_pub3i.aa		protein fasta
 cacao11genes_pub3i.cds		coding dna		transcript dna
 cacao11genes_pub3i.attr.txt	gene annotation table (tabbed)
 cacao11genes_pub3i.gff		gene location/annotation format
          cacao11genes_pub3i.good.ids	        IDs of Class:Strong|Medium|Weak (Alt included)
          cacao11genes_pub3i.good.{aa,tr,cds}   fasta subset of Class:Strong|Medium|Weak

Annotation fields in gene.attr.txt. Same values are in mRNA lines of gene GFF.
  transcriptID       Thecc1EG000002t1              Thecc1EG000005t1
        geneID            Thecc1EG000002                Thecc1EG000005
        isoform           1                             1
        quality1          Class:Strong                  Class:Strong
        quality2          Express:Strong                Express:Strong
        quality3          Homology:OrthologStrong       Homology:OrthologStrong        
        quality4          Intron:Strong                 Intron:Strong
        quality5          Protein:complete              Protein:complete 
        aaSize            205                           1269
        cdsSize1          62%                           77%
        cdsSize2          618/977                       3810/4930
        Name1             Cystathionine beta-synthase.. Kinesin-like calmodulin-binding..          
        Name2             82%T                          74%T
        oname1            Uncharacterized protein       Uncharacterized protein      
        oname2            87%U                          77%U
        groupname         Cystathionine beta-synthase   Kinesin-like calmodulin-binding..              
        Dbxref1           TAIR:AT5G10860.1              TAIR:AT5G65930.2
        Dbxref2           82%                           74%
        ortholog1         frave:gene01181               ricco:29682.m000589 
        ortholog2         87%                           83%
        paralog1          Thecc1EG034062t1              Thecc1EG000957t1 
        paralog2          51%                           12%
        uniprot1          UniRef50_B9I794               UniRef50_B9GJK9
        uniprot2          87%                           77%
        genegroup1        PLA9_G6641                    PLA9_G3639
        genegroup2        1/11/9                        1/13/9
        cacaoGD09         CGD0000016/C99.77             na   
        cacaoTCR1         na                            Tc01_t000060/C99.83
        intron1           100%                          100%
        intron2           10/10                         46/46
        express1          94%                           82%
        express2          75                            99
        estgroup          LeafPistil                    LeafPistil
        location          scaffold_1:7897-10405:+       scaffold_1:17413-27097:+        
        oid               rna8b:r8L_g13025t00001        mar7g.mar11f:AUGepir7p1s1g7t1  
        score             7946                          40120

Guide to cacao Evigene annotation table columns and GFF mRNA  attributes:
  transcriptID    (ID in gff mRNA)
  geneID          (gene in gff mRNA, is Parent= to mRNA)
  isoform   : alternate transcript number if > 1, matches ID suffix (t2,t3...)
  quality   : evidence quality values for Expression Homology Intron Protein         
  aaSize    : protein aa length
  cdsSize   : percent of transcript, cds length / transcript length 
  Name      : homology-derived gene name, P:Plant9 family, U:UniProt or T:TAIR, 
               with percent align (88%P, 62%T, 74%U)
  oname     : other name (from next best classifier above)
  Dbxref    : cross reference gene IDs to TAIR, UniProt
  express   : expressed span as percent of transcript
  estgroup  : has significant expression from tissue groups Leaf,Pistil and/or Bean
  ortholog  : protein orthology percent identity, and protein IDs
  paralog   : protein paralogy percent identity, and gene ID
  genegroup : gene family ID from Orthomcl grouping of 9 plants
    genegroup2 : 1/11/9 found 1 cacao gene / 11 plant genes / 9 plant species (of 9 max)
  cacaoGD09 : equivalent Cacao CGD (Mars v0.9) gene
  cacaoTCR1 : equivalent Cacao Tc (Cirad v1.0) gene
  intron    : evidence intron splices matched (10/10 for 5 matched introns)
  location  : genome location
  oid       : original model ID
  score     : evidence score sum
  scorevec  : evidence score vector

Quality notes:
  Values are generally Strong/Medium/Weak/None
  Class:   gene quality class as sum of evidence parts; Transposon, Poor special classes
  Express:  Strong/Medium/Weak for percent of transcript with expression
  Homology:  Ortholog if best match is other species, Paralog for this species
  Protein:  complete or partial
  Intron: Strong/Medium/Weak depending on % and total of splice sites matched

Developed at the Genome Informatics Lab of Indiana University Biology Department