[DIR] Parent Directory 29-Jan-2012 19:30 [DIR] aphid2_evigene3_2010/ 11-Nov-2011 13:05 [TXT] aphid2_evigene8e.readme.txt 07-Jun-2011 17:38 [   ] aphid2_evigene8f.aa.gz 05-Jun-2011 23:41 [   ] aphid2_evigene8f.annot.txt.gz 06-Jun-2011 12:55 [   ] aphid2_evigene8f.cds.gz 05-Jun-2011 23:52 [   ] aphid2_evigene8f.gff.gz 06-Jun-2011 13:21 [   ] aphid2_evigene8f.tbl.gz 05-Jun-2011 23:16 [   ] aphid2_evigene8f.tr.gz 05-Jun-2011 23:50 [DIR] aphid2_genemodels/ 03-Jun-2011 19:09 [TXT] evigene_aphid2.conf 30-May-2011 18:08 [TXT] evigene_aphid2ndary.conf 17-Apr-2011 11:49 [DIR] other/ 03-Jun-2011 18:47 [DIR] quality/ 07-Jun-2011 13:56

Evidential Gene for Pea aphid assembly 2
June 2011, by Don Gilbert

aphid2_evigene8f.gff  : annotated gene models, GFFv3 format
aphid2_evigene8f.annot.txt : table of gene annotations, tabbed
aphid2_evigene8f.aa   : fasta sequence of aa (proteins), tr (transcript na), cds (coding na),

quality/         : gene quality information, including validated chimeric splits o ACYPI v1 genes
		   quality/compare3-uniprot-blastp.txt compares homology for evigene, ncbi refseq, acypi1 at same locus
other/           : additional gene models and supporting information

  Names are derived from protein homology to Uniprot of May 2011, uniref50-arthropods,
  and related named gene data sets.  Match criteria to name of >33% alignment is used, and noted 
  on names as (nn%).

  A small set of curated proteins are included, mostly from chimeric splits,
  that cannot be computed from gene.gff.  See quality=Protein:curated flag

  GFF format is 3 level (gene/mRNA/exon,CDS) with alternate transcripts flagged as isoform=N,
  and ID=...t1,t2,t3 to indicate alternates.  All primary models have ID=t1 suffix, but may not
  be "best" form (longest protein).

  Long introns in gene models are all evidence supported from rna/est assemblies
     many are  > 20kb, a few >100kb, > 35 genes span over 250kb (more than bee, but same ballpark)

  False UTRs were worked over, and many but not all removed.
     These  extend into next gene, or include introns, sometimes many utr-exons.
     These are areas of high expression, joined to gene ends when should not be, 
     or coding section broken artifactually to non-coding (artifactually);
     e.g. commonest in est/rna-assemblies by PASA, cufflinks

  Chimera/split genes from version 1: 1000 computed but <100 validated,
     some matched alternate models.  These include a few well known genes like 
     dicer-1, maleless, sex-determining fem-1
   Chimeric genes are  entered 2 times in genes.gff, with 2 separate IDs, to conform
   to GFF format requirements.  Protein is listed only once.  See annotation chimeria=1,2

   quality/compare3-uniprot-blastp.txt has homology score to Uniprot of 3 models at each locus.
      This can help resolve which may be best model/locus, but best bitscore by itself 
      is not enough to pick best model (some lower bitscore models have better intron, expression evidence)


Summary of Evidential Gene models for Pea Aphid

36,500 genes are located in aphid2_evigene8f gene set
14,000 are fully supported by evidence (expression/orthology),
24,000 have above 66% evidence support,
33,000 have above 33% evidence support, 
the remainder have evidence but at lower levels.

23,200 have paralogs to pea aphid genes above 33%
13,500 have orthologs to other species above 33%
11,800 orthologs are true orthologs, the rest have stronger paralogy to another pea aphid gene
5000 alternate transcripts among 2700 genes add to these primary transcripts.
4400 are non-coding or poorly coding genes.
4000 have partial proteins (missed start,stop,inner stop)

3300 are likely transposon genes; 2800 have expression, strong to moderate, but
     only 400 have valid introns ( 2/3 of non-TE genes with expression have 
     valid introns).

90 genes are valid chimeric models from version 1, split across scaffolds now.
33 have long valid introns, genes span > 250 Kb (acyp2eg0001707t1 approaches 1 Mb)

16,800 Evigene models are equivalent to NCBI RefSeq (13,000)/Gnomon (3800)
      for >90% coding sequence.  10,000 are equivalent to ACYPI v1 genes; 
      many ACYPI1 models are partial components of this gene set.

Protein size is 262 aa (median), 21 Kb largest,  for 42 Mb coding bases in genome.
Transcript size is 1.6 Kb (median), 62 Kb largest, for 73 Mb transcript bases in genome,
with average 58% coding/transcript ratio.

Gene Evidence Summary for pea_aphid2, 2011 June

Evid.   Nevd    Statistic       evig8   evig3   ACYPI   ncbi2
------  ------  -------------   ------  -----   -----   -----
EST     36Mb    BaseOverlap     0.79    0.82    0.49    0.69
Pro     27Mb    BaseOverlap     0.76    0.82    0.47    0.46
RNA     55Mb    BaseOverlap     0.49    0.44    0.27    0.43   * all under 50% of expression
Intron  127076  SplicesHit      0.70    0.66    0.52    0.68

ESTgene 10371   Perfect         2837    2143    1808    2583
ESTgene 10371   Sensitv.        0.77    0.78    0.55    0.72
ESTgene 10371   Specifc.        0.47    0.40    0.64    0.48

Progene 12860   Perfect         4494    4548    3355    4051
Progene 12860   Sensitv.        0.49    0.51    0.37    0.40
Progene 12860   Specifc.        0.59    0.58    0.66    0.63

Ortholog --     N_found         26523   24079   17656   -
Paralog  --     N_found         30447   26496   22272   -

Genome  --      Coding Mb       42Mb    42Mb    28Mb    21Mb
Genome  --      Exon Mbase      73Mb    74Mb    33Mb    36Mb
Genome  --      Gene count      36586   32967   35722   16894
evig8e=genes/aphid2_evigene8e.gff,    2011-June
evig3=genes/aphid2_mix3.gff,          2010-Oct  
ncbi2=genes/acyr2_ncbigenes.gff,      2011-May          
ACYPI=genes/acyr1-ACYPImRNA.gff,      2009

Gene Homology of 3 prediction sets to Uniprot-arthropods
from loci with all 3 predictors

Best transcript, 3 predictors at same locus, including 0 hits;   
pred    nho     avebit  nbest 
acypi1  10127   321     537   
evigene 11114   330     1370  
ncbiref 10157   343     796   
same    -       -       8759  

Best transcript, 3 predictors at same locus, excluding 0 hits 
pred    nho     avebit  nbest
acypi1  9688    387     352
evigene 10198   402     843
ncbiref 9678    415     581
same    -       -       8422

  where nho=number of models with Uniprot hit, avebit = average bitscore,
  nbest = number where this predictor has best homology (>=5% of others).

Guide to pea aphid Evigene annot.txt columns and GFF mRNA  attributes:

  transcriptID :   mRNA transcript public ID (ID= in gff mRNA)
  geneID    :     (gene= in gff mRNA) is Parent= to mRNA
  isoform   : alternate transcript number if > 1, matches ID suffix (t2,t3...)
  quality   : list of quality values for Expression Homology Intron Mate-pairing, Protein,         
  aaSize    : protein aa length, percent of transcript
  cdsSize   : CDS length / transcript length
  Name      : homology-derived gene name, UniProt arthropods and related databases
  Dbxref    : cross reference gene IDs to AphidBase v1, NCBI RefSeq v2
  express   : expressed span as percent of transcript, and read count for EST, RNA-seq
  ortholog  : protein orthology percent identity, bit score, and protein IDs
  paralog   : protein paralogy percent identity, bit score and gene ID
  intron    : evidence introns from expression / model introns
  location  : genome location
  oid       : original model ID
  chimera   : validated split or chimeric model from ACYPI v1 gene, has 2 locations (and 2 transcript IDs)
  score     : evidence score sum
  scorevec  : evidence score vector

Quality notes:
  Values are generally Strong/Medium/Weak/None
  Homology:  Ortholog if best match is other species, Paralog for this species
  Protein:  curated_complete indicates curated by expert, including chimera split ACYPI genes
            and that protein cannot be computed from genome sequence.
  Intron: and Mated: (mate pairing) qualities include perfect/complete for all exons supported in gene,
          good, poor, none : levels of intron, mate pair quality

Other field notes:
  Dbxref  = gene cross reference, includes percent equivalence, and "I" or "C" flag.
            I = identical model, C = >= 90% coding sequence identity
  chimera = includes location of other split part, and computed gene model that matches part
    ID=acyp2eg0037508t1 chimera=1,Scaffold298:481226-487632:+,acyp2eg0018229t1,complete
    ID=acyp2eg0037509t1 chimera=2,Scaffold298:618846-621340:+,acyp2eg0018215t1,complete
    These should/will have gene records added to show part equivalence.
  scorevec fields are defined in top of GFF file and used to make total gene model score, using weighted values
  ##gff-version 3
  #program: overbestgenes, selection of best gene set by evidence scores
  #scoretype: homolog:9,paralog:1,ref:2,est:3,pro:3,rseq:2,intr:20,nintron:40,inqual:20,maqual:5,terepeat:-3,UTR:3,CDS:1
  Name=uncharacterized protein (66%)

Alternate transcript indicator in ID and isoform field:

Genera of best homolog in Uniprot 2011.05 all arthropods + human + 3-bacteria (726442 proteins)
for primary transcript, n=27675 genes in aphid2_evigene8f

   7308 Tribolium
   3612 Camponotus    ant
   2406 Pediculus
   2180 Drosophila
   1990 Harpegnathos  ant
   1486 Solenopsis    ant
   1126 Daphnia
   1034 HUMAN
    945 Aedes         mosquito
    890 Culex         mosquito
    786 Anopheles     mosquito
    668 Bombyx
    623 Ixodes
    730 ECOLI/BACSU   bacteria
    236 Glyptapanteles
    ... others under 100
The Uniref-50 cluster name for best UP homolog is in most cases given as Name in this gene set.

Developed at the Genome Informatics Lab of Indiana University Biology Department