Index of /EvidentialGene/pea_aphid2/genes
Name Last modified Description
Parent Directory 29-Jan-2012 19:30
aphid2_evigene3_2010/ 11-Nov-2011 13:05
aphid2_evigene8e.readme.txt 07-Jun-2011 17:38
aphid2_evigene8f.aa.gz 05-Jun-2011 23:41
aphid2_evigene8f.annot.txt.gz 06-Jun-2011 12:55
aphid2_evigene8f.cds.gz 05-Jun-2011 23:52
aphid2_evigene8f.gff.gz 06-Jun-2011 13:21
aphid2_evigene8f.tbl.gz 05-Jun-2011 23:16
aphid2_evigene8f.tr.gz 05-Jun-2011 23:50
aphid2_genemodels/ 03-Jun-2011 19:09
evigene_aphid2.conf 30-May-2011 18:08
evigene_aphid2ndary.conf 17-Apr-2011 11:49
other/ 03-Jun-2011 18:47
quality/ 07-Jun-2011 13:56
Evidential Gene for Pea aphid assembly 2
June 2011, by Don Gilbert
aphid2_evigene8f.gff : annotated gene models, GFFv3 format
aphid2_evigene8f.annot.txt : table of gene annotations, tabbed
aphid2_evigene8f.aa : fasta sequence of aa (proteins), tr (transcript na), cds (coding na),
aphid2_evigene8f.cds
aphid2_evigene8f.tr
quality/ : gene quality information, including validated chimeric splits o ACYPI v1 genes
quality/compare3-uniprot-blastp.txt compares homology for evigene, ncbi refseq, acypi1 at same locus
other/ : additional gene models and supporting information
Notes:
Names are derived from protein homology to Uniprot of May 2011, uniref50-arthropods,
and related named gene data sets. Match criteria to name of >33% alignment is used, and noted
on names as (nn%).
A small set of curated proteins are included, mostly from chimeric splits,
that cannot be computed from gene.gff. See quality=Protein:curated flag
GFF format is 3 level (gene/mRNA/exon,CDS) with alternate transcripts flagged as isoform=N,
and ID=...t1,t2,t3 to indicate alternates. All primary models have ID=t1 suffix, but may not
be "best" form (longest protein).
Long introns in gene models are all evidence supported from rna/est assemblies
many are > 20kb, a few >100kb, > 35 genes span over 250kb (more than bee, but same ballpark)
False UTRs were worked over, and many but not all removed.
These extend into next gene, or include introns, sometimes many utr-exons.
These are areas of high expression, joined to gene ends when should not be,
or coding section broken artifactually to non-coding (artifactually);
e.g. commonest in est/rna-assemblies by PASA, cufflinks
Chimera/split genes from version 1: 1000 computed but <100 validated,
some matched alternate models. These include a few well known genes like
dicer-1, maleless, sex-determining fem-1
ACYPI000122|ACYPI006952|ACYPI005290|ACYPI003167|ACYPI006652
Chimeric genes are entered 2 times in genes.gff, with 2 separate IDs, to conform
to GFF format requirements. Protein is listed only once. See annotation chimeria=1,2
quality/compare3-uniprot-blastp.txt has homology score to Uniprot of 3 models at each locus.
This can help resolve which may be best model/locus, but best bitscore by itself
is not enough to pick best model (some lower bitscore models have better intron, expression evidence)
#----------------------------------------------------------------------
Summary of Evidential Gene models for Pea Aphid
36,500 genes are located in aphid2_evigene8f gene set
14,000 are fully supported by evidence (expression/orthology),
24,000 have above 66% evidence support,
33,000 have above 33% evidence support,
the remainder have evidence but at lower levels.
23,200 have paralogs to pea aphid genes above 33%
13,500 have orthologs to other species above 33%
11,800 orthologs are true orthologs, the rest have stronger paralogy to another pea aphid gene
5000 alternate transcripts among 2700 genes add to these primary transcripts.
4400 are non-coding or poorly coding genes.
4000 have partial proteins (missed start,stop,inner stop)
3300 are likely transposon genes; 2800 have expression, strong to moderate, but
only 400 have valid introns ( 2/3 of non-TE genes with expression have
valid introns).
90 genes are valid chimeric models from version 1, split across scaffolds now.
33 have long valid introns, genes span > 250 Kb (acyp2eg0001707t1 approaches 1 Mb)
16,800 Evigene models are equivalent to NCBI RefSeq (13,000)/Gnomon (3800)
for >90% coding sequence. 10,000 are equivalent to ACYPI v1 genes;
many ACYPI1 models are partial components of this gene set.
Protein size is 262 aa (median), 21 Kb largest, for 42 Mb coding bases in genome.
Transcript size is 1.6 Kb (median), 62 Kb largest, for 73 Mb transcript bases in genome,
with average 58% coding/transcript ratio.
Gene Evidence Summary for pea_aphid2, 2011 June
Evid. Nevd Statistic evig8 evig3 ACYPI ncbi2
------ ------ ------------- ------ ----- ----- -----
EST 36Mb BaseOverlap 0.79 0.82 0.49 0.69
Pro 27Mb BaseOverlap 0.76 0.82 0.47 0.46
RNA 55Mb BaseOverlap 0.49 0.44 0.27 0.43 * all under 50% of expression
Intron 127076 SplicesHit 0.70 0.66 0.52 0.68
ESTgene 10371 Perfect 2837 2143 1808 2583
ESTgene 10371 Sensitv. 0.77 0.78 0.55 0.72
ESTgene 10371 Specifc. 0.47 0.40 0.64 0.48
Progene 12860 Perfect 4494 4548 3355 4051
Progene 12860 Sensitv. 0.49 0.51 0.37 0.40
Progene 12860 Specifc. 0.59 0.58 0.66 0.63
Ortholog -- N_found 26523 24079 17656 -
Paralog -- N_found 30447 26496 22272 -
Genome -- Coding Mb 42Mb 42Mb 28Mb 21Mb
Genome -- Exon Mbase 73Mb 74Mb 33Mb 36Mb
Genome -- Gene count 36586 32967 35722 16894
------------------------------------------------------------
Predictors
evig8e=genes/aphid2_evigene8e.gff, 2011-June
evig3=genes/aphid2_mix3.gff, 2010-Oct
ncbi2=genes/acyr2_ncbigenes.gff, 2011-May
ACYPI=genes/acyr1-ACYPImRNA.gff, 2009
#-----------------------------------------------------------------------
Gene Homology of 3 prediction sets to Uniprot-arthropods
from loci with all 3 predictors
Best transcript, 3 predictors at same locus, including 0 hits;
pred nho avebit nbest
acypi1 10127 321 537
evigene 11114 330 1370
ncbiref 10157 343 796
same - - 8759
Best transcript, 3 predictors at same locus, excluding 0 hits
pred nho avebit nbest
acypi1 9688 387 352
evigene 10198 402 843
ncbiref 9678 415 581
same - - 8422
where nho=number of models with Uniprot hit, avebit = average bitscore,
nbest = number where this predictor has best homology (>=5% of others).
#-----------------------------------------------------------------------
Guide to pea aphid Evigene annot.txt columns and GFF mRNA attributes:
transcriptID : mRNA transcript public ID (ID= in gff mRNA)
geneID : (gene= in gff mRNA) is Parent= to mRNA
isoform : alternate transcript number if > 1, matches ID suffix (t2,t3...)
quality : list of quality values for Expression Homology Intron Mate-pairing, Protein,
aaSize : protein aa length, percent of transcript
cdsSize : CDS length / transcript length
Name : homology-derived gene name, UniProt arthropods and related databases
Dbxref : cross reference gene IDs to AphidBase v1, NCBI RefSeq v2
express : expressed span as percent of transcript, and read count for EST, RNA-seq
ortholog : protein orthology percent identity, bit score, and protein IDs
paralog : protein paralogy percent identity, bit score and gene ID
intron : evidence introns from expression / model introns
location : genome location
oid : original model ID
chimera : validated split or chimeric model from ACYPI v1 gene, has 2 locations (and 2 transcript IDs)
score : evidence score sum
scorevec : evidence score vector
Quality notes:
Values are generally Strong/Medium/Weak/None
Homology: Ortholog if best match is other species, Paralog for this species
Protein: curated_complete indicates curated by expert, including chimera split ACYPI genes
and that protein cannot be computed from genome sequence.
Intron: and Mated: (mate pairing) qualities include perfect/complete for all exons supported in gene,
good, poor, none : levels of intron, mate pair quality
Other field notes:
Dbxref = gene cross reference, includes percent equivalence, and "I" or "C" flag.
I = identical model, C = >= 90% coding sequence identity
chimera = includes location of other split part, and computed gene model that matches part
ID=acyp2eg0037508t1 chimera=1,Scaffold298:481226-487632:+,acyp2eg0018229t1,complete
ID=acyp2eg0037509t1 chimera=2,Scaffold298:618846-621340:+,acyp2eg0018215t1,complete
These should/will have gene records added to show part equivalence.
scorevec fields are defined in top of GFF file and used to make total gene model score, using weighted values
##gff-version 3
#program: overbestgenes, selection of best gene set by evidence scores
#scoretype: homolog:9,paralog:1,ref:2,est:3,pro:3,rseq:2,intr:20,nintron:40,inqual:20,maqual:5,terepeat:-3,UTR:3,CDS:1
Sample
transcriptID=acyp2eg0000002t1
gene=acyp2eg0000002
quality=Express:Strong,Homology:None,Intron:good,Mated:perfect,Protein:complete
aaSize=982,66%
cdsSize=2949/4496
Name=uncharacterized protein (66%)
Dbxref=APHIDBASE:ACYPI52644-RA,54%C,RefSeq:XM_003239978.1,83%C
express=100%,12226r
ortholog=4%,71.6,UniProt:E0VP94_PEDHC,Arp:pediculus_PHUM354810-PA
intron=2/2
oid=ars27cuf8:aphid_cuf8r27Gsc1.248.1
score=31792
scorevec=71,0,0,0,0,67,28,2,75,4496,0,1150,2949
Alternate transcript indicator in ID and isoform field:
transcriptID=acyp2eg0000002t2
gene=acyp2eg0000002
isoform=2
#------------------------------------------------------------------
Genera of best homolog in Uniprot 2011.05 all arthropods + human + 3-bacteria (726442 proteins)
for primary transcript, n=27675 genes in aphid2_evigene8f
7308 Tribolium
3612 Camponotus ant
2406 Pediculus
2180 Drosophila
1990 Harpegnathos ant
1486 Solenopsis ant
1126 Daphnia
1034 HUMAN
945 Aedes mosquito
890 Culex mosquito
786 Anopheles mosquito
668 Bombyx
623 Ixodes
730 ECOLI/BACSU bacteria
236 Glyptapanteles
... others under 100
The Uniref-50 cluster name for best UP homolog is in most cases given as Name in this gene set.
|