euGenes/Arthropods About EvidentialGene BLAST Gene Search Maps Data DroSpeGe

See below a brief description of these data files for
Nasonia vit. NCBI Gnomon predictions, not in RefSeq or Glean6 gene sets
-- Don Gilbert
      Name                                Last modified       Size  Description

[DIR] Parent Directory 04-Feb-2010 14:57 - [TXT] notglean_gnomon.blasttab 05-Dec-2008 12:44 482k [TXT] notglean_gnomon.hmm 12-Nov-2008 18:50 206k [TXT] notglean_gnomon_est.hmm 05-Dec-2008 15:05 6k [TXT] notglean_gnomon_evch-phylo_dist.txt 15-Nov-2008 17:30 18k [TXT] notglean_gnomon_evch.hmm 05-Dec-2008 15:06 27k [TXT] notglean_gnomon_notevchormcl.hmm 05-Dec-2008 15:07 75k [TXT] notglean_gnomon_ortholog.e15.hmm 05-Dec-2008 15:07 8k [TXT] notglean_good.gene.gff 05-Dec-2008 16:46 537k [TXT] 05-Dec-2008 17:10 294k [TXT] notglean_omclgn2sum.hmm 05-Dec-2008 15:05 244k [TXT] notglean_tegene.hmm 05-Dec-2008 15:05 61k [TXT] notglean_wellknown.gnomon.gff 12-Nov-2008 20:30 9k [TXT] pasa_nasv.novelgenes.gff 05-Dec-2008 17:40 303k [TXT] pasa_nasv.novelgenes.protein.fa 05-Dec-2008 17:34 122k [TXT] pasa_nasv.novelgenes.transcript.fa 05-Dec-2008 17:37 289k [TXT] 05-Dec-2008 14:00 721k [DIR] work/ 05-Dec-2008 17:08 -

Nasonia vit. NCBI Gnomon predictions, not in RefSeq or Glean6 gene sets

08Dec05 update: : table of additional gene predictions with evidence
notglean_good.gene.gff     : NCBI Gnomon mRNA GFF lines with added evidence annotation

These include 2271 genes with protein homology and/or EST evidence, 
but lacking Transposon (TEgene) annotation from Chris Smith's analysis,
out of 8906 total gene predictions not in RefSeq + Glean6 sets (file notglean_gnomon.hmm) table lists Gnomon_ID, Evidence flags as EST, Homology,
     Ortholog gene IDs with blast score, and Arthropod gene cluster id.

notglean_good.gene.gff has this same information pasted into attributes of Gnomon GFF

====================  Background  =================
Exon data:
% wc -l *exons
   90614 nasonia_glean6.exons
  119571 nasv_pred_gnomon.exons
   33644 notglean_gnomon.exons   < subset with no glean6 gene overlap 
          (removed all exons with any exon overlapping glean6 gene )

Gene ID lists from exons:
% wc -l *gnomon.gids
   17386 hasglean_gnomon.gids   < overlap glean6  
    9665 notglean_gnomon.gids   < no overlap

RefSeq genes in above:
% grep -c '^LOC' *gnomon.gids
  hasglean_gnomon.gids:8395       < RefSeq genes in glean6
  notglean_gnomon.gids:759        < RefSeq genes not in glean6

Remainder: Gnomon predictions not in Glean6 or RefSeq:
 8906 notglean_notrefseq_gnomon.gids

Orthology from BlastP to 12 arthropod (10 insect) proteomes.
 8055 notglean_notrefseq_gnomon match some other gene (same or different species)
      using BlastP evalue <= 1e-5 recipr matches from arthropod orthology,

  242 notglean_notrefseq_gnomon match another species, at evalue <= 1e-15
    = file: notglean_gnomon_ortholog.e15.hmm

 4920 notglean_notrefseq_gnomon match other Nasonia gene at evalue <= 1e-15
        many/most of these seem to be transposon genes (form largest clusters of 100s)

Arthropod gene cluster descriptions for  notglean_gnomon_ortholog
  file: notglean_omclgn2sum.e15.arpdesc

Count of these nas-notglean genes with insect orthologs:

Nas      No.
Genes    Taxa 
  85 ntaxa: 2
  33 ntaxa: 3
  22 ntaxa: 4
  13 ntaxa: 5
   9 ntaxa: 6
   4 ntaxa: 7
   1 ntaxa: 8
   4 ntaxa: 9
   9 ntaxa: 10
  11 ntaxa: 11  |  These 50 are well known insect/arthropod orthologs.
  19 ntaxa: 12  |  Should be in RefSeq but are not. 
  21 ntaxa: 13  |  36 are called Pseudogene, expert should look at, could be non-pseudo.
                   file: notglean_wellknown.gnomon.gff


The count by your criteria is less than I found when I look with other criteria, 
but you win this bet, with 242 of these missed orthology genes.

Find here these gene lists and supporting data (let me know what is unclear):

This is the Gnomon gene file for the not-glean-not-refseq with orthologs, evalue < e-15:

About 50 of these are valuable 1-1 orthologs across 10 other insects, genes
you don't want to miss in Nasonia.  Nowever Gnomon calls 36 of these as
Pseudogenes.  An expert should look at them as Gnomon and others can mistake
the end of a scaffold or NNN error for a pseudogene.

This is a table of orthologs with my ARP ID. Some have useful
descriptions, others no description or hypothetical protein:

This table summarizes all the blastp matches for these as gene pairs with e-values:

The 242 misses are a small enough count you can work them into your current gene set
w/o much effort.  The majority of the ones that turn up as matching other genes 
are matching other Nasonia genes, presumably many transposons,
5000 of the 8000 non-glean set are this variety.  Some, possibly many, are likely real wasp 
genes with paralogs.

I can get a higher count of possible orthologs among these notglean predictions:
many of the Nasonia genes that have significant paralogs but not themselves cross-species matches
fall in orthology clusters with other species.  There are about 1000 notglean genes with
this possible orthology.  

The OrthoMCL clustering is saying there is common homology here,
you may or may not agree with that, but it gives a basis for thinking these may be
real but derived genes.  Just to pick one at random, ARP1_G548 "Odorant receptor 30aCG13106-PA;"
is a cluster of one Apis gene and 23 Nasonia genes, 3 of which are in this Nasonia-only
blastp matching category.  Which of the 23 listed here would seem false positives?

You can read more about use of OrthoMCL for detecting orthology here:
   Li Li, Christian J. Stoeckert, Jr., and David S. Roos
   OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes. Genome Res. 2003 13: 2178-2189.
   Feng Chen, Aaron J. Mackey, Jeroen K. Vermunt, and David S. Roos
   Assessing Performance of Orthology Detection Strategies Applied to Eukaryotic Genomes. 
   PLoS ONE 2007 2(4): e383.

I applied the methods as described in these papers.

- Don

Developed at the Genome Informatics Lab of Indiana University Biology Department