Nasonia vit. NCBI Gnomon predictions, not in RefSeq or Glean6 gene sets
notglean_good_evidence.tab : table of additional gene predictions with evidence
notglean_good.gene.gff : NCBI Gnomon mRNA GFF lines with added evidence annotation
These include 2271 genes with protein homology and/or EST evidence,
but lacking Transposon (TEgene) annotation from Chris Smith's analysis,
out of 8906 total gene predictions not in RefSeq + Glean6 sets (file notglean_gnomon.hmm)
notglean_good_evidence.tab table lists Gnomon_ID, Evidence flags as EST, Homology,
Ortholog gene IDs with blast score, and Arthropod gene cluster id.
notglean_good.gene.gff has this same information pasted into attributes of Gnomon GFF
==================== Background =================
% wc -l *exons
33644 notglean_gnomon.exons < subset with no glean6 gene overlap
(removed all exons with any exon overlapping glean6 gene )
Gene ID lists from exons:
% wc -l *gnomon.gids
17386 hasglean_gnomon.gids < overlap glean6
9665 notglean_gnomon.gids < no overlap
RefSeq genes in above:
% grep -c '^LOC' *gnomon.gids
hasglean_gnomon.gids:8395 < RefSeq genes in glean6
notglean_gnomon.gids:759 < RefSeq genes not in glean6
Remainder: Gnomon predictions not in Glean6 or RefSeq:
Orthology from BlastP to 12 arthropod (10 insect) proteomes.
8055 notglean_notrefseq_gnomon match some other gene (same or different species)
using BlastP evalue <= 1e-5 recipr matches from arthropod orthology,
242 notglean_notrefseq_gnomon match another species, at evalue <= 1e-15
= file: notglean_gnomon_ortholog.e15.hmm
4920 notglean_notrefseq_gnomon match other Nasonia gene at evalue <= 1e-15
many/most of these seem to be transposon genes (form largest clusters of 100s)
Arthropod gene cluster descriptions for notglean_gnomon_ortholog
Count of these nas-notglean genes with insect orthologs:
85 ntaxa: 2
33 ntaxa: 3
22 ntaxa: 4
13 ntaxa: 5
9 ntaxa: 6
4 ntaxa: 7
1 ntaxa: 8
4 ntaxa: 9
9 ntaxa: 10
11 ntaxa: 11 | These 50 are well known insect/arthropod orthologs.
19 ntaxa: 12 | Should be in RefSeq but are not.
21 ntaxa: 13 | 36 are called Pseudogene, expert should look at, could be non-pseudo.
The count by your criteria is less than I found when I look with other criteria,
but you win this bet, with 242 of these missed orthology genes.
Find here these gene lists and supporting data (let me know what is unclear):
This is the Gnomon gene file for the not-glean-not-refseq with orthologs, evalue < e-15:
About 50 of these are valuable 1-1 orthologs across 10 other insects, genes
you don't want to miss in Nasonia. Nowever Gnomon calls 36 of these as
Pseudogenes. An expert should look at them as Gnomon and others can mistake
the end of a scaffold or NNN error for a pseudogene.
This is a table of orthologs with my ARP ID. Some have useful
descriptions, others no description or hypothetical protein:
This table summarizes all the blastp matches for these as gene pairs with e-values:
The 242 misses are a small enough count you can work them into your current gene set
w/o much effort. The majority of the ones that turn up as matching other genes
are matching other Nasonia genes, presumably many transposons,
5000 of the 8000 non-glean set are this variety. Some, possibly many, are likely real wasp
genes with paralogs.
I can get a higher count of possible orthologs among these notglean predictions:
many of the Nasonia genes that have significant paralogs but not themselves cross-species matches
fall in orthology clusters with other species. There are about 1000 notglean genes with
this possible orthology.
The OrthoMCL clustering is saying there is common homology here,
you may or may not agree with that, but it gives a basis for thinking these may be
real but derived genes. Just to pick one at random, ARP1_G548 "Odorant receptor 30aCG13106-PA;"
is a cluster of one Apis gene and 23 Nasonia genes, 3 of which are in this Nasonia-only
blastp matching category. Which of the 23 listed here would seem false positives?
You can read more about use of OrthoMCL for detecting orthology here:
Li Li, Christian J. Stoeckert, Jr., and David S. Roos
OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes. Genome Res. 2003 13: 2178-2189.
Feng Chen, Aaron J. Mackey, Jeroen K. Vermunt, and David S. Roos
Assessing Performance of Orthology Detection Strategies Applied to Eukaryotic Genomes.
PLoS ONE 2007 2(4): e383.
I applied the methods as described in these papers.