VectorBase Anopheles+Aedes locus counts Ano. gambiae 12994 pc, -- nc (AgamP4.3) Ano. funestus 13471 pc, 412 nc, 161 alts (AfunF1.3) Ano. albimanus 12085 pc, 424 nc, 110 alts (AalbS1.3) Aedes aegypti 15696 pc, 1680 nc, 1462 alts (AaegL3.3, ) pc= protein coding, nc= non-coding, Evigene reconstruction of Anopheles+Aedes genes Ano. funestus 27848 loci Ano. albimanus 29310 loci Aedes aegypti 68737 loci Q: Way too many genes there, Don, what's wrong? A: Classify those loci with gene characteristics .. Conservatively classified coding loci are comparable in count to VectorBase coding set, additional characteristics suggest additional coding and non-coding loci. Classified Evigene reconstruction of Anopheles+Aedes genes Ano. funestus 14053..23467 pc, 2665..8303 nc loci, 29880 alts at 10653 loci Ano. albimanus 12000..18000 pc, nnnn..8812 nc loci, 38089 alts at 10522 loci [classify in progress] Aedes aegypti 20546..23000 pc, 6344..24000 nc loci, 86437 alts at 17439 loci [classify in progress] Don Gilbert, 2016.05.06 ----------- Evigene reconstruction of Anopheles gene loci http://arthropods.eugenes.org/EvidentialGene/arthropods/mosquito/ funestus 27,848 total funestus loci classified by gene characteristics a. insect prot homology 13201 Of these, 40% improve homology, 48% equal, vs Vectorbase Ano.funestus loci b. related species conservation 8981 (not in a., Ano.minimus+culicifacies for Ano.fun) b1. coding-conservation 852 (signif Ka/Ks) b2. sequence-aligned 8129 (not signif Ka/Ks) c. transposon/rRNA 150 [excludes 440 with a.insect ho, ambiguous TE call, eg AGAP028564-PA] d. coding-potential (not a,b,c; calculated) d1. good coding 1285..3693 (with..without b2) d2. poor/non-coding 2665..8303 (w/o b2, 426 ambiguous) e. contaminant/artifact (partly in c, unfinished) Total classed : 26,282 coding.1 : 14,053 = 13201.a + 852.b1 coding.2 : 17,746 = 13201.a + 852.b1 + 3693.d1 coding.3 : 23,467 = 13201.a + 8981.b1,2 + 1285.d1 noncoding : 2,665..8,303 (d2) [remainder 1566 are what? not in ho/noho id lists] ----------------------- albimanus 29,310 total albimanus loci classified by gene characteristics a. insect prot homology 11649 Of these, 41% improve homology, 38% equal, vs Vectorbase ano.albimanus loci b. related species conservation nnnn (not a.) b1. coding-conservation nnn (signif Ka/Ks) b2. sequence-aligned nnnn (not signif Ka/Ks) c. transposon ~150 d. coding-potential (includes a,b,c) d1. good coding 18731 d2. poor/non-coding 8812, 1767 ambiguous e. contaminant/artifact not done yet ----------------------- Evigene reconstruction of Aedes gene loci http://arthropods.eugenes.org/EvidentialGene/arthropods/mosquito/aedes_aegypti/ aegypti 68,737 total good 30018, subset has insect homology and/or multiple introns on chromosomes poor 38725, subset lacks both homology and introns a.good, insect prot homology 20546 Of these, 44% improve homology, 45% equal, vs Vectorbase Aedes.aegypti loci a.poor, insect prot homology 0 b. related species conservation (includes a.) b1.good, coding-conservation 20368 (sig Ka/Ks) b2.good, sequence-aligned 4921 (ns Ka/Ks) b1.poor, coding-conservation 5326 (sig Ka/Ks) b2.poor, sequence-aligned 13816 (ns Ka/Ks) c.good, transposons/rRNA 497..1036 (ambiguous a.insect ho vs transposons) c.poor, transposons/rRNA 1573 d. coding-potential (includes a,b,c) d1.good, good coding 22483 d2.good, poor/non-coding 6344 (1235 ambiguous) d1.poor, good coding 12530..18012 d2.poor, poor/non-coding 19267..24749 (1440 ambiguous) e. contaminant/artifact not done yet --------------- Coding+sequence conservation is found in related species genomes, not in related spp gene models. A classifier for gene loci is based on measurable attributes of genes, generally all measured for all loci. Hierarchical classification rules then apply those measures in order of reliability (which can include complex weightings, but simplify to): a. protein orthology/homology to reference + related species genes (i.e., protein matches well to reliably known protein) > remainder: b/c/e weighted character discrimination to b,c,e categories (i.e., scores b/c/e = 1/0/0 > b.category, scores 0/1/0 > c, scores 0.5/0.5/0.5 > ambiguous depends on reliability of scoring) > remainder: d coding potential character (ie. looks like coding or looks like non-coding) Other classifying rules, and refinements, are possible (what does UniProt use?) ---------------