Daphnia magna gene set update http://arthropods.eugenes.org/EvidentialGene/daphnia/daphnia_magna_new/ public candidate: evg7vose/pubset9b/ 03 Oct 2014 TABLE G1. Daphnia magna Gene set numbers, version evg7f9b, 03 Oct 2014 --------------------------------------------------- 29121 gene loci, all supported by mRNA-seq and/or protein homology evidence 26825 (92%) are mRNA assemblies, 2296 (8%) are genome-modelled 28127 (99%) have uniquely mapped RNA-seq, and 98% of 7 Billion RNA reads map to these transcripts 22059 (76%) have complete proteins, 7068 have partial proteins 22063 (76%) have homology to other species (blastp e<=1e-5 to proteins or conserved domains), 11770 (40%) are orthologs to other species, 4535 (16%) are inparalogs of orthologs, and 12826 (44%) are species-unique (by OrthoMCL clustering, some uniques have homology) 5170 (18%) have homology only to other Daphnia. 18962 (65%) are properly mapped to genome (>=80% coverage, no splits), 10189 (35%) improperly gmapped, 3389 (12%) are un-mapped genes, 3386 (12%) partial-mapped <80% coverage, 3414 (12%) split-mapped >= 80% cover. 2860/20558 (14%) are single-exon loci of those mapping >= 40% to genome. 84898 alternate transcripts are at 17473 loci (60%), ave. 5 transcripts per locus, DSCAM has 123 alts, 10 have >= 100 alts, 56 have >= 50, 2496 have >= 10 alts. 400-1000+ are trans-spliced genes (mRNA/protein and introns are on reverse strands), more loci show bi-directional transcription rates from <0.1% to ~50%, by intron counts. This is likely first evidence of transpliced genes in Daphnia, a few other crustaceans show it, but generally hasn't been seen much among arthropods (like well known fly mod(mdg4)). Gene locus IDs: Dapma7bEVm000001t1 .. Dapma7bEVm030756t1, Alternate transcripts have ID suffix t2 .. t100. -------------------------------------------------------- # * Note, large change in mapqual from evg7f8, 30Aug14, is due to added mblast+gmap2 mappings # evg7f8: 15973 (56%) are properly mapped to genome, 12431 (44%) improperly gmapped, TABLE G2. Arthropod gene orthology categories (using OrthoMCL) ---------- GENES --------------------- ------ GROUPS ---- nGene Orth1 Ordup Inpara Uniq1 UDup OrGrp OrMis1 UniGrp -------------------------------------- ------------------ daphniam 29127 9139 2627 4360 10487 2339 11523 18 737 daphniap 45212 9865 2212 6296 15661 11220 11670 36 2795 beetlet 12420 8078 820 1298 1360 874 8765 42 250 beetlep 21503 6800 921 4500 6654 2678 8875 78 731 honeybee 72392 7951 926 1101 58685 3742 8682 161 1449 wasp 36326 7849 882 9057 9078 9616 8688 126 1319 fruitfly 13927 7049 734 1570 2967 1610 7801 203 464 shrimp 26962 6637 611 2272 14065 3397 7516 395 1038 human 39357 6533 4516 10324 12233 5723 11265 59 1984 fish_mz 23194 7968 3975 5803 3248 2206 11270 52 545 -------------------------------------------------------------------- source: arp7bor5/arp7s10f-orthomcl-gclass.tab, 2014.09.30 daphniam = daphnia magna, daphniap = daphnia pulex, beetlet = tribolium cas, beetlep = pogonus cha., wasp = nasonia vit., honeybee = apis mel., fruitfly = dros. mel., shrimp = penaeus monodon tiger shrimp, fish_mz = maylandia z. Key: inGene = count of input genes, excludes alternate isoforms/locus. Orth1 = single copy orthologous genes, Ordup = multi-copy old-ortholog genes (one-to-one matches among multicopies), Inpara= Inparalogs (recent ortholog duplicates) of orthologous genes Uniq1, UDup = single-copy and duplicated species-unique genes OrMis1= groups missing in species that all other species have OrGrp, UniqGrp = orthologous and species-unique groups -------------------------------------------------------------------- TABLE G3. Arthropod species gene set completness, measured with average protein size and orthology in gene groups ---Common Groups--- ----All Groups----- Species cBits aaSize OrMiss tBits OrGroup Tiny --------- -------------------- -------------------- daphniam 650 46 18 466 11523 1.8% daphniap 643 -25 36 462 11670 5.1% beetlet 526 -26 42 351 8765 4.1% beetlep 541 15 78 358 8875 3.5% honeybee 532 38 161 346 8682 3.1% wasp 526 -16 126 347 8688 5.1% fruitfly 470 68 203 290 7801 1.8% shrimp 463 22 395 285 7516 6.6% ---------------------------------------------------- source: arp7bor5/arp7s10f-orthomcl cBits = bitscore average for 4740 common gene groups tBits = bitscore average for all ortholog groups aaSize = average protein size difference from group median OrMiss = missing ortholog groups that are common to other 9 of 10 species OrGroup = number of ortholog gene groups in species Tiny = percent species gene size outliers below 2sd of group median size ---------------------------------------------------- Gene Expression x Orthology ======================================= TABLE M1. RNA-seq read mapping to Daphnia magna mRNA transcripts, from 3 experiments, 2 clones (I,X), 6-8 treatments, 3 replicates/treatment, approx. 100 M paired reads in each of 66 read sets. ReadGroup mRNAset nMapread nTotRead nNomap pctMap --------------------------------------------------------------- hscItotal finall8 2736214261 2789627581 53413320 98.1% hscXtotal finall8 3172301416 3233374500 61073084 98.1% ndcXtotal finall8 857334019 885996197 28662178 96.8% hscItotal finloc8 2429376789 2789627581 360250792 87.1% hscXtotal finloc8 2814739850 3233374500 418634650 87.1% ndcXtotal finloc8 791867853 885996197 94128344 89.4% --------------------------------------------------------------- finall8 = primary and alt mRNA transcripts, n=112805, alts add ~11% read mapping. finloc8 = primary mRNA transcripts only, n=28400 Read pairs are mapped to transcripts with GSNAP (v2014-05-15, opts:-N 0 --gmap-mode=none --pairexpect=400) Other transcripts account for the 1.9% - 3.2% unmapped reads: a. 1991 contaminant assemblies are RNA-identical to contam species (mouse, human, a few others like human acne bacteria) b. 17283 fragment and semi-duplicate assemblies culled from finloc8 set, contain unattached alt exons, noncoding RNA, other transcribed "stuff". TABLE M2. RNA-seq expression summary for Daphnia magna mRNA transcripts, as levels of mapped-reads/Kb/Mill (R=RPKM) of 28400 loci, 112805 transcripts, and 22 read-groups. R>0 R>=1 R>=10 R>=100 R>=1000 ------------------------------------------------- maximal expression loci 28146 19914 12716 2234 175 trans 111561 96918 67055 12094 615 median expression loci 25794 15936 7202 659 42 trans 108932 84466 39350 3609 76 uniquely mapped reads, combining 22 groups loci 28128 25857 21376 14105 5138 ------------------------------------------------- No expression was measured for 254 loci. TABLE M3a. Daphnia magna gene Orthology by DE effects as percent of loci per treatment effect. Treat orlog inpar uniq oSUM ---------------------------------------- uHSi 7% 11% 80% 600 uHSx 10% 16% 73% 427 uNDx 10% 25% 63% 1046 uNDxPB 24% 22% 52% 225 none 33% 14% 51% 24760 dHSi 10% 14% 74% 249 dHSx 10% 29% 59% 204 dNDx 12% 22% 65% 1181 dNDxPB 13% 17% 68% 530 xSUM 8726 4499 15588 28813 ---------------------------------------- TABLE M3b. Daphnia magna gene Orthology by DE group effects, as locus counts. Treat orlog inpar uniq oSUM ---------------------------------------- uHSi 47 69 484 600 uHSx 43 69 315 427 uNDx 110 270 666 1046 uNDxPB 56 51 118 225 none 8269 3676 12815 24760 dHSi 27 36 186 249 dHSx 22 60 122 204 dNDx 143 260 778 1181 dNDxPB 74 95 361 530 xSUM 8726 4499 15588 28813 ---------------------------------------- orlog = gene has ortholog, inpar = gene is recent paralog of orthologs, uniq = daphnia species unique gene, using ortho-daphplx,daphmag-only = uniq, u/d = up/down DExpress, treatments combined by expt. groups: HSi = Helsinki INB clone, HSx = XIN clone, NDx = Notre Dame XIN, NDxPB = PB treatment only TABLE G4. Daphnia magna Gene set sources, processing steps and versions ------------------------------------------------------------------------------------- Stage1. Transcript assemblies of mRNA-seq with several de-novo assemblers and parameters, followed by EvidentialGene tr2aacds redundancy removal applied to each assembly set, non-redundant output. Stage2. Locus/Alternate classification from clone assembly NR sets of stage 1 is done with several attributes, all with errors but of different types: transcript alignment classification (tr2aacds), genome-map location and consensus map loci, consensus protein homology and quality, cross-clone transcript consensus (MCL cluster of transcript alignment), and other qualities. Stage3. Candidate Locus/Alternate gene set selection, with expert curation + computational reclassification, focus on hard cases (alt/paralog, unmap/mismap,etc) A. Input transcript assembly sets, 1st stage Input_Tr NR_out Name Provenance ---------------------------------------- 3,751,425 140192 dmagset36m McTaggart clones rna 2012May (Dapma6rm, daphmag3, dmag2vel, tag41 id patt) 16,454,489 256607 dmagset56tx XIN clone assembly, 2014Jun-2013Aug of input (Dapma6tx, hsX, ndX, vel4x id patt) 9,469,773 272398 dmagset56ri INB clone set, 2014May (Dapma6ti,hsI,vel4i id patt) of input trasm ~1,000,000 64487 dmagset56ru Low express assembly from XIN clone, 1st pass unassembled reads 2014Jun (Dapma6rx, xun, nun id patt) ---------------------------------------- Transcripts are assembled on pair-end RNA-seq, 100bp or 50-70bp (36m) with Velvet/Oases, SoapDenovoTR, Trinity, with multi-kmer settings (k23 .. k95~readsize) for Velvet and Soap, with and w/o digital normalized read sets, and other options/filters. "ru" low express set built from reads not mapping to XIN clone dmagset56tx assembly. B. Locus/Alternate classification Transcript source sets, 2nd stage Input_Tr Name Provenance ---------------------------------------- 34530 dmagset1m8 Genome predicted 2010 (m8AUG id patt) 140192 dmagset36m McTaggart clones rna 2012May (Dapma6rm, daphmag3, dmag2vel, tag41 id patt) 256607 dmagset56tx XIN clone assembly, 2014Jun-2013Aug of 16454489 input (Dapma6tx, hsX, ndX, vel4x id patt) 272398 dmagset56ri INB clone set, 2014May (Dapma6ti,hsI,vel4i id patt) of 9469773 input trasm 64487 dmagset56ru Low express assembly from XIN clone, 1st pass unassembled reads 2014Jun (Dapma6rx, xun, nun id patt) 120122 dmagset4pub1208 Stressflea rna 2012Aug, XIN clone mostly, used to fill in missed loci 182909 dmagset5xpub1401 Pre-release 2014Jan, used to fill in missed loci, from 2013-2010 transcripts ntr=1,071,245 following 1st round of redundant removal from ~30 million mRNA assemblies, from ~9 billion reads. ---------------------------------------- C. Candidate Locus/Alternate sets, 3rd stage Name nLoci Notes ---------------------------------------- pubset1 97140 evg7vose-tr2aacds, input tr=988788 of 4 separately assembled and reduced RNA-seq sets (3-clones) and genome-predict set, no-omcl 04Jul2014. Sets 4 (1208) and 5 (1401) were not pubset1 inputs. pubset2 44762 no-omcl 24Jul2014 ; cross-clone consensus classification (MCG loci/alts common across clone sets) pubset3 28363 arp7aor1 30Jul2014 pubset4 27239 no-omcl 14Aug2014 ; intron-miss loci, paralog/alt reclass pubset5 27218 no-omcl 19Aug2014 ; remove ~1,200 contaminant assemblies (human,mouse,bacteria,..) pubset6 26886 no-omcl 20Aug2014 ; intron-miss loci, paralog/alt reclass, v2 nLoci oGene UDup Orth1 OrDup OrGrp OMis1 pubset3 28363 17661 1835 9659 6167 11749 28 ; 30Jul2014, arp7bor1 pubset7 27775 17558 2196 9571 5791 11516 29 ; 21Aug2014, arp7bor2b pubset8 28400 18327 2157 9184 6986 11541 31 ; 21Sep2014, arp7bor3b, various checks, ~600 missed loci from analyses pubset9a 29074 18722 2367 9164 7191 11570 31 ; 24Sep2014, arp7bor4 pubset9b 29127 18768 2383 9193 7192 11599 21 ; 30Sep2014, arp7bor5a, found 55 ortho-misses daphplx 45212* 29900 11526 9955 8419 11788 37 ; daphplx comparison ---------------------------------------- pubset1 = evg7vose: mixture from 6 separately assembled RNA-seq sets (3-clones) + 1 genome-predict set later versions from applying various classification, filtering methods using several gene analyses, attributes including consensus transcripts of 3 independent clone-sets, orthology/paralogy scores, rna-express/readmap scores, protein quality scores, genome-map quality scores, * daphplx, daphnia pulex 2010 beta gene prediction set, includes ~10,000 non-coding-like loci, and other unclassified gene models. ------------------------------------------------------------------------------------- Notes: ===================================================================== Almost all source RNA-seq (98.6%) is recovered in mRNA transcripts, and almost all gene loci (99%) recovered uniquely mapped RNA-seq, 91% are at/above one read/kilobase/MMr, for the 700 million mapped reads of this data set. Alternate transcripts add 11% of mapped reads not found in primary transcripts. Differential Expression (DE) is found in fewer orthologs, more species-unique and inparalog genes, proportionally to non-DE genes. This gene set includes likely first evidence of transpliced genes in Daphnia, a few other crustaceans show it, but generally hasn't been seen much among arthropods (eg well-known fruitfly mod(mdg4) transgene). Trans-spliced genes (mRNA/protein and introns are on reverse strands), in the 100s at least, are relatively abundant compared to other arthropods where they have been found. There is confusion in calling these precisely, as most gene informatics tools don't handle such and gene predictors don't. There are a few hundred obvious cases where protein and introns are clearly in reversed directions, and 1000+ ambigous cases where bi-directional expression appears. There may be an association with tandem duplicate clusters and trans-splicing (among trans-spliced regions, these may be likely to contain duplicate clusters). Over all a few thousand loci/regions have bi-directional transcription rates from 0.1% to ~50%, by intron counts. At the low end (< 1%) this may include errors as well as weak effects, non-coding extras.