Cacao assemblies vs EST-intron differences. Total intron size accounts for some of Mars <> Cirad assembly size differences. However, this appears to be assembly quality effect rather than strain biology effect, as when perfect EST mapping only is used, no intron difference is found, but fewer ESTs map perfectly to cirad assembly. 2012 March update I did map long ESTs ccn51, tsh1188, to both mars and cirad assemblies, the introns of those agree with last email: there is no structural difference for intron size between strains. There is a larger intron span found in mars assembly due to larger number of *long* introns from ESTs that map well to assemblies. Paired intron differences for same EST mapping well to both assemblies. (+=Mars, -=Cirad) intron97.ccn51 n=27859; aved=-2 bp; sumd=-82100 bp intron97.tsh1188 n=26694; aved= 1 bp; sumd= 33368 bp Genome total intron spans Difference of 6.8 Mb more Mars intron span is due to EST mapping quality, as above shows no paired read difference. Small number of Long introns account for most of genome total difference, 4.2 Mb more Mars in 6100 introns over 1000 bp, versus 0.7 Mb more Mars in 37600 introns under 500 bp. EST introns with >= 97% mapping identity, mapped to mars11 and cirad1 assemblies EST ccn51 Chr nInt awInt swInt Group Chr nInt awInt swInt Group Ma01 8522 575 4904574 ccn51.mars11 Tc01 7932 529 4197514 ccn51.cirad1 Ma02 6923 543 3761026 ccn51.mars11 Tc02 5942 557 3312254 ccn51.cirad1 Ma03 6375 628 4009375 ccn51.mars11 Tc03 5975 565 3378971 ccn51.cirad1 Ma04 5940 617 3667415 ccn51.mars11 Tc04 5418 544 2951077 ccn51.cirad1 Ma05 6292 711 4479748 ccn51.mars11 Tc05 5100 563 2872030 ccn51.cirad1 Ma06 4738 622 2951219 ccn51.mars11 Tc06 4022 530 2134002 ccn51.cirad1 Ma07 2649 1185 3140494 ccn51.mars11 Tc07 2068 1225 2535089 ccn51.cirad1 Ma08 4056 669 2717144 ccn51.mars11 Tc08 3464 587 2034650 ccn51.cirad1 Ma09 7873 606 4774073 ccn51.mars11 Tc09 7393 559 4134156 ccn51.cirad1 tot 53368 644 34405068 ccn51.mars11 tot 47314 582 27549743 ccn51.cirad1 EST tsh1188 Chr nInt awInt swInt Group Chr nInt awInt swInt Group Ma01 8025 576 4624225 tsh1188.mars11 Tc01 7456 509 3801863 tsh1188.cirad1 Ma02 6418 519 3335813 tsh1188.mars11 Tc02 5477 516 2828239 tsh1188.cirad1 Ma03 5996 612 3672895 tsh1188.mars11 Tc03 5630 547 3083773 tsh1188.cirad1 Ma04 5583 596 3332346 tsh1188.mars11 Tc04 5136 615 3161810 tsh1188.cirad1 Ma05 5954 746 4444725 tsh1188.mars11 Tc05 4837 665 3217971 tsh1188.cirad1 Ma06 4440 674 2996292 tsh1188.mars11 Tc06 3785 547 2072343 tsh1188.cirad1 Ma07 2466 985 2431164 tsh1188.mars11 Tc07 1971 948 1868595 tsh1188.cirad1 Ma08 3894 575 2239643 tsh1188.mars11 Tc08 3330 538 1792374 tsh1188.cirad1 Ma09 7370 619 4565661 tsh1188.mars11 Tc09 6903 549 3795466 tsh1188.cirad1 tot 50146 631 31642764 tsh1188.mars11 tot 44525 575 25622434 tsh1188.cirad1 Ma = mars assembly, Tc = cirad assembly nInt = number of introns; awInt = average width; swInt = sum of widths; Note the skewed width distribution of introns makes average width not that useful. ==================== For all EST introns, Mars 1.1 assembly has an excess of 3.5 Megabases of intron span versus Cirad1 assembly. See wDiff for Matot (Mars) vs Tctot (Cirad). >> Newer EST mapping, 5 Nov + 4 Dec 2011. .. but this does or doesn't account for mapping quality difference. Cirad/Mars assembly for Introns from EST reads mapped to assemblies, 2011 Dec : is this right? ==> intron/intron.note.diff1e <== InID Asm1 nInt avwIn wInt Asm2 nInt avwIn wInt wDiff All Ma01 10178 394 4010322 Tc01 9582 383 3672079 338243 All Ma02 7995 380 3039573 Tc02 6862 382 2627209 412364 All Ma03 7522 395 2974822 Tc03 7133 379 2704183 270639 All Ma04 6850 420 2883213 Tc04 6321 407 2574275 308938 All Ma05 7144 425 3037745 Tc05 5919 395 2341024 696721 All Ma06 5365 412 2212303 Tc06 4721 389 1840989 371314 All Ma07 2796 430 1203298 Tc07 2242 430 965939 237359 All Ma08 4806 388 1868293 Tc08 4144 376 1561988 306305 All Ma09 9261 394 3655112 Tc09 8747 382 3342589 312523 All Matot 65232 404 26359017 Tctot 58399 391 22843411 3515606 Date: Mon, 22 Aug 2011 22:12:36 -0500 (EST) From: Don Gilbert Message-Id: <201108230312.p7N3Caq16480@cricket.bio.indiana.edu> To: gilbertd@cricket.bio.indiana.edu, kmockait@indiana.edu Cc: mnrusimh@indiana.edu Subject: Re: EST mapping to mars11 and cirad1 genomes Get even less excited.. I forgot to adjust for transcript mapping quality, which is lower for these ESTs going to cirad genome. That accounts for missing intron span. Using only the subset of EST assemblies that map to both at >= 99%, the differece is minimal. I.e. the gene structures have not changed, but your ESTs don't match quite as well to cirad genome. >> Older EST mapping, Aug 2011 cgbAssembly.bean.intron.diffn : n=8197; an=0; aw=-10; sn=18; sw=-86368; cgbAssembly.leaf.intron.diffn : n=7739; an=0; aw=-44; sn=-52; sw=-342334; cgbAssembly.pistil1.intron.diffn : n=4783; an=0; aw=-8; sn=-20; sw=-39902; cgbAssembly.pistil2.intron.diffn : n=6844; an=0; aw=19; sn=0; sw=136576; an=ave count diff; aw=ave width diff sn=sum count diff; sw=sum width diff ------------------- Date: Tue, 23 Aug 2011 16:20:15 -0500 (EST) From: Don Gilbert To: gilbertd@cricket.bio.indiana.edu, kmockait@indiana.edu Cc: mnrusimh@indiana.edu Subject: Re: M-C files Keithanne, Find here an update with nonaligned spans for mars x cirad assemblies, http://server7.eugenes.org:8091/cacao/genes10/genome/align/ mars_cacao10asm.nonalign.gff.gz cirad_cacao1asm.nonalign.gff.gz These are inverse of aligned.gff (the gap portions) but removing spans with NNN gaps. Summarized as base counts here: http://server7.eugenes.org:8091/cacao/genes10/genome/assembly_align_summary.txt for the total genome sizes, aligned, non-aligned (with and without gaps). Non-aligned span bases, all spans with but excluding NNN from base count mars10 22678238 23 mb non-aligned cirad1c 8088740 8 mb non-aligned +15 mb mars Non-aligned mars contents 1858 consensus1 genes in spans of >= 1,000 bp 2945 consensus1 exons 675 common introns, 4197 rare introns, from expressed reads 1700+ EST exons (1704 bean, 1418 leaf, 1166 pistil1, 1593 pistil2 from cgb EST assembly) Regarding that bit yesterday about EST alignment and intron span differences, this is another indication of the reason the mars assembly is larger, that is there are more transcripts that fully align to it, along with their introns. If that difference is due to higher quality in this assembly, that is useful information .. e.g. the few thousand genes or EST transcripts that fall into non-aligned regions can account for genome size difference, along with introns that go with them. - Don -------------------------------------------------------------- # out of date # Date: Tue, 23 Aug 2011 16:41:59 -0500 (EST) # From: Don Gilbert # Message-Id: <201108232141.p7NLfxl20407@cricket.bio.indiana.edu> # To: gilbertd@cricket.bio.indiana.edu, kmockait@indiana.edu # Cc: mnrusimh@indiana.edu # Subject: Re: M-C files # # These are updated transcript mapping counts (after I made correct cirad transcripts), # using gmap. # # gene transcript gmap counts # # cacao9_consensus1.tr: 34997 34991 in mars11 34945 in cirad1 # cacao9_epir7.newgenes.tr: 3458 3458 in mars11 972 in cirad1 # cirad1cacao_genes.tr: 46140 46140 in cirad1 46083 in mars11 # # Also for the EST assemblies, though more map to mars11, some map # only to cirad1 assembly: 1290 Mars only, 832 Cirad only.