Expression assembly and annotation for Matina/Mars cacao genome. Tables, figures referenced below are in sub-folders at http://server7.eugenes.org:8091/cacao/genes10/ Methods X. ESTs EST 454 long reads, and transcript assemblies, are mapped to genome assembly with GMAP, versions 2011-08-30 thru 2011-10-16, using npaths option. RNA-Seq RNA-Seq short, paired-end reads are mapped to genome assembly with GSnap software, versions 2011-08-30 thru 2011-10-16, using novelsplicing option. EST/RNA transcript assemblies Transcript assembly methods: Cufflinks, version 0.8x (2010), genome mapped short reads Cufflinks, version 1.0.3 (2011), genome mapped short reads Newbler, EST/454 Roche de-novo long read assembler PASA, version 2011, genome mapped, pre-assembled short reads plus long reads Velvet/Oases, version 1.2, de-novo short + long read assembler, multiple kmer options (19..29) Post-processing: add protein, CDS with best orf finding method (evigene script) check/correct with valid introns select best transcripts/locus using valid intron and protein homology scores, CDS and tr sizes (evigene overbestgenes script) select valid alternate transcripts (evigene altbest script) with criteria - equivalent exons to most of primary gene model (66% CDS equivalence) - distict exon/intron splice pattern (distinct intron chain) Introns from EST/RNA reads short and long mapped reads are mapped using spliced-alignment options, for GMAP/GSNAP. splice site locations at criterion for low errors are extracted from mapping results, and tabulated by number of reads supporting each intron span, within and over all read groups. A table is created of valid introns with minimum of 5 read splices, or 10 reads for long introns > 20,000 bp. -------------------------------------------------------------------------------------------------- Table X1. EST read and assembly mapping to genome assemblies Group Nin Nmap %mapped %cover %ident match mismat indel indel/bp -------------------------------------------------------------------------------------------------- The. cacao Matina v1.1 asm x EST assemblies cgba.bean 25501 24274 95.19 98.13 98.72 1174 4.84 2.90 0.0025 cgba.leaf 33237 24013 72.25 99.12 99.25 1111 1.41 1.58 0.0014 cgba.pistil1 20507 19200 93.63 95.21 98.65 745 2.84 2.41 0.0032 cgba.pistil2 25415 24100 94.83 93.39 98.35 1019 4.73 3.56 0.0035 cgba100.CCN51 25964 25911 99.80 92.73 98.90 1400 4.60 2.75 0.0020 cgba100.TSH1188 26528 25256 95.21 93.75 98.78 1360 4.89 2.48 0.0018 total_assemblies 157152 142754 90.84 95.35 98.78 1153 3.94 2.62 0.0023 The. cacao Matina v1.1 asm x EST reads cacao_dbEST 157996 146776 92.90 95.07 97.79 445 6.10 0.69 0.0016 reads.cacao_bean 3363703 3317671 98.63 99.06 98.68 370 1.23 1.81 0.0049 reads.cacao_leaf 1515418 1504017 99.25 99.11 98.62 379 0.75 2.29 0.0060 reads.cacao_pistil 2436789 2150379 88.25 96.64 98.44 284 1.26 1.82 0.0064 reads.CCN51 1834700 1817952 99.09 97.01 98.43 616 2.58 4.24 0.0069 reads.TSH1188 1841549 1803089 97.91 97.77 98.48 615 2.44 3.93 0.0064 total_reads 10992159 10593108 96.37 98.01 98.54 438 1.60 2.66 0.0061 ------------------------------------------------------------------------------------------------ Group Nin Nmap %mapped %cover %ident match mismat indel indel/bp -------------------------------------------------------------------------------------------------- The. cacao Criollo v1 asm x EST assemblies cgba.bean 25501 24067 94.38 97.39 98.62 1154 6.45 2.29 0.0020 cgba.leaf 33237 23895 71.89 97.99 98.65 1092 6.33 2.38 0.0022 cgba.pistil1 20507 18910 92.21 94.83 98.49 744 4.34 2.21 0.0030 cgba.pistil2 25415 23581 92.78 93.30 98.39 1019 6.24 3.14 0.0031 cgba100.CCN51 25964 25795 99.35 91.92 98.72 1377 6.43 3.06 0.0022 cgba100.TSH1188 26528 25033 94.36 93.26 98.68 1351 6.95 2.87 0.0021 total_assemblies 157152 141281 89.90 94.74 98.60 1142 6.20 2.68 0.0023 The. cacao Criollo v1 asm x EST reads cacao_dbEST 157996 138298 87.53 93.64 97.32 444 7.57 0.74 0.0017 reads.bean 3363703 3292775 97.89 98.61 98.51 368 1.82 1.83 0.0050 reads.leaf 1515418 1477784 97.52 98.45 98.19 376 2.24 2.40 0.0064 reads.pistil 2436789 2076878 85.23 96.31 98.17 285 1.92 1.85 0.0065 reads.CCN51 1834700 1808600 98.58 96.57 98.28 613 3.34 4.30 0.0070 reads.TSH1188 1841549 1794036 97.42 97.35 98.30 612 3.42 4.05 0.0066 total_reads 10992159 10450073 95.07 97.56 98.32 437 2.43 2.72 0.0062 ------------------------------------------------------------------------------------------------ cgba = Newbler 454 Isotig assemblies; reads = 454 EST reads; cacao_dbEST = NCBI dbEST entries for The. cacao mapped to genome assembly with GMAP v2011-08 software. Nin = number of input sequences; Nmap = number with any mapping; %mapped = percent mapped; %cover = percent coverage of EST span; %ident= percent identity; match= average aligned bases; mismat = ave. mismatched bases indel= ave. indels; indel/bp= indels / matched bases Table X2. RNA-seq samples mapped to Matina v1.1 assembly and used in transcript assembly. Counts of reads in total, those mapped and those properly paired, of paired-end RNA-seq. IUCGB reads are 106 bases, NCGR are 54 bases. Source Samples Total Mapped Properly paired --------------------------------------------------------------------------------------- iucgb_2x106b Pistil 451150078 414731867 (91.9%) 401416513 (89.0%) ncgr_2x54b Leaf.Matina 53722751 51010673 (95.0%) 47498698 (88.4%) ncgr_2x54b Leaf.mix14 251947835 232145736 (92.1%) 221143451 (87.8%) ncgr_2x54b Pistil 472514442 384192153 (81.3%) 340387973 (72.0%) total mixed 1229335106 1082080429 (88.0%) 1010446635 (82.2%) --------------------------------------------------------------------------------------- Table X3. EST/RNA assemblies obtained, with coding/transcript sizes EST-Newbler ESTRNA-Velvet RNA-Cufflinks N CDS Exon %C/X N CDS Exon %C/X N CDS Exon %C/X ---------------------- ---------------------- ----------------------- NonoverALL 58207 636 1035 61 152279 292 591 49 18447 652 971 67 NonoverTOP 25000 1109 1456 76 25000 1131 1859 60 WithAltALL 96702 638 1022 62 205854 363 726 50 21513 685 1011 67 WithAltTOP 25000 1284 1646 78 25000 1628 2482 65 -------------------------------------------------------------------------------------- Nonover = assemblies with non-overlapping locations; WithAlt = assemblies with alternate-transcripts; ALL = all assemblies, TOP = largest 25,000. Table X4. Count of best transcript assemblies used for gene modelling, for primary locus and valid alternates, by assembly method. Best transcripts were selected using scores for CDS and exon sizes, C/X ratio, and protein homology. Primary gene Alternates Assembly input final set final set ------------------------------------------- EST-Newbler 8950 1600 1729 RNA-Cufflinks 10008 1300 2598 ESTRNA-Velvet 29446 1600 8465 PASA 0 0 2222 total 48404 4500 15014 --------------------------------------------- Transcript assembly software (see X. Methods): EST-Newbler = Newbler 454 EST assembler ; RNA-Cufflinks = Cufflinks, v0.8 and v1.0.3, ESTRNA-Velvet = Velvet/Oases, PASA assembly of EST/RNA assemblies. ======== END of good parts =========================================================================== ## Table Xn. Count of best transcript assemblies: leave out submethods parts # submethods primary alternates # ----------------------------------- # cuff08 7772 2598 # cuff10 2236 0 # nblr.bean 3367 675 # nblr.leaf 2278 530 # nblr.pistil1 1527 174 # nblr.pistil2 1778 350 # velvet1 8423 2528 # velvet3 9720 1685 # velvet4 6966 2083 # velvet5 4337 2169 # PASA 0 2222 # total 48404 15014 # ----------------------------------- # # ** Reduce this to simpler table? dont need each group listed, sum over groups w/ same source. # Longer Table Xn. RNA-seq samples mapped to genome and used in transcript assembly. # # Group Total Mapped Properly paired # -------------------------------------------------------------------------- # cgb_ca1 22294788 21094956 (94.62%) 20831145 (93.44%) # cgb_ca2 19233030 17939649 (93.28%) 17613376 (91.58%) # cgb_ca3.1 62431332 59615511 (95.49%) 57340794 (91.85%) # cgb_ca3.2 62433126 59615740 (95.49%) 57341147 (91.84%) # cgb_ca3.3 62433127 59618717 (95.49%) 57346022 (91.85%) # cgb_ca3.4 62432124 59617408 (95.49%) 57344347 (91.85%) # cgb_ca4 19291972 18562886 (96.22%) 18304898 (94.88%) # cgb_ca5 28151289 23872663 (84.80%) 22931093 (81.46%) # cgb_ca6 15815992 13870888 (87.70%) 13435167 (84.95%) # cgb_ca7 18977144 16823538 (88.65%) 16421487 (86.53%) # cgb_ca8 25305730 20632835 (81.53%) 20165104 (79.69%) # cgb_ca9 31597305 25661767 (81.22%) 24866302 (78.70%) # cgb_ca10 20753119 17805309 (85.80%) 17475631 (84.21%) # -------------------------------------------------------------------------- # # NcgrID Group Total Mapped Properly paired # --------------------------------------------------------------------------------------- # ncgr090511_1 Matina_1 20480787 19443480 (94.94%) 18054708 (88.15%) # ncgr090511_2 Matina_2 33241964 31567193 (94.96%) 29443990 (88.57%) # # ncgr090609_1 1_SCA6 16584768 13630752 (82.19%) 13438732 (81.03%) # ncgr090609_2 2_U48 18103547 17086240 (94.38%) 16555932 (91.45%) # ncgr090609_3 3_GU255_P 23194475 21545664 (92.89%) 20662251 (89.08%) # ncgr090609_4 4_IMC51 18411596 17248544 (93.68%) 16335132 (88.72%) # ncgr090609_6 5_AMAZ_15_15 20575318 19797741 (96.22%) 18037452 (87.67%) # ncgr090714_2 COC3335 14705946 13805680 (93.88%) 13202298 (89.78%) # ncgr090714_3 NAP_30 14127879 13033139 (92.25%) 12692099 (89.84%) # ncgr090714_4 CRIOLLO_13 14105999 13323499 (94.45%) 12922516 (91.61%) # ncgr090714_6 PA_120_B 17519951 16001391 (91.33%) 15534374 (88.67%) # ncgr090714_7 PA_150 16640689 15726383 (94.51%) 15097710 (90.73%) # ncgr090714_8 U26 16085479 15282384 (95.01%) 14915572 (92.73%) # ncgr090728_7 Pound_5C_a 22356061 19876486 (88.91%) 18511771 (82.80%) # ncgr090728_8 Pound_7_2 24912934 22593605 (90.69%) 20714117 (83.15%) # ncgr091005_6 EBC_148_leaf_1 14623193 13194228 (90.23%) 12523495 (85.64%) # --------------------------------------------------------------------------------------- # # NcgrID Group Total Mapped Properly paired # --------------------------------------------------------------------------------------- # ncgr090922_7 Pistil_1 27957870 22259567 (79.62%) 20298560 (72.60%) # ncgr090922_8 Pistil_2 28006092 22927875 (81.87%) 20287548 (72.44%) # ncgr090929_1 Pistil_3 19811594 14418304 (72.78%) -- na -- # ncgr090929_2 Pistil_4 43813792 38015761 (86.77%) 33371148 (76.17%) # ncgr090929_3 Pistil_5 49740617 43958973 (88.38%) 38398564 (77.20%) # ncgr090929_4 Pistil_6 42900783 33345213 (77.73%) 28885632 (67.33%) # ncgr090929_6 Pistil_7 41215621 35240433 (85.50%) 31054088 (75.35%) # ncgr090929_7 Pistil_8 45415987 39539183 (87.06%) 34924038 (76.90%) # ncgr090929_8 Pistil_9 36318759 28175197 (77.58%) 25751718 (70.90%) # ncgr091005_1 Pistil_10 27697734 19632563 (70.88%) 17686200 (63.85%) # ncgr091005_2 Pistil_11 38647049 31940616 (82.65%) 27558826 (71.31%) # ncgr091005_3 Pistil_12 35425040 24823640 (70.07%) 21415486 (60.45%) # ncgr091005_4 Pistil_13 35563504 29914828 (84.12%) 25699354 (72.26%) # --------------------------------------------------------------------------------------- # Revise these from cacao3d/docs/cacao3asm-estintron2011.info ------------------------------------------ Table Xn. Intron statistics from spliced reads. short intron locations n=160327 long intron locations n=1477 ; size distribution: mode: median: mean: 20k-29k:367 30k-39k:193 ... >100k:75 -------------------------------------------------------------- # see now Table Gn. Gene structure statistics for Cacao # full gene struct table? exon, cds, intron sizes, numbers/gene, Paired intron differences for same EST mapping well to both assemblies. (+=Mars, -=Cirad) intron97.ccn51 n=27859; aved=-2 bp; sumd=-82100 bp intron97.tsh1188 n=26694; aved= 1 bp; sumd= 33368 bp Genome total intron spans Difference of 6.8 Mb more Mars intron span is due to EST mapping quality, as above shows no paired read difference. Small number of Long introns account for most of genome total difference, 4.2 Mb more Mars in 6100 introns over 1000 bp, versus 0.7 Mb more Mars in 37600 introns under 500 bp. EST introns with >= 97% mapping identity, mapped to mars11 and cirad1 assemblies EST ccn51 Chr nInt awInt swInt Group Chr nInt awInt swInt Group Ma01 8522 575 4904574 ccn51.mars11 Tc01 7932 529 4197514 ccn51.cirad1 Ma02 6923 543 3761026 ccn51.mars11 Tc02 5942 557 3312254 ccn51.cirad1 Ma03 6375 628 4009375 ccn51.mars11 Tc03 5975 565 3378971 ccn51.cirad1 Ma04 5940 617 3667415 ccn51.mars11 Tc04 5418 544 2951077 ccn51.cirad1 Ma05 6292 711 4479748 ccn51.mars11 Tc05 5100 563 2872030 ccn51.cirad1 Ma06 4738 622 2951219 ccn51.mars11 Tc06 4022 530 2134002 ccn51.cirad1 Ma07 2649 1185 3140494 ccn51.mars11 Tc07 2068 1225 2535089 ccn51.cirad1 Ma08 4056 669 2717144 ccn51.mars11 Tc08 3464 587 2034650 ccn51.cirad1 Ma09 7873 606 4774073 ccn51.mars11 Tc09 7393 559 4134156 ccn51.cirad1 tot 53368 644 34405068 ccn51.mars11 tot 47314 582 27549743 ccn51.cirad1 EST tsh1188 Chr nInt awInt swInt Group Chr nInt awInt swInt Group Ma01 8025 576 4624225 tsh1188.mars11 Tc01 7456 509 3801863 tsh1188.cirad1 Ma02 6418 519 3335813 tsh1188.mars11 Tc02 5477 516 2828239 tsh1188.cirad1 Ma03 5996 612 3672895 tsh1188.mars11 Tc03 5630 547 3083773 tsh1188.cirad1 Ma04 5583 596 3332346 tsh1188.mars11 Tc04 5136 615 3161810 tsh1188.cirad1 Ma05 5954 746 4444725 tsh1188.mars11 Tc05 4837 665 3217971 tsh1188.cirad1 Ma06 4440 674 2996292 tsh1188.mars11 Tc06 3785 547 2072343 tsh1188.cirad1 Ma07 2466 985 2431164 tsh1188.mars11 Tc07 1971 948 1868595 tsh1188.cirad1 Ma08 3894 575 2239643 tsh1188.mars11 Tc08 3330 538 1792374 tsh1188.cirad1 Ma09 7370 619 4565661 tsh1188.mars11 Tc09 6903 549 3795466 tsh1188.cirad1 tot 50146 631 31642764 tsh1188.mars11 tot 44525 575 25622434 tsh1188.cirad1 Ma = mars assembly, Tc = cirad assembly nInt = number of introns; awInt = average width; swInt = sum of widths; Note the skewed width distribution of introns makes average width not that useful.