Non-align spans of Matina/Mars vs Criollo/Cirad cacao genome assemblies. final in cacao3d/docs/cacao11genes_cir1nonalign.txt * add methods paragraph to each summary table Methods M. 1. Cacao genome assembly alignments using mugsy/Mummer/Nucmer align of 2 cacao genome assemblies. 2. Gene content of two assemblies, align/nonalign spans from 1. adding transcript dna alignments (gene models, est assemblies) Gmap align of tr dna to assembly dna. Count no align, and poor align at criterion of % identically aligned bases >= 95%? of transcript. 3. Attributes of aligned/nonaligned genes, using gene annotations of parolog, ortholog, alternate transcripts, EST coverage. 4. Gene ontology groups of aligned/nonaligned, using best TAIR v10 protein homology, and TAIR GO annotation table (2011) for GO groups. ATH_GO_GOSLIM.txt 2011/08 ------------------------------------------------------------- Table M1. Assembly sizes ------------------------------------------------------------- Genome sizes not-Gap Gap mars11 330821837, 331 mb 15171838, 15 mb cirad1 291426886, 291 mb 35926235, 36 mb +40 mb mars -21 mb mars Aligned span bases mars11 308143599, 308 mb n spans=33057 cirad1 283338146, 283 mb n spans=31394 +25 mb mars, cirad aligns 1.1 times to mars Non-aligned span bases, excluding gaps mars11 22678238, 23 mb non-aligned cirad1 8088740, 8 mb non-aligned +15 mb mars ------------------------------------------------------------- mars11 = Matina/Mars assembly v1.1, scaffold_1..10 and 700 unlocated scaffolds, chloroplast, mitochondria, rRNA cirad1 = Criollo/Cirad published assembly, Tc01..Tc10 + Tc00 Assembly alignment performed with Mummer/Nucmer. Assembly size counts (cacao11allasm,cirad_cacao1_chrs.fa.count) are at http://server7.eugenes.org:8091/cacao/genes10/genome/ # **? add here Table Mn. EST, Intron content differences # see cacao3d/docs/cacao3asm-estintron2011.info # -- add intron sizes (more in matina), paired intron sizes (no diff) and intron counts, # est aligned bases Table M2. Gene content differences ================================== Cacao genes in Matina/Mars non-align spans (vs Criollo/Cirad assembly) and/or lacking transcript alignment to Cirad genes. M2.1 Gene alignment differences ------------------------------------------------------------- Genes in non-aligned assembly spans Mars11 Cirad1 5606 1014 Gene transcripts that do not map, or map poorly (<90% align) Mars11 Cirad1 Aligned assembly spans 1009 (35) 1335 (21) Non-aligned spans 1978 (90) 355 (20) total 2987 (125) 1690 (41) ------------------------------------------------------------- Number in parantheses: transcript does not map at all. M2.2 Gene attributes for loci in non-aligned and aligned spans Nonalign Align %Nona %Aln N-A% Chi p --------------------------------------------------------- has_alternates 411 7472 16 28 -11 153 0 has_ortholog 1927 22747 76 84 -8 104 0 has_paralog 1985 20954 78 77 1 1.3 1 estbean 425 9505 17 35 -18 348 0 estleaf 959 15193 38 56 -18 310 0 estpistil 1057 16952 42 63 -21 422 0 total loci 2538 27127 --------------------------------------------------------- Counts of genes in genome assembly spans that are aligned, and non-aligned between Matina and Criollo assemblies. The percentages, %Nona %Aln, are relative to total for each group, and N-A% is the difference in these, with Chi-sq statistic and p-value indicating significant deviation from equal rates. See also docs/cacao3gene-evid2012sum.txt, Tables E1,E4,E5. M2.3 Chromosome location Nonalign Align pNonal pAlign diff% g p ----------------------------------------------------------- scaffold_1 308 5938 10.6 14.4 -3.7 28.97 5e-04 scaffold_9 295 5406 10.2 13.1 -2.9 18.81 5e-04 scaffold_2 317 5155 10.9 12.5 -1.5 5.27 5e-04 scaffold_3 304 4636 10.5 11.2 -0.7 1.27 5e-04 scaffold_5 360 4510 12.4 10.9 1.5 5.48 5e-04 * scaffold_4 279 4461 9.6 10.8 -1.1 3.49 5e-04 scaffold_6 266 3371 9.2 8.1 1.0 3.38 5e-04 scaffold_8 193 2975 6.6 7.2 -0.5 1.09 5e-04 scaffold_10r 310 2458 10.7 5.9 4.7 81.84 5e-04 * most Nonalign excess scaffold_7 254 2290 8.8 5.5 3.2 42.77 5e-04 * ----------------------------------------------------------- Plot of genes, transposons and criollo/matina gaps (cgap) per chromosome http://server7.eugenes.org:8091/cacao/genes10/docs/feature_coverage/cacao_genomemap.chrs10k.pdf Chrs 10, 7 and 5 show more cirad gaps in genic regions Table M3. Gene functions ========================= Table M3.1 GO classes overabundant in Genes in Matina/Mars assembly spans non-aligned to Criollo/Cirad assembly No. Genes Pct Genes GOid Nocir Cir %Nocir %Cir diff% g p GOTerm -------------------------------------------------------------------------------------------- GO:0006952 184 662 1.87 0.42 1.45 233.608 1.0e-60 P:defense response GO:0004675 57 99 0.58 0.06 0.52 129.053 0.000 F:transmembrane receptor prote GO:0006468 579 5706 5.9 3.67 2.23 103.505 0 P:protein phosphorylation GO:0032259 80 480 0.81 0.3 0.51 51.081 4.2e-64 P:methylation GO:0019761 20 48 0.2 0.03 0.17 36.384 0.000 P:glucosinolate biosynthetic p GO:0004674 170 1628 1.73 1.04 0.69 33.860 4e-259 F:protein serine/threonine kin GO:0050832 62 406 0.63 0.26 0.37 33.695 6.1e-57 P:defense response to fungus GO:0004672 164 1574 1.67 1.01 0.65 32.403 9e-251 F:protein kinase activity GO:0010025 20 58 0.2 0.03 0.17 31.193 1.6e-05 P:wax biosynthetic process GO:0005524 176 1768 1.79 1.13 0.66 29.152 1e-285 F:ATP binding GO:0009809 32 164 0.32 0.1 0.22 26.273 4.1e-21 P:lignin biosynthetic process GO:0007165 66 520 0.67 0.33 0.34 23.782 1.7e-78 P:signal transduction GO:0019825 38 233 0.38 0.14 0.24 23.350 2.2e-32 F:oxygen binding GO:0016301 132 1319 1.34 0.84 0.5 22.337 3e-213 F:kinase activity GO:0009626 30 166 0.3 0.1 0.2 21.918 2.6e-22 P:plant-type hypersensitive re GO:0051555 14 44 0.14 0.02 0.12 20.307 8.1e-05 P:flavonol biosynthetic proces GO:0012505 177 1943 1.8 1.24 0.56 19.549 7e-322 C:endomembrane system GO:0000162 14 52 0.14 0.03 0.11 17.186 2.9e-06 P:tryptophan biosynthetic proc GO:0009699 12 40 0.12 0.02 0.1 16.456 0.000 P:phenylpropanoid biosynthetic GO:0052544 12 42 0.12 0.02 0.1 15.674 4.4e-05 P:defense response by callose GO:0004497 33 242 0.33 0.15 0.18 14.132 2.0e-36 F:monooxygenase activity GO:0030244 20 116 0.2 0.07 0.13 13.532 1.8e-16 P:cellulose biosynthetic proce GO:0009396 10 34 0.1 0.02 0.08 13.448 0.000 P:folic acid-containing compou GO:0043680 12 50 0.12 0.03 0.09 12.939 1.3e-06 C:filiform apparatus GO:0010483 12 50 0.12 0.03 0.09 12.939 1.3e-06 P:pollen tube reception GO:0006457 81 840 0.82 0.54 0.27 11.724 4e-138 P:protein folding GO:0051707 15 80 0.15 0.05 0.1 11.604 2.5e-11 P:response to other organism GO:0004713 10 40 0.1 0.02 0.08 11.308 2.2e-05 F:protein tyrosine kinase acti GO:0006098 8 28 0.08 0.01 0.07 10.449 0.000 P:pentose-phosphate shunt GO:0016571 12 60 0.12 0.03 0.09 10.209 1.5e-08 P:histone methylation -------------------------------------------------------------------------------------------- pub3ig1.ciradpoor+nonalign.gostat Nocir = genes in Matina/Mars assembly spans non-aligned to cirad assembly, Cir = cirad aligned Pct genes = percent of subset total (Nocir or Cir) diff% = difference in percent genes, this selection removed negative diff (missing in non-aligned) as uninformative/expected for housekeeping genes. g= G-statistic (log likelihood, similar to Chi-sq); p= prob(G). Note: each gene can have several GO terms. These are assigned with TAIR annotation using best TAIR protein match to cacao gene. Table M3.2 GO classes overabundant in Genes with no/poor alignment, but in Matina/Mars assembly spans aligned to Criollo/Cirad assembly No. Genes Pct Genes GOid Nocir Cir %Nocir %Cir diff% g p GOTerm -------------------------------------------------------------------------------------------- GO:0004675 24 99 0.6 0.06 0.54 57.429 1.3e-11 F:transmembrane receptor prote +above GO:0006464 12 84 0.3 0.05 0.25 18.969 2.0e-13 P:protein modification process GO:0001522 8 46 0.2 0.03 0.17 15.021 2.3e-07 P:pseudouridine synthesis GO:0031625 10 75 0.25 0.05 0.2 14.788 1.7e-12 F:ubiquitin protein ligase bin GO:0008610 6 26 0.15 0.01 0.13 13.894 0.000 P:lipid biosynthetic process GO:0046274 6 28 0.15 0.01 0.13 13.201 0.000 P:lignin catabolic process GO:0005388 12 120 0.3 0.08 0.21 12.853 5.4e-21 F:calcium-transporting ATPase GO:0009451 10 86 0.25 0.05 0.2 12.810 8.7e-15 P:RNA modification GO:0006457 42 840 1.05 0.58 0.47 11.725 4e-159 P:protein folding +above GO:0009231 6 34 0.15 0.02 0.13 11.400 9.5e-06 P:riboflavin biosynthetic proc GO:0042742 28 492 0.7 0.34 0.35 11.162 4.8e-92 P:defense response to bacteriu GO:0080043 5 25 0.12 0.01 0.11 10.465 0.000 F:quercetin 3-O-glucosyltransf GO:0000139 8 68 0.2 0.04 0.16 10.381 5.8e-12 C:Golgi membrane -------------------------------------------------------------------------------------------- pub3ig1.ciradpoor-nonalign.gostat # Table M3.3 Details for # GOid Nocir Cir %Nocir %Cir diff% g p GOTerm # -------------------------------------------------------------------------------------------- # GO:0006952 184 662 1.87 0.42 1.45 233.608 1.0e-60 P:defense response # GO:0004675 57 99 0.58 0.06 0.52 129.053 0.000 F:transmembrane receptor prote # GO:0019761 20 48 0.2 0.03 0.17 36.384 0.000 P:glucosinolate biosynthetic p # GO:0010025 20 58 0.2 0.03 0.17 31.193 1.6e-05 P:wax biosynthetic process # GO:0051555 14 44 0.14 0.02 0.12 20.307 8.1e-05 P:flavonol biosynthetic proces Table M3.4 Defense response and transmembrane gene names from non-aligned regions n=190 gene names, n=344 genes -------------------------------------------------------------------------------------------- 18 LRR and NB-ARC domains-containing disease resistance protein, putative 16 NB-ARC domain-containing disease resistance protein, putative 3 NB-ARC domain-containing disease resistance protein 2 LRR and NB-ARC domains-containing disease resistance protein 11 Cc-nbs-lrr resistance protein, putative 2 Cc-nbs-lrr resistance protein 10 Kinase superfamily protein 7 Leucine-rich repeat transmembrane protein kinase 7 Nbs-lrr resistance protein 3 NBS-LRR type disease resistance protein 7 S-locus lectin protein kinase family protein 6 Serine/threonine-protein kinase PBS1 5 Dynamin related protein 4C 5 Leucine-rich repeat containing protein 5 P-loop containing nucleoside triphosphate hydrolases superfamily protein 4 Disease resistance protein 4 Disease resistance protein RPP8 2 Disease resistance RPS5, putative-like protein 2 Disease resistance protein family 2 Disease resistance protein family, putative 2 Disease resistance protein, putative 4 Kinase superfamily protein, putative 4 S-locus lectin protein kinase family protein, putative 4 Serine/threonine kinases,protein kinases,ATP binding,sugar binding,kinases,carbohydrate binding, putative 3 12-oxophytodienoate reductase 2 3 CC-NBS-LRR class disease resistance protein 3 CC-NBS-LRR class disease resistance protein, putative 1 CC-NBS-LRR protein 3 Cysteine-rich RLK 10, putative 2 TMV resistance protein N 1 Disease resistance (TIR-NBS-LRR class), putative-like protein 1 Disease resistance RPM1, putative-like protein 1 Disease resistance protein RPP8, putative 1 NB-ARC domain-containing disease resistance-like protein 1 NBS type disease resistance protein, putative 1 Nbs-lrr resistance-like protein 1 TMV resistance N, putative-like protein 1 Tir-nbs-lrr resistance-like protein -------------------------------------------------------------------------------------------- See Table Sn. Disease resistance genes in cacao11genes_summaries-work.txt # ** Should add other class of disease resistance genes: receptor protein kinase # Malectin/receptor prot kinase is overabundant vs other plants and in cacao_MA.noCR (by ~10 more loci) # Malectin contributes to downy mildew *disease* not resistance?: http://www.ncbi.nlm.nih.gov/pubmed/21711359 # grep -c Malectin ciradgene1.marspoormap.cdd.tab = 7 # grep -c Malectin cacao11pub3i.ciradpoormap.cdd.tab = 17 # # orthomcl groups for Malectin/RPK Thecc genes: 24, all RPK Thecc genes: 53 # non-align span Malectin/RPK Thecc genes: 17 # ---------------------------------------------------------------- Table Sn.3 Genes containing disease resistance protein domains, using rpsblast of gene proteins x disease-domains from NCBI Conserved Domain Database. Protein domains Species Dirig Psyr NBS TIR TIR-NBS Malectin/RPK ------------------------------------------------------------------------- cacao_MA 37 410 341 10 21 cacao_MA.noCR 6 89 86 1 7 17 cacao_CR 38 334 267 18 14 cacao_CR.noMA 2 43 34 1 2 7 arath 29 284 88 42 126 poptr 46 693 430 68 112 ricco 24 174 127 13 31 vitvi 18 338 300 10 21 --------------------------------------------------------- cacao_CR = Criollo/Cirad v1 genome; cacao_MA = Matina/CGD v1.1 genome; cacao_CR.noMA = Criollo genes from non-aligned regions (n=1690) cacao_MA.noCR = Matina genes from non-aligned regions (n=2986) The non-aligned genes are located in spans where assemblies do not align, and the transcripts fail to map properly elsewhere in other assembly. All proteins per species were aligned using RPSBLAST to a disease domain database, and type and number of domains per gene counted, then summed over all genes. Disease resistance domains: Dirig= CDD:190504 pfam03018 Dirigent, Dirigent-like protein Psyr = CDD:178749 PLN03210 Psyringae, PLN03210, Resistant to P. syringae 6 NBS = CDD:201512 pfam00931 NB-ARC, NB-ARC domain TIR = CDD:201870 pfam01582 TIR, TIR domain or CDD:205852 pfam13676 TIR_2, TIR domain -------------------------------------------------------------------------------------------- New chromosome feature map Counting bases spanned by features in 100kb sliding window over genome 1. Gene exon density (pub3i.good.gff), max exon density is 46kb/100kb window 2. Transposon density (transposon_mazhao.gff), max transposon density is 75kb/100kb 3a. Matina / Criollo non-alignment gaps from Mummer align, max density is 26kb 3b. Matina / Criollo gDNA alignment (gDNA cover density), inverted for gaps (100k - cover), max density 43kb 4. Disease genes identified by homology as disease resistance, n=616, max density 10kb (but 90% of spans=0) 5. Tandem duplicate genes, near (<=10kb) to paralogs in same gene family (Orthomcl), n=616, max dens 12kb (but 90% of spans=0) Sets 4+5 share only 82 same genes, even though 4 tends to be found in clusters 6. not plotted: Control geneset with ATP binding function (GO:0005524), n=1162 Final plot here: http://server7.eugenes.org:8091/cacao/genes10/docs/feature_coverage/cacao_genomemap.chrs10k.pdf Summary of correlation of gene sets, alignment gaps, transposons: 0. Genes anticorrelate with Transposons (r=-0.87) 1. Disease R genes and Tandem paralogs both correlate with Criollo align gaps [Genic regions strongest] DiseaseR x gaps is higher by pearson R (r=0.38 vs .33), TandemDups x gaps is higher by spearman R 2. Partial corr of each geneset x gap, controlled for other geneset, gives no major change, partial of both is lower by 25%, but still significant. 3. Criollo align gaps have stronger corr with gene sets than Criollo gDNA gaps (r=0.38 for align-gap vs 0.23 for gDNA-gap in genic spans, for Disease genes) 4. Control geneset ATP binding (GO:0005524) has no corr with gaps (r=-0.05), but strong corr (r=0.42) with overall gene density, vs gene density x Disease R or Tandem paralogs (r=0.20). 5. Per chromosome, DiseaseR x gaps is highest on chrs 5,7,10r (r=.64,.47,.38), on other chrs, Tandem paralogs x gaps is highest, in range of 0.20-0.33, DiseaseR contents are limited. -------------------------------------------------------------------------------------------- Viewing Matina/Criollo non-align gaps and associated genes 1. Pick out spots on chromosome maps here where genes are above transposons (Green over Blue) and cgap (Red) is down valley (non-align criollo). http://server7.eugenes.org:8091/cacao/genes10/docs/feature_coverage/gene_calign_te.call6.pdf 2. Find matching location in GBrowse, eg. scaffold_5 at 38.5 MB http://server7.eugenes.org:8091/gbrowse/cgi-bin/gbrowse/cacao11x/ 3. Disease resistance genes often are found with no/low Criollo alignment Criollo non-align gap examples: scaffold_2:778524-853524 several NB-ARC Disease R genes scaffold_4:31208458-31283458 several Disease resistance / LRR family scaffold_5:38441020-38516020 several NB-ARC Disease R genes scaffold_7:1969211-2051711 disease R genes scaffold_7:2051712-2134212 disease R genes scaffold_7:2134213-2216713 disease R genes .. aligned span .. scaffold_7:2546718..2629218 Leucine repeat genes LRR scaffold_10r:225923-308423 several NB-ARC disease resistance scaffold_10r:1415118-1497618 3 tandem Disease resistance protein RPP8, plus transposon; small/part gap scaffold_10r:1497619-1580119 UDP-Glycosyltransferase tandems, spotty gaps scaffold_10r:2689791-2772291 Cysteine/Histidine-rich C1 domain genes scaffold_10r:3019795-3102295 Ankyrin repeat family duplicates scaffold_1:38478520-38553520 not diseR but tandem dupl. Sterol methyltransferase scaffold_3:1153529-1228529 not diseR but several Ankyrin repeat family scaffold_3:33256170-33331170 not diseR but several tandem Fiber protein Fb17 scaffold_6:25780951-25863450 not diseR but several tandem Cysteine-rich RLK 30 Gbrowse tracks: Genome features/ Cirad1 align = Mummer alignment of assemblies Criollo gDNA cover or xyplot = your Criollo13 gDNA coverage maps -- these two usually are consistant; sometimes Mummer align covers spans without gDNA cover -------------------------------------------------------------------------------------------- ======== END of good parts =========================================================================== /////////////////////////////////////////////////////////////////////////////// ** Redo this again; counts dont match w/ other method. cacao3d/prot/estfunc/pub3ig1.ciradpoor+nonalign.gostat cacao3d/prot/estfunc/pub3ig1.ciradpoor-nonalign.gostat ** Drop below list. # grep -F -f prot/defensego6952.tairids relseq/cacao11good_pub3i.ciradpoormap+nonalign.attr | cut -f1 | perl -pe's/t\d+$//;' | sort -u | wc -l # nonaln n=108 w/defence TAIR, more than below 69 # align n=433, more than below 245 .. about same ratio # ?? what did I miss in above selection? cacao11genes_pub3i.tairgo.tab may be wrong; # # ==> estfunc/pub3it1.gotair.cirpoor+nonalign.count2 <== # Gene functions, TAIR GO Group, in two subsets of Matina/Mars cacao genome. # NOCir = non-aligned to criollo/cirad cacao genome + no/poor gene alignment # Cir = good gene alignment, and aligned to criollo/cirad genome # nGene %Genes nGene %Genes TAIR GO group # NOCir 119 5.4** Cir 1070 2.5 ATP binding/GO:0005524/F # NOCir 101 4.6 Cir 3177 7.4 biological_process/GO:0008150/P # NOCir 99 4.5 Cir 1664 3.8 protein phosphorylation/GO:0006468/P # NOCir 90 4.1 Cir 2950 6.8 molecular_function/GO:0003674/F # NOCir 69 3.1** Cir 245 0.5 defense response/GO:0006952/P # NOCir 67 3 Cir 1093 2.5 metabolic process/GO:0008152/P # NOCir 60 2.7 Cir 718 1.6 oxidation-reduction process/GO:0055114/P # NOCir 52 2.3 Cir 493 1.1 protein kinase activity/GO:0004672/F # NOCir 48 2.2 Cir 459 1 protein serine/threonine kinase activity/GO:0004674/F # NOCir 36 1.6 Cir 754 1.7 catalytic activity/GO:0003824/F # NOCir 31 1.4 Cir 415 0.9 nucleotide binding/GO:0000166/F # NOCir 29 1.3 Cir 1051 2.4 regulation of transcription, DNA-dependent/GO:0006355/P # NOCir 28 1.2 Cir 730 1.7 DNA binding/GO:0003677/F # NOCir 27 1.2 Cir 241 0.5 heme binding/GO:0020037/F # NOCir 28 1.2 Cir 793 1.8 zinc ion binding/GO:0008270/F # NOCir 22 1 Cir 545 1.2 binding/GO:0005488/F # NOCir 22 1 Cir 273 0.6 electron carrier activity/GO:0009055/F # NOCir 23 1 Cir 186 0.4 monooxygenase activity/GO:0004497/F # NOCir 22 1 Cir 739 1.7 protein binding/GO:0005515/F # NOCir 21 0.9 Cir 218 0.5 iron ion binding/GO:0005506/F # NOCir 20 0.9 Cir 371 0.8 oxidoreductase activity/GO:0016491/F # NOCir 20 0.9 Cir 739 1.7 sequence-specific DNA binding transcription factor activity/GO:0003700/F # NOCir 21 0.9 Cir 170 0.3 signal transduction/GO:0007165/P # NOCir 19 0.8 Cir 135 0.3 nucleoside-triphosphatase activity/GO:0017111/F # NOCir 18 0.8 Cir 238 0.5 proteolysis/GO:0006508/P # NOCir 16 0.7 Cir 252 0.5 carbohydrate metabolic process/GO:0005975/P # NOCir 15 0.6+ Cir 29 0 innate immune response/GO:0045087/P # NOCir 15 0.6 Cir 113 0.2 sugar binding/GO:0005529/F # NOCir 15 0.6+ Cir 28 0 transmembrane signaling receptor activity/GO:0004888/F # .. # NOCir 10 0.4 Cir 85 0.1 flavin adenine dinucleotide binding/GO:0050660/F # NOCir 8 0.3? Cir 29 0 cellulose biosynthetic process/GO:0030244/P # NOCir 7 0.3? Cir 29 0 flavonoid biosynthetic process/GO:0009813/P # NOCir 4 0.1+ Cir 0 0 diacylglycerol binding/GO:0019992/F # # Chi-test of above, p<0.05 # > rp<- (retn[,4] < 0.05) # > cbind(data.frame(ret), tab.noa[,c(2,5,7)])[rp,] # lkNOC lkCir g p NOC Cir GO term # 1 116.6 -90.3 26.4 1.95e-167 119 1070 ATP binding/GO:0005524/F # 2 -139.0 203.2 64.2 0.00e+00 101 3177 biological_process/GO:0008150/P # 4 -131.0 196.0 64.9 0.00e+00 90 2950 molecular_function/GO:0003674/F # 5 176.2 -90.6 85.6 3.01e-23 69 245 defense response/GO:0006952/P # 12 -47.9 75.8 27.9 2.53e-212 29 1051 regulation of transcription, DNA-dependent/GO:0006355/P # 15 -32.8 45.3 12.5 4.88e-157 28 793 zinc ion binding/GO:0008270/F # 19 -33.1 50.2 17.1 6.22e-149 22 739 protein binding/GO:0005515/F # 22 -33.8 54.0 20.3 3.84e-150 20 739 seq-sp DNA binding transcription factor activity/GO:0003700/F # 27 51.5 -20.5 31.0 3.48e-02 15 29 innate immune response/GO:0045087/P # 29 52.2 -20.5 31.7 4.74e-02 15 28 transmembrane signaling receptor activity/GO:0004888/F # 31 -22.6 38.3 15.7 5.47e-101 12 490 nucleic acid binding/GO:0003676/F # 62 18.5 -8.2 10.3 4.95e-02 6 15 glucosinolate biosynthetic process/GO:0019761/P # 98 22.3 0.0 22.3 4.55e-02 4 0 diacylglycerol binding/GO:0019992/F # # # ==> estfunc/pub3it1.gotair.cirpoor-nonalign.count2 <== # Gene functions, TAIR GO Group, in two subsets of Matina/Mars cacao genome. # NOCir = no/poor gene alignment, but aligned to criollo/cirad genome # Cir = good gene alignment, and aligned to criollo/cirad genome # nGene %Genes nGene %Genes TAIR GO group # NOCir 53 6 Cir 3225 7.3 biological_process/GO:0008150/P # NOCir 46 5.2 Cir 2994 6.7 molecular_function/GO:0003674/F # NOCir 36 4.1 Cir 1727 3.9 protein phosphorylation/GO:0006468/P # NOCir 33 3.7 Cir 1127 2.5 metabolic process/GO:0008152/P # NOCir 32 3.6 Cir 1157 2.6 ATP binding/GO:0005524/F # NOCir 21 2.4 Cir 1059 2.4 regulation of transcription, DNA-dependent/GO:0006355/P # NOCir 20 2.2 Cir 758 1.7 oxidation-reduction process/GO:0055114/P # NOCir 17 1.9 Cir 490 1.1 protein serine/threonine kinase activity/GO:0004674/F # NOCir 17 1.9 Cir 528 1.1 protein kinase activity/GO:0004672/F # NOCir 13 1.4 Cir 777 1.7 catalytic activity/GO:0003824/F # NOCir 13 1.4 Cir 378 0.8 oxidoreductase activity/GO:0016491/F # NOCir 13 1.4 Cir 746 1.6 sequence-specific DNA binding transcription factor activity/GO:0003700/F # NOCir 12 1.3 Cir 809 1.8 zinc ion binding/GO:0008270/F # NOCir 11 1.2 Cir 750 1.7 protein binding/GO:0005515/F # NOCir 9 1 Cir 749 1.6 DNA binding/GO:0003677/F # NOCir 9 1 Cir 305 0.6 defense response/GO:0006952/P # NOCir 9 1 Cir 437 0.9 nucleotide binding/GO:0000166/F # NOCir 8 0.9? Cir 87 0.1 flavin adenine dinucleotide binding/GO:0050660/F # # Chi-test of above, only sig set, no signif. positive cirpoor-nonalign # > rp<- (ret[,4] < 0.05) # > cbind(data.frame(ret), tab.aln[,c(2,5,7)])[rp,] # lkNOC lkCir g p NOC Cir GOterm # 1 -53.7 70.4 16.6 0 53 3225 biological_process/GO:0008150/P # 2 -52.7 71.6 18.9 0 46 2994 molecular_function/GO:0003674/F cat ../relseq/cacao11good_pub3i.ciradpoormap+nonalign.ids gonamid.tab cacao11genes_pub3i.tairgo.tab | perl -ne\ 'chomp; if(/^(Thec\w+)$/) { $sid{$1}=$S[0]; next; } elsif(/\tGO:\d+$/){ ($na,$go)=split"\t"; $gon{$go}=$na; next; }\ ($tg,$at,$p,$go)=split"\t"; next unless($tg=~/t1$/); $s=$sid{$tg}||$S[1]; %didg=(); foreach $g (split",",$go) { \ next if($g=~m,/C, or $didg{$g}++); $sc{$s}++; $gc{$g}{$s}++ } END{ foreach $g (sort keys %gc) { next unless($gc{$g}{NOCir}); \ for $s (@S) { $c=$gc{$g}{$s}||0; $sc=$sc{$s}; $p=int(1000*$c/$sc)/10; print join("\t",$s,$c,$p); print "\t"; } \ ($g1=$g)=~s,/.,,; $gn=$gon{$g1}||"na"; print "$gn/$g\n"; } } BEGIN{ @S=qw(NOCir Cir); }' | \ sort -k3,3nr -k7,7i > estfunc/pub3it1.gotair.cirpoor+nonalign.count2 Frequency of GOslim function categories (using homology to TAIR GOslim) GO Slim, all groups. -------------------- cat $caca/genes/pub3i/pub3i.good.ciradnonalign.ids cacao11genes_pub3i.goslim2.tab | perl -ne\ 'chomp; if(/^(Thec\w+)$/) { $sid{$1}=$S[0]; next; } ($tg,$at,$p,$go)=split"\t"; next unless($tg=~m/t1$/); \ $s=$sid{$tg}||$S[1]; foreach $g (split",",$go) { $sc{$s}++; $gc{$g}{$s}++ } \ END{ foreach $g (sort keys %gc) { for $s (@S) { $c=$gc{$g}{$s}; $sc=$sc{$s}; $p=int(1000*$c/$sc)/10; \ print join("\t",$s,$c,$p); print "\t"; } print "$g\n"; } } BEGIN{ @S=qw(NOCir Cir); }' | \ sort -k3,3nr -k6,6n -k7,7 > estfunc/pub3it1.goslim2.nocir.count estfunc/pub3it1.goslim2.nocir.count nGene %Genes nGene %Genes TAIR GOSlim group NOCir 958 10.7 Cir 6382 10.2 P:other metabolic processes NOCir 879 9.8 Cir 6174 9.9 P:other cellular processes NOCir 442 4.9* Cir 2340 3.7 C:other cellular components NOCir 400 4.4 Cir 2711 4.3 P:protein metabolism NOCir 378 4.2 Cir 2345 3.7 F:other binding NOCir 353 3.9 Cir 2762 4.4 C:other intracellular components NOCir 348 3.8** Cir 1315 2.1 F:nucleotide binding NOCir 326 3.6 Cir 1834 2.9 F:other enzyme activity NOCir 310 3.4 Cir 1635 2.6 C:plasma membrane NOCir 301 3.3- Cir 2977 4.8 P:unknown biological processes NOCir 281 3.1 Cir 2128 3.4 C:other membranes NOCir 281 3.1 Cir 2255 3.6 C:other cytoplasmic components NOCir 275 3 Cir 1571 2.5 F:hydrolase activity NOCir 273 3 Cir 2170 3.4 C:chloroplast NOCir 267 2.9 Cir 2334 3.7 C:unknown cellular components NOCir 248 2.7* Cir 1118 1.8 P:response to stress NOCir 248 2.7 Cir 1368 2.2 F:transferase activity NOCir 248 2.7-- Cir 2792 4.5 F:unknown molecular functions NOCir 169 1.8 Cir 1275 2 C:cytosol NOCir 153 1.7 Cir 1050 1.6 P:transport NOCir 156 1.7 Cir 1361 2.1 C:nucleus NOCir 140 1.5 Cir 640 1 F:kinase activity NOCir 142 1.5 Cir 945 1.5 P:response to abiotic or biotic stimulus NOCir 141 1.5 Cir 1032 1.6 F:DNA or RNA binding NOCir 135 1.5 Cir 1129 1.8 P:other biological processes NOCir 132 1.4 Cir 1059 1.7 F:protein binding NOCir 120 1.3 Cir 943 1.5 C:plastid NOCir 109 1.2 Cir 877 1.4 P:developmental processes NOCir 89 0.9 Cir 524 0.8 P:signal transduction NOCir 77 0.8 Cir 536 0.8 F:other molecular functions NOCir 76 0.8 Cir 683 1.1 F:transcription factor activity NOCir 68 0.7 Cir 440 0.7 C:cell wall NOCir 64 0.7 Cir 449 0.7 F:transporter activity NOCir 57 0.6 Cir 466 0.7 F:nucleic acid binding NOCir 54 0.6 Cir 467 0.7 P:cell organization and biogenesis NOCir 46 0.5 Cir 511 0.8 C:mitochondria NOCir 40 0.4 Cir 300 0.4 C:extracellular NOCir 27 0.3 Cir 60 0 F:receptor binding or activity NOCir 35 0.3 Cir 179 0.2 C:ER NOCir 35 0.3 Cir 215 0.3 P:DNA or RNA metabolism NOCir 26 0.2 Cir 259 0.4 F:structural molecule activity NOCir 15 0.1 Cir 97 0.1 P:electron transport or energy pathways NOCir 16 0.1 Cir 125 0.2 C:Golgi apparatus NOCir 13 0.1 Cir 180 0.2 C:ribosome ============================ cat pub3i.gostressnbind.nocirad.attr.tab | egrep -v '/[CI]' | cut -f2 | perl -pe's/ \(\d+%..//;' | sort | uniq -c | sort -k1,1nr | head -20 n=190 gene names, n=344 genes 18 LRR and NB-ARC domains-containing disease resistance protein, putative 16 NB-ARC domain-containing disease resistance protein, putative 3 NB-ARC domain-containing disease resistance protein 2 LRR and NB-ARC domains-containing disease resistance protein 11 Cc-nbs-lrr resistance protein, putative 2 Cc-nbs-lrr resistance protein 10 Kinase superfamily protein 7 Leucine-rich repeat transmembrane protein kinase 7 Nbs-lrr resistance protein 3 NBS-LRR type disease resistance protein 7 S-locus lectin protein kinase family protein 6 Serine/threonine-protein kinase PBS1 5 Dynamin related protein 4C 5 Leucine-rich repeat containing protein 5 P-loop containing nucleoside triphosphate hydrolases superfamily protein 4 Disease resistance protein 4 Disease resistance protein RPP8 2 Disease resistance RPS5, putative-like protein 2 Disease resistance protein family 2 Disease resistance protein family, putative 2 Disease resistance protein, putative 4 Kinase superfamily protein, putative 4 S-locus lectin protein kinase family protein, putative 4 Serine/threonine kinases,protein kinases,ATP binding,sugar binding,kinases,carbohydrate binding, putative 3 12-oxophytodienoate reductase 2 3 CC-NBS-LRR class disease resistance protein 3 CC-NBS-LRR class disease resistance protein, putative 1 CC-NBS-LRR protein 3 Cysteine-rich RLK 10, putative 2 TMV resistance protein N 1 Disease resistance (TIR-NBS-LRR class), putative-like protein 1 Disease resistance RPM1, putative-like protein 1 Disease resistance protein RPP8, putative 1 NB-ARC domain-containing disease resistance-like protein 1 NBS type disease resistance protein, putative 1 Nbs-lrr resistance-like protein 1 TMV resistance N, putative-like protein 1 Tir-nbs-lrr resistance-like protein -- disease resistance, non-aligned genes -- 20 lrr and nb-arc domains-containing disease resistance protein 19 nb-arc domain-containing disease resistance protein 13 cc-nbs-lrr resistance protein 7 nbs-lrr resistance protein 6 cc-nbs-lrr class disease resistance protein 6 disease resistance protein 5 disease resistance protein rpp8 5 tmv resistance protein n 4 disease resistance protein family 3 nbs-lrr type disease resistance protein 2 disease resistance rps5-like protein 1 cc-nbs-lrr resistance-like protein 1 disease resistance (tir-nbs-lrr class)-like protein 1 disease resistance rpm1-like protein 1 nb-arc domain-containing disease resistance-like protein 1 nbs type disease resistance protein 1 nbs-lrr resistance-like protein 1 pleiotropic drug resistance 10 1 tir-nbs-lrr resistance-like protein 1 tmv resistance n-like protein 1 lrr receptor-like serine/threonine-protein kinase cat pub3i.all.nocirad.attr.tab | egrep -v '/[CI]' | cut -f2 | perl -pe's/ \(\d+%..//;' | sort | uniq -c | sort -k1,1nr | head -20 n=944 gene names, n=1641 genes 83 Uncharacterized protein 19 Cysteine/Histidine-rich C1 domain family protein, putative 18 LRR and NB-ARC domains-containing disease resistance protein, putative 17 NB-ARC domain-containing disease resistance protein, putative 15 Cysteine-rich RLK (RECEPTOR-like protein kinase) 8 15 Leucine-rich repeat protein kinase family protein, putative 14 DNA/RNA polymerases superfamily protein 12 Cc-nbs-lrr resistance protein, putative 12 Cytochrome P450, putative 12 Kinase superfamily protein 11 Alpha/beta-Hydrolases superfamily protein 11 Ankyrin repeat-containing protein, putative 11 Cytochrome P450 10 Kinase superfamily protein, putative 10 Receptor like protein 33 10 S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 9 Nbs-lrr resistance protein 9 PR5-like receptor kinase 8 Disease resistance family protein / LRR family protein, putative 8 Serine-threonine protein kinase, putative #..... full pub3i.gostressnucbind.nocirad ... cat pub3i.gostressnbind.nocirad.attr.tab | egrep -v '/[CI]' | cut -f2 | perl -pe's/ \(\d+%..//;' | sort | uniq -c | sort -k1,1nr | less 18 LRR and NB-ARC domains-containing disease resistance protein, putative 16 NB-ARC domain-containing disease resistance protein, putative 11 Cc-nbs-lrr resistance protein, putative 10 Kinase superfamily protein 7 Leucine-rich repeat transmembrane protein kinase 7 Nbs-lrr resistance protein 7 S-locus lectin protein kinase family protein 6 Serine/threonine-protein kinase PBS1 5 Dynamin related protein 4C 5 Leucine-rich repeat containing protein 5 P-loop containing nucleoside triphosphate hydrolases superfamily protein 4 Disease resistance protein 4 Disease resistance protein RPP8 4 Kinase superfamily protein, putative 4 S-locus lectin protein kinase family protein, putative 4 Serine/threonine kinases,protein kinases,ATP binding,sugar binding,kinases,carbohydrate binding, putative 3 12-oxophytodienoate reductase 2 3 CC-NBS-LRR class disease resistance protein 3 CC-NBS-LRR class disease resistance protein, putative 3 Cysteine-rich RLK 10, putative 3 Leucine-rich repeat containing protein, putative 3 Leucine-rich repeat-containing protein, putative 3 Lipoxygenase 1 3 NB-ARC domain-containing disease resistance protein 3 NBS-LRR type disease resistance protein 3 Peroxidase superfamily protein 3 Serine-threonine protein kinase 3 TMV resistance protein N, putative 3 Transducin/WD40 repeat-like superfamily protein, putative 2 Beta-6 tubulin 2 Calcium-dependent protein kinase 29 2 Cc-nbs-lrr resistance protein 2 Cysteine-rich RLK 10 2 DEA(D/H)-box RNA helicase family protein 2 Dehydration-responsive element-binding protein 2C 2 Disease resistance RPS5, putative-like protein 2 Disease resistance protein family 2 Disease resistance protein family, putative 2 Disease resistance protein, putative 2 Eukaryotic translation initiation factor 2 family protein 2 Flavin-dependent monooxygenase 1 2 G-type lectin S-receptor serine/threonine-protein kinase 2 Heat shock protein 70 family protein 2 Kinase family protein with leucine-rich repeat domain, putative 2 LRR and NB-ARC domains-containing disease resistance protein 2 Leucine-rich repeat receptor-like protein kinase family protein, putative 2 Leucine-rich repeat transmembrane protein kinase protein, putative 2 MLP protein 28 2 PR5-like receptor kinase 2 RNA-binding family protein 2 Receptor like protein 34 2 Receptor-like protein kinase 1 2 S-locus-specific glycoprotein S6, putative 2 Serine-threonine protein kinase, plant-type, putative 2 TMV resistance protein N 2 Transducin/WD40 repeat-like superfamily protein 1 2-nonaprenyl-3-methyl-6-methoxy-1,4-benzoquinol hydroxylase 1 3,4-dihydroxy-2-butanone kinase, putative 1 3R-linalool synthase 1 3R-linalool synthase, putative 1 5'-3' exonuclease family protein, putative 1 AGC, putative 1 AP2 6l, putative 1 AP2/ERF domain-containing transcription factor-like protein 1 ATP binding protein, putative 1 ATP-dependent RNA helicase, putative 1 ATPase family AAA domain-containing protein 1-A 1 Alpha-glucan water dikinase, chloroplast, putative 1 Arginase 1 Arginine/serine-rich splicing factor 35 1 Argonaute family protein, putative 1 Argonaute protein group 1 Autophagy 18 H 1 Autophagy protein Apg5 family 1 BRI1-like 3, putative 1 BTB-POZ and MATH domain 4 1 BZIP domain class transcription factor 1 Basal transcription factor complex subunit-related 1 CBL-interacting protein kinase 9 1 CC-NBS-LRR protein 1 CDPK-related kinase 3 1 Caleosin-related family protein 1 Cc-nbs-lrr resistance-like protein 1 Chalcone and stilbene synthase family protein 1 Chaperonin 20-like protein 1 Chromatin remodeling complex subunit 1 Chromatin remodeling factor18 1 Cyclin F-box 1 Cysteine-rich RLK 30, putative 1 Cytochrome BC1 synthesis, putative 1 D-isomer specific 2-hydroxyacid dehydrogenase family protein, putative 1 DYNAMIN-like 1E 1 Di-glucose binding protein with Kinesin motor domain 1 Di-glucose binding protein with Kinesin motor domain, putative 1 Di-glucose binding with Kinesin motor domain-like protein 1 Disease resistance (TIR-NBS-LRR class), putative-like protein 1 Disease resistance RPM1, putative-like protein 1 Disease resistance protein RPP8, putative 1 F-box family protein 1 Fatty acid reductase 4 1 Flavin-dependent monooxygenase 1, putative 1 Flavonol 3-sulfotransferase, putative 1 Forkhead-associated domain-containing protein / FHA domain-containing protein, putative 1 Forms aploid and binucleate cells 1c, putative 1 GTP binding Elongation factor Tu family protein 1 GTP-dependent nucleic acid-binding protein engD 1 Glutathione peroxidase 8 1 Glyceraldehyde-3-phosphate dehydrogenase C subunit 1 1 HEAT repeat,WD domain 1 HSP20-like chaperones superfamily protein 1 Heat shock protein 70B 1 Histidine kinase 1 Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase family protein 1 Hydrolases, acting on acid anhydrides, in phosphorus-containing anhydrides,ATP-dependent helicases,nucleic acid binding,ATP binding,RNA binding,helicases, putative 1 Kinase family protein with ARM repeat domain 1 Kinase family protein with leucine-rich repeat domain 1 Kinase, putative 1 LRR receptor-like serine/threonine-protein kinase 1 Leo1-like family protein 1 Leucine-rich repeat protein kinase family protein 1 Leucine-rich repeat receptor protein kinase family protein, putative 1 Leucine-rich repeat transmembrane protein kinase, putative 1 Leucine-rich repeat-containing protein DDB_G0290503, putative 1 MMS ZWEI 1 MORC family CW-type zinc finger protein 3 1 MRNA capping enzyme family protein 1 MUTM-1 1 Malectin/receptor-like protein kinase family protein 1 Minichromosome maintenance 9 1 Mitochondrial HSO70 2 1 Mitochondrial-processing peptidase subunit alpha 1 Molecular chaperone Hsp40/DnaJ family protein 1 Molybdenum cofactor sulfurase (LOS5) (ABA3) 1 Myb-like HTH transcriptional regulator family protein 1 Myb-like transcription factor family protein 1 NAC domain protein 1 NAD(P)-linked oxidoreductase superfamily protein, putative 1 NB-ARC domain-containing disease resistance-like protein 1 NBS type disease resistance protein, putative 1 Nbs-lrr resistance-like protein 1 P-loop containing nucleoside triphosphate hydrolases superfamily protein, putative 1 PAD4, putative 1 PR5 receptor kinase 1 Pathogenesis-related protein P2 1 Pathogenesis-related thaumatin-like protein, putative 1 Peroxidase superfamily protein, putative 1 Phosphoenolpyruvate carboxylase-related kinase 2 1 Pleiotropic drug resistance 10 1 Polypyrimidine tract-binding protein 3 1 Potassium channel in 2 1 Pre-mRNA-splicing factor ATP-dependent RNA helicase PRP16 1 RAB GTPase A1F 1 RAB GTPase A5E 1 RAB GTPase A6A 1 RAD3-like DNA-binding helicase protein 1 RECQ helicase L2 1 RHO-related from plants 2-like protein 1 Ras 5 1 RecA DNA recombination family protein 1 Receptor protein kinase 1 1 Receptor serine/threonine kinase 1 Receptor serine/threonine kinase, putative 1 Receptor-like protein kinase 1, putative 1 Ribulose bisphosphate carboxylase/oxygenase activase 1 1 S-adenosyl-L-methionine-dependent methyltransferases superfamily protein 1 S-locus-specific glycoprotein S6 1 Senescence-associated gene 13 1 Serine-threonine protein kinase, putative 1 Serine/threonine-protein kinase GCN2 1 Seryl-tRNA synthetase / serine--tRNA ligase, putative 1 Seven transmembrane MLO family protein 1 Signal recognition particle receptor protein 1 Small GTP-binding protein 1 Small nuclear ribonucleoprotein family protein 1 Squalene monooxygenase, putative 1 Sucrose nonfermenting 4 1 TCP-1/cpn60 chaperonin family protein 1 TGACG motif-binding factor 6 1 TMV resistance N, putative-like protein 1 TRNA modification GTPase 1 TRNA synthetase class I (I, L, M and V) family protein 1 Tetratricopeptide repeat-like superfamily protein, putative 1 Tir-nbs-lrr resistance-like protein 1 UB-like protease 1D 1 Uridine diphosphate glycosyltransferase 74E2, putative 1 Valyl-tRNA synthetase / valine--tRNA ligase (VALRS)-like protein 1 WD domain containing protein, putative 1 WRKY DNA-binding protein 27, putative 1 Wall-associated kinase 2, putative 1 Zinc finger C-x8-C-x5-C-x3-H type family protein -------------------------------------------------------------------------------------- SEE ABOVE: New chromosome feature map Counting bases spanned by features in 100kb sliding window over genome 1. Gene exon density (pub3i.good.gff) 2. Transposon density (transposon_mazhao.gff) 3a. Matina / Criollo non-alignment gaps from Mummer align 3b. Matina / Criollo gDNA alignment (gDNA cover density), inverted for gaps (100k - cover) 4. Disease gene location/density, using genes identified by homology as disease resistance, n=616 5. Tandem duplicate genes, near (<=10kb) to paralogs in same gene family (Orthomcl), n=616 Sets 4+5 share only 82 same genes, even though 4 tends to be found in clusters Final plot here: http://server7.eugenes.org:8091/cacao/genes10/docs/feature_coverage/cacao_genomemap.chrs10k.pdf #.............. data ...................... cat genes/pub3i/pub3i.good.gff | egrep '^scaffold_(10r|[1-9]) ' | grep ' exon' | sort -k1,1 -k4,4n -k5,5nr | \ env chrs=genome/cacao11chrs.fa.count bin=100000 slide=1 feat="exon" $evigene/scripts/genomefeatdens.pl \ > map/dens8/pub3i.good.bed8 gzcat misc/transposon_mazhao.gff.gz | egrep '^scaffold_(10r|[1-9]) ' | sort -k1,1 -k4,4n -k5,5nr |\ env chrs=genome/cacao11chrs.fa.count bin=100000 slide=1 feat="transposon" $evigene/scripts/genomefeatdens.pl \ > map/dens8/transposon_ma.bed8 cat genes/pub3i/pub3i.good.gff | egrep '^scaffold_(10r|[1-9]) ' | grep ' exon' |\ ggrep -F -f prot/diseasep/disease_pub3i1_omclfam+named.ids - | sort -k1,1 -k4,4n -k5,5nr | \ env chrs=genome/cacao11chrs.fa.count bin=100000 slide=1 feat="exon" $evigene/scripts/genomefeatdens.pl \ > map/dens8/pub3i.diseasep.bed8 # >> revised this, redid calculation, needs normalizing over windows.. # > the gaps are not showing up in this gdna score density; invert, count spans < 10 gdna reads = gaps? set cacag=/export/disk2/work/cacao2d/gdna gzcat $cacag/criollo13gdna-mars11asm.bed.gz | egrep '^scaffold_(10r|[1-9]) ' |\ env chrs=genome/cacao11chrs.fa.count bed=1 bin=100000 slide=1 $evigene/scripts/genomefeatdens.pl \ > map/dens8/criollogdna.bed8 # ** maybe poor gap data; look at 100k - criollogdna.bed8 and see if it seems better; # and/or redo using cirad.gaps.gff # NOTE: gaps also exist without rows; interpolate like introns gzcat $cacag/criollo13gdna-mars11asm.bed.gz | egrep '^scaffold_(10r|[1-9]) ' |\ perl -ne'($r,$b,$e,$c)=split; \ if($lr eq $r and $b > $le+9) { ($ib,$ie)=($le+1,$b-1); print join("\t",$r,$ib,$ie,0)."\n"; }\ print if($c<10); ($lr,$lb,$le,$lc)=($r,$b,$e,$c);' |\ env chrs=genome/cacao11chrs.fa.count bed=1 bin=100000 slide=1 $evigene/scripts/genomefeatdens.pl \ > map/dens8/criollogdna_gaps.bed8 #^?? is that right? compare w/ cirad.gaps.gff gzcat relseq/mars_cacao11asm.nonalign.gff.gz | egrep '^(scaffold_10r|scaffold_[1-9]) ' | sort -k1,1 -k4,4n -k5,5nr |\ env chrs=genome/cacao11chrs.fa.count feat=gap bin=100000 slide=1 $evigene/scripts/genomefeatdens.pl \ > map/dens8/cirad_nonalign.bed8 & === n= 16463 grep mRNA genes/pub3i/pub3i.good.gff | grep 't1;gene=' | grep 'genegroup=PLA9' | grep 'paralog=Thecc' | perl -ne'($r,$b,$e)=(split)[0,3,4]; ($d,$pa,$gg)=m/(?:ID|paralog|genegroup)=(\w+)/g; print join("\t",$d,$gg,$pa,$r,$b,$e)."\n";' > genes/pub3i/pub3i.paralogs.tab cat pub3i.paralogs.tab | perl -ne'($g,$og,$pg,$r1,$b,$e)=split; \ ($rn)= m/scaffold_(\d+)/; next unless($rn > 0 and $rn < 11); \ push(@{$og{$og}{$r1}}, $b, $e); push(@{$ogg{$og}{$r1}}, $g); END{ foreach $og (sort keys %og) { \ @r=sort keys %{$og{$og}}; foreach $r (@r) { @be=@{$og{$og}{$r}}; @gd=@{$ogg{$og}{$r}}; \ next if(@be<4); ($lb,$le,$nd,$sb,$se,$i)=(0)x10; @sgd=@sb=(); while( ($b,$e)=splice(@be,0,2) ) { \ if($le and $b>$lb and $b < $le+9999) { push @sgd,$gd[$i-1] unless($nd); push @sgd,$gd[$i];\ $nd++; $se=$e; $sb=$lb unless($sb); } \ elsif($se) { push(@sb,$sb,$se); $sb=$se=0; } ($lb,$le)=($b,$e); $i++; } \ $nd++ if($nd); $sgd=join",",@sgd; while( ($b,$e)=splice(@sb,0,2) ) { $w=1+$e-$b; print \ join("\t",$r,"tandup\tregion",$b,$e,$nd,".",".","ID=$og;len=$w;tdg=$sgd")."\n"; } } } }' | \ sort -k1,1 -k4,4n -k5,5n > pub3i.tanduplomcl.gff cat pub3i.tanduplomcl.gff | perl -ne'($td)=m/tdg=([^;\s]+)/; @td=split",",$td; print join("\n",@td)."\n";' |\ sort -u > pub3i.tanduplomcl.ids cat genes/pub3i/pub3i.good.gff | egrep '^scaffold_(10r|[1-9]) ' | grep ' exon' |\ ggrep -F -f genes/pub3i/pub3i.tanduplomcl.ids - | sort -k1,1 -k4,4n -k5,5nr | \ env chrs=genome/cacao11chrs.fa.count bin=100000 slide=1 feat="exon" $evigene/scripts/genomefeatdens.pl \ > map/dens8/pub3i.tanduplexons.bed8 ## ^^ use this one ## some of these tandup.bed spans look too big; instead pull out gene IDs and do w/ genes.gff like diseaseR cat genes/pub3i/pub3i.tanduplomcl.gff | egrep '^scaffold_(10r|[1-9]) ' |\ env chrs=genome/cacao11chrs.fa.count bin=100000 slide=1 feat="region" $evigene/scripts/genomefeatdens.pl \ > map/dens8/pub3i.tandupl.bed8 # ^^ this is bad -------------------------------------------------------------------------------------- # recheck 25mar ** the %vals are wrong here, used colSums when these items dont sum. # .. pct are right for scaffold columns . # # M2.2 Gene attributes # Nonalign Align %Nonal %Align diff% g p # --------------------------------------------------------- # no_alts 2127 19655 19.4 11.5 7.9 450.4 0 # has_alt 860 21762 7.8 12.8 -4.9 230.3 0 -* all mrna # hasa loci 411 7472 # has_ortho 2476 34380 22.6 20.2 2.4 27.9 0 # 2423 37083 < corrected ortho col; all mrna # haso loci 1927 22747 # has_parlg 2251 28347 20.5 16.6 3.9 86.2 0 +* # 2381 31492 < corrected paralog col; all mrna # hasp loci 1985 20954 # # estbean 528 12182 4.8 7.1 -2.3 89.8 0 # estleaf 1296 25710 11.8 15.1 -3.3 79.8 0 # estpistil 1410 27954 12.8 16.4 -3.6 86.4 0 # # total 2987 41417 # #total loci 2485 26923 # BAD # total loci 2538 27127 #should be this, 2485 bad # nonaln 1978 in nonalign spans, and no/poor transcript align to cirad # 1680 nonaln loci # poormap 1009 in align spans, but no/poor transcript align to cirad # 874 poormap loci # --------------------------------------------------------- # # Loci recount # Nonalign Align %Nona %Aln diff% Chi p # has_altloc 411 7472 16 28 -11 153 0 # has_ortholoc 1927 22747 76 84 -8 104 0 # has_parlgloc 1985 20954 78 77 1 1.3 1 # estbean 425 9505 17 35 -18 348 0 # estleaf 959 15193 38 56 -18 310 0 # estpistil 1057 16952 42 63 -21 422 0 # totalloci 2538 27127 # M2.4 Paralog overlap # using best paralog # parlg 454 NonToAln ; 1165 AlnToNon # parology average identity # align n=18348; n85=1583,8%; ave=51; # noal+notr n=1292; n85=277,21%; ave=60; # aln+notr n=636; n85=120,18%; ave=59; # # using orthomcl gene families # align nfam: 15662; with nonaln: 380 # noal+notr nfam: 732; with align: 380; 52% # aln+notr nfam: 447; with aln-map: 211; 47% # .............. # # set pt=ciradpoormap-nonalign ; # cat cacao11good_pub3i.$pt.attr | cut -f1,5 | perl -ne'($t,$pa)=split; ($g=$t)=~s/t\d+$//; ($pi)=m/,(\d+)%/; if($pi and not $did{$g}++){ $sp+=$pi; $n++; $n75++ if($pi>84); } END { $ap=int($sp/$n); $p7=int(100*$n75/$n); print "n=$n; n85=$n75,$p7%; ap=$ap; sp=$sp\n"; }' # ciradalign n=18348; n85=1583,8%; ap=51; sp=947494 # ciradpoormap+nonalign n=1292; n85=277,21%; ap=60; sp=78490 # ciradpoormap-nonalign n=636; n85=120,18%; ap=59; sp=37986