Index of /EvidentialGene/other/gnodes/gnodes_cacao22measure
Name Last modified Size
Parent Directory 18-Dec-2023 20:47 -
cacao22asm.metad 21-Oct-2022 15:54 1k
cacao22genomes.txt 24-Oct-2022 22:40 5k
cacao22gnodes_outputs/ 24-Oct-2022 23:11 -
cacao_duplicated_flavorgenes.gids 23-Oct-2022 14:48 1k
cacao_flavorgenes_info.txt 24-Oct-2022 08:55 15k
run_gnodes_cacao.info.txt 24-Oct-2022 22:33 1k
Gnodes analyses of Theobroma cacao genes and chromosome assemblies:
Counting Cacao flavor genes, and Theobroma cacao genome "completeness", with comments on a 2021 report
2022.Oct, Don Gilbert, gilbertd@indiana.edu
-------------------
Th. cacao chromosome assembly completeness
Summary:
Both public cacao chromosome assemblies are missing duplicated parts, at a rate of 24% (2011 Matina) to 30% (2019 Criollo) below nuclear genome size, measured with flow cytometry and with DNA samples (Gnodes). Duplicated DNA accounts for most of the genome missing from assemblies, at 100 Mb, and this is mostly un-annotated sections (NOann in this analysis, not annotated genic CDS, nor transposons, nor simple repeats). However, the DNA evidence indicates some missing gene bearing segments also, likely tandem duplicated spans as the example below of esterases with putative flavor effects. This contrasts with some other plants analyzed with Gnodes, e.g. Arabidopsis is missing simple repeats at telomeric and centromeric sections, and Zea mays is missing transposons.
Of the NCBI gene set used, the 21437 coding loci under-represent the total coding loci, from DNA depth analysis. There are 1670 annotated loci with DNA copy number > 1.8, that represent 9500 to 12500 loci (pseudo- and partial gene loci included, 35 have 100+ copies, 400 have 10+ copies). Some of these are located but unannotated on chromosome assemblies. Other duplicates are missing in assemblies. The Cacao genome project of 2011, Matina cultivar, annotated 29,400 coding loci, with more duplicates.
For esterase and hydrolase genes, potential flavor factors in cacao bean, a small portion have high identity, likely tandem, duplicates. These duplicates are poorly annotated and/or missing in current chromosome assemblies, which complicates the search for pathways from genome to "flavorome" and use of breeding techniques to select flavors. Gene duplicates, especially recent high-identity ones, are known to vary among populations, ecotypes and cultivars of a species, with functions that often include interactions with environs (defense, disease resistance, reproduction, venoms, sensory/olfactory/visual).
Flow cytometry measured size of Th. cacao nuclear genome is 434 Mb average (https://cvalues.science.kew.org/)
Public Chr. assemblies
cacao11Matina = Matina cultivar, 2011 assembly, NCBI Genome GCA_000403535.1
cacao19Criollo = Criollo cultivar, 2019 assembly, NCBI Genome GCA_000208745.2
Coding gene set used is NCBI gene set built on cacao19criollo assembly, nloci=21437
|