Gnodes analyses of Theobroma cacao genes and chromosome assemblies:
Counting Cacao flavor genes, and Theobroma cacao genome "completeness", with comments on a 2021 report
2022.Oct, Don Gilbert,

Th. cacao chromosome assembly completeness

Both public cacao chromosome assemblies are missing duplicated parts, at a rate of 24% (2011 Matina) to 30% (2019 Criollo) below nuclear genome size, measured with flow cytometry and with DNA samples (Gnodes).  Duplicated DNA accounts for most of the genome missing from assemblies, at 100 Mb, and this is mostly un-annotated sections (NOann in this analysis, not annotated genic CDS, nor transposons, nor simple repeats).  However, the DNA evidence indicates some missing gene bearing segments also, likely tandem duplicated spans as the example below of esterases with putative flavor effects.  This contrasts with some other plants analyzed with Gnodes, e.g. Arabidopsis is missing simple repeats at telomeric and centromeric sections, and Zea mays is missing transposons.

Of the NCBI gene set used, the 21437 coding loci under-represent the total coding loci, from DNA depth analysis. There are 1670 annotated loci with DNA copy number > 1.8, that represent 9500 to 12500 loci (pseudo- and partial gene loci included, 35 have 100+ copies, 400 have 10+ copies).  Some of these are located but unannotated on chromosome assemblies.  Other duplicates are missing in assemblies.  The Cacao genome project of 2011, Matina cultivar, annotated 29,400 coding loci, with more duplicates. 

For esterase and hydrolase genes, potential flavor factors in cacao bean, a small portion have high identity, likely tandem, duplicates.  These duplicates are poorly annotated and/or missing in current chromosome assemblies, which complicates the search for pathways from genome to "flavorome" and use of breeding techniques to select flavors.  Gene duplicates, especially recent high-identity ones, are known to vary among populations, ecotypes and cultivars of a species, with functions that often include interactions with environs (defense, disease resistance, reproduction, venoms, sensory/olfactory/visual).

Flow cytometry measured size of Th. cacao nuclear genome is 434 Mb average (
Public Chr. assemblies
  cacao11Matina = Matina cultivar, 2011 assembly, NCBI Genome GCA_000403535.1
  cacao19Criollo = Criollo cultivar, 2019 assembly, NCBI Genome GCA_000208745.2
Coding gene set used is NCBI gene set built on cacao19criollo assembly, nloci=21437

