Gnodes#1 document, 2022 May Genes ruler for genomes, Gnodes, measures assembly accuracy in animals and plants. Author: Donald G. Gilbert; Indiana University, Bloomington, IN, USA; DOI URL: https://doi.org/10.1101/2022.05.13.491861 Abstract Gnodes is a Genome Depth Estimator for animal and plant genomes, also a genome size estimator. It calculates genome sizes based on DNA coverage of assemblies, using unique, conserved gene spans for its standard depth. Results of this tool match the independent measures from flow cytometry of genome size quite well in tests with plants and animals. Tests on a range of model and non-model animal and plant genome assemblies give reliable and accurate results, in contrast to less reliable K-mer histogram methods. The problem of half-sized assemblies of duplication-rich Daphnia is addressed. A 20-year old Arabidopsis genome discrepancy is resolved in favor of 157Mb as measured with flow-cytometry. Not all genome DNA samples contain a genome, examples and reasons for this are discussed. The T2T completed human genome assembly of 2022 is complete by Gnodes measures, with about 5% uncertainty. With full genome DNA, Gnodes measures within 10%, usually within 5%, of flow cytometry, indicating they are both measuring the same content. Public URL: http://eugenes.org/EvidentialGene/other/gnodes/ --------------------------------------- Gnodes#2 document, 2024 June-Sep. Measuring DNA contents of animal and plant genomes with Gnodes, the long and short of it. Author: Donald G. Gilbert; Indiana University, Bloomington, IN, USA; DOI URL: xxxxx Abstract Measurement of DNA contents of genomes is valuable for understanding genome biology, including assessments of genome assemblies, but it is not a trivial problem. Measuring contents of DNA shotgun reads is complicated by several factors: biological contents of genomes at species, individual and tissue/cell levels, laboratory methods, sequencing technology and computational processing for measurement and assembly. This compares, and shares, complications with cytometric (Cym) and related molecular measurements of genome size and contents. There is an obvious discrepancy between cytometric measurements and current long-read genome assemblies (Asm): genome assemblies average 12% below Cym measured sizes, differing in amounts of duplicated content. This report examines five DNA read types to see if they can be used for more precise and reliable discrimination of major genome contents and sizes. The read types are short, accurate Illumina, long PacBio, of low and high accuracy, and long Oxford Nanopore Tech. of low and high accuracy. Gnodes is the measurement tool used, which maps DNA to assembly, and measures DNA copy numbers for major genome contents of genes, transposons, repeats, and others, using as a measurement unit the single copies of unique conserved genes. Public data of five well studied genomes, human, corn, zebrafish, sorghum and rice, are used for the primary evidence of this work. Results of this are mixed and open to interpretations: In broad terms, all DNA types measure about the same genome contents, at or below 90% agreement, which is a level that the other complications can contribute. For precision above a 90% level, long read types differ in supporting larger cytometric sizes (low accuracy reads), and smaller assembly sizes (high accuracy reads), with accurate short-reads roughly between. The weight of interpreted evidence suggests that "low accuracy" long reads are un-biased, or less biased, for genome measurement, that "high accuracy" long reads have a bias of reduced duplications introduced by computational averaging or filtering. The several complicating factors noted can produce discrepancies larger than this average Cym - Asm difference, and are a problem to control. --------------------------------------- Gnodes#3 document, 2023 June-Dec. Measure of major contents in animal and plant genomes, using Gnodes, finds under-assemblies of model plant, Daphnia, fire ant and others. Author: Donald G. Gilbert; Indiana University, Bloomington, IN, USA; DOI URL: https://doi.org/10.1101/2023.12.20.572422 Abstract Significant discrepancies in genome sizes measured by cytometric methods versus DNA sequence estimates are frequent, including recent long-read DNA assemblies of plant and animal genomes. A new DNA sequence measure using a baseline of unique conserved genes, Gnodes, finds the larger cytometric measures are often accurate. DNA-informatic measures of size, as well as assembly methods, have errors in methodology that under-measure duplicated genome spans. Major contents of several model and discrepant genomes are assessed here, including human, corn, chicken, insects, crustaceans, and the model plant. Transposons dominate larger genomes, structural repeats are often a major portion of smaller ones. Gene coding sequences are found in similar amounts across the taxonomic spread. The largest contributors to size discrepancies are higher-order repeats, but duplicated coding sequences are a significant missed content, and transposons in some examined species. Informatics of measuring DNA and producing assemblies, including recent long-read telomere to telomere approaches, are subject to mistakes in operation and/or interpretation that are biased against repeats and duplications. Mistaken aspects include alignment methods that are inaccurate for high-copy duplicated spans; misclassification of true repetitive sequence as heterozygosity and artifact; software default settings that exclude high-copy DNA; and overly conservative data processing that reduces duplicated genomic spans. Re-assemblies with balanced methods recover the missing portions of problem genomes including model plant, water fleas and fire ant. ---------------------------------------