Gnodes, Genome measurement for animal and plant genomes
Documents
Gnodes#1 document pdf, 2022 May
Genes ruler for genomes, Gnodes, measures assembly accuracy in animals and plants.
DOI: 10.1101/2022.05.13.491861
Gnodes#2 document pdf, 2024 June-Sep.
Measuring DNA contents of animal and plant genomes with Gnodes, the long and short of it.
DOI: not yet.
Gnodes#3 document pdf, 2023 June-Dec.
Measure of major contents in animal and plant genomes, using Gnodes, finds under-assemblies of model plant, Daphnia, fire ant and others.
DOI: 10.1101/2023.12.20.572422
Part of EvidentialGene project
Name Last modified Size
Parent Directory 12-Aug-2024 14:25 -
Gnodes1doc.pdf 12-May-2022 21:23 4.6M
Gnodes3doc.pdf 20-Dec-2023 20:17 2.6M
Gnodes2doc.pdf 06-Oct-2024 15:06 1.5M
Gnodes123_abstracts.txt 05-Aug-2024 15:58 6k
2023_Holiday_Genomics_Puzzle.html 18-Dec-2023 22:00 1k
hdr2024.html 05-Aug-2024 19:33 1k
gnodes_newasm23/ 13-Oct-2023 15:14 -
gnodes_newasm/ 08-May-2022 15:33 -
gnodes_doctabs/ 12-May-2022 17:30 -
gnodes_docsup/ 11-May-2022 21:38 -
gnodes_doc3tables/ 18-Dec-2023 21:47 -
gnodes2docf/ 06-Oct-2024 15:07 -
Gnodes#1 document, 2022 May
Genes ruler for genomes, Gnodes, measures assembly accuracy in animals and plants.
Author: Donald G. Gilbert; Indiana University, Bloomington, IN, USA;
DOI URL: https://doi.org/10.1101/2022.05.13.491861
Abstract
Gnodes is a Genome Depth Estimator for animal and plant genomes, also
a genome size estimator. It calculates genome sizes based on DNA
coverage of assemblies, using unique, conserved gene spans for its
standard depth. Results of this tool match the independent measures
from flow cytometry of genome size quite well in tests with plants and
animals. Tests on a range of model and non-model animal and plant
genome assemblies give reliable and accurate results, in contrast to
less reliable K-mer histogram methods. The problem of half-sized
assemblies of duplication-rich Daphnia is addressed. A 20-year old
Arabidopsis genome discrepancy is resolved in favor of 157Mb as
measured with flow-cytometry. Not all genome DNA samples contain a
genome, examples and reasons for this are discussed. The T2T completed
human genome assembly of 2022 is complete by Gnodes measures, with
about 5% uncertainty. With full genome DNA, Gnodes measures within
10%, usually within 5%, of flow cytometry, indicating they are both
measuring the same content. Public URL:
http://eugenes.org/EvidentialGene/other/gnodes/
---------------------------------------
Gnodes#2 document, 2024 June-Sep.
Measuring DNA contents of animal and plant genomes with Gnodes,
the long and short of it.
Author: Donald G. Gilbert; Indiana University, Bloomington, IN, USA;
DOI URL: xxxxx
Abstract
Measurement of DNA contents of genomes is valuable for
understanding genome biology, including assessments of genome
assemblies, but it is not a trivial problem. Measuring contents
of DNA shotgun reads is complicated by several factors: biological
contents of genomes at species, individual and tissue/cell levels,
laboratory methods, sequencing technology and computational
processing for measurement and assembly. This compares, and
shares, complications with cytometric (Cym) and related molecular
measurements of genome size and contents.
There is an obvious discrepancy between cytometric measurements
and current long-read genome assemblies (Asm): genome assemblies
average 12% below Cym measured sizes, differing in amounts of
duplicated content. This report examines five DNA read types to
see if they can be used for more precise and reliable
discrimination of major genome contents and sizes. The read types
are short, accurate Illumina, long PacBio, of low and high
accuracy, and long Oxford Nanopore Tech. of low and high accuracy.
Gnodes is the measurement tool used, which maps DNA to assembly,
and measures DNA copy numbers for major genome contents of genes,
transposons, repeats, and others, using as a measurement unit the
single copies of unique conserved genes. Public data of five well
studied genomes, human, corn, zebrafish, sorghum and rice, are
used for the primary evidence of this work.
Results of this are mixed and open to interpretations: In broad
terms, all DNA types measure about the same genome contents, at or
below 90% agreement, which is a level that the other complications
can contribute. For precision above a 90% level, long read types
differ in supporting larger cytometric sizes (low accuracy reads),
and smaller assembly sizes (high accuracy reads), with accurate
short-reads roughly between. The weight of interpreted evidence
suggests that "low accuracy" long reads are un-biased, or less
biased, for genome measurement, that "high accuracy" long reads
have a bias of reduced duplications introduced by computational
averaging or filtering. The several complicating factors noted
can produce discrepancies larger than this average Cym - Asm
difference, and are a problem to control.
---------------------------------------
Gnodes#3 document, 2023 June-Dec.
Measure of major contents in animal and plant genomes, using Gnodes,
finds under-assemblies of model plant, Daphnia, fire ant and others.
Author: Donald G. Gilbert; Indiana University, Bloomington, IN, USA;
DOI URL: https://doi.org/10.1101/2023.12.20.572422
Abstract
Significant discrepancies in genome sizes measured by cytometric
methods versus DNA sequence estimates are frequent, including recent
long-read DNA assemblies of plant and animal genomes. A new DNA
sequence measure using a baseline of unique conserved genes, Gnodes,
finds the larger cytometric measures are often accurate.
DNA-informatic measures of size, as well as assembly methods, have
errors in methodology that under-measure duplicated genome spans.
Major contents of several model and discrepant genomes are assessed
here, including human, corn, chicken, insects, crustaceans, and the
model plant. Transposons dominate larger genomes, structural repeats
are often a major portion of smaller ones. Gene coding sequences are
found in similar amounts across the taxonomic spread. The largest
contributors to size discrepancies are higher-order repeats, but
duplicated coding sequences are a significant missed content, and
transposons in some examined species.
Informatics of measuring DNA and producing assemblies, including
recent long-read telomere to telomere approaches, are subject to
mistakes in operation and/or interpretation that are biased against
repeats and duplications. Mistaken aspects include alignment methods
that are inaccurate for high-copy duplicated spans; misclassification
of true repetitive sequence as heterozygosity and artifact; software
default settings that exclude high-copy DNA; and overly conservative
data processing that reduces duplicated genomic spans. Re-assemblies
with balanced methods recover the missing portions of problem genomes
including model plant, water fleas and fire ant.
---------------------------------------
|