Gnodes/Genome Depth Estimator
EvidentialGene package for genome coverage depth estimation for animal & plant genomes
Gnodes is a Genome Depth Estimator for animal and plant genomes, also
a genome size estimator. It calculates genome sizes based on DNA
coverage of assemblies, using unique, conserved gene spans for its
standard depth. Results of this tool match the independent measures from flow
cytometry of genome size quite well in tests with plants and animals.
Tests on a range of model and non-model animal and plant genome assemblies
give reliable and accurate results, in contrast to unreliable K-mer histogram methods.
Gnodes draft publication with supplemental data
is now available (2022 May), with extensive results
(Abstract text
and Document PDF).
See also Gnodes DNA Depth Deficit Analyses
for a synopsis of missing matter in assemblies and gene copy numbers.
Chromosome plots of DNA-Depth with major components are
useful to resolve deficits.
Also see How to Use Gnodes .
Boxplots (median, range) of Estimators, for equivalence to Flow
Cytometry (FC) measured genome sizes. Gnodes is very accurate, whereas
K-mer histogram methods (GenoScope, covest, findGSE) are rather
inaccurate, with a wide range of estimates. Assembly sizes are
typically below FC measured sizes.
|
Estimations relative to FC value, for measured animal and plant
genomes, with median, range and values from three estimators: Assembly,
Gnodes, and GenoScope. Flow cytometry sizes in megabases are
given, ranging from 160 Mb (plant) to 3400 Mb (human).
|
Genome reconstruction is a Goldilocks problem: answers are
often too hot, or too cold; the just-right solution takes effort to
discriminate among these outcomes. Gnodes provides a measuring stick
for too hot and too cold genome assemblies. When used to
compare several assemblies of one organism, it spots over- and
under-assembled portions, relative to its unique gene DNA depth
measure. It can be used to estimate genome size from only gene coding
sequences mapped with genomic DNA, and these tests show it is reliable
for that.
Gnodes is now a component of the EvidentialGene package:
evigene/scripts/genoasm/gnodes_pipe.pl
Gnodes resolves a few discrepancies, such as Daphnia water flea genome
assemblies that are only 1/2 size of flow cytometry measured size, and the
well-known 40 megabase discrepancy in Arabidopsis (Bennett et al 2003).
Extensive gene coding sequence duplication is a likely reason that assemblies
of Daphnia genomes have faltered at half-size.
Half of Daphnia genomic DNA aligns to genes coding sequence,
much more than the 10-20% of measured insects and vertebrates,
or 25% in measured plants.
|
|
Name Last modified Size
Parent Directory 23-Feb-2021 22:12 -
agdplots8v/ 21-Nov-2021 13:19 -
crplots/ 10-Aug-2021 12:53 -
gnodes_afuk_plots/ 13-Apr-2021 15:38 -
gnodes_cacao22measure/ 24-Oct-2022 23:04 -
gnodesdoc/ 05-Aug-2024 19:33 -
soft_evigene_gnodes_update/ 13-Sep-2023 20:59 -
soft_evigene_package/ 15-Jul-2023 16:53 -
gnodes_chrplot3a.html 10-Aug-2021 00:00 3k
gnodes_covstats_sum21feb.txt 23-Feb-2021 22:11 4k
gnodes_covstats_intro.html 12-May-2022 21:48 4k
gnodes_covstats_readme21apr.txt 12-Apr-2021 00:27 8k
gnodes_pipe_algo.txt 23-Feb-2021 22:18 9k
Gnodes_help.txt 08-Nov-2021 14:03 16k
gnodes_doc3draft.pdf 20-Dec-2023 20:17 2.6M
gnodes_doc2draft.pdf 12-May-2022 21:23 4.6M
Gnodes/Genome Depth Estimator 2021.apr
Summary of results
--------------------
Insects Fruitfly 20 UC Honeybee 19 Ha Plants Arabidopsis 18 TAIR Arabidopsis 20 Max
Part Obs.Mb Est.Mb xCopy Obs.Mb Est.Mb xCopy Part Obs.Mb Est.Mb xCopy Obs.Mb Est.Mb xCopy
-------- ---------------------- ---------------------- ------- --------------------- -----------------------
Flowcyto 161-180 . . 234-264 . . Flowcyto 157-166 . . 157-166 . .
LN/C Est . 168 . . 267 . LN/C Est . 156 . . 156 .
Totalasm 163 164 . 224 222 . Totalasm 120 154 . 130 158 .
Measured 163 164 1.00 223 221 0.99 Measured 115 149 1.30 126 154 1.22
uniqasm 129 130 1.00 213 188 0.88 uniqasm 98 104 1.07 108 104 0.96
dupasm 34 34 1.02 9.4 33.5 3.56 dupasm 16.9 44.3 2.62 17.3 49.6 2.86
CDSann 31 33 1.04 50 56 1.12 CDSann 42 57 1.37 45 57 1.28
TEann 26 23 0.89 3.6 5.2 1.45 TEann 16.4 17.2 1.05 20 19 0.94
RPTann . . . 38 47 1.24 RPTann 4.7 20.9 4.44 8.2 21.5 2.61
NOann 109 111 1.02 141 126 0.89 NOann 54 70 1.28 58 72 1.25
-------- ---------------------- ---------------------- ------- --------------------- ---------------------
Size=LN/C C=94, N=105 Mb, L=150 C=25, N=50 Mb, L=150 Size=LN/C C=52, N=54M, L=150 C=52, N=54M, L=150
C, C_UCG = read copy depth measured for unique conserved genes
xCopy = excess/deficit in read copy depth: C_part/C_UCG, depth at partition / depth at uniq conserved genes.
Obs.Mb = partition size in megabases, Est.Mb = estimated size: observed * xCopy
What Is Gnodes?
-----------------
Gnodes = G.no.. D.... Es........
Genome depth estimation is a critical measurement for genome size estimates from DNA sequence data. There are
many software tools for estimating genome sizes from DNA sequences. The commonly used ones at this writing are based on
K-mer shredding of DNA to very small pieces, then counting frequency of pieces. This is a statistical method that is
rather distant from the biological evidence, where choices of K-mer size and other options strongly influence estimates.
The Gnodes method is based on two biological or molecular-method assumptions: a. Molecular sequencing methods for genomes
produce an even depth of DNA pieces from chromosomes, and b. Depth of coverage is measurable, with smallest error, most
reliably at known unique genome sequence spans, such as unique conserved genes (UCG).
Gnodes measures DNA cover depth by mapping DNA pieces, tabulating read depth in bins, like samtools pileup, but different
in that it measures multi-mapping explicitly. It uses gene and transposon data, as two largest and commonly measured
genome attributes, to annotate chromosome assemblies. It tabulates depth for several whole-genome partitions, unique and
duplicated DNA spans, coding, transposon and simple repeat spans. Its main result is a measure of over- and
under-assembly (xCopy) relative to the standard depth for unique conserved genes. It can be also used in detailed
comparison of chromosome spans, to detect regions of mis-assembly, both over- and under-assembly. These methods and
results are in agreement with those reported by Pflug et al 2020, comparing genome size measures of flow cytometry,
read-depth and k-mer counting.
With further work, it may be used in combining the accurate portions of multiple assemblies. Genome assemblers today
produce rather different results for the same DNA data, and with a measure of their accuracy at contig levels from Gnodes,
those portions of each can be combined, much like software for removing heterozygotic assembly spans works now. Gnodes
looks a both sides of this coin of over- and under- assembly, in contrast to the heterozygosity reducing tools that are
looking at only the over-assembly side of coverage depth.
I think that Gnodes is a reliable estimator whose results are determined primarily by the biological properties of the
genome, with minor influence of computational options. More testing will answer that.
-- Don Gilbert, 2021.Feb
References
----------
[1] Bennett,M.D. et al. (2003) Comparisons with Caenorhabditis (100Mb) and Drosophila (175Mb) using flow cytometry show
genome size in Arabidopsis to be 157Mb and thus 25% larger than the Arabidopsis genome initiative estimate of 125Mb. Ann.
Botany, 91, 547-557 doi:10.1093/aob/mcg057
[2] Pflug JM, V R Holmes, C Burrus, JS Johnston, and DR Maddison (2020). Measuring Genome Sizes Using Read-Depth, k-mers,
and Flow Cytometry: Methodological Comparisons in Beetles (Coleoptera). Gen.Gen.Gen., 10:3047-3060; doi:
https://doi.org/10.1534/g3.120.401028
Software
--------------------
evigene/scripts/genoasm/gnodes_pipe.pl and component Evigene scripts.
It requires common genome informatics components: ncbi-blast bwa-mem samtools repeatmasker busco. Some of these may become
optional later. Mapping DNA reads has been tested with both bwa-mem and bowtie2, with similar results for this tool, but
bwa-mem is faster.
gnodes_pipe needs a genome.metadata file and gnodes_setup.sh unix system script for your compute cluster.
Gnodes requires data inputs of
a. Accurate genomic DNA pieces, as from current Illumina sequencers. Read sizes of 150 bp are common now and work well;
reads of 100bp or lower may result in some different estimates.
b. Chromosome assembly(ies), whether contig or chromosome level. Gnodes is most useful comparing assemblies, from same or
related species.
c. Gene coding sequences for species. Gene CDS assembled from RNA independently of chromosome assemblies provide measures
of errors in chr-assemblies for more reliable estimates, including missing genes. The Evigene package has a pipeline,
SRA2Genes, for constructing accurate and complete gene sets from RNA.
d. Optionally, transposon sequences known for your species. Now repeatmasker, optionally repeatmodeler, have been tested
and used for Gnodes.
e. Busco and Repeatmasker require data sets for your organism.
gnodes_setup.sh for your computer cluster
## --- gnodes_setup.sh PBS version ---
#PBS -N gnodes_pipe
#PBS -l vmem=128gb,nodes=1:ppn=24,walltime=12:00:00
#PBS -V
module load blast bwa-mem samtools repeatmasker busco
## --- end gnodes_setup.sh ---
## --- gnodes_setup.sh for Slurm ---
#SBATCH --job-name="gnodes_pipe"
#SBATCH --partition=compute
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=24
#SBATCH -t 12:00:00
#SBATCH --export=ALL
module load blast bwa-mem samtools repeatmasker busco
## --- end gnodes_setup.sh ---
Arabidopsis Example
--------------------
# data fetches
wget -nd -q -b https://sra-download.ncbi.nlm.nih.gov/traces/era14/ERR/ERR4586/ERR4586299
# two genome chrasm
arath18tair_chr =
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/
arath20max_chr =
ftp.ncbi.nlm.nih.gov:/genomes/all/GCA/904/420/315/GCA_904420315.1_AT9943.Cdm-0.scaffold
$evigene/scripts/genoasm/gnodes_pipe.pl -title arath18t1a -chr arath18tair_chr.fa -cds arath18tair1cds.fa \
-sumdata arath20asm.metad -ncpu 24 -maxmem 128gb -reads readsf/ERR4586299_1.fastq
$evigene/scripts/genoasm/gnodes_pipe.pl -title arath20m1a -chr arath20max_chr.fa -cds arath18tair1cds.fa \
-sumdata arath20asm.metad -ncpu 24 -maxmem 128gb -reads readsf/ERR4586299_1.fastq
Arabidopsis gnodes metadata
--------------
arath20gnodes/arath20asm.metad
asmid=arath18tair_chr
flowcyto=157-166 Mb
asmtotal=120 Mb
asmname=Arath18TAIR
species=Arabidopsis_thaliana
buscodb=embryophyta_odb9
rmaskdb=Arabidopsis
|