Gnodes/Genome Depth Estimator
Gnodes is a Genome Depth Estimator for animal and plant genomes, also
a genome size estimator. It calculates genome sizes based on DNA
coverage of assemblies, using unique, conserved gene spans for its
standard depth. Results of this tool match the independent measures from flow
cytometry of genome size quite well in tests with plants and animals.
Tests on a range of model and non-model animal and plant genome assemblies
give reliable and accurate results, in contrast to unreliable K-mer histogram methods.
Boxplots (median, range) of Estimators, for equivalence to Flow Cytometry (FC) measured genome sizes. Gnodes is very accurate, whereas K-mer histogram methods (GenoScope, covest, findGSE) are rather inaccurate, with a wide range of estimates. Assembly sizes are typically below FC measured sizes.
Estimations relative to FC value, for measured animal and plant genomes, with median, range and values from three estimators: Assembly, Gnodes, and GenoScope. Flow cytometry sizes in megabases are given, ranging from 160 Mb (plant) to 3400 Mb (human).
Genome reconstruction is a Goldilocks problem: answers are
often too hot, or too cold; the just-right solution takes effort to
discriminate among these outcomes. Gnodes provides a measuring stick
for too hot and too cold genome assemblies. When used to
compare several assemblies of one organism, it spots over- and
under-assembled portions, relative to its unique gene DNA depth
measure. It can be used to estimate genome size from only gene coding
sequences mapped with genomic DNA, and these tests show it is reliable
Gnodes is now a component of the EvidentialGene package:
Gnodes resolves a few discrepancies, such as Daphnia water flea genome assemblies that are only 1/2 size of flow cytometry measured size, and the well-known 40 megabase discrepancy in Arabidopsis (Bennett et al 2003).
Extensive gene coding sequence duplication is a likely reason that assemblies of Daphnia genomes have faltered at half-size. Half of Daphnia genomic DNA aligns to genes coding sequence, much more than the 10-20% of measured insects and vertebrates, or 25% in measured plants.
Name Last modified Size
Parent Directory 23-Feb-2021 22:12 - Gnodes_help.txt 08-Nov-2021 14:03 16k agdplots8v/ 21-Nov-2021 13:19 - crplots/ 10-Aug-2021 12:53 - gnodes_afuk_plots/ 13-Apr-2021 15:38 - gnodes_cacao22measure/ 24-Oct-2022 23:04 - gnodes_chrplot3a.html 10-Aug-2021 00:00 3k gnodes_covstats_intro.html 12-May-2022 21:48 4k gnodes_covstats_readme21apr.txt 12-Apr-2021 00:27 8k gnodes_covstats_sum21feb.txt 23-Feb-2021 22:11 4k gnodes_doc2draft.pdf 12-May-2022 21:23 4.6M gnodes_pipe_algo.txt 23-Feb-2021 22:18 9k gnodesdoc/ 15-May-2022 12:44 - soft_evigene_gnodes_update/ 16-Apr-2022 00:42 - soft_evigene_package/ 08-May-2022 23:17 -
Gnodes/Genome Depth Estimator 2021.apr Summary of results -------------------- Insects Fruitfly 20 UC Honeybee 19 Ha Plants Arabidopsis 18 TAIR Arabidopsis 20 Max Part Obs.Mb Est.Mb xCopy Obs.Mb Est.Mb xCopy Part Obs.Mb Est.Mb xCopy Obs.Mb Est.Mb xCopy -------- ---------------------- ---------------------- ------- --------------------- ----------------------- Flowcyto 161-180 . . 234-264 . . Flowcyto 157-166 . . 157-166 . . LN/C Est . 168 . . 267 . LN/C Est . 156 . . 156 . Totalasm 163 164 . 224 222 . Totalasm 120 154 . 130 158 . Measured 163 164 1.00 223 221 0.99 Measured 115 149 1.30 126 154 1.22 uniqasm 129 130 1.00 213 188 0.88 uniqasm 98 104 1.07 108 104 0.96 dupasm 34 34 1.02 9.4 33.5 3.56 dupasm 16.9 44.3 2.62 17.3 49.6 2.86 CDSann 31 33 1.04 50 56 1.12 CDSann 42 57 1.37 45 57 1.28 TEann 26 23 0.89 3.6 5.2 1.45 TEann 16.4 17.2 1.05 20 19 0.94 RPTann . . . 38 47 1.24 RPTann 4.7 20.9 4.44 8.2 21.5 2.61 NOann 109 111 1.02 141 126 0.89 NOann 54 70 1.28 58 72 1.25 -------- ---------------------- ---------------------- ------- --------------------- --------------------- Size=LN/C C=94, N=105 Mb, L=150 C=25, N=50 Mb, L=150 Size=LN/C C=52, N=54M, L=150 C=52, N=54M, L=150 C, C_UCG = read copy depth measured for unique conserved genes xCopy = excess/deficit in read copy depth: C_part/C_UCG, depth at partition / depth at uniq conserved genes. Obs.Mb = partition size in megabases, Est.Mb = estimated size: observed * xCopy What Is Gnodes? ----------------- Gnodes = G.no.. D.... Es........ Genome depth estimation is a critical measurement for genome size estimates from DNA sequence data. There are many software tools for estimating genome sizes from DNA sequences. The commonly used ones at this writing are based on K-mer shredding of DNA to very small pieces, then counting frequency of pieces. This is a statistical method that is rather distant from the biological evidence, where choices of K-mer size and other options strongly influence estimates. The Gnodes method is based on two biological or molecular-method assumptions: a. Molecular sequencing methods for genomes produce an even depth of DNA pieces from chromosomes, and b. Depth of coverage is measurable, with smallest error, most reliably at known unique genome sequence spans, such as unique conserved genes (UCG). Gnodes measures DNA cover depth by mapping DNA pieces, tabulating read depth in bins, like samtools pileup, but different in that it measures multi-mapping explicitly. It uses gene and transposon data, as two largest and commonly measured genome attributes, to annotate chromosome assemblies. It tabulates depth for several whole-genome partitions, unique and duplicated DNA spans, coding, transposon and simple repeat spans. Its main result is a measure of over- and under-assembly (xCopy) relative to the standard depth for unique conserved genes. It can be also used in detailed comparison of chromosome spans, to detect regions of mis-assembly, both over- and under-assembly. These methods and results are in agreement with those reported by Pflug et al 2020, comparing genome size measures of flow cytometry, read-depth and k-mer counting. With further work, it may be used in combining the accurate portions of multiple assemblies. Genome assemblers today produce rather different results for the same DNA data, and with a measure of their accuracy at contig levels from Gnodes, those portions of each can be combined, much like software for removing heterozygotic assembly spans works now. Gnodes looks a both sides of this coin of over- and under- assembly, in contrast to the heterozygosity reducing tools that are looking at only the over-assembly side of coverage depth. I think that Gnodes is a reliable estimator whose results are determined primarily by the biological properties of the genome, with minor influence of computational options. More testing will answer that. -- Don Gilbert, 2021.Feb References ----------  Bennett,M.D. et al. (2003) Comparisons with Caenorhabditis (100Mb) and Drosophila (175Mb) using flow cytometry show genome size in Arabidopsis to be 157Mb and thus 25% larger than the Arabidopsis genome initiative estimate of 125Mb. Ann. Botany, 91, 547-557 doi:10.1093/aob/mcg057  Pflug JM, V R Holmes, C Burrus, JS Johnston, and DR Maddison (2020). Measuring Genome Sizes Using Read-Depth, k-mers, and Flow Cytometry: Methodological Comparisons in Beetles (Coleoptera). Gen.Gen.Gen., 10:3047-3060; doi: https://doi.org/10.1534/g3.120.401028 Software -------------------- evigene/scripts/genoasm/gnodes_pipe.pl and component Evigene scripts. It requires common genome informatics components: ncbi-blast bwa-mem samtools repeatmasker busco. Some of these may become optional later. Mapping DNA reads has been tested with both bwa-mem and bowtie2, with similar results for this tool, but bwa-mem is faster. gnodes_pipe needs a genome.metadata file and gnodes_setup.sh unix system script for your compute cluster. Gnodes requires data inputs of a. Accurate genomic DNA pieces, as from current Illumina sequencers. Read sizes of 150 bp are common now and work well; reads of 100bp or lower may result in some different estimates. b. Chromosome assembly(ies), whether contig or chromosome level. Gnodes is most useful comparing assemblies, from same or related species. c. Gene coding sequences for species. Gene CDS assembled from RNA independently of chromosome assemblies provide measures of errors in chr-assemblies for more reliable estimates, including missing genes. The Evigene package has a pipeline, SRA2Genes, for constructing accurate and complete gene sets from RNA. d. Optionally, transposon sequences known for your species. Now repeatmasker, optionally repeatmodeler, have been tested and used for Gnodes. e. Busco and Repeatmasker require data sets for your organism. gnodes_setup.sh for your computer cluster ## --- gnodes_setup.sh PBS version --- #PBS -N gnodes_pipe #PBS -l vmem=128gb,nodes=1:ppn=24,walltime=12:00:00 #PBS -V module load blast bwa-mem samtools repeatmasker busco ## --- end gnodes_setup.sh --- ## --- gnodes_setup.sh for Slurm --- #SBATCH --job-name="gnodes_pipe" #SBATCH --partition=compute #SBATCH --nodes=1 #SBATCH --ntasks-per-node=24 #SBATCH -t 12:00:00 #SBATCH --export=ALL module load blast bwa-mem samtools repeatmasker busco ## --- end gnodes_setup.sh --- Arabidopsis Example -------------------- # data fetches wget -nd -q -b https://sra-download.ncbi.nlm.nih.gov/traces/era14/ERR/ERR4586/ERR4586299 # two genome chrasm arath18tair_chr = ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/ arath20max_chr = ftp.ncbi.nlm.nih.gov:/genomes/all/GCA/904/420/315/GCA_904420315.1_AT9943.Cdm-0.scaffold $evigene/scripts/genoasm/gnodes_pipe.pl -title arath18t1a -chr arath18tair_chr.fa -cds arath18tair1cds.fa \ -sumdata arath20asm.metad -ncpu 24 -maxmem 128gb -reads readsf/ERR4586299_1.fastq $evigene/scripts/genoasm/gnodes_pipe.pl -title arath20m1a -chr arath20max_chr.fa -cds arath18tair1cds.fa \ -sumdata arath20asm.metad -ncpu 24 -maxmem 128gb -reads readsf/ERR4586299_1.fastq Arabidopsis gnodes metadata -------------- arath20gnodes/arath20asm.metad asmid=arath18tair_chr flowcyto=157-166 Mb asmtotal=120 Mb asmname=Arath18TAIR species=Arabidopsis_thaliana buscodb=embryophyta_odb9 rmaskdb=Arabidopsis