Gnodes/Genome Depth Estimator 2021.apr Summary of results -------------------- Insects Fruitfly 20 UC Honeybee 19 Ha Plants Arabidopsis 18 TAIR Arabidopsis 20 Max Part Obs.Mb Est.Mb xCopy Obs.Mb Est.Mb xCopy Part Obs.Mb Est.Mb xCopy Obs.Mb Est.Mb xCopy -------- ---------------------- ---------------------- ------- --------------------- ----------------------- Flowcyto 161-180 . . 234-264 . . Flowcyto 157-166 . . 157-166 . . LN/C Est . 168 . . 267 . LN/C Est . 156 . . 156 . Totalasm 163 164 . 224 222 . Totalasm 120 154 . 130 158 . Measured 163 164 1.00 223 221 0.99 Measured 115 149 1.30 126 154 1.22 uniqasm 129 130 1.00 213 188 0.88 uniqasm 98 104 1.07 108 104 0.96 dupasm 34 34 1.02 9.4 33.5 3.56 dupasm 16.9 44.3 2.62 17.3 49.6 2.86 CDSann 31 33 1.04 50 56 1.12 CDSann 42 57 1.37 45 57 1.28 TEann 26 23 0.89 3.6 5.2 1.45 TEann 16.4 17.2 1.05 20 19 0.94 RPTann . . . 38 47 1.24 RPTann 4.7 20.9 4.44 8.2 21.5 2.61 NOann 109 111 1.02 141 126 0.89 NOann 54 70 1.28 58 72 1.25 -------- ---------------------- ---------------------- ------- --------------------- --------------------- Size=LN/C C=94, N=105 Mb, L=150 C=25, N=50 Mb, L=150 Size=LN/C C=52, N=54M, L=150 C=52, N=54M, L=150 C, C_UCG = read copy depth measured for unique conserved genes xCopy = excess/deficit in read copy depth: C_part/C_UCG, depth at partition / depth at uniq conserved genes. Obs.Mb = partition size in megabases, Est.Mb = estimated size: observed * xCopy What Is Gnodes? ----------------- Gnodes = G.no.. D.... Es........ Genome depth estimation is a critical measurement for genome size estimates from DNA sequence data. There are many software tools for estimating genome sizes from DNA sequences. The commonly used ones at this writing are based on K-mer shredding of DNA to very small pieces, then counting frequency of pieces. This is a statistical method that is rather distant from the biological evidence, where choices of K-mer size and other options strongly influence estimates. The Gnodes method is based on two biological or molecular-method assumptions: a. Molecular sequencing methods for genomes produce an even depth of DNA pieces from chromosomes, and b. Depth of coverage is measurable, with smallest error, most reliably at known unique genome sequence spans, such as unique conserved genes (UCG). Gnodes measures DNA cover depth by mapping DNA pieces, tabulating read depth in bins, like samtools pileup, but different in that it measures multi-mapping explicitly. It uses gene and transposon data, as two largest and commonly measured genome attributes, to annotate chromosome assemblies. It tabulates depth for several whole-genome partitions, unique and duplicated DNA spans, coding, transposon and simple repeat spans. Its main result is a measure of over- and under-assembly (xCopy) relative to the standard depth for unique conserved genes. It can be also used in detailed comparison of chromosome spans, to detect regions of mis-assembly, both over- and under-assembly. These methods and results are in agreement with those reported by Pflug et al 2020, comparing genome size measures of flow cytometry, read-depth and k-mer counting. With further work, it may be used in combining the accurate portions of multiple assemblies. Genome assemblers today produce rather different results for the same DNA data, and with a measure of their accuracy at contig levels from Gnodes, those portions of each can be combined, much like software for removing heterozygotic assembly spans works now. Gnodes looks a both sides of this coin of over- and under- assembly, in contrast to the heterozygosity reducing tools that are looking at only the over-assembly side of coverage depth. I think that Gnodes is a reliable estimator whose results are determined primarily by the biological properties of the genome, with minor influence of computational options. More testing will answer that. -- Don Gilbert, 2021.Feb References ---------- [1] Bennett,M.D. et al. (2003) Comparisons with Caenorhabditis (100Mb) and Drosophila (175Mb) using flow cytometry show genome size in Arabidopsis to be 157Mb and thus 25% larger than the Arabidopsis genome initiative estimate of 125Mb. Ann. Botany, 91, 547-557 doi:10.1093/aob/mcg057 [2] Pflug JM, V R Holmes, C Burrus, JS Johnston, and DR Maddison (2020). Measuring Genome Sizes Using Read-Depth, k-mers, and Flow Cytometry: Methodological Comparisons in Beetles (Coleoptera). Gen.Gen.Gen., 10:3047-3060; doi: https://doi.org/10.1534/g3.120.401028 Software -------------------- evigene/scripts/genoasm/gnodes_pipe.pl and component Evigene scripts. It requires common genome informatics components: ncbi-blast bwa-mem samtools repeatmasker busco. Some of these may become optional later. Mapping DNA reads has been tested with both bwa-mem and bowtie2, with similar results for this tool, but bwa-mem is faster. gnodes_pipe needs a genome.metadata file and gnodes_setup.sh unix system script for your compute cluster. Gnodes requires data inputs of a. Accurate genomic DNA pieces, as from current Illumina sequencers. Read sizes of 150 bp are common now and work well; reads of 100bp or lower may result in some different estimates. b. Chromosome assembly(ies), whether contig or chromosome level. Gnodes is most useful comparing assemblies, from same or related species. c. Gene coding sequences for species. Gene CDS assembled from RNA independently of chromosome assemblies provide measures of errors in chr-assemblies for more reliable estimates, including missing genes. The Evigene package has a pipeline, SRA2Genes, for constructing accurate and complete gene sets from RNA. d. Optionally, transposon sequences known for your species. Now repeatmasker, optionally repeatmodeler, have been tested and used for Gnodes. e. Busco and Repeatmasker require data sets for your organism. gnodes_setup.sh for your computer cluster ## --- gnodes_setup.sh PBS version --- #PBS -N gnodes_pipe #PBS -l vmem=128gb,nodes=1:ppn=24,walltime=12:00:00 #PBS -V module load blast bwa-mem samtools repeatmasker busco ## --- end gnodes_setup.sh --- ## --- gnodes_setup.sh for Slurm --- #SBATCH --job-name="gnodes_pipe" #SBATCH --partition=compute #SBATCH --nodes=1 #SBATCH --ntasks-per-node=24 #SBATCH -t 12:00:00 #SBATCH --export=ALL module load blast bwa-mem samtools repeatmasker busco ## --- end gnodes_setup.sh --- Arabidopsis Example -------------------- # data fetches wget -nd -q -b https://sra-download.ncbi.nlm.nih.gov/traces/era14/ERR/ERR4586/ERR4586299 # two genome chrasm arath18tair_chr = ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/ arath20max_chr = ftp.ncbi.nlm.nih.gov:/genomes/all/GCA/904/420/315/GCA_904420315.1_AT9943.Cdm-0.scaffold $evigene/scripts/genoasm/gnodes_pipe.pl -title arath18t1a -chr arath18tair_chr.fa -cds arath18tair1cds.fa \ -sumdata arath20asm.metad -ncpu 24 -maxmem 128gb -reads readsf/ERR4586299_1.fastq $evigene/scripts/genoasm/gnodes_pipe.pl -title arath20m1a -chr arath20max_chr.fa -cds arath18tair1cds.fa \ -sumdata arath20asm.metad -ncpu 24 -maxmem 128gb -reads readsf/ERR4586299_1.fastq Arabidopsis gnodes metadata -------------- arath20gnodes/arath20asm.metad asmid=arath18tair_chr flowcyto=157-166 Mb asmtotal=120 Mb asmname=Arath18TAIR species=Arabidopsis_thaliana buscodb=embryophyta_odb9 rmaskdb=Arabidopsis