euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

Gnodes/Genome Depth Estimator
EvidentialGene package for genome coverage depth estimation for animal & plant genomes

Gnodes is a Genome Depth Estimator for animal and plant genomes, also a genome size estimator. It calculates genome sizes based on DNA coverage of assemblies, using unique, conserved gene spans for its standard depth. Results of this tool match the independent measures from flow cytometry of genome size quite well in tests with plants and animals. Tests on a range of model and non-model animal and plant genome assemblies give reliable and accurate results, in contrast to unreliable K-mer histogram methods.


Boxplots (median, range) of Estimators, for equivalence to Flow Cytometry (FC) measured genome sizes. Gnodes is very accurate, whereas K-mer histogram methods (GenoScope, covest, findGSE) are rather inaccurate, with a wide range of estimates. Assembly sizes are typically below FC measured sizes.

Estimations relative to FC value, for measured animal and plant genomes, with median, range and values from three estimators: Assembly, Gnodes, and GenoScope. Flow cytometry sizes in megabases are given, ranging from 160 Mb (plant) to 3400 Mb (human).

Genome reconstruction is a Goldilocks problem: answers are often too hot, or too cold; the just-right solution takes effort to discriminate among these outcomes. Gnodes provides a measuring stick for too hot and too cold genome assemblies. When used to compare several assemblies of one organism, it spots over- and under-assembled portions, relative to its unique gene DNA depth measure. It can be used to estimate genome size from only gene coding sequences mapped with genomic DNA, and these tests show it is reliable for that. Gnodes is now a component of the EvidentialGene package: evigene/scripts/genoasm/gnodes_pipe.pl

Gnodes resolves a few discrepancies, such as Daphnia water flea genome assemblies that are only 1/2 size of flow cytometry measured size, and the well-known 40 megabase discrepancy in Arabidopsis (Bennett et al 2003).

Extensive gene coding sequence duplication is a likely reason that assemblies of Daphnia genomes have faltered at half-size. Half of Daphnia genomic DNA aligns to genes coding sequence, much more than the 10-20% of measured insects and vertebrates, or 25% in measured plants.

      Name                            Last modified       Size  

[DIR] Parent Directory 23-Feb-2021 22:12 - [DIR] arath20gnodes/ 25-Feb-2021 11:49 - [DIR] gnodes_afuk_plots/ 13-Apr-2021 15:38 - [TXT] gnodes_covstats_intro.html 13-Apr-2021 15:57 4k [TXT] gnodes_covstats_readme21apr.txt 12-Apr-2021 00:27 8k [TXT] gnodes_covstats_sum21feb.txt 23-Feb-2021 22:11 4k [TXT] gnodes_pipe_algo.txt 23-Feb-2021 22:18 9k [DIR] soft_evigene_gnodes_update/ 10-Mar-2021 22:00 - [DIR] soft_evigene_package/ 21-May-2020 14:05 -

Gnodes/Genome Depth Estimator  2021.apr

Summary of results
--------------------
Insects     Fruitfly 20 UC            Honeybee 19 Ha          Plants      Arabidopsis 18 TAIR      Arabidopsis 20 Max          
Part        Obs.Mb  Est.Mb  xCopy     Obs.Mb  Est.Mb  xCopy   Part        Obs.Mb  Est.Mb  xCopy    Obs.Mb  Est.Mb  xCopy 
--------    ----------------------    ----------------------  -------     ---------------------   -----------------------
Flowcyto    161-180  .      .         234-264  .      .       Flowcyto    157-166  .      .        157-166  .      .     
LN/C Est    .       168     .         .       267     .       LN/C Est    .       156     .        .       156     .     
Totalasm    163     164     .         224     222     .       Totalasm    120     154     .        130     158     .     
Measured    163     164     1.00      223     221     0.99    Measured    115     149     1.30     126     154     1.22  
uniqasm     129     130     1.00      213     188     0.88    uniqasm     98      104     1.07     108     104     0.96  
dupasm      34      34      1.02      9.4     33.5    3.56    dupasm      16.9    44.3    2.62     17.3    49.6    2.86  
CDSann      31      33      1.04      50      56      1.12    CDSann      42      57      1.37     45      57      1.28  
TEann       26      23      0.89      3.6     5.2     1.45    TEann       16.4    17.2    1.05     20      19      0.94  
RPTann       .      .       .         38      47      1.24    RPTann      4.7     20.9    4.44     8.2     21.5    2.61  
NOann       109     111     1.02      141     126     0.89    NOann       54      70      1.28     58      72      1.25  
--------    ----------------------    ----------------------  -------     ---------------------    ---------------------
Size=LN/C   C=94, N=105 Mb, L=150     C=25, N=50 Mb, L=150    Size=LN/C   C=52, N=54M, L=150       C=52, N=54M, L=150    

C, C_UCG = read copy depth measured for unique conserved genes
xCopy  = excess/deficit in read copy depth: C_part/C_UCG, depth at partition / depth at uniq conserved genes.
Obs.Mb = partition size in megabases, Est.Mb = estimated size: observed * xCopy

What Is Gnodes?
-----------------
Gnodes = G.no.. D.... Es........ 
         Genome depth estimation is a critical measurement for genome size estimates from DNA sequence data.  There are
many software tools for estimating genome sizes from DNA sequences.  The commonly used ones at this writing are based on
K-mer shredding of DNA to very small pieces, then counting frequency of pieces.  This is a statistical method that is
rather distant from the biological evidence, where choices of K-mer size and other options strongly influence estimates.  

The Gnodes method is based on two biological or molecular-method assumptions: a. Molecular sequencing methods for genomes
produce an even depth of DNA pieces from chromosomes, and b. Depth of coverage is measurable, with smallest error, most
reliably at known unique genome sequence spans, such as unique conserved genes (UCG).

Gnodes measures DNA cover depth by mapping DNA pieces, tabulating read depth in bins, like samtools pileup, but different
in that it measures multi-mapping explicitly.  It uses gene and transposon data, as two largest and commonly measured
genome attributes, to annotate chromosome assemblies.  It tabulates depth for several whole-genome partitions, unique and
duplicated DNA spans, coding, transposon and simple repeat spans.   Its main result is a measure of over- and
under-assembly (xCopy) relative to the standard depth for unique conserved genes.  It can be also used in detailed
comparison of chromosome spans, to detect regions of mis-assembly, both over- and under-assembly.   These methods and
results are in agreement with those reported by Pflug et al 2020, comparing genome size measures of flow cytometry,
read-depth and k-mer counting.

With further work, it may be used in combining the accurate portions of multiple assemblies. Genome assemblers today
produce rather different results for the same DNA data, and with a measure of their accuracy at contig levels from Gnodes,
those portions of each can be combined, much like software for removing heterozygotic assembly spans works now.  Gnodes
looks a both sides of this coin of over- and under- assembly, in contrast to the heterozygosity reducing tools that are
looking at only the over-assembly side of coverage depth. 

I think that Gnodes is a reliable estimator whose results are determined primarily by the biological properties of the
genome, with minor influence of computational options.  More testing will answer that.

-- Don Gilbert, 2021.Feb

References
----------
[1] Bennett,M.D. et al. (2003) Comparisons with Caenorhabditis (100Mb) and Drosophila (175Mb) using flow cytometry show
genome size in Arabidopsis to be 157Mb and thus  25% larger than the Arabidopsis genome initiative estimate of 125Mb. Ann.
Botany, 91, 547-557 doi:10.1093/aob/mcg057
[2] Pflug JM, V R Holmes, C Burrus, JS Johnston, and DR Maddison (2020). Measuring Genome Sizes Using Read-Depth, k-mers,
and Flow Cytometry: Methodological Comparisons in Beetles (Coleoptera).  Gen.Gen.Gen., 10:3047-3060; doi:
https://doi.org/10.1534/g3.120.401028


Software
--------------------
evigene/scripts/genoasm/gnodes_pipe.pl and component Evigene scripts.

It requires common genome informatics components: ncbi-blast bwa-mem samtools repeatmasker busco.  Some of these may become
optional later.  Mapping DNA reads has been tested with both bwa-mem and bowtie2, with similar results for this tool, but
bwa-mem is faster.

gnodes_pipe needs a genome.metadata file and gnodes_setup.sh unix system script for your compute cluster.

Gnodes requires data inputs of 

a. Accurate genomic DNA pieces, as from current Illumina sequencers.  Read sizes of 150 bp are common now and work well;
reads of 100bp or lower may result in some different estimates.

b. Chromosome assembly(ies), whether contig or chromosome level.  Gnodes is most useful comparing assemblies, from same or
related species.

c. Gene coding sequences for species. Gene CDS assembled from RNA independently of chromosome assemblies provide measures
of errors in chr-assemblies for more reliable estimates, including missing genes.  The Evigene package has a pipeline,
SRA2Genes, for constructing accurate and complete gene sets from RNA.

d. Optionally, transposon sequences known for your species.  Now repeatmasker, optionally repeatmodeler, have been tested
and used for Gnodes.

e. Busco and Repeatmasker require data sets for your organism.


gnodes_setup.sh for your computer cluster
## --- gnodes_setup.sh PBS version ---    
#PBS -N gnodes_pipe
#PBS -l vmem=128gb,nodes=1:ppn=24,walltime=12:00:00
#PBS -V

module load blast bwa-mem samtools repeatmasker busco
## --- end gnodes_setup.sh ---    

## --- gnodes_setup.sh for Slurm ---    
#SBATCH --job-name="gnodes_pipe"
#SBATCH --partition=compute
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=24
#SBATCH -t 12:00:00
#SBATCH --export=ALL

module load blast bwa-mem samtools repeatmasker busco
## --- end gnodes_setup.sh ---    


Arabidopsis Example
--------------------
# data fetches
wget -nd -q -b https://sra-download.ncbi.nlm.nih.gov/traces/era14/ERR/ERR4586/ERR4586299

# two genome chrasm 
arath18tair_chr =
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/

arath20max_chr = 
ftp.ncbi.nlm.nih.gov:/genomes/all/GCA/904/420/315/GCA_904420315.1_AT9943.Cdm-0.scaffold

$evigene/scripts/genoasm/gnodes_pipe.pl -title arath18t1a -chr arath18tair_chr.fa -cds arath18tair1cds.fa \
 -sumdata arath20asm.metad -ncpu 24  -maxmem 128gb -reads readsf/ERR4586299_1.fastq 

$evigene/scripts/genoasm/gnodes_pipe.pl -title arath20m1a -chr arath20max_chr.fa -cds arath18tair1cds.fa \
 -sumdata arath20asm.metad -ncpu 24  -maxmem 128gb -reads readsf/ERR4586299_1.fastq 

Arabidopsis gnodes metadata
--------------
arath20gnodes/arath20asm.metad

asmid=arath18tair_chr
flowcyto=157-166 Mb
asmtotal=120 Mb
asmname=Arath18TAIR
species=Arabidopsis_thaliana
buscodb=embryophyta_odb9
rmaskdb=Arabidopsis

Developed at the Genome Informatics Lab of Indiana University Biology Department