Gnodes DNA Depth Deficit Analyses:
Chr Assembly Components and Gene Copy Numbers
Deficit Synopsis of Major Components
Deficit is the difference of observed assembly and gene copy numbers,
from expected contents based on uniform DNA coverage depth, as measured
over unique conserved genes. Percent deficit is shown (Y-axis) as ratio
of this difference: 100 * (obs - exp) / exp. The desired value of 0 indicates
chromosome assembly has all expected DNA coverage depth. A value above 0 indicates excess,
below zero is a deficit. Y-axis for Gene-CN on right, with minimum -10% deficit,
is narrower than for Chr Assembly on left.
Chromosome assembly components (gray), ALL: all assembly,
UNIQ: unique read map spans, DUP: duplicate map spans, CDS: Gene CDS-mapped read spans.
UNIQ, DUP are full partitions of ALL (UNIQ+DUP=ALL), but need careful
interpretation: a deficit in UNIQ here means that duplicates are lower than
expected (ie under-assembled duplications). Excess in DUP means some are
unique (ie over-assembled). A deficit in one balanced by excess in the other means
roughly that it is just-right assembled, e.g. Drosophila mel.2020 Pi assembly.
Gene copy number levels (green) C1: one copy
in genomic DNA reads mapped to gene-cds, C2-9: two to nine copies, C10-99: ten
to ninety-nine copies. These values are the percentage of genes in measurement
set with a deficit in copies found on chromosome assemblies, using CDS-mapped
genomic DNA reads. This doesn't measure total of missing copies, but those of
the measured set with a deficit.
Cmiss (red) : Gene CDS-mapped genomic DNA reads
that are missing from Chr-assembly, as percent of all CDS-mapped reads.
Measureable Cmiss percentages (>=1%) indicate a poor assembly, i.e. missing all
copies of gene coding sequence DNA.
Arabidopsis thaliana: 2018.TAIR, 2020.Max Planck assemblies, 2018T x F1 Heterozygote DNA
Ara.th. TAIR assembly has a -30% deficit in DNA spans, notably including genes with 2-9 copies.
Arath20Max assembly is larger by 10 Mb or 6%, and has approx, 6% lower DNA deficits in chr assembly,
notable it has 50% more spans with simple Repeats.
However this Arath20Max has a greater deficit in gene copy recovery, and 10x more
missing unique gene DNA (0.40% Arath20Max vs 0.04% for Arath18TAIR), possibly an effect of sample population differences.
Heterozygous DNA (F1 of Col-0 x Cvi-0, right tair_evg8hetrc panel) has no significant effect on genome-wide measures, versus homozygous DNA of Col-0 strain (left panel). Measurable effects of F1-DNA are (a) higher map error rate, with more incomplete gene span aligns (30% F1 vs 5% parent Col-0) and (b) a small number of genes with copy number changes (10% with more or fewer copies in F1 mix).
Gene copy deficits in Arath18TAIR include Ribosomal proteins, Cytochromes, Transcription factors, Plant self-incompatibility, Disease resistance, Transmembrane genes, Transposon genes, among others.
Uncharacterized genes account for roughly half of copy deficits.
Drosophila mel.r6, mel.2020 and pse.2020 assemblies
Dros. melanogaster Release 6 assembly (drmel6r) has a noticable deficit of -17%, or 30 Megabases.
Recent Dros. mel. 2020 (drmel20, Pi2) and Dros. pseudoobscura 2020 (drpse20) assemblies are
at zero deficit, within measurement error.
The drmel20 apparent deficit Uniq + excess Dupl parts sum to zero deficit.
The public standard Release 6 deficit is in part found among
genes with 2-9 and 10-99 copies in genome, and also in transposon and repeat regions (not shown).
Gene copy deficits in Drome6 include Histones, Chorion, Mucin, Ubiquitin genes among others. Uncharacterized genes account for a small portion, Histones are the majority. There are no Transposon genes in the Drome gene set used for annotations.
Human, Chicken, Pig chr assemblies
The reference human19grc assembly is close to accurately recovering flow cytometry measures of
genomic DNA (3099 of 3423 Mb). Proportionally largest deficits are in simple repeat spans, of 50-80 Mb,
and all duplicated spans including repeats, of 75-125 Mb.
Gene copy number deficits for human18grc are largest for a small number of families with 10-99 copies, and a larger number of 2-9 copy genes missing one copy. Notable gene copy deficits are found for olfactory and taste receptor gene duplications, homeoboxes, antigens, ribosomal proteins, zinc finger containing and uncharacterized genes.
008% for
Two human chromosome assemblies were measured, reference human19grc and more a recent human20ash.
The later is 2% larger, has minimal differences measured by gnodes from the reference assembly,
in both whole genome partitions and in gene copy numbers, and is not shown here.
Chicken and pig reference chromosome assemblies are also analyzed with gnodes. Pig genome size is similar to human, chicken is 1/3 that at 1200 Mb. Chicken appears to be close to accurately assembled. Pig however has significant deficits in whole chomosome parts, in gene copy numbers as well as missing unique gene DNA (missing 0.670% of pig gene dna, vs 0.008% for human, 0.017% for chicken).
Daphnia pulex 2016ml, 2019ml, 2020ma assemblies
Daph. pulex PA42 assembly of 2016 has large deficiencies, esp. 2-9 copies of genes, and is
missing significant gene-DNA sequence (> 1%). PA 2019 assembly has narrowed this deficit to nearer
zero, but has measureable deficit in gene copies and gene DNA. 2020-ma (MaSurCA) assembly of same PA DNA
as 2019ml erases most deficits, and shows some excess DNA spans, on average closer to zero deficit than
the other two.
Gene copy deficits in Dap19 include Ribosomal proteins, Centromere proteins, Cuticle genes, Transcription factors, Heat shock genes, Sex-determining genes, Transposon genes, among others.
Uncharacterized genes account for more than half of copy deficits.
Daphnia magna 2010nw, 2019sk, 2020ma assemblies
Daph. magna 2010nw (Newbler assembler, 454-reads, publ 2015) has a deficit of nearly 50% in the chr.
assembly, missing many gene copies, and missing measurable gene DNA. D. magna 2019sk (using odd assembler,
Illumina DNA) has even greater deficits, and a notably large missing gene DNA portion (up to 7%).
This poorest assembly unfortunately was recnetly chosen to represent D. magna in NCBI genomes.
2020ma uses DNA data of this project re-assembled with MaSurCA,
has notably reduced deficits, still large but improved.
Cucumber 2019 CGI and 2020 PCC chr assemblies
Cucumber chr assemblies, 2019 CGI of 220 Mb (reference) and 2020 PCC of 340 Mb, have deficits in
duplicated regions from the 400-500 Mb chromosome set determined by flow cytometry. 2019 CGI has
a large deficit, missing more than 50% of genome DNA. The 2020 PCC
assembly has improved recovery of coding spans and duplicated spans.
|