euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

Gnodes DNA Depth Deficit Analyses:
Chr Assembly Components and Gene Copy Numbers

     Model plant ... Model fly ... Human, model vertebrates ... Daphnia waterfleas ... Cucumber plant

Deficit Synopsis of Major Components

Deficit is the difference of observed assembly and gene copy numbers, from expected contents based on uniform DNA coverage depth, as measured over unique conserved genes. Percent deficit is shown (Y-axis) as ratio of this difference: 100 * (obs - exp) / exp. The desired value of 0 indicates chromosome assembly has all expected DNA coverage depth. A value above 0 indicates excess, below zero is a deficit. Y-axis for Gene-CN on right, with minimum -10% deficit, is narrower than for Chr Assembly on left.

Chromosome assembly components (gray), ALL: all assembly, UNIQ: unique read map spans, DUP: duplicate map spans, CDS: Gene CDS-mapped read spans. UNIQ, DUP are full partitions of ALL (UNIQ+DUP=ALL), but need careful interpretation: a deficit in UNIQ here means that duplicates are lower than expected (ie under-assembled duplications). Excess in DUP means some are unique (ie over-assembled). A deficit in one balanced by excess in the other means roughly that it is just-right assembled, e.g. Drosophila mel.2020 Pi assembly.

Gene copy number levels (green) C1: one copy in genomic DNA reads mapped to gene-cds, C2-9: two to nine copies, C10-99: ten to ninety-nine copies. These values are the percentage of genes in measurement set with a deficit in copies found on chromosome assemblies, using CDS-mapped genomic DNA reads. This doesn't measure total of missing copies, but those of the measured set with a deficit.

Cmiss (red) : Gene CDS-mapped genomic DNA reads that are missing from Chr-assembly, as percent of all CDS-mapped reads. Measureable Cmiss percentages (>=1%) indicate a poor assembly, i.e. missing all copies of gene coding sequence DNA.


Arabidopsis thaliana: 2018.TAIR, 2020.Max Planck assemblies, 2018T x F1 Heterozygote DNA

ara18ch_asmgcn_plot ara20ma_asmgcn_plot ara21he_asmgcn_plot
Ara.th. TAIR assembly has a -30% deficit in DNA spans, notably including genes with 2-9 copies. Arath20Max assembly is larger by 10 Mb or 6%, and has approx, 6% lower DNA deficits in chr assembly, notable it has 50% more spans with simple Repeats. However this Arath20Max has a greater deficit in gene copy recovery, and 10x more missing unique gene DNA (0.40% Arath20Max vs 0.04% for Arath18TAIR), possibly an effect of sample population differences.

Heterozygous DNA (F1 of Col-0 x Cvi-0, right tair_evg8hetrc panel) has no significant effect on genome-wide measures, versus homozygous DNA of Col-0 strain (left panel). Measurable effects of F1-DNA are (a) higher map error rate, with more incomplete gene span aligns (30% F1 vs 5% parent Col-0) and (b) a small number of genes with copy number changes (10% with more or fewer copies in F1 mix).

Gene copy deficits in Arath18TAIR include Ribosomal proteins, Cytochromes, Transcription factors, Plant self-incompatibility, Disease resistance, Transmembrane genes, Transposon genes, among others. Uncharacterized genes account for roughly half of copy deficits.




Drosophila mel.r6, mel.2020 and pse.2020 assemblies

drmeref_asmgcn drmel20_asmgcn drpse20_asmgcn
Dros. melanogaster Release 6 assembly (drmel6r) has a noticable deficit of -17%, or 30 Megabases. Recent Dros. mel. 2020 (drmel20, Pi2) and Dros. pseudoobscura 2020 (drpse20) assemblies are at zero deficit, within measurement error. The drmel20 apparent deficit Uniq + excess Dupl parts sum to zero deficit. The public standard Release 6 deficit is in part found among genes with 2-9 and 10-99 copies in genome, and also in transposon and repeat regions (not shown). Gene copy deficits in Drome6 include Histones, Chorion, Mucin, Ubiquitin genes among others. Uncharacterized genes account for a small portion, Histones are the majority. There are no Transposon genes in the Drome gene set used for annotations.




Human, Chicken, Pig chr assemblies

human19grc_asmgcn_plot chick19nc_asmgcn_plot pig11c_asmgcn_plot
The reference human19grc assembly is close to accurately recovering flow cytometry measures of genomic DNA (3099 of 3423 Mb). Proportionally largest deficits are in simple repeat spans, of 50-80 Mb, and all duplicated spans including repeats, of 75-125 Mb. Gene copy number deficits for human18grc are largest for a small number of families with 10-99 copies, and a larger number of 2-9 copy genes missing one copy. Notable gene copy deficits are found for olfactory and taste receptor gene duplications, homeoboxes, antigens, ribosomal proteins, zinc finger containing and uncharacterized genes. 008% for Two human chromosome assemblies were measured, reference human19grc and more a recent human20ash. The later is 2% larger, has minimal differences measured by gnodes from the reference assembly, in both whole genome partitions and in gene copy numbers, and is not shown here.

Chicken and pig reference chromosome assemblies are also analyzed with gnodes. Pig genome size is similar to human, chicken is 1/3 that at 1200 Mb. Chicken appears to be close to accurately assembled. Pig however has significant deficits in whole chomosome parts, in gene copy numbers as well as missing unique gene DNA (missing 0.670% of pig gene dna, vs 0.008% for human, 0.017% for chicken).




Daphnia pulex 2016ml, 2019ml, 2020ma assemblies

dplx16ml_asmgcn_plot dplx19ml_asmgcn_plot dplx20ma4p_asmgcn_plot
Daph. pulex PA42 assembly of 2016 has large deficiencies, esp. 2-9 copies of genes, and is missing significant gene-DNA sequence (> 1%). PA 2019 assembly has narrowed this deficit to nearer zero, but has measureable deficit in gene copies and gene DNA. 2020-ma (MaSurCA) assembly of same PA DNA as 2019ml erases most deficits, and shows some excess DNA spans, on average closer to zero deficit than the other two. Gene copy deficits in Dap19 include Ribosomal proteins, Centromere proteins, Cuticle genes, Transcription factors, Heat shock genes, Sex-determining genes, Transposon genes, among others. Uncharacterized genes account for more than half of copy deficits.




Daphnia magna 2010nw, 2019sk, 2020ma assemblies

dam15nw_asmgcn_plot dam19sk_asmgcn_plot dam20ma_asmgcn_plot
Daph. magna 2010nw (Newbler assembler, 454-reads, publ 2015) has a deficit of nearly 50% in the chr. assembly, missing many gene copies, and missing measurable gene DNA. D. magna 2019sk (using odd assembler, Illumina DNA) has even greater deficits, and a notably large missing gene DNA portion (up to 7%). This poorest assembly unfortunately was recnetly chosen to represent D. magna in NCBI genomes. 2020ma uses DNA data of this project re-assembled with MaSurCA, has notably reduced deficits, still large but improved.




Cucumber 2019 CGI and 2020 PCC chr assemblies

cucum19cgi_asmgcn_plot cucum20pcc_asmgcn_plot
Cucumber chr assemblies, 2019 CGI of 220 Mb (reference) and 2020 PCC of 340 Mb, have deficits in duplicated regions from the 400-500 Mb chromosome set determined by flow cytometry. 2019 CGI has a large deficit, missing more than 50% of genome DNA. The 2020 PCC assembly has improved recovery of coding spans and duplicated spans.



Developed at the Genome Informatics Lab of Indiana University Biology Department