Daphnia magna assembly 2.4 quality assessment --------------------------------------------- Assembly Gaps: Dmagna vs Dpulex These are similar but many more Pulex <= 50bp gaps and Pulex > 10 Kb gaps This suggests Magna assembly is not missing unusual amount, otherwise should show up as many more gaps from paired reads. Dmagna 2.4 asm, gapCount=10371, totalN=23 MB, avgGap=2264 0 < size < 10 : 613: ******* 100 < size < 500 : 2282: *************************** 500 < size < 1000 : 1957: *********************** 1000 < size < 5000 : 3796: ******************************************** 5000 < size < 10000 : 1335: *************** 10000 < size < 50000 : 247: ** Dpulex 2006 asm, gapCount=18441, totalN=60 MB, avgGap=3298 0 < size < 10 : 2725: ********************** size = 50 : 4151: ********************************* 100 < size < 500 : 1789: ************** 500 < size < 1000 : 1665: ************* 1000 < size < 5000 : 5501: ********************************************* 5000 < size < 10000 : 785: ****** 10000 < size < 50000 : 1497: ************ EST errors: Dmagna vs other species ESTs mapped to genomes give an assessment of assembly errors. There are few missed ESTs or poor identity for D. magna ESTs, fewer errors than for D. pulex or any other arthropod genomes except Dros. melanogaster. This low error rate for Dmagna suggests the assembly is essentially complete. EST mapping errors Species N OK pOK pDUP pMISS 6 aphid 132559 0.76 0.076 0.085 4 bombyx 209897 0.80 0.083 0.030 5 dappulex 115809 0.78 0.052 0.158 2 dapmagna 1211902 0.94 0.019 0.025 1 drosmel 520952 0.95 0.018 0.016 7 ixodes 149662 0.74 0.079 0.102 3 nasvit 156288 0.88 0.030 0.063 pOK = valid map, pDUP = duplicate location, pMISS = missing or low identity Synteny: Gene order table I've made a table of synteny between Dpulex and Dmagna from matching, ordered genes. http://server7.wfleabase.org/genome/Daphnia_magna/prerelease/Dmagna_asm2.4/gene-predictions/ dmagna_dpulex_geneorder_table.txt This is a rough matching, but there is much agreement. 16093 Dmagna and Dpulex genes match, and about 2/3 are in ordered syntenic runs. 7386 Dmagna genes do not match Dpulex 28322 Dpulex genes do not match Dmagna ** The 16093 matching genes should be raised to 21,000 from tblastn results; leaving about 23,000 non-matching Dpulex. These number of non-match are likely overestimates due to independent duplications, e.g. only 4 of 8 genes in the hemoglobin cluster match best. For the pulex genes not in magna, the smaller scaffolds have more: DmagFound DmagMiss 10638 1 2233 1 dpx scaffold 1-9 4408 10 5003 10 dpx scaffolds 10-39 3908 40 4904 40 dpx scaffolds 40-99 4331 100 13576 100 dpx scaffolds 100-999 < largest # pulex genes missing from magna 194 1000 2606 1000 dpx scaffolds 1000+ Comparing the evidence types for Dpulex genes found or missed in Dmagna reveals that Homology, EST, and Expressed genes are deficient in the missed group, and Paralogs are overabundant. This may be expected from biology. High identity paralogs in missed genes could also be spurious duplicates. DmagnaFound Sbin N AllEvd EST Express Homolog Paralog none 1 10638 0.306 0.278 0.296 0.259 0.146 0.000 10 4408 0.991 0.796 0.862 0.798 0.579 0.007 40 3908 0.987 0.747 0.806 0.772 0.576 0.012 100 4331 0.981 0.466 0.494 0.698 0.738 0.017 1000 194 0.871 0.031 0.052 0.485 0.758 0.129 DmagnaMiss Sbin N AllEvd EST Express Homolog Paralog none 1 2233 0.907 0.441 0.638 0.401 0.549 0.093 10 5003 0.924 0.333 0.552 0.415 0.700 0.076 40 4904 0.925 0.282 0.439 0.402 0.730 0.075 100 13576 0.848 0.118 0.161 0.390 0.751 0.152 1000 2606 0.574 0.008 0.015 0.276 0.472 0.426 If we missed 35% of magna genome, it is mostly pulex scaffolds 100-999, conversely these pulex scaffolds could be somewhat spurious, or this could be where the genomes differ. Inspection of 10 of these Dpulex-only gene scaffolds says that most of the genes have Dpulex paralogs that match better to Dmagna. So the Dmagna assembly has fewer copies of these. Is this real biology or artifact, and if artifact, have we missed Dmagna duplicates, or counted Dpulex genes twice? Dpulex ngene scaffold_257 17 == mostly transposon genes (gag-pol) scaffold 297 15 == hypothetical/conserved proteins, highly duplicated, guess transposons; several large NNN gaps scaffold 323 15 == mix of conserved and hypothetical genes, 1 transposon, some expression scaffold 305 13 == mix of conserved genes, several large NNN gaps < good gene set to check magna for hxAUG26us305g144t1/chrom reg maint; hxNCBI_GNO_24994/dystroph; hxAUG26us305g143t1/transcription elongation factor ** All of these genes are found in Dmagna as other Dpulex paralogs scaffold 313 12 == 2 genes duplicated several times, large NNN gaps scaffold 265 11 == mix of conserved genes, several large NNN gaps; < check magna for this scaffold 292 10 == mostly large NNN gaps, weak gene models scaffold 367 10 == mix of conserved and hypothetical, mostly large NNN gaps; scaffold 601 10 == several conserved genes, some expression, NO gaps < check magna for this ** Genes here are found in Dmagna as other Dpulex paralogs scaffold 698 10 == 1 good gene hxAUG26us698g51t1/Ribonucleoside-diphosphate reductase, others poor models;large NNN gaps Of 9883 homologous genes, 6822 genes are in 1003 ordered runs between species, with average mis-order of 0.80 (+/-0.02) where 0 indicates identical order. For comparison, the Drosophila pair melanogaster x mojavensis are about 50 MYA distant, and have syntenic runs of ordered genes: 7354 genes are in 1232 ordered runs between species, of 10610 homologous genes, with average mis-order of 0.52 (+/-0.02) Here is a graphic of the Drosophila mel x moj chromosome synteny: http://insects.eugenes.org/species/maps/muller-elements/ D. mojavensis genome Muller elements (x D. melanogaster synteny) shows 5 major chromosomes share a large amount of agreement. Daphnia should be similar, if somewhat more divergent. One can use this synteny to improve genome assemblies and gene models in both species. The higher mis-order for Daphnia may indicate more assembly mistakes, as well as phylogenetic divergence. I've been trying to distinguish the two, but so far haven't any good answer. I've looked at a handful of mis-order cases, or some pulex genes missing in magna, with no general conclusion. There are some with a magna NNN gap in the gene mis-order, and a small magna scaffold/contig fits precisely that gap and gene order (in another case a pulex NNN gap is rectified by magna gene order). Some of the missed pulex genes are poor gene models in one or the other species. Some cases of tandem duplicates in pulex look like duplicates are missing in magna. Other mis-order cases suggest real gene translocations. A simple program using gene order synteny can close some assembly gaps. 180 Dmagna small scaffolds/contigs with 270 genes are moved to 160 NNN gaps in larger scaffolds using gene order runs. See Dmagna_asm2.4/gene-predictions/dmagna_dpulex_geneorder_gapfill.txt This is derived from dmagna_dpulex_geneorder_table.txt, showing gaps filled with ordered genes. -- Don Gilbert, May 2010