euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

EvidentialGene Gene assembler best methods, 2016 update

This recent EvidentialGene reconstruction of Aedes and Anopheles mosquito genes uses several gene assemblers. Comparison of the assemblers ability to accurately reconstruct genes is still needed, as many projects and publications are not using such effectively. Accurate coding gene construction can be measured objectively, with coding sequence metrics (protein size, completeness and especially alignment to reference proteins), as it has been for a decade or more.

Gene assemblers
velvet/oases : v1.2.10 2013;     idba-tran : v.1.1.1 2013;     soap-trans : v.1.03 2013;     trinity : trinityrnaseq_r20140717 (v2.1.1)

Velvet/Oases remains single best gene assembler, but note that each assembler contributes some uniquely best genes. The majority of genes are most accurately assembled with kmer (read shred size) at or above 1/2 read length of 100 bp. Trinity is less capable in part due to its restricted kmer choice, and lack of scaffolding with read pairs.
Anopheles albimanus, Longest 10K genes
-----------------------------
  Count       Unique      Method
assembler
 4622,46.2%  1464,14.6%u   idba
 2900,29.0%   352, 3.5%u   soap
 2408,24.1%   305, 3.1%u   trin
 7636,76.4%  4492,44.9%u   velv 
kmer
 2219,22.2%     80, 0.8%u  k05
 3903,39.0%    811, 8.1%u  k25 
 5130,51.3%   1255,12.6%u  k35
 4897,49.0%    771, 7.7%u  k45
 4553,45.5%    492, 4.9%u  k55
 4764,47.6%    779, 7.8%u  k65
 4341,43.4%    544, 5.4%u  k75
 3920,39.2%    375, 3.8%u  k85
 3460,34.6%    168, 1.7%u  k95
------------------------------
Anopheles funestus, Longest 10K genes 
------------------------------
  Count       Unique      Method
assembler
 4092,40.9%  2450,24.5%u   idba
 2059,20.6%   682, 6.8%u   soap
 1754,17.5%   561, 5.6%u   trin
 6122,61.2%  4505,45.1%u   velv
kmer
 1495,15.0%    263, 2.6%u  k05
 2785,27.9%   1156,11.6%u  k25
 4047,40.5%   2077,20.8%u  k35
 3053,30.5%   1112,11.1%u  k45
 2831,28.3%    983, 9.8%u  k55
 2173,21.7%    680, 6.8%u  k65
 1520,15.2%    399, 4.0%u  k75
 1117,11.2%    378, 3.8%u  k85
  719, 7.2%    213, 2.1%u  k95
------------------------------
Anopheles albimanus, Highly conserved
  (BUSCO_drosmel 2561 genes)
  Count       Unique      Method
assembler
 1082,42.2%   309,12.1%u   idba
  692,27.0%    75, 2.9%u   soap
  569,22.2%    50, 2.0%u   trin
 2089,81.6%  1285,50.2%u   velv
kmer
  458,17.9%     28, 1.1%u  k05
  957,37.4%    174, 6.8%u  k25
 1177,46.0%    251, 9.8%u  k35
 1169,45.6%    200, 7.8%u  k45
 1085,42.4%    133, 5.2%u  k55
 1203,47.0%    266,10.4%u  k65
 1070,41.8%    192, 7.5%u  k75
  950,37.1%    136, 5.3%u  k85
  787,30.7%     70, 2.7%u  k95
------------------------------
Anopheles funestus, Highly conserved 
  (BUSCO_drosmel 2648 genes)
  Count       Unique      Method
assembler
 1269,47.9%   700,26.4%u   idba
  686,25.9%   178, 6.7%u   soap
  515,19.4%    90, 3.4%u   trin
 1655,62.5%  1054,39.8%u   velv
kmer
  494,18.7%    107, 4.0%u  k05
  822,31.0%    261, 9.9%u  k25
 1089,41.1%    465,17.6%u  k35
  925,34.9%    293,11.1%u  k45
  883,33.3%    245, 9.3%u  k55
  731,27.6%    165, 6.2%u  k65
  540,20.4%    103, 3.9%u  k75
  411,15.5%    119, 4.5%u  k85
  240, 9.1%     68, 2.6%u  k95
------------------------------

There are various comparison papers out there, contradicting each other, on how to pick a best gene assembler. One reason for those contradictions is that some comparisons use only 1 kmer setting, which isn't good, or use error-prone ways of merging multiple gene assemblies. The Evigene way is to produce and assess millions of gene assemblies for coding sequence qualities, pulling out the most complete genes from that huge pile of chaff.

Many gene assembler comparison papers focus on technical measures like "N50" length of transcripts, or "reads-mapped-back" counts of gene fragments recovered. These are not biological accuracy measures. I can easily construct transcripts that map all reads and that are longer than anyone elses, but these are not biological transcripts, they are artifacts. A simple, meaningful replacement for gene quality of N50 transcript length is the average length of 1000 longest proteins, which has biological maxima, is quick and easy to calculate, and will usefully compare gene sets of same and related species.

-- Don Gilbert, 2016 March


Developed at the Genome Informatics Lab of Indiana University Biology Department