euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

EvidentialGene : evgpipe_sra2genes results for
Danio rerio zebrafish     .. in progress, more data coming

  • Example Zebrafish genes improved in Evigene vs NCBI and Ensembl sets
  • 1000s of Zebrafish genes are improved in Evigene reconstruction, versus NCBI RefSeq gene set of 2016, and Ensembl gene set of 2017 (ZFIN uses this).

  • Danio rerio Gene/Genome map
  •       Name                    Last modified       Size  
    [DIR] Parent Directory 30-Dec-2017 20:46 - [DIR] aaeval/ 28-Dec-2017 14:16 - [DIR] docs/ 05-Jan-2018 14:15 - [DIR] evgmethods/ 28-Dec-2017 14:10 - [DIR] map/ 30-Dec-2017 20:43 - [DIR] publicset/ 28-Dec-2017 14:11 - [DIR] rnasets/ 28-Dec-2017 13:56 -

    Zebrafish gene set improvement with EvidentialGene 
    using a new automated SRA2Genes pipeline.
    This SRA2Genes pipeline collects several EvidentialGene methods into a
    complete, automated (nearly) gene set reconstruction pipeline for fetching
    public RNA-seq gene pieces from NCBI SRA, over-assembling that into many
    millions of gene models, varying assembly methods and data slices, then
    reducing the over-assembly by to its most accurate non-redundant coding
    gene loci and alternates, followed by annotation with reference/related
    species proteins and gene names, with checks for contaminants, and
    formatting of gene sequence sets to publication quality for public database
    Preliminary zebrafish17evigene gene set info is at
    The Evigene software package including omnibus is available at
       evigene18jan01.tar  (draft2 of evgpipe_sra2genes)
    I took zebrafish as one test case  of this Evigene sra2genes pipeline, as
    it is in top 10 of those with public RNA-seq studies, and my prior work
    with fish genes suggested published zfish genes may be amenable to
    improvements.  That proved true, from comparisons to other fish and
    vertebrate gene sets.  The Evigene draft set is more complete and
    accurate in representing zebrafish genes than Ensembl or NCBI sets by
    objective measures of gene orthology.
    Completeness and accuracy comparisons are to NCBI and ENSembl gene sets of
    zebrafish, modeled on chromosome assembly GRCz10. Evigene set is built from
    RNA assembly only, without using chromosomes or other species genes to
    reconstruct.  Those gene evidences are used for validating and
    reclassifying the RNA constructs.
    Conserved vertebrate genes in zebrafish gene sets 
    Gene set    Align   Compl Frag Miss
    Evigene17   443.1   2572    5    9   Evigene gene set, 2017 Dec
    NCBI16      433.8   2554   13   19   NCBI RefSeq gene set, 2016 Dec
    Ensembl17   426.8   2510   47   29   Ensembl gene set, 2017 Nov
    The NCBI refseq gene and chromosome ID used is GCF_000002035.5_GRCz10.
    These are measured against BUSCO verebrate subset of OrthoDB v9. The Align
    score is average alignment to conserved (ancestral) proteins, and
    Compl/Frag/Miss are complete, fragment and missing statistics from BUSCO
    calculation of HMM search for those anscestral vertebrate one-copy genes.
    Note that this is a  10% or less subset of the ortholog genes in
    fishes, many are multi-copy, or fish clade -specific.  
    A more complete orthology assessement is done using 3 related fish: a
    cavefish, carp and catfish, all drawn from NCBI's RefSeq models. Although
    any single gene set can be presumed to have mistakes, cross-species
    alignments infer the biological accuracy, there should be no correlation
    between species for the errors, esp. for the Evigene set that did not use
    any cross-species models for reconstruction.
      Reference Cavefish_sa (n=28811, Sinocyclocheilus_anshuiensis)
    Gene set    Found   Align   Frag  Best
    Evigene17   97.0%   96.5%   0.5%  50.4%
    NCBI16      93.9%   92.7%   3.9%   6.3%  43.1% equal
      Reference Carp (n=36674, Cyprinus_carpio)
    Gene set    Found   Align   Frag  Best
    Evigene17   92.8%   94.1    0.6%  52.4%
    NCBI16      85.6%   86.3%   8.9%   5.8%  41.6%  equal
    Ensembl17     todo (below NCBI but has some not in NCBI set)
    This zebrafish17evigene is a draft gene set with some missing and
    inaccurate genes.  After assembling genes from two public RNA projects,
    there were missing gene functions for eye, ear, nose and taste receptor
    genes, among others. Those selected projects did not include tissue samples
    from the whole head or body of adults, which is a limitation for
    reconstructing genes from expressed RNA: only works for those genes you
    have expressed.  Also Titin the largest vertebrate gene, of 30,000 aa or
    100,000 bases, is still in pieces, largest is 20,000 aa.  The problem
    here is likely not using enough of available data + assembly options, for
    this repetitive muscle gene.

    Developed at the Genome Informatics Lab of Indiana University Biology Department