euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

Index of /EvidentialGene/vertebrates/zebrafish/zebrafish17evigene/publicset

      Name                                         Last modified       Size  

[DIR] Parent Directory 26-Apr-2018 20:50 - [   ] zebrafish17evigene_m6pt.ann.txt.gz 19-Apr-2018 11:31 11.3M [TXT] zebrafish17evigene_m6pt.pubids 12-Apr-2018 15:39 73.9M [   ] zebrafish17evigene_m6pt.public.aa.gz 19-Apr-2018 11:49 26.8M [   ] zebrafish17evigene_m6pt.public.cds.gz 19-Apr-2018 11:52 69.7M [   ] zebrafish17evigene_m6pt.public.grcz10.gff.gz 21-Apr-2018 15:49 51.5M [   ] zebrafish17evigene_m6pt.public.mrna.gz 19-Apr-2018 11:47 109M [TXT] zebrafish17evigene_m6pt.stats.txt 28-Apr-2018 14:50 8k [   ] zebrafish17evigene_m6pt.xcull.aa.gz 19-Apr-2018 11:49 54.2M [   ] zebrafish17evigene_m6pt.xcull.cds.gz 19-Apr-2018 11:52 93.3M [   ] zebrafish17evigene_m6pt.xcull.grcz10.gff.gz 21-Apr-2018 15:46 61.6M [   ] zebrafish17evigene_m6pt.xcull.mrna.gz 19-Apr-2018 11:47 122M [TXT] 05-Jan-2018 15:11 4k

Zebrafish gene set improvement with EvidentialGene 
using a new automated SRA2Genes pipeline.

This SRA2Genes pipeline collects several EvidentialGene methods into a
complete, automated (nearly) gene set reconstruction pipeline for fetching
public RNA-seq gene pieces from NCBI SRA, over-assembling that into many
millions of gene models, varying assembly methods and data slices, then
reducing the over-assembly by to its most accurate non-redundant coding
gene loci and alternates, followed by annotation with reference/related
species proteins and gene names, with checks for contaminants, and
formatting of gene sequence sets to publication quality for public database

Preliminary zebrafish17evigene gene set info is at

The Evigene software package including omnibus is available at
   evigene18jan01.tar  (draft2 of evgpipe_sra2genes)
I took zebrafish as one test case  of this Evigene sra2genes pipeline, as
it is in top 10 of those with public RNA-seq studies, and my prior work
with fish genes suggested published zfish genes may be amenable to
improvements.  That proved true, from comparisons to other fish and
vertebrate gene sets.  The Evigene draft set is more complete and
accurate in representing zebrafish genes than Ensembl or NCBI sets by
objective measures of gene orthology.

Completeness and accuracy comparisons are to NCBI and ENSembl gene sets of
zebrafish, modeled on chromosome assembly GRCz10. Evigene set is built from
RNA assembly only, without using chromosomes or other species genes to
reconstruct.  Those gene evidences are used for validating and
reclassifying the RNA constructs.

Conserved vertebrate genes in zebrafish gene sets 
Gene set    Align   Compl Frag Miss
Evigene17   443.1   2572    5    9   Evigene gene set, 2017 Dec
NCBI16      433.8   2554   13   19   NCBI RefSeq gene set, 2016 Dec
Ensembl17   426.8   2510   47   29   Ensembl gene set, 2017 Nov

The NCBI refseq gene and chromosome ID used is GCF_000002035.5_GRCz10.
These are measured against BUSCO verebrate subset of OrthoDB v9. The Align
score is average alignment to conserved (ancestral) proteins, and
Compl/Frag/Miss are complete, fragment and missing statistics from BUSCO
calculation of HMM search for those anscestral vertebrate one-copy genes.
Note that this is a  10% or less subset of the ortholog genes in
fishes, many are multi-copy, or fish clade -specific.  

A more complete orthology assessement is done using 3 related fish: a
cavefish, carp and catfish, all drawn from NCBI's RefSeq models. Although
any single gene set can be presumed to have mistakes, cross-species
alignments infer the biological accuracy, there should be no correlation
between species for the errors, esp. for the Evigene set that did not use
any cross-species models for reconstruction.

  Reference Cavefish_sa (n=28811, Sinocyclocheilus_anshuiensis)
Gene set    Found   Align   Frag  Best
Evigene17   97.0%   96.5%   0.5%  50.4%
NCBI16      93.9%   92.7%   3.9%   6.3%  43.1% equal
  Reference Carp (n=36674, Cyprinus_carpio)
Gene set    Found   Align   Frag  Best
Evigene17   92.8%   94.1    0.6%  52.4%
NCBI16      85.6%   86.3%   8.9%   5.8%  41.6%  equal
Ensembl17     todo (below NCBI but has some not in NCBI set)

This zebrafish17evigene is a draft gene set with some missing and
inaccurate genes.  After assembling genes from two public RNA projects,
there were missing gene functions for eye, ear, nose and taste receptor
genes, among others. Those selected projects did not include tissue samples
from the whole head or body of adults, which is a limitation for
reconstructing genes from expressed RNA: only works for those genes you
have expressed.  Also Titin the largest vertebrate gene, of 30,000 aa or
100,000 bases, is still in pieces, largest is 20,000 aa.  The problem
here is likely not using enough of available data + assembly options, for
this repetitive muscle gene.

Developed at the Genome Informatics Lab of Indiana University Biology Department