Example Zebrafish genes improved in Evigene vs NCBI and Ensembl sets
Human genes found in Evigene zebrafish reconstructions, vs NCBI and Ensembl sets
1000s of Zebrafish genes are improved in Evigene reconstruction,
versus NCBI RefSeq gene set of 2016, and Ensembl gene set of
2017 (ZFIN uses this).
Danio rerio Gene/Genome map
Name Last modified Size
Parent Directory 23-Jul-2019 21:24 -
rnasets/ 12-Oct-2018 15:19 -
publicset/ 28-Apr-2018 14:50 -
map/ 13-Jul-2018 23:21 -
evgmethods/ 28-Dec-2017 14:10 -
docs/ 25-Apr-2018 23:54 -
aaeval/ 26-Apr-2018 20:43 -
Zebrafish gene set improvement with EvidentialGene
using a new automated SRA2Genes pipeline.
This SRA2Genes pipeline collects several EvidentialGene methods into a
complete, automated (nearly) gene set reconstruction pipeline for fetching
public RNA-seq gene pieces from NCBI SRA, over-assembling that into many
millions of gene models, varying assembly methods and data slices, then
reducing the over-assembly by to its most accurate non-redundant coding
gene loci and alternates, followed by annotation with reference/related
species proteins and gene names, with checks for contaminants, and
formatting of gene sequence sets to publication quality for public database
submission.
Preliminary zebrafish17evigene gene set info is at
http://eugenes.org/EvidentialGene/vertebrates/zebrafish/
The Evigene software package including omnibus evgpipe_sra2genes.pl is available at
http://arthropods.eugenes.org/EvidentialGene/other/evigene_old/
evigene18jan01.tar (draft2 of evgpipe_sra2genes)
I took zebrafish as one test case of this Evigene sra2genes pipeline, as
it is in top 10 of those with public RNA-seq studies, and my prior work
with fish genes suggested published zfish genes may be amenable to
improvements. That proved true, from comparisons to other fish and
vertebrate gene sets. The Evigene draft set is more complete and
accurate in representing zebrafish genes than Ensembl or NCBI sets by
objective measures of gene orthology.
Completeness and accuracy comparisons are to NCBI and ENSembl gene sets of
zebrafish, modeled on chromosome assembly GRCz10. Evigene set is built from
RNA assembly only, without using chromosomes or other species genes to
reconstruct. Those gene evidences are used for validating and
reclassifying the RNA constructs.
Conserved vertebrate genes in zebrafish gene sets
Gene set Align Compl Frag Miss
------------------------------------
Evigene17 443.1 2572 5 9 Evigene gene set, 2017 Dec
NCBI16 433.8 2554 13 19 NCBI RefSeq gene set, 2016 Dec
Ensembl17 426.8 2510 47 29 Ensembl gene set, 2017 Nov
------------------------------------
The NCBI refseq gene and chromosome ID used is GCF_000002035.5_GRCz10.
These are measured against BUSCO verebrate subset of OrthoDB v9. The Align
score is average alignment to conserved (ancestral) proteins, and
Compl/Frag/Miss are complete, fragment and missing statistics from BUSCO
calculation of HMM search for those anscestral vertebrate one-copy genes.
Note that this is a 10% or less subset of the ortholog genes in
fishes, many are multi-copy, or fish clade -specific.
A more complete orthology assessement is done using 3 related fish: a
cavefish, carp and catfish, all drawn from NCBI's RefSeq models. Although
any single gene set can be presumed to have mistakes, cross-species
alignments infer the biological accuracy, there should be no correlation
between species for the errors, esp. for the Evigene set that did not use
any cross-species models for reconstruction.
Reference Cavefish_sa (n=28811, Sinocyclocheilus_anshuiensis)
Gene set Found Align Frag Best
Evigene17 97.0% 96.5% 0.5% 50.4%
NCBI16 93.9% 92.7% 3.9% 6.3% 43.1% equal
Reference Carp (n=36674, Cyprinus_carpio)
Gene set Found Align Frag Best
Evigene17 92.8% 94.1 0.6% 52.4%
NCBI16 85.6% 86.3% 8.9% 5.8% 41.6% equal
Ensembl17 todo (below NCBI but has some not in NCBI set)
------------------------------------
This zebrafish17evigene is a draft gene set with some missing and
inaccurate genes. After assembling genes from two public RNA projects,
there were missing gene functions for eye, ear, nose and taste receptor
genes, among others. Those selected projects did not include tissue samples
from the whole head or body of adults, which is a limitation for
reconstructing genes from expressed RNA: only works for those genes you
have expressed. Also Titin the largest vertebrate gene, of 30,000 aa or
100,000 bases, is still in pieces, largest is 20,000 aa. The problem
here is likely not using enough of available data + assembly options, for
this repetitive muscle gene.
|