Pig gene set improvement with EvidentialGene using its new SRA2Genes pipeline.

This SRA2Genes pipeline collects several EvidentialGene methods into a
complete, automated gene set reconstruction pipeline for fetching
public RNA-seq gene pieces from NCBI SRA, over-assembling that into many
millions of gene models, varying assembly methods and data slices, then
reducing the over-assembly by to its most accurate non-redundant coding
gene loci and alternates, followed by annotation with reference/related
species proteins and gene names, with checks for contaminants, and
formatting of gene sequence sets to publication quality for public database

Preliminary pig18evigene gene set info is at

The Evigene software package including omnibus is available at
Completeness and accuracy comparisons are to NCBI RefSeq gene set of
the pig, modeled on chromosome assembly. Evigene set is built from
RNA assembly only, without using chromosomes or other species genes to
reconstruct.  Those gene evidences are used for validating and
reclassifying the RNA constructs.

TABLE G3. Sus_scrofa gene sets compared for gene evidence recovery

G3a.  Conserved vertebrate genes in pig gene sets (BUSCO v9)
Geneset Align   Full    Frag    Miss   Best
Evigene 447     2568    10       8     776 (30%),  1730 same (67%)
NCBI    440     2567     2      17      80 ( 3%)
Ensembl 431     2552    20      14      na

G3b. Reference Human (Homo_sapiens, NCBI 2018 RefSeq)
Geneset Found   Align   Frag  Best
Evigene 99.3%   96.0    1.7     20   55% equal
NCBI    99.3%   97.2    0.6     25
for 37,883 human protein isoforms that are uniquely found in either pig gene set

The G3a scores are measured against BUSCO verebrate subset of OrthoDB v9.  The Align
score is average alignment to conserved (ancestral) proteins, and
Compl/Frag/Miss are complete, fragment and missing statistics from BUSCO
calculation of HMM search for those anscestral vertebrate one-copy genes.  Align = average
alignment (aa) to ref proteins, Full  = Complete align to conserved proteins, Frag =
fragment alignment, Miss = no alignment, Best = percentage of best alignments per gene set
in pairwise matches to each reference gene.

A more complete orthology assessement (G3b) is done using 4 vertebrates,
human, mouse, cow and zebrafish, all drawn from NCBI's RefSeq models. Although
any single gene set can be presumed to have mistakes, cross-species
alignments infer the biological accuracy, there should be no correlation
between species for the errors, esp. for the Evigene set that did not use
any cross-species models for reconstruction.

Developed at the Genome Informatics Lab of Indiana University Biology Department