Pig gene set improvement with EvidentialGene using its new SRA2Genes pipeline. This SRA2Genes pipeline collects several EvidentialGene methods into a complete, automated gene set reconstruction pipeline for fetching public RNA-seq gene pieces from NCBI SRA, over-assembling that into many millions of gene models, varying assembly methods and data slices, then reducing the over-assembly by to its most accurate non-redundant coding gene loci and alternates, followed by annotation with reference/related species proteins and gene names, with checks for contaminants, and formatting of gene sequence sets to publication quality for public database submission. Preliminary pig18evigene gene set info is at http://eugenes.org/EvidentialGene/vertebrates/pig/ The Evigene software package including omnibus evgpipe_sra2genes.pl is available at http://arthropods.eugenes.org/EvidentialGene/other/evigene_old/ Completeness and accuracy comparisons are to NCBI RefSeq gene set of the pig, modeled on chromosome assembly. Evigene set is built from RNA assembly only, without using chromosomes or other species genes to reconstruct. Those gene evidences are used for validating and reclassifying the RNA constructs. TABLE G3. Sus_scrofa gene sets compared for gene evidence recovery G3a. Conserved vertebrate genes in pig gene sets (BUSCO v9) Geneset Align Full Frag Miss Best ------------------------------------------- Evigene 447 2568 10 8 776 (30%), 1730 same (67%) NCBI 440 2567 2 17 80 ( 3%) Ensembl 431 2552 20 14 na --------------------------------------------- G3b. Reference Human (Homo_sapiens, NCBI 2018 RefSeq) Geneset Found Align Frag Best ------------------------------------------ Evigene 99.3% 96.0 1.7 20 55% equal NCBI 99.3% 97.2 0.6 25 ------------------------------------------ for 37,883 human protein isoforms that are uniquely found in either pig gene set The G3a scores are measured against BUSCO verebrate subset of OrthoDB v9. The Align score is average alignment to conserved (ancestral) proteins, and Compl/Frag/Miss are complete, fragment and missing statistics from BUSCO calculation of HMM search for those anscestral vertebrate one-copy genes. Align = average alignment (aa) to ref proteins, Full = Complete align to conserved proteins, Frag = fragment alignment, Miss = no alignment, Best = percentage of best alignments per gene set in pairwise matches to each reference gene. A more complete orthology assessement (G3b) is done using 4 vertebrates, human, mouse, cow and zebrafish, all drawn from NCBI's RefSeq models. Although any single gene set can be presumed to have mistakes, cross-species alignments infer the biological accuracy, there should be no correlation between species for the errors, esp. for the Evigene set that did not use any cross-species models for reconstruction.