Arabidopsis thaliana gene set reconstructed with EvidentialGene RESULTS De-novo reconstruction of model plant genes from 3 RNA-seq sources, without use of chromosomes or other species genes, is accurate. Comparison to gene sets of other methods, including Pac-Bio RNA sequencing, Trinity-Illumina assembly, and genome gene models, indicate the Evigene methods are more accurate than commonly used methods. A primary goal of this reconstruction is the comparison of RNA assembly methods to recover accurate and complete gene sets, notably Evigene's accurate methods versus Pac-Bio RNA gene reconstructions and Trinity/Illumina RNA reconstructions. Comparisons versus reference genes of At Araport 2016 ("official") set are used, but also other reference species are included to see if At Araport is complete, as well as the distinct gene evidence from introns of expressed RNA. Secondarily, the Evigene set is compared with At-Araport, and At-Ler, genome-modeled gene sets. I used only a fraction of the available RNA-seq samples for this model plant, and not surprisingly, not all genes of the At-Araport set are expressed in samples used, so were not reconstructed. These statistical summaries of alignment to reference genes, and expression intron recovery, are objective comparisons of basic accuracy and completeness of these various coding gene sets. Overall, Evigene genes reconstructed only from RNA are the most complete/accurate next to Araport 2016. Evigene set also includes many additional, valid alternate transcripts to the Araport set. It appears a bit more complete than the At-Ler genome modeled set. The incompleteness of both Pac-Bio and Trinity-only RNA gene sets, from same lab samples, relative to Evigene set, is clear, whether measuring all genes or only the subset found in those particular sets, they produce incomplete genes more often than desired. ------------------------------------------------- Arabidopsis Evigene sets evg3arath 5,102,350 gene assemblies were generated from this RNA sample with 11 assembly runs. Evigene coding transcript-only classification reduced these to 37,393 locus transcripts, plus 128,412 alternates. evg4arath 5,039,388 gene assemblies were generated from two RNA samples with 10 assembly runs. Evigene coding transcript-only classification reduced these to 36,347 locus transcripts, plus 135,007 alternates. evg5arath The transcript-only re-classification and merge of evg3, and evg4 produced a set of 34,299 locus transcripts, plus 132,604 alternate transcripts. Further locus classification by mapping to At TAIR10 chromosomes, included reassignments among alternates and paralogs, and reduction of redundant same-locus extra transcripts. This evg5arath set with chromosome mapped transcripts has 26,134 coding gene loci, 75,102 alternate transcripts, and 138 other locations of paralog locus transcripts. This compares to the reference gene set At2016-Araport11 with 27,655 coding genes, and 20,707 alternates. Arabidopsis PacBio gene set pacbio16arath RNA genes extracted and assembled from 12 raw SRA PacBio RNA data entries, with Pacific Biosciences SMRTAnalysis software, resulted 353,153 transcripts, an average of 27165 transcripts per assembly. These were reduced by removing perfectly identical duplicate transcripts, and identical sub-transcripts, yielding 94,102 distinct transcripts. ------------ Arabidopsis gene set versions compared At16Ap = public gene set of 2016 of Arabidopsis thal. from Araport.org Oases = velvet/oases assembly of Illumina RNA, IDBAtr = idba_tran asm of Illumina RNA, SOAPtr = SOAP-Trans asm of Illumina RNA, Trinity = Trinity asm of Illumina RNA, At16Pacb/PacBio = Pac-Bio SMRTAnalysis assembly of Pac-Bio RNA data At16Ler = genes modeled on At-Ler chromosome assembly, NCBI accession GCA_001651475.1, PRJNA311266 At3EVm = Evigene classified gene set of Illumina RNA assemblies above (O,I,S,T) At17EVm5/At5EVm = Evigene reduction of assemblies of Illumina RNA-seq, improved over At3EVm with added RNA sets and gene assemblies, At17EVm5 genes of 2017, along with comparison gene sets, are public at http://arthropods.eugenes.org/EvidentialGene/plants/arabidopsis/evigene2017_arabidopsis/ Summary comparisons ------------------- 1a. Gene assembly methods measured against reference Arath public gene set (2016), unique coding sequences Alignment to Arabidopsis gene set At16Ap (nt=37806 ref transcripts) Geneset nFound Found% AlignF% AlignT% At5EVm 36072 95.4 95.7 95.0 At3EVm 34294 90.7 94.0 92.3 Oases 34030 90.0 93.5 91.2 IDBAtr 33837 89.5 92.0 89.1 SOAPtr 33598 88.9 90.5 87.0 Trinity 33417 88.4 87.9 84.1 PacBio 21964 58.1 76.7 48.2 --------------------------- 1b. Arabidopsis gene sets measured against related species Orange and Cacao. Cacao Reference Orange Reference Geneset nGene nAlt Found% AlignF% nGene nAlt Found% AlignF% At17EVm5 23042 134099 89.8 70.2 16739 22226 90.7 74.1 At16Ap 22473 132468 88.7 70.6 16850 22338 91.1 74.7 At16Ler 22650 132005 88.4 69.9 16661 22091 90.1 74.0 At16Pacb 17593 95437 64.2 60.5 11578 15608 64.0 63.5 ---------------------------- --------------------------- 1c. Intron recovery for Arabidopsis gene sets (ni=125481 of RNA-seq mapped to chrs) Geneset GeneTr valExon Found% At17EVm5 83211 109970 87.6 At16Ap 42211 110654 88.1 At16Ler 25187 102381 81.5 At16Pacb 48848 70719 56.3 .. subset gene assemblies .. Oases 312817 101784 81.1 IDBAtr 248429 101357 80.7 SOAPtr 96533 99274 79.1 Trinity 198380 102208 81.4 ---------------------- 1d. Equivalency of Evigene and At-Araport gene loci At16Ap gene loci (n=27652) with equivalent At17EVm5 genes 21646, 78% : Essentially same coding exons (>=95%) 22965, 83% : Same or >50% same coding exons 24065, 87% : Same, or some equal coding exons 3618, 13% : No equal CDS (includes loci unexpressed in this RNA sample) 408, 1% : No equal CDS with expressed introns (i.e. omission mistakes in Evigene set) ------------- If we ignore 3200 (12%) un/weakly-expressed At16Ap genes in this RNA sample, the Evigene set is 90% same-location-equivalent to At16ap coding sequences. This is a different, location-based measure, than the above (1a) 95% coding sequence alignment (At5EVm to At16ap), which collapses high identity paralogs. Intron recovery 109751 introns are common to At16Ap and At17EVm5, 106421 also found in RNA-seq 4354 RNA introns only in At16Ap 3549 RNA introns only in At17EVm5 At17EVm5 with no equivalent At16Ap gene 2514 primary and alternates transcripts, 381 of these have unique expressed introns ~100 of these appear as valid genes, lacking At16Ap equivalent 141 At17EVm5 have no mapping to AtTAIR10 chromosomes, or very poor mapping, but having strong protein alignment to other plant species. Of these, 86 genes align well on At-Ler chromosomes, suggesting errors in assembly of AtTAIR10 chromosomes (gene RNA is from same Col-0 genomic clone). Most common protein functions of non-equivalent At17EVm5 (from At16Ap protein alignment) 723 unknown, 209 hypothetical 35 transmembrane, 27 F-box, 20 cytochrome, 19 UDP-glucosyl 16 Disease resistance, 16 kinase, 15 maternal effect, 13 RING/U-box, 12 Leucine-rich, 11 MLP-like, 11 UDP-Glycosyltransferase 10 Cysteine/Histidine-rich, 10 fucosyltransferase ----------------------------------- Reference Gene Alignments (1a,1b) Method: BLASTn -query reference-unique.cds -db allgenesets.cds -evalue 1e-5 .. Statistics Found% = percent of reference transcripts found AlignF% = align to reference transcripts found. AlignT% = align to total reference transcripts. Reference genes: Orange, NCBI genomes/refseq/plant/Citrus_clementina/GCF_000493195.1_Citrus_clementina_v1.0 Cacao, Evigene update of Theobroma cacao from RNA-seq supplied by Mars. Intron Recovery (1c) Method: map RNA-seq (Illumina) to chromosome assembly with GSNAP, extract splice-mapped reads and their intron locations, tabulate gene-exon x RNA-intron matches. Statistics GeneTr = gene transcripts total in gene set valExon = gene exons w/ validated intron InFound% = percent of all valid introns recovered b/n gene exons Equivalency (1d) Measured as CDS and exon location overlap between genes.gff mapped to AtTAIR10 chromosomes. =============================================================