Name Last modified Size
Parent Directory 30-Oct-2021 16:32 -
arabidopsis_evigene17_methods.txt 22-May-2017 13:43 4k
arabidopsis_evigene17_results.txt 22-May-2017 20:28 9k
blastplants/ 22-May-2017 11:31 -
evigene5arath/ 23-May-2017 14:05 -
gene_models/ 22-May-2017 11:31 -
illumina_asm/ 22-May-2017 11:31 -
pacbio_asm/ 21-May-2017 14:21 -
rnasource/ 01-May-2017 16:10 -
Arabidopsis thaliana gene set reconstructed with EvidentialGene
RESULTS
De-novo reconstruction of model plant genes from 3 RNA-seq sources,
without use of chromosomes or other species genes, is accurate.
Comparison to gene sets of other methods, including Pac-Bio RNA
sequencing, Trinity-Illumina assembly, and genome gene models, indicate
the Evigene methods are more accurate than commonly used methods.
A primary goal of this reconstruction is the comparison of RNA assembly methods to
recover accurate and complete gene sets, notably Evigene's accurate methods versus
Pac-Bio RNA gene reconstructions and Trinity/Illumina RNA reconstructions.
Comparisons versus reference genes of At Araport 2016 ("official") set are used,
but also other reference species are included to see if At Araport is complete,
as well as the distinct gene evidence from introns of expressed RNA.
Secondarily, the Evigene set is compared with At-Araport, and At-Ler, genome-modeled
gene sets. I used only a fraction of the available RNA-seq samples for this model plant,
and not surprisingly, not all genes of the At-Araport set are expressed in samples used,
so were not reconstructed. These statistical summaries of alignment to reference genes,
and expression intron recovery, are objective comparisons of basic accuracy and completeness
of these various coding gene sets. Overall, Evigene genes reconstructed only from RNA are
the most complete/accurate next to Araport 2016. Evigene set also includes many additional,
valid alternate transcripts to the Araport set. It appears a bit more complete than the At-Ler
genome modeled set. The incompleteness of both Pac-Bio and Trinity-only RNA gene sets,
from same lab samples, relative to Evigene set, is clear, whether measuring all genes
or only the subset found in those particular sets, they produce incomplete genes more
often than desired.
-------------------------------------------------
Arabidopsis Evigene sets
evg3arath
5,102,350 gene assemblies were generated from this RNA sample with 11 assembly runs.
Evigene coding transcript-only classification reduced these to
37,393 locus transcripts, plus 128,412 alternates.
evg4arath
5,039,388 gene assemblies were generated from two RNA samples with 10 assembly runs.
Evigene coding transcript-only classification reduced these to
36,347 locus transcripts, plus 135,007 alternates.
evg5arath
The transcript-only re-classification and merge of evg3, and evg4
produced a set of 34,299 locus transcripts, plus 132,604 alternate
transcripts. Further locus classification by mapping to At TAIR10
chromosomes, included reassignments among alternates and paralogs, and
reduction of redundant same-locus extra transcripts.
This evg5arath set with chromosome mapped transcripts has 26,134 coding
gene loci, 75,102 alternate transcripts, and 138 other locations of
paralog locus transcripts. This compares to the reference gene set
At2016-Araport11 with 27,655 coding genes, and 20,707 alternates.
Arabidopsis PacBio gene set
pacbio16arath
RNA genes extracted and assembled from 12 raw SRA PacBio RNA data
entries, with Pacific Biosciences SMRTAnalysis software, resulted
353,153 transcripts, an average of 27165 transcripts per assembly. These
were reduced by removing perfectly identical duplicate transcripts, and
identical sub-transcripts, yielding 94,102 distinct transcripts.
------------
Arabidopsis gene set versions compared
At16Ap = public gene set of 2016 of Arabidopsis thal. from Araport.org
Oases = velvet/oases assembly of Illumina RNA,
IDBAtr = idba_tran asm of Illumina RNA,
SOAPtr = SOAP-Trans asm of Illumina RNA,
Trinity = Trinity asm of Illumina RNA,
At16Pacb/PacBio = Pac-Bio SMRTAnalysis assembly of Pac-Bio RNA data
At16Ler = genes modeled on At-Ler chromosome assembly, NCBI accession GCA_001651475.1, PRJNA311266
At3EVm = Evigene classified gene set of Illumina RNA assemblies above (O,I,S,T)
At17EVm5/At5EVm = Evigene reduction of assemblies of Illumina RNA-seq,
improved over At3EVm with added RNA sets and gene assemblies,
At17EVm5 genes of 2017, along with comparison gene sets, are public at
http://arthropods.eugenes.org/EvidentialGene/plants/arabidopsis/evigene2017_arabidopsis/
Summary comparisons
-------------------
1a. Gene assembly methods measured against
reference Arath public gene set (2016), unique coding sequences
Alignment to Arabidopsis gene set At16Ap (nt=37806 ref transcripts)
Geneset nFound Found% AlignF% AlignT%
At5EVm 36072 95.4 95.7 95.0
At3EVm 34294 90.7 94.0 92.3
Oases 34030 90.0 93.5 91.2
IDBAtr 33837 89.5 92.0 89.1
SOAPtr 33598 88.9 90.5 87.0
Trinity 33417 88.4 87.9 84.1
PacBio 21964 58.1 76.7 48.2
---------------------------
1b. Arabidopsis gene sets measured against related species Orange and Cacao.
Cacao Reference Orange Reference
Geneset nGene nAlt Found% AlignF% nGene nAlt Found% AlignF%
At17EVm5 23042 134099 89.8 70.2 16739 22226 90.7 74.1
At16Ap 22473 132468 88.7 70.6 16850 22338 91.1 74.7
At16Ler 22650 132005 88.4 69.9 16661 22091 90.1 74.0
At16Pacb 17593 95437 64.2 60.5 11578 15608 64.0 63.5
---------------------------- ---------------------------
1c. Intron recovery for Arabidopsis gene sets (ni=125481 of RNA-seq mapped to chrs)
Geneset GeneTr valExon Found%
At17EVm5 83211 109970 87.6
At16Ap 42211 110654 88.1
At16Ler 25187 102381 81.5
At16Pacb 48848 70719 56.3
.. subset gene assemblies ..
Oases 312817 101784 81.1
IDBAtr 248429 101357 80.7
SOAPtr 96533 99274 79.1
Trinity 198380 102208 81.4
----------------------
1d. Equivalency of Evigene and At-Araport gene loci
At16Ap gene loci (n=27652) with equivalent At17EVm5 genes
21646, 78% : Essentially same coding exons (>=95%)
22965, 83% : Same or >50% same coding exons
24065, 87% : Same, or some equal coding exons
3618, 13% : No equal CDS (includes loci unexpressed in this RNA sample)
408, 1% : No equal CDS with expressed introns (i.e. omission mistakes in Evigene set)
-------------
If we ignore 3200 (12%) un/weakly-expressed At16Ap genes in this RNA sample,
the Evigene set is 90% same-location-equivalent to At16ap coding sequences.
This is a different, location-based measure, than the above (1a) 95% coding
sequence alignment (At5EVm to At16ap), which collapses high identity paralogs.
Intron recovery
109751 introns are common to At16Ap and At17EVm5, 106421 also found in RNA-seq
4354 RNA introns only in At16Ap
3549 RNA introns only in At17EVm5
At17EVm5 with no equivalent At16Ap gene
2514 primary and alternates transcripts,
381 of these have unique expressed introns
~100 of these appear as valid genes, lacking At16Ap equivalent
141 At17EVm5 have no mapping to AtTAIR10 chromosomes, or very poor
mapping, but having strong protein alignment to other plant species.
Of these, 86 genes align well on At-Ler chromosomes, suggesting
errors in assembly of AtTAIR10 chromosomes (gene RNA is from same Col-0
genomic clone).
Most common protein functions of non-equivalent At17EVm5 (from At16Ap protein alignment)
723 unknown, 209 hypothetical
35 transmembrane, 27 F-box, 20 cytochrome, 19 UDP-glucosyl
16 Disease resistance, 16 kinase, 15 maternal effect,
13 RING/U-box, 12 Leucine-rich, 11 MLP-like, 11 UDP-Glycosyltransferase
10 Cysteine/Histidine-rich, 10 fucosyltransferase
-----------------------------------
Reference Gene Alignments (1a,1b)
Method: BLASTn -query reference-unique.cds -db allgenesets.cds -evalue 1e-5 ..
Statistics
Found% = percent of reference transcripts found
AlignF% = align to reference transcripts found.
AlignT% = align to total reference transcripts.
Reference genes:
Orange, NCBI genomes/refseq/plant/Citrus_clementina/GCF_000493195.1_Citrus_clementina_v1.0
Cacao, Evigene update of Theobroma cacao from RNA-seq supplied by Mars.
Intron Recovery (1c)
Method:
map RNA-seq (Illumina) to chromosome assembly with GSNAP,
extract splice-mapped reads and their intron locations,
tabulate gene-exon x RNA-intron matches.
Statistics
GeneTr = gene transcripts total in gene set
valExon = gene exons w/ validated intron
InFound% = percent of all valid introns recovered b/n gene exons
Equivalency (1d)
Measured as CDS and exon location overlap between genes.gff
mapped to AtTAIR10 chromosomes.
=============================================================
|