euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

EvidentialGene : Arabidopsis thaliana model plant

  1. Arabidopsis thaliana Genome map
  2. BLAST Plant Genes

      Name                              Last modified       Size  

[DIR] Parent Directory 30-Oct-2021 16:32 - [TXT] arabidopsis_evigene17_methods.txt 22-May-2017 13:43 4k [TXT] arabidopsis_evigene17_results.txt 22-May-2017 20:28 9k [DIR] blastplants/ 22-May-2017 11:31 - [DIR] evigene5arath/ 23-May-2017 14:05 - [DIR] gene_models/ 22-May-2017 11:31 - [DIR] illumina_asm/ 22-May-2017 11:31 - [DIR] pacbio_asm/ 21-May-2017 14:21 - [DIR] rnasource/ 01-May-2017 16:10 -

Arabidopsis thaliana gene set reconstructed with EvidentialGene

RESULTS
De-novo reconstruction of model plant genes from 3 RNA-seq sources,
without use of chromosomes or other species genes, is accurate. 
Comparison to gene sets of other methods, including Pac-Bio RNA
sequencing, Trinity-Illumina assembly, and genome gene models, indicate
the Evigene methods are more accurate than commonly used methods.

A primary goal of this reconstruction is the comparison of RNA assembly methods to
recover accurate and complete gene sets, notably Evigene's accurate methods versus
Pac-Bio RNA gene reconstructions and Trinity/Illumina RNA reconstructions.
Comparisons versus reference genes of At Araport 2016 ("official") set are used,
but also other reference species are included to see if At Araport is complete,
as well as the distinct gene evidence from introns of expressed RNA.

Secondarily, the Evigene set is compared with At-Araport, and At-Ler, genome-modeled
gene sets.  I used only a fraction of the available RNA-seq samples for this model plant,
and not surprisingly, not all genes of the At-Araport set are expressed in samples used,
so were not reconstructed.  These statistical summaries of alignment to reference genes,
and expression intron recovery, are objective comparisons of basic accuracy and completeness
of these various coding gene sets.  Overall, Evigene genes reconstructed only from RNA are
the most complete/accurate next to Araport 2016.  Evigene set also includes many additional,
valid alternate transcripts to the Araport set. It appears a bit more complete than the At-Ler
genome modeled set.  The incompleteness of both Pac-Bio and Trinity-only RNA gene sets,
from same lab samples, relative to Evigene set, is clear, whether measuring all genes
or only the subset found in those particular sets, they produce incomplete genes more
often than desired.

-------------------------------------------------

Arabidopsis Evigene sets
evg3arath
  5,102,350 gene assemblies were generated from this RNA sample with 11 assembly runs.
  Evigene coding transcript-only classification reduced these to 
  37,393 locus transcripts, plus 128,412 alternates.
  
evg4arath
  5,039,388  gene assemblies were generated from two RNA samples with 10 assembly runs.
  Evigene coding transcript-only classification reduced these to 
  36,347 locus transcripts, plus 135,007 alternates.

evg5arath
  The transcript-only re-classification and merge of evg3, and evg4
  produced a set of 34,299 locus transcripts, plus 132,604 alternate
  transcripts. Further locus classification by mapping to At TAIR10
  chromosomes, included reassignments among alternates and paralogs, and
  reduction of redundant same-locus extra transcripts.
  
  This evg5arath set with chromosome mapped transcripts has 26,134 coding
  gene loci, 75,102 alternate transcripts, and 138 other locations of
  paralog locus transcripts. This compares to the reference gene set
  At2016-Araport11 with 27,655 coding genes, and 20,707 alternates.
  
Arabidopsis PacBio gene set
pacbio16arath
  RNA genes extracted and assembled from 12 raw SRA PacBio RNA data
  entries, with Pacific Biosciences SMRTAnalysis software, resulted 
  353,153 transcripts, an average of 27165 transcripts per assembly. These
  were reduced by removing perfectly identical duplicate transcripts, and
  identical sub-transcripts, yielding 94,102 distinct transcripts.
------------

Arabidopsis gene set versions compared
  At16Ap  = public gene set of 2016 of Arabidopsis thal. from Araport.org 
  Oases   = velvet/oases assembly of Illumina RNA,
  IDBAtr  = idba_tran asm of Illumina RNA,
  SOAPtr  = SOAP-Trans asm of Illumina RNA,
  Trinity = Trinity asm of Illumina RNA,
  At16Pacb/PacBio = Pac-Bio SMRTAnalysis assembly of Pac-Bio RNA data
  At16Ler = genes modeled on At-Ler chromosome assembly, NCBI accession GCA_001651475.1, PRJNA311266 
  At3EVm = Evigene classified gene set of Illumina RNA assemblies above (O,I,S,T)
  At17EVm5/At5EVm = Evigene reduction of assemblies of Illumina RNA-seq, 
           improved over At3EVm with added RNA sets and gene assemblies,
  At17EVm5 genes of 2017, along with comparison gene sets, are public at 
    http://arthropods.eugenes.org/EvidentialGene/plants/arabidopsis/evigene2017_arabidopsis/

Summary comparisons
-------------------  
1a.  Gene assembly methods measured against 
     reference Arath public gene set (2016), unique coding sequences

  Alignment to Arabidopsis gene set At16Ap (nt=37806 ref transcripts)
Geneset  nFound  Found%  AlignF% AlignT% 
At5EVm   36072   95.4    95.7    95.0     
At3EVm   34294   90.7    94.0    92.3   
Oases    34030   90.0    93.5    91.2   
IDBAtr   33837   89.5    92.0    89.1   
SOAPtr   33598   88.9    90.5    87.0   
Trinity  33417   88.4    87.9    84.1   
PacBio   21964   58.1    76.7    48.2   
         ---------------------------

1b. Arabidopsis gene sets measured against related species Orange and Cacao.
               Cacao Reference                Orange Reference
 Geneset  nGene  nAlt    Found% AlignF%   nGene  nAlt   Found% AlignF% 
 At17EVm5 23042  134099  89.8   70.2      16739  22226  90.7   74.1
 At16Ap   22473  132468  88.7   70.6      16850  22338  91.1   74.7
 At16Ler  22650  132005  88.4   69.9      16661  22091  90.1   74.0
 At16Pacb 17593  95437   64.2   60.5      11578  15608  64.0   63.5
          ----------------------------    ---------------------------

1c. Intron recovery for Arabidopsis gene sets (ni=125481 of RNA-seq mapped to chrs)
  Geneset  GeneTr  valExon Found%
  At17EVm5  83211  109970  87.6  
  At16Ap    42211  110654  88.1  
  At16Ler   25187  102381  81.5  
  At16Pacb  48848   70719  56.3   
  .. subset gene assemblies ..
  Oases    312817  101784  81.1
  IDBAtr   248429  101357  80.7
  SOAPtr    96533   99274  79.1
  Trinity  198380  102208  81.4
          ----------------------
          
1d. Equivalency of Evigene and At-Araport gene loci

At16Ap gene loci (n=27652) with equivalent At17EVm5 genes
  21646, 78%  : Essentially same coding exons (>=95%)
  22965, 83%  : Same or >50% same coding exons
  24065, 87%  : Same, or some equal coding exons
   3618, 13%  : No equal CDS (includes loci unexpressed in this RNA sample)
    408,  1%  : No equal CDS with expressed introns (i.e. omission mistakes in Evigene set)
  -------------
If we ignore 3200 (12%) un/weakly-expressed At16Ap genes in this RNA sample,
the Evigene set is 90% same-location-equivalent to At16ap coding sequences.
This is a different, location-based measure, than the above (1a) 95% coding 
sequence alignment (At5EVm to At16ap), which collapses high identity paralogs.

Intron recovery
   109751 introns are common to At16Ap and At17EVm5, 106421 also found in RNA-seq
     4354 RNA introns only in At16Ap
     3549 RNA introns only in At17EVm5
     
At17EVm5 with no equivalent At16Ap gene
  2514 primary and alternates transcripts,
   381 of these have unique expressed introns
  ~100 of these appear as valid genes, lacking At16Ap equivalent
  
141 At17EVm5 have no mapping to AtTAIR10 chromosomes, or very poor
mapping, but having strong protein alignment to other plant species. 
Of these, 86 genes align well on At-Ler chromosomes, suggesting
errors in assembly of AtTAIR10 chromosomes (gene RNA is from same Col-0
genomic clone).
  
Most common protein functions of non-equivalent At17EVm5 (from At16Ap protein alignment)
 723 unknown, 209 hypothetical
  35 transmembrane, 27 F-box, 20 cytochrome, 19 UDP-glucosyl
  16 Disease resistance, 16 kinase, 15 maternal effect,
  13 RING/U-box, 12 Leucine-rich, 11 MLP-like,  11 UDP-Glycosyltransferase
  10 Cysteine/Histidine-rich, 10 fucosyltransferase
  
-----------------------------------

Reference Gene Alignments (1a,1b)
Method:  BLASTn -query reference-unique.cds -db allgenesets.cds -evalue 1e-5 ..
Statistics
  Found% = percent of reference transcripts found
  AlignF% = align to reference transcripts found.  
  AlignT% = align to total reference transcripts.  
Reference genes:
  Orange, NCBI genomes/refseq/plant/Citrus_clementina/GCF_000493195.1_Citrus_clementina_v1.0
  Cacao, Evigene update of Theobroma cacao from RNA-seq supplied by Mars.

Intron Recovery (1c)
Method:
  map RNA-seq (Illumina) to chromosome assembly with GSNAP, 
  extract splice-mapped reads and their intron locations, 
  tabulate gene-exon x RNA-intron matches.
Statistics
  GeneTr  = gene transcripts total in gene set
  valExon = gene exons w/ validated intron
  InFound% = percent of all valid introns recovered b/n gene exons

Equivalency (1d)
Measured as CDS and exon location overlap between genes.gff 
mapped to AtTAIR10 chromosomes.

=============================================================

Developed at the Genome Informatics Lab of Indiana University Biology Department