EvidentialGene: Alignment scores of Gene sets Constructed from
mRNA-Seq Assembly or Genome Predictions/mappings
Fig. 3 Protein alignment of mRNA Assemblies matched to longest Reference Genes.
These tables and graphs show the longest proteins of reference gene
sets, and alignment percent for draft constructed or predicted gene
sets to those long reference genes. The mRNA assembled genes are
constructed without any reference protein information, nor with genome
data. The Genome gene sets all use both reference genes and genomes to
produce gene models. I.e., the genome genes have the answers in front
of them, while mRNAseq genes do not, for this test.
This summary shows that genes constructed from mRNA-seq are often more
complete than genes from genome-modelling methods. mRNA-seq genes are
more biologically convincing as they don't rely on the reference
species genes, predictions or genome assembly, and thus are not
subject to artifacts from those sources. When they match the
reference genes, it is true discovery.
Combine this truth discovery with lower mRNA-seq costs, simpler
computational methods, and fewer artifact sources, or
"mRNA-sequencing+assembly" versus
"genome-sequencing+assembly+gene-prediction+refgene-mapping",
and the value of mRNA-seq genes over genome-genes should be apparent.
Summary of mRNA-genes and Genome-genes Alignment
to Reference genes for Animals and Plants
mRNA Genome mRNA
Clade Evigene Genes TSA Genes Organisms
---------------------------------------------------------------------------------
Fishes 78% 83%,71,67,65 39% tilapia, catfish, medaka, stickleback, killifish
Plants 93%,61 95%,63 10% cacao, banana
Insects 75%,54,48 56%,33 41%-4% beetles, whitefly, locust
Crustacea 77%,58,28 66%,58 5% waterfleas, black & white shrimps
Ticks 74% 57%,43 56% zebra-tick, spider-mite, deer-tick
---------------------------------------------------------------------------------
Statistic is percent of test genes within 90% of clade maximum alignment to
1000 longest reference genes, from below statistic tables. This statistic
approximately answers what portion of genes are accurate. Other statistics
such as percent alignment to references have same ordering. mRNA Evigene
and TSA (author published) are for same data and species; Genome genes are
for the most part other species.
Data tables are given below.
Test Gene set source:
GENOME = major portion from gene prediction on genome assembly and/or reference protein mappings.
mRNAseq = genes assembled from mRNA-seq, de-novo, without genome assembly or reference proteins.
mRNAseq assemblies include author-submitted (tsa) and
Evigene-produced (evg), using same source mRNA-seq, where the
difference is often dramatic (e.g. banana evg versus tsa versus
gno-genome, or tiger shrimp penaeus_evg versus penaeus_tsa).
Reference gene sets are chosen from good quality gene sets, distant
enough taxonomically from the test species clade to reduce
phylogenetic effects (human for fishes, arabidopsis for
plants, bee/tribolium for insects and crustacea). Ticks lack a near reference,
the reference is median gene per family contained in all of 3 species
(human, daphnia, tribolium).
The 1000 longest reference proteins are tabulated with aligned test genes,
and 200 are plotted. Alignment is scored with
'blastp -query reference.aa -db testspecies.aa'.
Plot bars:
The bar plots show percentage alignment to reference gene, for each of 200 longest ref genes,
as orange bars. Background blue bars are the maximal alignment for that group/clade of gene sets.
Each title lists test gene set, and average percent alignment to 200 reference genes.
Statistic table columns:
dref = mean percent difference in alignment from reference length,
sdref= std. deviation of mean,
dmax = mean pct. difference in alignment from maximum alignment for this clade group,
nmax90 = number of genes 90% or better of maximal alignment,
pmax90 = percent of genes 90% or better,
tmax = t-statistic for dmax less than 100%,
pmax = p-value of t.; df = number of items -1
Protein alignment % for RefHuDaTr_median-ticks
dref sdref dmax nmax90 pmax90 tmax pmax df
ixodes 62.2 28.3 75.2 427 42.8 -28.7 1.45e-132 996 GENOME
tetur 67.1 29.8 80.2 566 56.8 -22.6 6.50e-92 996 GENOME
ztick_evg 74.1 26.9 89.2 742 74.4 -16.3 1.11e-53 996 mRNAseq
ztick_tsa 62.8 33.9 75.0 558 56.0 -23.3 2.40e-96 996 mRNAseq
ixodes = Ixodes scapularis deer tick, tetur = Tetranychus urticae spider mite,
ztick = Rhipicephalus pulchellus zebra tick
Protein alignment % for RefArath-plants
dref sdref dmax nmax90 pmax90 tmax pmax df
banana_evg 78.7 28.3 82.1 608 61.0 -21.39 3.92e-84 995 mRNAseq
banana_gno 82.4 23.7 85.9 623 62.6 -21.18 8.65e-83 995 GENOME
banana_tsa 45.3 28.1 47.6 97 9.7 -59.05 0.00e+00 995 mRNAseq
cacao3_evr 92.6 15.7 97.6 922 92.6 -10.26 7.51e-24 995 mRNAseq
cacao1_evg 93.3 15.4 98.3 941 94.5 -8.25 2.41e-16 995 GENOME (including same mRNA seq as cacao3_evr)
cacao = Theobroma cacao chocolate tree, banana = Musa acuminata banana plant
Protein alignment % for RefHoneybee-insects
dref sdref dmax nmax90 pmax90 tmax pmax df
pgbeetle_evg 75.2 28.7 90.0 725 74.8 -15.5 4.59e-49 968 mRNAseq
pgbeetle_tsa 62.6 29.4 74.8 397 41.0 -30.8 2.56e-146 968 mRNAseq
pinebeetle 56.3 32.9 65.8 321 33.1 -33.9 3.34e-167 968 GENOME
tribobeetle 69.6 29.8 82.8 546 56.3 -23.1 5.61e-95 968 GENOME
whitefly1_evg 67.8 31.7 79.9 525 54.2 -22.8 5.72e-93 968 mRNAseq
whitefly1_tsa 31.8 21.0 38.8 38 3.9 -69.1 0.00000 968 mRNAseq
locust1_evg 63.1 32.9 75.2 468 48.3 -25.8 2.72e-112 968 mRNAseq
tribobeetle = Tribolium castaneum beetle, pgbeetle = Pogonus chalceus beetle,
pinebeetle = Dendroctonus ponderosae pine beetle,
whitefly = Bemisia tabaci whitefly, locust = Locusta migratoria
Protein alignment % for RefHuman-fish
dref sdref dmax nmax90 pmax90 tmax pmax df
catfish_evg 82.1 26.8 91.7 767 77.7 -14.9 1.12e-45 986 mRNAseq
catfish_tsa 61.6 36.6 68.8 383 38.9 -8.05 7.49e-13 986 mRNAseq
killifish_evg 78.4 28.3 86.6 639 64.7 -20.0 1.49e-75 986 GENOME
medaka 77.7 30.7 84.8 662 67.1 -19.0 5.37e-69 986 GENOME
stickleback 79.7 30.1 86.9 704 71.3 -17.2 1.25e-58 986 GENOME
tilapia 84.0 27.6 92.2 818 82.9 -12.5 1.77e-33 986 GENOME
Protein alignment % for RefBeetle-crusts
dref sdref dmax nmax90 pmax90 tmax pmax df
daphmag5_evg 70.8 31.1 89.8 740 77.0 -14.1 1.93e-41 960 mRNAseq
daphniamag 65.4 31.5 82.3 555 57.8 -21.4 7.04e-84 960 GENOME/mRNAseq
daphniaplx10 68.1 30.8 86.1 639 66.5 -19.0 6.03e-69 960 GENOME
litova1_evg 50.0 30.9 64.5 273 28.4 -36.7 2.60e-185 960 mRNAseq
penaeus_evg 65.0 31.5 82.3 557 58.0 -21.6 4.04e-85 960 mRNAseq
penaeus_tsa 24.4 23.8 32.1 44 4.6 -74.1 0.00e+00 960 mRNAseq
penaeus = Penaeus monodon black tiger shrimp,
litova = Litopenaeus vannamei pacific white shrimp,
daphniaplx, daphniamag, daphmag = Daphnia pulex and magna waterfleas
Data Table columns include ref gene ID and aa length, and
test gene Id, Bitscore, Identity, Align, and Length (all AA sizes)
E.g. datatables/refman1000-fish.topgene.tab
RefgeneID RefLen TestID1 Bits1 Iden1 Align1 Len1 TestID2 Bits2 Iden2 Align2 Len2 ...
HSAPI:ENSG00000155657 33423 cfishevg:socatfishv1k31loc102733t1 18536.9 8832 13118 13376
kfishevg:Funhe5EG014955t1 19833.5 12321 23137 28556
medaka:ENSORLP00000022735 33417.5 16233 25294 26141
stickleback:ENSGACP00000000362 27711.8 14071 23265 26129
tilapia:ENSONIP00000007298 25333 12654 20549 22001
HSAPI:ENSG00000131018 8797 cfishevg:socatfishv2k25loc98569t1 9838 5199 8842 8727
kfishevg:Funhe5EG001671t1 9640 5076 8778 8743
medaka:ENSORLP00000018390 8138 4540 8837 8775
stickleback:ENSGACP00000011594 8357 4582 8832 8782
tilapia:ENSONIP00000000873 9817 5221 8857 8743
R-Stat Tables
#R-shell ..............
NSTAT <- 1000
# YALN only, pct of ref only
ylab <- "Protein alignment %"; fkey <- "paln"; ymax <- 100
tvars<- c("tid","bits","iden","algn","slen");
tinset = system(paste("ls -1 ",ingdir,"/ref*1000*.topgene.tab",sep=""),intern = T)
inttab <- tinset[1]
for (inttab in tinset) {
gtab <- read.table(inttab, header=F, na.strings="na")
nspp <- (ncol(gtab) - 2) / 5
inghead <- c("refid","rlen");
for(c in 1:nspp) { inghead <- c(inghead, paste(tvars,c,sep="")) }
colnames(gtab) <- inghead
gname= sub("refbee","RefHoneybee", sub("reftribo","RefBeetle",
sub("refman","RefHuman", sub("refarath","RefArath",
sub("beetleslocustwhifly","insects",
sub("ref3a","RefHuDaTr", sub("ref3d","RefHuTrWa", sub("1000","", sub("_top1000","",
sub(".topgene.tab","",sub("/Users.*/","",inttab))))) ))))))
nrmax <- min(nrow(gtab),NSTAT)
gmat <- gtab[1:nrmax,grep("rlen|algn",colnames(gtab))]
labid <- gtab[1,grep("tid",colnames(gtab))] ## missing? YES: check >1; can use levels(labid[[i]])[1]
for (i in 1:length(labid)) { lid <- levels(labid[[i]])[1];
lid<- sub(":.*","",as.character(lid));
lid <- sub("kfish","killifish",sub("cfish","catfish",lid))
lid <- sub("whitefly","whitefly1evg",sub("locust","locust1evg",lid));
lid <- sub("(evg|evr|gno|tsa)","_\\1",lid)
labid[[i]] <- lid; }
# SUBSET here..
dmat <- 100*(gmat[,2:ncol(gmat)] / gmat[,1]) # ave pct of ref
colnames(dmat) <- labid;
dmat[dmat>100] <- 100;
nc<- ncol(dmat)
nr<- nrow(dmat)
dmax <- apply(dmat,1,max)
dmean<- mean(dmat); dsd <- sd(dmat)
maxmat<- 100*dmat/dmax # use this instead of dmat - dmax ?
maxmean<- mean(maxmat);
dstats <- data.frame( matrix(0,nrow=nc,ncol=8), row.names=labid)
for (j in 1:nc) {
dxt <- t.test(maxmat[,j] - 100, alternative="less");
nmax90 <- sum( maxmat[,j] >= 90); pmax90 <- round(100 * nmax90 / nr,1)
rstat <- cbind(dref=dmean[j],sdref=dsd[j],dmax=maxmean[j],
nmax90=nmax90, pmax90=pmax90,
tmax=dxt$statistic,pmax=dxt$p.value,df=dxt$parameter )
dstats[j,] <- rstat
if(j==1) colnames(dstats)<- colnames(rstat)
}
cat(ylab, "for", gname,"\n");
print( dstats);
cat('-------------------------------------------',"\n\n");
}
R-PLOT Ref gene x TrAssembly Top Sizes for multi-species/ref
#R-shell ..............
tvars<- c("tid","bits","iden","algn","slen");
ingdir<- "/Users/gilbertd/Desktop/dspp-work/arthropod/aabugs4/aabugs4qual/tsaevg/aaeval/ref3redo"
tinset = system(paste("ls -1 ",ingdir,"/ref*1000*.topgene.tab",sep=""),intern = T)
inttab <- tinset[1]
NSHOW <- 200 ; plwidth <- 14 # NSHOW <- 100 ; plwidth <- 8; #
PSAME <- 0.05; NOSAME <- F # NOSAME <- T ; #
SUBSET <- ""; subname<-""
SUBSET <- "banana"; SUBSET <- "cacao";
SUBSET <- "daph"; subname<- "daphnia"
SUBSET <- "beet"; subname<-"beetles";
SUBSET <- "pgbeetl_evg|whitefly|locust"; subname<-"insect3"; # various evgR insects from beetleslocustwhifly
SUBSET <- "daphmag5_evg|litova1_evg|penaeus_evg|penaeus_tsa"; subname<-"crustacea4"; # various evgR insects from beetleslocustwhifly
#..............
gtab <- read.table(inttab, header=F, na.strings="na")
nspp <- (ncol(gtab) - 2) / 5
inghead <- c("refid","rlen");
for(c in 1:nspp) { inghead <- c(inghead, paste(tvars,c,sep="")) }
colnames(gtab) <- inghead
gname= sub("refbee","RefHoneybee", sub("reftribo","RefBeetle",
sub("refman","RefHuman", sub("refarath","RefArath",
sub("beetleslocustwhifly","insects",
sub("ref3a","RefHuDaTr", sub("ref3d","RefHuTrWa", sub("1000","", sub("_top1000","",
sub(".topgene.tab","",sub("/Users.*/","",inttab))))) ))))))
nrmax <- min(nrow(gtab),NSHOW)
maint <- "Top Sizes"
# YALN only, pct of ref only
ylab <- "Protein alignment %"; fkey <- "paln"; ymax <- 100
gmat <- gtab[1:nrmax,grep("rlen|algn",colnames(gtab))]
labid <- gtab[1,grep("tid",colnames(gtab))] ## missing? YES: check >1; can use levels(labid[[i]])[1]
for (i in 1:length(labid)) { lid <- levels(labid[[i]])[1];
lid<- sub(":.*","",as.character(lid));
lid <- sub("kfish","killifish",sub("cfish","catfish",lid))
lid <- sub("whitefly","whitefly1evg",sub("locust","locust1evg",lid));
lid <- sub("(evg|evr|gno|tsa)","_\\1",lid)
labid[[i]] <- lid; }
if(nchar(SUBSET)>1) {
colset <- grep(SUBSET,labid) ## FIXME: move this up after read(gmat)
if(length(colset)>1) {
labid <- labid[colset]; gmat <- gmat[,c(1,1+colset)] ;
if(nchar(subname)<3) subname<- gsub('|','_',SUBSET);
fkey <- paste(fkey,subname,sep="-") #? gname also
}
}
dmat <- 100*(gmat[,2:ncol(gmat)] / gmat[,1]) # use this == ave pct of ref
dmat[dmat>100] <- 100; ## gmat <- dmat
dmean= mean(dmat)
dmax <- apply(dmat,1,max)
colnames(dmat) <- labid;
for (j in 1:length(dmean)) { pl<-paste("(",round(dmean[j]),"% ref)",sep="");
labid[j]= paste(labid[j],pl); }
nx= seq(to=nrow(dmat)) #? got 99 for nrow=100
nc<- ncol(dmat)
gcol=terrain.colors(1+nc); # rainbow(nc); #
gplotfile <- sub("topgene.tab",paste("top",fkey,".pdf",sep=""),inttab)
labmain=paste("Ref gene x TrAssembly: ",maint,"for",gname)
pdf( gplotfile, width=plwidth, height=2.5*nc)
layout(matrix(1:nc,nrow=nc))
for (j in 1:nc) {
labmainj= paste(gname, labid[j],sep=": ")
plot( nx, dmax, type="h", col="skyblue2", lwd=3, ## pink3 skyblue2 deeppink3
ylim=c(1,ymax), ylab = ylab, xlab="Longest Ref Genes", main=labmainj )
lines( nx, dmat[,j], type="h", col="orangered1", lwd=3) ## orangered3 gcol[j]
}
dev.off()
|