Index of /EvidentialGene/arthropods/honeybee/evg3hbee
Name Last modified Size
Parent Directory 30-Jun-2015 22:42 -
evg3hbee.mrna2tsa.info 07-Jun-2014 19:02 1k
evg3hbee.mrna2tsa.log 05-Jun-2014 13:04 24k
evg3hbee.tr2aacds.log 01-May-2014 22:05 13k
evg3hbee.trclass.gz 01-May-2014 21:53 38.4M
hbee_rnaseq/ 30-Jul-2014 13:32 -
inputset/ 30-Jun-2015 22:38 -
lola/ 30-Jul-2014 13:32 -
publicset/ 30-Jun-2015 22:39 -
run_evgmrna2tsa.sh 07-Jun-2014 19:03 2k
runtr2cds.sh 01-May-2014 16:53 2k
====== Honey Bee EvidentialGene gene construction set ========
2014-jun-07, by Don Gilbert, gilbertd at indiana edu
This is a 'reference-free' gene set assembly from mRNA-seq, without reference made to
a genome assembly nor training/mapping from other species genes. As such it has different
values than genome-based gene sets, one important one is no external artifacts or errors
contribute to these genes. Any protein orthology measured has not been influenced by
gene modelling using other species (with their artifacts), and genome assembly errors.
#t2ac: EvidentialGene tr2aacds.pl VERSION 2013.07.27
#t2ac: BEGIN with cdnaseq= evg3hbee.tr date= Thu May 1 14:16:23 PDT 2014
#t2ac: bestorf_cds= evg3hbee.cds nrec= 6156631
#t2ac: nonredundant_cds= evg3hbeenr.cds nrec= 2257631
#t2ac: nofragments_cds= evg3hbeenrcd1.cds nrec= 1353185
# Class Table for evg3hbee.trclass
class okay drop okay drop
althi 4.5 11.4 61969 154724
althi1 8.8 24.4 119671 331456
althia2 0 0.5 0 7687
altmfrag 0.4 0.4 6271 6534
altmfraga2 0 0 696 631
altmid 0.6 0.6 8381 9294
altmida2 0 0 690 498
main 4.4 4.6 60370 62381
maina2 0.3 0.2 4868 3490
noclass 2.3 7.7 32321 105044
noclassa2 0 0 138 166
parthi 0 16.2 0 220332
parthi1 0 9 0 122333
parthia2 0 2.4 0 33187
---------------------------------------------
total 21.8 78.1 295375 1057757 # ok = 5% of input.tr
=============================================
# AA-quality for okay set of evg3hbee.aa.qual (no okalt): all and longest 1000 summary
okay.top n=1000; average=2024; median=1725; min,max=1362,16948; gaps=9430,9.4
okay.all n=97697; average=217; median=131; min,max=40,16948; gaps=681261,6.9
Revised class count table for publicset, after removing 30,000 gut parasite genes (from a euglenoid),
plus a subset of uninformative, short, fragmented transcripts (no homology, mostly no mapping to genome)
118027 althi # high identity exon alignment alternates, Apimel3aEVm000000t2..tn
5438 althim # main/alt swaps
5297 altmid # lower identity exon align alts, will contain some paralogs
2488 altmidfrag # shorter, low ident alts
59018 main # main loci with alternates (Apimel3aEVm000000t1),
13374 noclass # and main-noclass without alts (IDt1), adding to 72392 "loci"
Notes:
Find useful data in honeybee/evg3hbee/publicset/
protein, cds and mRNA fasta sequence files
annotation table with homology Name (from blastp)
view.gff location table, mapped to apis amel45 genome assembly
publicset/evg3hbee.nopath.names
This name list contains ortholog gene function names
for NOPATH mRNA loci (that is they do not map to amel45 genome). There likely
are some true bee genes in this set, missing from an incomplete genome.
I filtered out obvious contaminant mRNA assemblies, notably 30000 loci of
a euglenoid parasite? from gut mRNA. Non-genome-mapping mRNA with bee-like
orthologs should be investigated by someone. Many are found in other bee and wasp:
973 DMELA, 857 wasp, 783 bomimp, 503 AECHI, 424 apimel, 361 megrot, 343 apidor, 154 apiflo, 125 TCAST
honeybee/evg3hbee/lola/
** Found Lola 100-alts same hub intron as Nasonia, Nasvit alts map to Apis genom
lola = longitudunals lacking (one of those cute fruitfly gene names), however this gene
is active/expressed and mucking around in most tissues, over development course,
including brain/nervous tissue, where many alt proteins may come into play for
interesting biology.
** Please investigate you biologists, this may be a hymenoptera specific alternate expansion
affecting social/nervous behaviour (or maybe sting/venoms, or whatever.. I don't know)
same place Evigene Apis mRNA map.. 60 hub intron variants found in evg3hbee trasm,
Group9.12 intron 1489888 -> 1506360..1670677 (200 kb span)
most are locus Apimel3aEVm002442t alts (252 alts listed in publicset, not all map w/ diff hub intron)
.. 4 loci have 250+ alts in pubset,
Apimel3aEVm002442 = lola, amel45:GB53441 matches Nasvi2EG036900t == wasp lola
locus Group9.12: 1482542i .. 1489888ihub > 1512692..1670677
Apimel3aEVm000555 = DSCAM, 314 alts, aa=1674,87%,complete, Split genes
locus Group4.13:611675 -> 611794..671399
Group4.13:611675-671399
DSCAM is now well known multiply-alternate transcript locus, but it doesn't have
all that many alt introns, just lots of exons to mix and match..
more later..
........
Transcript assemblies and input mRNA seq
honeybee/hbee_rnaseq
mRNA seq used is all from public data sets found at NCBI SRA for Apis mel.
see the hobee_study.list and sra_result.cvs for SRA accessions
A brief summary of the 6+ Million de-novo transcript assemblies made from these RNA
are also there, with primary statistic for selection and effort in the 'aastat' tables
of average protein sizes found in each tr-assembly run. This I find best way to proceed,
to learn early on if one has enough mRNA assemblies for a full animal/plant gene set.
Size and count proteins, not transcripts. N50 transcript size is meaningless for mRNA
gene sets. Measure 1000 longest proteins which has a biological max and is strongly
correlated with orthology (all the longest proteins are pretty much well-known now).
(use cd-hit to filter duplicates for quick answer).
As in other work, 3 main assembliers, in order of value for making orthology-complete
gene sets are 1. Velvet/Oases, 2. Soap-denovoTrans, 3. Trinity (I've learned TransAbyss
is also useful, ~= Soap, but don't use yet myself). If you just run Trinity (or any 1 assembler),
you are not getting complete assemblies of your mRNA-seq.
more later on methods...
|