euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

EvidentialGene : Killifish Genes

Gene Annotation search

      Name                    Last modified       Size  

[DIR] Parent Directory 10-Jan-2014 14:45 - [DIR] annotation/ 04-Jul-2015 21:46 - [DIR] current/ 13-May-2016 16:03 - [TXT] gene_summary.txt 09-Jan-2014 13:38 15k [DIR] inotherfish/ 13-May-2016 00:53 - [TXT] kf2genesearch.html 09-Jan-2014 17:09 4k [DIR] kfish2rae5/ 13-May-2016 16:03 - [DIR] kfish2submit/ 14-Sep-2015 13:43 - [DIR] mRNA_assembly/ 04-Dec-2013 13:18 - [DIR] modelled_on_genome/ 21-Oct-2013 01:47 - [DIR] old_versions/ 04-Jul-2015 21:53 -

# kfish2rae5g_sum.txt
2013-Nov-12++

Killifish, Fundulus heteroclitus genome project
http://arthropods.eugenes.org/EvidentialGene/killifish/project/

Gene assemblies, v2 an5g, 2013 December
  at killifish/project/Genes/kfish2rae5/
  Summary in killifish/project/Genes/gene_summary.txt
  kfish2rae5g version files are separated in main (primary transcript) and alt (alternate) parts
Genome assembly v2.1, killifish201303asm
  at killifish/project/Genome/  
Gene annotation: Search at killifish/project/ bottom form or
  killifish/project/version2/genes/kf2genesearch.html
Gene families,Fish orthology:  Search at killifish/project/ (FISH11G)
Genome maps:  at Genome v2b (2013 Mar) at killifish/project/
BLAST search genes, genome: killifish/project/BLAST/

Gene evidence:
  killifish/project/Introns/ 
    347K valid introns on genome asm from RNA,EST reads
  killifish/project/Proteins/
    proteins of 9 fish species + human used for orthology evidence (n=284956 main isoform set)
  killifish/project/RNAs/
    kf2evg367mixx11/    : mRNA assembly, 183K used of 723K, from 776M short reads, 65 Bil bases, see below  
    est_cgbAssembly100.fa : 57741 assemblies of 3.5M 454-reads, from 3 projects 2009-2012
  killifish/project/Repeats/
    Repeatmasker transposon and repeat finding, and masked genome assmbly (12% masked)
      
---------------------------------------------------


TABLE G1.  Gene set numbers, version kfish2rae5d, 20 Nov 2013
---------------------------------------------------
35047 gene loci, all supported by mRNA transcripts and/or protein homology evidence
27597 from mRNA-assembly, 4929 genome-modelled, 2514 Kfish version 1 (mixed sources)
21083 are orthologs to other species, 3701 are inparalogs of orthologs, and 10263 are species-unique.
25753 properly-mapped to genome (>=80% coverage),
 2292 un-mapped genes, 2328 split-mapped, 4694 partial-mapped <80% coverage
 1851 single-exon loci (including partials, excluding TE)
79103 alternate transcripts at 20449 loci, ave. 5 transcripts per locus, 40 maximum.
27069 have complete proteins, 7966 have partial proteins
 2504 have <90% evidence coverage, 1690 have <60%, by orthology and/or RNA evidence  
 2460 Transposon-associated expressed transcripts
---------------------------------------------------
Gene set numbers, version kfish2rae5g, 15 .. 27 Dec 2013
  34928  gene loci
  - a few changes vs an5f, mostly seq corrections (antisense aa, seleno), same kfish2rae5f.gff?
  - recovered 28 kfish1 loci not in kf2, strong orthology (3 missed, 19 improved fish ortho grps)
  - omcl updated for new aa seqs
  - 3500 alternate transcripts removed due to conflicting locus info are
    otherwise valid and should be retained, but confict noted to be resolved 
    (two-loci or wrong-locus)
  * Note that some conflicts in alt-paralog classing remain (both ways).
  
Gene set numbers, version kfish2rae5f, 04 Dec 2013
  34903 gene loci
  - minor number changes since rae5d, 100 dup-loci removed, 300 loci updated
  - mostly resolves discrepancies between sequences and gff locations, matching object ids,
---------------------------------------------------

    
TABLE G2a.  Fish species average orthology gene groups 
           ---Common Groups---   ----All Groups-----
Species    cBits aaSize orMiss   tBits orGroup Tiny
--------- --------------------  --------------------
killifish  803     50     18     585   17272   1.1%
maylandia  824     45     76     596   16469   1.1% 
tilapia    822      6    223     568   14905   1.9% 
platyfish  783    -12    118     549   15305   4.7% 
zebrafish  711     -9    366     478   15190   4.8% 
sticklebk  763    -42    342     509   14343   7.8%
catfish    725     21    729     470   14276   3.4% 
medaka     743    -45    654     478   13541   9.7%
tetraodon  732    -50    658     473   13423   7.9%
spotgar    --    -110   2588     329   10882  20.2%
human      --      27    --      395   12606   1.7%
----------------------------------------------------
  source: kfish2/prot/fish11c/fish11gor3, Dec11 .. fish11gor3 update
  cBits = bitscore average for 8656 common fish gene groups
  tBits = bitscore average for all ortholog groups
  aaSize = average protein size difference from group median
  orMiss = missing ortholog groups that are common to other 8 of 9 fish (-gar)
  orGroup = number of ortholog gene groups in species
  Tiny  = percent species gene size outliers below 2sd of group median size  
----------------------------------------------------

  Killifish 2013/2012 improvements (relative to tilapia)
Geneset    cBits  dSize orMiss   tBits orGroup Tiny
--------- --------------------   -------------------
kfish.2v1    +12   +48    -100     +40  +1690  -3.6% : all improvements
---------------------------------------------------

TABLE G2b.  Fish gene orthology categories (using OrthoMCL)
            ----------- GENES -----------     ------ GROUPS -----  Ortho to Kfish
            nGene Orlog Inpara Uniq1 UDup     OrGrp OrMis1 UniGrp  Shared  Best 
            -----------------------------     -------------------  -------------  
killifish   34931 21133   3672  7694 2432     17272    10    682    ---     --- 
maylandia   23194 21021    879  1171  122     16469    46     52   15468*  5159 
tilapia     21437 19461    975   810  189     14904   135     78   14019    954 
platyfish   20366 19483    214   641   29     15307    54     14   14609* 13352 
zebrafish   26190 18465   4089  2202 1439     15187   188    226   13286*  1363 
stickleback 20787 17954   1254  1396  181     14344   180     44   13317*   636 
catfish     43671 17279   2964 15470 7938     14246   407   1561   12799*  1208 
medaka      19686 16881    912  1535  361     13542   366    106   12670*  1071 
tetraodon   19602 16814    904  1796  176     13425   360     67   12475*   696 
spotgar     15734 11841   2655  1009  230     10880  1514     56    9492    116 
human       39357 12699   7758 12497 6402     12608   420   2221   11265*  1182 
-------------------------------------------------------------------------------
  source: kfish2/prot/fish11c/fish11gor3, Dec11 .. gor3 upd Dec27
  nGene = count of input genes, excludes alternate isoforms/locus.
  Orlog = Orthologous genes (one-to-one matches among species)
  Inpara = Inparalogs (recent ortholog duplicates) of orthologous genes
  Uniq1,UDup  = single-copy and duplicated species-unique genes
  OrGrp,UniqGrp = orthologous and species-unique groups
  OrMis1  = groups missing in species that all other species have
  Ortho to Kfish, Shared= count of ortho groups shared with killifish, 
     Best    = count of groups with closest homolog,
     Shared* = maximum shared of 10 choices, tilapia shares more with maylandia, 
             and spotgar with zebrafish.
------------------------------------------------------------------

TABLE G2c.  Fish Taxonomy with Human gene alignment stats 
    Human genes       Fish Taxonomy
-------------------------------------------------------------------------------
Nhuman Ident% Align%  Neopterygii
                      + Teleostei
                      + + + Euteleostei
                      + + + + + + Pseudocrenilabrinae
14822   66     71    :+ + + + + + + + Maylandia zebra # african cichlid Zebra Mbuna, NCBI : KF2
14181   65     70    :+ + + + + + + + Oreochromis niloticus # Tilapia, Ensembl : KF1,2
                      + + + + + Smegmamorpha
14893   64     70    :+ + + + + + + Gasterosteus aculeatus # Stickleback, Ensembl? : KF1,2
                      + + + + + + Atherinomorpha
14478   64     67    :+ + + + + + + + Oryzias latipes # Medaka, Ensembl? : KF1,2
                      + + + + + + + Cyprinodontiformes
15033   64     71    :+ + + + + + + + + + Xiphophorus maculatus # Platyfish, Consortium : KF2
17072   65     76    :+ + + + + + + + + + Fundulus heteroclitus # Killifish, Evigene : KF1,2
                      + + + Otocephala
16127   65     70    :+ + + + + + Ictalurus punctatus # Catfish, Evigene : KF2
16871   64     73    :+ + + + + + Danio rerio # Zebrafish, Consortium/Ensembl : KF1,2
                      + Semionotiformes
13081   70     54    :+ + Lepisosteus oculatus # Spotted gar, Draft Ensembl : KF2
-------------------------------------------------------------------------------
  for N =26859 human genes, nc=16555 common to 7+ fish for align score


  
TABLE G3a.  Gene Expression (RNA-seq) summary
---------------------------------------------------
34641 (99%) loci have FPKM >= 0.02 (~20+ reads/1000 bases)
  29139 >= 1 FPKM, 14882 >= 10 FPKM
  7 median, 58 mean, 20000 max FPKM
Approx. differential expression
  nodiff  :21881, 63.5%
  adultenv: 5707, 16.5% 
  embryo  : 4666, 13.5% 
  grandis : 2185,  6.3% 
---------------------------------------------------------------------
  RNA-seq read groups are whoi (embryo, 106M 2x100b read-pairs, 56% mapped), 
  mdibl (adult tissue/environ stress, 106M 2x100b read-pairs, 59% mapped), 
  and grandis (Fund. grandis sibling species, 176M 2x60b read-pairs, 67% mapped)
  total of 388 million read pairs, 65 billion bases.
  bowtie-mapped to CDS sequences [ kfish2rae5g.mainalt.rnax3.tab ]
  FPKM : fragments mapped to transcript, per sequence kilobase, per million mapped reads.
  
TABLE G3b.  Orthology, SNP by DE group effects 
           unique  inpar  snp
  adultenv  35%+   31%+   40%
  embryo    11%-    7%-   34%-
  grandis   38%+   23%    55%+
  nodiff    31%    12%    41%
---------------------------------------------------------------------
unique = species-specific genes versus ortholog genes
inpar  = Inparalogs (new genes) versus orthologs
snp    = presense/absense of read SNPs mapped to transcript
-----------------------------------------------------------------------


TABLE G3c.  Gene functions (GO terms) with most divergent 
            differential expressed responses
-----------------------------------------------------------------------
GOntol_ID   %adlt adult embr  grnd none   GO_Term
-----------------------------------------------------------------------
GO:0046906  90.0  63    21    38    70    F:tetrapyrrole binding
GO:0030246  82.9  58    13    18    70    F:carbohydrate binding
GO:0008289  55.3  26    13    10    47    F:lipid binding
GO:0016491  53.0  221   48    98    417   F:oxidoreductase activity
GO:0033218  51.2  22    3     8     43    F:amide binding
GO:0060090  50.6  42    33    2     83    F:binding, bridging
GO:0048037  38.4  78    22    44    203   F:cofactor binding
GO:0043021  38.1  16    1     1     42    F:ribonucleoprotein complex binding
GO:0043178  36.0  18    21    14    50    F:alcohol binding
..
GO:0022610  54.6  311   463   49    570   P:biological adhesion
GO:0051704  53.7  341   81    54    635   P:multi-organism process
GO:0050896  41.5  1208  713   302   2911  P:response to stimulus
GO:0043170  37.4  79    57    20    211   P:macromolecule metabolic process
GO:0044255  36.4  223   94    103   612   P:cellular lipid metabolic process
GO:0048469  36.2  21    24    3     58    P:cell maturation
-----------------------------------------------------------------------



TABLE G4. Evigene mRNA assembly set, 
    project/RNAs/kf2evg367mixx11/kfish2evg367mixx11pub

mRNA inputset: 723555 kfish2evg367mixx from subset assemblies:
  356953 Funhe2E6b, 175596 Funhe2Eq7, 136132 Fungr1EG3, 54874 Funhe2Emap3

mRNA classification:
Class           cull    drop    okay    Class notes
----------------------------------------------------------------------
althi           0       35078   51301   # high identity exon alternate
althi1          0       122038  53038   # higher idenity alternate
altmap          0       8741    10828   # alternate from genome mapping
altmid          0       10694   10413   # mid identity alternate (may be paralog)
main            0       1713    45048   # main transcript, longest with alts
mainsingle      0       40406   9629    # main transcript, no alternate
frag1exon       16482   0       0       # fragment single exon, no homology
fragalthi       0       65834   0       # fragment high ident alternate
fragaltmid      0       21182   3025    # fragment mid ident alternate
fragnearg       31015   0       0       # alt-exon near gene, but unattached
fragtrivia      85597   0       0       # trivially short, no homology
fragnopath      6342    0       0       # short, no genome map, no homology
----------------------------------------------------------------------
total           139436  305686  183282
  okay= keep, drop= uninformative excess, cull= uninteresting excess


Notes: 
-------------
Killifish, Maylandia and Tilapia form a good gene methods/results
comparison, as top-scored gene sets, recently built by 3 groups with
"good" gene construction pipelines. Some artifacts of methods may be
found.  Tilapia has Ensembl:genewise+exonerate models from mix of
rna-seq + uniprot prots.  Mayzebr is NCBI Gnomon annotate, also mix of
RNA-seq and related proteins, Kfish2 is Evigene annotate, mRNA-gene
strong but also using related species proteins.  Other two do use
RNA-seq assembly, but not as extensively or carefully, and rely on
mapping to genome assembly. Ensembl-genewise-protein mapping has
potential to add artifacts of homolog models. NCBI Gnomon now uses
RNA-seq more carefully than in past, and better than Ensemble I think.

Killifish and Platyfish form another useful comparison, platyfish being closest
relative, and also a recent genome product built with current data and software.
Differences that can be highlighted:
1.  "The quality of a gene set is dependent on the quality of the genome assembly"
  (from Ensembl platyfish gene build document). This also can be derived from methods
  of platyfish genome paper, e.g. the methods included discarding mRNA assemblies
  that did not map well to genome assembly).
  In contrast, killifish genes v2 are not dependent on quality of genome assembly,
  merging both mRNA-assembly and genome-mapped methods to pick best set from both.
2.  The human gene orthology stats indicate killifish surpases platyfish in
  completeness of genes.    

Killifish, Maylandia and Catfish form a third special comparison to other
fish genes.  You will find in the Orthology search that these three share
more gene families that are missed in the other fish, than any other 3-fish
comparison, by about 100 families.   This is I think an effect of (a) mRNA
assembly independent of genome genes used for Killifish and Catfish, and (b)
for Maylandia, the NCBI has improved its mRNA evidence use enough to be
roughly equivalent to discovering genes that may be poorly modelled on genome
assembly.


Find families shared by just these 3 fish,
http://arthropods.eugenes.org/lucegene_arthropod/search?q=fish11xml-all:geneid+AND+Killifish:[1+TO+999]+AND+Catfish:[1+TO+999]+AND+Maylandia:[1+TO+999]+AND+Medaka:0+AND+Stickleback:0+AND+Tetraodon:0+AND+Tilapia:0+AND+Zebrafish:0+AND+Platyfish:0

One of these is http://arthropods.eugenes.org/genepage/fish11xml/FISH11G_G18567
FISH11G_G18567  : new D-tyrosyl-tRNA(Tyr) deacylase, one of 3 same named families,
  in killifish, catfish and maylandia only, maylandia: XP_004554729.1/LOC101478506, 1035 aa 

FISH11G_G1773   : D-tyrosyl-tRNA deacylase member 2, all but killifish, various number of genes (Tetraodon has 10)
  maylandia: XP_004554728.1/LOC101478225, 168 aa
G1773 and G18567 are related in that G1773 shorter protein aligns to longer G18567, 
  both have same CDD:202294 domain. In maylandia, these are tandem genes.  
  In killifish, missing shorter one would be where genome gap exists (mRNA assembly may or 
  may not have partial version)

FISH11G_G5675	  : D-tyrosyl-tRNA deacylase member 1, one gene in all 11 species
  


Developed at the Genome Informatics Lab of Indiana University Biology Department