euGenes/Arthropods About Arthropods EvidentialGene DroSpeGe

A Gene Information Perspective on Genome Terminology


The word genome was coined in 1930's to describe all the heritable traits of an organism. It co-existed with terms chromosome and gene, at that time chromosomes were observed and experimented on directly (e.g. fruit fly polytene chromosomes visible in light microscopes). Genes at that time were not observable, but were infered as discrete units by breeding experiments with heritable traits (Mendel and others). Following on discoveries in genome biology of 1950s and on, the term genome is now commonly defined in biology text books as all DNA content of a cell, nuclear and non-nuclear, genic and non-genic, with most contained in chromosomes. All the genes transcribed from DNA to RNA in cells are commonly defined as a transcriptome.

I use the term gene-ome to describe a complete gene set (loci and alternates, transcribed or not) of an organism, to emphasize that discoverable information about genes may be independent of discoverable information about chromosomes. Both are aspects of the biological meaning of genome, but can and do differ in information meaning. Common usage of "assemble a genome" means, in information discovery terms, to assemble a chromosome set. Yet even a perfectly reconstructed chromosome set does not necessarily reconstruct an accurate gene set (gene-ome). For an objective view of genome information, we should in my opinion understand all the independently derivable genome information components, as they contribute some separate accuracies and errors to full knowledge of a genome.

When discussing genes, one should be clear what the information sources for those genes are, and what dependencies exist. A gene modeled on top of chromosomes is dependent in part on chromosome assembly accuracy, a gene assembled from RNA is dependent on gene assembly accuracy, and a gene determined with other experimental methods such as RT-PCR will have other accuracies. These reconstructions of genes often disagree in details, but discrepancies can be resolved when the error sources are understood and evaluated.

Gene reconstruction

The term gene reconstruction is an encompassing term for gene and gene-ome discovery, that includes methods of modeling, finding, assembly, experimentation, and others. Reconstruction can mix and merge these methods, e.g. RNA mapped and assembled on chromosomes, combined with modeling of gene sequence signals on those chromosomes.

Gene prediction

Gene prediction or gene finding is a term for informatics methods of locating gene signals on chromosome DNA (coding sequences with start, stop points, intron splice sites, transcription start and stop signals). This often uses statistical models (hidden markov or Viterbi algorithm) of those gene signal motifs.

Gene assembly

Gene assembly describes informatics methods of putting pieces of a experimentally derived gene sequence together to correctly represent a gene. The pieces are experimental evidence from molecular methods and sequencing machines. Assembly includes overlaying partial and error containing sequences as well as fitting short parts into longer ones. Short but accurate sequences (e.g. Illumina sequencers) and long but error full sequences (e.g. PacBio sequencers) both need assembly into accurate gene representations or models. These are inferences of a gene, dependent on assembly methodology. Sometime the gene assembly process uses external data, such as chromosome assemblies or other gene sequences, which adds complexity and error sources to the inferred gene models.

Don Gilbert, gilbertd at indiana edu, 2017 January

Developed at the Genome Informatics Lab of Indiana University Biology Department