What is a Gene? Colinearity and Transcription Units

By: Leslie A. Pray, Ph.D. © 2008 Nature Education
Citation: Pray, L. (2008) What is a gene? Colinearity and transcription units. Nature Education 1(1)

In 1958, Francis Crick’s sequence hypothesis finally provided an answer to the question: what is a gene? Why is this definition now considered overly simplistic?

 

In the early part of the twentieth century, scientists knew what genes did, but they did not know what they were. Francis Crick, one of the codiscoverers of the three-dimensional double helical structure of DNA, was among the first to propose that a gene was a linear sequence of nucleotides and that each gene encoded a single protein. Crick called this proposal the sequence hypothesis (Crick, 1958); other scientists have since referred to it as the genes-on-a-string hypothesis. In Crick's words, this hypothesis "assumes that the specificity of a piece of nucleic acid is expressed solely by the sequence of its bases, and this sequence is a (simple) code for the amino acid sequence of a particular protein." Crick freely admitted that his hypothesis was just that: a hypothesis "for which proof is completely lacking." However, in an effort to rationalize his speculation, Crick cited some experimental work with bacteriophages that had been conducted by American molecular biologist Seymour Benzer. Benzer's work demonstrated that, in Crick's words, "a functional gene consists of many sites arranged strictly in a linear order" (Crick, 1958; italics original).

Today, scientists no longer speak of the sequence hypothesis. Instead, the notion that nucleotide sequences (genes) directly dictate amino acid sequences is known as colinearity (Figure 1). Scientists have confirmed that colinearity is a regular occurrence among many viruses, like the ones Benzer studied, as well as among bacteria. However, it turns out that colinearity is the exception, not the rule, in eukaryotic genomes.

The concept of colinearity.
Figure 1: The concept of colinearity.
Colinearity suggests that a continuous sequence of nucleotides in DNA encodes a continuous sequence of amino acids in a protein.

Alternatives to Colinearity

One of the first clues that the colinearity of DNA and amino acid sequences is not as simple as what Crick had proposed was the discovery of RNA splicing in the 1970s. Using common cold viruses as their experimental systems, English molecular chemist Richard Roberts and American molecular biologist Philip Sharp independently discovered that genes can be split into several segments along the genome (Berget et al., 1977; Chow et al., 1977). Then, using electron microscopy, both scientists observed that a single messenger RNA (mRNA) molecule hybridized not to a single stretch of DNA but to as many as four or more discontinuous DNA segments (Figure 2).

Roberts and Sharp also noted that the genetic material actually breaks apart and then re-forms itself at certain points in protein synthesis. Specifically, the sections of DNA that encode protein production are known as exons, and the noncoding sections interspersed among the exons are known as introns. During splicing, which occurs after transcription (i.e., the synthesis of RNA from a DNA template), the introns are removed and the exons are joined, or spliced together.

Roberts's and Sharp's findings not only raised serious doubts about the concept of a gene as a continuous, clearly demarcated segment of DNA, but they also led to a flurry of research activity, with scientists curious about whether the same was true in other species. As other researchers were quick to discover, discontinuous gene structure and splicing during RNA processing are the norm, not the exception, in most eukaryotes. Some vertebrate genes contain as many as 50 exons, and exons often make up only a small portion of the transcribed region of a gene. For example, in one early splicing study that involved examination of the intron-exon pattern of a chicken ovalbumin gene, Stein et al. (1980) measured eight exons ranging in length from 20 to 181 base pairs and seven introns ranging in length from 264 to 1,150 base pairs. Since that study, scientists have detected introns as long as 50,000 base pairs or more in some species.

The final protein products encoded by any given intron-exon sequence also vary in structure, depending on which exons are spliced back together during RNA processing. This so-called "alternative splicing" is illustrated in Figure 3. Scientists have also since learned that eukaryotic cells have evolved another "alternative" mRNA processing pathway: the use of multiple 3' cleavage sites in a single exon. (Every intron has a 5' and 3' splice site.) As illustrated in Figure 3, the end result is the same as with alternative splicing: different mRNA molecules are produced from a single protein-coding gene. Clearly, contrary to the conventional notion of a single gene encoding a single protein, a single continuous stretch of DNA can encode multiple mRNA molecules and, ultimately, multiple protein products.

Transcription Units Instead of Genes

Given the vast quantity of DNA that appears to have little protein-encoding power and the fact that so much of this DNA resides right in the middle of functional genes (as introns), some scientists prefer to think in terms of "transcription units" rather than "genes." A transcription unit is a linear sequence of DNA that extends from a transcription start site to a transcription stop site (Figure 4).

The promoter, a DNA sequence that lies upstream of the RNA coding region, serves as an indicator of where and in which direction transcription should proceed. The promoter is not actually transcribed; its role is purely regulatory. While promoters vary tremendously among eukaryotes, there are some common features. For example, most promoters lie immediately upstream of the transcription unit (transcription proceeds in an upstream to downstream direction), and most contain what is known as a TATA box; this is a sequence that is recognized and bound by a so-called TATA binding protein. The TATA binding protein helps position the RNA polymerase machinery and initiates transcription. Some promoters work in concert with other types of regulatory sequences known as enhancers, which sometimes lie several kilobases further upstream or downstream from the coding sequence itself, or even within introns. These two sequences are able to interact because of the way DNA molecules bend in space, enabling sections that would otherwise be very far from each other to interact (via DNA-binding proteins). Enhancer regions serve as binding sites for proteins known as activators (Figure 5). The proteins that bind to promoters to regulate transcription are called transcription factors. The RNA coding region, the main component of the transcription unit, contains the actual exons and introns. The terminator, a sequence of nucleotides at the end of the transcription unit, is transcribed along with the RNA coding region. The terminator serves as a speed bump of sorts; transcription stops only after this region has been transcribed.

Scientists have recently discovered that some mRNA molecules are coded by exons from multiple transcription units through a process known as trans-splicing. In fact, in 2005, a European group of researchers estimated that about 4% to 5% of tandem transcription units (i.e., distinct but adjacent transcription units) in humans are transcribed together to create single "chimeric" mRNA molecules (Parra et al., 2005). Scientists are not sure how this occurs. Some speculate that transcription overrides the first transcription terminator and doesn't stop until it reaches the second termination site; others suspect that both transcripts are formed independently and then spliced together to form the chimeric mRNA molecule.

Delineating Gene Regions

It seems that the more scientists learn about the genome and gene expression, the less they seem to be able to identify the point along a stretch of nucleotides at which a single gene actually begins and ends; indeed, it appears to be increasingly more difficult to determine whether there are even actual discrete nucleotide start and stop points for genes. This complexity continues to make it difficult for scientists to agree on exactly what a gene is. At the very least, scientists now know that Crick's original sequence hypothesis was overly simplistic, at least for eukaryotes. Genes are not linear sequences of DNA that directly correspond one-to-one with their protein counterparts.

Moreover, scientists now know that not all transcribed RNA molecules, or transcripts, end up being translated into protein products. For example, in a study of the mouse genome, researchers found that as much as 63% of the genome is transcribed but only about 1% to 2% is translated into a functional protein product (FANTOM Consortium et al., 2005). So not only is the notion of colinearity overly simplistic, but so too is the notion that all genes encode proteins. Many code other types of molecules, like tRNA and rRNA, that have important known cellular functions. Other non-protein-coding RNAs work to regulate gene expression at multiple levels, and still other transcripts have unknown functions.

References and Recommended Reading


Beadle, G. W., & Tatus, E. L. Genetic control of biochemical reactions in Neurospora. Proceedings of the National Academy of Sciences 27, 499–506 (1941)

Berget, S. M., et al. Spliced segments at the 5' terminus of adenovirus 2 late mRNA. Proceedings of the National Academy of Sciences 74, 3171–3175 (1977)

Chow, L. T., et al. An amazing sequence arrangement at the 5' ends of adenovirus 2 messenger RNA. Cell 12, 1–8 (1977)

Crick, F. On protein synthesis. The Biological Replication of Macromolecules: Symposium for the Society of Experimental Biology 12, 138–162 (1958)

FANTOM Consortium, et al. The transcriptional landscape of the mammalian genome. Science 309, 1559–1563 (2005) doi:10.1126/science.1112014

Parra, G., et al. Tandem chimerism as a means to increase protein complexity in the human genome. Genome Research 16, 37–44 (2006)

Stein, J. P., et al. Ovomucoid intervening sequences specify functional domains and generate protein polymorphism. Cell 21, 681–687 (1980)


Flag Inappropriate

This content is currently under construction.

This reading is linked to the following Scitable pages:

This is the Nucleic Acid Structure and Function Topic Room at Scitable.
All Articles Within Nucleic Acid Structure and Function (36)

DNA Replication (6)

  • DNA Replication and Causes of Mutation
    Cells employ an arsenal of editing mechanisms to correct mistakes made during DNA replication. How do they work, and what happens when these systems fail?
  • Major Molecular Events of DNA Replication
    Arthur Kornberg compared DNA to a tape recording of instructions that can be copied over and over. How do cells make these near-perfect copies, and does the process ever vary?
  • Semi-Conservative DNA Replication: Meselson and Stahl
    Watson and Crick's discovery of DNA structure in 1953 revealed a possible mechanism for DNA replication. So why didn't Meselson and Stahl finally explain this mechanism until 1958?
  • Genetic Mutation
    A single base change can create a devastating genetic disorder or a beneficial adaptation, or it might have no effect. How do mutations happen, and how do they influence the future of a species?
  • DNA Damage & Repair: Mechanisms for Maintaining DNA Integrity
    DNA integrity is always under attack from environmental agents like skin cancer-causing UV rays. How do DNA repair mechanisms detect and repair damaged DNA, and what happens when they fail?
  • Genetic Mutation
    Is it possible to have “too many” mutations? What about “too few”? While mutations are necessary for evolution, they can damage existing adaptations as well.

Transcription & Translation (4)

  • Translation: DNA to mRNA to Protein
    How does the cell convert DNA into working proteins? The process of translation can be seen as the decoding of instructions for making proteins, involving mRNA in transcription as well as tRNA.
  • DNA Transcription
    If DNA is a book, then how is it read? Learn more about the DNA transcription process, where DNA is converted to RNA, a more portable set of instructions for the cell.
  • RNA Transcription by RNA Polymerase: Prokaryotes vs Eukaryotes
    Gene expression is linked to RNA transcription, which cannot happen without RNA polymerase. However, this is where the similarities between prokaryote and eukaryote expression end.
  • What is a Gene? Colinearity and Transcription Units
    In 1958, Francis Crick’s sequence hypothesis finally provided an answer to the question: what is a gene? Why is this definition now considered overly simplistic?

Discovery of Genetic Material (4)

RNA (8)

  • RNA Functions
    The central dogma of molecular biology suggests that the primary role of RNA is to convert the information stored in DNA into proteins. In reality, there is much more to the RNA story.
  • RNA Transcription by RNA Polymerase: Prokaryotes vs Eukaryotes
    Gene expression is linked to RNA transcription, which cannot happen without RNA polymerase. However, this is where the similarities between prokaryote and eukaryote expression end.
  • Chemical Structure of RNA
    The more researchers examine RNA, the more surprises they continue to uncover. What have we learned about RNA structure and function so far?
  • RNA Splicing: Introns, Exons and Spliceosome
    What's the difference between mRNA and pre-mRNA? It's all about splicing of introns. See how one RNA sequence can exist in nearly 40,000 different forms.
  • What is a Gene? Colinearity and Transcription Units
    In 1958, Francis Crick’s sequence hypothesis finally provided an answer to the question: what is a gene? Why is this definition now considered overly simplistic?
  • Restriction Enzymes
    Restriction enzymes are one of the most important tools in the recombinant DNA technology toolbox. But how were these enzymes discovered? And what makes them so useful?
  • Genome Packaging in Prokaryotes: the Circular Chromosome of E. coli
    How do bacteria, lacking a nucleus, organize and pack their genome into the cell? Supercoiling enables this but forces a different kind of transcription and translation in prokaryotes.
  • Eukaryotic Genome Complexity
    How many genes are there? This question is surprisingly not very important, and has nothing to do with the organism’s complexity. There is more to genomes than protein-coding genes alone.

Gene Copies (5)

  • Copy Number Variation and Genetic Disease
    Did you know that a large number of your genes exist in variable numbers of copies? While they can overlap with disease-related genes, these variants exist in healthy individuals too.
  • DNA Deletion and Duplication and the Associated Genetic Disorders
    Deletions and duplications of single-base pairs typically arise during homologous recombination and cause diseases. But what happens when a mutation occurs over multiple genes?
  • Tandem Repeats and Morphological Variation
    All mammals have basically the same set of genes, yet there are obviously some significant differences that distinguish the various species. Recent research suggests that one such difference involves tandem repeats, or short lengths of DNA that are repeated multiple times within a gene. But what, if anything, does having a different number of tandem repeats do to an organism?
  • Copy Number Variation
    Copy number variations (CNVs) have been linked to dozens of human diseases, but can they also represent the genetic variation that was so essential to our evolution?
  • Copy Number Variation and Human Disease
    Analysis of individual human genomes has revealed an unexpected amount of variability in human populations. Copy number variation (CNV) has recently been identified as a major cause of structural variation in the genome, involving both duplications and deletions of sequences that typically range in length from 1,000 base pairs to 5 megabases, the cytogenetic level of resolution. Evidence is accumulating that CNVs play important roles in human disease.

Jumping Genes (4)

Applications in Biotechnology (4)

 
Ask an Expert
Post Question



Nature Education Home Learn More About Faculty Page Students Page Feedback



Genetics

Event Reminder