A genomic analysis of yeast reveals that individual genes produce a rich complexity of RNA molecules with differing start and end sequences. The variation in these transcripts reflects the diversity of gene-regulation mechanisms. See Letter p.127
It seemed simple enough, the idea that one gene encodes one RNA transcript that is translated into one protein. But over the past 50 years, molecular biology has proven to be more complex than this 'central dogma', first proposed by Francis Crick in 19581. On page 127 of this issue, Pelechano et al.2 take transcript complexity to new ends, reporting nearly 2 million different RNA transcripts for a yeast genome that contains roughly 6,000 protein-encoding genes*.
The RNA molecules studied by the authors are called transcript-RNA isoforms, or TIFs. These are RNAs that traverse the same region of a genome, but have differing start (5′) and end (3′) sequences. Different TIFs have the potential to alter the coding and regulatory capacity of RNA3,4, and so the diversity of TIFs identified is intriguing. However, the plethora of isoforms may distil down to a relatively small number of functionally distinct RNAs for each gene (Fig. 1). For example, the TIFs found by the authors often have ends just a few nucleotides apart, and can be clustered into about 370,000 major TIFs (mTIFs). TIFs belonging to an mTIF probably arise from imprecise initiation of transcription. Moreover, isoforms can be quite rare compared with the predominant TIFs, which raises questions about the importance of low-abundance TIFs. The advance presented by Pelechano and colleagues' study is the comprehensiveness and resolution of TIF identification, and the complexity of transcript diversity, that can be revealed by sequencing both the 5′ and 3′ ends of the same RNA molecule.
When piecing together the full complement of RNA transcripts in a cell from sequencing data, one may be erroneously led into thinking that two overlapping transcripts represent a single longer transcript, such that distinct transcripts might go unnoticed. This problem is exacerbated by the fact that some sequencing methods fall short of reading both ends of the same RNA molecule, owing to very long transcript lengths and short-read technology. This problem was solved by methods that generate paired-end ditags (PETs)5, in which the 5′ and 3′ ends of an RNA molecule are converted to DNA and stitched together; sequencing across this junction allows simultaneous identification of both ends of the RNA. Pelechano et al. present a refined version of this method, called TIF-Seq, which they use in conjunction with deep sequencing, such that each RNA sequence is detected multiple times in the data set.
A typical protein-coding messenger RNA (mRNA) molecule has three main parts: a 5′ untranslated region (UTR), a coding region (also called an open reading frame, or ORF) and a 3′ UTR. Pelechano and colleagues' analysis shows that all of these regions can be varied in a cell's TIF repertoire, and suggests how this might regulate cell function. For example, the authors identify more than 200 genes with mTIFs that start or end within the coding region, thereby resulting in truncated proteins with potentially altered function. UTRs often contain regulatory elements that alter mRNA stability and protein-translation efficiency, and the authors also find that these regulatory elements occur more often where the location of 5′ and 3′ ends is most variable. Thus, by virtue of where the enzyme RNA polymerase II, which transcribes DNA to RNA, starts or stops, the resulting RNA may have a vastly different rate of turnover or translation.
The authors also report more than 700 examples of differential UTR usage that is related to the presence of an upstream ORF (uORF). uORFs are thought to be too short to code for a protein, but they may regulate the translation frequency of the downstream ORF, thereby affecting the amount of protein produced. However, the TIF-Seq assay revealed that about half of the annotated uORFs are actually transcribed independently of the downstream ORF, suggesting that they have an independent function. Reciprocally, many examples were found in which two adjacent ORFs that were thought to be transcribed independently were actually encoded on the same transcript.
Some of the identified TIFs might have no protein-coding function, such as those that run across the start site of a neighbouring gene. It is conceivable that simply the act of transcription can regulate the expression of a nearby gene, by altering the local structure of chromatin (the complex of DNA and associated proteins that make up chromosomes) and thereby changing the accessibility of DNA in the region. In this scenario, the region that RNA polymerase II transcribes will be important, but the resulting TIF will be an irrelevant by-product.
So what is the origin of TIF diversity? Pelechano and colleagues focused on the possible sources of different 5′ ends, although their results indicate that there is actually even more variation in 3′ ends. One source of 5′ variation will be at the initiation of transcription. Transcription enzymes are guided to the DNA by sequence elements, such as the TATA box, in promoter regions in DNA. In yeast, RNA polymerase II is thought to scan downstream from these sites for a certain distance before initiating transcription. Initiation might be more probable at DNA sequences at which the enzyme dwells for longer, and this might be driven directly by the underlying sequence and/or by impeding chromatin structures6. Variability in 5′ ends might also arise through the influence of multiple promoters on a given gene. Alternatively, weak or defective promoters might produce a small number of RNA molecules that are measurable but not meaningful.
Although TIF-Seq will help us to understand these processes by providing robust estimates of transcript diversity, it is not without its limitations. First, bona fide TIFs that are not post-transcriptionally modified by the cell (by 5′ capping and 3′ polyadenylation) will not be detected. Second, the reverse-transcriptase enzyme used to convert the RNA to DNA in the TIF-Seq assay must read the entire length of the RNA, which is problematic for long RNA molecules. TIF-Seq may therefore be less sensitive in human systems, which have long RNAs. Third, because the method detects only RNA ends, it does not detect whether isoforms have had a portion of their interior spliced out, as is common in multicellular organisms.
Despite these caveats, there is no doubt that the diversity of isoforms per cell identified by Pelechano et al. is dramatic. The findings help us not only to understand the origin of these transcripts, but also to explain why cell populations can never be homogeneous, even if they are derived from a single starting cell. Such intrinsic heterogeneity has broad implications, from providing opportunities for adaptation during evolution to explaining in part why it is difficult to kill all cancer cells in a tumour.
*This article and the paper under discussion2 were published online on 24 April 2013.
About this article