Advances in DNA-sequencing technology provide unprecedented insight into the entire collection of four genomes' transcribed sequences. They herald a new era in the study of gene regulation and genome function.
Genomes are the blueprints of life: they contain all the information necessary to build and operate their hosts. But we still have much to learn about the language of DNA to interpret the billions of Gs, As, Ts and Cs, the DNA bases that spell out life. The information-containing portions of genomes are transcribed into two RNA classes: messenger RNAs, which are translated into proteins; and non-coding RNAs, which have regulatory and mechanical roles. So studying the transcribed portion of the genome — the transcriptome — significantly aids gene identification, as well as providing insight into the inner workings of the genome and the biology of an organism. Five recent papers1,2,3,4,5, including one on page 1239 of this issue by Wilhelm et al.1, describe how advances in DNA-sequencing technology can be harnessed to explore transcriptomes in remarkable detail.
The concept of sequencing large numbers of randomly selected mRNAs is not new. It forms the basis of the controversial, yet revolutionary, expressed sequence tag (EST) method6, which was originally used to identify genes in the reference copy of the human genome. In this technique, genes are quickly identified through sequencing small fragments of large numbers of mRNAs. Although EST sequencing remains useful, it is relatively slow, requires considerable resources and generally cannot identify mRNAs that are expressed at low levels.
DNA microarrays are also powerful tools for transcriptome analysis. Particularly informative are tiling arrays, which are dotted with DNA sequences derived from defined intervals (for example, every 35 base pairs) throughout the genome. Fluorescently labelled RNA is then allowed to bind to the arrays, and the transcribed portions of the genome are identified by determining which DNA sequences pair with the RNA. But tiling arrays also have several shortcomings. First, they can be used only for organisms with known genome sequences. Second, their limited sensitivity, specificity and dynamic range (the ratio of the smallest to the largest fluorescent signal) make it difficult to identify low-abundance mRNAs and to distinguish between highly similar mRNA sequences. Finally, the number of DNA probes that fit on a microarray is limited, putting constraints on the minimum feasible genomic distance between the probes, and thus on the resolution at which a genome can be analysed.
Enter the trio of next-generation sequencing technologies — systems called 454 (from 454 Life Sciences), Solexa (from Illumina) and SOLiD (from ABI) — which can generate gigabases of sequence in a single experiment7. They differ from traditional sequencing methods in two ways. First, rather than sequencing individual DNA clones, hundreds of thousands (the 454 system) to tens of millions (Solexa and SOLiD) of DNA molecules are sequenced in parallel. Second, the sequences obtained are much shorter (25–50 nucleotides for the Illumina and ABI technologies, and 200–400 nucleotides for the 454 system) than those generated by traditional sequencing (typically more than 800 nucleotides). Matching these shorter sequences unambiguously to the reference genome is more difficult, but this is a relatively minor trade-off compared with the massive amount of total sequence generated using these technologies. The three sequencing systems have already revolutionized the study of chromatin structure, DNA-binding proteins, DNA methylation, genome organization and small RNAs7. But how useful they would be for studying transcriptomes was not known.
Five teams have now used a method called mRNA-Seq (Fig. 1) to sequence, at various levels of detail, the transcriptomes of four organisms — the fission yeast Saccharomyces pombe1, the budding yeast Saccharomyces cerevisiae2, the plant Arabidopsis thaliana3 and the laboratory mouse4,5. For the sequencing step, all except one of these groups used the Solexa system1,2,3,4, and one team5 used the SOLiD system. In each study, between 30 and 125 million sequences — 25–39 base pairs in length — were obtained. The most inclusive of these was performed by Wilhelm et al.1, who generated 122 million 39-base-pair sequences for S. pombe, corresponding to nearly five gigabases of sequence or 250 equivalents of this organism's genome.
But how comprehensively do these analyses cover the known genes? In the one billion bases of sequence obtained for S. cerevisiae, only about 91% of the known genes are detected. By contrast, sequencing five billion bases of the S. pombe transcriptome, Wilhelm et al. identify 99.3% of known genes. So although 'moderate' sequencing of the transcriptome can quickly detect most genes, identification of all genes requires extraordinarily 'deep' sequencing.
The mRNA-Seq method can also detect previously unidentified genes. In S. pombe, 453 new transcripts are identified, of which 427 seem to be non-coding. Similarly, in the S. cerevisiae transcriptome, 204 previously undetected transcripts are identified. Although these numbers sound relatively small, they are noteworthy because the organisms investigated had already been extensively studied. Undoubtedly, mRNA-Seq will also identify unknown genes in organisms that are not typically studied in the laboratory.
Genes consist of sequences called exons, which are separated by shorter sequences known as introns. After transcription, introns are spliced out of mRNA to form mature mRNA containing only exons. One limitation posed by the spacing of DNA probes in tiling arrays is that the short introns cannot be confidently identified. mRNA-Seq, by contrast, provides unparalleled resolution because, although many sequence 'reads' do not match to introns, they cover sequences at the end of the exons on either side of the intron. These reads not only identify introns but also precisely delineate the ends of exons and introns. Data on such intron-spanning reads confirm 78% and 93% of known introns in S. cerevisiae and S. pombe, respectively. Moreover, Wilhelm and colleagues discover1 20 new introns in S. pombe.
The dynamic range, sensitivity and specificity of mRNA-Seq also make it ideal for quantitatively analysing various aspects of gene regulation, including differences in transcript abundance. For example, a comparison3 of the transcriptome of normal A. thaliana with the transcriptomes of three strains of this plant that are defective in different aspects of DNA methylation — a modification that regulates gene expression — reveals scores of genes, some of them new, that are differentially expressed when DNA methylation pathways are perturbed.
The efficiency of intron removal is another aspect of gene regulation that can be monitored by comparing the number of reads that span an intron with the number that span the corresponding exon–intron junctions (Fig. 2). Wilhelm et al.1 compared S. pombe transcriptomes from proliferating cells with those of cells undergoing different stages of meiotic cell division. They identify 314 introns from 254 genes that are spliced more efficiently during meiosis than in rapidly proliferating cells; only 12 such meiotically spliced genes were previously known. Further analysis of this data set also reveals a striking correlation between transcription levels and splicing efficiency — the higher the level of transcripts, the more efficiently they are processed to mature mRNAs.
In organisms such as the mouse, exons can be spliced together in different patterns to generate several mRNA transcripts from a single gene — a process called alternative splicing. The intron-spanning reads obtained by mRNA-Seq can also be used to identify cases of alternative splicing and to quantify changes in alternative splicing that occur in different samples4,5. For example, a comparison of the mRNA-Seq transcriptomes obtained from mouse brain and muscle tissues singles out4 an exon in the Mef2d gene that is spliced in a specific way only in the muscle.
For transcriptome mining by mRNA-Seq, this is just the beginning of things to come. The method will become even more powerful with technological improvements such as longer reads, paired-end reads (the ability to obtain sequence from both ends of each DNA molecule and to determine the distance between those sequences)8, enrichment for sequences of interest9, DNA-strand-specific sequencing of the mRNA transcripts1, and methods to sequence all RNAs1 and not just mRNA. Algorithms that can accurately assemble short-sequence reads into longer stretches10 will further allow sequencing of the transcriptome of organisms for which a reference genome is not available. Together, these advances will provide even greater insight into transcriptional landscapes, regulation of gene expression and alternative splicing. Most importantly, next-generation sequencing has the potential to turn individual laboratories into small genome centres and to allow an individual scientist to determine the entire transcriptome of any source (any organism, tumour samples, tissues from patients with neurodegenerative disorders, and so on) in a matter of days, and for only a few thousand US dollars. This technology will have a lasting impact on the methods and speed with which we do science.