New work from Tom Gingeras and colleagues extends the findings of a series of recent global analyses of transcription1, 2, 3, 4, 5, 6, 7 by revealing a much larger number of nonpolyadenylated (polyA−) transcripts than expected and an extraordinary level of organizational complexity in the human transcriptome.

A variety of recent evidence indicates that the majority of sequences in eukaryotic genomes are transcribed and that the proportion of transcribed nonprotein-coding sequences increases with developmental complexity (Table 1). However, it is the novel approaches that Gingeras and colleagues employed that allowed them to add spectacularly to the findings of these previous studies. In particular, the use of tiling arrays to identify transcribed fragments (‘transfrags’) of the human genome gives more complete and global coverage of the transcriptome than standard cDNA cloning and sequencing approaches, although the relationship between adjacent transfrags that derive from the nearby genomic region is initially uncertain (see below).

Table 1 Increase in nonprotein-coding transcription in metazoa

Cheng et al6 isolated mature (ie postspliced) cytoplasmic polyA+ RNA from eight human cell lines and interrogated tiling chips covering 10 human chromosomes in triplicate. They found that the detectable transfrags in each cell line covered on average 5% of the genomic sequences on the arrays. Cumulatively, 10% of the genomic sequences were represented in the polyA+ RNA fraction of one or more cell lines, indicating that many of the observed RNAs were cell type specific. The average length of the transfrags (exons) was approximately 120 bp, but PCR cloning and sequencing showed that the detected transfrags are derived from much longer primary transcripts covering extended genomic regions.

The current annotation of the human genome indicates that less than 2% of the genome is present in known or predicted mRNAs. So, most of the RNAs that the authors observed are not derived from known or predicted transcripts: over 56% of the transfrags do not overlap with any well-characterized exon, mRNA or EST annotation; 30% map to ‘intergenic’ regions and 26% to introns of known genes.

Transcriptome analyses have traditionally focused on cytoplasmic polyA+ RNA. This strategy was used partly to exclude infrastructural RNAs (rRNAs and tRNAs) and incompletely processed primary transcripts, and partly because it was assumed that most transcripts are derived from protein-coding genes and so are processed to polyadenylated mRNAs that are exported to the cytoplasm for translation.

In a radical departure from this tradition, Cheng et al extended their study to examine polyA+ and polyA− RNAs fractionated from the nucleus and the cytoplasm of the cell line HepG2. In both fractions, they found more nonpolyadenylated than polyadenylated RNA, a pattern that is consistent with some early but largely forgotten studies 30 years ago.8, 9, 10 Over half of the detected transfrags are unique to the largely unstudied polyA− and the nuclear polyA+ fractions of the transcriptome. Kiyosawa et al11 recently reported similar observations in mouse. A very big and almost completely unexplored area of the expressed RNA repertoire in mammalian cells has just been reopened.

The tiling array technique has some limitations: it does not reveal which strand of the chromosome is transcribed (because the RNA sample is converted into double-stranded cDNA before hybridization), and it does not indicate which transfrags are connected in different transcripts. To address these limitations, the authors studied several hundred randomly selected, nonannotated transfrags in more detail. They used rapid amplification of cDNA ends (RACE) to generate the extended sequences that are linked upstream and downstream of the transfrags in vivo. To map and characterize the transcripts that contain the transfrags, these PCR-amplified products were used to reinterrogate the tiling arrays, as well as cloned and sequenced to confirm their structure.

Over half of the studied transfrags show evidence of transcription from both strands. In a number of cases, the authors found exact reverse complement transcripts, so that one transcript has the standard GT–AG sequence at its intron boundaries and its partner has the complementary but nonstandard sequence CT–AC. The most plausible explanation for this pattern is that an RNA-dependent RNA polymerase uses the partner transcript as a template, although no such enzyme has yet been identified in mammals.12 In this context, however, it is worth noting that a reservoir of replicable RNA molecules has been proposed to be responsible for the non-Mendelian inheritance of ancestral alleles not present in parental chromosomes in Arabidopsis.13

Perhaps, the most basic question about these mysterious unannotated transcripts, termed TUFs (transcripts of unknown function), is whether or not they encode proteins. The cloning and sequencing of 178 TUFs reveals that most do not possess open-reading frames greater than 100 amino acids. This pattern is consistent with these TUFs being noncoding, a conclusion also reached by others.2 However, some might encode short proteins, and more powerful techniques such as synonymous versus nonsynonymous substitution analysis will need to be employed to provide tighter bounds on this question. In any case, potential short protein coding sequences do not explain the vast extent of the hidden transcriptome that is being brought to light.

These studies also demonstrate the interlaced nature of transcription, so that rather than neatly separated genes, the genome harbors a network of nested and overlapping transcripts on both strands, where introns of one harbor exons of another. Large-scale cDNA sequencing projects such as FANTOM have also revealed such complex patterns, at least in part2 (Carnici et al., submitted for publication). Transcript overlap occurs not only on opposite strands but also on the same strand, so that there is often no clear distinction between splice variants and overlapping neighboring genes.

Kapranov et al7 explore these complex patterns further in a subsequent paper. They examined the structures of transcripts from 14 transcribed loci, representing both known genes and unannotated transcripts taken from those described in Cheng et al.6 They show that there is an amazing world of previously unknown and again barely explored transcripts. Even loci that encode well-known proteins, such as sonic hedgehog, are shown to have previously unknown exons and novel isoforms that are likely to have important functions. They also report that it is not uncommon that a single base pair is part of an intricate network of multiple isoforms of overlapping sense and antisense transcripts, the majority of which are unannotated.

The picture that emerges is that the human genome, far from being a desert with islands of protein-coding sequences, is a nest of interwoven transcriptional units that cover a large fraction of the genome, including many ‘intergenic’ regions previously considered to be inert. This complexity will undoubtedly have consequences for our understanding of genetic information and pleiotropy, since a mutation may affect multiple overlapping ‘genes’. Indeed, the utility of the gene concept itself is no longer clear, both in terms of its discreteness and in terms of the usual presumption that proteins express and transmit most genetic information.

The cDNA cloning and the tiling array approaches give complementary but incomplete views of the transcriptome. The depth of interrogation of the range of expressed transcripts, particularly rare transcripts, by whole tissue cDNA approaches has obvious limitations and is subject to diminishing returns, even using aggressive normalization techniques to remove common transcripts. Tiling arrays are more global, but the data are inherently more noisy and disconnected. Not only are the strand and exon linkages uncertain but also the exact exon boundaries are not revealed with confidence. Even the RACE/array technique does not provide exact exon boundaries and transcript sequences. Cloning and sequencing of these RACE/PCR products is required to reveal the actual transcripts in the detail required for analyzing their characteristics and function. This in turn means that all transfrags might have to be examined in this way to provide a comprehensive view of the human transcriptome. Even if these procedures were refined to be high throughput, this task would be a huge undertaking, although not beyond the scale of past genome projects.

Finally, it is unlikely that tiling arrays and other techniques will have detected all the stable processed transcripts from the human genome. The cDNA approaches used to interrogate tiling arrays will easily not detect short RNAs, for example microRNAs, of which around 1000 have been thus far been identified in human.14, 15 These miRNAs regulate a wide variety of important developmental processes, and are probably just the tip of a very big iceberg of small regulatory RNAs, most of which remain to be discovered.15 Potentially important16 regulatory RNAs expressed below the detection limit of such assays are also likely to have been missed. Moreover, since a high proportion of the transcripts show cell-type-specific expression, and only eight cell lines were analyzed, almost certainly many new transcripts will be found in different cell types.

It is now beyond question that the majority of the human genome is transcribed, and that the vast majority of the transcribed sequences are nonprotein coding. There are only two choices – either this transcription is largely meaningless or it is fulfilling some unexpected function. The former explanation is becoming more difficult to sustain, as many of these transcripts and their splicing patterns show cell specificity, although very few have yet been experimentally studied.17 Both logic and a wide variety of molecular genetic evidence now suggest that there are two inter-related levels of genetic information expressed in complex organisms – that specifying the analog components of cells (mainly proteins, including their many isoforms) and an extensive regulatory RNA network (including microRNAs) transacted by sequence-specific recognition to form various RNA:RNA and RNA:DNA complexes that are in turn recognized and acted upon by different types of nucleic acid-binding proteins.15, 18, 19

We predict that the coming years will see an avalanche of studies demonstrating function for noncoding RNAs, including the many intronic and ‘antisense’ RNAs that are transcribed. If current indications hold, we may have to reassess many, if not most, of our conceptions of how genetic information is encoded and transacted in our genome▪