A central challenge in genome annotation is determining the function of sequences that do not encode proteins, but make up the overwhelming bulk of large genomes — some 99% in humans. A significant fraction of these sequences are pseudogenes, or fossils of ancient proteins, and although many of them are transcribed into RNA, they have hitherto been deemed 'junk'. However, given the abundance of pseudogenes, it is unlikely that they are useless. One function suggested for them is gene regulation, and RNA interference (RNAi) has been proposed as the mechanism for carrying this out. Six papers1,2,3,4,5,6, including three in this issue (pages 793, 798 and 803), significantly expand the known scope of RNAi by describing the discovery of natural small interfering RNA (siRNA) sequences in mice and fruitflies, some of which are potentially transcribed from pseudogenes.

The textbook definition of a pseudogene is an inheritable genetic element that is similar to a functioning gene, yet is non-functional. But what is meant by non-functional is debatable — not transcribed, not translated, or not under control of a promoter sequence? Pseudogenes are similar to protein-coding genes because they are usually copied from a parent gene, either through unsuccessful duplication or by retrotransposition (whereby a gene is transcribed into RNA, which is then 'reverse-transcribed' back into DNA and inserted somewhere different in the genome). Because all this copying does not yield a normal, functioning protein, pseudogenes are usually identified by obvious 'disablements' in their sequence, such as frameshifts or premature stops. They have been of interest because they provide records of ancient molecules encoded by the genome.

Although pseudogenes have generally been considered as evolutionary 'dead-ends', one of the surprises of genome sequencing has been how abundant they are: tens of thousands of pseudogenes are found in mammalian genomes (roughly the same number as protein-coding genes in all mammals sequenced so far)7. In addition, a large proportion of these sequences seem to be under some form of purifying selection8 — whereby natural selection eliminates deleterious mutations from the population — and genetic elements under selection have some use. Finally, several large-scale genomic studies probing non-gene parts of the genome for biochemical activity have found many pseudogenes being transcribed and regulatory factors binding upstream of them. One such investigation, the ENCODE pilot project9, which looked at a representative 1% of the sequence of the human genome, found strong evidence for at least one-fifth of pseudogenes being actively transcribed.

These observations indicate that pseudogenes might not be purely dead relics of past genes but could be resurrected for new biochemical activities. Indeed, functioning pseudogenes have been reported previously. For instance, in snails, a pseudogene is involved in translational control of the gene that codes for nitric oxide synthase10. And transcripts of the mouse pseudogene makorin1-p1 have been proposed to inhibit degradation of their parent gene's mRNA, effectively enhancing its expression11, although this observation has been debated. Nevertheless, a clear mechanism for the functioning of pseudogenes has been lacking. The six studies — four in flies1,2,3,4 and two in mice5,6 — provide such a direct pathway, showing that pseudogene transcripts can act as natural siRNAs.

Broadly speaking, RNAi involves various types of small 'guide' RNA sequence regulating protein levels by targeting mRNA for degradation. Pseudogenic siRNAs provide two of the four categories posited by the six studies to organize the natural, or 'endo', siRNAs (Box 1).

Endo-siRNAs in the first category mediate transposon silencing, which is typically a feature of Piwi-interacting RNAs (piRNAs). The studies were therefore careful to distinguish between endo-siRNAs associated with transposons and piRNAs on the basis of size (21–22 nucleotides versus 24–30) and Argonaute effector-protein partner (Ago2 versus Piwi). The second category of endo-siRNAs arise from bidirectional transcription of partially overlapping loci on opposite DNA strands1,12. Studies in mice5,6 identify a few examples of these, and around 1,000 have been reported in flies1, with their target genes consisting mainly of those with nucleic-acid functions, such as nuclease activity and transcription-factor binding12.

The third category of siRNAs, which have been identified only in mice5,6, are products of the interaction between a spliced mRNA transcript from a protein-coding parent gene and an antisense transcript from its pseudogene, which can be located far away from its parent gene, on the same or a different chromosome (Fig. 1a). Endo-siRNAs of the fourth category are closely related to those in the third. They arise from hairpin-shaped sequences, which in mice5,6 can come from inverted-repeat structures of pseudogenes (Fig. 1b). Here, the pseudogene also regulates its parent gene, but the double-stranded RNA precursor of the endo-siRNA comes from transcription of an inverted-repeat sequence, producing a hairpin. The reports show that mouse proteins affected by the third and fourth categories of endo-siRNAs are disproportionately involved in particular functions — such as regulating cytoskeletal dynamics — which indicates that their underlying pseudogene-mediated regulation has been explicitly selected for and is not simply caused by random pairing of transcribed genes and pseudogenes.

Figure 1: Pseudogene-mediated production of endogenous small interfering RNAs (endo-siRNAs).
figure 1

Pseudogenes can arise through the copying of a parent gene (by duplication or by retrotransposition). a, An antisense transcript of the pseudogene and an mRNA transcript of its parent gene can then form a double-stranded RNA. b, Pseudogenic endo-siRNAs can also arise through copying of the parent gene as in a and then nearby duplication and inversion of this copy. The subsequent transcription of both copies results in a long RNA, which folds into a hairpin, as one half of it is complementary to its other half. In both a and b, the double-stranded RNA is cut by Dicer into 21-nucleotide endo-siRNAs, which are guided by the RISC complex to interact with, and degrade, the parent gene's remaining mRNA transcripts. The mRNA from genes is in red and that from pseudogenes is in blue. Green arrows indicate DNA rearrangements.

Hairpin precursors of endo-siRNAs have also been found in flies, but the evidence links them only weakly with inverted repeats of pseudogenes. Thus, most of the new data for pseudogenic siRNAs come from mouse rather than fly studies. One possible reason for this is that the mouse genome contains many more pseudogenes than the fly genome13. In fact, even compared with other metazoan organisms such as worms, flies are particularly poor in pseudogenes, possibly owing to pronounced genomic deletion processes known to occur in this organism14.

The scarcity of pseudogenes in flies makes their detection particularly difficult. Nevertheless, there is suggestive evidence for fly pseudogenes functioning as endo-siRNAs. First, an appreciable number (30) have an inverted-repeat structure, associated with the formation of hairpins. Second, many of the sequences obtained by ultra-high-throughput sequencing of small RNAs in the fly coincide with DNA regions containing pseudogenes. In particular, a small but significant number of the 'reads' found using the Solexa sequencing technology1,4 can be intersected with some 70 pseudogenes, for an average of roughly 12 reads each. Finally, there is strong evidence that for several genes — particularly the β-esterase gene and its pseudogene — a duplicated pseudogene forms a functional complex with its parent gene, with regulatory consequences15.

Of course, to demonstrate the activity of pseudogenes conclusively, further experiments are needed. Deleting a pseudogene and demonstrating an effect on its potentially regulated parent gene would be most definitive. Also of great value would be studying the expression patterns of a potential endo-siRNA-producing pseudogene and its regulated parent gene across various tissues — data which should be generated by the ENCODE and modENCODE projects.

In addition to connecting RNAi with pseudogenes, the new studies1,2,3,4,5,6 also blur the distinctions between the three 'traditional' classes of small RNA — siRNAs, piRNAs and microRNAs (miRNAs) — which are distinct in their biogenesis and cellular roles (Box 1). The studies1,2,3,4,5,6 find that endo-siRNAs regulate transposons as piRNAs do; that, like miRNAs, they can arise from hairpins; and that, in flies, their processing involves a similar co-factor to the processing of miRNAs (Box 1).

This blurring of boundaries among different types of small RNA, together with the newly established links between siRNAs and pseudogenes, has interesting evolutionary implications. In plants, inverted duplications containing a protein-coding gene have been proposed16 as a mechanism to create new miRNAs. Thus, one can imagine a gene being copied (either by duplication or retrotranscription) and this copy then being duplicated (again) in inverted fashion. Given the ubiquitous nature of genomic transcription, the copy and its inverted duplicate could potentially be transcribed to a hairpin precursor of endo-siRNAs to regulate the parent gene.

As the function of the hairpin no longer has anything to do with encoding protein, its sequence, still under selection, can acquire frameshifts and stop codons, making it seem pseudogenic. One could even imagine its sequence drifting further and becoming gradually transformed into a miRNA gene, the sequence of which is much less similar to the gene encoding its target mRNA. So pseudogenes encoding endo-siRNAs might provide a crucial intermediate link to understanding the evolution of miRNA-mediated regulation17. Although speculative, the plausibility of this theory is bolstered by a recent survey18 of the genomic context of more than 300 human miRNA loci, which identified two that lie within pseudogenes.