One of the most beautiful aspects of the genetic code is its simplicity: three letters of DNA combine in 64 different ways, easily spelled out in a handy table, to encode the 20 standard amino acids that combine to form a protein.

RNA: a difficult beast to predict. Credit: LAGUNA DESIGN/SPL

But between DNA and proteins comes RNA, and an expanding realm of complexity. RNA is a shape-shifter, sometimes carrying genetic messages and sometimes regulating them, adopting a multitude of structures that can affect its function. In a paper published in this issue (see page 53), a team of researchers led by Benjamin Blencowe and Brendan Frey of the University of Toronto in Ontario, Canada, reports the first attempt to define a second genetic code: one that predicts how segments of messenger RNA transcribed from a given gene can be mixed and matched to yield multiple products in different tissues, a process called alternative splicing. This time there is no simple table — in its place are algorithms that combine more than 200 different features of DNA with predictions of RNA structure.

The work highlights the rapid progress that computational methods have made in modelling the RNA landscape. In addition to understanding alternative splicing, informatics is helping researchers to predict RNA structures, and to identify the targets of small regulatory snippets of RNA that do not encode protein. "It's an exciting time," says Christopher Burge, a computational biologist at the Massachusetts Institute of Technology in Cambridge. "There's going to be a lot of progress in the next few years."

The floodgates were opened by high-throughput technologies that allow researchers to compile comprehensive catalogues of RNA molecules found in various tissues and under different environmental conditions. Such techniques revealed that 95% of the human genome is alternatively spliced, and that changes in this process accompany many diseases. But no one knew how to predict which form of a particular gene would be expressed in a given tissue. "The splicing code is a problem that we've been bashing our heads against for years," says Burge. "Now we finally have the technologies we need."

The splicing code is a problem that we've been bashing our heads against for years. ,

Blencowe and Frey's team used the masses of data generated by these technologies to train a computer algorithm to predict the outcome of alternative splicing in mice. Given the DNA sequence of a particular gene, the algorithm predicts which segments of that DNA sequence will be included in a final messenger RNA molecule in one of four tissue types: the central nervous system, muscle, the digestive system and embryos. The model works well, says Burge, and is an important technological advance. But he hopes that it will be refined to mimic more closely the mechanism that the cellular splicing machinery uses to make its choices.

Wiggle and jiggle

The sequence of letters in an RNA molecule is not the only determinant of how the molecule will function. Its three-dimensional structure can also affect how it interacts with other molecules, including drugs that are designed to target it. "RNA forms highly flexible structures that wiggle and jiggle just due to thermal motion," says Hashim Al-Hashimi, a biophysicist at the University of Michigan in Ann Arbor. "It is very difficult to define them as a static structure." Structures of the same molecule determined using various techniques sometimes look wildly different, Al-Hashimi adds, because RNA is sensitive to even small variations in its environment.

As a result, researchers including Al-Hashimi are eager to develop methods that will predict the three-dimensional structure of RNA on the basis of its sequence. At present, experimental techniques that reveal how an RNA molecule folds back on itself — its secondary structure — are fairly advanced. For example, in 2009, Kevin Weeks, a chemist at the University of North Carolina at Chapel Hill and his colleagues reported the full secondary structure of the HIV-1 genome — a strand of RNA about 9,000 letters long (J. M. Watts et al. Nature 460, 711–716; 2009). Al-Hashimi has developed a method that combines such two-dimensional structures with knowledge of the constraints on RNA flexibility to predict aspects of the three-dimensional structure (M. H. Bailor et al. Science 327, 202–206; 2010).

But automated programs for predicting three-dimensional structures are still quite limited in scope and need refining, says Tamar Schlick, a computational chemist at New York University.

Much of the enthusiasm for understanding RNA is motivated by the discovery of small RNAs that do not code for protein, yet can regulate gene expression. The hunt is on to catalogue these RNAs and their targets — a quest aided by advances in algorithm design and the accumulation of genome sequences. This allows researchers to search the vast stretches of noncoding DNA between genes: the conservation of sections in many species could suggest that they have important functions.

But enthusiasm for finding functional noncoding RNAs may be getting out of hand, cautions Sean Eddy, a computational biologist at the Howard Hughes Medical Institute's Janelia Farm research campus in Ashburn, Virginia. Teams have reported thousands of such RNAs, but few researchers have followed up to confirm exactly what these RNAs do, or whether the molecules are simply aborted mistakes made by the machinery that converts DNA to RNA.

For now, Burge says he is enjoying the ongoing renaissance in RNA informatics. "These new technologies have given me hope."

figure 1