|
 |
 |
EMBO reports 5, 3, 236–238 (2004)
doi:10.1038/sj.embor.7400108
Biologists think bigger
Developments in academia and industry may encourage biologists to
use large-scale computation
Caroline Hadley
|
|
|
 |
 |
 |
Biology is getting bigger. The science that once concerned itself with
studying the intricate details of individual organisms and molecules is now
taking a step back to get a larger picture of what life is: mechanisms,
pathways and systems. This shift is a natural progression, but it is also a
result of the huge amount of data that has been generated over the past decade
from large-scale sequencing efforts such as the Human Genome Project. Making
sense of this information requires entirely new ways of thinking, and an
equivalent revolution in methodology. First and foremost, it calls for massive
computing power to extract information from the raw data. This shift from
reductionism to complexity in biology is now revealing problems of a scale that
can only be solved using the computational power usually associated with the
harder sciences, such as physics, astronomy and mathematics—disciplines
that have many problems that require massive calculation to solve. What form
this computational power takes depends on the specific problem at hand, but
also on the biologists themselves. Industry may be useful, but in the past few
years, academia has shown that it can often find its own solutions.Table 1
|
 |
Table 1
Further information about large-scale computing
|
|
 |
In 1999, computer scientists at the University of California at Berkeley
(CA, USA), launched a project to search for extra-terrestrial intelligence
(SETI) by analysing radio signals recorded on the Arecibo Observatory radio
telescope in Puerto Rico. Realizing that available computing resources were not
sufficient to process all the data received, they came up with an ingenious
solution: distribute the task to members of the public, all over the world, who
volunteer time on their own PCs. Users download software disguised as a
screensaver, and are then sent a chunk of data via the Internet that is
processed only when their PC is idle. The results are sent back to Berkeley,
and a new set of data is received. This 'distributed computing' model connects
thousands of PCs to create a virtual computer with more computational power
than the most advanced supercomputer. With more than 500,000 active users,
"in terms of total computing [power], we're the largest [distributed
computing] project in history," claimed David Anderson, head of
SETI@home.
It did not take long for biologists to copy this approach. In 2000,
Vijay Pande and his colleagues at Stanford University (CA, USA) started the
Folding@home project to elucidate the mechanisms behind protein folding. Like
SETI@home, Pande's project uses distributed computing to harness the power of
PCs throughout the world. The project currently has more than 100,000 regular
users and over 750,000 total contributors. Unlike Anderson's project,
Folding@home's volunteers have already achieved results, successfully
simulating the folding of small specially designed polypeptides (Nature
420: 102–106). "100,000 [volunteers' PCs] allows us to do
work that really couldn't be done any other way," Pande said.
 |
 |
 |
|
This 'distributed computing' model connects thousands
of PCs to create a virtual computer with more computational power than the most
advanced supercomputer
|
 |
 |
 |
But distributed computing is not necessarily suited to all biological
problems. Pande likened the approach to speeding up a task by a factor of
1,000. "If someone gave you 1000 assistants, it's unclear whether that
would really allow you to achieve that goal." Organizing and managing all
those people would take up most of the time. Similarly, distributed computing
is not simply a matter of dividing a complex problem into many smaller problems
that can be calculated independently—the way in which the problem is
approached is as important as the problem itself.
Nevertheless, other distributed computing projects have successfully
tackled various biological problems. Arthur J. Olson's group at the Scripps
Research Institute (La Jolla, CA, USA) uses the FightAIDS home project to
screen candidate drug compounds against detailed models of evolving AIDS
viruses. Graham Richards, Chairman of the Chemistry Department at Oxford
University, UK, has elicited support from more than 2 million volunteers for
his cancer screening project. The main project aims to find new drugs for
cancer therapy by screening a database of millions of small molecules against a
selection of specific proteins thought to be involved in the development of the
disease. Other smaller projects have already found potential drugs against
smallpox and anthrax using the same approach.
By focusing on cancer and AIDS, these distributed computing projects
also increase the chances of drawing public support. Their common denominator
is an overwhelming public interest in being involved in and contributing to
scientific research. In fact, few of the researchers make a concerted effort to
encourage people to contribute; word of mouth is often enough. "In terms
of public understanding of science, and this was not one of the original aims
of the project, it has been remarkably successful," said Richards.
"To get the general public involved is really very valuable." Pande
explained, "there's also aspects of the way distributed computing works
that is designed to try to encourage people to gather friends to get
involved." Many projects keep track of who has volunteered the most
computing time, or who has cracked a particular problem. "That aspect
actually has a sort of a competition aspect to it that is in many ways a large
driving force as well," Pande said. Referring to SETI, Anderson commented
that "many people are motivated by the possibility of discovering life
outside Earth, others are motivated by the competition aspect, and others like
our high-tech screensaver graphics."
 |
 |
 |
|
...few academic researchers have the financial
resources to buy the latest high-performance supercomputer
|
 |
 |
 |
Although Pande and his group have obtained valuable results, distributed
computing alone may not be sufficient to 'solve' protein folding. Computer
giant IBM (Armonk, NY, USA) is ready to tackle the problem with brute force.
The Blue Gene project, announced at the end of 1999, will build a petaflop
supercomputer—capable of 1,000 trillion floating-point operations per
second—by the end of 2004. Blue Gene will be 500 times faster than the
current fastest supercomputer, and more than 2 million times faster than a
standard desktop PC. The project is as much about computer engineering as it is
about biology: IBM has revolutionized the construction of supercomputers in
order to meet this challenge. Bob Germain, manager of the Biomolecular Dynamics
and Scalable Modeling Group of the IBM Blue Gene team (Yorktown Heights, NY,
USA) explained: "We think we can advance our understanding of the protein
folding process through large-scale simulation ... and also study other
interesting biologically related systems."
IBM's foray into biology is indicative of the potential of this future
market. "Biologists are not traditionally strong users [of
supercomputers]," said Barry Utting, former General Manager and Vice
President of Cray Europe and present director of BDUX Limited. This may be for
economic reasons, as few academic researchers have the financial resources to
buy the latest high-performance supercomputer. On the other hand, biologists
may prefer to stick with what they know and leave supercomputing to the harder
sciences. IBM's investment into Blue Gene may be an attempt to counter that
attitude.
There are obvious differences between distributed computing and
supercomputing that make the two approaches suitable for specific needs.
"Each paradigm is good for some problems and not others," Anderson
explained. "Public computing is useful only for problems that have public
appeal (so you can get users), have a high computing/data ratio (so your
Internet bill is reasonable) and don't involve secret data (since it will be
visible to the world). There are quite a few problems that fall into this
category. Because of the huge numbers of PCs, public computing will likely
continue to outperform the other paradigms by some measures (such as total
number of floating-point operations)." Supercomputers, however, are ideal
"if your problem is big enough and communication-dependent enough,"
explained Utting. The difference lies in how each approach deals with the
problem: distributed computing is ideal for large problems that can be split
into many similar smaller problems that can be solved independently (so-called
'embarrassingly parallel' problems). Analysing the many different folding
trajectories of one protein, or screening a database of molecules against one
protein, are ideal applications for distributed computing. Supercomputers are
more suited to non-linear complex problems that are not easily subdivided, but
require high-speed interconnection. Understanding an entire system in which
many factors influence many others, is one example.
 |
 |
 |
|
"Each paradigm is good for some problems and
not others"
|
 |
 |
 |
Blue Gene is therefore unlikely to tread on the toes of Folding@home.
"The big difference between Folding@home and Blue Gene is that while
Folding@home probably does have more raw power, the communication between the
processors is a lot slower, whereas Blue Gene has state of the art
communication," Pande explained. Continuing his analogy with the 1000
assistants, "it's kind of like having 1000 assistants where they can talk
to each other extremely quickly such that they can help organize themselves
better." Indeed, the two groups have collaborated to their mutual
benefit, and both agree that their approaches are complementary. Many of the
algorithms used with Folding@home, in particular the models representing
various physical and chemical laws, could have an impact on Blue Gene's design
and software. Although the supercomputer will have more power than any other
computer on earth, its success in determining how proteins fold still depends
in part on the accuracy of these models. Germain confidently expects to go
where no supercomputer has gone before: "Because we will have access to
much larger computing resources than anyone has ever had, especially for this
problem, we can do the larger size systems and study systems at longer time
scales which gives us a better chance of connecting in a significant way with
physical experiments."
With so much computing power available, it is not clear why biologists
have yet to embrace this technology. Beside the obvious prohibitive cost of
buying a supercomputer, what is stopping biologists from tackling larger
problems with larger computers? "Certainly there is a discrepancy between
the technology that might be available to solve these problems, and what is
[used]," Utting observed. Distributed computing is similarly
under-exploited but, according to Anderson, "there are at least 100
million Internet-connected PCs, and less than 1% of them are participating in
distributed computing projects. There's no shortage of computers." It may
be that biologists have not yet learned to think 'big' enough to use
large-scale computation. "I think part of it is not the scale of the
thoughts but trying to figure out a way to use computers to actually be
useful," Pande said. So far, outside the realm of bioinformatics,
computers have not done much for biologists, he believes. "Traditionally
people think about large-scale simulation and calculation as the domain of
physics but I think what's going to be happening now is that it's going to
start to become very commonplace in biology," Pande predicted.
 |
 |
 |
|
With so much computing power available, it is not
clear why biologists have yet to embrace this technology
|
 |
 |
 |
According to Utting, "I think the harder sciences naturally go to
simulation because there's a certain level of numeracy required to do it, and
biology in the past hasn't been a quantitative science and hasn't attracted
quantitative people." Richards believes that often technology starts off
with the physicists, subsequently moves into the field of chemistry, and will
eventually be picked up by biologists: "On the whole, the biological
community [are] not interested in the technology, they're interested in the
answers." To promote distributed computing, Pande is thus planning to
release the software behind Folding@home. "It takes a lot of work to
create all the software yourself, and I think that barrier has kept people
away," he said.
Ultimately, it may be that computing supply has so far outpaced demand.
With all this technology and only a few really appropriate problems to tackle,
there is the danger that biologists may fall into the trap of using more power
and less thought when design experiments. As many in the bioinformatics field
would agree, it is easy to produce a swathe of data without investing much
brain power. Both Pande and Germain avoid this in their own groups by
collaborating closely with experimentalists. "The end arbiter to anything
is really its significance in terms of experiments and biomedical
relevance," Pande said. "It always makes me very proud that
experimentalists are excited about what we're doing...if they weren't excited
about it, I think it would mean that there is something that's
missing."
 |
 |
 |
|
...the success of academic and industry projects may
encourage more scientists to use large-scale computing
|
 |
 |
 |
The use of computation in biology is still at the very early stages, but
the success of academic and industry projects may encourage more scientists to
use large-scale computing. The projects themselves will also evolve. According
to Pande, the next step for distributed computing is to start looking at larger
proteins and also proteins that are more biologically relevant. "If we
really believe that this technology is useful, we should try to attack problems
that are important," he said. Biologists would be able to model larger
systems at much longer time scales, and concentrate on understanding complex
systems previously intractable to current methodology. Germain therefore
expects that scientists will eventually become more ambitious in the kinds of
calculations that they want to do—regardless of whether these
calculations are carried out on one expensive supercomputer, or millions of
personal computers scattered around the globe.
|
 |
top   |
 |
|
|