Nature | Toolbox

How to catch a cloud

Why cloud computing is attracting scientists — and advice from experienced researchers on how to get started.

Corrected:

Article tools

The Project Twins

In February, computer scientist Mark Howison was preparing to analyse RNA extracted from two dozen siphonophores — marine animals closely related to jellyfish and coral. But the local high-performance computer at Brown University in Providence, Rhode Island, was not back up to full reliability after maintenance. So Howison fired up Amazon's Elastic Compute Cloud and bid on a few 'spot instances' — vacant computing capacity that Amazon offers to bidders at a discounted price. After about two hours of fiddling, he had configured a virtual machine to run his software, and had uploaded the siphonophore sequences. Fourteen hours and US$61 later, the analysis was done.

Researchers such as Howison are increasingly renting computing resources over the Internet from commercial providers such as Amazon, Google and Microsoft — and not just for emergency backup. As noted in a 2013 report sponsored by the US National Science Foundation (NSF) in Arlington, Virginia, the cloud provides labs with access to computing capabilities that they might not otherwise have (see go.nature.com/mxh4xy). Scientists who need bursts of computing power — such as seismologists combing through data from sensors after an earthquake or astronomers processing observations from space telescopes — can rent extra capacity as needed, instead of paying for permanent hardware.

Scientists can configure their cloud environment to suit their requirements. Although cloud computing cannot handle analyses that require a state-of-the-art supercomputer or quick communication between machines, it may be just right for projects that are too big to tackle on a desktop, but too small to merit a high-performance supercomputer. And working online makes it easy for teams to collaborate by sharing virtual snapshots of their data, software and computing configuration.

But shifting science into the cloud is not a trivial task. “You need a technical background. It's not really designed for an end user like a scientist,” says Howison. Although the activation energy might be high, there are recommended routes for scientists who want to try setting up a cloud environment for their own research group or lab.

A DIY guide to cloud computing

Most cloud platforms require users to have some basic computing skills, such as an understanding of how to work in the command line, and a familiarity with operating systems and file structures. Once researchers have a strong foundation, the next step is to try working in a cloud.

The most user-friendly cloud for scientists, says plant biologist Andreas Madlung, could be the platform Atmosphere, which was created as part of a collaborative cyber infrastructure project called iPlant. Funded by the NSF and led by three US universities and the Cold Spring Harbor Laboratory in Long Island, New York, iPlant has been helping scientists to share software and run free analyses in the cloud since 2008.

Designed with scientists in mind, the platform's interface comes with pre-loaded software, a suite of practice data sets and discussion forums for users to help each other to tackle problems. Madlung, at the University of Puget Sound in Tacoma, Washington, teaches an undergraduate bioinformatics course that includes a section on cloud computing. He first introduces his students to the Unix operating system, then has them use that knowledge to analyse RNA sequence data on Atmosphere.

Those who sign up with iPlant are automatically given what equates to around 168 hours of processing time a month, and can request more if needed. Users can load up virtual machines with any extra software that they need, and if a job is too much for standard equipment to handle, tasks can be offloaded to a supercomputer at the Texas Advanced Computing Center in Austin, where iPlant has a guaranteed allocation.

Biologist Mike Covington of the University of California, Davis, shifted his lab's computing work to iPlant after its servers kept crashing because they were overloaded. He has also made copies ('images') of his own virtual machine, so that his collaborators — and any iPlant user — can log in and access the same software, data and computing configuration. “If I spend several hours setting up my virtual machine perfectly for de novo genome assembly [reconstructing full-length sequences from short fragments of DNA], I can quickly and easily make it available to any other scientist in the world that wants to do de novo assembly with their own data,” Covington says.

Such virtual snapshots may become standard for projects that require computational work. Anyone who wants to reproduce, for example, the microbial-genome analysis described in one paper can access a snapshot of the authors' virtual machine on the Amazon cloud, simply by paying for Amazon computing time (B. Ragan-Kelley et al. ISME J. 7, 461464; 2013).

Pick a cloud

For some researchers, choosing a cloud is straightforward. Scientists at CERN, Europe's particle-physics laboratory near Geneva, Switzerland, have had access to a massive internal cloud running on the software platform OpenStack since 2013. A handful of institutions, such as Cornell University in New York and the University of Notre Dame in Indiana, have developed computing clouds, too. Some, including Notre Dame, outsource their clouds to companies such as Rackspace Private Cloud, a multi-national firm in San Antonio, Texas, that sets up and manages cloud services for users. But for scientists who are not at an institution with a fully functional campus cloud, bushwhacking through the jungle of cloud options can be a frustrating adventure (see ‘A guide for the perplexed’). Cloud system set-up can vary, and proficiency with one provider does not guarantee an easy transition to others.

A guide for the perplexed

Clouds for researchers

The largest commercial providers include Amazon's Elastic Compute Cloud (Get started | Spot instances), Microsoft's Azure (Get started | Training courses) and Google's Cloud Platform (which has a specific program for genomics). Other services are Terminal.com, aimed specifically at research; the (free) Atmosphere cloud platform, from the US National Science Foundation-backed iPlant collaboration; SageMathCloud; Cornell University's RedCloud; Digital Ocean — known for quick deployment of cloud apps; and Rackspace (Get started) — a company that sets up clouds using OpenStack, an open-source cloud-software platform that the firm developed jointly with NASA.

Useful resources for cloud explorers

StarCluster (Tutorial) is a tool developed at the Massachusetts Institute of Technology in Cambridge that helps to build a virtual research-computing cluster on Amazon's platform. Docker is an open-source platform that allows researchers to share a snapshot of the code, computing environment and data used to generate analyses. Project Jupyter are shareable notebooks that make data, code and analysis easily accessible — and interactive (H. Shen, Nature 515, 151152; 2014). Nimbus, partly developed by the Argonne National Laboratory in Illinois, helps to turn a normal computing cluster into a cloud system accessible by remote users.

Other computing resources

Practical Computing for Biologists, by Casey Dunn and Steven Haddock (Palgrave Macmillan; 2011).

The Software Carpentry computing workshops (see go.nature.com/jg86jj).

The University of Washington's eScience Institute advice on “Which compute platform should I use?”.

Casey Dunn, an evolutionary biologist who works with Howison at Brown University, prefers to train students on commercial platforms. “When they go on to a postdoc somewhere else or start their own lab, they'll still be able to log into Amazon,” he says.

Somalee Datta, the director of bioinformatics at Stanford University's Center for Genomics and Personalized Medicine in California, is using Google's cloud platform to support the centre's enormous amount of genomics data and computing demand, rather than relying only on the servers available at Stanford. She chose Google, she says, for several reasons: the company's developers were actively making tools available for genomics researchers, Google had demonstrated interest in health-care research — and the price was right.

Cloud concerns

For Datta and others, one key issue surrounding cloud computing is security. “It's a big concern,” she says. “Hackers understand where the value is, and they will turn their attention towards that.” Still, Datta thinks that clouds are no more or less secure than any other computer network. A university cloud system, for example, is only as solid as the university's firewall. “If I were working on my own or at a small college or company, I would probably feel more secure with Google's cloud,” Datta says (although Stanford has its own army of engineers watching security). The truth is, anyone working with extremely sensitive data might be better off keeping it away from the Internet altogether.

Another key issue for researchers who are venturing into cloud computing is the level of tech support needed. Getting software to run on a new system can take days, and determining how much computing power or memory a virtual machine needs can be an exercise in trial and error. All cloud providers offer training and tutorials, but dedicated support staff are more commonly found at universities with campus clouds.

Despite the challenges, cloud computing is increasingly appealing to scientists, says Darrin Hanson, vice-president of Rackspace Private Cloud. “The last few years have been mostly people who are absolutely out on the bleeding edge,” he says. “But now we're starting to see an influx of adopters.”

That isn't too surprising, Dunn says — the cloud is not as foreign as it can sometimes sound. “Nearly all consumer computer products now have a cloud component, be it mobile apps, content-streaming services like Netflix or desktop tools like Dropbox,” he says. “Research computing is not on the vanguard of some crazy and risky unknown frontier — we are just undergoing the same transitions that are already well under way in industry and the consumer marketplace.”

Journal name:
Nature
Volume:
522,
Pages:
115–116
Date published:
()
DOI:
doi:10.1038/522115a

Corrections

Corrected:

This article originally gave the wrong location for the Texas Advanced Computing Center. It is in Austin not San Antonio. The text has been updated to reflect this.

Author information

Affiliations

  1. Nadia Drake is a freelance science writer in San Francisco, California.

Author details

For the best commenting experience, please login or register as a user and agree to our Community Guidelines. You will be re-directed back to this page where you will see comments updating in real-time and have the ability to recommend comments to other users.

Comments for this thread are now closed.

Comments

2 comments Subscribe to comments

  1. Avatar for Carlos Polanco
    Carlos Polanco

    To the editor: Considerations about Cloud computing and security Drake's editorial (1) [How to catch a cloud, Nature] describes several computing capabilities that cloud computing (CC) provides to the scientific community, and in particular for those who require Bioinformatics software. Cloud computing is a computational resource that prevails from six decades ago, and that was known under the name of "time sharing" (2). Today its competitive advantage has been promoted due to their access through Internet-enabled devices, and the growing number of application programs "Add-ons" (3). However data integrity is a real concern in terms of availability, accessibility, security, among others (4); specially for those users who do not have local high performance computing (HPC) resources and rely heavily on CC and thus use it constantly. Having access to commercial CC platforms requires taking precautions to preserve proprietary data, once and after it is released via different security channels to the CC, monitoring and keeping track of the computing results that are being carried out. The scientific community should bear in mind that CC "discourages" the authorship of the software, but more importantly, the user does not know the characteristics of the software that the CC implementations offers. One way of overcoming this would involve adding the pseudocode (5) in the user's manuals of the add-ons, and/or promote how to deploy HPC custom implementations in platforms as Amazon AWS.

    Sincerely, Carlos Polanco, Ph.D., (*,a) Alicia Morales Reyes, Ph.D. (b) Miguel Arias-Estrada, Ph.D. (b) (a) Universidad Nacional Autónoma de México, México City, México. (b) Instituto Nacional de Astrofísica, Óptica y Electrónica, Puebla, México. Carlos Polanco is an associate professor at the Department of Mathematics in the Universidad Nacional Autónoma de México, México City, México. (polanco@unam.mx)

    Alicia Morales Reyes is a senior researcher at the Department of Computer Sciences in the Instituto Nacional de Astrofísica, Óptica y Electrónica, Puebla, México.

    Miguel Arias Estrada is a senior researcher at the Department of Computer Sciences in the Instituto Nacional de Astrofísica, Óptica y Electrónica, Puebla, México.

    References 1. Drake, N. Nature 522, 115-116 (2015). 2. Duane, C. [Last accessed on 2015 Jun 16]. Available from: http://www.constructioncloudcomputing.com/2010/ 08/14/ cloud compu ting-history/ 3. Issa, S.A. Biomed Res Int. 791051 (2013). 4. Stallman, R. [Last accessed on 2015 Jun 16]. Available from: http://www.theguardian.com/technology/2008/sep/29/cloud.computing.richard.stallman 5. Polanco, C. (letter) Nature 521, 261 (2015).
  2. Avatar for Yannick Pouliot
    Yannick Pouliot
    Nice article, though I feel it necessary to emphasize that the activation energy is not necessarily high. This is a problem only if you have limited computing skills. Otherwise, anyone with decent UNIX skills can get to speed very quickly on Amazon EC2 and Microsoft Azure, for example. The real problem here is scientists attempting to use computational tools without having cultivating the necessary skills, and indeed, frequently trying their best to not have to. As example, consider that once you have reserved a machine and established a basic configuration on your favorite Cloud provider, projects such as CloudBioLinux (http://cloudbiolinux.org/) and Bio-Linux (http://environmentalomics.org/bio-linux/) can provide incredible pre-configured machines ("software images") that deliver a vast array of bioinformatics software and other tools ready to go, ranging from a comprehensive R environment to MySQL and PostgreSQL database engines. The most demanding task there is probably running the update command, a trivial matter in UNIX. Setting up machines such as those from scratch, whether on the Cloud or not, would take days at best. These projects are also notable in delivering excellent documentation that even a novice will find enabling. Having deployed such software images, you can have a powerful computational engine operational in a matter of minutes, with the ability to add disk space equally quickly. Ask yourself how long it would take to simply add disk space in your work environment, much less deploying a fully configured server. At the next level, there are impressive pipelining engines such as Cloudman (https://wiki.galaxyproject.org/CloudMan) that deploy the Galaxy system in a multi-machine cluster setup in less than 15 minutes, all from a graphical interface. Really, this requires little more than the ordinary skills required to run a UNIX machine, as long as you don't mind reading a bit of documentation (which is particularly good at Amazon), and are reasonably good leveraging the experience of large numbers of users.
sign up to Nature briefing

What matters in science — and why — free in your inbox every weekday.

Sign up

Listen

new-pod-red

Nature Podcast

Our award-winning show features highlights from the week's edition of Nature, interviews with the people behind the science, and in-depth commentary and analysis from journalists around the world.