Gathering clouds and a sequencing storm

Journal name:
Nature Biotechnology
Volume:
28,
Page:
1
Year published:
DOI:
doi:10.1038/nbt0110-1

Why cloud computing could broaden community access to next-generation sequencing.

Those with any doubts as to whether we have entered the decade of the sequencer need only pay a visit to the Broad Institute in Cambridge, Massachusetts. There in the lobby, a wall of flat screen TVs displays an endless stream of A's, T's, C's and G's, with a mind-boggling multiple-digit readout counting up the number of DNA base pairs sequenced. The Broad is one of a hundred or so research centers around the world currently generating thousands of gigabases of DNA sequence every week. But whereas researchers at these centers have a wide array of core computing resources and expertise at their disposal for analyzing the reams of data they generate, smaller laboratories that intend to purchase next-generation sequencers are not so fortunate. For the latter, more funding and effort should be devoted not only to the development of on- and off-site data management solutions but also to disseminating software from core facilities to the broader community.

In the coming year, it is not unreasonable to expect that the amount of sequence data generated around the world will outstrip that generated in the past decade. The National Institutes of Health's Cancer Genome Atlas (CGA; http://cancergenome.nih.gov/) is already ramping up its effort to sequence hundreds of genomes in 20 different types of cancer. And almost everywhere one looks, other sequencing projects are springing up or gathering pace; examples include the 1000 Genomes Project (a high-resolution map of human genomic variation from 1,000 individuals; http://www.1000genomes.org/), the Personal Genome Project (the exomes of tens of healthy volunteers; http://www.personalgenomes.org/), the 1001 Genomes Project (sequence variation in 1,001 strains of Arabidopsis thaliana; http://1001genomes.org/); and the Mouse Genomes Project (the genomes of 17 mouse strains; http://www.sanger.ac.uk/modelorgs/mousegenomes).

Next-generation sequencing platforms are playing an increasingly prominent role in resequencing efforts. But their role in de novo sequencing and assembly is also broadening from simple microbes to filling gaps and providing finer sequence resolution and coverage in higher organisms—the characterization of the human pan-genome on p. 57 being one example.

These efforts, together with a burgeoning number of additional applications for next-generation platforms in small RNA discovery, transcriptomics, chromatin immunoprecipitation and copy number variation studies, mean that deep sequencing instruments are likely to become indispensable and ubiquitous tools in the biology laboratory. But for this technology to be truly widely adopted, challenges related to data handling and analysis remain to be addressed.

Next-generation sequencers produce a prodigious stream of data. A single Illumina instrument, for example, can generate up to 90 billion bases per run. This represents terabytes of raw image data that require at a minimum 4 GB of RAM and 750 GB of local storage capacity to carry out the data handling and analysis.

Whereas genome centers are set up to deal with such gargantuan files, most academic laboratories are in a completely different situation. They have no large central computing pool and data storage capacity. They are more likely to generate data in an ad hoc manner, rather than in a steady stream amenable to an automated data management pipeline. And they often lack sequencing specialists and support staff working under the same roof who can create software tailored to their needs and solve computational problems.

Some algorithms, such as those for mapping short DNA reads to a reference genome, have progressed to a high level of sophistication and are widely accessed. This is because these programs—written by cross-disciplinary individuals—have now been optimized by computer scientists to enhance user friendliness and remove bugs. In other areas, such as short RNA read mapping or analysis of genome structural variation, progress has been slower, in part because the problem is more complex and in part because the data have not been available.

One potential solution to the data handling/storage problem for smaller research groups is the use of cloud computing (see p. 13). In this approach, a user rents processing time on a computer cluster (e.g., from Amazon) through a virtual operating system (or 'cloud'), which can load software and provide an access point for running highly parallelized tasks. Sequencing data can be sent to the cluster either by disk or the internet (although the size of data sets presents its own problems for the latter).

The first software (CrossBow) capable of performing alignment and single nucleotide polymorphism analysis on multiple whole-human data sets on a computing cloud was published just 6 weeks ago (Genome Biol. 10, R134, 2009). Essentially, the package makes it possible to analyze an entire human genome in a single day while sitting with a laptop at your local Starbucks.

It remains unclear, however, whether the cost of routinely renting time on the cloud would be cost effective in the long term, particularly if a user intends to analyze billions of base pairs of genome sequence on a regular basis. What's more, if the wide uptake of sequence analysis on clouds depends on the availability of user-friendly, debugged software, bioinformaticians might not be willing to spend the time to familiarize themselves with hadoop, the open source program needed to process large data sets on a cloud—especially when their jobs focus on developing algorithms for their own local computer clusters.

Thus, for next-generation sequencing to move out of genome centers, more effort must focus on creating software compatible for use in a cloud or better still, infrastructure software (similar to Apache for web servers) that would allow community-generated software for all types of sequence analysis to be plugged into it. This approach is likely to be particularly valuable for smaller laboratories lacking software development resources. And although it will not solve all the data management and analysis problems associated with next-generation platforms, it could give many the opportunity to adopt a powerful and rapidly advancing technology that would otherwise remain out of reach.

Additional data