Genomes free for download

The world's largest data set on human genetic variation, the 1000 Genomes Project, is now available for free through Amazon Web Services. The arrangement with Amazon, spearheaded through the US National Institutes of Health, makes data from sequencing the genomes of more than 1,700 people accessible through Amazon's cloud. Any scientist with an internet connection will be able to access all 200 terabytes of the data set in the cloud, avoiding the need to acquire additional bandwidth or data storage capacity. “It's a very large amount of data,” says William Spooner, chief technology officer at Cambridge, UK–based Eagle Genomics, which provides bioinformatics services for mining genomics data. “You would need dedicated server capacity for 200 terabytes.” Most pharma companies have the capacity to download that amount of data, but “these days there is no advantage to doing that,” he says. Plus, working with the data in the cloud makes it easier for companies to share their analyses with university collaborators who may not have the extra server capacity, Spooner says. “What the cloud offers pharma is the ability to share analyses securely and easily with university collaborators without any risk to company firewalls.” It's a good deal for both industry and public sector researchers, who will likely find the information from the 1000 Genomes Project enormously useful. But working with data in the cloud has its drawbacks. Michael Snyder, chair of genetics at Stanford University and co-founder of Personalis, a genome analysis company in Palo Alto, California, says he has used Amazon's cloud for other projects and found it difficult to get data in and out. “We started out doing stuff in the cloud and found it to be a pain,” he says. Snyder says Personalis will access the data from the 1000 Genomes Project, but his team hasn't yet decided if they will do it though Amazon's cloud. Amazon Web Services could profit as well. Additional resources will be needed to crunch the data once and not everyone has this kind of computing power. Amazon can provide these resources for a fee. The 1000 Genomes Project is an international effort initiated in 2008 to collect the genomes of more than 2,600 people from 26 populations around the world (Nat. Biotechnol. 26, 256, 2008). The remaining 900 samples will be sequenced this year. The move to put the project in the cloud is part of a larger effort by the US federal government to manage the deluge of 'big data' in science. The White House Office of Science and Technology Policy in March announced that more than $200 million would be doled out to six federal agencies to manage the mountains of data being created for scientific discovery.