Access

Published online 3 September 2008 | Nature 455, 16-21 (2008) | doi:10.1038/455016a

News Feature

Big data: Welcome to the petacentre

What does it take to store bytes by the tens of thousands of trillions? Cory Doctorow meets the people and machines for which it's all in a day's work.

Ten seconds after I stepped into the roar of the data centre at the UK Wellcome Trust Sanger Institute, in rural Cambridgeshire, my video camera croaked: CARD FULL. Impossible.

Comments

Reader comments are usually moderated after posting. If you find something offensive or inappropriate, you can speed this process by clicking 'Report this comment' (or, if that doesn't work for you, email webadmin@nature.com). For more controversial topics, we reserve the right to moderate before comments are published.

  • "...the relentless march from kilo to [...] to peta to exa to zetta to yotta. The mad, inconceivable growth of computer performance and data storage is changing science, knowledge, ..." I find this alarming. Especially modern biology seems to degenerate into a kind of "high-throughput/low-insight" pseudoscience. The production and storage of huge datasets is relatively "easy". Making sense of it all could easily become a bottleneck.

    • 04 Sep, 2008
    • Posted by: András Aszódi
  • "Yet whether a network card is saturated or idle, it still burns 100% of its energy draw. The same with video cards, power supplies, RAM and every other component except for some CPUs. So these idle systems whir away, turning coal into electricity into heat that has to be cooled with coal turned into electricity turned into heat, and the planet warms and the bills soar. Every decibel of noise roaring through the centres is waste, energy pissed away for no benefit." --- That's badly written and factually incorrect. Apart from the fact that some countries don't use much coal (France is mostly nuclear, for instance) the energy consumed by a computing device does depend on the computational load. This is why your laptop battery can last almost a whole day of work, but it's dead within an hour after you fire up world of warcraft. Information theory tells you why.

    • 04 Sep, 2008
    • Posted by: Phillip Bentley
  • "The fallow cooling floor is matched in the compute centre below (these people all use 'compute' as an adjective)." - Mr. Doctorow, you don't seem to know what an adjective is. A 'job centre' is a centre for jobs; 'job' is a noun. A 'data centre' is a centre for data, and 'data' is another noun. As we find in the dictionary, 'compute' is another noun, synonymous with 'computation'. A centre for computation.

    • 04 Sep, 2008
    • Posted by: Thomas Dent
  • You should probably correct the 320TB quote. This is not the amount of data generated in 2 hours by a single run A single solexa machine generates about 30Gb in 2 hours.

    • 05 Sep, 2008
    • Posted by: Ben Berman
  • @András Aszódi http://www.wired.com/science/discoveries/magazine/16-07/pb_intro

    • 06 Sep, 2008
    • Posted by: Mike D
  • @Mike D: thanks for the link. My primary concern is the trend described in one of those Wired articles, the one by Mr Anderson: "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete". In particular note the following sentences therein:- "...the more we learn about biology, the further we find ourselves from a model that can explain it. There is now a better way. Petabytes allow us to say: "Correlation is enough." " Correlation science is pseudoscience. We desperately need new theories that can *make sense* of enormously large datasets. Being content with mindless data collection leads nowhere. We should find the right balance between generating data and the efforts needed to interpret them in an intelligent way. Thankfully, some of those other Wired articles do point in this direction.

    • 07 Sep, 2008
    • Posted by: András Aszódi
  • "Today, mainframe is more synonymous with the creaky old legacy system that no one can be bothered to shut down because it is running an obscure piece of accounting software that would be a pain to port to a modern system." NO. Mainframes are about very high data throughput and extreme reliability, like 99,999% uptime, self-diagnostics, hot-swappable, redundant components. There are rather important application areas where such reliability is critical. "Financial transactions" comes to mind. Mr Doctorow may care to ask IBM why they still produce and sell those "creaky old legacy systems".

    • 07 Sep, 2008
    • Posted by: C D Todd
  • Cory - great article. Awesome (in all senses of the word) subject. I had some idea for the Sanger, but not for the other sites, and it's all quite mind-boggling. The great universal library in the making.

    • 07 Sep, 2008
    • Posted by: Heather Etchevers