The Fourth Paradigm: Data-Intensive Scientific Discovery

Edited by:
  • Tony Hey,
  • Stewart Tansley &
  • Kristin Tolle
Published at http://research.microsoft.com/en-us/collaboration/fourthparadigm

When it came online in 1946, the US Army's giant ENIAC — Electronic Numerical Integrator and Computer — was hailed as the world's first 'electronic brain', a major step forwards in our ability to process information. It was put to work doing everything from modelling the hydrogen bomb to predicting the weather. Skip to today, and the Large Hadron Collider at CERN, Europe's particle-physics laboratory near Geneva, Switzerland, will produce data in a single second that would take, on average, six million ENIACs to store. The Large Synoptic Survey Telescope, planned to begin operation in Chile in 2015, will produce data on a similar scale.

A data-storage facility at CERN hints at the huge scale of the information revolution. Credit: CERN

Hundreds of projects in fields ranging from genomics to computational linguistics to astronomy demonstrate a major shift in the scale at which scientific data are taken, and in how they are processed, shared and communicated to the world. Most significantly, there is a shift in how researchers find meaning in data, with sophisticated algorithms and statistical techniques becoming part of the standard scientific toolkit. The Fourth Paradigm is about this shift, how scientists are dealing with it, and some of the consequences. Its 30 chapters, written by some 70 authors, cover a wide range of aspects of data-intensive science.

The book is in four parts. The first two parts are a panorama of the new ways in which data are obtained, through new instruments and large-scale sensor networks. The fields covered range from cosmology to the environment and from healthcare to biology. Most of the chapters in these sections follow a common pattern. Each introduces a complex system of scientific interest — the human brain, the world's oceans, the global health system and so on — before supplying an explanation of how we are building an instrument or a network of sensors to map out that system comprehensively and, in some cases, to track its real-time behaviour.

We learn in one chapter, for example, about steps towards building a complete map of the human brain — the 'connectome'. Another chapter describes the Ocean Observatories Initiative, a major effort funded by the US National Science Foundation to build an enormous underwater sensor network in the northeast Pacific, off the coasts of Oregon, Washington and British Columbia. And so on, example after example.

This repetition was, for me, the most enjoyable part of the book. It illuminates common questions that are being asked across these superficially very different fields: who owns the data gathered? How should their release be managed? How should they be curated? How will we preserve them for future generations? Most of all: how can we understand the data?

In parts three and four of the book, these same questions return, from the broader perspective of how the answers could and should be reflected in scientific institutions. Part three tackles infrastructure requirements, and part four looks at scholarly communication. Topics include the technical challenges of doing large-scale data analysis, such as multicore and parallel computing; workflow tools that simplify data analysis and make experiments and analysis more reproducible; and the difficult social and technical challenges of moving to a world in which large data sets are routinely published as part of the scientific process and then integrated with other data sources. The most interesting theme that emerges here is a vision of an increasingly linked web of information: all of the world's scientific knowledge as one big database.

The book has some minor shortcomings. At times, it reads too much like a brochure — perhaps inevitable, given that nearly half of the contributing authors come from Microsoft. Many of the essays assume that progress comes mostly from big grants and massive centralized programmes, an assumption not justified by the history of networked innovation. Think of the Internet, or the preprint server arXiv hosted by Cornell University in Ithaca, New York, or the gene-sequence database GenBank — each started by individuals with limited institutional support.

I also found myself wishing that the scope was broader. Science is about more than data: it is about ideas, explanations and people. The same tools that are driving data-intensive science are also changing the nature of scientific collaboration, and these two changes are closely related. This shift in how scientists team up to create meaning is addressed in only a few chapters.

These are minor criticisms. The rise of 'big data' is one of the major scientific stories of our time, and The Fourth Paradigm offers a broad view that is both informative and stimulating. Better still, the book has been released under a Creative Commons licence, and is available for free on the Internet.