The data deluge

    An Erratum to this article was published on 06 November 2012

    This article has been updated

    The Royal Society calls for a shift in the attitude of scientists and others, including funders, research institutions and publishers, towards data accessibility, curation and dissemination.

    Data is big and getting bigger. In 2011, Twitter users generated 200 million tweets a day and, on average, more than 300 million photos were uploaded to Facebook each day in the first three months of 2012. The Large Hadron Collider is estimated to produce 15 petabytes, or 15 million gigabytes, of data annually. On a more modest scale, the 1000 Genomes Project, a data set of human genetic variation, has already generated about 200 terabytes of data, and MitoCheck, a systematic genome-wide analysis of about 21,000 human genes, generated about 190,000 movies representing a total of 24 terabytes. Tremendous leaps in technology have now made it possible to accumulate large volumes of data in relatively short time spans. Scientists and others are thus increasingly contending with larger and more complex data sets that pose unique opportunities and challenges for data storage, curation, analysis, interpretation, dissemination and reuse. To fully reap the benefits of 'big data', the generation of data must proceed apace with the development of new tools for data analysis and visualization. To this end, the US government launched the Big Data Research and Development Initiative in March 2012 that committed more $200 million, distributed among six Federal agencies and departments, to fund the development of technology and tools necessary for meeting the challenges thrown up by big data.

    Shortly before this issue went to press, the Royal Society released its report on “Science as an open enterprise”. The report issues a number of recommendations for the generation, preservation and dissemination of data so as to facilitate maximal impact from research data, and enhance its reproducibility in an age of big data. It emphasizes a need for change in current practises of data management and communication, in prevailing attitudes towards data sharing, and in mechanisms for crediting data generators. It identifies six main areas where change is needed: (1) developing greater openness in data sharing; (2) developing appropriate reward mechanisms for data generation, analysis and dissemination; (3) developing data standards to enable interoperability; (4) making data associated with published papers accessible, and amenable to assessment and reuse; (5) developing a cadre of 'data scientists'; (6) developing new tools for data analysis.

    For these changes to take root, various stakeholders involved in the scientific enterprise, including scientists, research institutions, funding bodies and publishers, must collectively commit to action. For example, alongside the push towards 'open data', mechanisms for incentivizing, and crediting data generation and sharing, need to be developed by funders, universities and research institutes. These institutions, along with publishers, can also spur data sharing by encouraging data deposition in appropriate community-recognized databases, where available. Most funding agencies have explicit directives on data sharing, and some funders also require a data management plan (describing data-sharing mechanisms) to be submitted with grant proposals. Nature journals require that data sets associated with a published paper be made freely available to readers from the date of publication, and submission of data sets to community-endorsed public repositories is often mandatory or strongly encouraged, depending on the nature of the data set and evolving standards of the community. Accession numbers for deposited data sets are required in papers published in Nature journals to facilitate data retrieval. Data-centric journals, such as GigaScience, aiming to publish 'big-data' studies, go a step further and propose to integrate the published paper with a journal-hosted repository for data sets, as well as providing data analysis tools and DOIs for associated data sets to facilitate citation.

    Many conclusions of the Royal Society's report are shared with those from a 2009 report from the Interagency Working Group on Digital Data in the US, “Harnessing the Power of Digital Data for Science and Society”, setting forth a strategy to ensure that scientific digital data is preserved and accessed for maximum use. This report rightly cautioned that considered decisions must be made about which scientific data needs to be preserved and for how long. Some criteria to consider included balancing the cost of preservation against the cost of generating the data, whether existing data are rendered obsolete by new data, and whether certain types of data are irreproducible. For so-called 'preservation data' — data that was deemed as high priority for long-term preservation — the report recommended detailed data management plans that would cover the full life cycle of data from generation to dissemination and reuse.

    Although we do not dispute the considerable benefits to research progress that can come from making research data widely accessible in a timely manner, there are several areas of biological research where data accessibility and reusability is significantly hampered by the absence of community-recognized repositories and data standards. Imaging technology has become increasingly sophisticated, allowing fine-resolution analysis within cells and organisms, and systematic genome-wide analysis. However, databases and data standards for imaging experiments, akin to what has been established for and by the gene expression communities, are sorely lacking (Nat. Cell Biol. 13, 183; 2011). Similarly, RNAi screening is becoming commonplace, but a unified repository does not yet exist and data standards such as MIARE (minimum information about an RNAi experiment) have not gained sufficient traction within the community (Nat. Cell Biol. 14, 115; 2012). Developing repositories and community consensus around data standards must be a core priority to harness the full potential of research data.

    Change history

    • 17 October 2012

      In the version of this paper initially published, the reported number of movies produced by the Mitocheck project was incorrect.

    Rights and permissions

    Reprints and Permissions

    About this article

    Cite this article

    The data deluge. Nat Cell Biol 14, 775 (2012).

    Download citation

    Further reading


    Quick links

    Nature Briefing

    Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

    Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing