Main

To tackle complex scientific questions, experimental datasets from different sources often need to be harmonized in regard to structure, formatting and annotation so as to open their content to (integrative) analysis. Vast swathes of bioscience data remain locked in esoteric formats, are described using nonstandard terminology, lack sufficient contextual information or simply are never shared due to the perceived cost or futility of the exercise. This loss of value continues to engender standardization initiatives and drives the ongoing conversation about the encouragement of data sharing through appropriate reward mechanisms.

Minimum reporting guidelines, terminologies and formats (hereafter referred to generally as reporting standards) are increasingly used in the structuring and curation of datasets, enabling data sharing to varying degrees. However, the mountain of frameworks needed to support data sharing between communities inhibits the development of tools for data management, reuse and integration. Here we describe a way in which a group of data producers and consumers work within an invisible metadata framework that enables the coordinated use of reporting standards by service providers and circumvents many of the problems caused by data diversity. The same framework enables researchers, bioinformaticians and data managers to operate within an open data commons.

From reusable data to reproducible research

Shared, annotated research data and methods offer new discovery opportunities and prevent unnecessary repetition of work. Although funding agencies, journals and community initiatives encourage good data stewardship and sharing through the use of community reporting standards, data sharing remains challenging1,2,3. More significant coordination has occurred in the food and drug regulatory arena4 and in commercial science, where investments in procedures and tools that integrate external sources with internal data now enhance decision-making processes5.

Funding agency 'encouragement' has normally taken the form of top-down data sharing policies. Increasingly, however, funding agencies are also requiring specific data management, preservation and sharing plans in grant applications and are monitoring adherence6. Such an approach requires researchers to follow or develop best practices collaboratively. These practices are also emerging organically through the provision of independent databases, tools and curators, driven by advocates of the sharing of both pre- and post-publication data7,8. To build an interoperable open data ecosystem will require leveraging all of these positive efforts and further increasing community buy-in.

Time to leap outside the box

Overall, most stakeholder groups accept the principles of data sharing, but in practice, achieving compliance is challenging, especially when new technologies or combinations of technologies are employed. The current wealth of domain-specific reporting standards provides proof of stakeholders' engagement with standardization and sharing, but the use of combinations of technologies presents challenges9,10. Descriptions of investigations of biological systems in which source material has been subject to several kinds of analyses (for example, genomic sequencing, protein-protein interaction assays and the measurement of metabolite concentrations) are particularly challenging to share as coherent units of research because of the diversity of reporting standards with which the parts must be formally represented. Equally, most repositories are designed for specific assay types, necessitating the fragmentation of complex datasets11,12,13,14,15. One way forward is to establish reciprocal data exchange between major repositories, but budgetary constraints limit such activities15,16, and a crop of differing methodologies still imposes barriers11,12.

Researchers acting as data consumers also face challenges when the component parts of an investigation are scattered across databases. Fragmented datasets can only be reassembled by those equipped to navigate the various reporting guidelines, terminologies and formats involved17. Cross-cutting, topic-specific reference datasets have been assembled, but predominantly by large initiatives (such as Sage Commons) and programs (such as ENCODE or the US National Institutes of Health–National Institute of Allergy and Infectious Diseases' Bioinformatics Resource Centers (BRCs)). These limitations fuel the indifference researchers feel about investing significant effort to share their data18.

As the main facilitators of data sharing, major public repositories are evolving to support the structure and detail increasingly present in complex, multipart datasets (such as the US National Center for Biotechnology Information's BioSample system). By importing data from external files under their own schemata, databases provide badly needed integration. The speed of this evolution is dependent on access to highly skilled biocurators able to generate and validate complex annotations, increasing the pressure on data producers to quality check data before submission19.

ISA commons: a part of the data-commoning revolution

New solutions are required that deliver economies of scale in data capture and inherently support data integration, rendering the process of data capture and annotation scalable in the face of the current 'data bonanza'. Here we refer to efforts toward such positive solutions as 'data commoning'. Box 1 presents an exemplar ecosystem of data curation and sharing solutions from groups working together to create a cross-domain data sharing vision of the future. These collaborative groups are, in essence, on the path to building a data commons, serving an increasingly diverse set of domains including environmental health, environmental genomics, metabolomics, (meta)genomics, proteomics, stem cell discovery, systems biology, transcriptomics and toxicogenomics, but also communities working to characterize nucleic acid structures and to build a library of cellular signatures. This emerging commons depends on its participants' use of the metadata categories 'Investigation' (the project context), 'Study' (a unit of research) and 'Assay' (analytical measurement). This so-called ISA framework is the backbone upon which the discovery, exchange and informed integration of data sets articulate with one another.

At the heart of the ISA framework is the extensible, hierarchical 'ISA-Tab' file format20 that can be used alone or as a template for a variety of spreadsheet-based formats for data sharing21. ISA-Tab was developed by mapping a number of public repositories' submission formats onto one structure for representing experimental metadata, leveraging common elements while keeping data files external in their native or community-specific formats. ISA-Tab offers the chance for both project-specific and public repositories to adopt a common file format for representing experimental metadata, increasing the flow of richly described investigations into the public domain.

The modular ISA software suite, which implements the ISA-Tab format, acts to (i) regularize local collection and management of experimental metadata, (ii) reduce the adoption barrier for using community minimum reporting guidelines and terminologies through customizable configuration, (iii) facilitate consistent curation at source and (iv) support direct submission to a growing number of public repositories, both in ISA-Tab format (such as MetaboLights and the other systems shown in Box 1) and through conversion to other supported formats12,13,14. An example of the ISA framework in action is illustrated by the Harvard Stem Cell Institute (HCSI)'s Stem Cell Discovery Engine (SCDE)22 and shown in Figure 1.

Figure 1: The ISA framework in action in the stem cell–based system of the Harvard Stem Cell Institute (HSCI).
figure 1

The data management workflow of the HSCI's Stem Cell Discovery Engine (SCDE) system, powered by the ISA framework. (a) Curators use the ISAconfigurator and ISAcreator software modules to consistently curate a variety of internally generated stem cell-based genomics profiles according to community-developed minimum information guidelines and terminologies; published transcriptomics-based studies are also collected via the MAGEtoISA module, then curated and enriched for consistency. (b) Consistently represented investigations are loaded in the BioInvestigation Index (BII) component that stores and serves the (public and private) data sets to the HSCI and wider community. (c) Upon publication, investigations are directly submitted to those public repositories using ISA-Tab format, or converted to/from other supported formats via the ISAconverter.

Without community-level harmonization and interoperability, many community projects risk becoming data silos, aggravating the problem. Using the shared, metadata-focused ISA framework, it is now possible to aggregate investigations in community 'staging posts', merge them in various combinations, perform meta-analyses and more straightforwardly submit to public repositories. Furthermore, simplifying the integration of bioscience data can only speed systems biology research23 and improve the ability of the R&D community to utilize shared data24.

The growing number of communities using the ISA framework adds credibility to this metadata-focused data sharing vision. Taking this a step further, Figure 2 shows how these communities' systems—a mix of public and internal tools that use ISA software components or, minimally, the ISA-Tab format—will progressively interrelate to build the 'ISA commons'. Activities are already underway under the auspices of the World Wide Web Consortium (W3C) Semantic Web for Health Care and Life Sciences Interest Group (HCLSIG)'s Scientific Discourse task force to generate serialized ISA-Tab metadata in compliance with the recommendations of the international Linked Data community25. Semantic integration of bioscience data with the wider corpus of human knowledge then becomes more straightforward.

Figure 2: Building the 'ISA commons', a growing ecosystem of resources that work to provide a data commons.
figure 2

(a) Data sets of interest to each community are collected and curated. (b) Capture systems, either powered by the ISA software suite or supporting the hierarchical ISA-Tab structure, deliver a common representation of experimental content that transcends individual domains. (c) To achieve broader data integration, the next step is to explore the growing Linked Data universe. The European Innovative Medicines Initiative (IMI) Open PHACTS project, for example, will use semantic web approaches to make existing knowledge available for linking, querying and where possible, reasoning. This project will benefit greatly from study descriptions that draw on the ISA model to connect quantified information held in semantic triple stores to data from actual experiments performed. As a result, the project will connect public and private datasets to genomics resources, enabling the combination of existing and new experimental data.

BioSharing: standard cooperating procedures

It is widely acknowledged that unlocking shared data promises to accelerate discovery, but this process requires new models for the way we collaborate1,2,3,5,6,17,18,26. But reporting standards often have different levels of maturity, and inevitably, duplication of effort. Communication between standards initiatives is pivotal to ensure that a common or at least complementary set of standards exists and is widely used by the academic and commercial sectors to maximize the utility of shared data. Building on the effort of the Minimum Information for Biological and Biomedical Investigations (MIBBI) portal10, the BioSharing initiative works to strengthen collaborations between researchers, funders, industry and journals and to discourage redundant (if unintentional) competition between standards-generating groups27. The BioSharing catalog maps the landscape of standards and the systems implementing them, and it also works to build graphs of complementarities in scope and functionality. In time and after consultation, a set of criteria for assessing the usability and popularity of standards will be implemented to maximize their adoption and use to assist the virtuous data cycle—from generation to standardization through publication to subsequent sharing and reuse.

The research community requires solutions that accommodate the current 'wealth' of standards and resources, but hides it from users, thereby simplifying their efforts to meet (or ideally, exceed) applicable reporting requirements. Although ongoing activities hold promise, they are a drop in the ocean compared to the daunting challenges ahead: for example, the integration of clinical and biological data in translational medicine28 and the establishment of mechanisms to support credit for data sharing, which would benefit data producers for making their data accessible (for example, refs. 29,30).

Nonetheless, the vision of data sharing through a 'commons' is entirely technologically possible; communities simply need agree on the largely organizational changes required. The continued collaborative development and uptake of standard frameworks, and the emergence of compliant tools and interoperable data sets such as we have described, illustrates the potential of the horizontal, synergistic approach that is data commoning. Such horizontal integration transcends individual life science domains and assay- or technology-focused communities.

A growing movement

The ISA commons is a growing exemplar ecosystem of data curation and sharing solutions built on a common metadata tracking framework, providing tools and resources to create and manage large, heterogeneous data sets in a coherent manner, and allowing users of (parts of) data sets to 'connect the metadata dots'. We are open to coordinating efforts with other data commons working on similar and related aspects of the same problem, who we invite to adopt and contribute to the further evolution of the ISA framework—the results of years of effort to agree to a basic lingua franca for the standards community.

We urge new communities interested in breaching the boundary of their own bio-domain to join the growing ISA network and the BioSharing initiative, thereby contributing to the realization of this data-sharing vision: to empower ever more scientists to take data management and sharing into their own hands, using community standards while remaining blissfully unaware of the underlying complexities of the implementation of those standards.

Note: The views presented in this article do not necessarily reflect those of the US Food and Drug Administration.

URLs. BGI, http://en.genomics.cn/; BioLinux, http://nebc.nerc.ac.uk/tools/bio-linux; Bioplatforms Australia, http://bioplatforms.com.au/; CSIRO, http://www.bioinformatics.csiro.au; BioSharing, http://biosharing.org/; BIRN BioScholar Knowledge Management system, http://bmkeg.isi.edu/; DataCite's DOIs, http://www.datacite.org/; dbNP, http://www.dbnp.org/; ENCODE, http://encodeproject.org/ENCODE/dataStandards.html; Galaxy, http://galaxy.psu.edu/; GSC, http://gensc.org/; GigaScience, www.gigasciencejournal.com/; HSCI's SCDE, http://discovery.hsci.harvard.edu/; HSCI's Blood Genomics Repository, http://bloodprogram.hsci.harvard.edu/; ICoMM, http://icomm.mbl.edu/; IMI Open PHACTS, http://www.openphacts.org/; ISA Commons, http://www.isacommons.org/; ISA software suite and ISA-Tab, http://www.isa-tools.org/; Leibniz Institute of Plant Biochemistry, http://www.ipb-halle.de/en/research/stress-and-developmental-biology/research/bioinformatics-mass-spectrometry/research-projects; LINCS, http://lincs.hms.harvard.edu/; Linked Data, http://linkeddata.org/; MeRy-B, http://www.cbib.u-bordeaux2.fr/MERYB/index.php; http://listserver.ebi.ac.uk/mailman/listinfo/metabolights/; MIRADA LTERs, http://amarallab.mbl.edu/mirada/mirada.html/; NIEHS' Center for Environmental Health, http://www.hsph.harvard.edu/research/niehs; NCBI's BioSample, http://www.ncbi.nlm.nih.gov/biosample; NERC EnvBase, http://bii.nwl.ac.uk/; NIBR, http://www.nibr.com/; NIH-NIAID's BRCs (Bioinformatics Resource Centers), http://www.niaid.nih.gov/labsandresources/resources/brc; Sage Commons, http://sagebase.org/commons/; SEEK, http://www.sysmo-db.org/; SIDR, http://sidr-dr.inist.fr/; SNRNASM, http://snrnasm.bio.unc.edu/; SysMO, http://www.sysmo.net/; http://www.fda.gov/AboutFDA/CentersOffices/OC/OfficeofScientificandMedicalPrograms/NCTR/WhatWeDo/NCTRCentersofExcellence/ucm078990.htm/; W3C HCLSIG Scientific Discourse task force, http://www.w3.org/wiki/HCLSIG/SWANSIOC.

Author contributions

S.-A.S. and P.R.-S. designed and led the development of the ISA framework and the BioSharing catalogue. D.F. and S.-A.S. are the cofunders of the BioSharing initiative. E.M. is the lead engineer of the ISA framework and, with P.R.-S., of the BioSharing site. C.T. coordinates the MIBBI portal. W.H. conceived SCDE and the role of an ISA approach to integration and within its stem cell systems, W.H., O.H., B.C., S.J.H.S. and K.B. contributed to the development of the ISA framework and worked on the SCDE. W.T. and H.F. contributed to the development of the ISA framework and strategies to integrate it with the FDA's ArrayTrack tool. S.N. contributed to the development of the ISA framework and developed workflows to integrate it with lab equipment. L.A.-Z. worked toward the implementation of ISA for the MIRADA-LTERS and ICoMM data sets. T.B. developed the NERC Environmental Bioinformatics Center (NEBC) EnvBase catalogue. G.B. worked toward the implementation of ISA for the BIRN BioScholar Knowledge Management system. T.C. leads the W3C working subgroup on Scientific Discourse; S.D. led the development of the Harvard Stem Cell Institute (HSCI) Blood Genomics repository, and M.E. worked on the integration of ISA-Tab into the system. L.-A.C. assisted the ISA developers to make use of the DataCite Metadata Store to mint Digital Object Identifiers (DOIs). J.C. and C.E.S. worked toward the implementation of ISA for use with HMS LINCS data. A.d.D. and D.J. worked toward the implementation of ISA for the MeRy-B knowledgebase. S.E. and S.L. worked on the integration of the ISA framework into the GigaScience and BGI database infrastructure. C.T.E. worked toward the implementation of ISA in the dbNP database and provided links to the Open PHACTS project. J.G. worked toward the implementation of ISA at the Argonne National Laboratory. C.G. and K.W. worked on the implementation of ISA-Tab in the SEEK platform. J.K. led the CarcinoGENOMICS project under which the ISA framework was first funded and developed. K.H., P.d.M. and C.S. developed the MetaboLights, powered by the ISA framework. A.L. led the implementation of the ISA-Tab in the SNRNASM annotation guidelines. S.M. and D.R. worked toward the integration of selected ISA software components as part of an extended workflow at NIBR. M.R. headed the development of the SIDR repository and the implementation of the ISA-Tab format. A.M. worked toward the implementation of ISA at CSIRO. C.A.S. worked toward the implementation of ISA at Bioplatforms Australia.

A.T., B.W.-J., H.H., I.D., I.X., J.L.G., L.B., L.H., M.J.F. and P.G., along with all the other authors, have provided advice, suggestions and feedback to S.-A.S. and P.R.-S. during the design and development phase of the ISA framework. In particular, P.G. was also closely involved in the BioSharing effort, and L.H. and B.W.-J. were pivotal for the links to the Pistoia Alliance, industry groups and the IMI Open PHACTS project.

All the authors have contributed to the preparation of the manuscript at all stages; in particular, E.M. developed the figures and S.-A.S., P.R.-S., D.F. and C.T. led the writing process.