Correspondence | Published:

Alternative models for sharing confidential biomedical data

Nature Biotechnology volume 36, pages 391392 (2018) | Download Citation

To the Editor:

Although much discussion has been focused on the need for more data sharing in the biomedical community, less attention has been paid to new kinds of biomedical data sharing, particularly the sharing of confidential patient data. In the traditional paradigm of data sharing, researchers transfer their data directly to data modelers. Here we describe an alternative model that allows the protection of confidential data through a process we term 'model to data' (MTD). In the MTD model, the flow of information between data generators and data modelers is reversed. This new sharing paradigm has been successfully demonstrated in crowdsourced competitions and represents a promising alternative for increasing the use of data that cannot—or will not—be more broadly shared.

Biomedical studies generate vast clinical, radiologic, cellular and molecular data sets, and enable new basic and translational science. However, there is substantial disagreement around the best ways to share these valuable assets, particularly in the context of clinical trials. Some advocate for immediate and fully open sharing, arguing that wide accessibility will facilitate creative new analyses and improved reproducibility1. Others suggest more closed and/or delayed data sharing, arguing that broad availability will disincentivize the substantial effort required to accurately collect and generate large data sets2. Still others highlight the importance of keeping patient data private and not betraying patients' trust3. These points were highlighted by researchers involved in a large cardiovascular study, who remarked that the public release of their data puts projects and manuscripts “in jeopardy of being scooped”4. Ultimately, the question at hand is simple: what data-sharing model will most effectively incentivize funders, clinicians, scientists and patients, and catalyze new biomedical discovery?

Data sharing traditionally implies a flow of information from data generator to data consumer. This 'data to modeler' (DTM) paradigm has been—and remains—the cornerstone of scientific research; data modelers acquire direct access to data to develop and test hypotheses. Recently, alternative forms of data sharing have emerged, enabled by new technologies and propelled by a small, albeit growing, community that organizes research questions around 'challenges'. These challenges are crowdsourced competitions that pose quantitative questions to the research community and encourage innovation through incentives and multidisciplinary collaboration (e.g., Kaggle, Innocentive, CASP (Critical Assessment of Protein Structure Prediction), CAGI (Critical Assessment of Genome Interpretation), the DREAM (Dialogue for Research Engineering Assessments and Methods) Challenges, and others5,6). In these challenges, the dissemination and availability of data are critical to their operation and have motivated challenge organizations to experiment with alternative forms of data sharing.

Data sharing in most challenges is based on the DTM format; after agreeing to the challenge's terms of use, registered members of a team log onto a central server and download challenge data to their personal machine(s) (Fig. 1a). An example of this is the Prostate Cancer Prognosis DREAM Challenge, where organizers partnered with Project Data Sphere to curate and disseminate clinical data collected from five phase 3 clinical trials in metastatic castration-resistant prostate cancer. Challenge participants were asked to predict patient survival and treatment discontinuation using clinical variables provided to them. Three of the five clinical cohorts were shared at the onset of the challenge, allowing participants the ability to develop (i.e., train) models; the remaining two clinical trial cohorts were also available, but the prediction variables—patient survival and treatment discontinuation—were withheld for blinded model validation. The challenge resulted in the submission and evaluation of over 50 models, with many outperforming previous benchmarks in the field7.

Figure 1: Sharing paradigms for data challenges.
Figure 1

(a) Data to modeler (DTM). Both training and validation data sets are provided to participants for model development and generation of predictions. (b) Model to data (MTD). Participants submit 'containerized' models to organizers. Hidden data sets are used for unbiased model validation, as well as potential model training.

In the Prostate Cancer DREAM Challenge, all challenge data were available—and able—to be broadly disseminated to the participants. In many circumstances, however, restrictions on data sets prohibit this kind of broad access.

In response to such restrictions, interested parties have developed the MTD form of data sharing. In this sharing paradigm, data remain stationary with models moving to the data. In the context of data challenges, participants submit executable models, which are then run on a team's behalf to generate predictions on unseen data (Fig. 1b). With MTD, validation data and sometimes training data are only indirectly accessed through submitted code and models, and direct sharing is limited to the data contributors and the Challenge organizers. Consequently, data contributors can enjoy the benefits of data sharing (e.g., recognition and publication authorship), while preserving their control of the data.

This MTD paradigm has been enabled by the emergence of two complementary technologies: container software and cloud computing. Container software, such as Docker or RKT, simplifies the bundling and transfer of an application (model) and its dependencies in a platform-agnostic way. Cloud computing has led to the commoditization of compute and data storage, and today provides a robust environment in which containers may be hosted and run. Together, these technologies allow a model to be migrated with minimal effort from one computing environment to another and executed at reasonable scale and cost.

One of the first large-scale demonstrations of MTD in a data challenge was the Digital Mammography DREAM Challenge http://www.synapse.org/Digital_Mammography_DREAM_Challenge. This challenge aimed to reduce the high false-positive rate in mammography screening, with one in ten women experiencing screening failures attributed to the misinterpretation of imaging features or overly cautious assessments by radiologists7. Advances in image processing and artificial intelligence (AI) raised hopes that the accuracy of cancer detection from mammography screening could be improved. As advanced AI methods require large training data sets—often on the order of many hundreds of thousands of images—challenge organizers partnered with insurance company Kaiser Permanente (Oakland, CA, USA), which contributed over 640,000 annotated breast images from 80,000 women. However, the conditions of use for these data required that participants could not download or manually inspect images. To address this restriction, organizers designed the challenge using MTD. Teams were required to submit containerized programs to train models on unseen training data, which were then validated on new, unseen data. To the best of our knowledge, this format of data sharing is unprecedented at this scale with over 12 terabytes of imaging data available to participants8. Two additional challenges have used the MTD design—the Multiple Myeloma DREAM Challenge and the NCI-DREAM Proteogenomic Challenge—which successfully incorporated sensitive biomedical data that could not be broadly shared.

The MTD format is not without limitations. First, any data sharing opens up the potential for abuse and can result in unintended or unpermitted uses. In the Digital Mammography Challenge, although the imaging data were not downloadable, limited information on participants' models was provided to help teams troubleshoot and evaluate performance. Although this could theoretically allow leakage of imaging data, challenge organizers put in place auditing safeguards to minimize this risk. A second limitation is that participants have restricted feedback on the analysis of data sets, as they are barred from a direct and iterative exploration of the data. Consequently, successful challenges will require direct access to vital training data sets, with additional training and validation data held completely hidden from participants.

The MTD data-sharing paradigm permits the use of private data that would otherwise remain closed and unavailable to the research community. We believe that the most permissive and accessible data-sharing policies are essential for a robust research community, in which researchers gain unfettered access to large, clinically characterized data sets. However, we acknowledge that restrictions around data access are likely to persist for the foreseeable future.

Several arguments are commonly given to justify data hoarding. First, data contain sensitive personal health information and could endanger patient privacy; second, data are proprietary and could undermine commercialization; and third, data are preliminary, and must remain embargoed until the completion of data quality assessments, peer-review and publication. Given these barriers, we as a scientific community must continue to innovate and find ways to work with, rather than against, these restrictions.

As researchers increase their use of cloud computing and container technologies, we can envision a future in which data repositories are designed to support access, not just by scientists who wish to download data, but by models capable of operating on the data. This would be especially valuable for researchers requiring access to protected health information in imaging databases or electronic health record systems. Such a system has the potential to reshape our concept of data sharing, and to refocus our priorities within the biomedical research community.

References

  1. 1.

    et al. N. Engl. J. Med. 376, 1178–1181 (2017).

  2. 2.

    International Consortium of Investigators for Fairness in Trial Data Sharing. N. Engl. J. Med. 375, 405–407 (2016).

  3. 3.

    ACCESS CV. N. Engl. J. Med. 375, 407–409 (2016).

  4. 4.

    Nature 543, 299 (2017).

  5. 5.

    Nature 533, S62–S64 (2016).

  6. 6.

    et al. Nat. Rev. Genet. 17, 310–318 (2016).

  7. 7.

    et al. Lancet Oncol. 18, 132–142 (2017).

  8. 8.

    , , & N. Engl. J. Med. 375, 1438–1447 (2016).

Download references

Author information

Affiliations

  1. Sage Bionetworks, Seattle, Washington, USA.

    • Justin Guinney
  2. RWTH Aachen University, Faculty of Medicine, Germany.

    • Julio Saez-Rodriguez
  3. The European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK.

    • Julio Saez-Rodriguez

Authors

  1. Search for Justin Guinney in:

  2. Search for Julio Saez-Rodriguez in:

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Justin Guinney.

About this article

Publication history

Published

DOI

https://doi.org/10.1038/nbt.4128

Newsletter Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing