To the Editor:
Although much discussion has been focused on the need for more data sharing in the biomedical community, less attention has been paid to new kinds of biomedical data sharing, particularly the sharing of confidential patient data. In the traditional paradigm of data sharing, researchers transfer their data directly to data modelers. Here we describe an alternative model that allows the protection of confidential data through a process we term 'model to data' (MTD). In the MTD model, the flow of information between data generators and data modelers is reversed. This new sharing paradigm has been successfully demonstrated in crowdsourced competitions and represents a promising alternative for increasing the use of data that cannot—or will not—be more broadly shared.
Biomedical studies generate vast clinical, radiologic, cellular and molecular data sets, and enable new basic and translational science. However, there is substantial disagreement around the best ways to share these valuable assets, particularly in the context of clinical trials. Some advocate for immediate and fully open sharing, arguing that wide accessibility will facilitate creative new analyses and improved reproducibility1. Others suggest more closed and/or delayed data sharing, arguing that broad availability will disincentivize the substantial effort required to accurately collect and generate large data sets2. Still others highlight the importance of keeping patient data private and not betraying patients' trust3. These points were highlighted by researchers involved in a large cardiovascular study, who remarked that the public release of their data puts projects and manuscripts “in jeopardy of being scooped”4. Ultimately, the question at hand is simple: what data-sharing model will most effectively incentivize funders, clinicians, scientists and patients, and catalyze new biomedical discovery?
Data sharing traditionally implies a flow of information from data generator to data consumer. This 'data to modeler' (DTM) paradigm has been—and remains—the cornerstone of scientific research; data modelers acquire direct access to data to develop and test hypotheses. Recently, alternative forms of data sharing have emerged, enabled by new technologies and propelled by a small, albeit growing, community that organizes research questions around 'challenges'. These challenges are crowdsourced competitions that pose quantitative questions to the research community and encourage innovation through incentives and multidisciplinary collaboration (e.g., Kaggle, Innocentive, CASP (Critical Assessment of Protein Structure Prediction), CAGI (Critical Assessment of Genome Interpretation), the DREAM (Dialogue for Research Engineering Assessments and Methods) Challenges, and others5,6). In these challenges, the dissemination and availability of data are critical to their operation and have motivated challenge organizations to experiment with alternative forms of data sharing.
In the Prostate Cancer DREAM Challenge, all challenge data were available—and able—to be broadly disseminated to the participants. In many circumstances, however, restrictions on data sets prohibit this kind of broad access.
In response to such restrictions, interested parties have developed the MTD form of data sharing. In this sharing paradigm, data remain stationary with models moving to the data. In the context of data challenges, participants submit executable models, which are then run on a team's behalf to generate predictions on unseen data (Fig. 1b). With MTD, validation data and sometimes training data are only indirectly accessed through submitted code and models, and direct sharing is limited to the data contributors and the Challenge organizers. Consequently, data contributors can enjoy the benefits of data sharing (e.g., recognition and publication authorship), while preserving their control of the data.
This MTD paradigm has been enabled by the emergence of two complementary technologies: container software and cloud computing. Container software, such as Docker or RKT, simplifies the bundling and transfer of an application (model) and its dependencies in a platform-agnostic way. Cloud computing has led to the commoditization of compute and data storage, and today provides a robust environment in which containers may be hosted and run. Together, these technologies allow a model to be migrated with minimal effort from one computing environment to another and executed at reasonable scale and cost.
One of the first large-scale demonstrations of MTD in a data challenge was the Digital Mammography DREAM Challenge http://www.synapse.org/Digital_Mammography_DREAM_Challenge. This challenge aimed to reduce the high false-positive rate in mammography screening, with one in ten women experiencing screening failures attributed to the misinterpretation of imaging features or overly cautious assessments by radiologists7. Advances in image processing and artificial intelligence (AI) raised hopes that the accuracy of cancer detection from mammography screening could be improved. As advanced AI methods require large training data sets—often on the order of many hundreds of thousands of images—challenge organizers partnered with insurance company Kaiser Permanente (Oakland, CA, USA), which contributed over 640,000 annotated breast images from 80,000 women. However, the conditions of use for these data required that participants could not download or manually inspect images. To address this restriction, organizers designed the challenge using MTD. Teams were required to submit containerized programs to train models on unseen training data, which were then validated on new, unseen data. To the best of our knowledge, this format of data sharing is unprecedented at this scale with over 12 terabytes of imaging data available to participants8. Two additional challenges have used the MTD design—the Multiple Myeloma DREAM Challenge and the NCI-DREAM Proteogenomic Challenge—which successfully incorporated sensitive biomedical data that could not be broadly shared.
The MTD format is not without limitations. First, any data sharing opens up the potential for abuse and can result in unintended or unpermitted uses. In the Digital Mammography Challenge, although the imaging data were not downloadable, limited information on participants' models was provided to help teams troubleshoot and evaluate performance. Although this could theoretically allow leakage of imaging data, challenge organizers put in place auditing safeguards to minimize this risk. A second limitation is that participants have restricted feedback on the analysis of data sets, as they are barred from a direct and iterative exploration of the data. Consequently, successful challenges will require direct access to vital training data sets, with additional training and validation data held completely hidden from participants.
The MTD data-sharing paradigm permits the use of private data that would otherwise remain closed and unavailable to the research community. We believe that the most permissive and accessible data-sharing policies are essential for a robust research community, in which researchers gain unfettered access to large, clinically characterized data sets. However, we acknowledge that restrictions around data access are likely to persist for the foreseeable future.
Several arguments are commonly given to justify data hoarding. First, data contain sensitive personal health information and could endanger patient privacy; second, data are proprietary and could undermine commercialization; and third, data are preliminary, and must remain embargoed until the completion of data quality assessments, peer-review and publication. Given these barriers, we as a scientific community must continue to innovate and find ways to work with, rather than against, these restrictions.
As researchers increase their use of cloud computing and container technologies, we can envision a future in which data repositories are designed to support access, not just by scientists who wish to download data, but by models capable of operating on the data. This would be especially valuable for researchers requiring access to protected health information in imaging databases or electronic health record systems. Such a system has the potential to reshape our concept of data sharing, and to refocus our priorities within the biomedical research community.