Key Points
-
Crowdsourcing is emerging as a novel framework to tackle scientific problems.
-
A variant of crowdsourcing, scientific competitions known as 'Challenges', enables a rigorous validation of methods, promotes reproducibility and fosters community building.
-
Challenges also accelerate scientific discovery by allowing large numbers of groups to work jointly on a problem.
-
Integrating predictions from different methods submitted by participants to solve a Challenge provides a robust solution that is often better than the best individual solution, a phenomenon known as the 'wisdom of crowds'.
-
The patterns of similar findings that emerge from several independent Challenges can provide useful insight into various key questions in genetics and genomics.
Abstract
The generation of large-scale biomedical data is creating unprecedented opportunities for basic and translational science. Typically, the data producers perform initial analyses, but it is very likely that the most informative methods may reside with other groups. Crowdsourcing the analysis of complex and massive data has emerged as a framework to find robust methodologies. When the crowdsourcing is done in the form of collaborative scientific competitions, known as Challenges, the validation of the methods is inherently addressed. Challenges also encourage open innovation, create collaborative communities to solve diverse and important biomedical problems, and foster the creation and dissemination of well-curated data repositories.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Stephens, Z. D. et al. Big Data: astronomical or genomical? PLoS Biol. 13, e1002195 (2015).
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
The Cancer Genome Atlas Research Network et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).
International Cancer Genome Consortium et al. International network of cancer genome projects. Nature 464, 993–998 (2010).
Uhlén, M. et al. Proteomics. Tissue-based map of the human proteome. Science 347, 1260419 (2015).
Toga, A. W. et al. Big biomedical data as the key resource for discovery science. J. Am. Med. Inform. Assoc. 22, 1126–1131 (2015).
Snijder, B., Kandasamy, R. K. & Superti-Furga, G. Toward effective sharing of high-dimensional immunology data. Nat. Biotechnol. 32, 755–759 (2014).
Henneken, E. Unlocking and sharing data in astronomy. Bul. Am. Soc. Info. Sci. Tech. 41, 40–43 (2015).
World Meteorological Organization. Climate data, management and exchange. WMO http://www.wmo.int/pages/themes/climate/climate_data_management_exchange.php (2009).
Brabham, D. C. Crowdsourcing. (MIT Press, 2013).
Nesta. A guide to historical Challenge prizes Nesta http://www.nesta.org.uk/news/guide-historical-challenge-prizes (13 May 2014)
Costello, J. C. & Stolovitzky, G. Seeking the wisdom of crowds through challenge-based competitions in biomedical research. Clin. Pharmacol. Ther. 93, 396–398 (2013).
Boudreau, K. J. & Lakhani, K. R. Using the crowd as an innovation partner. Harv. Bus. Rev. 91, 60–69 (2013).
Howe, J. The rise of crowdsourcing. Wired Magazine 14, 1–4 (2006). This article coined the term crowdsourcing and highlighted its potential.
Sobel, D. Longitude: The True Story of a Lone Genius Who Solved the Greatest Scientific Problem of His Time (Bloomsbury Publishing, 2007).
Heritage Provider Network Health Prize. Improve healthcare, win $3,000,000. WebCite http://www.webcitation.org/65IuEDAsc (4 May 2011).
Wikipedia. List of crowdsourcing projects. Wikipedia https://en.wikipedia.org/wiki/List_of_crowdsourcing_projects (updated 16 Jun 2016).
Kryshtafovych, A. et al. Challenging the state of the art in protein structure prediction: highlights of experimental target structures for the 10th Critical Assessment of Techniques for Protein Structure Prediction Experiment CASP10. Proteins 82, 26–42 (2014).
Janin, J. et al. CAPRI: a Critical Assessment of PRedicted Interactions. Proteins 52, 2–9 (2003).
Arighi, C. N. et al. BioCreative-IV virtual issue. Database 2014, bau039 (2014).
Aghaeepour, N. et al. Critical assessment of automated flow cytometry data analysis techniques. Nat. Methods 10, 228–238 (2013).
Engström, P. G. et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat. Methods 10, 1185–1191 (2013).
Steijger, T. et al. Assessment of transcript reconstruction methods for RNA-seq. Nat. Methods 10, 1177–1184 (2013). References 22 and 23 describe RGASP as an early Benchmarking Challenge for RNA-seq data analysis.
Stolovitzky, G. A., Monroe, D. & Califano, A. Dialogue on reverse-engineering assessment and methods. Ann. NY Acad. Sci. 1115, 1–22 (2007).
Weirauch, M. T. et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31, 126–134 (2013).
Küffner, R. et al. Crowdsourced analysis of clinical trial data to predict amyotrophic lateral sclerosis progression. Nat. Biotechnol. 33, 51–57 (2015). A Challenge with direct clinical implications.
Bentzien, J., Muegge, I., Hamner, B. & Thompson, D. C. Crowd computing: using competitive dynamics to develop and refine highly predictive models. Drug Discov. Today 18, 472–478 (2013).
Bansal, M. et al. A community computational challenge to predict the activity of pairs of compounds. Nat. Biotechnol. 32, 1213–1222 (2014).
Costello, J. C. et al. A community effort to assess and improve drug sensitivity prediction algorithms. Nat. Biotechnol. 32, 1208–1212 (2014). A Challenge to benchmark methods for precision medicine.
Boutros, P. C. et al. Global optimization of somatic variant identification in cancer genomes with a global community challenge. Nat. Genet. 46, 318–319 (2014).
Green, A. K. et al. The project data sphere initiative: accelerating cancer research by sharing data. Oncologist 20, 464–e20 (2015).
Abdallah, K., Hugh-Jones, C., Norman, T., Friend, S. & Stolovitzky, G. The Prostate Cancer DREAM Challenge: a community-wide effort to use open clinical trial data for the quantitative prediction of outcomes in metastatic prostate cancer. Oncologist 459–460 (2015).
Atassi, N. et al. The PRO-ACT database: design, initial analyses, and predictive features. Neurology 83, 1719–1725 (2014).
Omberg, L. et al. Enabling transparent and collaborative computational analysis of 12 tumor types within The Cancer Genome Atlas. Nat. Genet. 45, 1121–1126 (2013).
Norel, R., Rice, J. J. & Stolovitzky, G. The self-assessment trap: can we all be better than average? Mol. Syst. Biol. 7, 537 (2011).
Cokelaer, T. et al. DREAMTools: a Python package for scoring collaborative challenges [version2; referees: 1 approved, 2 approved with reservations]. F1000Res. 4, 1030 (2015).
Plenge, R. M. et al. Crowdsourcing genetic prediction of clinical utility in the Rheumatoid Arthritis Responder Challenge. Nat. Genet. 45, 468–469 (2013).
Margolin, A. A. et al. Systematic analysis of challenge-driven improvements in molecular prognostic models for breast cancer. Sci. Transl. Med. 5, 181re1 (2013).
Hill, S. M. et al. Inferring causal molecular networks: empirical assessment through a community-based effort. Nat. Methods 13, 310–318 (2016).
Marbach, D., Schaffter, T., Mattiussi, C. & Floreano, D. Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J. Comput. Biol. 16, 229–239 (2009).
Marbach, D. et al. Revealing strengths and weaknesses of methods for gene network inference. Proc. Natl Acad. Sci. USA 107, 6286–6291 (2010).
Prill, R. J. et al. Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PLoS ONE 5, e9202 (2010).
Marbach, D. et al. Wisdom of crowds for robust gene network inference. Nat. Methods 9, 796–804 (2012). This paper introduces the wisdom-of-crowds concept in computational biology.
Cantone, I. et al. A yeast synthetic network for in vivo assessment of reverse-engineering and modeling approaches. Cell 137, 172–181 (2009).
Stolovitzky, G., Prill, R. J. & Califano, A. Lessons from the DREAM2 Challenges. Ann. NY Acad. Sci. 1158, 159–195 (2009).
Mendes, P., Sha, W. & Ye, K. Artificial gene networks for objective comparison of analysis algorithms. Bioinformatics 19 (Suppl 2), ii122–ii129 (2003).
Schaffter, T., Marbach, D. & Floreano, D. GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics 27, 2263–2270 (2011).
Reich, M. et al. GenePattern 2.0. Nat. Genet. 38, 500–501 (2006).
Marbach, D. et al. Tissue-specific regulatory circuits reveal variable modular perturbations across complex diseases. Nat. Methods 13, 366–370 (2016).
Hindorff, L. A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl Acad. Sci. USA 106, 9362–9367 (2009).
Maurano, M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012).
Badis, G. et al. Diversity and complexity in DNA recognition by transcription factors. Science 324, 1720–1723 (2009).
Benos, P. V. Additivity in protein–DNA interactions: how good an approximation is it? Nucleic Acids Res. 30, 4442–4451 (2002).
Maerkl, S. J. & Quake, S. R. A systems approach to measuring the binding energy landscapes of transcription factors. Science 315, 233–237 (2007).
Zhao, X., Xiaoyue, Z., Haiyan, H. & Speed, T. P. Finding short DNA motifs using permuted Markov models. J. Comput. Biol. 12, 894–906 (2005).
Sharon, E., Eilon, S., Shai, L. & Eran, S. A. Feature based approach to modeling protein–DNA interactions. PLoS Comput. Biol. 4, e1000154 (2008).
He, X. et al. A biophysical model for analysis of transcription factor interaction and binding site arrangement from genome-wide binding data. PLoS ONE 4, e8155 (2009).
Berger, M. F. et al. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol. 24, 1429–1435 (2006).
Annala, M., Laurila, K., Lähdesmäki, H. & Nykter, M. A linear model for transcription factor binding affinity prediction in protein binding microarrays. PLoS ONE 6, e20059 (2011).
Eduati, F. et al. Prediction of human population responses to toxic compounds by a collaborative competition. Nat. Biotechnol. 33, 933–940 (2015).
Allen, G. I. et al. Crowdsourced estimation of cognitive decline and resilience in Alzheimer's disease. Alzheimers Dement. 12, 645–653 (2016).
Critical Assessment of Genome Interpretation. Cystathionine beta-Synthase (CBS) single amino acid mutations. CAGI http://cagi2010.org/content/CBS (updated 3 Nov 2010).
Chen, Y.-C. et al. A probabilistic model to predict clinical phenotypic traits from genome sequencing. PLoS Comput. Biol. 10, e1003825 (2014).
Longo, D. L. & Drazen, J. M. Data Sharing. N. Engl. J. Med. 374, 276–277 (2016).
Wilbanks, J. & Friend, S. H. First, design for data sharing. Nat. Biotechnol. 34, 377–379 (2016).
Khare, R., Good, B. M., Leaman, R., Su, A. I. & Lu, Z. Crowdsourcing in biomedicine: challenges and opportunities. Brief. Bioinform. 17, 23–32 (2015).
Goodman, J. K., Cryder, C. E. & Cheema, A. Data collection in a flat world: the strengths and weaknesses of Mechanical Turk samples. J. Behav. Decis. Mak. 26, 213–224 (2013).
sbvIMPROVER project team. On crowd-verification of biological networks. Bioinform. Biol. Insights 7, 307–325 (2013).
Kutmon, M. et al. WikiPathways: capturing the full diversity of pathway knowledge. Nucleic Acids Res. 44, D488–D494 (2015).
Thiele, I. et al. A community-driven global reconstruction of human metabolism. Nat. Biotechnol. 31, 419–425 (2013).
Vashisht, R. et al. Crowd sourcing a new paradigm for interactome driven drug target identification in Mycobacterium tuberculosis. PLoS ONE 7, e39808 (2012).
Mortensen, J. M. et al. Using the wisdom of the crowds to find critical errors in biomedical ontologies: a study of SNOMED CT. J. Am. Med. Inform. Assoc. 22, 640–648 (2015).
Cooper, S. et al. Predicting protein structures with a multiplayer online game. Nature 466, 756–760 (2010).
Larson, S. M., Snow, C. D., Shirts, M. & Pande, V. S. Folding@Home and Genome@Home: using distributed computing to tackle previously intractable problems in computational biology. arXiv https://arxiv.org/abs/0901.0866 (2009).
Das, R. et al. Structure prediction for CASP7 targets using extensive all-atom refinement with Rosetta@home. Proteins 69 (Suppl. 8), 118–128 (2007).
Good, B. M. & Su, A. I. Games with a scientific purpose. Genome Biol. 12, 135 (2011).
Treuille, A. & Das, R. Scientific rigor through videogames. Trends Biochem. Sci. 39, 507–509 (2014).
Lee, J. et al. RNA design rules from a massive open laboratory. Proc. Natl Acad. Sci. USA 111, 2122–2127 (2014).
Sørensen, J. J. W. H. et al. Exploring the quantum speed limit with computer games. Nature 532, 210–213 (2016).
Rees, M. A. Longitude Prize for the twenty-first century. Nature 509, 401 (2014).
Chandler, D. L. A doctor in the palm of your hand: how the Qualcomm Tricorder X-Prize could help to revolutionize medical diagnosis. IEEE Pulse 5, 50–54 (2014).
Meyer, P. et al. Inferring gene expression from ribosomal promoter sequences, a crowdsourcing approach. Genome Res. 23, 1928–1937 (2013).
Dwork, C. et al. STATISTICS. The reusable holdout: preserving validity in adaptive data analysis. Science 349, 636–638 (2015).
Blum, A. & Hardt, M. The Ladder: a reliable leaderboard for machine learning competitions. arXiv https://arxiv.org/abs/1502.04585 (2015).
Möller, S. et al. Community-driven development for computational biology at Sprints, Hackathons and Codefests. BMC Bioinformatics 15, S7 (2014).
Dahlin, J. L., Inglese, J. & Walters, M. A. Mitigating risk in academic preclinical drug discovery. Nat. Rev. Drug Discov. 14, 279–294 (2015).
Meyer, P. et al. Verification of systems biology research in the age of collaborative competition. Nat. Biotechnol. 29, 811–815 (2011).
Cheng, W.-Y., Ou Yang, T.-H. & Anastassiou, D. Development of a prognostic model for breast cancer survival in an open challenge environment. Sci. Transl. Med. 5, 181ra50 (2013).
Boutros, P. C., Margolin, A. A., Stuart, J. M., Califano, A. & Stolovitzky, G. Toward better benchmarking: challenge-based methods assessment in cancer genomics. Genome Biol. 15, 462 (2014).
Meyer, P. et al. Network topology and parameter estimation: from experimental design methods to gene regulatory network kinetics using a community based approach. BMC Syst. Biol. 8, 13 (2014).
Uehara, T. et al. The Japanese toxicogenomics project: application of toxicogenomics. Mol. Nutr. Food Res. 54, 218–227 (2010).
Earl, D. et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21, 2224–2241 (2011).
Bradnam, K. R. et al. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience 2, 10 (2013).
Earl, D. et al. Alignathon: a competitive assessment of whole-genome alignment methods. Genome Res. 24, 2077–2089 (2014).
Ewing, A. D. et al. Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection. Nat. Methods 12, 623–630 (2015).
Acknowledgements
The authors thank N. Aghaeepour, M. Bansal, P. Bertone, E. Bilal, P. Boutros, S. E. Brenner, J. Dopazo, D. Earl, F. Eduati, L. Heiser, S. Hill, P.-R. Loh, D. Marbach, J. Moult, M. Peters, S. Sieberts, J. Stuart, M. Weirauch and N. Zach for information on the crowdsourcing efforts they organized. The authors also thank the DREAM Challenges community, who taught them everything about Challenges that they have tried to share in this Review.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary information S1 (box)
Scoring Metrics (PDF 241 kb)
Supplementary information S2 (table)
Examples of collaborative competitions. (PDF 252 kb)
Glossary
- Cloud computing
-
An internet-based infrastructure to perform computational tasks remotely.
- Crowdsourcing
-
A methodology that uses the voluntary help of large communities to solve problems posed by an organization.
- Challenges
-
(Also known as collaborative competitions). Calls to a wide community to submit proposed solutions to a specific problem. These solutions are evaluated by a panel of experts using diverse criteria, and the best performer or winner is selected.
- Gamification
-
The abstraction of a problem in such a way that working towards its solution feels like playing a computer game.
- Benchmarking Challenge
-
A Challenge used to determine the relative performance of the methodologies used to solve a particular problem in which a known solution is available to the organizers but not the participants. The organizers compare the proposed solutions to the solution that is only available to them (that is, the gold standard). It is expected that the good solutions will generalize to instances of the problem for which the solution is unknown.
- Gold standard
-
In allusion to the abandoned system of assigning the true value of a currency, the gold standard in a Challenge is the true solution to the posed problem in one particular instance of that problem.
- Leaderboards
-
Tables that provide real-time feedback of performance and scores of the proposed solutions to a Challenge, allowing participants to monitor their ranking.
- Training set
-
In general, this is the portion of the data used to train (fit) a computational model. In a Challenge, this is the data given to the participants to build their models. It normally encompasses most of the data.
- Cross-validation set
-
A procedure whereby a participant uses subsets of the training data to adjust model parameters based on how well they predict this data set.
- Test set
-
The subset of data that is separate from the training set and the cross-validation set (that is, the data that participants never have access to in any sort of way). The test set is used to do a final assessment of the predictive power of the models.
- Wisdom of crowds
-
The collective wisdom that emerges when the solutions to a problem that are proposed by a large pool of people are aggregated. The aggregate solution is often better than the best individual solution.
- Hackathons
-
Events in which specialists in a topic, normally related to computation, get together to work on a specific problem.
Rights and permissions
About this article
Cite this article
Saez-Rodriguez, J., Costello, J., Friend, S. et al. Crowdsourcing biomedical research: leveraging communities as innovation engines. Nat Rev Genet 17, 470–486 (2016). https://doi.org/10.1038/nrg.2016.69
Published:
Issue Date:
DOI: https://doi.org/10.1038/nrg.2016.69
This article is cited by
-
Multimodal data fusion for cancer biomarker discovery with deep learning
Nature Machine Intelligence (2023)
-
Public Biological Databases and the Sui Generis Database Right
IIC - International Review of Intellectual Property and Competition Law (2023)
-
Developing skin cancer education materials for darker skin populations: crowdsourced design, message targeting, and acral lentiginous melanoma
Journal of Behavioral Medicine (2023)
-
Machine and deep learning for longitudinal biomedical data: a review of methods and applications
Artificial Intelligence Review (2023)
-
Crowdsourcing and open innovation: a systematic literature review, an integrated framework and a research agenda
Review of Managerial Science (2022)