Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Perspective
  • Published:

Machine learning and data mining: strategies for hypothesis generation


Strategies for generating knowledge in medicine have included observation of associations in clinical or research settings and more recently, development of pathophysiological models based on molecular biology. Although critically important, they limit hypothesis generation to an incremental pace. Machine learning and data mining are alternative approaches to identifying new vistas to pursue, as is already evident in the literature. In concert with these analytic strategies, novel approaches to data collection can enhance the hypothesis pipeline as well. In data farming, data are obtained in an ‘organic’ way, in the sense that it is entered by patients themselves and available for harvesting. In contrast, in evidence farming (EF), it is the provider who enters medical data about individual patients. EF differs from regular electronic medical record systems because frontline providers can use it to learn from their own past experience. In addition to the possibility of generating large databases with farming approaches, it is likely that we can further harness the power of large data sets collected using either farming or more standard techniques through implementation of data-mining and machine-learning strategies. Exploiting large databases to develop new hypotheses regarding neurobiological and genetic underpinnings of psychiatric illness is useful in itself, but also affords the opportunity to identify novel mechanisms to be targeted in drug discovery and development.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1

Similar content being viewed by others


  1. Carlsson A . A paradigm shift in brain research. Science 2001; 294: 1021–1024.

    Article  CAS  Google Scholar 

  2. Mitchell TM . The Discipline of Machine Learning. School of Computer Science: Pittsburgh, PA, 2006. Available from:

    Google Scholar 

  3. Nilsson NJ . Introduction to Machine Learning. An early draft of a proposed textbook. Robotics Laboratory, Department of Computer Science, Stanford University: Stanford, 1996. Available from:

    Google Scholar 

  4. Hand DJ . Mining medical data. Stat Methods Med Res 2000; 9: 305–307.

    PubMed  CAS  Google Scholar 

  5. Smyth P . Data mining: data analysis on a grand scale? Stat Methods Med Res 2000; 9: 309–327.

    Article  CAS  Google Scholar 

  6. Burgun A, Bodenreider O . Accessing and integrating data and knowledge for biomedical research. Yearb Med Inform 2008; 47(Suppl 1): 91–101.

    Google Scholar 

  7. Hochberg AM, Hauben M, Pearson RK, O’Hara DJ, Reisinger SJ, Goldsmith DI et al. An evaluation of three signal-detection algorithms using a highly inclusive reference event database. Drug Saf 2009; 32: 509–525.

    Article  Google Scholar 

  8. Sanz EJ, De-las-Cuevas C, Kiuru A, Bate A, Edwards R . Selective serotonin reuptake inhibitors in pregnant women and neonatal withdrawal syndrome: a database analysis. Lancet 2005; 365: 482–487.

    Article  CAS  Google Scholar 

  9. Baca-Garcia E, Perez-Rodriguez MM, Basurte-Villamor I, Saiz-Ruiz J, Leiva-Murillo JM, de Prado-Cumplido M et al. Using data mining to explore complex clinical decisions: A study of hospitalization after a suicide attempt. J Clin Psychiatry 2006; 67: 1124–1132.

    Article  Google Scholar 

  10. Ray S, Britschgi M, Herbert C, Takeda-Uchimura Y, Boxer A, Blennow K et al. Classification and prediction of clinical Alzheimer's diagnosis based on plasma signaling proteins. Nat Med 2007; 13: 1359–1362.

    Article  CAS  Google Scholar 

  11. Baca-Garcia E, Perez-Rodriguez MM, Basurte-Villamor I, Lopez-Castroman J, Fernandez del Moral AL, Jimenez-Arriero MA et al. Diagnostic stability and evolution of bipolar disorder in clinical practice: a prospective cohort study. Acta Psychiatr Scand 2007; 115: 473–480.

    Article  CAS  Google Scholar 

  12. Baca-Garcia E, Vaquero-Lorenzo C, Perez-Rodriguez MM, Gratacos M, Bayes M, Santiago-Mozos R et al. Nucleotide variation in central nervous system genes among male suicide attempters. Am J Med Genet B Neuropsychiatr Genet 2010; 153B: 208–213.

    PubMed  CAS  Google Scholar 

  13. Sun D, van Erp TG, Thompson PM, Bearden CE, Daley M, Kushan L et al. Elucidating a magnetic resonance imaging-based neuroanatomic biomarker for psychosis: classification analysis using probabilistic brain atlas and machine learning algorithms. Biol Psychiatry 2009; 66: 1055–1060.

    Article  Google Scholar 

  14. Shen H, Wang L, Liu Y, Hu D . Discriminative analysis of resting-state functional connectivity patterns of schizophrenia using low dimensional embedding of fMRI. Neuroimage 2010; 49: 3110–3121.

    Article  Google Scholar 

  15. Hay MC, Weisner TS, Subramanian S, Duan N, Niedzinski EJ, Kravitz RL . Harnessing experience: exploring the gap between evidence-based medicine and clinical practice. J Eval Clin Pract 2008; 14: 707–713.

    Article  Google Scholar 

  16. Unutzer J, Choi Y, Cook IA, Oishi S . A web-based data management system to improve care for depression in a multicenter clinical trial. Psychiatr Serv 2002; 53: 671–673.

    Article  Google Scholar 

Download references


Dr Blasco-Fontecilla acknowledges the Spanish Ministry of Health (Rio Hortega CM08/00170), Alicia Koplowitz Foundation, and Conchita Rabago Foundation for funding his post-doctoral rotation at CHRU, Montpellier, France. SAF2010-21849.

Author information

Authors and Affiliations


Corresponding author

Correspondence to M A Oquendo.

Ethics declarations

Competing interests

Dr Oquendo has received unrestricted educational grants and/or lecture fees form Astra-Zeneca, Bristol Myers Squibb, Eli Lilly, Janssen, Otsuko, Pfizer, Sanofi-Aventis and Shire. Her family owns stock in Bistol Myers Squibb. The remaining authors declare no conflict of interest.

PowerPoint slides

Rights and permissions

Reprints and permissions

About this article

Cite this article

Oquendo, M., Baca-Garcia, E., Artés-Rodríguez, A. et al. Machine learning and data mining: strategies for hypothesis generation. Mol Psychiatry 17, 956–959 (2012).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


This article is cited by


Quick links