Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Machine learning and data mining: strategies for hypothesis generation


Strategies for generating knowledge in medicine have included observation of associations in clinical or research settings and more recently, development of pathophysiological models based on molecular biology. Although critically important, they limit hypothesis generation to an incremental pace. Machine learning and data mining are alternative approaches to identifying new vistas to pursue, as is already evident in the literature. In concert with these analytic strategies, novel approaches to data collection can enhance the hypothesis pipeline as well. In data farming, data are obtained in an ‘organic’ way, in the sense that it is entered by patients themselves and available for harvesting. In contrast, in evidence farming (EF), it is the provider who enters medical data about individual patients. EF differs from regular electronic medical record systems because frontline providers can use it to learn from their own past experience. In addition to the possibility of generating large databases with farming approaches, it is likely that we can further harness the power of large data sets collected using either farming or more standard techniques through implementation of data-mining and machine-learning strategies. Exploiting large databases to develop new hypotheses regarding neurobiological and genetic underpinnings of psychiatric illness is useful in itself, but also affords the opportunity to identify novel mechanisms to be targeted in drug discovery and development.


Some of the major discoveries in medicine have been the product of serendipity combined with astute observation. In 1877, Louis Pasteur observed that the growth of the anthrax bacilli in culture was inhibited when these were contaminated with moulds, this observation led to the discovery of penicillin. In 1952, chlorpromazine, originally developed for surgical interventions as a sedative that did not induce unconsciousness, was found not only to have serenic effects, but also to improve behavior and thinking in psychotic patients. These discoveries, cornerstones of major advances in pharmacology, were largely the product of observation on the part of scientists who were open to discovery and not prejudiced by their own hypotheses, such that they were free to observe and record previously unexpected effects.

Despite several centuries of empirical scientific approach to medical research, medicine remains beholden to a handful of strategies for generating new knowledge. One is the aforementioned time-honored observation of associations in clinical practice or research settings. More recently, advances in molecular biology have made possible developments that are based on hypotheses about pathophysiology that lead to the generation of treatment approaches that address the underlying mechanism. For example, the serotonin transporter blocker zimelidine was developed in the 1960s as an antidepressant,1 based on the observation that tricyclics blocked both norepinephrine and serotonin transporters. The antidepressant action was thought to be mediated through norepinephrine reuptake blockade, leading investigators to develop a selective serotonin transporter blocker to assess its effects.

Unfortunately, the former approaches to discovery are both unpredictable and plodding, and rely on ‘creativity’ and ‘imagination’ in the context of unbiased observation. The latter is limited by the incremental development of knowledge about underlying molecular biology implicated in disease, a process that is necessarily constrained by the time required to perform the painstaking experiments to develop the basis for a pathophysiological model.

One possible approach to breaking the barriers to rapid growth of the knowledge base in medicine is to make use of unbiased observation, used to great advantage by Pasteur, employing modern approaches. With the aid of recently developed tools and the availability of large databases, we may be able to accelerate discovery and generate new leads that can then be evaluated in subsequent hypothesis-testing studies.

One such tool is machine learning (ML), a field that seeks to answer the question, ‘How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?’ ( ML has the advantage of being comprised of systems that ‘learn’ from experience, observation, and/or other means. This results in a system that improves its efficiency and/or effectiveness over time. The usefulness of ML is bolstered by the versatility of its techniques (support vector machines or kernel methods, Gaussian processes, graphical models, deep belief networks or dirichlet process and so on) and its utility for artificial intelligence (classification, prediction, planning, recognition, regression, clustering, association rules and so on).3 Importantly, the use of ML approaches to data does not require a priori hypotheses. Much like a scientist might observe his/her subject to glean understanding, ML ‘observes’ the data and ‘learns’ from it to build understanding and uncover previously unexpected associations. In this way, this computational approach allows exploration of data to identify patterns and structures not suspected a priori,4, 5 and thus can lead to the generation of new hypotheses. This is critical, especially in areas with huge data sets where hypothesis testing and/or traditional analytic strategies have led to disappointing results, such as in genetic and brain-imaging studies. Further, ML techniques have been developed for dealing with disparate data types, such as text and images, allowing analysis of heterogeneous data sets that contain a mixture of clinical, genetic and imaging data. As well, like more standard statistical approaches, ML permits addressing some of the common questions about data: What is the relationship between the variables? Are there associations between a given outcome variable and the predictor variables of interest? Might they be causal? Given all these properties, ML offers great opportunities for exploratory analysis of emerging large-scale data repositories, and the availability of large data sets containing psychiatric information both in the public domain and in the private sector provides opportunities to generate new knowledge. Several federal funding agencies require investigators to make certain types of data, mostly genetic and epidemiologic, available to the public after a specified period of time.6 It is likely that we can further harness the power of large data sets by using these types of analytic strategies.

Akin to ML is data mining, and indeed distinctions between these two have always been blurred. Data mining strategies have been in use for decades.5 Development of these new tools was made possible by the rapid evolution in computing power, allowing sophisticated computations not possible until very recently. Indeed, data mining methods have been used routinely to screen the WHO and other databases for evidence of adverse drug reactions for some years.7, 8 However, apart from this type of application, medicine in general, and psychiatry in particular, has been relatively slow to adopt such techniques (see Figure 1).9 One significant barrier to the adoption of these techniques is that physicians are not trained in these methods and therefore are wary of them. Better appreciation of the potential value of these tools is essential for the advancement of psychiatry as well as other medical specialties.

Figure 1

Percent of publications using ML or data mining cited in Institute for Scientific Information (ISI) across five disciplines. Search was performed as follows: Topic=(‘data-mining’ or ‘data mining’ or ‘machine learning’ or ‘machine-learning’ or ‘support vector machine’ or ‘SVM’).

PowerPoint slide

Of interest, exciting new leads have been generated in other fields of medicine using hypothesis-generating and hypothesis-testing strategies. For instance, a ground-breaking study in neuroscience by Ray et al.10 used ML to classify patients as having Alzheimer's dementia or not. Archived plasma from 259 patients with presymptomatic to late-stage Alzheimer's disease and controls was used to examine 120 known signaling proteins, quantified with ELISA. The split sample method was used to divide the Alzheimer's and non-demented control groups into: (a) a training sample to be used for predictor discovery and supervised classification to generate a plausible hypothesis and (b) a test sample or validation sample to test or validate the hypothesis generated in the training sample. The training sample was subjected to a shrunken centroid algorithm called predictive analysis of microarray. Predictive analysis of microarray identified 18 proteins that were predictive of Alzheimer's. Using this algorithm on the training sample, 95% of all Alzheimer's cases were correctly identified (positive agreement). On the other hand, 83% of non-demented control cases were classified in accordance with the clinical diagnosis (negative agreement). The 18 predictors were then used to classify subjects in the test sample as having Alzheimer's or not. Predictive analysis of microarray classified subjects in the test sample with 90% sensitivity (for the Alzheimer's samples) with the clinical diagnosis and 88% specificity (for the non-Alzheimer's samples). The authors then confirmed some of these results using postmortem diagnosis. In all, 8 out of 9 postmortem-confirmed subjects with Alzheimer's disease were classified correctly by the predictive analysis of microarray algorithm, as were 10 out of the 11 non-Alzheimer's’ classification. This led the authors to suggest that this 18-protein array may constitute a biosignature for Alzheimer's disease. This is an example of how a peripheral measure identified with machine-learning strategies can comport not only with a clinical diagnosis but also with the gold standard: a postmortem neuropathological diagnosis.

A handful of examples of implementation of machine-learning tools to psychiatric data sets highlight their utility. ML has been put to use in large samples for the study of the natural evolution of psychiatric illness11 and in the identification of single-nucleotide polymorphisms associated with mental conditions.12 Smaller samples of psychotic patients (n=36) and matched controls (n=36) have had MRI data subjected to Sparse Multinomial Logistic Regression Classifier, a ML approach that develops a classification function based on a weighted combination of basis functions, tuning the weights during the learning phase to optimize classification of training data.13 Using this technique, cortical gray matter density maps discriminated between controls and patients with 86% accuracy. In another small sample of patients with schizophrenia and normal controls, Feature Selection generated an algorithm that showed 92% accuracy in identifying affected individuals based on functional connectivity as assessed by resting state functional magnetic resonance.14 Although we still do not know whether results from these initial forays will be confirmed in the long run, it is evident that unanticipated results can be reliably generated.

Although most recent papers using ML-based approaches have applied these analytical techniques to extant data sets from large epidemiological samples, brain-imaging data sets or genetic repositories, development of new methods to enhance our ability to collect data is essential, as well. A recent reconceptualization of data collection is illustrative. Instead of viewing data as something that scientists actively search for and collect, like fossil fuel deposited underground waiting to be mined, data can also be ‘organic,’ grown and harvested if suitable environmental conditions are provided. This notion is illustrated in the concept of ‘data farming’ or ‘evidence farming.’

One example of the data-farming paradigm is the online enterprise Founded in 2004, this website provides a platform for patients to self identify and enter their own data into a database. Data include illness and laboratory variables, rating scales, medication and other treatments, outcomes and so on. The patient may use tools provided by the website to track his or her own experience over time. One stated goal, which also serves as an incentive for patients to provide data, is to help patients learn about each others’ course of illness, treatment and outcome such that individuals may compare their own experience with that farmed from others with similar conditions. The website also makes data available to partners of the organization for analysis and research, which is stated explicitly on the site. Of note, this website supports a variety of psychiatric conditions, including anxiety, bipolar disorder, depression, obsessive compulsive disorder and post-traumatic stress disorder, to name a few, with a total of 117 184 registered patients.

Although to our knowledge, this data is not yet being analyzed by psychiatric investigators, it may provide a powerful repository of information. In particular, because the data set is unbiased by scientific trends or funding agencies, it could potentially yield information opening new vistas on psychiatric disease. In essence, successful data farms such as this one can facilitate data collection available for analysis, using both traditional statistical methods and machine-learning methods, to expand our knowledge base. There are, of course, the obvious issues around selection bias, because only those who are willing to make their personal health information public will participate. Also, only those who are computer literate and/or technology savvy or who have the economic resources to access technology are likely to submit their data to the ‘farm,’ both for their own information and for sharing with others. In addition, there are no methods for ascertaining the validity of the stated diagnoses, nor are there assurances that laboratory data are comparable across patients. Some of these issues are certainly not unique to this type of data set, but nonetheless, the breadth of the sample may render it of critical importance.

A related variant of the data-farming paradigm is the concept of evidence farming discussed in Hay et al.15 In evidence farming, it is the provider who enters medical data about individual patients. Evidence farming differs from regular electronic medical record systems because it permits frontline providers to learn from their own past experience, examining it with user-friendly analytic tools and employing the results in their current clinical decision making.15 Unutzer et al.16 have described a research tracking tool that is more comprehensive than a general electronic medical record and that permits clinicians to compare outcomes of their own patients with outcomes of similar patients being seen by other clinicians, providing opportunities for use of the data by individual clinicians who are entering data from their own practice.

An important potential advantage for the farming (vs mining) paradigm is the incentive to the end-users who submit data to the ‘farm,’ such as patients participating in, clinicians participating in Unutzer's data registry. In contrast to the experience of subjects who enroll in traditional research studies, data-farm participants benefit directly from the ‘harvest’ of the data from the ‘farm’, including the use of their own data to monitor their own progress, and the opportunity to learn from data provided by other participants in the ‘farm.’ Those incentives provide opportunities to broaden the scope of the investigation, and include participants who might not participate in traditional research studies due to the lack of direct incentives.

At the same time, a successful ‘farm’ might also provide opportunities for investigators to analyze the data harvested from the ‘farm’ for the benefits of future patients who did not participate in the ‘farm.’ These research opportunities could be viewed as the by-product for the ‘farm’, in addition to the direct benefit to the participants themselves.

Given the paucity of paradigm shifting breakthroughs in psychiatric research in recent decades, it behooves the field to explore all promising strategies to generate new leads. Being able to exploit large databases for new hypotheses regarding neurobiological and genetic underpinnings of psychiatric illness is useful in itself, but also affords the opportunity to identify novel mechanisms to be targeted in drug discovery and development. With several large data sets available to qualified investigators, there is a wealth of data that could be subjected to these methods. Moreover, data-farming approaches to amassing large data sets at low cost from a diverse pool of end-users can also enhance our ability to develop new leads in understanding psychiatric disorders. Psychiatry needs novel ideas to pursue. ML and computational models based on it can provide such a path and data-derived ideas emerging from these repositories have special appeal for the empirically minded investigator. Our patients are waiting for improved therapies and quality of life. We are duty bound to chase them vigorously. Perhaps our current notions of disease, diagnosis, causality and cure can be advanced by generating cutting-edge, previously unsuspected hypotheses using these tools, to be tested/validated in subsequent confirmatory studies. Are we ready to enhance the pipeline of innovative hypotheses?


  1. 1

    Carlsson A . A paradigm shift in brain research. Science 2001; 294: 1021–1024.

    Article  CAS  Google Scholar 

  2. 2

    Mitchell TM . The Discipline of Machine Learning. School of Computer Science: Pittsburgh, PA, 2006. Available from:

    Google Scholar 

  3. 3

    Nilsson NJ . Introduction to Machine Learning. An early draft of a proposed textbook. Robotics Laboratory, Department of Computer Science, Stanford University: Stanford, 1996. Available from:

    Google Scholar 

  4. 4

    Hand DJ . Mining medical data. Stat Methods Med Res 2000; 9: 305–307.

    PubMed  CAS  Google Scholar 

  5. 5

    Smyth P . Data mining: data analysis on a grand scale? Stat Methods Med Res 2000; 9: 309–327.

    Article  CAS  Google Scholar 

  6. 6

    Burgun A, Bodenreider O . Accessing and integrating data and knowledge for biomedical research. Yearb Med Inform 2008; 47(Suppl 1): 91–101.

    Google Scholar 

  7. 7

    Hochberg AM, Hauben M, Pearson RK, O’Hara DJ, Reisinger SJ, Goldsmith DI et al. An evaluation of three signal-detection algorithms using a highly inclusive reference event database. Drug Saf 2009; 32: 509–525.

    Article  Google Scholar 

  8. 8

    Sanz EJ, De-las-Cuevas C, Kiuru A, Bate A, Edwards R . Selective serotonin reuptake inhibitors in pregnant women and neonatal withdrawal syndrome: a database analysis. Lancet 2005; 365: 482–487.

    Article  CAS  Google Scholar 

  9. 9

    Baca-Garcia E, Perez-Rodriguez MM, Basurte-Villamor I, Saiz-Ruiz J, Leiva-Murillo JM, de Prado-Cumplido M et al. Using data mining to explore complex clinical decisions: A study of hospitalization after a suicide attempt. J Clin Psychiatry 2006; 67: 1124–1132.

    Article  Google Scholar 

  10. 10

    Ray S, Britschgi M, Herbert C, Takeda-Uchimura Y, Boxer A, Blennow K et al. Classification and prediction of clinical Alzheimer's diagnosis based on plasma signaling proteins. Nat Med 2007; 13: 1359–1362.

    Article  CAS  Google Scholar 

  11. 11

    Baca-Garcia E, Perez-Rodriguez MM, Basurte-Villamor I, Lopez-Castroman J, Fernandez del Moral AL, Jimenez-Arriero MA et al. Diagnostic stability and evolution of bipolar disorder in clinical practice: a prospective cohort study. Acta Psychiatr Scand 2007; 115: 473–480.

    Article  CAS  Google Scholar 

  12. 12

    Baca-Garcia E, Vaquero-Lorenzo C, Perez-Rodriguez MM, Gratacos M, Bayes M, Santiago-Mozos R et al. Nucleotide variation in central nervous system genes among male suicide attempters. Am J Med Genet B Neuropsychiatr Genet 2010; 153B: 208–213.

    PubMed  CAS  Google Scholar 

  13. 13

    Sun D, van Erp TG, Thompson PM, Bearden CE, Daley M, Kushan L et al. Elucidating a magnetic resonance imaging-based neuroanatomic biomarker for psychosis: classification analysis using probabilistic brain atlas and machine learning algorithms. Biol Psychiatry 2009; 66: 1055–1060.

    Article  Google Scholar 

  14. 14

    Shen H, Wang L, Liu Y, Hu D . Discriminative analysis of resting-state functional connectivity patterns of schizophrenia using low dimensional embedding of fMRI. Neuroimage 2010; 49: 3110–3121.

    Article  Google Scholar 

  15. 15

    Hay MC, Weisner TS, Subramanian S, Duan N, Niedzinski EJ, Kravitz RL . Harnessing experience: exploring the gap between evidence-based medicine and clinical practice. J Eval Clin Pract 2008; 14: 707–713.

    Article  Google Scholar 

  16. 16

    Unutzer J, Choi Y, Cook IA, Oishi S . A web-based data management system to improve care for depression in a multicenter clinical trial. Psychiatr Serv 2002; 53: 671–673.

    Article  Google Scholar 

Download references


Dr Blasco-Fontecilla acknowledges the Spanish Ministry of Health (Rio Hortega CM08/00170), Alicia Koplowitz Foundation, and Conchita Rabago Foundation for funding his post-doctoral rotation at CHRU, Montpellier, France. SAF2010-21849.

Author information



Corresponding author

Correspondence to M A Oquendo.

Ethics declarations

Competing interests

Dr Oquendo has received unrestricted educational grants and/or lecture fees form Astra-Zeneca, Bristol Myers Squibb, Eli Lilly, Janssen, Otsuko, Pfizer, Sanofi-Aventis and Shire. Her family owns stock in Bistol Myers Squibb. The remaining authors declare no conflict of interest.

PowerPoint slides

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Oquendo, M., Baca-Garcia, E., Artés-Rodríguez, A. et al. Machine learning and data mining: strategies for hypothesis generation. Mol Psychiatry 17, 956–959 (2012).

Download citation


  • data farming
  • discovery
  • empiricism
  • inductive reasoning

Further reading


Quick links