Development and multi-site external validation of a generalizable risk prediction model for bipolar disorder

Walsh, Colin G.; Ripperger, Michael A.; Hu, Yirui; Sheu, Yi-han; Lee, Hyunjoon; Wilimitis, Drew; Zheutlin, Amanda B.; Rocha, Daniel; Choi, Karmel W.; Castro, Victor M.; Kirchner, H. Lester; Chabris, Christopher F.; Davis, Lea K.; Smoller, Jordan W.

doi:10.1038/s41398-023-02720-y

Download PDF

Article
Open access
Published: 25 January 2024

Development and multi-site external validation of a generalizable risk prediction model for bipolar disorder

Colin G. Walsh ORCID: orcid.org/0000-0002-9379-2056¹,
Michael A. Ripperger ORCID: orcid.org/0000-0002-9613-346X¹,
Yirui Hu²,
Yi-han Sheu ORCID: orcid.org/0000-0002-2767-0851^3,4,5,
Hyunjoon Lee¹,
Drew Wilimitis¹,
Amanda B. Zheutlin³,
Daniel Rocha²,
Karmel W. Choi³,
Victor M. Castro ORCID: orcid.org/0000-0001-7390-6354³,
H. Lester Kirchner²,
Christopher F. Chabris²,
Lea K. Davis¹ &
…
Jordan W. Smoller^3,4,5

Translational Psychiatry volume 14, Article number: 58 (2024) Cite this article

859 Accesses
17 Altmetric
Metrics details

Subjects

Abstract

Bipolar disorder is a leading contributor to disability, premature mortality, and suicide. Early identification of risk for bipolar disorder using generalizable predictive models trained on diverse cohorts around the United States could improve targeted assessment of high risk individuals, reduce misdiagnosis, and improve the allocation of limited mental health resources. This observational case-control study intended to develop and validate generalizable predictive models of bipolar disorder as part of the multisite, multinational PsycheMERGE Network across diverse and large biobanks with linked electronic health records (EHRs) from three academic medical centers: in the Northeast (Massachusetts General Brigham), the Mid-Atlantic (Geisinger) and the Mid-South (Vanderbilt University Medical Center). Predictive models were developed and valid with multiple algorithms at each study site: random forests, gradient boosting machines, penalized regression, including stacked ensemble learning algorithms combining them. Predictors were limited to widely available EHR-based features agnostic to a common data model including demographics, diagnostic codes, and medications. The main study outcome was bipolar disorder diagnosis as defined by the International Cohort Collection for Bipolar Disorder, 2015. In total, the study included records for 3,529,569 patients including 12,533 cases (0.3%) of bipolar disorder. After internal and external validation, algorithms demonstrated optimal performance in their respective development sites. The stacked ensemble achieved the best combination of overall discrimination (AUC = 0.82–0.87) and calibration performance with positive predictive values above 5% in the highest risk quantiles at all three study sites. In conclusion, generalizable predictive models of risk for bipolar disorder can be feasibly developed across diverse sites to enable precision medicine. Comparison of a range of machine learning methods indicated that an ensemble approach provides the best performance overall but required local retraining. These models will be disseminated via the PsycheMERGE Network website.

Causal machine learning for predicting treatment outcomes

Article 19 April 2024

Genome-wide association studies

Article 26 August 2021

The serotonin theory of depression: a systematic umbrella review of the evidence

Article Open access 20 July 2022

Introduction

Bipolar disorder (BD), characterized by episodes of hypomania/mania and depression [1], is a leading cause of disability [2]. Rates of suicide among patients with BD are 20- to 30-fold higher than in the general population [3], and BD is associated with substantial premature mortality from multiple causes [4]. The diagnosis of BD can be challenging and may require a prolonged diagnostic odyssey, averaging 6–10 years [5,6,7]. Affected patients frequently present initially with a major depressive episode and are misdiagnosed with unipolar depression. Misdiagnosis may lead to inappropriate prescribing of antidepressants without mood stabilization and increased risk of switching into a manic state [8]. Duration of untreated bipolar disorder is associated with more severe and recurrent mood episodes and more frequent suicide attempts [9, 10].

Identifying those at risk for BD might enable targeted assessment, early intervention, and more appropriate management. A recent systematic review of clinical trials to prevent bipolar disorder showed reliance on family history for risk identification [11]. However, given BD’s multifactorial nature, most affected would not have a positive family history [12]. In addition and unlike schizophrenia, no established prodrome exists for bipolar disorder. Newer methods for risk identification not reliant on existing clinical signs or symptoms might be of substantial value.

Longitudinal electronic health records (EHRs) coupled with predictive analytics might enable novel risk identification opportunities in BD. We have previously demonstrated that such data can produce valid diagnostic phenotyping of bipolar cases [13, 14]. Here, we extend this work to the domain of prognostication by leveraging the PsycheMERGE Network, a national research network of EHR-linked biobanks. Using longitudinal EHRs from three major healthcare systems, we trained and validated quantitative bipolar disorder risk prediction models based on high-dimensional structured EHR data and evaluated their performance individually and when ensembled.

Methods

Study settings

Participating study sites included three major academic medical centers in the United States: the Northeast (Mass General Brigham [MGB]), the Mid-South (Vanderbilt University Medical Center [VUMC]), the Mid-Atlantic (Geisinger [GHS]). Each site participates in the PsycheMERGE Network and has an extensive EHR repository linked to a biobank. On average, these sites each serve 1.4 M patients per year and have EHR repositories of ~2.7 M patients linked via EHRs.

The methods were performed in accordance with relevant guidelines and regulations and approved by Institutional Review Boards at each participating study site: Vanderbilt University Medical Center, Geisinger Health System, Mass General Brigham.

Outcome definition

Cases of BD were defined by the published “Bipolar Coded-Broad” definition per Castro et al. [14]. This rule-based algorithm demonstrated high positive predictive value (0.80) using a gold-standard of semi-structured diagnostic interviews (SCID-IV) by experienced doctoral-level clinicians blind to algorithmic results. To meet “Bipolar Coded-Broad,” cases must have at least two BD diagnostic codes versions nine or ten of the International Classification of Diseases (ICD) schema with a minimum of four weeks between each code and at least two documented medications used to treat BD (e.g., lithium or valproic acid) within one year of the index BD diagnosis. To rule out patients with related disorders, we required the number of diagnostic codes for major depressive disorder (MDD), schizophrenia (SCZ), schizoaffective disorder, or organic affective syndrome (OAS) to total fewer than half the number of BD codes. Only adult patients aged 18 years and older at the time of their index BD diagnosis were included in this analysis.

All adult patients were included if they had a minimum of three documented healthcare encounters over a minimum of six months, regardless of case status.

Predictive modeling approach

We compared three predictive modeling approaches that together span a range of model architectures and strategies for handling feature relationships (see “Algorithmic Details”, below). Because of prior algorithmic experience at each study site, each team validated internally one of three types of models: L2-penalized regression (abbreviated here as “Ridge”) at MGB; random forests (RF) at VUMC; gradient boosting machines (GBM) at GHS. In external validation, the remaining two of three models were tested across sites, e.g., MGB validated externally RF from VUMC and GBM from GHS.

Internal validation was conducted at each site with a randomly selected hold-out test and the best internally performing algorithms were shared for external validation. This reciprocal validation strategy tested generalizability of each algorithm without the need for each site to train three separate algorithms in parallel.

Feature engineering

Variables for prediction included demographics: age (continuous), coded sex (categorical: Male, Female, and Unknown), coded race (categorical: White, Black, Asian, Other and Unknown); inpatient-administered and outpatient-prescribed medications (log-transformed counts); and diagnostic codes (log-transformed counts). For log-transformed counts, we identified each diagnostic code and medication. We grouped medications by RxNorm ingredient and each diagnostic code by Clinical Classification Software, Level 2, codes. We then counted occurrences of each group per person in the time before the prediction. The “log-transform” means the counts all were incremented by 1 (to avoid log of zero) and then the natural log of that count +1 value was calculated for all count-based predictors. This common transform has been shown to be helpful in sparse and skewed predictors, i.e., many zero values with few very high counts. Such predictors are common in EHR-based datasets.

Dimensionality reduction included grouping medications by their RxNorm ingredients [15] and diagnostic codes mapped from ICD-9-CM and ICD-10-CM to Clinical Classification Software (CCS) Level 2 codes [16]. The final feature list numbered up to ~2500. Missing data for count variables were imputed as zeroes and categorized as unknown for race and sex (see Table 1).

Table 1 Baseline study participant characteristics.

Full size table

Records meeting “Bipolar Coded Broad” were right-censored until the day before index diagnosis to represent a useful prediction target and to prevent leakage of bipolar-related data from driving model predictions. For those not meeting bipolar disorder criteria, right-censoring occurred at the last date of a visit or first occurrence of an ICD code for BD in the EHR.

Algorithmic details

Ridge regression

Ridge Regression [17] is a regularized regression model that imposes shrinkage of regression coefficients to reduce multi-collinearity and model variance, and thereby increasing prediction performance. Despite the shrinkage, the regression coefficients are never shrunk to zero, and therefore all features remain in the final model. We used the widely adopted glmnet [18,19,20] package in R for model training, using the main (first order) effects of all available features. The model was developed with 10-fold cross-validation using 60% of all data to find the best Lambda value (i.e., strength of regularization) and estimate model parameters, while the remaining 40% data were used as a hold-out test set. The 60–40 split was chosen due to a larger sample size at MGB, and the 60% training/validation split approached the limits in input data size for the glmnet package. Preliminary analyses showed minimal performance differences by varying the train/test ratio.

RF

VUMC implemented the decision-tree based RF. A commonly employed nonparametric algorithm, RF permits nonlinear relationships between predictors, samples with replacement for feature inclusion and model training, and it tolerates collinearity likely to be present in EHR data. After preliminary analyses varying the following, parameters of 200 trees, minimum node size of five, and purity for importance were used. RF were trained with an 80-20% train-test split. RF were implemented using the ranger package in R [21].

GBM

GHS implemented GBM in for internal validation. Boosting is an ensemble technique based on using multiple “weak learner” algorithms to train a strong one through sequential training to iteratively improve prediction accuracy. GBM is a high-performance gradient boosting framework based on decision trees capable of handling imbalanced datasets, as the boosting can strengthen the impact of the positive class (here, cases of BD). Tuning parameters included the ratio of features used, the ratio of training instances, maximum depth of trees and the learning rate. In preliminary analyses, dimensionality impacted model performance, so we selected the most prevalent medications across the GHS EHR by including those accounting for 95% of cumulative medication counts by Pareto analysis. GBM were trained with a 75-25% train-test split. Here, GBM were implemented in Python (package ‘lightgbm’) [22].

Ensembling

Ensembling is designed to improve prediction accuracy by aggregating the strengths of diverse machine learning models into a single predictive model. Here, we ensembled the three algorithms via stacking of all three algorithms at each site and evaluated performance. We combined Ridge, GBM, and RFs with a stacked ensemble trained with ten-fold cross validation and logistic regression using the three individual predictions as multivariate predictors on each site’s training set to avoid leakage of test data for the internally valid model into the ensemble.

Model evaluation

Evaluation metrics included Area Under the Receiver Operating Characteristic (AUROC), Precision-Recall Curves and Area Under the P-R Curve (AUPR). Metrics at specific risk thresholds included sensitivity/recall, specificity, positive predictive value (PPV), and number needed to screen (NNS [23], the reciprocal of PPV for a predictive model) Calibration was measured with Brier score and calibration slope/intercept [24].

Feature importance by site

Because each algorithm, GBM/RF/Ridge, have different inherent metrics of importance, we applied L1-penalized regression (LASSO) to the training datasets at each site. L1-penalized regression importance was defined as the magnitude of regression coefficients by predictor. The top ten L1-penalized regression selected features for each site are shown here, ranked (Table 4).

Results

Study site data are shown (Table 1).

Individual model performance by site

Discrimination performance is shown for each algorithm by training site with internal validation (testing within site) denoted visually for ready comparison to external validation (testing across sites) (Table 2).

Table 2 Model performance by site.

Full size table

As shown in Table 2, models performed comparably within and across sites with a tendency for better discrimination at internal validation sites for locally trained algorithms and better calibration for GBM and Ensembles of GBM, RF and Ridge.

Optimal thresholds and performance metrics by algorithm and site

Varying risk percentile thresholds by algorithm and by site showed specificity was closely linked to the thresholds themselves, while sensitivity (recall) and PPV tended to decrease and increase, respectively, as thresholds increased (Table 3). NPV for all algorithms above these thresholds (90% and above) was over 99%, largely because of the rarity of the outcomes in question.

Table 3 Model performance by risk percentile threshold.

Full size table

Predictor importance

The most important predictors to each model are listed in Table 4 (top ten by site) and in the eSupplement (top fifty by site). Features were driven by medications and contrast agents, e.g., those used in imaging studies, at the three sites. We caution overinterpretation of such predictor weights and underline these statistics are correlative, not causal.

Table 4 Ten most important predictors by algorithm using L1-penalized regression for feature selection and weighting.

Full size table

Discussion

Early identification of individuals at risk for BD offers opportunities for targeted assessment and prevention. Although a number of risk factors for BD have been established including family history [25] and stressful life events [26], quantitative, scalable prediction of risk is challenging. Prior studies have largely focused on individuals with a history of depression and/or have included relatively small samples [27, 28]. Here, we validated multiple algorithmic approaches across multiple well-powered longitudinal EHR sites in the absence of a common data model to generate a novel suite of prediction algorithms for BD. These models performed well across diverse geography and broad, heterogeneous patient populations. However, difficulty in portability and transferring algorithms across sites remains a primary barrier to replicative and implementation studies.

Our results demonstrate the feasibility and comparative performance of prediction algorithms using federated analyses of EHR data across the PsycheMERGE network. We compared three different machine learning approaches, each reliant on different assumptions and means of handling noisy, high dimensional data. Finally, we tested ensembles of these methods via stacking.

We highlight several noteworthy findings. First, we found that, regardless of method, performance was optimal at the site at which the model was developed, supporting the inference that portability of models may be limited by site-specific features - e.g., a local care practice common in one setting or region and uncommon in another. It also suggests potential for overly optimistic performance estimates with internal validation - underlining again that no substitute exists for external validation in model evaluation. We also note little overlap to the most important predictive features at each site, which likely relates to both site-specific differences and algorithmic differences, e.g., parametric (Ridge) and nonparametric (tree-based) approaches. The most generalizable algorithm, the stacked ensemble, matched internally valid algorithms in discrimination and was the only well-calibrated model, in part owing to its recalibration via regression at each site during the stacking process. Vigilance for drift and miscalibration over time would be necessary in planning implementation downstream.

Race was included as a predictor in these models, a feature being reconsidered for a number of clinical predictive applications. We opted not to blind our algorithms in this case as this process has been shown not to prevent algorithmic bias and might, in fact, introduce it. We emphasize race is a social construct that does not itself cause mental illness but can be a marker of inequitable healthcare access, experiences of adversity, and systemic inequity of opportunity. As such, it might be predictive of a coded diagnosis despite not being in the causal path for the outcome. Prior to implementation of models like these, close attention to algorithmic bias and potential for disparities should be considered using variables like coded race for prediction [29, 30]. As an additional check that race as a predictor did not bias our model unfairly, we retrained the RF at VUMC and compared validation set performance by coded race across 1,000 bootstrap replicates. Performance distributions did not differ across or within race: coded White race AUC 0.8 [0.79, 0.82] and 0.79 [0.77, 0.8] for model with and without race, respectively; coded Black race AUC 0.79 [0.76, 0.83] and 0.78 [0.75, 0.82] for model with and without race, respectively.

The clinical utility of predictive models for rare events like BD (<1% at each site) merits consideration and attention to the importance of PPV. Here, the stacked ensembles achieved the best threshold-specific PPVs across all three sites (a ~forty-fold increase compared to case prevalence). A resource-limited clinical environment prioritizing identifying those most likely to have undiagnosed BD or predicting onset of BD might benefit from models that provide such PPVs in the setting of such rare outcomes. Of note, our models achieved a NNS as low as 20 or fewer at each site, meaning that fewer than 20 patients would be identified as high risk for every true case detected.

These models rely on EHR data common to any modern hospital agnostic to a common data model: demographics, diagnostic codes in a universally accepted schema (ICD), medications mapped to a public ontology (RXNORM). For those who wish to leverage these trained and tested algorithms, a library of the individual models and the stacked ensemble will be made available on the PsycheMERGE Network website (psychemerge.com) [31].

Strengths

This study leveraged three large health systems with a validated definition of bipolar disorder as the prediction target. We applied three accepted algorithms (RF, Ridge Regression, GBM) to large real-world cohorts and assessed generalizability and model fit across partner sites. We ensembled these algorithms on over 3 million patient lives across three major biobanks - the largest modeling study of this kind in BD, to our knowledge. We relied on readily available structured EHR data for feature engineering. Finally, we disseminate these tools via the PsycheMERGE Network to facilitate replication studies and local deployment.

Limitations

Our results should also be interpreted in light of several limitations. First, while we explored performance of multiple different modeling approaches, there are others (including deep learning approaches) that were not tested. Second, our study relied on structured longitudinal EHR data, a decision we made to facilitate ready implementation across sites. However, natural language processing of narrative text might offer performance advantages longer term. Third, covariate shift in real-world data like these mean the joint distribution of model inputs and outputs may differ between training and testing across sites [32]. Methods of covariate shift detection and adaption might be investigated using importance re-weighting or feature dropping methods in future studies to improve model performance. Finally, class imbalance remains a notable challenge in this study and studies like it, and creates the potential for overfitting and spuriously high model performance metrics (e.g., high AUROCs simply because of identification of the majority class, here non-BD).

Conclusions

Generalizable predictive models of bipolar disorder trained and validated across health systems are feasible targets of clinical and precision medicine focused initiatives, even in the absence of common data models across sites. Implications of these models include BD risk research acceleration, catalyzing pharmacoepidemiologic studies, and potential for similar models to serve as probabilistic phenotypes in precision medicine research. Future work should assess their clinical utility and potential to phenotype quantitatively this serious mental illness.

Data availability

Study data including de-identified electronic health records linked to biobanks. However, complete anonymization to prevent inadvertent or intentional reidentification is not possible with granular healthcare data as those used here. Study-related analytic code and trained algorithms will be made available with publication as per the manuscript text.

References

Carvalho AF, Firth J, Vieta E. Bipolar disorder. N Engl J Med. 2020. https://doi.org/10.1056/NEJMra1906193.
Vigo D, Thornicroft G, Atun R. Estimating the true global burden of mental illness. Lancet Psychiatry. 2016;3:171–8. https://doi.org/10.1016/S2215-0366(15)00505-2.
Article PubMed Google Scholar
Plans L, Barrot C, Nieto E, Rios J, Schulze TG, Papiol S, et al. Association between completed suicide and bipolar disorder: a systematic review of the literature. J Affect Disord. 2019;242:111–22. https://doi.org/10.1016/j.jad.2018.08.054.
Article CAS PubMed Google Scholar
Weye N, Momen NC, Christensen MK, Iburg KM, Dalsgaard S, Laursen TM, et al. Association of specific mental disorders with premature mortality in the Danish population using alternative measurement methods. JAMA Netw Open. 2020;3:e206646. https://doi.org/10.1001/jamanetworkopen.2020.6646.
Article PubMed PubMed Central Google Scholar
Drancourt N, Etain B, Lajnef M, Henry C, Raust A, Cochet B, et al. Duration of untreated bipolar disorder: missed opportunities on the long road to optimal treatment. Acta Psychiatr Scand. 2013;127:136–44. https://doi.org/10.1111/j.1600-0447.2012.01917.x.
Article CAS PubMed Google Scholar
Fritz K, Russell AMT, Allwang C, Kuiper S, Lampe L, Malhi GS. Is a delay in the diagnosis of bipolar disorder inevitable? Bipolar Disord. 2017;19:396–400. https://doi.org/10.1111/bdi.12499.
Article PubMed Google Scholar
Dagani J, Signorini G, Nielssen O, Bani M, de Pastore A, et al. Meta-analysis of the interval between the onset and management of bipolar disorder. Can J Psychiatry. 2017;62:247–58. https://doi.org/10.1177/0706743716656607.
Article PubMed Google Scholar
Bowden CL. Strategies to reduce misdiagnosis of bipolar depression. Psychiatr Serv. 2001;52:51–55. https://doi.org/10.1176/appi.ps.52.1.51.
Article CAS PubMed Google Scholar
Altamura AC, Dell’Osso B, Berlin HA, Buoli M, Bassetti R, Mundo E. Duration of untreated illness and suicide in bipolar disorder: a naturalistic study. Eur Arch Psychiatry Clin Neurosci. 2010;260:385–91. https://doi.org/10.1007/s00406-009-0085-2.
Article PubMed Google Scholar
Altamura AC, Buoli M, Caldiroli A, Caron L, Melter CC, Dobrea C, et al. Misdiagnosis, duration of untreated illness (DUI) and outcome in bipolar patients with psychotic symptoms: a naturalistic study. J Affect Disord. 2015;182:70–75. https://doi.org/10.1016/j.jad.2015.04.024.
Article PubMed Google Scholar
Saraf G, Moazen-Zadeh E, Pinto JV, Ziafat K, Torres IJ, Kesavan M, et al. Early intervention for people at high risk of developing bipolar disorder: a systematic review of clinical trials. Lancet Psychiatry. 2020. https://doi.org/10.1016/S2215-0366(20)30188-7.
Yang J, Visscher PM, Wray NR. Sporadic cases are the norm for complex disease. Eur J Hum Genet. 2010;18:1039–43. https://doi.org/10.1038/ejhg.2009.177.
Article PubMed Google Scholar
Chen C-Y, Lee PH, Castro VM, Minnier J, Charney AW, Stahl EA, et al. Genetic validation of bipolar disorder identified by automated phenotyping using electronic health records. Transl Psychiatry. 2018;8:86 https://doi.org/10.1038/s41398-018-0133-7.
Article PubMed PubMed Central Google Scholar
Castro VM, Minnier J, Murphy SN, Kohane I, Churchill SE, Gainer V, et al. Validation of electronic health record phenotyping of bipolar disorder cases and controls. Am J Psychiatry. 2015;172:363–72. https://doi.org/10.1176/appi.ajp.2014.14030423.
Article PubMed Google Scholar
Nelson SJ, Zeng K, Kilbourne J, Powell T, Moore R. Normalized names for clinical drugs: RxNorm at 6years. J Am Med Inf Assoc. 2011;18:441–8. https://doi.org/10.1136/amiajnl-2011-000116.
Article Google Scholar
Healthcare Cost and Utilization Project (HCUP). HCUP Clinical Classifications Software (CCS) for ICD-9-CM. Agency for Healthcare Research and Quality; 2006. www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp. Accessed 1 May 2017.
Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67. https://doi.org/10.1080/00401706.1970.10488634.
Article Google Scholar
Simon N, Friedman J, Hastie T, Tibshirani R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw. 2011;39:1–13. https://doi.org/10.18637/jss.v039.i05.
Article PubMed PubMed Central Google Scholar
Friedman AJ, Hastie T, Simon N, Tibshirani R, Hastie MT. Package ‘ glmnet ’. 2015. https://glmnet.stanford.edu/.
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1–22.
Article PubMed PubMed Central Google Scholar
Wright MN, Ziegler A. ranger: a fast implementation of random forests for high dimensional data in C++ and R. 2015. http://arxiv.org/abs/1508.04409.
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017;30:3146–54.
Google Scholar
Rembold CM. Number needed to screen: development of a statistic for disease screening. BMJ. 1998;317:307–312.
Article CAS PubMed PubMed Central Google Scholar
Steyerberg E. Clinical prediction models: a practical approach to development, validation, and updating. Springer-Verlag; 2009. www.springer.com/us/book/9780387772431. Accessed October 30, 2018.
Smoller JW, Finn CT. Family, twin, and adoption studies of bipolar disorder. Am J Med Genet C Semin Med Genet. 2003;123C:48–58. https://doi.org/10.1002/ajmg.c.20013.
Article PubMed Google Scholar
Gilman SE, Ni MY, Dunn EC, Breslau J, McLaughlin KA, Smoller JW, et al. Contributions of the social environment to first-onset and recurrent mania. Mol Psychiatry. 2015;20:329–336. https://doi.org/10.1038/mp.2014.36.
Article CAS PubMed Google Scholar
Pradier MF, Hughes MC, McCoy TH, Barroilhet SA, Doshi-Velez F, Perlis RH. Predicting change in diagnosis from major depression to bipolar disorder after antidepressant initiation. Neuropsychopharmacology. 2021;46:455–461. https://doi.org/10.1038/s41386-020-00838-x.
Article PubMed Google Scholar
Rabelo-da-Ponte FD, Feiten JG, Mwangi B, Barros FC, Wehrmeister FC, Menezes AM, et al. Early identification of bipolar disorder among young adults – a 22-year community birth cohort. Acta Psychiatr Scand. 2020;142:476–485. https://doi.org/10.1111/acps.13233.
Article CAS PubMed Google Scholar
Walsh CG, Chaudhry B, Dua P, Goodman KW, Kaplan B, Kavuluru R, et al. Stigma, biomarkers, and algorithmic bias: recommendations for precision behavioral health with artificial intelligence. Jamia Open. 2020;3:9–15. https://doi.org/10.1093/jamiaopen/ooz054.
Article PubMed PubMed Central Google Scholar
Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366:447–453. https://doi.org/10.1126/science.aax2342.
Article CAS PubMed Google Scholar
Makadia R, Ryan PB. Transforming the premier perspective hospital database into the observational medical outcomes partnership (OMOP) common data model. EGEMS. 2014;2:1110 https://doi.org/10.13063/2327-9214.1110.
Article PubMed PubMed Central Google Scholar
Sugiyama M, Kawanabe M. Machine learning in non-stationary environments: Introduction to Covariate Shift Adaptation. The MIT Press; 2012. https://doi.org/10.7551/mitpress/9780262017091.001.0001.

Download references

Acknowledgements

All investigators were supported in part by NIMH R01MH118233 (PIs JWS/LKD). JWS is also supported in part by a gift from the Ryan Licht Sang Bipolar Foundation. LKD is also supported in part by R56MH120736. CGW is also supported in part by NIMH R01MH121455 and R01MH116269. KWC is supported in part by a NARSAD Young Investigator Grant from the Brain & Behavior Research Foundation.

Funding

Funders played no role in design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Author information

Authors and Affiliations

Vanderbilt University Medical Center Health System, Nashville, TN, USA
Colin G. Walsh, Michael A. Ripperger, Hyunjoon Lee, Drew Wilimitis & Lea K. Davis
Geisinger Health System, Danville, PA, USA
Yirui Hu, Daniel Rocha, H. Lester Kirchner & Christopher F. Chabris
Massachusetts General-Brigham Health System, Boston, MA, USA
Yi-han Sheu, Amanda B. Zheutlin, Karmel W. Choi, Victor M. Castro & Jordan W. Smoller
Center for Precision Psychiatry, Department of Psychiatry, Massachusetts General Hospital, Boston, MA, USA
Yi-han Sheu & Jordan W. Smoller
Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
Yi-han Sheu & Jordan W. Smoller

Authors

Colin G. Walsh
View author publications
You can also search for this author in PubMed Google Scholar
Michael A. Ripperger
View author publications
You can also search for this author in PubMed Google Scholar
Yirui Hu
View author publications
You can also search for this author in PubMed Google Scholar
Yi-han Sheu
View author publications
You can also search for this author in PubMed Google Scholar
Hyunjoon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Drew Wilimitis
View author publications
You can also search for this author in PubMed Google Scholar
Amanda B. Zheutlin
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Rocha
View author publications
You can also search for this author in PubMed Google Scholar
Karmel W. Choi
View author publications
You can also search for this author in PubMed Google Scholar
Victor M. Castro
View author publications
You can also search for this author in PubMed Google Scholar
H. Lester Kirchner
View author publications
You can also search for this author in PubMed Google Scholar
Christopher F. Chabris
View author publications
You can also search for this author in PubMed Google Scholar
Lea K. Davis
View author publications
You can also search for this author in PubMed Google Scholar
Jordan W. Smoller
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Design and conduct of the study (all authors) Data collection, management, analysis (CGW, YH, MAR, DR, DW, YS, KWC, ABZ). Interpretation of the data (all authors) Preparation, review, or approval of the manuscript (all authors).

Corresponding author

Correspondence to Colin G. Walsh.

Ethics declarations

Competing interests

Authors (MAR, YH, YS, DW, DR, KWC, VMC, HLK, CFC, LKD) declare no Competing Financial or Non-Financial Interests. JWS declares no Competing Non-Financial Interests but the following Competing Financial Interests: PI of a collaborative study of the genetics of depression and bipolar disorder sponsored by 23andMe for which 23andMe provides analysis time as in-kind support but no payments. He is also a member of the Scientific Advisory Board of Sensorium Therapeutics (with equity). ABZ declares no Competing Non-Financial Interests but the following Competing Financial Interests: currently employment by Janssen Pharmaceuticals. Other than manuscript revision, she contributed to this research only during her time as a postdoctoral fellow at MGB. CGW declares no Competing Non-Financial Interests but the following Competing Financial Interests: equity interest in Sage AI, LLC (unrelated to healthcare).

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplemental Material

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Walsh, C.G., Ripperger, M.A., Hu, Y. et al. Development and multi-site external validation of a generalizable risk prediction model for bipolar disorder. Transl Psychiatry 14, 58 (2024). https://doi.org/10.1038/s41398-023-02720-y

Download citation

Received: 21 February 2023
Revised: 29 November 2023
Accepted: 15 December 2023
Published: 25 January 2024
DOI: https://doi.org/10.1038/s41398-023-02720-y

Subjects

Abstract

Similar content being viewed by others

Causal machine learning for predicting treatment outcomes

Genome-wide association studies

The serotonin theory of depression: a systematic umbrella review of the evidence

Introduction

Methods

Study settings

Outcome definition

Predictive modeling approach

Feature engineering

Algorithmic details

Ridge regression

RF

GBM

Ensembling

Model evaluation

Feature importance by site

Results

Individual model performance by site

Optimal thresholds and performance metrics by algorithm and site

Predictor importance

Discussion

Strengths

Limitations

Conclusions

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplemental Material

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links