Introduction

Bipolar disorder (BD), characterized by episodes of hypomania/mania and depression [1], is a leading cause of disability [2]. Rates of suicide among patients with BD are 20- to 30-fold higher than in the general population [3], and BD is associated with substantial premature mortality from multiple causes [4]. The diagnosis of BD can be challenging and may require a prolonged diagnostic odyssey, averaging 6–10 years [5,6,7]. Affected patients frequently present initially with a major depressive episode and are misdiagnosed with unipolar depression. Misdiagnosis may lead to inappropriate prescribing of antidepressants without mood stabilization and increased risk of switching into a manic state [8]. Duration of untreated bipolar disorder is associated with more severe and recurrent mood episodes and more frequent suicide attempts [9, 10].

Identifying those at risk for BD might enable targeted assessment, early intervention, and more appropriate management. A recent systematic review of clinical trials to prevent bipolar disorder showed reliance on family history for risk identification [11]. However, given BD’s multifactorial nature, most affected would not have a positive family history [12]. In addition and unlike schizophrenia, no established prodrome exists for bipolar disorder. Newer methods for risk identification not reliant on existing clinical signs or symptoms might be of substantial value.

Longitudinal electronic health records (EHRs) coupled with predictive analytics might enable novel risk identification opportunities in BD. We have previously demonstrated that such data can produce valid diagnostic phenotyping of bipolar cases [13, 14]. Here, we extend this work to the domain of prognostication by leveraging the PsycheMERGE Network, a national research network of EHR-linked biobanks. Using longitudinal EHRs from three major healthcare systems, we trained and validated quantitative bipolar disorder risk prediction models based on high-dimensional structured EHR data and evaluated their performance individually and when ensembled.

Methods

Study settings

Participating study sites included three major academic medical centers in the United States: the Northeast (Mass General Brigham [MGB]), the Mid-South (Vanderbilt University Medical Center [VUMC]), the Mid-Atlantic (Geisinger [GHS]). Each site participates in the PsycheMERGE Network and has an extensive EHR repository linked to a biobank. On average, these sites each serve 1.4 M patients per year and have EHR repositories of ~2.7 M patients linked via EHRs.

The methods were performed in accordance with relevant guidelines and regulations and approved by Institutional Review Boards at each participating study site: Vanderbilt University Medical Center, Geisinger Health System, Mass General Brigham.

Outcome definition

Cases of BD were defined by the published “Bipolar Coded-Broad” definition per Castro et al. [14]. This rule-based algorithm demonstrated high positive predictive value (0.80) using a gold-standard of semi-structured diagnostic interviews (SCID-IV) by experienced doctoral-level clinicians blind to algorithmic results. To meet “Bipolar Coded-Broad,” cases must have at least two BD diagnostic codes versions nine or ten of the International Classification of Diseases (ICD) schema with a minimum of four weeks between each code and at least two documented medications used to treat BD (e.g., lithium or valproic acid) within one year of the index BD diagnosis. To rule out patients with related disorders, we required the number of diagnostic codes for major depressive disorder (MDD), schizophrenia (SCZ), schizoaffective disorder, or organic affective syndrome (OAS) to total fewer than half the number of BD codes. Only adult patients aged 18 years and older at the time of their index BD diagnosis were included in this analysis.

All adult patients were included if they had a minimum of three documented healthcare encounters over a minimum of six months, regardless of case status.

Predictive modeling approach

We compared three predictive modeling approaches that together span a range of model architectures and strategies for handling feature relationships (see “Algorithmic Details”, below). Because of prior algorithmic experience at each study site, each team validated internally one of three types of models: L2-penalized regression (abbreviated here as “Ridge”) at MGB; random forests (RF) at VUMC; gradient boosting machines (GBM) at GHS. In external validation, the remaining two of three models were tested across sites, e.g., MGB validated externally RF from VUMC and GBM from GHS.

Internal validation was conducted at each site with a randomly selected hold-out test and the best internally performing algorithms were shared for external validation. This reciprocal validation strategy tested generalizability of each algorithm without the need for each site to train three separate algorithms in parallel.

Feature engineering

Variables for prediction included demographics: age (continuous), coded sex (categorical: Male, Female, and Unknown), coded race (categorical: White, Black, Asian, Other and Unknown); inpatient-administered and outpatient-prescribed medications (log-transformed counts); and diagnostic codes (log-transformed counts). For log-transformed counts, we identified each diagnostic code and medication. We grouped medications by RxNorm ingredient and each diagnostic code by Clinical Classification Software, Level 2, codes. We then counted occurrences of each group per person in the time before the prediction. The “log-transform” means the counts all were incremented by 1 (to avoid log of zero) and then the natural log of that count +1 value was calculated for all count-based predictors. This common transform has been shown to be helpful in sparse and skewed predictors, i.e., many zero values with few very high counts. Such predictors are common in EHR-based datasets.

Dimensionality reduction included grouping medications by their RxNorm ingredients [15] and diagnostic codes mapped from ICD-9-CM and ICD-10-CM to Clinical Classification Software (CCS) Level 2 codes [16]. The final feature list numbered up to ~2500. Missing data for count variables were imputed as zeroes and categorized as unknown for race and sex (see Table 1).

Table 1 Baseline study participant characteristics.

Records meeting “Bipolar Coded Broad” were right-censored until the day before index diagnosis to represent a useful prediction target and to prevent leakage of bipolar-related data from driving model predictions. For those not meeting bipolar disorder criteria, right-censoring occurred at the last date of a visit or first occurrence of an ICD code for BD in the EHR.

Algorithmic details

Ridge regression

Ridge Regression [17] is a regularized regression model that imposes shrinkage of regression coefficients to reduce multi-collinearity and model variance, and thereby increasing prediction performance. Despite the shrinkage, the regression coefficients are never shrunk to zero, and therefore all features remain in the final model. We used the widely adopted glmnet [18,19,20] package in R for model training, using the main (first order) effects of all available features. The model was developed with 10-fold cross-validation using 60% of all data to find the best Lambda value (i.e., strength of regularization) and estimate model parameters, while the remaining 40% data were used as a hold-out test set. The 60–40 split was chosen due to a larger sample size at MGB, and the 60% training/validation split approached the limits in input data size for the glmnet package. Preliminary analyses showed minimal performance differences by varying the train/test ratio.

RF

VUMC implemented the decision-tree based RF. A commonly employed nonparametric algorithm, RF permits nonlinear relationships between predictors, samples with replacement for feature inclusion and model training, and it tolerates collinearity likely to be present in EHR data. After preliminary analyses varying the following, parameters of 200 trees, minimum node size of five, and purity for importance were used. RF were trained with an 80-20% train-test split. RF were implemented using the ranger package in R [21].

GBM

GHS implemented GBM in for internal validation. Boosting is an ensemble technique based on using multiple “weak learner” algorithms to train a strong one through sequential training to iteratively improve prediction accuracy. GBM is a high-performance gradient boosting framework based on decision trees capable of handling imbalanced datasets, as the boosting can strengthen the impact of the positive class (here, cases of BD). Tuning parameters included the ratio of features used, the ratio of training instances, maximum depth of trees and the learning rate. In preliminary analyses, dimensionality impacted model performance, so we selected the most prevalent medications across the GHS EHR by including those accounting for 95% of cumulative medication counts by Pareto analysis. GBM were trained with a 75-25% train-test split. Here, GBM were implemented in Python (package ‘lightgbm’) [22].

Ensembling

Ensembling is designed to improve prediction accuracy by aggregating the strengths of diverse machine learning models into a single predictive model. Here, we ensembled the three algorithms via stacking of all three algorithms at each site and evaluated performance. We combined Ridge, GBM, and RFs with a stacked ensemble trained with ten-fold cross validation and logistic regression using the three individual predictions as multivariate predictors on each site’s training set to avoid leakage of test data for the internally valid model into the ensemble.

Model evaluation

Evaluation metrics included Area Under the Receiver Operating Characteristic (AUROC), Precision-Recall Curves and Area Under the P-R Curve (AUPR). Metrics at specific risk thresholds included sensitivity/recall, specificity, positive predictive value (PPV), and number needed to screen (NNS [23], the reciprocal of PPV for a predictive model) Calibration was measured with Brier score and calibration slope/intercept [24].

Feature importance by site

Because each algorithm, GBM/RF/Ridge, have different inherent metrics of importance, we applied L1-penalized regression (LASSO) to the training datasets at each site. L1-penalized regression importance was defined as the magnitude of regression coefficients by predictor. The top ten L1-penalized regression selected features for each site are shown here, ranked (Table 4).

Results

Study site data are shown (Table 1).

Individual model performance by site

Discrimination performance is shown for each algorithm by training site with internal validation (testing within site) denoted visually for ready comparison to external validation (testing across sites) (Table 2).

Table 2 Model performance by site.

As shown in Table 2, models performed comparably within and across sites with a tendency for better discrimination at internal validation sites for locally trained algorithms and better calibration for GBM and Ensembles of GBM, RF and Ridge.

Optimal thresholds and performance metrics by algorithm and site

Varying risk percentile thresholds by algorithm and by site showed specificity was closely linked to the thresholds themselves, while sensitivity (recall) and PPV tended to decrease and increase, respectively, as thresholds increased (Table 3). NPV for all algorithms above these thresholds (90% and above) was over 99%, largely because of the rarity of the outcomes in question.

Table 3 Model performance by risk percentile threshold.

Predictor importance

The most important predictors to each model are listed in Table 4 (top ten by site) and in the eSupplement (top fifty by site). Features were driven by medications and contrast agents, e.g., those used in imaging studies, at the three sites. We caution overinterpretation of such predictor weights and underline these statistics are correlative, not causal.

Table 4 Ten most important predictors by algorithm using L1-penalized regression for feature selection and weighting.

Discussion

Early identification of individuals at risk for BD offers opportunities for targeted assessment and prevention. Although a number of risk factors for BD have been established including family history [25] and stressful life events [26], quantitative, scalable prediction of risk is challenging. Prior studies have largely focused on individuals with a history of depression and/or have included relatively small samples [27, 28]. Here, we validated multiple algorithmic approaches across multiple well-powered longitudinal EHR sites in the absence of a common data model to generate a novel suite of prediction algorithms for BD. These models performed well across diverse geography and broad, heterogeneous patient populations. However, difficulty in portability and transferring algorithms across sites remains a primary barrier to replicative and implementation studies.

Our results demonstrate the feasibility and comparative performance of prediction algorithms using federated analyses of EHR data across the PsycheMERGE network. We compared three different machine learning approaches, each reliant on different assumptions and means of handling noisy, high dimensional data. Finally, we tested ensembles of these methods via stacking.

We highlight several noteworthy findings. First, we found that, regardless of method, performance was optimal at the site at which the model was developed, supporting the inference that portability of models may be limited by site-specific features - e.g., a local care practice common in one setting or region and uncommon in another. It also suggests potential for overly optimistic performance estimates with internal validation - underlining again that no substitute exists for external validation in model evaluation. We also note little overlap to the most important predictive features at each site, which likely relates to both site-specific differences and algorithmic differences, e.g., parametric (Ridge) and nonparametric (tree-based) approaches. The most generalizable algorithm, the stacked ensemble, matched internally valid algorithms in discrimination and was the only well-calibrated model, in part owing to its recalibration via regression at each site during the stacking process. Vigilance for drift and miscalibration over time would be necessary in planning implementation downstream.

Race was included as a predictor in these models, a feature being reconsidered for a number of clinical predictive applications. We opted not to blind our algorithms in this case as this process has been shown not to prevent algorithmic bias and might, in fact, introduce it. We emphasize race is a social construct that does not itself cause mental illness but can be a marker of inequitable healthcare access, experiences of adversity, and systemic inequity of opportunity. As such, it might be predictive of a coded diagnosis despite not being in the causal path for the outcome. Prior to implementation of models like these, close attention to algorithmic bias and potential for disparities should be considered using variables like coded race for prediction [29, 30]. As an additional check that race as a predictor did not bias our model unfairly, we retrained the RF at VUMC and compared validation set performance by coded race across 1,000 bootstrap replicates. Performance distributions did not differ across or within race: coded White race AUC 0.8 [0.79, 0.82] and 0.79 [0.77, 0.8] for model with and without race, respectively; coded Black race AUC 0.79 [0.76, 0.83] and 0.78 [0.75, 0.82] for model with and without race, respectively.

The clinical utility of predictive models for rare events like BD (<1% at each site) merits consideration and attention to the importance of PPV. Here, the stacked ensembles achieved the best threshold-specific PPVs across all three sites (a ~forty-fold increase compared to case prevalence). A resource-limited clinical environment prioritizing identifying those most likely to have undiagnosed BD or predicting onset of BD might benefit from models that provide such PPVs in the setting of such rare outcomes. Of note, our models achieved a NNS as low as 20 or fewer at each site, meaning that fewer than 20 patients would be identified as high risk for every true case detected.

These models rely on EHR data common to any modern hospital agnostic to a common data model: demographics, diagnostic codes in a universally accepted schema (ICD), medications mapped to a public ontology (RXNORM). For those who wish to leverage these trained and tested algorithms, a library of the individual models and the stacked ensemble will be made available on the PsycheMERGE Network website (psychemerge.com) [31].

Strengths

This study leveraged three large health systems with a validated definition of bipolar disorder as the prediction target. We applied three accepted algorithms (RF, Ridge Regression, GBM) to large real-world cohorts and assessed generalizability and model fit across partner sites. We ensembled these algorithms on over 3 million patient lives across three major biobanks - the largest modeling study of this kind in BD, to our knowledge. We relied on readily available structured EHR data for feature engineering. Finally, we disseminate these tools via the PsycheMERGE Network to facilitate replication studies and local deployment.

Limitations

Our results should also be interpreted in light of several limitations. First, while we explored performance of multiple different modeling approaches, there are others (including deep learning approaches) that were not tested. Second, our study relied on structured longitudinal EHR data, a decision we made to facilitate ready implementation across sites. However, natural language processing of narrative text might offer performance advantages longer term. Third, covariate shift in real-world data like these mean the joint distribution of model inputs and outputs may differ between training and testing across sites [32]. Methods of covariate shift detection and adaption might be investigated using importance re-weighting or feature dropping methods in future studies to improve model performance. Finally, class imbalance remains a notable challenge in this study and studies like it, and creates the potential for overfitting and spuriously high model performance metrics (e.g., high AUROCs simply because of identification of the majority class, here non-BD).

Conclusions

Generalizable predictive models of bipolar disorder trained and validated across health systems are feasible targets of clinical and precision medicine focused initiatives, even in the absence of common data models across sites. Implications of these models include BD risk research acceleration, catalyzing pharmacoepidemiologic studies, and potential for similar models to serve as probabilistic phenotypes in precision medicine research. Future work should assess their clinical utility and potential to phenotype quantitatively this serious mental illness.