Introduction

Over the past decade, under-recognition of bipolar disorder (BPD) among individuals with depressive symptoms has become an increasing area of concern [1]. Individuals with BPD wait, on average, nearly 6 years between mood symptoms onset and initial management [2], with one study suggesting that one-third of individuals waited 10 years or more before receiving a diagnosis [3]. Those with unrecognized BPD presenting in a major depressive episode may be prescribed antidepressants, an approach that not only may be ineffective [4], but also may increase their risk of mood cycling and mixed, manic, or hypomanic episodes in the absence of mood stabilizer treatment [4]. Additionally, antidepressant prescriptions have also been associated with treatment resistance and suicidality [5]. As a result, efforts to stratify risk particularly early in illness course have been of great interest both to clinicians and investigators [6].

Such change in diagnosis may not reflect missed symptoms per se, but simply individuals who have not yet manifested sufficient clear manic or hypomanic symptoms to merit a bipolar diagnosis [1]. Indeed, some individuals manifest depressive symptoms before a manic or hypomanic episode, rendering a diagnosis of BPD impossible early in their illness course. Still, regular and systematic assessment for manic and hypomanic symptoms in high-risk individuals may shorten the time to accurate diagnosis while minimizing the potential adverse outcomes associated with antidepressant treatment [7]. However, to date, efforts to identify clinical predictors have generally relied on small, selected cohorts [8], or on assessments of personality that may not be routinely available in clinical settings [9];  furthermore, commonly accepted predictors have been shown to perform poorly [10, 11]. Perhaps unsurprisingly, then, these efforts have not yielded widely-applicable tools despite the clinical need [12].

Therefore, we sought to develop the predictive models to enable short-term stratification of risk for bipolar conversion after antidepressant initiation among individuals initially diagnosed with major depressive disorder (MDD), based solely on information available in electronic health records as part of routine care. Rather than identifying individual clinical risk factors, or distinguishing among specific antidepressants, we aimed to build summary classifiers that offer high-quality estimation of risk. We built and tested such models in one health system and then examined the performance of these models in the network of a second health system.

Materials and methods

Study overview and cohort description

We identified 67,807 individuals age 18–80 years drawn from the outpatient clinical networks of two academic medical centers in New England, Massachusetts General Hospital and Brigham and Women’s Hospital, referred to subsequently as Site A (42,547 patients) and Site B (25,260 patients) respectively, who had received at least one electronically-prescribed antidepressant from among the nine most commonly-prescribed between March 2008 and December 2017 with a diagnosis of MDD (International Classification of Diseases, Ninth Revision (ICD9) codes 296.2×, 296.3×) or depressive disorder not otherwise specified (311) at the nearest visit to that prescription. See Supplementary Fig. 1 for a CONSORT flow diagram. We generated a data mart with variables extracted from electronic health records of these two health systems using i2b2 server software (i2b2, Boston, MA, USA) [13]. Available patient data included sociodemographic (age, gender, race) and insurance information, diagnostic and procedure codes, as well as inpatient medication administrations and outpatient medication electronic prescriptions. The study protocol was approved by the Partners HealthCare institutional review board, waiving the requirement for informed consent as only deidentified data were utilized and no human subjects contact was required.

Outcome definition

The primary study outcome was index diagnosis. Index prescription refers to the first ADs ever prescribed in the history of a patient. Transition to BPD was defined as the presence of an ICD code for BPD (296.1×, 296.4×, 296.5×, or 296.7×) at any visit within 3 months of the index antidepressant (AD) prescription. Patients with no prescriptions before diagnosis of BPD were excluded, as well as patients having medication and procedural codes (listed in Supplementary Table 1) commonly associated with BPD. We also excluded patients with no clinical history (i.e., no available diagnostic codes) before the antidepressant prescription, where efforts to predict subsequent outcome would likely be unrealistic. (Note that more strict exclusion criteria such as excluding patients with less than 1-year of available data before and after the diagnosis of MDD would reduce the final sample size, and thus impact on the generalizability and predictive power of the proposed models.) We considered AD prescriptions starting March 2008, 6 months after electronic prescribing became mandatory in these hospital systems; patients receiving an AD between October 2007 and March 2008 were excluded as identified ADs might not be newly-prescribed ones, but rather first recorded follow-up ones. Finally, we excluded patients without follow-up (i.e., no code after 3 months from index prescription).

Covariate definition

For each patient, available covariates include socio-demographic features (age, gender, and race), diagnostic codes (ICD-9 and ICD-10 diagnoses), procedures (CPT lab tests and procedures), any medication (Rxnorm codes), and additional data (provider type, insurance type, and prescription date) available at the time of index prescription of any given antidepressant. From an initial set of 36,875 billing codes, we selected 8654 codewords which ranked highest according to the term-frequency-document-inverse-frequency (TF-IDF) metric, and which occurred in at least 200 patients in Site A. Thus, a count vector of size 8654 represents a patient’s diagnostic and treatment history.

Prediction task

The primary aim was to predict conversion to BPD diagnosis on the basis of socio-demographic (age, gender, and race) features, diagnostic codes, procedures, any medication, and additional data (provider type, insurance type, and prescription date) available at the time of index prescription of any given antidepressant. Forty-two thousand five hundred and forty seven eligible participants in the Site A cohort were randomly assigned to a training (70%), validation (10%), and testing (20%) data set in a stratified manner (preserving the same rate of transition to BPD in each set). All 25,260 patients from Site B were held-out for testing.

We built models and evaluated the prediction task for any of the nine most-prescribed AD medications jointly. Each model yields a score representing the probability that a given medication will be associated with subsequent transition to a bipolar diagnosis within the next 3 months, conditional on knowledge available at the time of that prescription (i.e., with all subsequent data censored). Models were evaluated at every index antidepressant prescription provided, for which there was sufficient prior history—i.e., at least one prior “fact” of any kind, comprised of diagnostic code, procedure, or prescription.

Classification methods and metrics

We considered two standard classifiers, logistic regression with L1-norm regularization (LR) and random forests (RF). Both classifiers used the implementation provided in the open-source Scikit Learn toolkit [14]. Hyperparameters were tuned on the Site A validation set, using grid search to find the parameter combination that performed best on the area under the receiver operating characteristic curve (AUC) metric.

In the case of LR, the regularization strength “C” was grid searched in the range [0.001–0.5] with 15 values. RF used three hyperparameters that were set on validation data: the number of trees (n_estimators, possible values = {16, 64, 256}), the fraction of features used in each tree (max_features, possible values = {0.04, 0.16, 0.64}), and the minimum number of samples at leaf nodes (min_samples_leaf, possible values = {16, 64, 256}). Finally, model performance was compared using area under the curve (AUC) in the held-out testing set from Site A, averaged over 50 different splits of the data, then in the independent Site B. To indicate the potential generalization to test sets from a similar empirical distribution, we show error bars with 95% confidence intervals of the AUC (i.e., scores at 2.5 and 97.5 percentiles) computed from 500 bootstrap samples of each considered test set.

Results

Characteristics of the two health networks are summarized in Table 1. A total of 636/42,547 individuals (1.49%) transitioned from MDD to BPD in Site A, and 289/25,260 (1.14%) in Site B, within 3 months of index antidepressant prescription. Rates of transition were substantially lower in primary care compared to specialty care settings: 1.22% versus 4.20% on Site A, and 1.06% versus 2.01% on Site B. Rates of transition were generally similar across antidepressants (Fig. 1), with greatest rates observed among nonSSRIs. Across the two sites, transition rates were 1.23% for SSRIs, 1.93% for SNRIs, and 1.73% for other antidepressants (X2 (2df) = 37.98; p = 5.64.e–09).

Table 1 Per-patient statistics, stratified by gender, race, insurance type, and provider type (top: Site A, bottom: Site B).
Fig. 1: Bipolar rates among all index prescriptions between 2008 and 2017 for different antidepressant categories.
figure 1

SNRIs serotonin and norepinephrine reuptake inhibitors, SSRIs selective serotonin reuptake inhibitors, MDD major depressive disorder, BP bipolar disorder.

Figure 2 reports model discrimination using socio-demographic features alone versus sociodemographic features with coded clinical features in the independent testing sample from Site A, and the full Site B cohort. For baseline sociodemographic features alone, AUCs for the LR and RF classifiers both reached a mean of 0.68 in the Site A testing set, ranged from 0.59 to 0.72 for the LR classifier, and from 0.59 to 0.73 for the RF classifier. With inclusion of coded clinical data, AUCs increased to a mean of 0.73 (ranging from 0.68 to 0.77) for the LR classifiers, and 0.76 (ranging from 0.72 to 0.80) for the RF classifiers in the Site A testing set. Discrimination stratified by medication class (SSRI, SNRI, and other antidepressants) is illustrated in Supplementary Fig. 2.

Fig. 2: Area under the curve (AUC) in test set for the logistic regression classifier (left) and random forest classifier (right) for Site a and Site b.
figure 2

Input data include sociodemographic features, specifically age, gender, and race (dem), date of prescription (date), type of insurance (insurance), type of provider (provider), and diagnostic/procedure codes (codes). Confidence intervals computed using 500 bootstraps across 50 different splits of the data in train/test/validation sets.

Figure 3 illustrates rates of BPD transition by individual risk decile; the horizontal dashed line represents the overall average BPD rate at each site. Lift in the top decile (rate of bipolar transitions in top decile with respect to mean rate in each site) was 16.77 for Site A and 3.02 for Site B for the best classifier (RF classifier). Figure 3 evidences the utility of the proposed models at sorting patients according to their predicted BPD risk: bars above the dashed line correspond to patients with higher-than-average BPD risk. Figure 4 illustrates the positive predictive values (PPV) achieved at a range of negative predictive value (NPV) thresholds from 98 to 100%. For an NPV of 99%, the PPV is 32.0% in Site A and 6.0% in Site B, respectively.

Fig. 3: Lift histogram for the random forest classifier (1st row) and logistic regression classifier (2nd row) in Site a (1st column) and Site b (2nd column) for a single split.
figure 3

Prescriptions are sorted according to their predicted probability of transition to bipolar disorder. The dashed line corresponds to the average BPD rate at each site respectively: bars above the dashed line correspond to patients high higher-than-average BPD risk.

Fig. 4: Positive predictive values (PPV) versus negative predictive values (NPV) for the random forest classifier (1st row) and logistic regression classifier (2nd row) in Site a (1st column) and Site b (2nd column) for a single split.
figure 4

Each blue point corresponds to a different operating point (threshold) of the classifier.

Supplementary Table 2a, b lists coded features most strongly associated with risk for BPD transition over 3 months in regression models, positively and negatively. Of note, provider type is the most strongly associated feature with risk for BPD. This is coherent with the BPD rates reported in Table 1 stratified by provider type: BPD rates are much higher under psychiatric providers. Other top-predictors include markers of greater illness severity (e.g., emergency department visit, psychiatry versus primary care), psychosis, as well as a medication indicated for psychosis but commonly prescribed for insomnia. Similarly, Supplementary Table 2c illustrates key features from random forest models, and again includes primary versus specialty care and measures of psychosis, as well as age and gender.

Supplementary Fig. 3 illustrates calibration curves for each classifier across Site A and Site B (the closer to the diagonal dotted line, the better). Finally, a sensitivity analysis examining the effects of varying the temporal window in the outcome definition is shown in Supplementary Fig. 4: model performance is generally similar when varying the time window from 2 months, up to 1 year.

Discussion

Across 67,807 individuals with MDD across two academic medical center-based health systems, rates of switch in diagnosis to BPD ranged from 1.23 to 1.93%. Rates were substantially greater in specialty compared to primary care settings, which may reflect greater severity, or more systematic detection of switch in specialty settings. Models incorporating clinical codes were highly discriminative in identifying patients at greatest risk for transition to mania within 3 months. While discrimination declined somewhat in a second health system, as expected, model performance remained promising.

Our results are generally consistent with prior reports of transition to BPD using similar designs, most notably Musliner and Østergaard [15], who found that 1% of individuals initially diagnosed with MDD were re-diagnosed with BPD within 1 year of initial diagnosis, not index antidepressant prescription. Our conversion rates are also generally similar to those observed in other study designs using structured interview or systematic evaluation by physicians in a mood disorder clinic [16]. For example, a meta-analysis incorporating more than 100 studies found rates of re-diagnosis ranging from 1.24% among antidepressant-untreated patients to 5.97% among antidepressant-treated patients [17]. Other estimates of conversion vary widely depending on context [7]. In the present work, the goal was not to catalog individual risk factors, amply addressed by other authors [7, 15, 18], but to develop useful prediction tools that operate on health records, and a means of aggregating easily-derived risk factors to arrive at a quantitative prediction about risk.

We cannot directly compare our prediction results to prior studies, as we could not identify a prior record-based prediction model or one tied to antidepressant prescribing, even though this represents a very common clinical decision point. One prospective study including 550 individuals with MDD reported that the total number of subthreshold hypomanic symptoms at intake yielded an AUC of 0.62 [19]. Another retrospective study contrasting 812 mood disorder cases in a specialty care setting reported an AUC of 0.72 on the basis of features such as earlier age of onset, family history of BPD, and cyclothymic temperament [9]. A small study (n = 52) of young patients attending an outpatient clinic found that severity of depression yielded an AUC of 0.72 [20]. Finally, in our own prior work using two waves from the National Epidemiologic Survey on Alcohol and Related Conditions cohort, risk factors included comorbid anxiety disorders, childhood adversity, and past-year social support disruption [21]; regression models yielded an AUC of 0.72.

We emphasize that the present results should not be viewed as a replacement for careful assessment to identify manic or hypomanic symptoms over time, the only means of diagnosing BPD reliably. Moreover, focused clinical assessments (for example, systematic use of the Hypomania Checklist [22], and the inclusion of additional risk factors such as family history of BPD) may well be able to improve upon these models. Indeed, one next-step intervention in high-risk populations emerging from our model might be application of additional time-intensive screening measures.

Limitations

Our results have multiple important limitations. First, the present study is based solely on diagnostic codes, which may be unreliable for identification of mood disorders. Undoubtedly, structured clinical assessment would decrease misclassification rates, although we note that DSM-5 field studies suggest that even gold standard measures perform poorly in mood disorders [23]. Likewise, incorporating quantitative measures of depression severity and unstructured clinical text might improve prediction; they were not included here as our intention was to rely solely on data commonly available across health systems and easy to parse. In particular, in future work narrative notes may be valuable in elucidating the rationale for change in diagnosis.

As such, we cannot state definitively that our results reflect transition from MDD to BPD. Rather, they capture individuals considered by their physician to have a diagnosis of depressive disorder and are subsequently considered by their physician to have a diagnosis of BPD—a face-valid, if imprecise, distinction. Moreover, each patient might visit different clinicians: in such scenarios, change in diagnosis might also be due to low degree of concordance between different clinicians. While it is easy to dismiss administrative claims, we note that such claims reflect real-world clinician judgment and consequent actions.

A further limitation is the possibility that some apparent predictors may represent confounding by indication. We exclude mood stabilizing medications, for example, but not all antipsychotics; it is certainly possible that clinicians may document one diagnosis while treating another or “hedging their bets”, for example, by using an atypical antipsychotic for insomnia. However, we would argue that identifying patients in whom clinician concern may be manifest but not documented is still potentially useful. Moreover, secondary analyses excluding quetiapine do not yield meaningfully different results (see Supplementary Fig. 5).

Finally, we recognize that these models yield AUC’s less than the 0.8 threshold often mischaracterized by reviewers as representing good discrimination [24]. At minimum, they represent a starting point to be improved upon by addition of further measures or other modeling approaches. Lift histograms in Fig. 3 illustrate the models’ potential at identifying patients with higher-than-average risk of BPD; such information could potentially be used to better allocate limited resources. Moreover, our plots of positive and negative predictive value in Fig. 4 suggest that it may be possible to apply our models, with appropriate re-calibration in a given health system, to enrich for high-risk populations, even if they are not strongly discriminative.

Strengths

We also note key strengths. First, these models utilize data readily available at time of medication prescription, such that they would be straightforward to deploy in an electronic health record as part of clinical decision support. Second, they reflect general clinical practice, rather than less generalizable clinical trial populations more typically studied to derive predictors. As tempting as it is to dismiss claims-based diagnosis, the uncomfortable reality is that we identify individuals in whom a clinician made one diagnosis, and the diagnosis was subsequently changed by that clinician, or by a new clinician. Finally, in demonstrating that these models replicate in a second health system, with modest diminution in discrimination, our results suggest the capacity to generalize more broadly across health systems.

The fact that BPD may not be apparent early in the course of a depressive disorder remains a source of anxiety both for clinicians and patients. Under-recognition of bipolar disorder has been associated with a significant increase in health care cost [25], though this must be balanced against the cost of undertreatment among individuals with MDD. Antidepressant prescription in BPD is a circumstance where iatrogenic injury is possible, precipitating worsening depressive symptoms, cycling, or switch to mania or mixed states in vulnerable patients. While it may be impossible to entirely avoid such circumstances, models such as the presented ones provide a starting point for efforts to minimize this rare, but feared, treatment outcome.

Funding and disclosure

This work was funded by Oracle Labs, Harvard SEAS, the Harvard Data Science Initiative, and a grant from the National Institute of Mental Health (grant no. 1R01MH106577). The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; or decision to submit the manuscript for publication. RHP holds equity in Psy Therapeutics and Outermost Therapeutics; serves on the scientific advisory boards of Genomind and Takeda; and consults to RID Ventures. RHP receives research funding from NIMH, NHLBI, NHGRI, and Telefonica Alfa. RHP is an associate editor for JAMA Network Open. THM receives research funding from the Stanley Center at the Broad Institute, the Brain and Behavior Research Foundation, National Institute of Aging, and Telefonica Alfa. FDV consults for Davita Kidney Care and Google Health. The other authors have no disclosures to report.