AI-assisted prediction of differential response to antidepressant classes using electronic health records

Sheu, Yi-han; Magdamo, Colin; Miller, Matthew; Das, Sudeshna; Blacker, Deborah; Smoller, Jordan W.

doi:10.1038/s41746-023-00817-8

Download PDF

Article
Open access
Published: 26 April 2023

AI-assisted prediction of differential response to antidepressant classes using electronic health records

npj Digital Medicine volume 6, Article number: 73 (2023) Cite this article

5066 Accesses
11 Citations
90 Altmetric
Metrics details

Subjects

Abstract

Antidepressant selection is largely a trial-and-error process. We used electronic health record (EHR) data and artificial intelligence (AI) to predict response to four antidepressants classes (SSRI, SNRI, bupropion, and mirtazapine) 4 to 12 weeks after antidepressant initiation. The final data set comprised 17,556 patients. Predictors were derived from both structured and unstructured EHR data and models accounted for features predictive of treatment selection to minimize confounding by indication. Outcome labels were derived through expert chart review and AI-automated imputation. Regularized generalized linear model (GLM), random forest, gradient boosting machine (GBM), and deep neural network (DNN) models were trained and their performance compared. Predictor importance scores were derived using SHapley Additive exPlanations (SHAP). All models demonstrated similarly good prediction performance (AUROCs ≥ 0.70, AUPRCs ≥ 0.68). The models can estimate differential treatment response probabilities both between patients and between antidepressant classes for the same patient. In addition, patient-specific factors driving response probabilities for each antidepressant class can be generated. We show that antidepressant response can be accurately predicted from real-world EHR data with AI modeling, and our approach could inform further development of clinical decision support systems for more effective treatment selection.

Predicting change in diagnosis from major depression to bipolar disorder after antidepressant initiation

Article 14 September 2020

Predicting treatment dropout after antidepressant initiation

Article Open access 06 February 2020

Optimizing prediction of response to antidepressant medications using machine learning and integrated genetic, clinical, and demographic data

Article Open access 08 July 2021

Introduction

Depression is a common and often disabling psychiatric condition¹. According to the Centers for Disease Control and Prevention (CDC, USA), more than 10% of adults are prescribed antidepressants within any given 30-day window², making antidepressants one of the most commonly used categories of medications. The American Psychiatric Association guidelines³ for treatment of major depressive disorder (MDD) suggests four classes of first-line antidepressant medications: selective serotonin reuptake inhibitors (SSRIs), serotonin-norepinephrine reuptake inhibitors (SNRIs), mirtazapine, and bupropion. Unfortunately, identifying the most effective treatment for a given patient remains a trial-and-error proposition. As antidepressant effectiveness may not be evident before 4–12 weeks, each inadequate trial can incur prolonged morbidity, disability and exposure to adverse effects as well as substantial healthcare costs.

Although average response rates are similar across different antidepressant classes⁴, individual response can vary widely in clinical practice. Currently, clinicians initiating antidepressant treatment have few tools or evidence-based predictors on which to rely. A major goal of precision psychiatry is to optimize treatment matching using patient-specific profiles. The growing availability of large-scale health data such as electronic health records (EHRs) coupled with advances in machine learning offer new opportunities for developing clinical decision support tools that may address this challenge^5,6,7,8.

That said, setting up an accurate and scalable system to guide antidepressant selection poses specific challenges. First, patient characteristics that may contribute to response prediction, such as depression symptomatology, are not readily codified. Second, the gold standard for characterizing antidepressant response from the EHR remains expert chart review, which is labor- and time-intensive. Lastly, in contrast to modeling predictions under the naturalistic, observational setting, modeling for interventional recommendations requires additional consideration of potential confounds that could arise from non-random treatment assignments. While a handful of studies have modeled prediction of antidepressant response using data specifically collected for research such as brain imaging or EEG, sample sizes have been typically modest and these approaches can be costly and difficult to scale^{9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42}. Ideally, a clinically useful model would enable accurate prediction of antidepressant response, comparative predicted response for alternative treatment choices, and control for potential confounding – in particular, confounding by indication – that is, pretreatment factors associated with both the propensity to choose an antidepressant and treatment response.

To address these limitations, we developed a machine learning (ML) pipeline to predict antidepressant response using real-world, large-scale EHR data. The pipeline incorporates the following features: (1) a battery of prediction models with different levels of complexity, ranging from linear to highly non-linear models such as deep neural networks⁴³ (DNNs, often referred to as “AI”), which enables comparative model performance; (2) an AI-based natural language processing (NLP) proxy labeling system, as a complement to expert label curation, to enable scalable model training; (3) confounding control through explicit inclusion of potential confounders among the predictors; (4) direct incorporation of unstructured data (i.e., clinical notes) as an additional predictor component to maximize use of information contained in EHRs.

We use the ML pipeline to address the following scientific questions: (1) Can clinical data routinely obtained before initial treatment with an antidepressant predict outcomes within 4–12 weeks after initiation? (2) Can such information predict which class of medication would work better for a particular patient? (3) What might be the best model architecture to do so? As most antidepressant prescriptions are initiated in non-psychiatric settings⁴⁴, we focused on patients for whom treatment was started by a non-psychiatrist physician (e.g. primary care physician).

See Supplementary Note 1 for brief description of DNN and the DNN classes we used in this study.

Results

Patient characteristics

The EHR data query for the period of 1990–2018 retrieved 111,563 adult patients who had at least one International Classification of Diseases (ICD) code for depression and received a new antidepressant prescription at the same visit. After applying our exclusion criteria, a total of 17,556 patients were included in the analysis (Fig. 1). We annotated 3600 patients (details provided in the Methods section) with expert-curated outcome labels (46% labeled positive) and the remainder with imputed labels (42% labeled positive). Patient characteristics, overall and stratified by the four antidepressant classes, are provided in Table 1.

**Fig. 1: Flowchart for sample selection after sequential application of exclusion criteria.**

Table 1 Patient characteristics overall and by antidepressant class.

Full size table

Overall, the female/male ratio was 2:1 (66% female), consistent with the known gender ratio of diagnosed depression⁴⁵. The mean (SD) age of the patients was 50.04 (17.59) at the date of index visit. In general, patients started on mirtazapine were older, more likely to be male, and to have had a greater burden of medical illness compared to the other three groups. The bupropion group was slightly younger. There was only a slight variation in depression-related mental symptoms across the four groups (Table 1).

Model performance metrics

Table 2 reports the point estimates for model performance on the full test set across each setting (the full table with confidence intervals is provided as Supplementary Table 1). All models achieved an area under the receiver operating characteristic curve (AUROC) of at least 70% and an area under the precision-recall curve (AUPRC) of at least 68%, indicating good overall model discrimination. Among the four feed-forward deep learning models, three that included either i) the treatment response likelihood score (a decimal number between 0-1 representing a preliminary evaluation of response probability by a DNN model using clinical notes in a three-year window prior to the index visit) as a predictor or ii) the imputed labels, or iii) both (see Methods for further details) had higher AUROC point estimates than the one without both the treatment response likelihood score and the imputed labels (AUROC for the first three models = 74% and AUROC = 70% for the last model).

Table 2 Model performance metrics for antidepressant treatment response prediction.

Full size table

Given that no model tested was clearly superior across all metrics, and for other reasons discussed later (see Discussion), we selected a “representative model” and report additional model characteristics of this model for the purpose of discussion and illustration. One model that performed well and has several advantages (in terms of extensibility and flexibility) is the feed-forward DNN model that included both the treatment response likelihood score and imputed labels (see Methods). In addition, this model incorporates the broadest range of data among the four feed-forward DNN models and is less computationally intensive than the Transformer + feed-forward DNN model. For this model, AUPRC = 72%, positive predictive value (PPV) = 71%, negative predictive value (NPV) = 69%, sensitivity = 68%, specificity = 72%, F1 score (the harmonic mean of sensitivity and PPV) = 70%, and accuracy = 70%, based on a threshold that maximizes prediction accuracy (See Table 2). Brier score was 0.21. A calibration plot for the model is provided in Supplementary Fig. 1. ROC plots for all models are provided in Supplementary Figs. 2–6.

To contextualize the potential value of the model, we compared the overall prevalence of antidepressant response in the test set data (i.e. the base rate of response independent of applying the model) to the model-predicted response rate. We can think of the overall model-agnostic prevalence as the prior probability of response for these patients based on current practice. We can then compare this to the PPVs obtained with each model in the test data (as a kind of counterfactual response rate had the model been available). For example, the PPV observed with our representative model was 71%, which is statistically significantly higher than the base rate of response (50%) by a two-tailed test for two binomial proportions (z = 7.44, p < 0.00001, 95% CI of the difference in the two proportions = (0.16, 0.26)).

Supplementary Table 2 shows performance metrics for the representative model after patient stratification by age, number of co-morbidities, counts of depression-related symptoms, and antidepressant class initiated. Performance of the representative model was generally consistent for all model metrics across most strata. Figure 2 shows the most important predictors overall for the representative model by global sHapley Additive explanation (SHAP) values⁴⁶ (i.e., mean of absolute local SHAP values. Supplementary Fig. 8 shows mean local SHAP values of these important predictors to provide additional information on directionality of feature effects). The top three predictive variables by global SHAP values were the number of co-occurring medications, the treatment response likelihood score, and number of references to depressed mood and anhedonia in clinical notes. The appearance of SNRI initiation among the top 10 predictors confirms that choice of specific antidepressant class is an important contributor to likelihood of responding to an antidepressant.

**Fig. 2: Top 15 features by global SHAP score.**

Visualization of clinical utility

To illustrate the kind of information the models can provide, Fig. 3 shows actual and alternative test set predictions and local SHAP predictor importance (i.e., predictor importance for a given patient) for illustrative patients using the representative model (feed-forward DNN). Figure 3a demonstrates predicted response for each antidepressant for three patients drawn from strata that are modeled to be moderately likely (Patient 1), highly likely (Patient 2), and less likely (Patient 3) to respond (see Methods), respectively, to the actual antidepressant class prescribed. Overall, there is substantial variation in predicted treatment response among these patients to any of the four antidepressant classes (e.g., ~80 for Patient 2 and ~0.25 for Patient 3). Within an individual patient, we see more modest variation in predicted response to alternative antidepressants. For example, Patient 1 had a 66% probability of a positive response to an SSRI (the class that was actually prescribed in this case) but only a 51% probability of response had an SNRI been chosen. Figure 3b shows the corresponding local SHAP scores for each initiation scenario for Patient 1. The visualization (“force plot”⁴⁷) displays the relative importance of each predictor to the response prediction for a given antidepressant class. The bolded values are the SHAP model’s overall response prediction for each antidepressant. (Note: the SHAP estimates differ slightly from the DNN estimates (e.g., 0.67 vs 0.66 for SSRI) because they reflect SHAP as a linear estimator of the DNN model; the relative ranking of predicted responses and the feature importance are the same in both models). For each antidepressant, the visualization shows how the predicted probability of response would change given changes in the values of the predictors. Thus, for example, the patient’s response to an SSRI would be predicted to be higher had they not have a diagnosis of alcohol use disorder.

**Fig. 3: Illustration of differential treatment response predictions and local predictor importance for different treatment scenarios.**

Discussion

In this study, we developed an AI-assisted machine learning pipeline for antidepressant treatment response prediction using large-scale, real world healthcare data. We systematically examined 14 machine learning models that differed in complexity. The models generated from the pipeline performed generally well by test set metrics, with AUROCs of at least 0.70 and AUPRCs of 0.68 or higher. As a representative example, a feed-forward DNN achieved AUROC of 0.74, and PPV of 0.71 at a threshold that maximized model accuracy. These results show that through appropriate modeling, readily available clinical data (even without formal psychiatric assessment) may provide a basis for the future development of clinical decision support for antidepressant treatment selection. Moreover, we demonstrated how the pipeline could be applied in a clinical context by modeling outcomes for the same patient under each alternative antidepressant treatment scenarios. Using SHAP, we also illustrate the model’s interpretability by displaying the relative importance of predictors to the overall model and to an individual patient’s response to alternative prescription choices. No model was clearly superior to all others based on the full range of performance metrics in the current study. In practice, users might favor some models over others based on other considerations including model interpretability, computational complexity, extensibility (i.e., capability of incorporating additional data modalities), flexibility, or transportability.

A number of prior studies have explored the potential for antidepressant response prediction using a variety of data sources, though all have had limitations. A meta-analysis of 20 studies using some type of machine learning algorithm found none based on large-scale healthcare data and none with a sample size approaching the current study³¹. In addition, in many cases, model performance metrics were either missing or not fully reported. Most studies of antidepressant prediction algorithms have relied on a limited range of predictors, derived largely from demographic variables and symptom scales or have been based on biomarkers (neuroimaging, EEG, or genomics) that are not widely available in clinical practice³⁰.

In the feed-forward DNN model we used for illustration, the top 15 features by mean absolute SHAP scores (Fig. 2) include indicators of illness burden as well as specific depression-related symptoms. The top feature was the count of medication prescriptions in the previous 90 days, a possible indicator of psychiatric and medical comorbidity burden. The second-most important feature, the three-year response likelihood score, is a transformation of a patient’s medical history over the prior three years as captured in clinical notes. Several other high-ranking features reflect a range of core depressive symptoms (depressive mood, anhedonia, loss of energy, suicidal tendencies, insomnia) or commonly comorbid conditions that complicate treatment of depression (substance use disorder, anxiety disorder). Patient-specific profiles of these features are commonly considered by clinicians in selecting among classes of antidepressants; as such, they provide some reassurance that influential features selected by the model are consistent with patient factors presumed to be clinically-relevant.

Our study has a number of strengths. First, we performed a systematic examination of models of varying complexity, evaluating the impact of AI-imputed labels and alternative strategies for feature engineering. We demonstrated that AI-assisted label imputation provides a highly scalable alternative to manual chart curation that can otherwise be rate-limiting for model development. Second, we empirically tested the effect on prediction performance of including novel features, including AI-generated vectorized notes and a “treatment response likelihood score” derived from clinical notes. Our motivation for developing these features was the expectation that efficiently incorporating extensive information from longitudinal narrative notes could be helpful. However, in the current study, neither the addition of the 3-year treatment response likelihood score, nor directly using 3-year vectorized notes as features substantially enhanced performance (although the feed-forward DNN models using the response score did show numerical improvement in AUROC). Third, we controlled for confounding by indication by explicitly incorporating potential treatment selection features that might affect treatment response into feature engineering. This allowed us to address the clinically-relevant question for treatment selection: to which major antidepressant classes is a given patient most likely to have a therapeutic response? Finally, the ability to identify which features are most important for predicted response enhances interpretability and gives clinicians insight into individual patient characteristics that are informing the model output.

Our results should also be interpreted in light of several limitations. First, models were trained on labels derived from clinician-documented information on patients’ antidepressant response and augmented with AI-imputed labels. As in any model, misclassification of outcomes used for labeling can impair model accuracy. In addition, EHRs inevitably have missing data, (e.g., patients may receive some of their care outside of the healthcare system), and the model does not explicitly account for effects of concurrent psychotherapy; such missing data may have reduced the predictive performance of the models. Also, while the inclusion of 20 years of EHR can be strength, secular trends in clinician prescribing or documentation practices over this span may have adverse affected model performance. Moreover, as with in any observational setting, there may be residual confounding. However, our approach should minimize confounding by controlling for a broad range of predictors that might influence both antidepressant selection and treatment outcome.

Of note, we applied two data requirements that may impact the generalizability of the model. First, we required at least one clinical note within the 4–12 week period after initiating an antidepressant, which may limit the model’s performance for patients not seen within that window. In training the model, a substantial proportion (70%) of patients started on an antidepressants lacked a clinical note within the following 4–12 weeks. This window was chosen in part because treatment guidelines suggest that 4–8 weeks are typically needed before concluding that an antidepressant is not effective³. Beyond 12 weeks, patients may have been more likely to have additional interventions (e.g., antidepressant augmentation or switching) that could complicate outcome assessment. In addition, long prediction windows (e.g. many months) would be less suited to our goal of enhancing expeditious selection of effective treatments. Second, we included only patients with at least one note within 90 days prior to initiating an antidepressant. This was done to minimize confounding by indication as recent symptom severity is associated with clinician antidepressant choice⁴⁸ and also likely to be an important predictor of the outcome (i.e., antidepressant response). To address this, we used recent clinical status (derived by NLP of clinical notes) proximal to the treatment decision to adjust for propensity to select a given treatment. While the 90-day cut-off for “recent” was of necessity arbitrary, we chose it to balance the goal of assessing symptoms proximal to treatment initiation while not overly restricting the window for including such information.

In sum, we present a novel computational pipeline, based on real-world EHR data, for predicting differential response to the most commonly used classes of antidepressants. The resulting models achieved good accuracy, discrimination, and positive predictive value. The models’ ability to predict and compare responses across antidepressant classes could prove valuable for further efforts aiming to provide clinical decision support for prescribers. The approach we demonstrate here could also be adapted to a wide variety of other clinical applications for optimizing and individualizing treatment selection.

Methods

Institutional review board approval

All procedures were approved by the Institutional Review Board of Mass General Brigham (MGB) Healthcare System (Boston, Massachusetts, USA, Protocol 2018P000765), with a waiver of consent for the analysis of electronic health record data.

Data source

The data for the study were extracted from the Research Patient Data Registry (RPDR)⁴⁹ of the MGB Healthcare System. The RPDR is a centralized clinical data registry that gathers clinical information from the MGB system. The RPDR database includes more than 7 million patients with over 3 billion records seen across seven hospitals, including two major teaching hospitals: Massachusetts General Hospital and Brigham and Women’s Hospital. Clinical data recorded in the RPDR includes detailed patient information, encounter meta-data (e.g., time, location, provider, etc.), demographics, diagnoses, laboratory tests, medications, providers, procedures, radiology tests, reasons for visits, and narrative clinical notes⁴⁹.

Study population

EHR data spanning January 1990 to August 2018 were obtained for adult patients (age ≥ 18 years) with at least one visit with a diagnostic ICD code for a depressive disorder (defined as ICD-9-CM: 296.20–6, 296.30–6, and 311; ICD-10-CM: F32.0–9, F33.0–9) co-occurring with an antidepressant prescription, and at least one ICD code for non-recurrent depression (ICD-9-CM: 296.20–6 and 311; ICD-10-CM: F32.0–9) any time during their history. The first visit with an antidepressant prescription is defined as the “index visit” for each patient. Patients were excluded if: (1) the antidepressant prescription was initiated by a psychiatrist, as we focused on patients initiated by non-psychiatrists in this study; (2) initiated antidepressants not among the four classes of interest, or initiated more than one antidepressant; (3) no clinical notes or visit details were available in the 90 days prior to the index visit date or within 4–12 weeks after the index visit date; (4) first prescription occurred before 1997, the year during which use of the latest antidepressant category (mirtazapine) began; or (5) the patient had a diagnosis of bipolar disorder, schizophrenia, or schizoaffective disorder at or prior to the index visit as antidepressant treatment is less likely to be initiated by non-psychiatrists and depression in the context of bipolar or psychotic disorders might introduce heterogeneity in treatment response. We utilized data from all patients available that met the above inclusion/exclusion criteria in our EHR database. Details of the stepwise sample selection procedure are shown in Fig. 1.

Constructing the outcome labels

For each patient, notes within the outcome window (4–12 weeks after the index visit) were concatenated as a “note set.” One of the authors (YHS), a psychiatrist, randomly sampled 3600 note sets and manually labeled them into two categories. One category was evidence of improvement, based on the presence of notes within the time window indicating that the patient’s mood was improving, such as “depression is well controlled” or “mood-wise, the patient felt a lot better.” The second category was no evidence of improvement, based on either notes stating the patient’s mood is not improving or is worsening, no documentation of mood status, or evidence that mood status could not be addressed due to medical status (e.g., consciousness change). In note sets where mood status was discussed more than once, the most recent status was taken for labeling. These labels were then incorporated in a deep learning-based text classification model trained on the full set of clinical notes to impute labels for the remaining patients (N = 13,956) as described previously^50,51. In brief, we trained a deep learning model to “emulate” manual chart review and used it to impute labels for notes that were not manually reviewed. The model is a modified version of one described in Sheu and colleagues⁵¹ and the accuracy of the model is 79% (based on a test set where 45% of the samples are positive).

Constructing the predictors

Since there is substantial uncertainty about which factors predict response to antidepressants, we constructed a broad set of possible predictors based on prior literature of antidepressant treatment^1,52,53 as well as interviews with clinicians to identify demographic and clinical factors thought to influence treatment selection. The predictor set was designed to capture factors that might be correlated with “confounding by indication” so that we could control for such confounding in the analyses. The final predictor set include three components: (1) “Structured predictors (Supplementary Table 3),” which included demographics, history of diagnoses preceding the index visit (as binary indicators) as well as counts of concurrent medication prescriptions, NSAID prescriptions, and depressive symptoms mentioned in clinic notes at the index visit and up to 90 days prior. Because depressive symptoms are not readily captured in structured EHR data, these were extracted from clinical notes through application of a set of NLP rules, as previously described⁴⁸. Briefly, these features were extracted and constructed using a hierarchical approach consisting of the following four levels (from highest to lowest): 1. categories of depression-related symptoms (e.g., “depression and anhedonia”); 2. concepts within these categories; 3. specific terms used to describe these concepts; and 4. lexical derivatives and regular expressions (i.e., strings in the form the computer program reads) of the specific terms. Terms extraction (with the handling of negation) was performed with matching regular expressions on the clinical notes for each patient, which were then organized following the hierarchy defined above. The final features built that were included in the predictors were the number of concepts present per category. We also included the antidepressant class initiated at the index visit (as one-hot encoding variables contrasting SNRI, bupropion or mirtazapine vs. SSRI as the reference class). The selection and processing for the structured predictors is described in more detail in Sheu et al. 2022;⁴⁸ (2) “Treatment response likelihood score,” a decimal number between 0-1 representing a preliminary evaluation of response probability by a DNN model, constructed by processing the three-year note sets (i.e., clinical notes in a three-year window prior to the index visit) through a DNN model (“Longformer”)⁵⁰, which was trained on the manually-curated outcome labels (in the training set, as described below). The score was then used as a feature in each of the models we tested (with the exception of the Transformer + feed-forward DNN model); and (3) “Vectorized clinical notes,” instead of a single score, a numerical vector representing the same set of clinical notes as mentioned in (2).

Among 3600 patients with expert-curated labels, we randomly sampled 300 as a hold-out validation set for model tuning, and 600 as a hold-out test set. All performance results are reported based on the hold-out test set.

Models for treatment response prediction

We compared five model architectures for prediction (Fig. 4a): (1) regularized generalized linear model (GLM); (2) random forest;⁵⁴ (3) gradient boosting machine (GBM);⁵⁵ (4) feed-forward DNN (with 4 hidden layers); and (5) Transformer⁵⁶ + feed-forward DNN (Fig. 3b). The main difference between models (4) and (5) is that in model (5), we applied an additional “Transformer” component that takes in the structured predictors plus the vectorized notes as inputs (Fig. 4b). For models (1)–(4), we examined performance with and without the 3-year treatment response likelihood score. For all models, we also evaluated whether or not the imputed labels enhanced model performance. All models were trained on a fixed training set, and hyperparameters were tuned to maximize AUROC on the hold-out validation set. For the GLM, we tuned the amount of regularization (lambda), and set alpha at 0.5 (i.e., an elastic net⁵⁷ with both L1 and L2 regularization). For the GBM and random forest models, we tuned on the complexity of the base learners (tree depth) and the total number of trees grown. For deep learning models (model classes (4) and (5)), we tuned on learning rate, dropout probability, and a weight factor that discounts the contribution of patients with imputed labels, based on the confidence of the imputation. We developed the non-deep learning models with R’s h2o package⁵⁸, and the deep learning models with Python, PyTorch⁵⁹, PyTorch-lightning⁶⁰, Huggingface Transformers⁶¹, and SimpleTransformers⁶². More details regarding model architectures and training processes are provided in Supplementary Methods.

**Fig. 4: Diagrams for prediction modeling designs.**

Assessment of model performance: discrimination and calibration

For all models, we report AUROC, AUPRC, sensitivity, specificity, PPV, NPV, F1 score and accuracy on the test set. Threshold-dependent metrics are reported for the threshold that maximizes accuracy. Although all models performed similarly as shown in Results, we selected a feed-forward DNN as a “representative model” for discussion based on overall consideration of model performance, complexity, flexibility, extensibility, and data representativeness. For this model, we also calculated Brier score, and produced the corresponding calibration plot. To determine if the estimates the model provides are meaningfully different to the current practice, the PPV of the representative model is compared to the base prevalence of response in the test set by a two-tailed Z-test for two proportions and the corresponding confidence interval for the difference in proportions. The base prevalence of response represents the prior probability of response based on clinician judgement.

Prediction of relative treatment response probability for different antidepressant classes

For any given models tested, four probabilities (one corresponding to each antidepressant class) can be derived for each patient. To illustrate the differential probabilities of antidepressant responses the resulting models can provide at the time when an antidepressant is initiated, we (post-hoc) split the test set by predicted probability of response to a given antidepressant class into high (>70%), medium (40–70%), and low (<40%) using the representative model, based on the actual antidepressant class prescribed. We then sample one patient from each stratum and report the estimated probability of improvement had the patients been treated with each of the other three classes of antidepressants. In this way, we are able to (1) assess whether a patient would response to first-line antidepressants in general; and (2) compare predicted responses to each of the four antidepressant classes for a given patient.

Global and local predictor importance

For the representative model, we identify which predictors are most important to the overall prediction model (using global SHAP, i.e., mean of absolute local SHAP values)⁴⁶. We also calculate mean SHAP values to assess the overall directionality of the effect of each top feature. Lastly, we illustrate how the model can identify the most important predictors for selecting among the alternative antidepressant classes for a given patient (using the local SHAP metric).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

Protected Health Information restrictions apply to the availability of the clinical data here, which were used under IRB approval for use only in the current study. As a result, this dataset is not publicly available. Qualified researchers affiliated with the Mass General Brigham (MGB) may apply for access to these data the MGB EHR data repository (RPDR) through the MGB Institutional Review Board.

Code availability

Software packages utilized and cited in this study are available online on their respective websites (h2o for R:⁵⁸ https://cran.r-project.org/web/packages/h2o/index.html; Huggingface Transformers:⁶¹ https://github.com/huggingface/transformers; SimpleTransformers:⁶² https://simpletransformers.ai; SHAP:⁶³ https://github.com/slundberg/shap). Code developed for training of treatment response prediction models can be made available upon reasonable request.

References

Hasin, D. S. et al. Epidemiology of adult DSM-5 major depressive disorder and its specifiers in the United States. JAMA Psychiatry 75, 336–346 (2018).
Article PubMed PubMed Central Google Scholar
Pratt, L. A., Brody, D. J. & Gu, Q. Antidepressant use among persons aged 12 and over: United States, 2011-2014. NCHS Data Brief 283, 1–8 (2017).
Gelenberg, A. J. et al. Practice guideline for the treatment of patients with major depressive disorder (American Psychiatric Association, 2010).
Park, L. T. & Zarate, C. A. Depression in the primary care setting. N. Engl. J. Med. 380, 559–568 (2019).
Article PubMed PubMed Central Google Scholar
Su, C. et al. Machine learning for suicide risk prediction in children and adolescents with electronic health records. Transl. Psychiatry 10, 413 (2020).
Article PubMed PubMed Central Google Scholar
Tomašev, N. et al. Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records. Nat. Protoc. 16, 2765–2787 (2021).
Article PubMed Google Scholar
Simon, G. E. et al. Predicting suicide attempts and suicide deaths following outpatient visits using electronic health records. Am. J. Psychiatry 175, 951–960 (2018).
Article PubMed PubMed Central Google Scholar
Tomašev, N. et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature 572, 116–119 (2019).
Article PubMed PubMed Central Google Scholar
Jaworska, N., de la Salle, S., Ibrahim, M.-H., Blier, P. & Knott, V. Leveraging machine learning approaches for predicting antidepressant treatment response using electroencephalography (EEG) and clinical data. Front. Psychiatry 9, 768 (2018).
Article PubMed Google Scholar
Köhler-Forsberg, K. et al. Predicting treatment outcome in major depressive disorder using serotonin 4 receptor pet brain imaging, functional MRI, cognitive-, EEG-based, and peripheral biomarkers: a NeuroPharm open label clinical trial protocol. Front. Psychiatry 11, 641 (2020).
Article PubMed PubMed Central Google Scholar
Wu, W. et al. An electroencephalographic signature predicts antidepressant response in major depression. Nat. Biotechnol. 38, 439–447 (2020).
Article CAS PubMed PubMed Central Google Scholar
Lin, E. et al. A deep learning approach for predicting antidepressant response in major depression using clinical and genetic biomarkers. Front. Psychiatry 9, 290 (2018).
Article PubMed PubMed Central Google Scholar
Klöbl, M. et al. Predicting antidepressant citalopram treatment response via changes in brain functional connectivity after acute intravenous challenge. Front. Comput. Neurosci. 14, 554186 (2020).
Article PubMed PubMed Central Google Scholar
Chang, B. et al. Arpnet: antidepressant response prediction network for major depressive disorder. Genes 10, 907 (2019).
Article CAS PubMed PubMed Central Google Scholar
Zhdanov, A. et al. Use of machine learning for predicting escitalopram treatment outcome from electroencephalography recordings in adult patients with depression. JAMA Netw. Open 3, e1918377 (2020).
Article PubMed PubMed Central Google Scholar
Pei, C. et al. Ensemble learning for early-response prediction of antidepressant treatment in major depressive disorder. J. Magn. Reson. Imaging 52, 161–171 (2020).
Article PubMed Google Scholar
Athreya, A. P. et al. Prediction of short-term antidepressant response using probabilistic graphical models with replication across multiple drugs and treatment settings. Neuropsychopharmacology 46, 1272–1282 (2021).
Article PubMed PubMed Central Google Scholar
Crane, N. A. et al. Multidimensional prediction of treatment response to antidepressants with cognitive control and functional MRI. Brain 140, 472–486 (2017).
Article PubMed PubMed Central Google Scholar
Cohen, S. E., Zantvoord, J. B., Wezenberg, B. N., Bockting, C. L. H. & van Wingen, G. A. Magnetic resonance imaging for individual prediction of treatment response in major depressive disorder: a systematic review and meta-analysis. Transl. Psychiatry 11, 168 (2021).
Article PubMed PubMed Central Google Scholar
Lin, E. et al. Prediction of antidepressant treatment response and remission using an ensemble machine learning framework. Pharmaceuticals 13, 305 (2020).
Article CAS PubMed PubMed Central Google Scholar
Hughes, M. C. et al. Assessment of a prediction model for antidepressant treatment stability using supervised topic models. JAMA Netw. Open 3, e205308 (2020).
Article PubMed PubMed Central Google Scholar
Langenecker, S. A. et al. Multidimensional imaging techniques for prediction of treatment response in major depressive disorder. Prog. Neuropsychopharmacol. Biol. Psychiatry 91, 38–48 (2019).
Article PubMed Google Scholar
Lee, H. S., Baik, S. Y., Kim, Y.-W., Kim, J.-Y. & Lee, S.-H. Prediction of antidepressant treatment outcome using event-related potential in patients with major depressive disorder. Diagnostics 10, 276 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kautzky, A. et al. Combining machine learning algorithms for prediction of antidepressant treatment response. Acta Psychiatr. Scand. 143, 36–49 (2021).
Article CAS PubMed Google Scholar
Xiao, H. et al. Functional connectivity of the hippocampus in predicting early antidepressant efficacy in patients with major depressive disorder. J. Affect. Disord. 291, 315–321 (2021).
Article PubMed Google Scholar
Preuss, A. et al. SSRI treatment response prediction in depression based on brain activation by emotional stimuli. Front. Psychiatry 11, 538393 (2020).
Article PubMed PubMed Central Google Scholar
Kong, Y. et al. Spatio-temporal graph convolutional network for diagnosis and treatment response prediction of major depressive disorder from functional connectivity. Hum. Brain Mapp. 42, 3922–3933 (2021).
Article PubMed PubMed Central Google Scholar
Tian, S. et al. Predicting escitalopram monotherapy response in depression: the role of anterior cingulate cortex. Hum. Brain Mapp. 41, 1249–1260 (2020).
Article PubMed Google Scholar
Long, Z. et al. Prediction on treatment improvement in depression with resting state connectivity: a coordinate-based meta-analysis. J. Affect. Disord. 276, 62–68 (2020).
Article PubMed Google Scholar
Chekroud, A. M. et al. The promise of machine learning in predicting treatment outcomes in psychiatry. World Psychiatry 20, 154–170 (2021).
Article PubMed PubMed Central Google Scholar
Lee, Y. et al. Applications of machine learning algorithms to predict therapeutic outcomes in depression: a meta-analysis and systematic review. J. Affect. Disord. 241, 519–532 (2018).
Article PubMed Google Scholar
Taliaz, D. et al. Optimizing prediction of response to antidepressant medications using machine learning and integrated genetic, clinical, and demographic data. Transl. Psychiatry 11, 381 (2021).
Article CAS PubMed PubMed Central Google Scholar
Mehltretter, J. et al. Analysis of features selected by a deep learning model for differential treatment selection in depression. Front. Artif. Intell. 2, 31 (2019).
Article PubMed Google Scholar
Peng, Y. et al. Electroencephalographic network topologies predict antidepressant responses in patients with major depressive disorder. IEEE Trans. Neural Syst. Rehabil. Eng. 30, 2577–2588 (2022).
Article PubMed Google Scholar
Harris, J. K. et al. Predicting escitalopram treatment response from pre-treatment and early response resting state fMRI in a multi-site sample: A CAN-BIND-1 report. Neuroimage Clin. 35, 103120 (2022).
Article PubMed PubMed Central Google Scholar
Kaiser, R. H. et al. Dynamic resting-state network biomarkers of antidepressant treatment response. Biol. Psychiatry 92, 533–542 (2022).
Article CAS PubMed Google Scholar
Tsai, P.-L., Chang, H. H. & Chen, P. S. Predicting the treatment outcomes of antidepressants using a deep neural network of deep learning in drug-naïve major depressive patients. J. Pers. Med. 12, 693 (2022).
Article PubMed PubMed Central Google Scholar
Hill, K. R. et al. Measuring brain glucose metabolism in order to predict response to antidepressant or placebo: A randomized clinical trial. Neuroimage Clin. 32, 102858 (2021).
Article PubMed PubMed Central Google Scholar
Nunez, J.-J. et al. Replication of machine learning methods to predict treatment outcome with antidepressant medications in patients with major depressive disorder from STAR*D and CAN-BIND-1. PLoS One 16, e0253023 (2021).
Article CAS PubMed PubMed Central Google Scholar
Puac-Polanco, V. et al. Development of a model to predict antidepressant treatment response for depression among Veterans. Psychol. Med. 1–11 https://doi.org/10.1017/S0033291722001982 (2022).
Mehltretter, J. et al. Differential treatment benefit prediction for treatment selection in depression: a deep learning analysis of STAR*D and CO-MED Data. BioRxiv https://doi.org/10.1101/679779 (2019).
Kleinerman, A. et al. Treatment selection using prototyping in latent-space with application to depression treatment. PLoS One 16, e0258400 (2021).
Article CAS PubMed PubMed Central Google Scholar
Goodfellow, I., Bengio, Y. & Courville, A. Deep learning (The MIT Press, 2016).
Bushnell, G. A., Stürmer, T., Mack, C., Pate, V. & Miller, M. Who diagnosed and prescribed what? Using provider details to inform observational research. Pharmacoepidemiol. Drug Saf. 27, 1422–1426 (2018).
Article PubMed PubMed Central Google Scholar
Salk, R. H., Hyde, J. S. & Abramson, L. Y. Gender differences in depression in representative national samples: Meta-analyses of diagnoses and symptoms. Psychol. Bull. 143, 783–822 (2017).
Article PubMed PubMed Central Google Scholar
Lundberg, S. M. & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems (eds. Guyon, I. et al.) vol. 30 (Curran Associates, Inc., 2017).
Lundberg, S. M. et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat. Biomed. Eng. 2, 749–760 (2018).
Article PubMed PubMed Central Google Scholar
Sheu, Y.-H., Magdamo, C., Miller, M., Smoller, J. W. & Blacker, D. Initial antidepressant choice by non-psychiatrists: Learning from large-scale electronic health records. Gen. Hosp. Psychiatry 81, 22–31 (2023).
Article PubMed Google Scholar
MassGeneralBrigham Healthcare System. Research Patient Data Registry (RPDR). https://rc.partners.org/about/who-we-are-risc/research-patient-data-registry (2018).
Beltagy, I., Peters, M. E. & Cohan, A. Longformer: The Long-Document Transformer. Preprint at https://arxiv.org/abs/2004.05150 (2020).
Sheu, Y. et al. Phenotyping Antidepressant Treatment Response with Deep Learning in Electronic Health Records. Preprint at https://www.medrxiv.org/content/10.1101/2021.08.04.21261512v1.full (2021).
Sadock, B. J., Sadock, V. A. & Ruiz, P. Kaplan & Sadock’s synopsis of psychiatry: behavioral sciences/clinical psychiatry (Lippincott Williams, 2015).
Stahl, S. S. Stahl’s essential psychopharmacology: neuroscientific basis and practical applications (Cambridge University Press, 2013).
Breiman, L. Random forests. Springer Science and Business Media LLC. https://doi.org/10.1023/a:1010933404324 (2001).
Friedman, J. H. Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001).
Article Google Scholar
Vaswani, A. et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems 6000–6010 (Curran Associates Inc., 2017).
Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 67, 301–320 (2005).
Article Google Scholar
H2O.ai. R Interface for H2O. https://github.com/h2oai/h2o-3 (2021).
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (eds. Wallach, H. et al.) vol. 32 (Curran Associates, Inc., 2019).
Falcon, W. A. PyTorch Lightning. PyTorch Lightning. https://github.com/PyTorchLightning/pytorch-lightning (2019).
Wolf, T. et al. Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 38–45 (Association for Computational Linguistics, 2020). https://doi.org/10.18653/v1/2020.emnlp-demos.6.
Rajapakse, T. C. Simple transformers. https://github.com/ThilinaRajapakse/simpletransformers (2019).
GitHub - slundberg/shap: a game theoretic approach to explain the output of any machine learning model. https://github.com/slundberg/shap (2017).

Download references

Acknowledgements

This research received no specific grant from any funding agency, commercial or not-for-profit sectors. It was conducted in part for a doctoral dissertation by Dr. Sheu who was supported in part by a scholarship (“Taiwanese Physicians Scholarship” through the Harvard T.H. Chan School of Public Health, Boston, MA, USA). Dr. Smoller and Dr. Sheu are supported in part by NIMH R01MH118233. Mr. Colin Magdamo was supported in part by NIA R01AG058063 and NIA P30AG062421.The authors thank Dr. Young-a Lee for providing helpful feedback on the illustrations presented in this paper

Author information

Authors and Affiliations

Center for Precision Psychiatry, Massachusetts General Hospital, Boston, MA, USA
Yi-han Sheu & Jordan W. Smoller
Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
Yi-han Sheu & Jordan W. Smoller
Department of Psychiatry, Massachusetts General Hospital / Harvard Medical School, Boston, MA, USA
Yi-han Sheu, Deborah Blacker & Jordan W. Smoller
Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Yi-han Sheu & Jordan W. Smoller
Department of Neurology, Massachusetts General Hospital / Harvard Medical School, Boston, MA, USA
Colin Magdamo & Sudeshna Das
Harvard Injury Control Research Center, Harvard T.H. Chan School of Public Health, Boston, MA, USA
Matthew Miller
Bouvé College of Health Sciences, Northeastern University, Boston, MA, USA
Matthew Miller
Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
Matthew Miller, Deborah Blacker & Jordan W. Smoller

Authors

Yi-han Sheu
View author publications
You can also search for this author in PubMed Google Scholar
Colin Magdamo
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Miller
View author publications
You can also search for this author in PubMed Google Scholar
Sudeshna Das
View author publications
You can also search for this author in PubMed Google Scholar
Deborah Blacker
View author publications
You can also search for this author in PubMed Google Scholar
Jordan W. Smoller
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.-H.S. conceived of the study; Y.-H.S. and J.W.S. acquired the data for the study; Y.-H.S. and C.M. curated data and performed analyses of the study; J.W.S., D.B., and M.M. supervised analyses of the study; Y.-H.S., J.W.S., D.B., M.M. and S.D. contributed to interpretation of the analyses of the study. J.W.S. and S.D. provided computational resources for the study; Y.-H.S. wrote the first draft; and all authors revised the manuscript.

Corresponding author

Correspondence to Yi-han Sheu.

Ethics declarations

Competing interests

J.W.S. is a member of Scientific Advisory Board of Sensorium Therapeutics (with equity) and has received grant support from Biogen, Inc. He is PI of a collaborative study of the genetics of depression and bipolar disorder sponsored by 23andMe for which 23andMe provides analysis time as in-kind support but no payments. The other authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplmentary materials

REPORTING SUMMARY

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sheu, Yh., Magdamo, C., Miller, M. et al. AI-assisted prediction of differential response to antidepressant classes using electronic health records. npj Digit. Med. 6, 73 (2023). https://doi.org/10.1038/s41746-023-00817-8

Download citation

Received: 17 October 2022
Accepted: 04 April 2023
Published: 26 April 2023
DOI: https://doi.org/10.1038/s41746-023-00817-8

This article is cited by

Revolutionizing healthcare: the role of artificial intelligence in clinical practice
- Shuroug A. Alowais
- Sahar S. Alghamdi
- Abdulkareem M. Albekairy
BMC Medical Education (2023)
Polygenic risk scores of lithium response and treatment resistance in major depressive disorder
- Ying Xiong
- Robert Karlsson
- Yi Lu
Translational Psychiatry (2023)