Prediction of short-term antidepressant response using probabilistic graphical models with replication across multiple drugs and treatment settings

Heterogeneity in the clinical presentation of major depressive disorder and response to antidepressants limits clinicians’ ability to accurately predict a specific patient’s eventual response to therapy. Validated depressive symptom profiles may be an important tool for identifying poor outcomes early in the course of treatment. To derive these symptom profiles, we first examined data from 947 depressed subjects treated with selective serotonin reuptake inhibitors (SSRIs) to delineate the heterogeneity of antidepressant response using probabilistic graphical models (PGMs). We then used unsupervised machine learning to identify specific depressive symptoms and thresholds of improvement that were predictive of antidepressant response by 4 weeks for a patient to achieve remission, response, or nonresponse by 8 weeks. Four depressive symptoms (depressed mood, guilt feelings and delusion, work and activities and psychic anxiety) and specific thresholds of change in each at 4 weeks predicted eventual outcome at 8 weeks to SSRI therapy with an average accuracy of 77% (p = 5.5E-08). The same four symptoms and prognostic thresholds derived from patients treated with SSRIs correctly predicted outcomes in 72% (p = 1.25E-05) of 1996 patients treated with other antidepressants in both inpatient and outpatient settings in independent publicly-available datasets. These predictive accuracies were higher than the accuracy of 53% for predicting SSRI response achieved using approaches that (i) incorporated only baseline clinical and sociodemographic factors, or (ii) used 4-week nonresponse status to predict likely outcomes at 8 weeks. The present findings suggest that PGMs providing interpretable predictions have the potential to enhance clinical treatment of depression and reduce the time burden associated with trials of ineffective antidepressants. Prospective trials examining this approach are forthcoming.


INTRODUCTION
Major depressive disorder (MDD) is a complex disease comprising several symptoms related to mood, capacity to derive pleasure, physical status, and cognitive functioning [1]. Despite variable efficacy rates [2], antidepressants are the most-commonly used treatments for MDD. Therapeutic responses to antidepressants can be reliably measured using validated rating scales (See Fig. 1A), which can then be used as a guide for clinical decision making [3,4]. However, there are no validated quantitative prognostic "symptom level" indicators that can be used to operationalize decisions about continuing or changing treatment based on the most-likely eventual treatment outcome. The high variability of depressive symptom presentations (See Fig. 1B) and clinical trajectories of MDD (See Fig. 1C) present formidable challenges for clinician decision making [5]. As a consequence, antidepressant treatment selection occurs on a "try-and-try-again" basis, based on lack of perceived treatment benefit by patients and clinicians [6]. Hence, there is a significant need to derive accurate and quantitatively-based prognoses of eventual treatment outcomes, given a set of measured changes in symptom severity at an intermediate timepoint [7,8], before therapeutic trials are declared to be fully complete, usually after 8 weeks of treatment [9][10][11][12].
Prior studies using STAR*D and other large datasets have investigated whether early improvements in total depression rating scale scores can be used to predict eventual treatment nonresponse [13], which would enable a change in treatment. These studies relied on the use of growth mixture models and trajectory analyses [14][15][16] which do not provide easily interpretable prognoses (prediction) of eventual treatment outcomes using specific patterns of improvement in depressive symptoms at intermediate treatment timepoints. These prior studies showed that, as would be expected, early response (i.e., >50% reduction in total depression severity scores at 4 weeks) is prognostic of response at 8 weeks, and a <20% reduction in total depression severity at 4 weeks is prognostic of nonresponse at 8 weeks. However, this observation accounts for variations in less than half the patients across the studies, and in the remaining majority of the patients, there is still significant heterogeneity in the 8-week outcomes of patients who are nonresponders at 4 weeks. Hence, the need for conditioning the likelihood of 8-week treatment outcomes on early improvements in individual depressive symptoms in conjunction with changes in total depression severity is highlighted by the observation that nearly half of nonresponders to antidepressant therapy (i.e., <50% reduction in total depression severity scores) at 4 weeks are eventual responders to therapy at 8 weeks [17].
Antidepressant response is probabilistic in nature (i.e., longitudinal variations in MDD severity and treatment outcomes vary in patients who begin treatment with the same MDD severity). Hence, we examined whether mathematical formulations such as probabilistic graphical models (PGMs) [18] that allow for reasoning under conditions of uncertainty, could thus be suitable methods to derive interpretable prognoses of antidepressant response. Fig. 1 Study overview. A Measurement-based psychiatry using validated rating scales such as the 17-item Hamilton Depression Rating Scale (HDRS) to measure severity of depression symptoms. HDRS total score is sum of severity of individual HDRS item (depressive symptom). B Heterogeneity of symptom severity in the training datasets (Mayo Clinic PGRN-AMPS and ISPC subjects) with HDRS total score of 25 at baseline. C Heterogeneity in longitudinal variations of HDRS total score in Mayo Clinic PGRN-AMPS and ISPC subjects treated with citalopram/ escitalopram. D Proposed analyses workflow to build probabilistic graphical model (PGM) and derive individualized prognoses of treatment outcomes at 8 weeks using changes in severity of focused set of depressive symptoms between baseline and after 4 weeks of antidepressant treatment.
Specifically, we used PGMs in conjunction with unsupervised machine learning methods to derive interpretable and accurate prognoses of antidepressant treatment outcomes first in a training dataset (see Fig. 1D), then through replication using other datasets. We hypothesized that a PGM-based model would result in significantly higher accuracy for the short-term prediction of response to antidepressants in adults with MDD, and achieve replications across multiple classes of antidepressants and treatment settings, compared with approaches that incorporated only baseline clinical and sociodemographic predictor variables.

Data sources
The datasets used for this study (described below and in Supplementary Methods) included subjects that met DSM-IV criteria for nonpsychotic MDD, confirmed using modules A, B (screen-only version), and D of the Structured Clinical Interview for DSM-IV (SCID). Subjects received at least 8 weeks of treatment with a study drug (see Supplementary Table 1), i.e., selective serotonin reuptake inhibitors (SSRIs), serotonin-norepinephrine reuptake inhibitors (SNRIs) or tricyclic antidepressants (TCAs). Depressive symptoms were measured using the 17-item clinician rated version of Hamilton Depression Rating Scale (HDRS) at baseline, 4 weeks, and 8 weeks. Participation in each of the studies required IRB approval at their respective institutions.
Training datasets. We used data from 947 MDD patients treated with SSRIs (citalopram/escitalopram) in two large, nonoverlapping clinical trial datasets from the Mayo Clinic Pharmacogenomics Research Network (PGRN-AMPS [19]) and the International SSRI Pharmacogenomics Consortium (ISPC [20]) to develop the PGM and derive prognoses rules.
Testing datasets. We then tested the prognostic capabilities of our model using datasets from independent cohorts of MDD patients as described: • Paroxetine, fluoxetine, sertraline (248 ISPC outpatients), or escitalopram (216 outpatients from a pooled dataset obtained from Eli Lilly and Co.); • Duloxetine (1067 outpatients from pooled datasets from Eli Lilly and Co.); and • Combination pharmacotherapy with an SSRI or SNRI plus a TCA (465 hospitalized participants in the Munich Antidepressant Response Signature [MARS [21]] Study).
Placebo data. Data from 575 patients who received a pill placebo was used for ascertaining the prognostic effects of depression symptoms that were most likely due to drug effects.

Outcomes
The categorical treatment outcomes based on HDRS total scores were remission at 8 weeks (HDRS total score ≤7), response without remission (referred to as response; a >50% reduction in HDRS total score from baseline and HDRS total score >7), and nonresponse (<50% reduction in HDRS total score from baseline).
Probabilistic graph: motivation and construction The PGM in this study was composed of states (nodes representing MDD severity) at each treatment timepoint and probabilistic transitions between states (i.e., fraction of patients moving between states of one timepoint to states of the next timepoint). To demonstrate the complexity of comprehending antidepressant response from a clinician's perspective, let the states of the PGM be N unique total HDRS scores observed at each treatment timepoint t. Then, for each treatment timepoint (t), the number of trajectories of scores is proportional to N t . As shown in Fig. 1C, such a complex array of trajectories is difficult to interpret and is of little clinical value for estimating treatment outcomes.
The strata inferred at 8 weeks (C1, C2, and C3) had acceptable clinical validity, given that all patients in the C1 stratum achieved remission and all patients in the C3 stratum were nonresponders. Eighty-seven percent of patients in the C2 stratum achieved response without remission and the remaining 13% were nonresponders.
In the absence of clustering, there were 680 unique MDD response trajectories among the 947 subjects in the training datasets (see Fig. 1C). With the use of patient clustering and stratification at each treatment timepoint, the number of MDD response trajectories reduced to a maximum of 27 paths (i.e., N = 3, and t = 3, and N t = 3 3 = 27). We then modeled the most-likely variations in depression severity along these paths for patients, starting from a given baseline stratum.
Probabilistic graph and path probabilities A hidden Markov model (HMM) with forward transitions was formulated to derive the trajectories of change in MDD severity in the training datasets. For the treatment timepoints (baseline, 4 and 8 weeks), the HMM was characterized by (1) hidden states (patient strata defined by range of total HDRS score, inferred from the study data); (2) observation states at 4 and 8 weeks (categorical response defined by HDRS total scores, based on transitions between hidden states of one timepoint to the next); and (3) forward transition probabilities (fraction of patients moving between strata of one timepoint to the next timepoint). The forward algorithm was used to derive the likelihood for all paths that originated from a given stratum at baseline, and terminated in a stratum at 8 weeks. By using the forward algorithm, we did not have to condition the trajectories originating from a baseline stratum based on an outcome of interest at 8 weeks. For every pair of strata at baseline and 8 weeks, the paths that had the highest likelihood and at least 10% of the patients from the baseline strata (tabulated in Supplementary Table 2) were chosen as the symptom dynamic paths.

Prognostic symptoms and prognoses rules
We sought to identify a group of prognostic symptoms that had (a) non-zero symptom severities at baseline across the majority of patients (to assess the quantum of early reductions in severity during treatment for predicting long-term response; see Supplementary Methods for details), (b) similar symptom severity scores (creating symptom clusters derived using hierarchical clustering for each stratum; illustrated in Fig. 2C (symptom clusters for A1 stratum) and Supplementary Fig. 1 (symptom clusters for all strata) at all timepoints on symptom dynamic paths originating from a baseline stratum (to establish how many symptoms with similar severity at baseline should improve at 4 weeks for predicting 8-week outcomes), and (c) different distributions of symptom severity scores between symptom dynamic paths (to quantify the level of change in a group of symptoms at 4 weeks needed to achieve specific outcomes at 8 weeks). These criteria allowed us to identify a group of depressive symptoms that had similar severities at baseline (criteria (a) and (b)) and across all treatment timepoints (a grouping effect) and had different levels of severity between individual symptoms dynamic pathways (a discriminatory effect, with criterion (c)).
The thresholds of change in prognostic symptom severity were derived based on absolute difference in median scores on symptom dynamic paths between baseline and 4-week strata (see Supplementary Table 3). Chi-square tests were used to identify the minimum number of prognostic symptoms needed to exceed (or not exceed) thresholds at 4 weeks to be prognostic of outcomes at 8 weeks (see Supplementary Methods for details). We then computed the accuracy (i.e., fraction of patients for whom the prognoses rule predicted the correct treatment outcome) and odds ratio (OR) of the most-likely outcome expected at 8 weeks in patients who transitioned from a baseline stratum to a stratum at 4 weeks (also tabulated in Table 1). The OR represents the odds that the expected treatment outcome at 8 weeks will occur if patients are covered by the prognoses rule, compared to the odds  of the same outcome occurring in patients not covered by the prognoses rule. The statistical significance (p value) of the prognostic accuracies derived using prognostic symptoms was established by comparing the observed accuracy against the null information rate (NIR)-a proxy for chance. The NIR of 0.53 represents the fraction of subjects in the training datasets for whom (i) baseline clinical and sociodemographic factors as predictors accurately predicted their treatment outcomes using Random Forests (derived from our prior work [23]), and (ii) categorical non-responder status at 4 weeks correctly predicted 8week outcomes (i.e., only 53% of the 514 subjects in our training data who were nonresponders at 4 weeks [<50% improvement in total HDRS score from baseline] were responders at 8 weeks). Finally, we used Kolmogorov-Smirnov (for age) and Chi-square tests (for sex and race) to evaluate if prognosis rules or accuracies were associated with age, sex, or race (the common sociodemographic factors across all datasets).

Symptom dynamic paths
For the patients treated with citalopram/escitalopram in the training dataset, specific symptom dynamic paths ( Fig. 2A) were derived (see Supplementary Table 2 for likelihood scores for symptom dynamic paths). Patients starting in any stratum at baseline were most likely to achieve remission at 8 weeks if they transitioned into the B1 stratum at 4 weeks, and the clinical observation at 4 weeks was response. Patients starting in the A2 or A3 strata at baseline were most likely to achieve response at 8 weeks if they transitioned into the B2 stratum at 4 weeks and the clinical observation at 4 weeks was response; and were most likely to be nonresponders at 8 weeks if they transitioned into the B3 stratum at 4 weeks and the clinical observation at 4 weeks was also a nonresponse. Patients starting in the A1 stratum at baseline were most likely to be nonresponders at 8 weeks if they transitioned into the B2 stratum at 4 weeks and the clinical observation at 4 weeks was also nonresponse. There was no symptom dynamic path between A1 to C3 since fewer than 8% of the patients reached the C3 stratum at 8 weeks via either the B3 or the B2 strata at 4 weeks.
Prognostic symptoms Four HDRS items (depressed mood, psychic anxiety, guilt feelings/ delusions, and work/activities) met the prognostic symptom criteria for patients in the training dataset. We illustrate the variations in severity of prognostic symptoms in patients with and without the superimposition of symptom dynamic paths (e.g., for depressed mood, see Fig. 2B, D), using data from subjects originating in A3 stratum at baseline. Improvement in the severity of depressed mood can be visualized at 4 and 8 weeks in Fig. 2B, but there is still a high degree of interpatient variation in the scores for depressed mood (as shown by the large spread of boxplots) when subjects are not stratified and analyzed using symptom dynamic paths. By stratifying patients and deriving symptom dynamic paths (e.g., those originating from stratum A3, as shown in Fig. 2D), the discriminatory effect of scores at 8 weeks was better reflected in the patterns of response at 4 weeks. No such discriminatory effects occur for nonprognostic symptoms, as shown in Supplementary Fig. 2. No prognostic symptoms could be identified for patients who received placebo using only the prognostic symptom criteria (see Fig. 2D).
Prediction of short-term antidepressant response using probabilistic.  Table 1) of the patients starting from any of the baseline strata. There were no associations with age, sex, or race for meeting the prognostic symptom criteria or accuracy of prognoses. The observed outcome was nonresponse for nearly all (92%) of the remaining patients.

Replication of prognostic performance of prognostic symptoms in testing datasets
We first assigned patients in the testing datasets who were treated with SSRIs, duloxetine, and combination therapy to a stratum at each timepoint, as defined by the same range of total HDRS scores derived from the training dataset. As shown in Fig. 2E, F, prognostic and nonprognostic symptom variations in the testing datasets (see Fig. 2E, F) were similar to those of the training dataset (see Fig. 2D). We then calculated the accuracies of forecasted outcomes at 8 weeks (see Fig. 3B) using the same prognostic thresholds of prognostic symptom changes at 4 weeks derived from the training cohort (additional details in Table 1). Prognoses performance of the change in prognostic symptoms at 4 weeks for predicting 8-week outcomes in the testing datasets are summarized below, and are shown in Table 1: • For patients originating in the A3 stratum: (1) the accuracies in the prediction of nonresponse at 8 weeks were 66%, 73%, and 67%, respectively, for patients treated with other SSRIs, duloxetine, and combination therapy who transitioned to the B3 stratum with ≥3 prognostic symptoms improved by ≤1 point at 4 weeks; (2) the accuracies in the prediction of response at 8 weeks were 88%, 84%, and 73%, respectively, for patients who transitioned to the B2 stratum with ≥2 prognostic symptoms improved by ≥2 points at 4 weeks; and (3) the accuracies in the prediction of remission at 8 weeks were noncalculable (due to lack of samples), 83% and 69% (p ≤ 0.08), respectively, for patients who transitioned to the B1 stratum with ≥2 prognostic symptoms improved by ≥2 points at 4 weeks.
• For patients originating in the A2 stratum: (1) the accuracies in prediction of nonresponse at 8 weeks was 93%, 78%, and 76%, respectively, for patients treated with other SSRIs, duloxetine, and combination therapy who transitioned to the B3 stratum with ≥3 prognostic symptoms improved by ≤1 point at 4 weeks; (2) the accuracies in the prediction of response at 8 weeks was 63%, 62%, and 68%, respectively, for patients who transitioned to the B2 stratum with ≥2 prognostic symptoms improved by ≥1 point at 4 weeks; and (3) the accuracies in prognoses of remission at 8 weeks was 72%, 77%, and 57%, respectively, for patients who transitioned to the B1 stratum with ≥2 prognostic symptoms improved by ≥2 points at 4 weeks.
• For patients originating in the A1 stratum: (1) the accuracies in the prediction of nonresponse at 8 weeks was 72%, 63%, and 76%, respectively, for patients treated with other SSRIs, duloxetine, and combination therapy who transitioned to the B2 or B3 stratum with ≥3 prognostic symptoms improved by ≤1 point at 4 weeks; (2) the accuracies in the prediction of remission at 8 weeks was 86%, 83%, and 57%, respectively, for patients who transitioned to the B1 stratum with ≥1 prognostic symptom improved by ≥2 points at 4 weeks.
Analogous to the case in the training dataset, the minimum prognostic symptom criteria captured variations in ≥71% of patients from each baseline cluster across all of the testing datasets, and sex was not associated with chances of meeting the prognostic symptom criteria or the prognoses accuracy. Nearly all (95%) of the remaining of patients had nonresponse as their outcome.
Lack of prognostic symptoms and prognoses in placebo-treated patients Prognostic depressive symptoms could not be identified using the criteria specified earlier in patients who received placebo. Instead, in Table 1, we report the accuracy and odds of outcomes in placebo patients (assigned to baseline and 4-week strata) using the four core HDRS-derived symptoms. The predictive accuracies in nearly all outcomes and the odds ratios for all outcomes were lower than those observed in escitalopram/citalopram-treated subjects from the training datasets (see Table 1). The only exception was that the odds ratio for predicting nonresponse was higher in placebo patients than escitalopram/citalopram-treated subjects.

DISCUSSION
We used probabilistic graphical models (PGMs) in conjunction with unsupervised machine learning methods to identify individual depressive symptoms that were highly predictive of antidepressant response, and thresholds of improvement needed in those symptoms by 4 weeks (an interim timepoint supported by treatment guidelines for making changes in antidepressant treatment [24][25][26]) to predict remission, response, or nonresponse by 8 weeks (which conservatively defines the end of a therapeutic antidepressant trial). The high levels of predictive accuracy achieved using a training dataset comprised of citalopram-or escitalopram-treated depressed outpatients replicated in three validation datasets that included depressed inpatients as well as outpatients treated with other SSRIs, duloxetine, and antidepressant combinations.
The prognostic depressive symptoms in this work were defined based on observed homogeneity in their responses at all timepoints, while demonstrating differential patterns of change under antidepressant treatment that were prognostic of clinical outcomes at 8 weeks. Whether they are core to the syndrome of MDD is a question not addressed in this work. However, there is a significant overlap of prognostic symptoms inferred in this work with symptoms in existing subscales , , HAMD7 [29], and VQIDS-C 5 [30]) that were derived from the fullscale HDRS or other rating scales to measure depressive symptoms that are more responsive to antidepressants and lesssensitive to their adverse effects. For example, the four prognostic symptoms derived from HDRS in this work were all included in Maier-6, Bech-6, and HAMD7. The prognostic depressive symptoms identified with QIDS-C in our study align with the items that were included in a brief version of QIDS-C [30] and with the "core emotional" symptoms of depression identified by others as being more responsive to citalopram/escitalopram treatment than were other depressive symptom clusters [14]. Our approach extends this prior work by establishing the prognostic capabilities of these symptoms using an unbiased approach.
The mathematical constructs of PGMs represent an analytical novelty in this work that permitted us to reason with uncertainty and overcome the challenges in interpreting longitudinal variations of antidepressant response when using other approaches, such as latent variable analyses with growth mixture models [14][15][16][31][32][33][34][35][36]. For example, we used probabilistic graphs in this work instead of growth mixture models, given that growth mixture models (1) do not find paths algorithmically by conditioning upon improvements in symptoms at intermediate timepoints, (2) offer very limited interpretability of dynamics of symptom changes, and (3) need sufficient domain expertise to define the number of latent classes and trajectories, and ensure appropriate model fit, and then interpret the results [37][38][39][40][41] (which might prove challenging in analyses that are exploratory in nature). PGMs also provide an extendable analytical framework to derive antidepressant response trajectories for longer observation periods beyond 8 weeks, with the additional ability to identify interpretable response trajectories when the study timeline is a continuum (e.g., extracting visit data from electronic health records) as opposed to discrete timepoints (by formulating the PGM as a Markov jump process [42]). Deep learning approaches have been explored for inferring patient subgroups based on homogeneity in disease trajectories in a data-driven manner [43]. In fact, deep learning approaches and probabilistic graphs both have the advantage of high utility for modeling outcomes without requiring a prespecified number of trajectories. The advantage of PGMs over traditional deep learning or growth mixture model approaches lies in the mathematical formulation of PGMs that allow for reasoning with uncertainty and permits to conditioning future disease variations based on trajectories up to an interim timepoint. In our work, the forward algorithm construct in our PGM parallels the logical scheme used by clinicians in the measurement-based care of depressed patients. That is, the severity of depressive symptoms at baseline and changes in these symptoms are used to drive treatment decisions at an interim timepoint, prior to completion of a therapeutic antidepressant trial.
Based on the clinically-driven design of our PGM (incorporating change in depressive symptoms at 4 weeks in stratified patients), our approach could begin to inform the development of clinical decision support tools to augment (but not replace) practitioner expertise, improve patient engagement, and enhance shared decision-making by providing highly-interpretable quantitative prognostic information as a supplement to clinical judgment and Prediction of short-term antidepressant response using probabilistic. . . AP Athreya et al. patient preferences. Importantly, the PGM-based approach described in this work allows for the integration of biological measures, which may then be used to not only improve the predictability of antidepressant outcomes, but may also serve as a future strategy for individualizing choices of therapy for people with depression [23,44]. Further work is needed to test the predictive capabilities of this approach, with the integration of biological measures, in prospective trials (NCT04355650), and in environments where measurement-based care of depressed patients is routinely delivered.
The consistently high predictive accuracies across numerous commonly-prescribed antidepressants observed in this work have several important implications that fit well with observations from the STAR*D trial: even with rigorously conducted antidepressant treatment, only 53% of patients may be expected to remit after 6 months [45][46][47]. By altering treatment at 4 weeks, an interim timepoint supported by practice guidelines and clinical evidence [9,24,25], a total of 2-4 months could be saved across two therapeutic trials that are each likely to fail after 8-12 weeks, a period of time that is often required for many depressed patients to remit or improve substantially [9,47]. Our approach, which relied on only a limited number of depressive symptoms in addition to total depression scores to predict treatment outcomes, may introduce needed efficiencies into busy practices in addition to optimizing predictive accuracies. This feature may be especially important in busy primary care practices, and may hasten referrals for specialty mental health consultation or treatment if the predicted outcomes of treatment are nonresponse. As a cautionary note, we do not suggest that the full versions of depression rating scales be replaced with shorter versions based on prognostic symptoms only, which would fail to consider all of the important elements of MDD severity for individual patients, including suicidal ideation. Rather, our results suggest that focusing on early changes in prognostic symptoms may increase the prognostic value of full-scale depression measures, which were designed to measure disease severity but not necessarily to predict outcomes.
It is of significant interest that prognostic symptoms could not be identified in patients who received placebo. This observation is important because placebo response rates in clinical trials of antidepressants in MDD patients are high, ranging from 35 to 40% [48]. Moreover, prior applications of machine learning to large antidepressant clinical trial datasets have not shown systematic differences in the patterns of change in individual depressive symptoms over time between placebo and active treatment, even in placebo responders [14]. Although not a direct test of hypothesis (considering a relatively smaller number of placebotreated subjects relative to those who received active treatment), our findings do suggest that the antidepressants we studied, as a group, exerted systematic effects on depressive symptoms that could not be demonstrated in placebo-treated subjects.
There are limitations to our study. Due to lack of data, we were unable to investigate whether changes in prognostic symptoms at timepoints earlier than 4 weeks can accurately predict clinical outcomes at 8 weeks, given evidence that eventual response may sometimes be predictable as early as 2 weeks [49]. The study data was restricted to three timepoints, which may not be sufficient to capture the full arc of the disease, including variations in depression severity and associated long-term outcomes that extend well beyond 8 weeks. There was no dose standardization across datasets, although this is less concerning given that drug dosage was not associated with clinical outcomes here or in previous studies [50]. Despite replication across independent testing datasets, additional studies are needed to establish the generalizability of our approach to other rating scales, medications and treatment approaches beyond those studied here, and longer follow-up durations. Our model, due to lack of data, does not account for the effects of nonadherence, comorbid diagnoses, environmental, and other socioeconomic factors. We were unable to address which treatments should be considered after failure to respond to a given medication due to the lack of sequential trial data. The impact of our findings on those who dropped out of treatment prior to 8 weeks is unknown because our analyses focused on trial completers. Finally, we did not have access to complete data on the number of previous therapeutic antidepressant trials for study patients, an important limitation given that the odds of achieving a positive treatment outcome with antidepressant treatment correlates inversely with the number of previous treatment failures [51].
In summary, this is the first study to examine PGMs in conjunction with unsupervised machine learning methods to derive interpretable and accurate prognoses of antidepressant treatment outcomes. The consistent results across several datasets from studies utilizing different antidepressant treatments and populations suggests this method to potentially utilize symptom trajectory improvements across time to provide much needed clinical decision support earlier in a patient's treatment course. and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF or the NIH.

AUTHOR CONTRIBUTIONS
APA and WVB designed and led the study, and developed the manuscript. DRN, RMW, MAF, LW, AJR, TM, MT, and PC contributed to the manuscript's discussion and analyses. TB and EB contributed the MARS study data and helped with manuscript development. DM, RKI, MS, JMB, and RC assisted with the methods and analyses of the work.