Treatment response classes in major depressive disorder identified by model-based clustering and validated by clinical prediction models

The identification of generalizable treatment response classes (TRC[s]) in major depressive disorder (MDD) would facilitate comparisons across studies and the development of treatment prediction algorithms. Here, we investigated whether such stable TRCs can be identified and predicted by clinical baseline items. We analyzed data from an observational MDD cohort (Munich Antidepressant Response Signature [MARS] study, N = 1017), treated individually by psychopharmacological and psychotherapeutic means, and a multicenter, partially randomized clinical/pharmacogenomic study (Genome-based Therapeutic Drugs for Depression [GENDEP], N = 809). Symptoms were evaluated up to week 16 (or discharge) in MARS and week 12 in GENDEP. Clustering was performed on 809 MARS patients (discovery sample) using a mixed model with the integrated completed likelihood criterion for the assessment of cluster stability, and validated through a distinct MARS validation sample and GENDEP. A random forest algorithm was used to identify prediction patterns based on 50 clinical baseline items. From the clustering of the MARS discovery sample, seven TRCs emerged ranging from fast and complete response (average 4.9 weeks until discharge, 94% remitted patients) to slow and incomplete response (10% remitted patients at week 16). These proved stable representations of treatment response dynamics in both the MARS and the GENDEP validation sample. TRCs were strongly associated with established response markers, particularly the rate of remitted patients at discharge. TRCs were predictable from clinical items, particularly personality items, life events, episode duration, and specific psychopathological features. Prediction accuracy improved significantly when cluster-derived slopes were modelled instead of individual slopes. In conclusion, model-based clustering identified distinct and clinically meaningful treatment response classes in MDD that proved robust with regard to capturing response profiles of differently designed studies. Response classes were predictable from clinical baseline characteristics. Conceptually, model-based clustering is translatable to any outcome measure and could advance the large-scale integration of studies on treatment efficacy or the neurobiology of treatment response.


Introduction
Developing a major depressive disorder (MDD) and recovering from it is a dynamic process. While consensus definitions of MDD include core symptoms such as anhedonia and a depressed mood 1 , multiple additional symptoms may co-occur during an episode, each with individual patterns and variability throughout the episode 2,3 . During the development of a MDD, patients may go through sub-clinical phases with areas of preserved functioning in daily life, yet already show impaired psychosocial stress tolerance 4,5 . Strong inter-individual differences in the sensitivity towards psychosocial stress-a major risk factor for MDD 6 -may underlie such symptom plurality. Similarly, the regression of symptoms under treatment shows strong inter-individual differences. However, it is hypothesized that stable subgroups [7][8][9] and predictive clinical patterns [8][9][10][11][12][13] exist.
The latter is important for the successful clinical management of MDD. Treatment should ideally lead to full remission, as the persistence of residual symptoms increases the likelihood of a relapse 14 . Accordingly, delays of treatment intensifications or a switch of medication further increases the risk of treatment resistance and chronification 15 . Early treatment response (e.g., within 2 weeks) is particularly predictive of the longer course 16 an established finding that also applies to outpatients and patients receiving a first-time antidepressant treatment 17 . Similarly, distinct psychopathological profiles reflect differences in the sensitivity of functional domains to stress and may be predictive of response patterns. For example, a patient suffering from severe anhedonia as a core symptom may respond particularly well to a treatment that addresses the dopaminergic system 18 .
Due the heterogeneous symptomatology of depression, treatment response classes are typically based on compound scores on which relative change criteria or absolute thresholds are then applied (e.g., depression severity below a certain threshold over a defined time period). Different multivariate statistical techniques have been employed to identify predictive patterns for such conventional treatment response classes 10,12,13 . Chekroud et al. 10 used an elastic net to identify 25 out of 164 patient-reportable variables of the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study that predicted response to citalopram. These variables were used to train a machine learning model, which could be validated with significant, yet low accuracy (59.7%) in an external sample. Nie et al. 12 , using data from the STAR*D study, trained five different machine learning algorithms on the full (700) or differently reduced (30 and 22, respectively) sets of clinical features to predict treatment resistance and non-resistance in STAR*D (at week 12) and an independent study (at week 6). Predictions were carried by early response markers and reached moderate accuracy. Wardenar et al. 13 reported that the effect of information on comorbidities significantly improved the prediction of depression persistence and severity. Yet, while the here predicted response classes are mostly rooted in the long-known importance of early response and full remission, they are not data-based and may thus not represent all patterns contained in the data. Here, clustering analysis may be useful to dissect the dataspace into subspaces based on features that are shared within a subgroup and distinct between subgroups 19 . Clustering analysis has so far mainly been applied towards cross sectional markers to identify subgroups based on clinical symptom profiles [20][21][22][23] , cognitive markers 24 , or functional imaging markers 25 , assuming that clusters could indirectly reflect distinct pathophysiological components. Here, we attempt to cluster the treatment response space based on (total) symptom severity trajectories, i.e., the patients' clinical development over a defined observation period.
Longitudinal latent class analysis has reported five 26 or nine such prototypical trajectories 27 based on 12 weeks of observation. More specifically, the first study 26 demonstrated rather limited prediction from~13 clinical baseline items and polygenic scores formed from five literature-based, treatment associated genetic variants. The second study 27 reported weak associations of response trajectories with the type of medication, yet investigated no clinical predictors. Another study revealed seven trajectories based on 1 year of observation 28 , yet, no prediction models were tested. One limitation of these studies, however, is their narrow generalizability as data from single centers studies were used.
In order to understand whether treatment response classes (TRC[s]) are specific to a study site-specific patient selection and treatment approach or whether they represent a generalizable dynamical fingerprint, we included two types of cohort studies in our work: First, the Munich Antidepressant Response Signature (MARS) study, a prospective, open, observational trial performed at the MPI of Psychiatry, Munich, and collaborating hospitals 29 . Second, the Genome-based Therapeutic Drugs for Depression (GENDEP) study, a partially randomized, multicenter clinical and pharmacogenomic study 30 . In both studies, the Hamilton Depression Rating Scale (HAM-D), which achieves good test-retest and interrater reliabilities 31 was used to assess current symptom levels, covering most domains that define MDD such as depressed mood, suicidality, anhedonia, lack of drive, circadian symptom changes, and autonomous nervous system disturbances.
The aims of this study were (i) to establish TRCs in a data-driven fashion, based on serial depression ratings as acquired during studies with naturalistic or partially randomized treatment, and (ii) to assess the generalizability and clinical validity of the resulting TRCs. For this purpose, we applied a mixed model-based, non-linear longitudinal clustering technique to detect TRCs (also referred to as clusters) in MDD in our discovery sample, a subsample of the MARS cohort. The core feature of this clustering technique is to assigns individuals to a cluster (here: a TRC) by while borrowing information from all other individuals and hereby improving cluster stability, which often is critical for generalizability and clinical applications. For the second aim, we assessed cluster generalizability empirically in a second subsample of MARS (MARS validation sample) and in the GENDEP sample, and employed random forest analyses to explore if clinical characteristics at baseline can predict the TRCs in the MARS discovery and validation sample.

Methods and materials
General study samples characterization Both the MARS and the GENDEP study protocol were approved by the respective local ethics committees. All participants gave their written informed consent before participation. MARS patients were admitted to the hospital of the MPIP, Munich, Germany, or collaborating hospitals in southern Bavaria and Switzerland for the treatment of different depressive disorders. Started in 2000, the study aimed at generating a large database of longitudinal observations with weekly ratings along with sociodemographic, psychopathological, and biological data from in-patients with all types of depressive disorders including MDD, bipolar depression, and schizoaffective disorder 29 . Diagnoses according to ICD10 32 were obtained from trained psychiatrists using patient interviews and clinical documentation 29 . Of 1286 available patients, only patients with either a single episode of MDD (ICD-10 F32, N = 373) or a recurrent (unipolar) depressive episode (ICD-10 F33, N = 698) were eligible. Patients with bipolar depression (N = 175), chronic depression (ICD-10 F34, N = 3), or patients with a baseline HAM-D score <14 (N = 37) were excluded. Of the remaining 1071 datasets suitable for this study, 834 patients (recruited 2002-2011) formed the discovery sample and 236 patients (recruited 2012-2016) the MARS validation sample. The split point represented an organizational intercept related to genotyping activity unrelated to this study. The age range was 18-87 years (see Table 1 for demographic and clinical details). Patients were treated psychopharmacologically according to the attending doctor's choice and received therapeutic drug monitoring to optimize plasma medication levels. Depression symptoms were evaluated weekly using the 21-item version of the HAM-D until week 6 and, after that, bi-weekly until discharge or, if not discharged, until week 16 as the latest assessment. During the first six weeks, 7.1% of the HAM-D scores were accidentally missing due to organizational reasons. Accidentally missing HAM-D scores of the first 6 weeks and bi-weekly skipped HAM-D scores between week 6 and 16 were linearly interpolated from the previous and subsequent scores to obtain complete time series. Eighty-eight percent of patients of the discovery and 99% of the MARS validation samples were discharged before week 16 and thus provided HAM-D time series with fewer than 17 data points.
The GENDEP study represents a partially randomized, multicenter clinical, and pharmacogenomic study on depression 33 into which 826 subjects were enrolled between July 2004 and December 2007. The main inclusion criterion was the diagnosis of a major depressive episode of at least moderate severity as defined by DSM-V 1 and ICD-10 criteria 32 and as established by the Schedules for Clinical Assessment in Neuropsychiatry (SCAN, version 2.1) 34 . Exclusion criteria were a first-degree relative with bipolar affective disorder or schizophrenia, a history of a hypomanic or manic episode, mood incongruent psychotic symptoms, primary substance misuse, primary organic disease, current treatment with an antipsychotic or a mood stabilizer, and pregnancy or lactation. Patients eligible for both antidepressants were randomly allocated to receive either nortriptyline (50-150 mg/day) or escitalopram (10-30 mg/day) for 12 weeks with clinically informed dose titration. Patients with a history of adverse effects, non-response or contraindications to one of these drugs were non-randomly allocated to the other drug. Patients who could not tolerate the initially allocated medication or who did not experience sufficient improvement with adequate dosage within 8 weeks were offered the other antidepressant. Depression symptoms were evaluated weekly until week 12 by psychiatrists or psychologists using the 17-item version of the HAM-D score 35 . The age range of all subjects was between 18 and 72 years, all patients were of European ethnicity. A total of 15 subjects who had missing data on all three suicidality items at baseline were excluded, as were patients with a baseline HAM-D score <14, leaving 809 patients for analysis 35 . Demographic data are given in Supplementary Table S1. Different biological aspects of treatment response 36,37 and psychopathological predictor schemes have been reported from this study 27 .

Clustering algorithm
A mixed model approach was used to describe the course of the individual HAM-D score time series, after (natural-) logarithm (ln) transformation (ln of [HAM-D scores +0.5]), considering information not only from the individual trajectory, but combining trajectories of several patients to identify TRCs. For a first organization of HAM-D responses into TRCs, we applied the FlexMix 38,39 clustering algorithm in R (version 3.3) on the HAM-D trajectories of the MARS discovery sample. FlexMix provides an infrastructure for the flexible fitting of finite mixture models using the expectation-maximization algorithm to cluster individual trajectories. The algorithm iterates between computing the expectation of the log-likelihood and maximizing it to find the parameters of the TRCs. To achieve a stable cluster solution, we ran the clustering model with 200 repetitions and determined the  Significan c e after Bonferroni correction for multiple testing, here: To allow optimal use in a parametric test, variables were coded as 0 (no relative affected), 1 (only second-degree relatives affected), and 2 (first-degree or first-degree and second-degree relatives affected).
e Not tested as these it e ms were not part of model 0.
optimal number of TRCs based on the Integrated Completed Likelihood (ICL) criterion generated by the model. To validate the stability and generalizability of the clustering solution, the coefficients of the model of the discovery sample were projected onto a second, later acquired subsample of the same cohort, referred to as MARS validation sample (N = 236). Here, the hypothesis was that the patients are classifiable into the defined TRCs with approximately equal proportions and similar clusterwise median HAM-D courses as had been observed for the discovery sample. In addition, we projected the same clustering model onto 12-week HAM-D courses of the GENDEP sample, hypothesizing similar median HAM-D courses per class, yet, not necessarily similar cluster proportions due to differences in the patient population and the study design. For both projection experiments, resulting proportions of classes were compared with the original distribution of the discovery sample using a χ 2 test. In order to assess suitability of the clustering solution for the validation samples, posterior likelihood values, classification log-likelihoods and eventually ICL values for were calculated on the basis of the clustering model of the discovery sample.
To assess the applicability of the original clustering coefficients to samples with a shorter observation interval, we systematically lowered the number of applied coefficients down to 1 and, for each observation interval report, compared this classification with the classification based on all coefficients (i.e., the full observation interval). The true distance (or dis-correlation) between the two solutions was calculated by Pearson correlation between model-based slope values of the respective TRC.

Multivariate prediction analyses
We then conducted a multivariate analysis using a random forest algorithm as implemented in the R package Ranger 40 to detect associations between clinical variables and the previously obtained TRCs in the MARS sample.

Clinical predictors
All 72 clinical variables are explained in Table 1 43 ). As random forest models require complete datasets, missing data were filled by the respective median of the total sample (for details see Supplementary Table S2). Extended models were: Model 1, which is model 0 expanded by 21 baseline HAM-D single items to investigate the effect of unfolding the baseline psychopathology; model 2, which is model 0 expanded by the partial response at week 2 to investigate the influence of early longitudinal observations; model 3, the combination of both expansions (Supplementary Fig.  S1).

Random forest-based prediction models
The basic algorithm used in the Ranger package is a fast implementation of random forests for high dimensional data. In a random forest, each node is split using the best among a subset of predictors randomly chosen at that node 44 . Two parameters were used to control this process: the number of prediction trees (bagging) and the number of features to search across to find the best feature (mtry). Mtry is the square root of D, which is the number of independent predictors used for classification. Predictions were obtained by aggregating the prediction trees (i.e., the majority votes for classification and the average for regression models). We calculated adjusted coefficients of multiple correlation R 2 (to quantify the explained variance and predictive quality of the entire model) and corresponding p-values. To characterize feature importance, a permutation based method that exploits the distribution of measured importance for each variable in a noninformative setting was applied 45 (10000 permutations); predictors with p < 0.05 are reported in more detail. Further, differences in R 2 between competing models were compared after Fisher's Z-transformation of the respective r values.
Prediction models were estimated on the pooled discovery and validation MARS sample. For each set of predictors, two ways of modeling the HAM-D time series were considered: first, the patient's individual treatment response slope, a simple linear regression on lntransformed HAM-D values, and, second, the slope derived from the clustering model. The rationale for this comparison was to determine the quality of the clustering method to generate meaningful and generalizable outcome classes. Further, class specific classification accuracy values (i.e., [true positives + true negatives]/[true positives + false positives + true negatives + false negatives]), were calculated on the basis of respective confusion matrices in which the class of interest was defined as true class, and the remaining six other classes as false class.

Clustering of HAM-D time courses
When applied to the HAM-D time courses of the discovery sample, the FlexMix clustering algorithm did not converge for any number of clusters k < 4 or k > 10. We therefore assessed cluster stability in more detail for k ≥ 4 and k ≤ 10, using 200 repetitions of the algorithm for each k. The lowest value of the ICL criterion, representing an optimal model fit, was found for seven clusters (Supplementary Fig. S2A). Figure 1 shows the resulting TRCs (C1 to C7), sorted by their model-derived slope. C1 showed the fastest symptom improvement, whereas C2 and C3 were characterized by improvements at slower rates. Cluster C4 reflected a more volatile symptom development, while C5, C6, and C7 were characterized by low improvement rates, with C7 showing practically no improvement over at least 16 Table S3).
To examine whether the TRCs represent stable and generalizable entities, we assigned the patients of the two MARS-and GENDEP-based validation samples to clusters, using the coefficients of the model estimated in the discovery sample. Figure 1 compares the individual trajectories across the three samples and shows the respective cluster-specific median time courses along with boundaries that include 95% of values of the discovery sample. Supplementary Fig. S2B shows ICL values for both validation samples, separately and combined. All samples showed an ICL minimum for seven clusters except for the MARS validation sample. The latter showed a flat ICL profile with a relative minimum at five clusters, most likely due to the relatively small sample size of about 30% compared with the MARS discovery and the GENDEP validation sample. For the MARS validation sample we observed that median HAM-D courses were highly similar to the discovery sample and cluster proportions were not different (Χ 2 = 6.157, p = 0.40). The GENDEP validation sample exhibited very similar median HAM-D courses compared with the discovery sample, except for C4, which had lower median values compared with MARS, caused by several patients with high volatility between week 4 and~10 and HAM-D values below the 95% threshold. Compared with the MARS discovery sample, GENDEP clusters had different proportions (Χ 2 = 177.13, p = 1.38 × 10 −35 ), showing fewer fast responders (e.g., in C1, average 4.9 weeks to discharge) and more slow responders (e.g., in C7 average 20.8 weeks to discharge).
Next, we analyzed to what degree a lower number of sequential observations would suffice to predict the TRCs instead of using the full observation interval. Here, we detected an almost linear increase of the correlation coefficient between the reduced and full solutions from week 0-4. Correlations were already high at week 8 for the MARS validation and the combined MARS sample (0.96-0.98) (Fig. 2). For GENDEP, as fully independent sample, the slope was generally lower, reaching 0.77 at week 8 and remaining linear until its maximum.
Strong correlations between the TRCs and established response markers (weeks until discharge, response [50% relative symptom decrease at discharge] and remission [HAM-D < 10 at discharge]) were confirmed (Supplementary Table S4). These differences were significant between~80% of neighboring clusters, particularly for remission as a conservative criterion (Supplementary  Table S5), highlighting an ecological importance of the cluster differences. Clusters also differed regarding the psychopharmacological treatment administered throughout the episode for three of nine medication classes (benzodiazepines, tricyclic antidepressant, and antipsychotics) (Supplementary Table S4).

Predicting TRCs from clinical characteristics
We assessed whether the attribution of patients to the TRCs can be predicted from clinical characteristics. While explorative, the analysis served mainly as general cluster validation step by probing if the TRCs associate with clinically plausible and previously reported prediction patterns. To this end, we analyzed four models with a focus on model 0 that comprised 50 clinical baseline items. Model 1 comprised additional baseline HAM-D single items, model 2 contained the early partial response at week 2, and model 3 combined models 1 and 2. All four models predicted treatment response in the combined MARS sample for both alternatives of modelling the slope (individual and cluster-derived) (both p < 2.17 × 10 −21 , Table 2). Overall, two performance levels (A and B) were observed for models using the cluster-derived slope: (A) Model 0 and 1 both explained 13% of the variance, which means that no gain was achieved by inclusion of the baseline HAM-D single items. (B) Model 2 and 3 explained 20% and 21% of the variance, respectively, with the improvement over (A) induced by the early partial response item; as observed in the first comparison (A), no added effect of the baseline HAM-D single items was seen for model 3. Predictions were also significant for all four models when analyzing the two MARS subsamples (p < 1.30 × 10 −17 and p < 8.71 × 10 −5 for the discovery and validation sample, respectively). It is worth mentioning that for the MARS validation sample the prediction analysis was entirely independent from the clustering procedure. Across all models, using the cluster-derived slope explained significantly more of the variance than using the individual slopes (Table 2). Classification accuracies as calculated from cluster-specific confusion matrices ranged between 75.0% and 95.2% (Supplementary Table S6 for details). Table 3 lists 10 (out of 50) predictors of model 0 that gained significance based on a multivariate comparison of the respective single item against all other competing items 45 . We also analyzed univariate associations of these items with the TRCs (likelihood ratio test on a generalized linear model). Concordantly, both types of comparison revealed strongest effects for the personality items neuroticism, extraversion, and harm avoidance. Furthermore, we investigated the cluster-specific averages of each clinical item, comparing them to the 95% confidence interval (CI) of the entire sample (  Table S8 summarizes significant predictors of the three extended models. In brief, model 1, compared to model 0, was characterized by prioritizing three baseline HAM-D single items; model 2 identified, as expected, early partial response as a strong predictor, along with minor other shifts; model 3 produced a combined pattern with baseline HAM-D single items, early partial response, and current psychotic symptoms as additional predictors over model 0.

Discussion
We employed model-based non-linear clustering 38,39 on symptom courses of 834 in-patients treated for MDD and identified seven TRCs. These classes were already distinct at the visual level and ranged from fast, unambiguous response to severe treatment resistance. The average HAM-D decrease differed strongly between classes, and classes were strongly associated with established response markers, highlighting that they represent clinically meaningful entities. Baseline severity was only weakly correlated with the response slope over a small HAM-D range, contradicting the intuitive expectation that a high initial disease severity is closely coupled to a steep symptom decline. Classification of 236 patients of the MARS validation sample and 809 patients of the GENDEP validation sample demonstrated that the patients' response dynamics can be captured by these clusters, yet studyspecific differences in the response profiles are also reflected.

Construct validity of the clustering solution
Similar cluster sizes and shape characteristics emerged when the discovery sample coefficients were applied to the validation samples (Fig. 1). The consistency observed in this validation is superior to previous latent variables analyses not using machine learning, which did not produce stable, symptom-based subtypes of depression 3 . Still, a major difference that limits the comparability is that the mentioned analyses (factor analyses, principal component analyses, latent class analyses) built their grouping on cross sectional symptom spectrum and not on trajectories of symptom changes.
Here, we applied a machine learning strategy to identify MDD subtypes based on longitudinal data collected over up to 16 weeks. Our results indicate that significant latent subtypes for MDD indeed exist in the MARS cohort. One advantage of our approach may have been the identification of the best model through the ICL criterion that appears more robust to the violation of some of the mixture model assumptions compared with the commonly used Bayesian Information Criterion. Therefore, the use of the ICL may have led to a more optimal choice for the number of clusters and, accordingly, to a more sensible data partitioning 46,47 . Within each model, the use of slopes derived from the linear mixed model characterizing each TRC led to higher R 2 coefficients than the use of individual slopes, particularly in the validation sample (Table 2). This observation strengthens the validity of the classes and highlights that the individual information of the HAM-D time courses was indeed assessed by the clustering algorithm. Moreover, this emphasizes that the average slope of the class is a good approximation of the response behavior, helping to denoise individual observations.

Clustering independent patient groups and simulating reduced observation intervals
To facilitate the translation of our clustering scheme to other cohorts and to understand the degree of generalizability of our clustering solution, we analyzed two aspects: First, we projected the clustering coefficients to an independent MARS subsample and found that these patients were assigned to classes with similarly shaped group plots and median HAM-D courses as observed for the discovery sample. The observation that the classes formed from the MARS validation sample were also equally proportioned as in the discovery sample confirmed that, within the MARS cohort, a stable solution had been gained. The additional projection onto the GENDEP sample was also informative: Here, patients could be captured equally well by the seven TRCs except for a small proportion of patients that exceeded the lower boundary of one (discovery) cluster due to volatile courses between week 4 and~10. More relevant, however, significantly different cluster proportions emerged compared with MARS. We speculate that the limited options to intensify treatment in the GENDEP study with defined treatments-or generally different patient characteristics -could underlie the proportional shift towards clusters that represent a slower treatment response. The combination these two observations led us to conclude that indeed generalizable response patterns seem to be described by the seven TRCs. Though, different cluster stability criteria may lead to different solutions, as for example pointed out by a longitudinal latent class analysis that used Bayesian Information Criterion and detected nine clusters in GENDEP 27 . Comparability with our solution, though, is hampered by the use of a different depression rating scale (Montgomery-Åsberg Depression Rating Scale). Second, in a simulation, we reduced the observation interval to probe whether studies with shorter observation windows could also benefit from the current clustering solution. We found that a correlation of r ≈ 0.96 was reached after eight weeks of HAM-D measurements in the MARS-based samples and r ≈ 0.77 in the independent GENDEP sample (Fig. 2). Of note, the remaining increase of prediction accuracy between weeks 8 and 12 was stronger in GENDEP, indicating that observation windows of 8 weeks generally seem sufficient, but that, expectedly, differences in study characteristics play a role, rendering more observations advisable. One such difference that could explain the difference at week 8 might have been the higher flexibility in the MARS study to adjust the treatment to the individual patient. Overall, the generalizability of our clustering solution could be higher for observational than for controlled studies.

Prediction of TRCs from clinical baseline features
We next investigated the clinical usefulness of the TRCs by testing whether these can be predicted from clinical baseline characteristics in a multivariate model (random forest algorithm) 40 . Rather than as a separate study we conceptualized this analysis as additional clinical validation of the clusters that primarily represent statistical constructs. Several machine learning techniques have before been used to predict treatment outcome in MDD [48][49][50] , yet, their models were directed towards classical categories of remission, non-remission 10 , treatment resistance 12 , or persistence-severity 13 . In brief, we found   Table 1  that 50 clinical baseline variables, obtained through interviews, symptom self-reports, and standard physical or laboratory tests, predicted about 13% of the variance of the TRCs. While seemingly low, this is actually in the range of previous multivariate analyses that focused on the prediction of two outcome categories, reporting low to medium accuracy values from receiver operating characteristic analyses 10,12,13 . In contrary to using predefined cutoff thresholds for these categories, clustering as exemplified here for the HAM-D measure can reveal more fine-grained, yet still sparse and data-driven classification systems. Of our clinical predictors, nine carried significantly more weight than the others: (i) the duration of the index episode, (ii-iv) symptom checklist-based scores for psychosocial self-assuredness, psychoticism, and phobic anxiety, (v-viii) the personality traits neuroticism, extraversion, psychoticism, and harm avoidance, and, (ix), sum scores for life events (weighted for their straining impact). Although all items support the overall prediction, a review of these nine items strengthened the clinical validity in several ways: A longer duration of time in depression before initiation of antidepressant treatment has before been identified as a negative predictor of treatment outcome 51 . In contrary, no consistent predictive value was found for the total duration of the current episode including periods with and without treatment 52,53 . As the period without treatment was not quantified in our sample, we speculate that our current episode duration marker incorporated the untreated period, and significance was gained through the large statistical power. Furthermore, baseline symptom profiles made a relevant contribution to the model. Several reports emphasized that strong anxiety symptoms during a depressive episode increase the risk for nonremission 54 . Of the predictive symptom items (phobic anxiety, psychosocial self-assuredness, and psychoticism) at least two reflect aspects of anxiety, corroborating that high anxiety levels in MDD impede treatment response. Of note, in an analysis on a MARS subsample, patients with high anxiety levels showed structural brain differences in areas involved in the processing of social cues 55 , critically overlapping with areas that predict treatment response over six weeks 56 .
While the symptom checklist covers state-related items, personality questionnaires target more stable characteristics of a person. Here, harm avoidance and neuroticism -which both represent similar concepts of developing feelings of anxiety and avoidance behavior in the face of challenges-were confirmed as predictors. Such an association has been reported before 57,58 , which constitutes an indirect validation of the TRCs. Extraversion has so far mainly been found to protect against developing clinical symptoms in the face of chronic stress 59 . We report a clearer direct impact on treatment response, a finding possibly facilitated by the random forest approach that integrates multiple interaction effects. Eventually, weighted life events emerged as a negative predictor, as reported 60,61 . Life events, particularly early adverse events, represent episodes of prolonged adaptation, stress, and liability that increase the risk for MDD, but that also influence recovery chances [51][52][53]62,63 . Information on early childhood adversity was only available in a subsample (≈35%), disqualifying it for the full model. We speculate that the inclusion of additional details on the type and timing of life events could improve the model.
In an earlier representative MARS sample 29 , previous treatment resistance-usually defined by at least two unsuccessful trials with different antidepressants in adequate dosages for at least six weeks 64 -has been identified as a strong univariate predictor of non-remission. In this study, treatment resistance was encoded by the Antidepressant Treatment Response Questionnaire (ATRQ) that showed no significant importance p-value (yet a significant univariate association [data not shown]). Results based on the ATRQ may differ because this measure tends to underreport failed trials 65 . Similarly, the BMI, previously reported to be associated with remission rates 29 and treatment response 66 , was not associated with the TRCs in our study. One explanation is the use of a binary cutoff (25 kg/m 2 ) in the positive report 66 , which may point to a non-linear relationship. Of note, the number of previous depressive episodes-a lifetime disease burden marker-did not emerge as a predictor, confirming other negative reports 64 . Similarly, age at onset (AAO), which is often inversely correlated with the number of episodes, was not predictive. Concerning this marker, reports are mixed, some finding no correlation 67,68 and some reporting an influence on remission speed 69 or treatment resistance 70 . Hidden interactions of AAO with subgroups (as reported for comorbid alcohol dependency) 71 or non-linear relationships may explain this variability. Baseline cortisol as a simple HPA axis marker was also not predictive; stimulation tests, particularly when obtained longitudinally, are most likely more sensitive 72 . TRCs also differed by the type of psychopharmacological treatment (Supplementary Table S2), yet, due to the observational study design, this likely reflects either disease acuity (anxiolytic medication), treatment escalation following non-response (e.g., tricyclic antidepressants), or episode severity (antipsychotic medication for psychotic depression). Similar confounding co-correlations between medication variables and disease severity have been reported for biological markers, e.g., in meta-analyses of brain structure 73,74 .
We explored two different strategies for improving our base model 0 (Table 1), by either adding single baseline HAM-D items or by adding information on the partial early response after 2 weeks. Interestingly, the inclusion of single baseline HAM-D items did not improve the model (Table 2), possibly because the current symptomatology was already reflected in the symptom checklist items. This does not imply that primary clustering of single item trajectories would not result in additional clusters. While representing an important follow-up question and adding clinical elaborateness, this conceptual modification would increase the number of observations per case and could lead to model instability. Eventually, including the partial early response increased the model fit markedly, confirming similar reports from both observational and controlled studies 12,16,[48][49][50]62,63 . Notably, personality items were among the strongest predictors in all models (Table 3).

Limitations
Our study has several limitations. First, due to a necessary tradeoff between higher statistical power through a large sample size and the use of powerful, specific single predictors, clinical variables like neurocognitive results, complex endocrine tests, or neuroimaging markers were not included, despite reports on them being potentially useful 72,75 . Second, while psychopharmacological treatments are well-documented in MARS, no formalized assessment of previous nonpharmacological treatments, including psychotherapy, was available, preventing an inclusion of these factors. Third, the MARS discovery and validation samples significantly differed in six clinical baseline items, which may explain minor differences of the prediction results. However, these six items showed no overlap with the most informative predictors of model 0 or predictors emerging from the other models.

Conclusions
By employing model-based non-linear clustering to clinical ratings of a large cohort of MDD patients, we detected seven distinct treatment response classes that proved stable in two validation samples. In a multivariate prediction analysis, these classes could be predicted from 50 clinical baseline variables, with personality items, life events, duration of the episode, and psychopathological baseline characteristics carrying particular weight. Overall, the construct and clinical validity of these treatment response classes in MDD encourages an exploration of their neurobiological underpinnings and, more generally, effectively describes response patterns across multiple clinical cohorts.