Predicting the naturalistic course of depression from a wide range of clinical, psychological, and biological data: a machine learning approach

Many variables have been linked to different course trajectories of depression. These findings, however, are based on group comparisons with unknown translational value. This study evaluated the prognostic value of a wide range of clinical, psychological, and biological characteristics for predicting the course of depression and aimed to identify the best set of predictors. Eight hundred four unipolar depressed patients (major depressive disorder or dysthymia) patients were assessed on a set involving 81 demographic, clinical, psychological, and biological measures and were clinically followed-up for 2 years. Subjects were grouped according to (i) the presence of a depression diagnosis at 2-year follow-up (yes n = 397, no n = 407), and (ii) three disease course trajectory groups (rapid remission, n = 356, gradual improvement n = 273, and chronic n = 175) identified by a latent class growth analysis. A penalized logistic regression, followed by tight control over type I error, was used to predict depression course and to evaluate the prognostic value of individual variables. Based on the inventory of depressive symptomatology (IDS), we could predict a rapid remission course of depression with an AUROC of 0.69 and 62% accuracy, and the presence of an MDD diagnosis at follow-up with an AUROC of 0.66 and 66% accuracy. Other clinical, psychological, or biological variables did not significantly improve the prediction. Among the large set of variables considered, only the IDS provided predictive value for course prediction on an individual level, although this analysis represents only one possible methodological approach. However, accuracy of course prediction was moderate at best and further improvement is required for these findings to be clinically useful.

grip strength was measured by a Jamar hand-held dynamo-meter, during a sitting, straight backed position. Lung function was measured as a peak expiatory flow using a mini Wright peak flow meter. Details of those procedures are described in Van Milligen et al. 8 The total number of chronic diseases was established based on a self-report questionnaire, and included lung disease, diabetes, cardiovascular disease, cancer, osteoarthritis, intestinal disorder, liver disease, epilepsy and thyroid gland diseases for which a respondent received treatment and/or medical attention.
Heart rate and heart rate variability were assessed using the 'Vrije Universiteit Ambulatory Monitoring System" (VU-AMS) during the large time of the baseline assessment; medical examination, interview, and a computer task (approximately 100 minutes). ECG was assessed using a six-electrode configuration. Non-stationary periods measured by an accelerometer were discarded. Additional parts of the recording were discarded upon visual inspection.

Definition of outcome groups using Latent Class Growth Analysis
We used a 3 class outcome definition based on course trajectories of burden of depressive symptoms identified previously in the same sample 9 . The burden of depressive symptoms was assessed using the Life Chart Inventory 10 assessed at 2-year follow-up. The Life Chart Inventory retrospectively assessed the burden of depressive symptoms for each month during the previous 2 years (between baseline and follow-up). The occurrence of life events was explored first, to refresh the memory, and then the presence of depressive symptoms and their self-reported burden was assessed using a 5-point likert scale. This resulted in 24 scores on which the Latent Class Growth Analysis (LCGA) was applied. The LCGA identified 5 distinct course trajectories that differed both in initial severity and change of symptom burden over time ( Figure S1). Because LCGA is known to oversplit the data (find more categories than there really are) 11 and to have a sufficient number of subjects in each course trajectory group, we decided to combine the categories with similar course (slope) but a different initial severity into 3 final outcome groups: rapid remission, gradual improvement and chronic.

Statistical analysis
In order to combat high correlations between our variables and to prevent over-fitting and thus low out of sample prediction, we used logistic regression with an elastic net penalty. Penalization is achieved by adding additional penalization term into a loss function measuring data fit. The addition of the penalization term aims to create simpler -and therefore potentially less overfitmodels. We used an elastic net penalty 12 that linearly combines two other commonly used methods for penalization: lasso and ridge penalization.
More specifically, we used a penalized binomial and multinomial logistic regression. The method is explained in detail elsewhere. 13 In brief, multinomial logistic regression is an extension of binomial logistic regression to K>2 classes. Since the binomial logistic regression can be seen as a special case of multinomial logistic regression with K=2 classes, here we will describe only the multinomial version. Multinomial logistic regression models the probability of subject i belonging to class (i.e. course trajectory group) ℓ with a softmax function: where β0ℓ and βℓ are respectively the intercept and regression coefficients for the logit function modeling the ℓ-th class, K is the number of classes and xi T is a vector of predictor variables for the i-th sample.
This model is fit by maximizing the penalized multinomial log-likelihood function: Here, N is the number of samples and y il is the class label that follows a one-of-K coding scheme (i.e equals 1 if subject i belongs to class ℓ and zero otherwise). The last term is an elasticnet penalty term where λ controls the overall amount of applied penalization and α specifies a balance between lasso and ridge penalty terms (i.e. the first and second terms in the penalty above). Logistic regression penalized with only lasso or ridge penalty is a special case of elasticnet, with α=1 and α=0 for pure lasso and pure ridge, respectively. The ridge part shrinks the coefficients whilst lasso part encourages variable selection (effectively forcing some coefficients to be exactly zero, and therefore removed from the model). Therefore, elastic net, similarly to lasso, 14 has can select only discriminating features. Elastic net, however, behaves better in the presence of highly correlated, informative groups of predictors. While lasso might select only one of the correlated variables, elastic net would select the whole group of correlated variables.
In our study we used a fixed value of α=0.2 for all the analyses. While ultimately arbitrary, this value is closer to ridge regression than lasso, but it still contains its feature selection properties.
That means it provides a reasonable balance in that it will select informative variables even if they are correlated with other variables in the model, thus minimizing the chance of overlooking variables that are informative, but that might be overshadowed by other, highly correlated, but slightly more informative variables.

Cross-validation
We used repeated 10 fold cross-validation to assess generalizability of the model. To achieve this, the dataset was divided in 10 subsamples, with approximately the same proportion of outcome classes in each subsample as in the full dataset. 9 subsamples served as a training set, and the remaining 10th served as a test set. The model was fit on the data from the training set and evaluated on the test set. This was repeated 10 times, so that every time a different subsample served as a test set, therefore each subsample served as a test set exactly once.
This whole procedure was repeated 10 times, with different splits of the data, in order to decrease dependence of the result on one particular data split. The final model performance was obtained by averaging results from the 100 test folds (10x 10-folds).

Optimal model parameters
In order to avoid double dipping and model selection bias, the optimal amount of regularization (λ) was selected for each fold separately, within additional (nested) 10-fold CV loop, performed on the training data only. Therefore, the selection of the amount of regularization applied was not influenced by the test data. CV was performed for 100 candidate λ values on a range between a value of λ that would result in a model containing only one predictor to a value of λ that would allow a model to keep all predictors as generated by the glmnet package. 13 λ that minimized crossvalidated mean squared error (Brier score) was chosen to fit the model using the whole training set.

Performance measures
Discrete measures (e.g. sensitivity, balanced accuracy) were obtained by thresholding the probability predictions. In the binomial case (presence/absence of unipolar depression diagnosis at follow-up), the predictions were thresholded by class proportions, so that the class that was predicted with higher probability than the percentage of subjects in this class (i.e., class proportion) was selected. Using the same approach in the multinomial case (3 LCGA trajectory groups) can result in more than one predicted class for a participant and, therefore, we used a different thresholding procedure. Predicted probabilities for each class were normalized by the class proportions (probability of an individual belonging to a class X divided by the proportion of participants belonging to the class X) and the class with the highest normalized prediction was selected. Please note that when this method is applied in the binomial case it would produce the same result as the simple thresholding by class proportions approach.

Permutation testing
To establish whether the estimated models' performance were statistically significant, we repeated the whole procedure 1000 times with randomly permuted labels. The p-value was calculated as the proportion of permutations that achieved a higher AUROC than the ones obtained for the real labels.

Imputation of missing values
The mean value of 5 most similar individuals, measured by Euclidean distance, was imputed to replace each missing value. Only the data from the training set were used to compute this value.
The proportion of missing data was small; in total 25 out of 80 variables contained missing values.

Nonlinear and interaction analysis
We conducted additional exploratory analyses in order to model possible interaction and nonlinear age effects. We repeated the cross-validation and stability selection procedures for models with two additional sets of predictors: I) all variables and interactions between all pairs of variables. II) all variables, quadratic and cubic expansion of age, and interactions between all pairs of variables(including those with expanded age terms).

Stability selection
Additional separate analyses were performed in order to assess the statistical significance of individual predictors. For this purpose, we used a stability selection approach. 16  The selected variables are variables that have higher probability of being selected than a prespecified threshold. While the selection of this threshold is arbitrary, we used the default values suggested by the authors of the method, 15 which are justified because they lead to theoretical guarantees on bounding the number of falsely selected variables. 15 More specifically, stability selection allows us to compute an expected number of falsely selected variables at any given point in a stability path. This is used to specify a region in the stability path with a chance of falsely selected variables being smaller than a desired family-wise error rate (FWER).
Stability selection was performed using R package C06013. We used 1000 resamples of the default size of 63.2% of the full dataset, with the elastic-net logistic regression with α=0.2 as a feature selection method (same parameter value as used in our penalized logistic regression models) and a threshold of 0.75.

IDS individual items
After identification of IDS as the only significant predictor of the course of depression, we performed the same stability selection analysis in order to identify which of the individual items from the IDS questionnaire were predictive of the course of depression. We identified only item "Feeling sad" to be significant for prediction of LCGA course trajectory classes and "Reactivity of mood" for diagnosis (presence/absence of MDD diagnosis at follow-up) prediction. AUROC's of models trained using only these two variables were 0.64 for predicting diagnosis and for LCGA groups: remission=0.68, improvement=0.61 and chronic=0.61. The PPV and NPV provide important additional information, as they also take into account the prevalence of the group of interest (e.g. a chronic course trajectory). If, for example, the prevalence of the group of interest that is to be classified by the prediction model is low relative to the prevalence of another group that it is classified against, the PPV may be low even though the sensitivity can be high. Thus, a low PPV means that a high proportion of classified subjects are   of dysthymia with a one month, 6 month and 12 month recency variables are highly correlated and represent the same construct (dysthymia), therefore, the interaction between these diagnoses of dysthymia variables with different recencies are not very meaningful. All of the variables identified by stability selection for these additional exploratory models were amongst the highest ranked variables and selected by stability selection in the main analysis (models including only main effects), although variables other than the IDS total score did not survive FWER correction in the main analysis.

Positive and negative predictive values
These results show that the AUROC's of the models including interaction terms and non-linear effects of age are similar to the AUROC's reported for models without these additional interaction terms and non-linear age terms. Thus, the nonlinear and interaction effects we included did not improve overall predictive performance for course prediction.

Used variables
Demographic, clinical, psychological and biological variables stratified by the outcome groups (presence of a diagnosis at follow-up or LCGA course trajectories). Binary or dummy-coded variables are shoed as bar charts. The rest is showed as boxplots.