INTRODUCTION

Anxiety disorders are highly prevalent (Kessler et al, 2005a) and significantly impairing (Sherbourne et al, 2010). Furthermore, anxiety disorders have increased in prevalence over the past several decades despite simultaneous increases in the availability of effective treatments (Kessler et al, 2005b). Although efficacious treatments exist, it is unclear who will respond best to what type of treatment. Thus, being able to match a patient to the best treatment for that individual is one of the most critical and important problems in psychiatry, yet few investigations have been conducted to gain a greater understanding of these individual differences.

One approach is to identify pre-treatment characteristics that predict treatment response. In addition to variables such as demographics and clinical characteristics, functional neuroimaging data may be particularly useful, because they are thought to quantify the underlying biological disease state. Previous neuroimaging studies of outcomes in psychiatric disorders have focused predominately on mood disorders and have identified the anterior cingulate as a potential predictor of antidepressant medication (Mayberg et al, 1997) or cognitive therapy (Siegle et al, 2012) response. In comparison, research in anxiety disorders has been limited to very small studies examining medication response and restricted to the amygdala and anterior cingulate (eg, Nitschke et al, 2009; Whalen et al, 2008). Results from these studies have been inconsistent and have not generated clinically useful single-subject predictions. There is therefore a profound need for research in anxiety disorders that (a) examines psychotherapy outcomes, (b) explores additional potential regions of interest (ROIs), and (c) moves towards clinically applicable single-subject prediction (as in Siegle et al, 2012).

In the present study, 48 patients with generalized anxiety disorder (GAD) or panic disorder (PD) were scanned while completing an emotion regulation task before cognitive behavioral therapy (CBT). Individuals with anxiety disorders have been hypothesized to have difficulty with emotion regulation (Aldao et al, 2010) and have demonstrated hypo-activation in dorsolateral, dorsomedial, and ventrolateral prefrontal cortex (PFC) during emotion regulation (Ball et al, 2013; Goldin et al, 2009a; Goldin et al, 2009b). We therefore sought to examine emotion regulation-related brain activation as predictors of CBT outcome.

A number of techniques have been proposed for classifying complex biological data, including support vector machine and logistic regression. Among these, random forest (Breiman, 2001) is one of the most consistently robust predictive techniques, yielding superior performance in independent replication (Qi et al, 2006). The random forest technique consists of a complex partitioning of the predictor variable space and is summarized in Figure 1 (see also Breiman, 2001; Genuer et al, 2010; Strobl et al, 2009). It is particularly appropriate when the number of predictor variables is much larger than the number of subjects (Bureau et al, 2005), which is typically the case with neuroimaging data sets. Moreover, random forest has a low tendency to over-fit (Strobl et al, 2009), and the stepwise partitioning of the feature space can yield high-order interactions among many predictor variables that cannot be identified using other classification procedures (Lunetta et al, 2004). Random forest models have successfully been used to detect Alzheimer’s disease (O’Bryant et al, 2010), identify tumor cells (Shi et al, 2004), and predict in-hospital mortality (Ward et al, 2006) but have not been used to generate psychiatric outcome predictions based on functional neuroimaging.

Figure 1
figure 1

Random forest procedure. Step 1a is to build a decision tree based on a bootstrapped sample of participants (filled circles represent responders and open circles represent non-responders) and a random sample of predictor variables (eg, average activation in anatomically defined brain regions, arbitrary data shown). The random forest algorithm determines the optimal split point for each variable in order to correctly classify this subset of participants. Step 1b is to repeat this process hundreds or thousands of times to generate a forest of trees. In step 2, each tree classifies the participants that were not used in its original construction; each tree then ‘votes’ for the classification of these participants, and these votes are aggregated to provide the predicted status of each participant and thereby determine accuracy. Figures marked with an asterisk (*) indicate inaccurately classified participants in this example. Step 3 is the identification of the most important variables for prediction. Brain regions are ranked in terms of their variable importance scores: only those with greater importance than the absolute value of the most negative importance rating are selected for the final model (arbitrary data shown). The variables selected for inclusion are then used as the sole input variables for another iteration of steps 1 and 2, generating the final model.

PowerPoint slide

In the present study, we used random forests to predict CBT outcomes in GAD and PD. We expected that key limbic (amygdala, insula, anterior cingulate) and prefrontal (dorsolateral, dorsomedial, ventrolateral PFC) regions previously implicated in the etiology of anxiety disorders (Shin and Liberzon, 2010) would have a role. In addition to identifying brain areas contributing most to outcome prediction in this sample, we also examined the utility of random forest modeling as a means to predict therapy outcome in anxiety. To that end, we calculated test characteristics of sensitivity, specificity, and positive and negative likelihood ratios.

MATERIALS AND METHODS

Participants

The University of California San Diego Human Research Protections Program approved this study. After providing written informed consent, 107 participants were screened by semi-structured diagnostic interview (Sheehan et al, 1998). Of these, 53 were eligible, meeting DSM-IV criteria for clinically predominant PD (n=26: 22 with agoraphobia) or GAD (n=27), did not meet criteria for lifetime psychosis, past-year substance dependence, or past-month substance abuse, and participated in an fMRI session before CBT. Other anxiety disorders, including co-occurring PD and GAD, were permitted. Five participants were excluded from analysis due to poor fMRI data quality leaving 23 PD and 25 GAD. All were free of psychotropic medications for 6 weeks (2 weeks for benzodiazepines) and met safety and eligibility criteria for fMRI. Data from 39 of the 48 participants were included in a previous report that did not include treatment data (Ball et al, 2013). Table 1 presents demographic and self-report measures.

Table 1 Clincal and Demographic Measures

Procedure

Eligible participants were screened with a medical examination consisting of a laboratory evaluation, EKG, and drug and pregnancy screen. Participants also completed self-report measures: the Overall Anxiety Severity and Impairment Scale (OASIS; Norman et al, 2006), an abbreviated version of the Penn State Worry Questionnaire (PSWQ-A; Hopko et al, 2003), the Anxiety Sensitivity Index (ASI; Reiss et al, 1986), the Intolerance of Uncertainty Scale (IUS; Buhr and Dugas, 2002), the Quick Inventory of Depressive Symptomology (QIDS; Rush et al, 2003), and the NEO Personality Inventory (Costa and McCrae, 1992). Finally, all participants completed an fMRI scan before 10 sessions of open-label weekly individual CBT (Craske et al, 2009) with a clinical psychologist or Masters-level therapist supervised by a clinical psychologist with CBT expertise.

The OASIS was the primary outcome measure due to its clinical relevance and applicability to both disorders. Responders were classified based on OASIS scores of 5 at the end of therapy (Roy-Byrne et al, 2010), with 60% of participants classified as responders (14 GAD, 15 PD). OASIS scores were collected every 2 weeks: participants who dropped out of treatment after completing at least four sessions (n=7) were included in the analysis using their most recent score.

fMRI Task

Two processes were examined with the emotion regulation task: in each trial, individuals were asked to either ‘Keep Up’ (hereafter maintain) or ‘Reduce’ (hereafter reappraise) their emotional responses to negative images. Twelve negative images for each condition were selected from the International Affective Picture System (Lang et al, 2008). No neutral images were used, instead a pixel-wise scrambled version of each image was presented as a baseline at the start of each trial. Participants were also prompted to rate their emotional state during each baseline and image presentation period. More details about this task have previously been described (Campbell-Sills et al, 2011).

Image Acquisition

One 9 min 40 s BOLD fMRI run was acquired, using a Signa EXCITE 3.0 Tesla-GE scanner (T2*-weighted echo planar imaging, TR=2000 ms, TE=32 ms, FOV=240 × 240 mm, 64 × 64 matrix, 30 2.6-mm axial slices with a 1.4-mm gap, 290 scans). For anatomical reference, a high-resolution T1-weighted image (SPGR, TI=450, TR=8 ms, TE=3 ms, FOV=250 × 250 mm, flip angle=12°, 172 sagittally acquired slices with 1-mm thickness) was obtained during the same session.

Image Processing

All structural and functional image processing was done with Analysis of Functional NeuroImages (AFNI) software (Cox, 1996). Time points with >2SDs more outlier voxels than the subject’s mean were excluded from analysis, as determined by the AFNI function 3dToutcount. Voxel time series were interpolated to correct for non-simultaneous slice acquisition and corrected for three-dimensional motion. Anatomical and functional volumes were co-registered algorithmically (Saad et al, 2009).

Individual participant time series data were analyzed with AFNI’s 3dDeconvolve program. Orthogonal regressors of interest modeled the maintain and reappraise conditions. Additional regressors of non-interest modeled the emotion rating periods, motion regressors, and linear and quadratic trends in the time series. Regressors were convolved with a modified gamma variate function to account for the hemodynamic response. Following deconvolution, data were converted to percentage of signal change by dividing the coefficient by the zero-order regressor within each voxel. Data were normalized to Talairach coordinates (Talairach and Tournoux, 1988) and subjected to 4 mm Gaussian spatial smoothing.

To obtain an easily replicable data set, we obtained average voxel-wise activation estimates for 70 anatomical ROIs during each of the two conditions (reappraise and maintain) to yield 140 independent variables. These ROIs were constructed as described in a previous report (Fonzo et al, 2012). Briefly, the Talairach atlas (Talairach and Tournoux, 1988) was combined with grey matter probabilities based on high-resolution T1 images from a group of 43 healthy adults. Grey matter probabilities were determined by applying grey matter segmentation with SPM5 (Statistical Parametric Mapping software; http://www.fil.ion.ucl.ac.uk/spm) for each subject, yielding voxel-wise probabilities of assignment to grey matter across all subjects. The probability maps were transformed to Talairach coordinates, and ROIs were defined using the Talairach atlas. Insula ROIs were divided into anterior and posterior at y=0, and the medial frontal gyrus was divided into ventromedial and dorsomedial at z=14. A total of 70 anatomical ROIs were created (Supplementary Table S1), and average activations were extracted for each subject, in each ROI, for each of the two task conditions, yielding 140 variables.

Statistical Analysis

Random forest classification was implemented in R statistics (http://cran.r-project.org; randomForest library). The goal of the procedure was to predict responder status using activations extracted from the anatomical ROIs during the two emotion regulation conditions. The random forest procedure involves several steps, as summarized in Figure 1 and described in detail elsewhere (Breiman, 2001; Genuer et al, 2010; Strobl et al, 2009). The first step is to construct a large number of classification trees (2000 trees were used, as recommended in Genuer et al, 2010), each using a bootstrapped subsample of participants and a randomly selected subset of the independent variables (Figure 1, Step 1). The number of independent variables used to create each classification tree was the square root of the total number of independent variables, as recommended by Díaz-Uriarte and De Andres (2006). At each node of the tree, a decision variable is selected randomly from the subset of available independent variables, and an optimal split point for that node is determined. The tree ends with each participant classified as a predicted responder or non-responder. In the second step, data from the out-of-bag sample (ie, for each classification tree, data from participants not used to create that tree) are used to evaluate how well the trees, in aggregate, classify subjects. Specifically, classification accuracy, sensitivity, specificity, and positive and negative likelihood ratios were evaluated (Figure 1, Step 2).

Third, data from the out-of-bag sample are used to evaluate variable importance, indicating how much each variable contributes to classification accuracy. Here, permutation importance scores were used to select the subset of best-performing variables for inclusion in a more parsimonious final model (Genuer et al, 2010). The permutation importance score of a predictor variable is defined as the decrease in overall accuracy when values of that variable are randomly permuted (Breiman, 2001). Variables that relate strongly to the outcome and contribute substantially to classification have large permutation importance, whereas those that do not contribute to classification have small or even negative scores (Strobl et al, 2009). Because negative scores are due to random variation around zero of the poor predictor variables (Strobl et al, 2009), only variables with an importance score greater than the magnitude of the most negative score were selected for inclusion (Figure 1, Step 3). Following Nicodemus et al (2010), the number of variables retained in the final model was limited to 10 and was based on median permutation importance scores from 500 repetitions of the random forest analysis to ensure stability of importance score estimates. The removal of poor performing variables can increase overall accuracy by increasing the selection of relevant variables for the decision trees and therefore increasing the relevance of included data to outcome prediction. A final, reduced model was built using the selected subset of predictor variables.

A similar analysis was undertaken using baseline clinical measures (OASIS, ASI, PSWQ-A, IUS, QIDS, NEO-Neuroticism, number of comorbid diagnoses) and demographics (age, gender, years of education) for comparison with the fMRI-based model. In addition, a third model was built using both brain imaging and clinical and demographic measures. For each of these, the best performing predictor variables were selected as described above.

Because the response rate was 60% (95% CI: 47–74%), the simple model that predicts all cases will respond to therapy will achieve 60% accuracy. Therefore, model accuracy for each random forest model was compared with the base response rate and its confidence interval. Models that can generate prediction accuracy outside the base rate confidence interval (ie, >74%) therefore perform statistically significantly better than predictions based only on the estimated responder rate and exceed the accuracy expected by chance.

To statistically compare the three models to each other, McNemar’s test was used. In order to aid in the interpretation of the selected ROIs, correlations were performed between the ROIs in the final model and selected self-report measures (OASIS, ASI, PSWQ, and QIDS). Bonferroni correction (α=0.005) was used; however, uncorrected significance tests (α=0.05) are also reported for exploratory purposes.

We used all participants (n=48) to create the random forest models for several reasons. First, this task yielded few differences at baseline between GAD and PD participants (Ball et al, 2013). Second, emotion regulation is considered a trans-diagnostic construct underlying affective disorders (Barlow et al, 2004), and the CBT protocol was designed such that it can be flexibly implemented across diagnoses (Craske et al, 2009). Finally, to best assess our ability to create predictive models using neuroimaging data, the increased power afforded by combining the groups was a distinct advantage. However, because the set of participants includes individuals with either primary GAD or PD, analyses were also conducted in each group separately. Individuals with comorbid PD and GAD were excluded from this supplemental analysis.

RESULTS

Table 1 shows baseline clinical and demographic measures. FMRI task effects, group effects, and brain–behavior relationships in the majority of these participants have been reported elsewhere (Ball et al, 2013). For each set of predictors, two models are described: an initial model built with all variables, and a final model built with the subset of predictors contributing most to classification accuracy. Results of each model in GAD and PD separately are in the Supplementary Results (Supplementary Table S2).

Clinical and Demographic Model

The initial model was comprised of baseline clinical and demographic variables as predictors of responder status and yielded accuracy of 63%. Three variables met criteria for inclusion in the final model: OASIS, ASI, and PSWQ-A, representing overall severity and impairment, anxiety sensitivity, and worry severity, respectively. The final model with these variables yielded accuracy of 69%, which is not statistically significantly different from the model based on response rate alone. Sensitivity was 0.79, and specificity was 0.53. The positive likelihood ratio of 1.67 (95% CI: 1.01, 2.79) and the negative likelihood ratio of 0.39 (95% CI: 0.17, 0.90) were statistically significantly different from each other and from 1.0 (p<0.05).

fMRI Model

The initial model contained average reappraise- and maintain-related activation before treatment in each of the 70 anatomical ROIs. These 140 variables were entered as predictors of responder status yielding accuracy of 65%. Ten variables met criteria for inclusion in the final model: right hippocampus and left uncus activation during maintenance, as well as left transverse temporal gyrus, left anterior insula, right and left superior temporal gyrus, left supramarginal gyrus, left precentral gyrus, left superior frontal gyrus, and right substantia nigra activation during reappraisal. The final model with these variables yielded accuracy of 79%, which is statistically significantly greater than the model based on response rate alone (p<0.05). Sensitivity was 0.86, and specificity was 0.68. The positive likelihood ratio of 2.73 (95% confidence interval: 1.39, 5.38) and negative likelihood ratio of 0.20 (95% confidence interval: 0.08, 0.53) were statistically significantly different from each other and from 1.0 (p<0.05).

Combined Model

The initial model was comprised of all baseline clinical, demographic, and fMRI variables as predictors of responder status and yielded accuracy of 60%. Ten variables met the criteria for inclusion in the final model: OASIS, ASI, PSWQ-A, as well as right hippocampus and left uncus activation during maintenance, and left transverse temporal gyrus, left supramarginal gyrus, left precentral gyrus, left superior frontal gyrus, and right substantia nigra activation during reappraisal. The final model with these variables yielded accuracy of 73%, which is not statistically significantly different from the model based on response rate alone. Sensitivity was 0.83, and specificity was 0.58. The positive likelihood ratio of 1.97 (95% confidence interval: 1.13, 3.42) and negative likelihood ratio of 0.30 (95% confidence interval: 0.12, 0.72) were statistically significantly different from each other and from 1.0 (p<0.05).

Model Comparisons

Table 2 shows the test characteristics of all three final models. Figure 2 illustrates the predictive information gained by each model over and above knowing the treatment response rate. In addition, McNemar’s chi-squared test was computed for all pairwise model comparisons. None of the comparisons were significant (p>0.2) indicating that the models did not statistically significantly differ from each other.

Table 2 Test Characteristics of All Three Final Models
Figure 2
figure 2

Model comparison of positive and negative likelihood ratios and posterior probabilities for (a) the clinical and demographic predictive model, (b) the fMRI predictive model, and (c) the combined model. Brackets indicate 95% confidence intervals. Upper lines indicate positive test result (ie, predicted responder) and lower lines indicate negative test result (ie, predicted non-responder).

PowerPoint slide

Further Investigation of the fMRI Model

The precise contribution of each variable to the outcome prediction is complex, due to the high-order interactions critical to the success of random forests. However, main effects can be investigated straightforwardly. As shown in Figure 3, responders had greater activation than non-responders in all 10 ROIs comprising the final fMRI model. Baseline anxiety severity and impairment (OASIS) was inversely associated with activation in the left uncus during maintenance, as well as in the right superior temporal gyrus, left precentral gyrus, left superior frontal gyrus, and right substantia nigra during reappraisal (p<0.05). None of the regions were significantly associated with baseline anxiety sensitivity (ASI). Baseline worry severity (PSWQ-A) was inversely associated with activation in the right hippocampus and left uncus during the maintain condition (p<0.05). Finally, activation in the left transverse temporal gyrus and right superior temporal gyrus during reappraisal was associated with baseline depressive symptoms (QIDS; p<0.05). However, the only associations significant with Bonferroni correction were with the left uncus during maintenance, such that greater activation was associated with less overall anxiety and worry severity.

Figure 3
figure 3

Average activation in responders and non-responder in regions selected for the final fMRI model. Error bars=SEM. Y axis is the perentage of signal change. Hippocampus and uncus activations were from the maintain condition, all other regions were from the reappraise condition. HIPP, hippocampus; INS, anterior insula; L, left; PRCEN, precentral gyrus; R, right; SFG, superior frontal gyrus; SN, substantia nigra; SPMG, supramarginal gyrus; STG, superior temporal gyrus; TTG, transverse temporal gyrus.

PowerPoint slide

DISCUSSION

To our knowledge, this is the first study to use random forest models and functional neuroimaging data to address a fundamental problem in psychiatry: predicting who will respond to a treatment. The results provide proof-of-principle that functional neuroimaging data can be used to generate predictions with good test characteristics. The likelihood ratios indicate that relative to treatment response odds, predicted responders based on the fMRI model are almost three times more likely and predicted non-responders five times less likely to respond to treatment. Although none of the model comparisons were statistically significant, the fMRI model demonstrated numerically higher classification accuracy than the model based on clinical and demographic variables and even had slightly greater accuracy than the combined model that used both types of data. Furthermore, only the fMRI model generated predictions that were significantly better than chance (ie, the model based on the response rate alone). Therefore, functional neuroimaging together with sophisticated classification procedures may ultimately make it possible to predict who will respond to what type of treatment and inform clinical decision-making.

We also identified potential predictors of CBT outcome in individuals with GAD and PD. Although these results are exploratory and require replication in an independent sample, they expand on previous work in anxiety disorders (eg, Nitschke et al, 2009; Whalen et al, 2008) by including more than one disorder group, by investigating psychotherapy rather than medication, and by not constricting analyses to specific ROIs.

Consistent with our expectations, greater pre-treatment activation in cortico-limbic circuitry (ie, superior frontal gyrus, anterior insula) during emotion regulation was associated with CBT response. The superior frontal gyrus, particularly the dorsolateral PFC, has been implicated in top-down control and emotion regulation in healthy adults (Ochsner and Gross, 2005). However, individuals with anxiety disorders, including a large subsample of these GAD and PD patients (Ball et al, 2013), have demonstrated reduced engagement of this region during emotion regulation (Goldin et al, 2009a; Goldin et al, 2009b). The anterior insula is critical for interoception (Craig, 2009) and integration of emotional information (Lamm and Singer, 2010). Recent theoretical models have highlighted the role of the insula in maintaining problematic anxiety (Paulus and Stein, 2010). The identification of these regions as predictive by the random forest model provides face validity given their role in emotion regulation and anxiety pathology.

The uncus was also identified as a region contributing significantly to prediction accuracy. Furthermore, uncus activation was significantly correlated with worry and overall anxiety severity. Although the uncus has not been highlighted in previous studies of anxiety disorders, it has previously been associated with anxiety levels in response to emotional stimuli (Ewbank et al, 2009; Klumpp et al, 2011). Given the dearth of previous research examining neuroimaging predictors of therapy outcome in anxiety, the identification by the random forest model of non-hypothesized regions, including the uncus, underscores the benefit of an approach that is not constrained to known ROIs.

Perhaps the most studied region in anxiety is the amygdala, which has been robustly linked to the acquisition, experience, and expression of fear (Davis, 1992), and is hyperactive in anxiety disorders (Etkin and Wager, 2007). However, previous studies have not found amygdala activation to consistently predict anxiety treatment outcome (Nitschke et al, 2009; Whalen et al, 2008). The amygdala was not identified as a region contributing to treatment response prediction in the present study. It is possible that although the amygdala’s important role in the anxiety pathology its variability is less important for predicting who will respond to a particular treatment and who will not.

Limitations

One limitation of the present findings is the relatively small sample. Through the use of bootstrapping, random forest models generate reliable predictions even with small sample sizes and are not prone to over-fitting. However, though larger than previous investigations (eg, Nitschke et al, 2009; Whalen et al, 2008), the current sample may still be too small to statistically compare models with traditional hypothesis testing, and therefore strict conclusions about model comparisons are limited.

Another potential limitation is the combination of the two diagnostic groups. Whether mechanisms predicting treatment outcomes are consistent within anxiety disorders broadly or specific to each diagnosis remains an open question. Our sample included anxious adults with either clinically predominant GAD or PD. The rationale for this was partially theoretical, based on the trans-diagnostic approach of both the treatment (Craske et al, 2009) and the task used during brain imaging (Ball et al, 2013), and partially practical, to increase statistical power. We also report results from random forest models built using GAD or PD participants separately (Supplementary Material). These models also show strong test characteristics.

The present study examined response to a CBT protocol with established efficacy (Craske et al, 2009) and therefore did not include a waitlist or control group. The findings therefore may reflect a nonspecific tendency to improve with any intervention. Future research should utilize random forests in randomized clinical trials in order to establish whether the findings are specific to improvement with CBT. Examining differential treatment prediction will be especially important to assist clinicians in choosing among evidence-based treatments.

Finally, it is important to note that the regions identified in the present sample are exploratory and require replication. Although the use of bootstrapping in random forest is a significant strength, it does not remove the necessity for confirmatory testing of the identified regions in independent samples.

CONCLUSIONS

In conclusion, random forest models built with fMRI can provide single-subject predictions with good test characteristics. These preliminary data suggest that fMRI may have a role in predicting treatment outcomes. Future research should build on these findings by testing them in a larger sample and by continuing to use random forest methodology to develop predictive models with clinical relevance.

FUNDING AND DISCLOSURE

Supported by the NIMH Grants MH65413 and MH64122 to MBS. MBS reports that he is paid for his editorial work at UpToDate, Inc. and Depression and Anxiety (Wiley), as well as for consulting work at Care Management Technologies. All the other authors declare no conflicts of interest.