Introduction

Deep brain stimulation (DBS) is an emerging approach to treatment-resistant mental disorders [1,2,3], but response rates in formal clinical trials are mixed [1, 4,5,6,7]. More reliable outcomes might be achieved by improving anatomic targeting. As psychiatric disorders are increasingly understood as network disorders [8, 9], psychiatric DBS is moving away from using a single nucleus/structure as the target and towards attempts at affecting networks [1, 10,11,12]. There is particular enthusiasm for identifying target networks through diffusion tractography, which may enable DBS electrode placement to be customized to individual patients’ anatomy. Although there is controversy over how accurately tractography reconstructs white matter anatomy [13, 14], remarkable early results have been reported from DBS placement based on that imaging [10]. Further, there are multiple tools available to model the interaction of DBS electric fields and targeted tracts [15,16,17]. These tools could replace trial-and-error DBS programming with a mathematically optimal approach to activating desired pathways while minimizing off-target effects [18]. That could overcome the difficulty of correctly programming stimulation, a likely driver of inconsistent clinical outcomes [1, 4, 19].

To realize that promise, we need to know which tracts should/should not be stimulated. For DBS of the subcallosal white matter for depression, multiple groups have settled on a specific white matter confluence and are studying it prospectively (with varying clinical outcomes [10, 20]). For obsessive-compulsive disorder (OCD), a consensus may also be emerging. A theory linking OCD to dysfunction in cortico-striato-thalamic connectivity [21, 22] has led to a focus on white matter tracts linking prefrontal cortex (PFC) to striatum, basal ganglia, and thalamus. Retrospective studies from multiple institutions have implicated tracts to/from dorsolateral PFC [23, 24], ventrolateral PFC [12, 25, 26], and anterior cingulate [12, 24] as potentially important in response. Recent analyses of patients implanted at two different targets correlated OCD response with a tract linking the ventral internal capsule/striatum (VCVS) and the subthalamic nucleus (STN) with the medial PFC [12, 26,27,28,29]. One study further suggested that capture of tracts from orbitofrontal cortex (OFC) [23] led to non-response, although a qualitative synthesis [30] suggests that effective DBS tends to activate OFC-related fibers, and OFC-directed circuits can drive compulsive behaviors in animal models [31,32,33].

Although promising, these prior tractographic analyses are also limited. Many used standard atlases or connectomes derived from healthy controls, comparing these maps against electric fields from patient-specific DBS placements [12, 23, 27, 28]. Individual patients, however, show dramatic variation in their white matter topography compared to atlas standards [34]. Targeting maps computed using “normative” connectomes differ from those computed from patient-specific DTI images [24]. Other studies used simple isotropic field models [25], or distance between electrodes and a target tract [35] which may not accurately capture the DBS-induced electric field [16, 36].

Most importantly, these analyses focused on tracts that correlate with clinical response. A variable may correlate strongly with an outcome but not be able to reliably predict that outcome, e.g. if the means are separate but the tails of two distributions overlap [37,38,39]. Best practices in biomarker research suggest explicitly building predictive models, testing those models on held-out data, and reporting predictive performance in addition to correlation [37, 38, 40, 41]. Prediction-oriented analyses might better answer the question of whether a tractographic finding can be used as a programming target, i.e. whether it has strong predictive accuracy at the single-patient level [42].

Here, we address these limitations through an explicit attempt to predict single-patient response to DBS for OCD at the VCVS target, based on more precise field modeling approaches and using patient-specific tractography. We replicate in part prior studies’ findings that cingulate, medial PFC, and lateral PFC tracts are correlated with clinical response, but we show that these correlations do not provide strong clinical predictive power, and in some cases, we identify correlations that contradict earlier reports.

Methods

Study population and clinical treatment

Participants were 6 patients who enrolled in a clinical trial (NCT00640133) of VCVS DBS for OCD [43], plus 2 who received VCVS DBS for OCD under a Humanitarian Device Exemption. All patients had sufficiently severe OCD at baseline to qualify for DBS (Table S1). All patients received Medtronic model 3387 DBS leads, with the most ventral contact targeted to the ventral striatal grey matter. The Institutional Review Boards of Massachusetts General Hospital and Butler Hospital approved the protocols and provided ethical oversight. All participants gave informed consent, explicitly including separate consent for DBS and for neuroimaging. We report here all patients who agreed to undergo imaging. We analyzed both the Yale-Brown Obsessive-Compulsive Scale (YBOCS) and Montgomery-Asberg Depression Rating Scale (MADRS), collected at visits ~2–4 weeks apart by a trained rater. We did not limit our analysis to YBOCS and MADRS collected at specific timepoints, but used all available datapoints for which we also had recorded DBS settings.

Imaging and patient-specific tractography

Pre-operative MRI data were acquired on a 3T Siemens TimTrio scanner. Diffusion MRI (dMRI) scans had a spatial resolution of 2 mm (isotropic) with 10 non-diffusion weighted volumes and 60 diffusion weighted volumes, with gradient directions spread uniformly on the sphere with a b-value of 700 s/mm2. dMRI data were registered to pre-operative T1- and T2-weighted MRI images and post-operative CT scans using a published pipeline [44] available at https://github.com/pnlbwh/. We then performed whole-brain tractography from the dMRI data, using a multi-tensor unscented Kalman filter (UKF) [45, 46]. The UKF fits a mixture model of two tensors to the dMRI data, providing a highly sensitive fiber tracking ability in the presence of crossing fibers [47,48,49,50]. The UKF method guides each fiber’s current tracking estimate by the previous one. This recursive estimation helps stabilize model fitting, making tracking more robust to imaging artifact/noise. Another benefit of UKF is that fiber tracking orientation is controlled by a probabilistic prior about the rate of change of fiber orientation, producing more accurate tracking than the hard limits on curvature used in typical tractography algorithms. We combined the UKF with a fiber clustering algorithm to create an anatomically curated and annotated white matter atlas [49]. The clustering method groups the streamlines from each patient using a spectral embedding algorithm. Each fiber cluster is matched to a tract from an a priori labeled atlas of the white matter derived from known connections in monkey and human brains. Fiber clustering was performed only on streamlines longer than 40 mm to annotate medium and long range tracts.

Tract activation modeling

For each clinical DBS setting used in each patient, we calculated the volume of tissue activated (VTA) using a modified version of StimVision [15]. This comprised 150 parameter sets, measured over 2–5 years for each patient. During this time, clinicians were actively programming devices and altering settings, leading to substantial fluctuations in VTAs and tract engagement. Briefly, the VTAs were calculated using artificial neural network predictor functions, which were based on the response of multi-compartment cable models of axons coupled to finite element models of the DBS electric field [51]. The VTAs used in this study were designed to estimate the spatial extent of activation for large diameter (5.7 µm) myelinated axons near the DBS electrode [52].

Based on theories that VCVS DBS acts by modulating circuits that run primarily in the internal capsule [14, 22, 30], we estimated activation of pathways linking thalamus with anterior cingulate and pericingulate cortex (ACC-PAC), dorsolateral PFC (dlPFC), ventrolateral PFC (vlPFC), dorsomedial PFC (DMPFC), medial orbitofrontal cortex (MOFC) and lateral OFC (LOFC). Pericingulate cortex includes rostral pre-cingulate cortex, but not the dorsal prefrontal cortex (such as the supplementary motor area). The atlas-guided fiber clustering algorithm [49] and a fiber clustering pipeline [53, 54] guided manual delineation of fiber bundles connecting these regions to thalamus. All pathway labelings were performed by an expert neuroanatomist (Dr. Makris). Examples of the traced bundles and their intersections with DBS VTAs are shown in Fig. 1A. Recent reports found that a tract connecting subthalamic nucleus (STN) to medial prefrontal cortex was strongly associated with clinical response to DBS in OCD [12, 26,27,28]. Therefore, we manually segmented the STN in each subject and extracted all fiber tracts connecting the STN with the prefrontal cortex (Fig. 1B). We verified that as the total charge delivered increased (leading to a larger VTA), the total number of activated fibers also increased (Fig. S1).

Fig. 1: Patient-specific tractographic mapping of OCD DBS response.
figure 1

A Tract tracing and activation modeling examples. Shown are left/right oblique and axial views from one non-responder and one responder, with cortico-thalamic and cortico-STN tracts indicated by different colors. DBS leads are shown in teal and VTAs in red. In this panel, we show only tracts intersecting the VTAs for clarity. B Tracing of tracts between STN and frontal cortex, in the same responder as (A). To ensure capture of the tract reported in ref. [12], we broadly traced all streamlines originating in a seed around STN and extending anterior to the central sulcus. This includes fibers coursing dorsally to motor regions, and tracts as in ref. [12] connecting STN to ACC and medial PFC. Very few of these intersect the VTA in this patient, despite the good clinical response (YBOCS drop of 61% from baseline). To emphasize that point, this panel shows all fibers traced from the STN seed in this patient, regardless of VTA intersection.

Data analysis—independent/predictor variables

It is unclear whether the important “dose” of DBS is activation of a sufficient number of fibers (“total fiber” model), vs. the degree to which a sub-circuit is influenced (i.e., the fraction of the overall streamlines in a tract that are within the VTA, or a “percentage” model). We calculated both and fit them as two separate models for each dependent clinical outcome (see below). We also considered the possibility that DBS response is not determined by any individual tract/pathway, but instead requires capture of multiple pathways simultaneously. We, therefore, added a “total activation” variable to each prediction model. For total fiber models, this variable represented the total number of streamlines activated for all tracts. For percentage models, it represented the mean percentage activation across all reconstructed tracts. We standardized all input variables to the 0–1 interval to ensure that regression coefficients were comparable between independent variables.

All models were fit and evaluated using scikit-learn (0.24.1) in Python (3.8.5). With the exception of a necessary condition analysis described below, variables were coded at the single-visit level. That is, we predicted the clinical outcome at visit T from the DBS settings programmed at visit T-1.

Data analysis—OCD response

White matter pathway activation might relate tightly to the degree of clinical improvement (YBOCS as a continuous variable) or to patients’ overall well being (dichotomous responder/non-responder analysis). We thus modeled each separately. We analyzed continuous YBOCS as percentage decrease from baseline. Distribution fitting via the ‘fitdist’ package verified that YBOCS values were most compatible with a gamma distribution. We therefore predicted YBOCS improvement via an L1-regularized generalized linear regression (gamma distribution with identity link, Python package “pyglmnet”) and via a random forest regression with 100 trees. The dependent variable was percentage improvement in YBOCS. We compared these two approaches to assess whether conclusions might be sensitive to the model formulation. Regularized regression emphasizes selection of a small number of highly leveraged variables, which may be more helpful in defining clinical decision rules. Random forests can outperform generalized linear regression in at least some cases [55], particularly where there are nonlinearities better captured by thresholding.

We further analyzed categorical (non)response, defined as a 35% or greater YBOCS decrease from baseline [43]. For these, we compared an L1-regularized logistic regression and a random forest classifier with 100 trees. A minority of visits represented clinical response (29 visits out of 165, although 5 of 8 patients were in clinical response during at least one visit). To compensate for this imbalance, we applied the Synthetic Minority Oversampling Technique (SMOTE, [56]) with 3 nearest-neighbor examples. We chose L1 regularization for both regressions because dominant models of OCD argue that dysfunction in specific cortico-striatal loops leads to symptoms [21, 22] and/or that a relatively small number of fiber bundles can explain response [12, 26,27,28]. This should be reflected in clinical response being driven a small subset of tracts.

Data analysis—depression response

VCVS may have more effects on mood than on compulsivity [57], which would be reflected in better prediction of mood (MADRS) than of YBOCS. We applied the modeling pipeline used for categorical YBOCS response to categorical MADRS response, defined as a 50% or greater MADRS decrease from pre-surgical baseline. 7 out of the 165 visits met MADRS response criteria, although this again represented 5 of 8 patients.

We further assessed tractographic models’ prediction of hypomania, a known and voltage-dependent complication of VCVS DBS [58, 59]; details are in the Supplement.

Data analysis—model evaluation

All categorical data sets were unbalanced, and the outcome of clinical interest was always the minority class. We, therefore, report balanced accuracy and recall (performance for the minority class) for the categorical dependent variables. Further, we report the area under the receiver operator curve (AUC), which is suggested to be the best summary of a categorical biomarker’s performance [37, 40]. For continuous YBOCS prediction, we report the fraction of variance explained and the coefficient of determination (R2). We emphasize that R2 here is not the square of a correlation coefficient [37].

All metrics were calculated on a held-out test set [37, 38, 40, 41]. For each model, we held out 2 random patients from the dataset (effectively 4-fold cross-validation with resampling). This improves over leave-one-out approaches, which can overstate predictive performance [60]. We left out 25% of patients, rather than visits, because data were highly autocorrelated visit-to-visit, which also falsely inflates performance [37]. We then fit the predictive model on the remaining 6 patients, and we report the performance on the visit-level data from the held out patients. To prevent data leakage, the SMOTE upsampling was performed on the training set only, after the split. We obtained confidence intervals for all metrics by repeating this process over all 28 possible leave-two-out combinations, then calculating the range of performance falling within 2 standard deviations of the median performance.

We fit 16 models (4 outcomes × 2 types of model × 2 ways of expressing activation), cross-validating within each model. We interpreted the outcomes using an uncorrected 95% confidence interval to maximize power.

Data analysis—predictor importance

To detect potentially relevant tracts, we performed importance scoring on all models, regardless of whether they correctly predicted the clinical outcomes. For regression models, we computed the median and standard deviation of the regression coefficient for each tract, across all the train-test splits. For random forests, we applied permutation importance as implemented in scikit-learn. We permuted each independent variable 5 times for each of the train-test splits.

Data analysis—alternative univariate approach

Recent papers [12, 26,27,28] used a different approach, based on comparison of VTAs to population-scale tractography. As an additional analysis (not pre planned), we attempted a similar approach on this dataset. We calculated all linear correlations between YBOCS improvement (continuous variable) and the activation of each individual tract (either as a total fiber or percentage activation). These correlations were performed on the training set after holding out 2 random patients, consistent with [12]. To test whether this approach produced more generalizable predictors of DBS response, we used the same data to fit a univariate linear regression for each independent variable, then evaluated the model performance (coefficient of determination, R2) on the 2 held out patients.

In a further exploratory analysis (see Supplement), we considered whether DBS outcomes depended not on the tracts activated, but the integrity of those tracts.

Data analysis—statistical power

Because our sample size is relatively small compared to other recent studies [12], we assessed the clinical effect size that we could reasonably have hoped to detect. We simulated a dataset with a known tract/outcome correlation rtrue ranging from 0 to 1, with 100 replicates at each putative rtrue value. We then calculated the fraction of times that our modeling approach identified statistical significance, both for a primary metric (AUC) and our secondary metric (size of the regression parameter for the putative predictive tract).

Results

Clinical outcomes—YBOCS

The mean YBOCS improvement (considering each patient’s best time point) was 46.6%, and 5 of the 8 patients (62.5%) were clinical responders (≥35% YBOCS drop) for at least one visit.

No tract reliably predicted continuous YBOCS improvement. By all metrics, model performance was worse than chance on the held-out test set (Table 1), for both total-activation and percentage-activation models. Consistent with this, no coefficients in the regression models were above zero (i.e., the dataset mean was more reliable than any tractographic predictor). In the random forest models, the highest importance was percentage activation of fibers connecting thalamus to left OFC, but this was at chance level (change in R2 across models: mean 0.09, SD 0.24).

Table 1 Modeling outcomes for YBOCS improvement as a continuous variable.

Similarly, no model exceeded chance for response/nonresponse prediction (Table 2). In the logistic regression, highly weighted features across models were the number (but not percentage) of activated streamlines connecting thalamus to left cingulate, lateral OFC, medial OFC, and vlPFC. Cingulate and lateral OFC streamline activation were positively associated with response, whereas medial OFC and vlPFC activation were negatively associated (Fig. 2). For all of these tracts, the confidence interval for the coefficient estimated across all train-test splits included 0. These findings were sensitive to the modeling approach; the same tracts did not show median importance scores different from 0 in the random forest models. The ACC-PAC findings were corroborated by a Necessary Condition Analysis on white matter integrity (Supplementary Results).

Table 2 Modeling outcomes for YBOCS improvement as a categorical response.
Fig. 2: Non-zero regression coefficients across exhaustive leave-two-out cross-validation of regularized logistic regression to predict YBOCS response.
figure 2

All confidence intervals include 0, with left medial OFC (non-response) and left ACC (response) coming closest to significance. All reported results are for total fiber capture; percentage capture did not have non-zero coefficients in this analysis. Data are coded such that positive regression coefficients represent clinical improvement.

The alternate mass-univariate approach also did not reliably predict response on the held-out test sets (Table 3). It was concordant with the categorical response analysis in that it identified streamlines connecting the left cingulate to thalamus as correlated with response, and similarly streamlines from bilateral vlPFC as correlated with non-response. There was more discordance than similarity, however. The medial OFC tracts identified by regression were not selected in the mass univariate approach, and conversely, the mass univariate approach predicted nonresponse if tracts projecting to dlPFC were within the VTA. Further, the mass univariate approach emphasized percentage capture, while the logistic regression emphasized total fibers within a VTA. We note that tracts from STN to PFC were negatively correlated with clinical outcomes, whereas prior reports identify them as positively correlated [12, 27, 28].

Table 3 Correlations between individual fiber tracts and YBOCS response, in the style of [12], filtered to tracts whose confidence interval excludes 0 on the training sets.

Clinical outcomes—MADRS

The mean MADRS improvement (considering each patient’s best time point) was 55.69%, and 5 of the 8 (62.5%) were responders (≥50% MADRS drop) at some point. Mood and OCD response were not linked (r = 0.13 for correlation between response status on YBOCS and MADRS). Consistent with other reports [57], there were more observations of MADRS response without YBOCS than of YBOCS response without MADRS (22 vs. 4).

No model reliably predicted MADRS response above chance (Table S1). For comparison with the YBOCS analysis, we further examined the non-zero coefficients of the total-fiber regression. Capture of streamlines between right cingulate and thalamus was correlated with MADRS response, and the confidence interval for this coefficient excluded zero (Fig. S1). This was not true of any other tract. Left vlPFC was associated with non-response (as it was in the categorical YBOCS analysis), but the distribution of coefficients across analyses included zero. Random forest importance scores were centered around zero.

Power analysis and detection bounds

Even with its relatively small sample size, the repeated-measures design of our analysis granted 90% power for detection of a tract-to-outcome correlation as low as 0.5 (Fig. 3), which is smaller than the reported correlations in the largest available normative dataset [12]. Critically, there was a large gap between sensitivity for clinical prediction (based on AUC) and sensitivity for the correlation itself (based on the regression coefficient). For the latter, we retained 90% power for detection of a correlation as low as 0.2.

Fig. 3: Power analysis.
figure 3

The curves show the probability of reporting a significant result, given an assumed level of correlation between a single tract and YBOCS outcome. The red line represents 90% power.

Discussion

Our results are both concordant and discordant with prior efforts to predict clinical OCD DBS response from tractographic modeling of cortico-striatal and cortico-basal circuits. Critically, we implemented multiple analytic steps beyond prior studies: individualized, patient-specific tracts registered to individual lead placements, activation volume calculation beyond simple electric field assumptions, consideration of multiple clinical timepoints for each patient, and formal evaluation of predictive power (as compared to measurement of correlations between activation and response or group mean differences). With this more guideline-adherent approach, we found that no tract could reliably predict clinical response or complications, whether those were considered in a continuous or categorical approach. This is likely not a surprise—we and others have highlighted that group-level significant correlations/separations often do not have clinical predictive power [37,38,39,40]. In this sense, our results support calls for caution regarding the clinical role of tractography [16, 42]. We also showed that outcomes can be sensitive to the analytic approach—our random forest and regularized regression approaches produced very different results, even though both are commonly used approaches to prediction and variable selection.

Model inspection may offer some insight into variables for further investigation, even if pathway activation modeling approaches are not yet able to strongly predict response. Numerically, predictive power was greater (more non-zero regression coefficients after regularization) when predicting categorical rather than continuous outcomes. This may be because categorical outcomes effectively smooth out small fluctuations in continuous rating scales, fluctuations that may be primarily due to inter-rater variability or disease-unrelated variables rather than to DBS settings. The YBOCS in particular shows non-linear behavior at high scores that may exacerbate this [61]. We obtained non-zero regression coefficients for models using activated fiber counts, but not for percentage-activated models, implying that it is more important to get at least a portion of a key tract within the VTA. These results also make sense in the context of our finding that the integrity (traceability) of these tracts varies greatly between patients with OCD—a tract where response depends on tract integrity will have a large coefficient in a total-fibers model, but not in a percentage-activation model.

Our results in part support and in part diverge from a series of recent papers implicating pathways between PFC and basal ganglia as critical for OCD DBS [12, 26,27,28]. PAC to thalamus tracts were implicated in both YBOCS and MADRS response, and were the most positively weighted in our mass-univariate approach. Our white matter integrity analysis identified the same tracts as having the largest effect size (necessity). Also similar to that prior work, we found that activation of connections to medial OFC produced numerically worse outcomes. Inconsistent with the prior work [12, 26,27,28], we found negative correlations (in the mass univariate analysis) or null effects (in the predictive models) specifically for tracts connecting PFC to STN or vlPFC to thalamus. This again may reflect the importance of patient-specific imaging. Given that we have previously shown these tracts to have substantial inter-individual variability in their position within the internal capsule [34], and that here we note them to have similar variability in their overall integrity, a normative connectomic analysis may not reflect the actual fibers being successfully modulated in DBS cases. Alternatively, our results may highlight programming and surgical differences. These patients were implanted and programmed following the approach in [62], which emphasizes an initial search for a positive affective response. Other centers have reported very different programming algorithms [63], based more on standard anatomic positions. If response correlates with, e.g., the quality of concomitant therapy [26, 64] or general clinical expertise [65], those factors will likely be strongly correlated with the programming clinician, and thus will spuriously load onto the tracts and implant locations that clinician happens to prefer. Most importantly, our results highlight the importance of applying analyses designed specifically to identify clinical predictors [37]. Interestingly, we found that OFC engagement predicted worse OCD clinical response. OFC-originating components of cortico-striato-thalamic circuits are heavily emphasized in theoretical [21, 22, 30] and animal [31, 33, 66] models of OCD, and these findings may contribute to an ongoing debate over those models.

These results are tempered by three limitations. First, our sample size is small, consistent with the rarity of these patients [67]. Second, imaging was not performed on a connectome-optimized scanner. The 3T MRI used in this study has relatively weak gradients that influence our maximum image resolution. Scanning at 7 Tesla (as has now become more common [68]) might identify more tracts. Third, we used relatively simple models of DBS activation. All of these add noise, reducing our ability to detect subtle correlations, particularly given DTI’s susceptibility to false positives [14]. Practically, however, these limitations may not affect the clinical importance of our findings. We mitigated the lower resolution of these scans by use of an algorithm that is specifically designed to perform well in the presence of noise [46] and ensuring that our extracted tracts matched known, anatomically verified fiber bundles [49]. Regarding sample size, small samples tend to inflate effect sizes and bias towards positive conclusions [69], not the negative result we report. Most importantly, for a tractographic result to be sufficiently reliable to inform clinical targeting/programming, it would need to have a large and clear influence on outcomes, with robustness to minor variations in analytic or clinical technique. Such a large effect would be clearly detectable and consistent across studies even at small sample sizes, like the clinical effect of VCVS DBS, which shows consistent 60–70% response rates across many small to medium cohorts [57, 59, 70,71,72]. We verified this by showing that the present dataset would have been sufficient to identify a relatively modest tractography-to-outcome correlation of 0.5. In that context, failure to identify a significant predictor in this sample is relevant to both clinical practice and future study design. Our results identify the limits of current methods, and suggest a floor below which a biomarker would be unlikely to provide clinical value.

At the same time, our results support a growing argument that circuits linking ACC to thalamus and basal ganglia are important to VCVS DBS response. They dovetail with other work linking modulation of those circuits to increased cognitive control [73, 74], a construct that is thought to be deficient in OCD [75, 76]. Thus, these results do not imply that tractography and field modeling are non-useful for understanding DBS. They establish a gap between our current level of understanding (which can identify mechanistic hypotheses for follow up) and the level needed for clinical practice. With multiple technologies emerging to better verify target engagement and address patient heterogeneity [1, 16], that understanding will likely grow in coming years.