Predicting language recovery in post-stroke aphasia using behavior and functional MRI

Language outcomes after speech and language therapy in post-stroke aphasia are challenging to predict. This study examines behavioral language measures and resting state fMRI (rsfMRI) as predictors of treatment outcome. Fifty-seven patients with chronic aphasia were recruited and treated for one of three aphasia impairments: anomia, agrammatism, or dysgraphia. Treatment effect was measured by performance on a treatment-specific language measure, assessed before and after three months of language therapy. Each patient also underwent an additional 27 language assessments and a rsfMRI scan at baseline. Patient scans were decomposed into 20 components by group independent component analysis, and the fractional amplitude of low-frequency fluctuations (fALFF) was calculated for each component time series. Post-treatment performance was modelled with elastic net regression, using pre-treatment performance and either behavioral language measures or fALFF imaging predictors. Analysis showed strong performance for behavioral measures in anomia (R2 = 0.948, n = 28) and for fALFF predictors in agrammatism (R2 = 0.876, n = 11) and dysgraphia (R2 = 0.822, n = 18). Models of language outcomes after treatment trained using rsfMRI features may outperform models trained using behavioral language measures in some patient populations. This suggests that rsfMRI may have prognostic value for aphasia therapy outcomes.


Recruitment and assessment. Patients with chronic aphasia were recruited from the Aphasia Research
Laboratory at Boston University (BU), the Aphasia and Neurolinguistics Laboratory at Northwestern University (NU), and the Cognitive and Brain Sciences Laboratory at Johns Hopkins University (JHU). Patients were independently recruited, diagnosed, and treated for one aphasia impairment at each site: anomia at BU (N = 28), agrammatism at NU (N = 11), and dysgraphia at JHU (N = 18). The diagnosis of aphasia was made using the mean score on the Western Aphasia Battery revised (WAB-R) 34 . All patients presented with aphasia resulting from a single left-hemisphere thromboembolic or hemorrhagic stroke (see Supplementary Fig. S1 for the aggregate lesion map), were at least one-year post-stroke, and had no other impairments that impacted the ability to complete the behavioral or neural tasks (e.g. vision and hearing was within normal limits). All were monolingual English-speaking, had at least a high school education, and completed a written consent form approved by every site's Institutional Review Board (IRB). All patients included in this study were right-handed, as determined through the Edinburgh Handedness Inventory. All experiments and protocols described in the upcoming sections were performed in accordance with the guidelines and regulations put forth by the IRB from each participating institution.
Aphasia is primarily diagnosed and assessed through language assessments, such as the Western Aphasia Battery (WAB) 34 . However, the WAB lacks sensitivity to lexical-semantic deficits, sentence processing deficits, and spelling deficits, so supplementary language measures must also be performed to capture the full range of aphasic deficits [35][36][37] . While many language measures have been developed to test specific language deficits, relatively few have been assessed for psychometric validity [38][39][40][41] . As a result, there is a lack of consensus on the optimal aphasia assessment battery, and the prognostic utility of existing language measures are largely unknown. In order to be inclusive of potentially prognostic measures, we collected a broad range of 27 behavioral measures from eleven language and cognitive assessments (see Table 1). Each patient was also assessed on one of three treatment-specific measures (TSMs, defined below), which served as the primary metric for evaluating the baseline aphasia severity and the response to SLT. Patients also underwent comprehensive multi-modal imaging assessment, including T1 structural MRI, perfusion, diffusion, task-based and resting-state fMRI. The present study only examines the rsfMRI data.
Speech and language therapy. Following baseline testing, each patient received a three-month course of SLT. The TSM was measured both before and after completion of the treatment protocol to estimate the treatment response. Detailed treatment protocols have been described previously and are summarized here.
Patients with anomia underwent a typicality-based semantic treatment 42 . Each patient participated in a computer-based task where they were presented with pictures from five semantic categories: birds, vegetables, fruit, clothing, and furniture. Patients sorted images into categories, then attempting to name each image. Naming was confirmed using written and auditory verification, before a second naming attempt was made. Training occurred weekly and each patient was assigned two half-categories for review. Assessment of treatment progress was made using a comprehensive naming test of items from 3 of the five categories (Anomia TSM).
Patients with agrammatism received sentence comprehension and production treatment through a Treatment of Underlying Forms program 43 . In this treatment the patient was shown an action picture that depicts a scene. The patient was then given a set of cards with verbs, and asked to point to the action verb that describes that scene. The examiner then built an active verb sentence using the action verb card and an active sentence template, and demonstrated to the patient how the active sentence can be changed into a passive verb sentence. Next, the patient formed the passive sentence and read it aloud. This treatment occurred twice per week for 90 min per www.nature.com/scientificreports/ session. Assessment of treatment progress was made using a sentence production test. Patients were shown two action pictures, and given a sentence describe one of them. The patient then had to produce a similar sentence for the other picture. The produced sentences were recorded and subsequently assessed for semantic similarity to the prompt sentence (Agrammatism TSM). Patients with dysgraphia underwent a spell-study-spell treatment protocol 44 . Each patient was given a set of 40 training words to learn to spell. Word sets were customized by patient such that each word has a baseline letter accuracy between 25 and 85%. Treatment consisted of a patient hearing the word, repeating it, and attempting to write it out. This was repeated until the word was spelled correctly, up to a maximum of three attempts per word. The patient was shown the correct spelling of the word regardless of accuracy. Training occurred twice per week for 90 min per session. Assessment of treatment progress was made by measuring the letter accuracy across the entire training set (Dysgraphia TSM). Image acquisition. MRI scans were acquired using 3.0 T scanners (Siemens Skyra at BU, Siemens Trio/ Prisma at NU, and Philips Intera at JHU). Imaging protocols were harmonized across the sites to provide similar quality and timing. Structural images were collected using a 3D T1-weighted sequence (TR = 2300 ms, TE = 2.91 ms, flip angle = 9°, resolution = 1 mm 3 isotropic). Whole brain functional images were collected using a gradient-echo T2*-weighted sequence (TR = 2 or 2.4 s, TE = 20 ms, flip angle = 90°, resolution = 1.72 × 1.72 × 3 mm, 210 or 175 volumes). Initial studies (first 5 NU subjects) used a 2 s TR, but additional coverage was required to obtain whole brain data so TR was increased to 2.4 s. While NU and BU had one scan of 210 volumes, JHU subjects received 2 runs of 175 volumes each, and only the scan with the highest temporal signal-to-noise ratio (tSNR) was included for analysis. Image preprocessing. All images were archived on NUNDA (Northwestern University Neuroimaging Data Archive, https:// nunda. north weste rn. edu) for storage and data analysis. Upon arrival in the archive, image quality assurance (QA) was performed using automatic pipelines for functional and structural data. For fMRI data, a slice-wise tSNR was calculated as the ratio of the mean signal to the standard deviation of the time course data from each slice, weighted by the number of brain voxels in the slice. Poor-quality scans (tSNR < 100) were repeated or excluded from analysis.
Image preprocessing was performed using the NUNDA "Robust fMRI preprocessing pipeline", which employs custom scripts built upon functions from AFNI, FSL, and SPM software [45][46][47] . First, the fMRI time series were despiked (AFNI 3dDespike) and coregistered to the mean image (AFNI 3dvolreg). Normalization to standard www.nature.com/scientificreports/ MNI space was performed in a concatenated two-step procedure. A transformation aligning the first image in the fMRI time series to the T1 was created using boundary based registration (FSL BBR) 48 . This was combined with the nonlinear warp of the T1 to an MNI template of 2 × 2 × 2mm resolution (SPM Dartel Toolbox) 49 . Structural images were corrected using enantiomorphic lesion transplant (SPM Clinical Toolbox) to minimize distortion effects caused by warping brains with lesions 50,51 . Using the lesion mask as a reference, right hemisphere homologous tissue was mirrored into the lesioned space to create a lesion-corrected left hemisphere. The optimal transform, calculated using the lesion-corrected brain, was then applied to the native brain. rsfMRI analysis. The rsfMRI features which predict aphasia recovery are currently unknown, encouraging a data-driven approach. Group independent component analysis (GICA) is a data-driven method of decomposing signals into underlying spatial and temporal components. Applied to rsfMRI, GICA may identify statistically independent patterns of brain activity across subjects. These patterns may then be compared across subjects, permitting analysis of network activity within and across groups. GICA also offers intrinsic noise filtering, as noise which is statistically independent of the signal filters into its own component. These components can be filtered out to perform group artifact removal, which has been demonstrated to be more reliable than artifact removal at the subject level 52 . Subjects from all sites were processed together to attenuate the impact of singlescanner artifacts on final components. GICA was performed through the GIFT toolbox for MATLAB 53 . Data were decomposed using default parameters (20 components, InfoMax algorithm). Component projections clustered strongly across 100 random parameter initializations, indicating the chosen parameters were stable in our analysis 54 . GICA components were then backprojected, producing 20 spatial components and 20 corresponding time series for each subject. Component maps which have been aggregated across subjects are displayed in Supplementary Fig. S3.
Measurement of low-frequency oscillations is broadly used as a generalized activity metric in fMRI analyses, including studies in both stroke 55,56 and aphasia 57,58 . The fractional amplitude of low-frequency fluctuations (fALFF) is the ratio of power in the 0.01-0.08 Hz band to the total power 59 . The fALFF was calculated for each time series of each component for each subject, then standardized such that the sum of a subject's 20 fALFF values equals one. This permits comparison of relative component power within a subject, and normalizes activity ranges prior to regression analyses. We combined fALFF with GICA as a data reduction technique, whereby a series of functional volumes is summarized by an activity measure of 20 components. This approach helps avoid overfitting by reducing problem dimensionality, and makes regression more feasible with the given sample sizes.
Model construction and validation. Prediction of post-treatment primary dependent measures was modeled using elastic net regression. This model was chosen because linear models are relatively robust to lower sample sizes, and tend to generalize well when trained with regularization. Elastic net regression utilizes a combination of LASSO (L1-norm) and ridge (L2-norm) regularization penalties, and reduces overfitting by limiting coefficient magnitudes 60 . Regularization hyperparameters were determined by leave-one-out cross-validation (LOOCV). To facilitate equitable interpretation of model coefficients, all input variables underwent z-score normalization prior to training. Model training and validation was performed with the caret package in R. Missing data were imputed using the randomForest package in R 61 . Imputation hyperparameters were selected to minimize output variance. All results from subsequent analyses with missing data were repeated using data from 1000 imputations, and the summary statistics across imputations are shown for each.
Model performance was assessed by comparing the post-treatment TSM predicted through LOOCV versus the actual post-treatment TSM. If the model predicted a value outside of the known range of a TSM (i.e. over 100% or below 0% accuracy), the result was rounded to within the test's dynamic range. We measured error in model predictions using the median absolute deviation (MAD) between predicted and actual values. The MADs of each model were compared to the MADs between predicted post-treatment TSMs and their mean (zero-order model) using a paired Wilcoxon test to assess if the model performed better than chance. This nonparametric approach was taken due to non-normality and heteroscedasticity in the dataset. Second, we measured the percent of variability in post-treatment TSMs explained by each model using the square of the Pearson correlation coefficient (R 2 ). We then tested if each correlation value is significantly larger than zero using a correlation coefficient hypothesis test.
Ethics approval and consent to participate. All participants provided written informed consent according to Institutional Review Board policies at Boston University, Johns Hopkins University, and Northwestern university. All experiments in this study were performed in accordance with the guidelines and regulations put forth by these Institutional Review Boards.

Results
Participants. Demographics and aphasia severity for the 57 recruited participants are shown in Table 2.
Of the participants who entered the study, one dropped immediately and one passed away before completing testing. Both of these participants were completely excluded from analyses. Of the remaining participants, one dropped out and one suffered a hematoma during therapy. For these participants the baseline language measures were used, and the post-treatment dependent measures were imputed. In addition, two agrammatism participants were scanned with a different sequence, and these subjects were removed from fALFF-based analyses. All baseline TSM data were collected, and 3.3% of the other baseline language measurements were missing due to incomplete testing and/or lack of follow-up. Patient performance on the TSM increased significantly over the course of treatment for all therapy groups: Anomia (p = 2.8E−6), Agrammatism (p = 0.005), Dysgraphia (p = 1.0E−4) (one-sided Wilcoxon Signed Rank Test). www.nature.com/scientificreports/ Significant differences were found to exist across the deficit groups in age, years of education, and overall aphasia severity. However, the outcome after treatment was modelled independently for each aphasia impairment, limiting the confounding effect of demographic differences between groups.

Behavioral measures.
Correlations between the behavioral measures included in our aphasia assessment battery are displayed in Fig. 1A. Hierarchical clustering was performed, using one minus the Kendall's Tau-b correlation as a distance metric between tests (Fig. 1B)    Predicting scores on the post-treatment TSM with behavioral measures. Linear models which predict the TSM after therapy were constructed for each aphasia impairment: anomia, agrammatism, and dysgraphia. The dashed line represents a perfect prediction (predicted score = actual score). Black circles show the median predicted score for each patient across all 1000 imputations during LOOCV. Table 3. Coefficients for models using language and cognitive assessments. The median model coefficient for each behavioral model is shown, followed by the range of minimum and maximum coefficients across 1000 data imputations. Modeling with GICA fALFF. We next built prognostic models using fALFF values of independent components and baseline aphasia severity as measured by the pre-treatment TSM (Fig. 3). These models did  Table 4. The largest coefficients in the anomia model were due to the pre-treatment TSM (0.272) and fALFF of Component 18 (0.132). In the agrammatism model, the coefficients for fALFF of Component 5 (0.062), fALFF of Component 1 (0.057), and the pre-treatment TSM (0.037) were largest. Coefficients in the dysgraphia model were overall more evenly distributed, with only the fALFF of Component 19 (0.080) standing out.

Discussion
Behavioral assessment multicollinearity. We have found strong multicollinearity across behavioral measures (Fig. 1), which has been observed previously 62 . All behavioral measures show a consistent weak correlation with nearly all other measures (median pairwise correlation of 0.36), suggesting that there is substantial overlap in the information being collected across assessments. In addition, measures from within the same assessment have even higher association, such as in the PALPA 40 where the correlation in spelling scores between high-frequency and low-frequency words is 0.88. This is significant because comprehensive aphasia testing is laborious for both patients and practitioners, and there is a need for shorter-form aphasia probes that can deliver adequate assessments 63 . The quick aphasia battery is a potential solution which can be rapidly administered and has high correlation with corresponding WAB measures 41 . Our analysis shows that there is high correlation between WAB measures, suggesting that further efficiency gains in clinical aphasia assessment may be realized by prioritizing orthogonal measures.
While multicollinearity is strongest between language assessments, there is also some correlation with the Digit Span, an assessment of verbal short-term memory. Relatively few language assessments correlate with the Doors & People assessment, which is also a verbal memory assessment (median pairwise correlation of 0.13). However, Doors & People correlates weakly with the WAB Information Content measure, which has one of the highest median correlations with all other assessments (0.43). It is therefore possible that there is some unique information shared between the D&P and the WAB-IC measures that is not shared between the WAB-IC and most other language assessments. These observations support the current understanding that post-stroke aphasia is a multidimensional disorder that may present with a range of language and cognitive impairments as a result of damage to multiple brain areas 19,64 . www.nature.com/scientificreports/ Model performance. Prognostic models using initial severity and behavioral measures performed best in the anomia group. The behavioral model for anomia relied most on the pre-treatment TSM, PALPA 35 EXC, and PALPA 40 HF measures, suggesting that pre-treatment reading and spelling ability were approximately as important as initial anomia severity in prognosis. This is not entirely surprising, as patients who perform better on nonspecific language assessments may have more intact language networks prior to therapy. Patients with improved baseline language function will tend to score higher than those with more impaired baseline language function, even after therapy. This may help explain why the majority of coefficients across the behavioral models were positive. However, higher initial performance on some language metrics indicated poorer outcome (i.e. SCT-N and SPPT-N in the anomia model). There is a pattern in the anomia model where paired measures have opposite coefficient signs (i.e. SCT-N and SCT-C) indicating that anticorrelated performance on measures which are typically correlated may carry some prognostic information. However, it is challenging to derive strong conclusions about individual predictor variables based on model coefficients due to the interactions between model regularization and multicollinearity. Ridge (L2-norm) regularization will incentivize the model to distribute coefficient magnitude across predictor variables during training. When multicollinearity is present, there is relatively more shared information between predictor variables, so model coefficients tend to be small even if some predictor variables are strong individual predictors. Prognostic models using initial severity and GICA fALFF activity performed best in the agrammatism and dysgraphia groups. While the anomia model demonstrated strong performance as well, much of this can be attributed to the pre-treatment TSM, further suggesting that behavioral assessments were more valuable than rsfMRI in the anomia group. As GICA selects components to be statistically independent of one another, fALFF values are not multicollinear and have clearer interpretation of model coefficients when regularization is used. The GICA component maps generated in this analysis do not resemble known language networks, but rather demonstrate strong weighting in the ventricles and brain regions known to be sensitive to physiological motion. This reflects our minimal image preprocessing pipeline and inclusion of all voxels within the brain. When analysis is repeated using rsfMRI data that was filtered to remove cardiac-driven physiological motion, the predictive power of the models is lost. We hypothesize that the data-driven patterns picked up by our GICA components and the fALFF-based prognostic models are functioning as proxy variables for regional cerebral blood flow (rCBF).
Arterial spin labeling studies have shown that the patterns of cerebral blood flow are known to be aberrant in both the acute and chronic phases of stroke recovery 65,66 . Alterations in regional cerebral blood flow (rCBF) have been tied to cognitive functioning in numerous neuropsychiatric conditions such as Alzheimer's Disease, frontotemporal dementia, adolescent ADHD, and schizophrenia [67][68][69][70] . In post-stroke aphasia, one study has shown that low-frequency repetitive transcranial magnetic stimulation (LF-rTMS) drives an increase in local rCBF that is proportional to the degree of language recovery 71 . Furthermore, the areas which experienced changes in rCBF extended beyond where LF-rTMS was applied. In our models, the coefficients are overall well distributed and Table 4. Coefficients for models using baseline severity and component fALFF. The median model coefficient for each behavioral model is shown. For the agrammatism model, the range of minimum and maximum coefficients is shown across 1000 data imputations. The anomia and dysgraphia models had no missing data for this analysis, so no imputations were computed for the corresponding groups. Limitations. There were several limitations present in this study. First, we encountered challenges in model selection due to sample size. While we recognize that the relationship of baseline aphasia severity to treatment outcomes likely has nonlinear components, we were limited to linear models with heavy regularization to overcome the challenge of having more predictor variables than subjects. Furthermore, model validation would have benefited from an independent validation set, however this would have strained training further to where LOOCV was the most viable approach. Recruitment was limited in part by the extensive imaging and behavioral testing that was performed for each subject alongside hours of treatment. Second, while subjects were admitted to the study based on a singular aphasia impairment (anomia, agrammatism, or dysgraphia), this design was based on treatment assignment and did not take into account the possibility of overlapping impairment and any possible nonlinear effects treatment. Third, while we model responses to three treatment protocols, there are many other protocols for aphasia therapy. We have demonstrated large variability in model performance across protocols, and this variability likely extends to protocols not examined here. Fourth, while we model performance on the TSM after treatment as our primary outcome, it may be more precise to instead model the individual change in TSM performance instead (post-treatment TSM-pre-treatment TSM). However, we were constrained by ceiling effects where subjects with high baseline performance did not have substantial room for improvement due to the dynamic range of the assessments used. When individual performance changes are considered, patients with known positive prognostic factors (i.e. high baseline performance) are interpreted during training as experiencing little performance gain. This convolutes model training leading to decreased performance, as our linear model lacks the capacity to condition the effects of other favorable prognostic variables on baseline severity (a nonlinear interaction).

Conclusions
High-performance predictive models of individual response to therapy were trained for each aphasia impairment. Models based on GICA fALFF were overall higher performing and more consistent than models based on behavioral measures alone. Furthermore, the average performance of the GICA fALFF models (R 2 = 0.816-0.876) is competitive when compared to prior work modelling aphasia outcomes (R 2 = 0.56, 0.73) which have relied on behavioral or anatomical variables 16,17 . This encourages further study of rsfMRI as a prognostic tool for poststroke aphasia. Continued effort on developing prognostic models which estimate treatment response trajectories may ultimately improve treatment. A series of high-performance prognostic models could be used to estimate the distribution of outcomes for a variety of therapeutic options, opening the door to personalized treatment. This approach may help overcome the unexplained variability in response to aphasia treatment, giving patients and practitioners more agency in the treatment selection process. rsfMRI has several advantages over conventional language measures in assessment and prognosis. Language assessments have a limited dynamic range and therefore a limit of deficit severity to which they are sensitive. This could make assessment of treatment response difficult in patients with near the assessment ceiling or floor, and this challenge was observed in our dataset. In contrast, features based on rsfMRI are more continuous and may have higher dynamic range, offering potential to equitably assess a wider range of aphasia severity. Creating quantitative profiles of aphasia severity from rsfMRI may be easier in practice than refining a battery of existing language measures due to the high dimensionality of rsfMRI data. However, there are challenges facing clinical adoption of rsfMRI for aphasia. Imaging is relatively expensive, and accessibility to quality scanning varies across patient populations. It is challenging to image patients who are claustrophobic or unable to keep still. Clinical adoption of rsfMRI may become feasible with further advancements in imaging technology and/or support from healthcare systems.

Data availability
The minimal dataset needed to interpret, replicate, and build on the findings reported in this paper are available from the corresponding author on reasonable request.