Jumping over baselines with new methods to predict activation maps from resting-state fMRI

Cognitive fMRI research primarily relies on task-averaged responses over many subjects to describe general principles of brain function. Nonetheless, there exists a large variability between subjects that is also reflected in spontaneous brain activity as measured by resting state fMRI (rsfMRI). Leveraging this fact, several recent studies have therefore aimed at predicting task activation from rsfMRI using various machine learning methods within a growing literature on ‘connectome fingerprinting’. In reviewing these results, we found lack of an evaluation against robust baselines that reliably supports a novelty of predictions for this task. On closer examination to reported methods, we found most underperform against trivial baseline model performances based on massive group averaging when whole-cortex prediction is considered. Here we present a modification to published methods that remedies this problem to large extent. Our proposed modification is based on a single-vertex approach that replaces commonly used brain parcellations. We further provide a summary of this model evaluation by characterizing empirical properties of where prediction for this task appears possible, explaining why some predictions largely fail for certain targets. Finally, with these empirical observations we investigate whether individual prediction scores explain individual behavioral differences in a task.

. Dice score sensitivity to choosen threshold: Single subject GLM maps hold a considerable degree of noise resulting in inflated statistical errors in detecting activation in task-based fMRI. Therefore, many results reported on the vertex/voxel level are often done on a binary active vs. non-active classification basis and are sometimes utilized to evaluate fMRI reproducibility measures. Here, we report on the stability of applying a threshold procedure for the model evaluation.
Varying thresholds (Z-scores: 1.4-3.3) were chosen to display the overall sensitivity to results over baseline models: Group Z-stat, Group Z-stat Threshold Free Cluster Extent (TFCE) corrected at Family-wise Error (FWE) p < 0.05, and Group Mean. Liberal thresholds show Dice coefficient scores higher for group-based models than a top-performing fitted. Fitted models, according to this evaluation metric, show better performance than Group models only when thresholds are increasingly strict.
especially at the highest threshold 0.2. Differences between these models are seen depending on number of training subjects, threshold selected, and contrast examined.

Miscellaneous
1 Score Table  0 Figure S3. Dice score sensitivity to number of training subjects: A comparison between Group Z-Stat models (top row) vs. MMP-RR-PCR (bottom row) with varying samples (subjects) used to fit the models. Shown are 3 different contrasts used in the main text for subject-wise evaluation. Column plots show increasing (left to right) Z thresholds choosen for the comparison, i.e., 1.7, 2.3, 3.1, and GGM-a threshold calculated from individually fitted Gaussian Gamma Mixture models on actual task activation maps of individual subjects, which tend to be considerably more conservative than Z > 3.1. This threshold was the estimated median of the positive Gamma component. Higher thresholds show that an increasing number of subjects used to calculate Group Z-Stat models lowers Dice score considerably. For fit model MMP-RR-PCR on contrasts FACES or RH, the fitted contrasts appear to remain flat across all thresholds without considerable and expected increases in Dice score due to increasing sample size. Suppl. Figure S4. Separate test-sample R 2 of model MMP-RR-PCR evaluations of a single task contrasts belonging to each task category: All 47 contrast targets belong to 7 different task categories: Emotion, Gambling, Language, Motor, Relational, Social, Working Memory. A contrast in each category closest to the median Pearson R score of that category was selected for display. The test-sample mean over all 47 contrasts is plotted also for convenience. A general pattern is clear across all contrasts: it is only within certain regions, e.g., association cortex, that a positive R 2 appears possible. Primary sensorimotor regions are consistently negative. Suppl. Figure S5. Pearson r correlation score benchmark results for 100 subject test set: Colorbar indicates mean r score difference between the model prediction score and the mean baseline across all test subjects for given contrast and model. Scores are ordered by model and contrast exactly like figure 2. This figure is akin to figure 4 showing what models achieve an R 2 score above 0.

4/11
Suppl. Figure S6. Inter-subject variability of activation maps (Z-Maps), sulcal depth, or MMP-RR-PCR rsfMRI features and how they relate to measured R 2 and each other plotted as a 3D scatter plot where each point represents a cortical vertex. The colormap represents a vertices' mean R 2 computed across all 47 contrasts. Inter-subject variability of features, sulcal depth, or Z-maps are computed as the vertex-wise standard deviations. rsfMRI features and Z-Maps are averaged across all features (379) or contrasts (47). This plot shows sulcul depth and rsfMRI functional correlational features are strongly correlated to one another. Unsurprisingly, the model's prediction ability as measured by R 2 on a vertex-wise basis are concentrated around the point-cloud mass where inter-subject variability between the two factors are highest (upper right-most areas of the plots).  Figure S8. Vertex-wise R 2 score sensitivity at 100 training samples: scores indicate the fraction of cortical surface having an R 2 score above a given threshold plotted on the x-axis for given contrast and model.

MMP-RR Task vs. MMP-RR Rest
Suppl. Figure S10. We investigated whether task data from separate task measurements provided better features for task prediction than resting-state data. Implicitly it is assumed resting-state scans, as opposed to task-based data where a very limited number of cognitive brain activity modes are investigated, subjects will enter multiple cognitive modes comprising of default, visual, motor, executive control and attention processes. This is supported by the networks of brain activity that are elicited during a single measurement of rest largely overlapping those that are extracted during the task 4, 5 . Furthermore, even across different task states, it appears that a core cognitive network dominates 6 . Do these multiple cognitive modes that are speculatively elicited during a rest scan differentiate subjects better than tfMRI?. Here, we tested whether a concatenation of data from the HCP battery of 6 diverse, but ultimately limited tasks 7 . In the tfMRI case, separate features were calculated by selecting only 6 of the 7 tfMRI datasets, leaving out the tfMRI measurement of the to-be predicted GLM task contrast. Doing this excluded circularity. In reported experiments, data matrix X i was a concatenation across 6 of the 7 task measurements with the excluded measurement being the one under which the contrast map was computed with. This led to features being computed from 3468, 3314, 3188, 3252, 3356, 3272, 3010 samples for EMOTION, GAMBLING, LANGUAGE, MOTOR, RELATIONAL, SOCIAL, WM contrast map predictions, respectively.)

8/11
Contrast  Suppl. Table S1. Correlation Score Table: Mean Correlation Scores averaged over subjects and shown across all models examined ordered by Contrast names as in figure 2. Additional supplementary material provides individual subject scores in a CSV (all_model_and_subject_r_scores.csv).

9/11
Contrast   Suppl. Table S2. Weighted R 2 Score Table: R 2 Scores across all models examined ordered by Contrast names as in figure 4. These scores are provided in a CSV (model_r2w_scores.csv).