Intermediately synchronised brain states optimise trade-off between subject specificity and predictive capacity

Functional connectivity (FC) refers to the statistical dependencies between activity of distinct brain areas. To study temporal fluctuations in FC within the duration of a functional magnetic resonance imaging (fMRI) scanning session, researchers have proposed the computation of an edge time series (ETS) and their derivatives. Evidence suggests that FC is driven by a few time points of high-amplitude co-fluctuation (HACF) in the ETS, which may also contribute disproportionately to interindividual differences. However, it remains unclear to what degree different time points actually contribute to brain-behaviour associations. Here, we systematically evaluate this question by assessing the predictive utility of FC estimates at different levels of co-fluctuation using machine learning (ML) approaches. We demonstrate that time points of lower and intermediate co-fluctuation levels provide overall highest subject specificity as well as highest predictive capacity of individual-level phenotypes.

2) Introduction / "These time points of high-amplitude co-fluctuations, called "events", contribute dispropor41 tionately to FC and are thought to reflect fluctuations in cognitive state": not all highmagnitude change sin resting state signals are thought to be cognitively or behaviourally relevant; please smoothen statement by integrating: Uddin, L. Q. (2020). Bring the noise: reconceptualizing spontaneous neural activity. Trends in Cognitive Sciences, 24(9), 734-746.
3) Methods: why and how were the 2 resting-state (out of 4) from the HCP data selected? Perhaps its is wording, different phase encoding acquisitions could be different sessions, too. 4) Methods: more information should be provided on whether and how the global signal regression was present and computed in the main analysis? 5) Methods: more information could/should be provided on how the spike regression for signal cleaning was carried out quantitatively. 6) Methods / connectome prediction: where the features cleaned or encoded different? For what reasons where the kernels picked in ridge regression modeling? Was any hyper parameter selection done, and, if yes, how? How was the confounding removal integrated into the cross validation workflow? Poldrack, R. A., Huckins, G., & Varoquaux, G. (2020). Establishment of best practices for evidence for prediction: a review. JAMA psychiatry, 77(5), 534-540. 7) General: the manuscript appears to be meta-stable between 'identification' and 'identifiability', which are similar, but different notions -please keep and argue for one of them throughout the whole work.
Minor 1) Better to avoid abbreviations in subtitles.
2) figures: the 'threshold' axis may be hard to follow for several readers and could be explained more. Also the nature of the metric on the y axis is not always directly apparent.

Referee #1
The authors have focused on the contribution of different edge time series time points on the relation between the brain and behavior. Via machine learning approaches, the authors tried to investigate the functional connectivity (FC) capability in differential identifiability and identification Major Comments:

Major Comment:
This study lacks statistical analysis. The authors report several interesting results without statistical tests. For instance, it is mentioned that "In the sequential sampling strategy, we could replicate the finding that HACF frames yield higher differential identifiability (IDiff) than LACF frames (Fig. 1A)." However, there is no statistical report on this difference to clarify if the difference between HACF and LACF is significant or not. As another example in figure 2, there is no detailed information about = < > signs, there are no statistical reports about values such as correlation and p-values.

Response and Changes:
We thank the reviewer for their thorough review of our study. We appreciate the feedback and understand the concern regarding the statistical analysis of our results. To address this concern, we distinguish between results obtained in the identification experiments, and the results obtained in the machine learning experiments. Firstly, we have adjusted and re-run the identification analysis by leveraging a bootstrapping approach. This allows us to perform meaningful statistical tests to compare the identification and identifiability metrics. Secondly, we have adjusted the manuscript to go into more depth regarding the Bayesian "region-of-practical-equivalence" (ROPE) approach that we use to perform statistical comparisons between the prediction accuracies obtained using different sampling strategies. 1. Identification experiments: Thus far we did not use inferential statistics for the identification analysis given that each identification problem only results in one accuracy score.
That is, for a given co-fluctuation bin, we perform identification once with the REST1 session as a source and the REST2 session as a target (REST1-REST2), resulting in one score, and then vice versa (i.e. using REST2 as a source and REST1 as a target (REST2-REST1)). Similarly, differential identifiability between two sessions always results in only one score. Therefore, with only two data points we could not use inferential statistics to compare different co-fluctuation bins.
In order to alleviate this problem, we have now changed and re-run this analysis using a bootstrapping approach. We have incorporated these changes into the manuscript on page 4, lines 108 to 123 in the "Results" section under "Differential identifiability and identification accuracy disagree in their assessment of functional connectivity "fingerprints": "… To assess variance in identification accuracy and differential identifiability, we resampled the original 771 subjects 1000 times with replacement, keeping only the unique subjects from the resampled subject list. In the sequential sampling strategy, we could replicate the finding that HACF frames yield higher differential identifiability (IDiff) than LACF frames (Fig. 1a). To assess statistical significance of this observation we used two-tailed Wilcoxon-signed-rank tests with a Bonferroni correction for multiple comparisons. Across all specified sampling thresholds, HACFderived FC provided statistically significantly higher differential identifiability scores at the .05 alpha level of significance. At the same time, the sequential sampling strategy shows that LACF frames provide statistically significantly higher identification accuracy (IAcc ) than HACF frames across all specified sampling thresholds (Fig. 1b). Moreover, it becomes evident across the individual and combined bins sampling strategies that the highest IDiff is in fact achieved by intermediate bins (Fig.   1c). In the individual and combined bins sampling strategies, bins of intermediate co-fluctuation achieved the highest IAcc overall (Fig. 1d). In the sequential sampling strategy, it is most apparent that IAcc and IDiff show opposing effects (Fig. 1a, b). ..." We have also adjusted the "Materials and Methods" section to include information about the bootstrap procedure applied to the identification analysis. We have added the following paragraph to "Subject Specificity: Differential Identifiability and Identification Accuracy" on page 23, lines 488 to 496: "… In order to obtain an estimate of the variance, and to assess whether differences in identification accuracy and differential identifiability are statistically significant, we performed a bootstrapping procedure via resampling the original 771 subjects with replacement 1000 times. In each resampling run only unique subjects from the resampled subject list were chosen to perform the identification experiment, resulting in 1000 identification accuracy scores as well as 1000 2 differential identifiability scores per co-fluctuation bin. For each co-fluctuation bin, scores for HACFand LACF-derived FC were compared with two-tailed Wilcoxon-signed-rank tests, resulting in one p-value per co-fluctuation bin. To account for multiple comparisons, a Bonferroni correction at the alpha level of significance of .05 was applied.…"

Prediction experiments:
To compare prediction accuracy between individual or combined bins and the full FC, we used a Bayesian 'region of practical equivalence' (ROPE) approach (Benavoli et al., 2017). We acknowledge that in our paper we did not dedicate a lot of text towards explaining this approach and we thank the reviewer for pointing this out. To better explain the rationale of our statistical tests in the prediction analyses, we added a sub-section entitled "Bayesian "region-of-practical-equivalence" (ROPE) approach" to the "Materials and Methods" section, where we expand a bit more on this particular method. The text from pages 25 to 26 (lines 560 to 589) was changed accordingly: "… To compare prediction accuracy between individual or combined bins and the full FC, we used a Bayesian 'region of practical equivalence' (ROPE) approach 38 . In this approach one defines two models as practically equivalent if differences between accuracy scores do not exceed a pre-defined percentage. It is a statistical approach used in Bayesian inference to determine a region around a null value (our predefined percentage) in which the posterior probabilities of a given parameter falling within this region can be determined using Bayesian estimation 63 . Using the prior and the results of the experiment (the observed data), a posterior distribution can be estimated. We can then estimate three different probabilities regarding differences in model accuracies (i.e. in relation to our hypotheses).
Assuming that differences ( ) are obtained between two models and : 1.
: the posterior probability that model performs better than ; this is the integral of the distribution to the left of the region of practical equivalence (i.e. where differences are negative) 2.
: the posterior probability that model and are practically equivalent; this is the integral of the distribution inside the region of practical equivalence 3.
: the posterior probability that model performs better than ; this is the integral to the right of the region of practical equivalence (i.e. where differences are positive) That is, the posterior probabilities represent the degree of belief in a given hypothesis after taking into account the observed data. One advantage of Bayesian estimation is that it does not rely on a point-wise null hypothesis but rather on a range of potential values within the ROPE. This ROPE can be specified to reflect the practical context. For example, if even a small difference in accuracy can incur cost that is considered practically meaningful within a given context one may go for a different). In our study, we used a 5% ROPE, so that 5% differences in model accuracy would be considered practically equivalent. ..."

Major Comment:
The aim of defining three different selection strategies is not clear in the manuscript. In other words, the authors did not mention which information they can get from each strategy.

Response:
We acknowledge that the rationale for selecting these strategies was not clearly stated in the manuscript, and we appreciate the opportunity to provide more clarity and insight into these.
The first strategy, which we refer to as the sequential strategy, was used to replicate and extend the findings of a previous study by Zamani Esfahlani et al. (2020). This strategy was chosen to allow us to obtain results that are directly comparable to theirs and to explore the generalisability of their findings.
The second strategy, which we call the individual bins strategy, was designed to investigate the intermediate bins that were not examined in the previous study by (Zamani Esfahlani et al., 2020). Our goal was to extend their findings and to explore the possibility that intermediate bins may also be behaviorally relevant. We hypothesised that different levels of co-fluctuation in rs-fMRI may be differently predictive of different behaviours.
A third strategy that we employed in this study was to combine individual bins to test whether combining bins from different co-fluctuation levels could provide additional information, or whether behavioural information is maximised by any one level of co-fluctuation. This strategy was selected to examine the potential for shared information between co-fluctuation levels.
In summary, we chose the three strategies with specific objectives in mind: to replicate and extend previous findings, to investigate the intermediate bins, and to examine the relationship between different levels of co-fluctuation. Overall, the use of these three strategies allowed comprehensive investigation of different aspects of the relationship between co-fluctuation levels and behaviour. We have adjusted the "Materials and Methods" section to better explain these objectives.

Changes:
Accordingly, we have added a paragraph in the "Materials and Methods" section, see page 20 (lines 439 to 446). The revised paragraph is shown below for convenience: "… Each sampling strategy was chosen with a specific goal in mind: The first strategy, which we referred to as the sequential strategy, was used to replicate and extend the findings of a previous study 18 . The second strategy, which we called the individual bins strategy, was designed strategy that we employed in this study was to combine individual bins to test whether combining bins from very different co-fluctuation levels also provides additional information, or whether behavioural information is maximised by any one level of co-fluctuation. This strategy was selected to examine the potential for shared information between co-fluctuation levels. …"

Major Comment:
The authors defined three strategies. For two of them "individual" and "combined" three states, low, intermediate, and high amplitude co-fluctuation are considered. While for the third one, sequential, the authors just considered low and high amplitude co-fluctuation. While the major finding of this work is related to intermediate bins.

Response:
We agree with the reviewer that the sequential strategy lacks information on intermediate bins. We used the sequential strategy to replicate and extend the findings of the previous study by Zamani Esfahlani et al. (2020), to assess whether our results are comparable to theirs and to explore the generalisability of their findings. As this strategy indeed disregards intermediate co-fluctuation levels we chose to additionally perform the individual bins strategy and the combined bins strategy.
Minor Comments:

Minor Comment:
The HCP dataset has 4 different sessions, I would suggest the authors clarify which sessions were used for this study.

Response:
We thank the reviewer for pointing this out and have clarified this in the following sections.

Changes:
We have updated the beginning of the "Results" section to explicitly state which fMRI sessions we used (page 4, lines 102 to 105): "… Identification was performed between rs-fMRI data from day 1 and day 2 of HCP-YA data and for each day FC was averaged across phase encoding directions. For prediction the FC obtained on both days were averaged, resulting in one FC matrix per subject per co-fluctuation bin.  providing four overall rs-fMRI datasets. Scans were acquired using a 3T Siemens connectome-Skyra scanner with a gradient-echo EPI sequence (TE=33.1ms, TR=720ms, flip angle = 52°, 2.0mm isotropic voxels, 72 slices, multiband factor of 8). ..." We have also specified this in the "Subject Specificity: Differential Identifiability and Identification Accuracy" sub-section of the "Materials and Methods" section (page 23, lines 482 to 487): "… Identification was performed between rs-fMRI data from day 1 and day 2 of HCP-YA data collection and for each day FC was averaged across phase encoding directions. That is, for identification accuracy, identification was performed with day 1 data as a source and day 2 data as a target, and vice versa resulting in two identification accuracy scores. These two scores were averaged to obtain one overall identification accuracy. Differential identifiability was also estimated between rs-fMRI data from day 1 and day 2. …" In addition, we have also adjusted a sentence in the "Prediction of Behavioural and

Response:
We thank the reviewer for their comment about the captions of figures 1 and 5, and agree that they were somewhat sparse. We have revised the captions to provide more detailed information on how the figures were obtained and what they illustrate. Additionally, we have updated figures 3, 4, 6, 7, and 8 to ensure they provide a more comprehensive understanding of our results. We have also changed all the bar plots to box plots following Communications Biology guidelines.

Changes:
We have adjusted the caption for Figs. 1, 3, 4, 5, 6, 7, and 8 as follows: Figure 1. Differential identifiability and identification accuracy in the HCP-YA dataset for the sequential sampling strategy (a and b), the individual bins sampling strategy (c and d), and the combined bins strategy (e). In a) and b), Threshold refers to the percentage of highest (HACF) or lowest (LACF) co-fluctuation frames chosen to estimate FC. In ( e), the lower triangle shows differential identifiability, whereas the upper triangle shows identification accuracy achieved by each pair of combined bins. In each identification experiment subjects were resampled with replacement 1000 times, and in each resampling run only the subset of unique subjects were chosen to perform the identification analysis. Asterisks ("*") in a) and b) indicate a significant difference as determined by a Wilcoxon-signed rank test between HACF-derived FC and LACF-derived FC at the .05 alpha significance level after Bonferroni correction and the error bars indicate a 95% confidence interval. In c) and d) the box plot indicates the median (center line) and the interquartile range.

Figure 3.
Prediction scores (Pearson's r between observed and predicted on the y-axis) for 9 phenotypic targets averaged across the ten folds in the grouped cross-validation scheme when using FC estimates derived from time points at different levels of co-fluctuation magnitude using the sequential sampling strategy. "Threshold" refers to the percentage of highest (HACF) or lowest (LACF) co-fluctuation frames chosen to estimate FC. Upper and lower boundary of the fill colours indicate the standard deviation across folds. A threshold of 100% corresponds to full FC. These 9 targets are displayed, because they yielded best prediction accuracy using full FC compared to other targets displayed in the supplementary information.

Figure 4.
Age prediction (Pearson's r) and sex classification (accuracy) using the sequential sampling strategy (a and b), individual bins (c and d), and combined bins (e) sampling strategies. In a) and b), "Threshold" (x-axis) refers to the percentage of highest (HACF) or lowest (LACF) co-fluctuation frames chosen to estimate FC and the y-axis indicates the prediction score. Upper and lower boundary of the fill colours indicate the standard deviation across folds. In c) and d) the box plot indicates the median (center line) and the interquartile range. In e) comparison operators indicate whether scores obtained by a co-fluctuation bin are equivalent to scores obtained by full FC ("=") or whether they are less ("<") or greater (">") than scores obtained by full FC according to a 5% Bayesian ROPE 43 .

Figure 5.
Differential identifiability and identification accuracy in the HCP-A dataset for the sequential sampling strategy (a and b), the individual bins sampling strategy (c and d), and the combined bins strategy (e). In a) and b), "Threshold" refers to the percentage of highest (HACF) or lowest (LACF) co-fluctuation frames chosen to estimate FC. In ( e), the lower triangle shows differential identifiability, whereas the upper triangle shows identification accuracy achieved by each pair of combined bins. In each identification experiment subjects were resampled with replacement 1000 times, and in each resampling run only the subset of unique subjects were chosen to perform the identification analysis. Asterisks ("*") in a) and b) indicate a significant difference as determined by a Wilcoxon-signed rank test between HACF-derived FC and LACF-derived FC at the .05 alpha significance level after Bonferroni correction. The error bars indicate a 95% confidence interval. In c) and d) the box plot indicates the median (center line) and the interquartile range. In e) and f) the box plot indicates the median (center line) and the interquartile range. In g) comparison operators indicate whether scores obtained by a co-fluctuation bin are equivalent to scores obtained by full FC ("=") or whether they are less ("<") or greater (">") than scores obtained by full FC according to a 5% Bayesian ROPE.  In e) and f) the box plot indicates the median (center line) and the interquartile range. In g) comparison operators indicate whether scores obtained by a co-fluctuation bin are equivalent to scores obtained by full FC ("=") or whether they are less ("<") or greater (">") than scores obtained by full FC according to a 5% Bayesian ROPE.

Minor Comment:
Before resubmitting, I would suggest the authors carefully go through the article and break the long sentences into short and clear sentences.

Response:
We have carefully reviewed the article and made efforts to reduce the complexity and length of sentences where appropriate. We recognize the importance of clear and concise writing, and have made a concerted effort to ensure that our writing is easy to follow and understand. We appreciate your suggestion and are grateful for the opportunity to improve the readability of our manuscript.