Task-induced brain state manipulation improves prediction of individual traits

Recent work has begun to relate individual differences in brain functional organization to human behaviors and cognition, but the best brain state to reveal such relationships remains an open question. In two large, independent data sets, we here show that cognitive tasks amplify trait-relevant individual differences in patterns of functional connectivity, such that predictive models built from task fMRI data outperform models built from resting-state fMRI data. Further, certain tasks consistently yield better predictions of fluid intelligence than others, and the task that generates the best-performing models varies by sex. By considering task-induced brain state and sex, the best-performing model explains over 20% of the variance in fluid intelligence scores, as compared to <6% of variance explained by rest-based models. This suggests that identifying and inducing the right brain state in a given group can better reveal brain-behavior relationships, motivating a paradigm shift from rest- to task-based functional connectivity analyses.

Major concerns. 1) HC P family structure. Based on the methods described in the current ms and also in the more general approach described in Nature Protocols (Shen et al, 2017), it appears that the authors have not taken into account the family structure of the HC P cohort (twins and non-twin siblings) in their statistical analyses. Failure to account for family structure can lead to false positives and inflated estimates of statistical significance Winkler et al. (Neuroimage, 2015). Taking fami ly structure into account is unlikely to have a dramatic effect in this particular study, but in this reviewer's assessment it is essential in order for the statistics to be on a solid footing. Moreover, since the authors are promoting the C PM approach as a useful tool for other investigators to use, it is all the more incumbent upon them to make the C PM tools more robust for the community at large. The authors might find this URL helpful: https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/PALM/ExchangeabilityBlocks #EBs_for_data_of_the_Human_C onnectome_Project 2) In Fig. 1, the difference in trait predictability for tasks versus rest is reasonably clear and impressive for most tasks in both HC P and PNC cohorts. The text (p. 6) states that the differences are significant for all but the HC P relational task. However, the significance is borderline (p = .04) for the HC P social vs R1. Besides the aforementioned concern about accounting for family structure, another methodological issue is whether these p values were corre cted for multiple comparisons. Item 14 in the Additional Material indicates that this was indeed done; it should also be mentioned in the Methods (I didn't find it).
3) The contrast between task and rest is all the more impressive in view of the fact tha t for HC P, the individual tasks were much shorter in duration than the rest scans. The authors later make a point of this (Discussion, p. 14) but should give specifics in the Methods and also indicate whether the apparent differences between tasks is corre lated with individual task duration.
Much rides on the claims that there are convincing differences across tasks in trait predictiveness. The statement at the bottom of p. 6 that "certain tasks yielding better predictions than others" may be technically correct, but it appears to rest on one leg: HC P gambling is better than relational tasks. While this one contrast appears to be highly significant, it is critical to know how general this is. At this place in the ms, the conclusion seems overstated.
4) The sex differences illustrated in Fig. 3 are fascinating and appropriately come later in the presentation. However, the scatterplots in Fig. 1b-d and f-h offer an intriguing opportunity to plot M vs F subjects in different colors (e.g., red, green, and yellow where they overlap). This might prove to be quite informative without taking up additional figure space; discussion of the sex differences could still be postponed to later in the text. 5) p. 7. Another possible control relates to whether the results depend specifically on use of regressing "global" signals (gray and white matter plus C SF; cf. Methods, p. 19). This step forces a zero -mean correlation and is controversial in the field. In the Shen (2017) paper the authors note in passing that there are alternative approaches such as partial correlation and IC A -based spatial denoising (e.g., Smith et al, NN 2015). Ideally, the authors would be able to run partial correlation for at least some of their tests to see if the results differ markedly. At a minimum they should justify their choices and note these potential confounds. 6) Edge overlap. Fig. 2 provides a data-rich comparison of edge overlap and node degree within and across models and cohorts. The edge overlap (Fig. 2a) is generally quite modest (~12% is the highest and most are much lower). Of course, there is inherently a lot of noise in these datasets, but it leaves this reviewer uneasy that the spatial patterns of connectivity appear to be pretty low in reproducibility. The edge overlap is a binary measure that entails thresholding and discounting of edge weights. The alternative of estimating overlap using weighted measures without thresholding might in principle have greater sensitivity and should at least be considered and mentioned. 7) Node degree overlap. A striking observation (p. 9 and Fig. 2 b, c) is that C N and AN degree vectors are highly correlated. A potential confound is that C NR/SNR is regionally variable and dependent on the head coil and pulse sequence. Higher C NR regions should presumably tend to have higher degree in general, related to data acquisition characteristics, not just neurobiological factors. This issue warrants careful consideration. 8) Sex differences. p. 12, Fig. 3, and Table 2. Some of the sex differences for different tasks are impressively large, highly significant, and consistent to some degree across HC P and PNC cohorts. However, the M vs F difference for WM is very large for HC P but quite modest for PNC (and not stated as to whether the latter difference is significant). Similarly, the M vs F difference for emotion is large for the PNC cohort (but significance not stated) whereas there is hardly any difference for the HC P cohort. On p. 12, para. 3, the authors report 'preliminary evidence for a sex -by-age interaction' However, the data in Table 2 appear distinctly underwhelming and not even a consistent trend for females, and the claims about 'outperforming' noted in the table legend should either be tempered or buttressed by pvalues. This paragraph and table should perhaps be dropped or relegated to SI. 9) Somewhere in the discussion the authors should discuss whether some of the lower variance explained for rest be related to differences in attention or arousal states, as the rest state may associated with reduced arousal, drowsiness, and even overt sleep for part of the scan in some/many subjects. 10) Limitations (pp. [17][18], parcellation (p. 20) and parcellation resolution (pp. [23][24]. The authors use a 268-node whole-brain parcellation ('functional atlas') from Shen et al. (2013), which is based on a volume-based alignment across individuals. For the PNC data, they compare results for a 600-node parcellation (C raddock et al., 2012) and find similar results. Their implicit conclusion that the choice of parcellation and of alignment method doesn't matter much is overstated. Methods that achieve more accurate alignment of functional subdivisions (e.g., the areal -featurebased alignment for cortical structures in the publicly released HC P data) and the a ssociated multimodal parcellation (Glasser et al, 2016) might well yield better predictive power. This issue warrants mention so that readers are aware that alignment quality and parcellation accuracy are important issues that may impact and eventually contribute to improved C PM performance. 11) Image reconstruction confound. p. 23. Another HC P -specific confound that needs to be addressed (but seems to have been overlooked) is that early in the HC P the fMRI image reconstruction was changed to an improved method. See https://wiki.humanconnectome.org/display/PublicData/Ramifications +of+Image+Reconstruction+Version+Differences. It should be straightforward to regress out this confound.
Minor comments.
p. 3, end of para. 1. "…ideal…" is not an 'ideal' word. "Attractive" would be better. Fig. 1a and 3a. Please rotate the text to oblique or vertical and increase font size along x axis for legibility. On Fig. 1e, simply increase font size. Fig. 1i Methods. Please state key parameters for HC P and PNC scans: Spatial resolution, multiband factor, TR, scan durations, to spare readers from needing to look these up.
Reviewer #2 (Remarks to the Author): "Task-induced brain state manipulation improves prediction of individual traits" is an extension of the authors' previously published study on resting state fMRI fingerprints. Here they examine whether models built from task fMRI data perform better than those built from resting -state fMRI data. This a nice direction, but neither the results nor approach is novel, as several previous studies have examined this question (C ole et al and others). Further, this study has several other major limitations: it is not theoretically motivated, tests no neural hypotheses, and the findings are increment al and inconsistent with respect to a previous publication from the authors (Finn et al., 2015 NN). Surprisingly, findings from that study are not discussed, although the authors employed almost exactly the same pipeline and investigated the same question "do individual connectivity profiles predict fluid intelligence". Overall, the study is weak and findings generally uninterpretable. Other concerns include: (1) There is no theoretical justification for using gambling, motor, social, and emotion tasks to predict fluid intelligence. Gambling task activation turned out to a strong predictor of gF. I wonder what the interpretation and theory here is.
(2) As in Finn et al., the results demonstrate that connectivity profiles can predict fluid intelligence. However, if we compare Finn et al. Fig. 5a with current Fig. 1d, it appears that models built from the previous small sample actually outperform those from the larger sample used here. This suggests that predictive values of connectivity-features (or the effectiveness of the current pipeline) decreases with increasing sample size? Moreover, in Finn et al study, just using features from fronto -parietal networks achieved the same performance as using features from the whole brain (Fig. 5a vs. 5c). Although the authors did not test if this is the case in the current study, it appears that visuomotor regions are more highly predictive of fluid intelligence in the current study (Fig. 2d). Does this fit any cognitive theory or models of fluid intelligence? I doubt it.
(3) Use of a developmental PNC cohort with a large age range is problematic. (4) The authors employed a univariate approach to select features using connectivity features that show a significant correlation with fluid intelligence. This is highly circ ular and problematic, and a likely reason for the inflated findings in Finn et al. This is now quite problematic as the current results suggest a failure to replicate original Finn et al. findings.
Reviewer #3 (Remarks to the Author): Review of: Task-induced brain state manipulation improves prediction of individual traits I read this article with great interest not least because I agree strongly with the central premise, which is that dynamic/task active connectivity should provide a better predic tor of cognitive ability than resting state connectivity. This notion makes a great deal of sense, given that we see patterns of large-scale network activity/coherences during the performance of g -loaded tasks. The analyses and results have a number of important strength, however, there are also a number of key points that I believe if addressed would greatly strengthen the article.
1) The authors state that "By considering task -induced brain state and sex, the best-performing model explains over 20% of the variance in fluid intelligence scores, as compared to less than 5% of variance explained by rest-based models built using the whole sample." This is an intriguing result, particularly given the WM/emotional task dissociation. The replication of the task*gender interaction across the two studies is a strength. My main concern though is that the difference in variance explained when accounting for gender could relate to differences in cohort size. By definition, there must be fewer subjects in the traine d datasets for the gender analyses, i.e., because the cohort has been split into two groups. Is it not the case that they will be more prone to overfit, and therefore, could give the illusion of explaining more variance? To be truly convincing, the trained model would really need to be validated by application to a further dataset. I also was curious why only gender was examined as opposed to other factors (handedness, age, education level, etc?) 2) The main finding is that "brain state can be manipulated to better reveal brain-behavior relationships, and that identifying and inducing the right brain state in a given group can improve trait prediction".
This seems sensible to me, g-loaded tasks involve activity/coherence across certain networks, it makes sense to examine those networks when they are expressing those active/synchronised states. The fact that "Results generalize across conditions and two large, independent datasets" is also a strength -the importance of reproducibility is finally coming to the fore in the imaging field at the moment, which is a good thing. A concern though, is how well balanced this comparison actually is? Specifically, were the rest and task acquisitions the same duration? I.e., did they have the same acquisition parameters, and were there the same type and number of EPI scans for task and rest? This is an important point to consider. Longer acquisition will allow for a more robust estimate of correlation strength between each pair of nodes. this results in a connectivity m atrix that has greater signal vs noise. One should ensure that the rest vs task difference is not simply a consequence of differences in the reliability of these estimates.
3) Relatedly to the above two points, when stating how much variance is actually explained, I would have been more convinced if the headline value was for a cross validation. by this, I mean where the model is trained on HC P data for one task, and then validated with more HC P data for that same task. I don't question that there is a connectivity-g relationship, but without such a step, the headline estimates of actual variance explained are hard to interpret (I note, that this is a criticism that can be levelled at a number of other high impact papers, and the authors at least apply a c ross validation across studies, albeit with somewhat different tasks/acquisitions).
4) The comparison of rest vs. task seems well balanced and sensible (assuming the above condition regarding number of TRs is met). A question is whether the authors also looked at measures taken of dynamic connectivity relative to steady state (PPI for example), or were the average taken across both rest and task for the task acquisitions? In a way it is critical to use the simple correlational as opposed to PPI approach, as this is more balanced across task and rest conditions, making them more comparable. Nonetheless, an obvious prediction of the authors hypothesis is that the PPI measures (increased connectivity during task vs rest) should in fact give the best possible explanation of g. 5) I was somewhat confused by the description of supplementary analyses with more robust compensation for movement artefacts. Does this mean that movement still had a component in the connectivity-g relationship for the primary analyses as reported in the main text? if so, then really these supplementary analyses should be presented as the main analyses, and headline statistics altered accordingly. If not, and the minimal data cleaning pipeline was sufficient, then are these not redundant?
6) The observation that models trained on rest data provide stronger prediction of 'g' when applied to task data seems to be a strength -although I was not clear on how generally / consistently this was the case -it would be good to clarify this in the text. Also, to clarify that the task and rest acquisitions actually have the same amount of data as outlined above.

7)
Repeating analyses with different parcellation resolutions is a strength of the paper.
8) It is surprising that leave-one-out approach was applied in such a large dataset -the gold standard would be to use cross validation with train and test sub-groups. 9) I was unclear whether the number of edges that were feature selected in the male a nd female populations was balanced? Obviously any differences in this would lead to a problem, whereby more edges gives the model more degrees of freedom for fitting (and over -fitting) the data.
In summary this is an intriguing and potentially important paper with great potential. I hope that the above suggestions are helpful in strengthening it further. . Taking family structure into account is unlikely to have a dramatic effect in this particular study, but in this reviewer's assessment it is essential in order for the statistics to be on a solid footing. Moreover, since the authors are promoting the CPM approach as a useful tool for other investigators to use, it is all the more incumbent upon them to make the CPM tools more robust for the community at large. The authors might find this URL helpful: https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/PALM/ExchangeabilityBlocks#EBs_for_data_of_the_ Human_Connectome_Project We agree that performing significance tests with clarity and rigor is of utmost importance, and thank the reviewer for pointing out the issue of restrictions on exchangeability. To address this issue, we have implemented the block permutation pipeline of Winkler and colleagues 1,2 for main analyses in both the HCP and PNC data sets. In the latter, one subject out of the 571 was missing family information; because it seems unlikely that this information would have been collected from one family member but not another, we assumed that this subject did not have siblings in the cohort and coded this subject as such. A description of this approach has been added to Methods (Cognitive prediction, p24). Revised P values based on these permutation tests, as well as FDR-corrected q values, are reported in the text (Results, State manipulations improve trait predictions, p6). Fig. 1, the difference in trait predictability for tasks versus rest is reasonably clear and impressive for most tasks in both HCP and PNC cohorts. The text (p. 6) states that the differences are significant for all but the HCP relational task. However, the significance is borderline (p = .04) for the HCP social vs R1. Besides the aforementioned concern about accounting for family structure, another methodological issue is whether these p values were corrected for multiple comparisons. Item 14 in the Additional Material indicates that this was indeed done; it should also be mentioned in the Methods (I didn't find it).

2) In
The P values corresponding to results of main analyses in this manuscript (i.e., the CPM results depicted in Fig. 1a-h and described on page 6 of the main text) were FDR-corrected 3 ; in the revised manuscript, in addition to stating overall q values at the beginning of the Results section entitled "State manipulations improve trait predictions" (p6), we have included both P and corrected q values throughout this section. We include both values for clarity and to facilitate comparison to our previous work 4,5 , in which only rest data were used, and correction was thus not necessary (i.e., to avoid the illusion that significance of rest-based model prediction is different from that described in our previous work). Because of the sheer number of follow-up and confound analyses, all other analyses were considered to be post-hoc, and P values are thus presented without correction, except as otherwise indicated. Methods (Cognitive prediction, p24) and Results (p5) have been revised to clarify this point.

3) The contrast between task and rest is all the more impressive in view of the fact that for HCP, the individual tasks were much shorter in duration than the rest scans. The authors later make a point of this (Discussion, p. 14) but should give specifics in the Methods and also indicate whether the apparent differences between tasks is correlated with individual task duration.
We thank the reviewer for pointing out this oversight -we agree that the different durations of the conditions are important, and have included the length of each condition in the Methods sections entitled "Imaging parameters and preprocessing" (p21-22) for both the HCP and PNC data sets. To further explore the potential effects of condition duration on model performance, several additional analyses were performed. First, condition duration was correlated with model performance. For the HCP data, longer conditions yielded models with significantly worse performance (r = -0.78, P < 0.05), but this effect was dominated by the rest runs (correlation of duration and model performance for only task runs: r = 0.23, P = 0.6). For the PNC data, longer conditions yielded models with significantly better performance (r = 0.999, P < 0.05). That taskbased models outperformed rest-based models both when task runs were longer than rest runs (PNC) and when task runs were shorter than rest runs (HCP) suggests that this effect is not driven by condition duration. However, to further explore this potential confound, time courses from all conditions within each data set were truncated to include the same number of frames as the shortest condition (HCP: 176 frames; PNC: 124 frames). Connectivity matrices were recalculated and submitted to the CPM pipeline; the pattern of results was largely unchanged (i.e., task-based models outperformed rest-based models, with the gambling task yielding the best-performing model in the HCP data, and the WM task yielding the best-performing model in the PNC data). These analyses are described briefly in Results (Investigation of potential confounds, p8) and more extensively in Methods (Effects of condition duration, p26-27), and corresponding results (100r 2 ) are presented in Supplementary Much rides on the claims that there are convincing differences across tasks in trait predictiveness. The statement at the bottom of p. 6 that "certain tasks yielding better predictions than others" may be technically correct, but it appears to rest on one leg: HCP gambling is better than relational tasks. While this one contrast appears to be highly significant, it is critical to know how general this is. At this place in the ms, the conclusion seems overstated.
Our intention here was to demonstrate two points: task-based models outperformed rest-based models, and not all task-based models performed equally well. To avoid the many comparisons that would result from exhaustive comparison of each model pair, we chose to demonstrate the first point by comparing only the worst-performing task-based models and the best-performing rest-based model, and the second point by comparing only the best-and worst-performing taskbased models in each data set. However, we appreciate that this approach is confusing, and may not adequately demonstrate these two points. In the revised manuscript, we have chosen to focus on the difference between task-and rest-based models' performance, and to this end, have replaced these analyses with a Mann-Whitney U test comparing all task-based models to all restbased models (pooled across both data sets). This clearly demonstrates that task-based models significantly outperformed rest-based models. These results are described on page 6 ("In both data sets, some tasks yielded better gF predictions than others, but in all cases, task-based models outperformed rest-based models (rank sum = 71, two-sided P = 0.018, Mann-Whitney U test)."), and this revised approach is described in Methods (Cognitive Prediction, p24). Fig. 3 are fascinating and appropriately come later in the presentation. However, the scatterplots in Fig. 1b-d and f-h offer an intriguing opportunity to plot M vs F subjects in different colors (e.g., red, green, and yellow where they overlap). This might prove to be quite informative without taking up additional figure space; discussion of the sex differences could still be postponed to later in the text.

4) The sex differences illustrated in
The colors of the points on the scatterplots in Figure 1 have been colored by sex (red for females, and blue for males). For clarity given the sheer number of subjects and the broad regions of overlap, we chose to use only two colors and increase the transparency of all points, but we welcome additional suggestions about using a third color, should the reviewer feel that this would add helpful information to the figure.

5) p. 7.
Another possible control relates to whether the results depend specifically on use of regressing "global" signals (gray and white matter plus CSF; cf. Methods, p. 19). This step forces a zero-mean correlation and is controversial in the field. In the Shen (2017) paper the authors note in passing that there are alternative approaches such as partial correlation and ICA-based spatial denoising (e.g. , Smith et al, NN 2015). Ideally, the authors would be able to run partial correlation for at least some of their tests to see if the results differ markedly. At a minimum they should justify their choices and note these potential confounds.
Given evidence that global signal regression (GSR) effectively reduces motion artifact in fMRI 6 , we expected model performance to be improved by the inclusion of GSR in our preprocessing pipeline. Nevertheless, we appreciate the controversy surrounding the use of GSR; the most straightforward way to address the impact of this preprocessing choice on functional connectivity patterns and resulting CPM results is to repeat our main analyses using data that have not been subjected to GSR. We did so with the HCP data and found, as expected, that overall model performance decreased, but task-based models still outperformed rest-based models. This analysis is described in Methods (Effects of global signal regression, p27); results (100r 2 ) are summarized in Results (Investigation of potential confounds, p9) and presented in full in Supplementary We share this reviewer's interest in partial correlation-based calculation of functional connectivity, not only because these approaches obviate the need for GSR, but also because they offer a data-driven method to more directly 7 and sensitively 8 measure connectivity between each node pair. To investigate the effects of connectivity estimation method on the main findings reported in this manuscript, we recalculated connectivity matrices using partial, rather than full, correlation. That is, we calculated the partial correlation between the mean time courses of every node pair, computed without GSR, such that each entry in the resulting connectivity matrix reflects the functional connectivity between the given node pair, controlling for all other nodes' mean time courses.
While partial correlation-based functional connectivity measures have been compellingly validated 8 , we note that in many cases, including several of the conditions in these data sets, there are fewer observations (i.e., time points) than nodes. In such cases, regularization is required to estimate partial correlation matrices, which involves the selection of a tuning parameter that controls sparsity, or the degree of regularization. While this parameter is often selected empirically, the appropriate degree of regularization likely depends on the particularities of a given data set, and results have been found to vary substantially as this parameter is varied 9 . To our knowledge, there is as yet no standardized method for selection of this parameter, and a thorough exploration of this issue is beyond the scope of the current work; to avoid the potential pitfalls of arbitrary parameterization, we chose to calculate partial correlation-based functional connectivity only for those HCP conditions that yield full-rank time-by-node matrices (language task, motor task, social task, WM task, rest1, and rest2). Given the range in condition length and the sensitivity of partial correlation to data quantity, we truncated all time courses to the length of the shortest included condition (274 time points). We note that this number of time points is only slightly greater than the number of included nodes, which is likely to cause instabilities in results 9 . Further, to our knowledge, the use of partial correlation-based approaches to calculate task-based functional connectivity remains relatively unexplored. Given these concerns about the approach, we predicted that results would be noisy and unstable, and found this to be the case. These results (100r 2 ) are presented below: Feature-selection threshold Task P < 0.01 (n = 514) P < 0.005 (n = 514) P < 0.001 (n = 514) Fig. 2 provides a data-rich comparison of edge overlap and node degree within and across models and cohorts. The edge overlap (Fig. 2a) is generally quite modest (~12% is the highest and most are much lower). Of course, there is inherently a lot of noise in these datasets, but it leaves this reviewer uneasy that the spatial patterns of connectivity appear to be pretty low in reproducibility. The edge overlap is a binary measure that entails thresholding and discounting of edge weights. The alternative of estimating overlap using weighted measures without thresholding might in principle have greater sensitivity and should at least be considered and mentioned.
To examine the potential impact of thresholding on quantification of model overlap, we performed an additional analysis in which we compared all edges' correlations with gF across conditions. That is, for each condition, we correlated each edge's strengths across all subjects with gF scores, yielding a single correlation coefficient for the given edge in that condition. We repeated this procedure for all edges and all conditions, yielding a 1xe vector of correlation coefficients for each condition, where e is the total number of edges. We then correlated these vectors to quantify the similarity of gF-related edge distributions across conditions. PNC correlations ranged from r = 0.37-0.51, HCP from r = 0.25-0.32, and cross-data set (within condition) from r = 0.12-0.19, and the pattern of results was similar to that identified in the overlap analyses (i.e., task-based models demonstrated the greatest similarity both within and between data sets). These results are presented in Supplementary Figure 5 (reproduced below), and the approach is described in Methods (Analysis of anatomic distribution of model edges, p27-28). We note that, by both accounts, this overlap is substantial (particularly given that there are over 30,000 unique edges in the brain), and far greater than would be expected by chance, highlighting the existence of core gF-related circuitry that is differentially perturbed by different tasks.
Supplementary Figure 5. Similarity of edges' correlations with gF between conditions, demonstrating substantial overlap between models both within data sets (PNC data: upper triangle; HCP data: bottom triangle) and between data sets (main diagonal). Values indicate Spearman's correlation coefficients.

7)
Node degree overlap. A striking observation (p. 9 and Fig. 2 b, c) is that CN and AN degree vectors are highly correlated. A potential confound is that CNR/SNR is regionally variable and dependent on the head coil and pulse sequence. Higher CNR regions should presumably tend to have higher degree in general, related to data acquisition characteristics, not just neurobiological factors. This issue warrants careful consideration. This is an interesting and important potential issue that we have investigated in several ways. First, we leveraged previous analyses of temporal SNR (tSNR) in the HCP data 10 , and compared the resulting tSNR map (which was reportedly similar across the various tasks) to our maps of node degree. By visual inspection, the maps are quite distinct (e.g., inferior temporal (IT) cortex has low tSNR; in contrast, several hubs are found in IT cortex). Further, if SNR were driving node degree, degree correlation across conditions (Fig. 2b) would be expected to be quite high. ] Next, as a proxy for SNR at the node level, we computed mean node reliability (i.e., the mean of reliability measured for every edge incident to the given node) in the HCP data 4 and calculated the Spearman's correlation between this measure and node degree for conditions shared across data sets (using a feature-selection threshold of P < 0.01 to facilitate comparison to Fig. 2b,c). The results are presented below, and demonstrate that degree was not significantly correlated with mean node reliability in any condition; insofar as reliability reflects SNR, this further suggests that node degree is not driven by SNR.  Fig. 3, and Table 2. Some of the sex differences for different tasks are impressively large, highly significant, and consistent to some degree across HCP and PNC cohorts. However, the M vs F difference for WM is very large for HCP but quite modest for PNC (and not stated as to whether the latter difference is significant). Similarly, the M vs F difference for emotion is large for the PNC cohort (but significance not stated) whereas there is hardly any difference for the HCP cohort.
We thank the reviewer for pointing out this important point and apologize for any confusion caused by the presentation of these results. The primary goal of these analyses was to investigate, 120 30 30 among the conditions shared across data sets, whether the condition that yields the best gF predictions varies by sex. Overall, we found that CPMs performed better in HCP males than in HCP females ( (males) = 0.32 versus (females) = 0.22). This main effect caused scaling differences that complicate the comparison of model performance for a given task between the sex groups (e.g., WM task-based model performance in males and females, as this reviewer suggests). Instead of pursuing this comparison, then, we chose to focus on the interaction between sex and condition (i.e., does the comparison between WM and emotion task-based models differ by sex?) to demonstrate that, for each group, a different task best perturbs circuitry relevant to the cognitive measure of interest.
On p. 12, para. 3, the authors report 'preliminary evidence for a sex-by-age interaction' However, the data in Table 2  Given our shared concerns about the preliminary and inconsistent sex-by-age interaction in model performance, these results have been removed from the manuscript.

9)
Somewhere in the discussion the authors should discuss whether some of the lower variance explained for rest be related to differences in attention or arousal states, as the rest state may associated with reduced arousal, drowsiness, and even overt sleep for part of the scan in some/many subjects.
We see differences in arousal and attention during rest to be important examples of the unconstrained nature of the resting "state," and have added a brief exploration of this topic to the Discussion (p16): That task-based models outperform rest-based models likely reflects, in large part, the unconstrained nature of the resting state 11 . Functional connectivity variability is greater during rest than tasks, a finding that has been suggested to demonstrate increased mind wandering at rest 12 ; recent experiences and brain states significantly alter patterns of resting-state functional connectivity [13][14][15] ; and in contrast to the task-relevant 16 , distinct patterns of connectivity identified during task states, resting-state connectivity patterns are better characterized by the joint expression of many states 17 . In short, rest is messy, and patterns of functional connectivity derived from it reflect many influences -arousal, attention, highlevel processes associated with conscious thought -that remain difficult to measure. Conversely, tasks offer a controlled manipulation of brain state that taps into relevant circuitry 16 ; any individual differences in this circuitry will be amplified, facilitating the prediction of related traits.

results. Their implicit conclusion that the choice of parcellation and of alignment method doesn't matter much is overstated. Methods that achieve more accurate alignment of functional subdivisions (e.g., the areal-feature-based alignment for cortical structures in the publicly released HCP data) and the associated multimodal parcellation (Glasser et al, 2016) might well yield better predictive power. This issue warrants mention so that readers
are aware that alignment quality and parcellation accuracy are important issues that may impact and eventually contribute to improved CPM performance.
We agree that improved registration and parcellation may, in turn, improve model performance.
We clarify in Results (Investigation of potential confounds, p8) and Methods (Effects of parcellation resolution and scan coverage, p26) that the comparison of results generated using the 268-node parcellation to results generated using the 600-node parcellation is solely an investigation of the potential effect of parcellation resolution on model performance. We have also added the following paragraph to the Discussion (p20): 19 , and correspondingly improve CPM performance. The impact of alignment and parcellation approaches on predictive model performance thus remains an important area for future investigation.

11) Image reconstruction confound. p. 23. Another HCP-specific confound that needs to be addressed (but seems to have been overlooked) is that early in the HCP the fMRI image reconstruction was changed to an improved method. See https://wiki.humanconnectome.org/display/PublicData/Ramifications+of+Image+Reco nstruction+Version+Differences. It should be straightforward to regress out this confound.
We thank the reviewer for pointing out this important potential confound that we indeed overlooked. To account for this difference in reconstruction algorithm, we performed three additional analyses (Results, Investigation of potential confounds, p7 and Methods, Effects of HCP reconstruction method and quality control issues, p26). First, we excluded those subjects for whom the r227 algorithm was not available, leaving 402 subjects on whom to perform the main CPM analyses. Next, we incorporated algorithm version into the feature-selection (via partial correlation) and model building (via multilinear regression) steps. These analyses are analogous to our exploration of the effects of head motion, sex, and gF measurement technique. CPM results were not substantially affected by these efforts to account for image reconstruction method; given the substantial number of subjects affected and the apparent lack of impact on our main results, we chose not to exclude subjects from analyses reported in the main text on the basis of reconstruction algorithm, but present these results in Supplementary Table 3. As an additional control, we re-ran the main CPM analyses after excluding all HCP subjects (40 in total) with QC Issues B, C, and D (for a description of these issues, see Methods, Effects of HCP reconstruction method and quality control issues, p26) or without gradient-recalled echo field maps. Results (100r 2 ) were largely unchanged, and are presented in Supplementary Minor comments.
The word "ideal" has been replaced with the phrase "well suited."  The fill has been changed to light gray to ensure that the legend is still clear while enhancing visibility of the bars.

p. 7, para. 2, pp. 20-21, and Table S2: Clarify whether the two versions of the PMAT test refer to just the PNC cohort. Was one of the PNC versions identical to the HCP PMAT24 version?
It has been clarified that the two versions of the Penn Matrix Reasoning Test used to measure Pmat apply only to the PNC data in Results (p7; "because two versions of the Penn Matrix Reasoning Test were used to measure gF in the PNC…"), and a "PNC" label has been added to Supplementary

Fig. 2a, b. Can't read units on scales. Please enlarge font.
The colorbar font has been enlarged.
p. 11 para. 2. The FC measure is not directed, so please use 'patterns of connectivity with' rather than '…..to' here and elsewhere.
"To" has been replaced with "with" when discussing "patterns of connectivity," "connections," and regions that are "connected" with other regions.
Methods. Please state key parameters for HCP and PNC scans: Spatial resolution, multiband factor, TR, scan durations, to spare readers from needing to look these up.
Key imaging parameters have been added to Methods for both the HCP data set (Imaging parameters and preprocessing, p21: "In brief, all fMRI data were acquired on a 3T Siemens Skyra using a slice-accelerated, multiband, gradient-echo, echo planar imaging (EPI) sequence (TR = 720 ms, TE = 33.1 ms, flip angle = 52 degrees, resolution = 2.0 mm 3 , multiband factor = 8)") and the PNC data set (Imaging parameters and preprocessing, p22: "In brief, all fMRI data were acquired on a 3T Siemens TIM Trio using a multi-slice, gradient-echo EPI sequence (TR = 3,000 ms, TE = 32 ms, flip angle = 90 degrees, resolution = 3 mm 3 )"). Scan durations for each condition have also been added to these sections.

Reviewer #2 (Remarks to the Author):
"Task-induced brain state manipulation improves prediction of individual traits" is an extension of the authors' previously published study on resting state fMRI fingerprints.

Here they examine whether models built from task fMRI data perform better than those built from resting-state fMRI data. This a nice direction, but neither the results nor approach is novel, as several previous studies have examined this question (Cole et al and others).
While others have investigated the group-level similarities, differences, and transitions 23-30 among "intrinsic" and task-related patterns of functional connectivity, to our knowledge, the few studies that have used task-based functional connectivity for prediction have focused on the prediction of state variables directly related to the given task (e.g., 31,32 ). Our work demonstrates not only the existence of meaningful differences between rest-and task-based functional connectivity patterns, but also the utility of these differences to reveal brain-behavior relationships (see Introduction, p4 and Discussion, p15). That is, we show that task data can be used to improve prediction of stable traits, not only task-relevant states, suggesting an opportunity to use specific tasks to make meaningful behavioral and clinical predictions about individuals, and to deepen our understanding of the neural bases of such individual traits. These results thus provide strong motivation for a shift from resting-state functional connectivity to task-based functional connectivity for the study of brain-behavior relationships, using thoughtfully selected tasks appropriate for the traits of interest. (Finn et al., 2015 NN). Surprisingly, findings from that study are not discussed, although the authors employed almost exactly the same pipeline and investigated the same question "do individual connectivity profiles predict fluid intelligence".

Further, this study has several other major limitations: it is not theoretically motivated, tests no neural hypotheses, and the findings are incremental and inconsistent with respect to a previous publication from the authors
First, this study is intentionally data driven, a choice that permits identification of gF-relevant circuitry that would likely not be studied in a hypothesis-driven study (see response to Reviewer 2, comment 1 below). Second, the findings of this study are consistent with previous work from our lab 4,5 and others' 33,34 : a significant amount of variance in gF is explained by gF-relevant network strength at rest. We note that the sample used by Finn et al. 5 was smaller than the sample used in the current study, which likely contributed to better rest-based model performance in that study (see response to Reviewer 2, comment 2), but the main finding is replicated here. Finally, we emphasize that we applied connectome-based predictive modeling in this study not to further develop the pipeline, itself, which has already been extensively validated 4,5,31,35 , but rather to demonstrate a key, generalizable point: cognitive tasks, by taxing trait-relevant neural circuitry, amplify individual differences in this circuitry and correspondingly permit more complete and robust characterization of brain-behavior relationships. This point has been clarified in the manuscript (see Discussion, p15).
Overall, the study is weak and findings generally uninterpretable. Other concerns include: (1) There is no theoretical justification for using gambling, motor, social, and emotion tasks to predict fluid intelligence. Gambling task activation turned out to a strong predictor of gF. I wonder what the interpretation and theory here is.
The incentive processing, social cognition, and emotion processing tasks are complex tasks that activate broad swaths of the brain 10 , and it is unsurprising that motor task-based models perform relatively well given this task's robust activation of motor cortices 10 and the finding that edges in the motor network tend to contribute disproportionately to gF predictive models (e.g., Fig. 2d; this is a discovery, we note, that likely would not have been made in a hypothesis-driven study). The consistently high performance of the incentive processing, or gambling, task-based model is perhaps also unsurprising given the high-level cognitive functions into which this task taps (e.g., reward processing, decision-making), the broad distribution of regions (both related and unrelated to reward) it activates 10,36 , and the well-documented individual differences in striatal reward response that it elicits 10,37 . Again, however, given the extensive literature highlighting the overlap between intelligence-and working memory-related networks (for a summary of this work, see 38 ), a hypothesis-driven study would likely have overlooked the incentive processing task, limiting our capacity to identify and understand the most effective perturbations for amplification of individual differences in gF-related circuitry. To clarify these points, a statement about the complexity of these tasks and the data-driven nature of this work has been added to Results (p5).
(2) As in Finn et al., the results demonstrate that connectivity profiles can predict fluid intelligence. However, if we compare Finn et al. Fig. 5a with current Fig. 1d, it appears that models built from the previous small sample actually outperform those from the larger sample used here. This suggests that predictive values of connectivity-features (or the effectiveness of the current pipeline) decreases with increasing sample size?
First, we highlight that, despite differences in sample size, feature-selection thresholds, and statistical procedures, this study replicates the core finding of several recent studies, including that of Finn and colleagues 4,5,33 : patterns of functional connectivity at rest can be used to predict individual measures of gF. That this finding generalizes across large, independent samples with different age ranges (i.e., HCP and PNC) is further evidence of its robustness.
Nevertheless, we appreciate that the differences between results in this manuscript and that of Finn et al. may at first appear counterintuitive and confusing. These differences are likely explained, in large part, by the difference in sample size, as this reviewer suggests. The results of Finn et al. were based on a smaller sample (n = 118) than is used in this study (n = 515).
Overfitting is well known to be more likely and problematic when the number of predictors is large and the number of samples, small 39 , such that a smaller effect in a larger sample often reflects a more accurate estimate of the true effect size, rather than a failure to replicate a larger effect in a smaller sample. This conclusion -that a larger sample size has allowed us to more accurately estimate the percent of gF variance explained by functional connectivity strength in gF-related networks -is further supported by the consistency of these results with other recent work using patterns of functional connectivity to predict gF 4,33 . We have revised the Discussion (p17-18) to clarify this issue.
Moreover, in Finn et al study, just using features from fronto-parietal networks achieved the same performance as using features from the whole brain (Fig. 5a vs. 5c). Although the authors did not test if this is the case in the current study, it appears that visuomotor regions are more highly predictive of fluid intelligence in the current study (Fig. 2d). Does this fit any cognitive theory or models of fluid intelligence? I doubt it.
We apologize for our conflation of the terms "predictive" and "overrepresented" in the manuscript, and note here that the goal of the localization analyses (Results, Model features are spatially distributed and overlapping, p11-13) was to identify regions that were overrepresented in gF-related circuitry; this does not require that these regions and networks be more predictive of gF than others, particularly given that we normalized results (i.e., cells in Fig. 2d) by network size, such that small networks may contribute few edges in absolute terms, but may be relatively overrepresented in a given model. We have replaced the term "predictive" throughout this section of the manuscript to clarify this point.
To demonstrate this distinction, we performed a virtual "lesion" analysis, removing from connectivity matrices all nodes in motor and visual networks; patterns of model performance were largely unchanged, as shown below (results presented as 100r 2 for each model):

Feature-selection threshold Task
P < 0.01 P < 0.005 P < 0.001 Further, the finding of widely distributed gF-related networks is consistent with an extensive literature documenting the neural underpinnings of gF [40][41][42] and general intelligence 38,[43][44][45][46][47] . The overrepresentation of motor and visual regions in predictive models of gF is similarly unsurprising as, in most cases, tasks used in the HCP and PNC data sets are visually complex and require motor responses 10,48 . Representations of task strategy, engagement, and performance may thus be expected to be found in visual and motor regions, and it is likely that individual differences in these processes relate to individual differences in gF. Discussion in the main text of how these results fit into previous efforts to characterize the neural underpinnings of intelligence has been expanded (Discussion, p18-19).

HCP
In sum, our findings are consistent with an extensive literature documenting the distributed networks underlying gF and reasonable given the nature of the tasks. This study's data-driven approach has allowed us to look beyond oversimplified neural accounts of gF to interrogate this distributed circuitry and access additional insights into meaningful individual differences in it.

(3) Use of a developmental PNC cohort with a large age range is problematic.
It is of course true that there are meaningful differences between the brains of adults and children, but these differences make our results all the more compelling. That is, by using such different samples, we take external validation to its logical extreme: that a model built in adults can be applied to children and adolescents, and vice versa, is a strong endorsement of the generalizability of these models and their relative performance (see Discussion, p18). Further, in both data sets, task-based models outperformed rest-based models, and emotion task-based models outperformed WM task-based models in females while WM task-based models outperformed emotion task-based models in males; this replication in independent, very different data sets indicates that these results are robust and generalizable. We are excited to present such sample-invariant results that can guide future efforts to reveal and study brain-behavior relationships in a wide range of populations.

(4) The authors employed a univariate approach to select features using connectivity features that show a significant correlation with fluid intelligence. This is highly circular and problematic, and a likely reason for the inflated findings in Finn et al. This is now quite problematic as the current results suggest a failure to replicate original Finn et al. findings.
The use of a single measure (here, gF) for feature selection and prediction is neither circular nor problematic in our analyses because we use cross-validation to ensure that data used for training and testing the models are kept strictly separate (see Discussion, p17). That is, features are selected and each model is built using gF scores and edge strengths in n -1 subjects; this model is then tested in the unseen n th subject. This standard, leave-one-out cross-validation approach to prediction is described in Methods (Cognitive prediction, p23-24). This separation is even more conservative -and the demonstration of generalizability more compelling 49 -in the cross-data set analyses (Methods, p25), which show that models, particularly those built using task data, generalize from one data set to another, entirely independent data set (Results, p10-11). These procedures explicitly avoid the "double-dipping" 50 that could inflate findings, as this reviewer suggests, and thus do not explain differences between these results and those presented by Finn et al. 5 We note, again, that this work does replicate the main prediction results of Finn et al., with several sample and analysis differences that likely account for decreased rest-based model performance in this work (see response to Reviewer 2, comment 2).

Reviewer #3 (Remarks to the Author):
Review of: Task-induced brain state manipulation improves prediction of individual traits I read this article with great interest not least because I agree strongly with the central premise, which is that dynamic/task active connectivity should provide a better predictor of cognitive ability than resting state connectivity. This notion makes a great deal of sense, given that we see patterns of large-scale network activity/coherences during the performance of g-loaded tasks. The analyses and results have a number of important strength, however, there are also a number of key points that I believe if addressed would greatly strengthen the article.
1) The authors state that "By considering task-induced brain state and sex, the bestperforming model explains over 20% of the variance in fluid intelligence scores, as compared to less than 5% of variance explained by rest-based models built using the whole sample." This is an intriguing result, particularly given the WM/emotional task dissociation. The replication of the task*gender interaction across the two studies is a strength. My main concern though is that the difference in variance explained when accounting for gender could relate to differences in cohort size. By definition, there must be fewer subjects in the trained datasets for the gender analyses, i.e., because the cohort has been split into two groups. Is it not the case that they will be more prone to overfit, and therefore, could give the illusion of explaining more variance? To be truly convincing, the trained model would really need to be validated by application to a further dataset. I also was curious why only gender was examined as opposed to other factors (handedness, age, education level, etc?) We chose to examine sex because it is a salient group feature that has received much attention in the human neuroimaging -and, more specifically, functional connectivity -literature. Sex was also a sound choice because it provides a natural measure for dividing the subject pool into only two subgroups, as opposed to other features, such as age or education level, that don't provide such clean separation, or that separate the sample into even smaller subgroups. We present these sex differences both because we believe them to be an important indication that the neural representations of fluid intelligence vary by sex, and -perhaps even more importantly -because they provide proof of the principle that predictive modeling may be most successful when appropriate models are defined for a given group. Finally, the sex difference work demonstrates that brain state manipulations may not all have the same efficacy for every group. A more complete characterization of the features by which groups should be defined to maximize model performance is certainly of interest, as is a method to identify such features in a data-driven manner, and we consider these to be important questions for future research (see Discussion, p20).
As discussed elsewhere in this response, we share this reviewer's concern about overfitting, and recognize that any decrease in sample size increases the risk of overfitting 39 , but we offer several key observations that mitigate this concern in the sex differences analysis. First, even after splitting the samples by sex, the groups are still quite large (each includes over 200 subjects). Second, with greater overfitting, average model performance would be expected to improve; this is not the case when the sample is split by sex (HCP whole sample mean r 2 = 7.1% versus HCP split sample mean r 2 = 6.4%; PNC whole sample mean r 2 = 8.7% versus PNC split sample mean r 2 = 6.8%). This suggests that overfitting is not more problematic in the sex differences analysis than in the whole-sample analysis. Finally, this analysis demonstrates a meaningful conditionby-sex interaction, but we make no claims about absolute model performance (see response to Reviewer 1,comment 8). If differences in model performance when the sample is split by sex were attributable to overfitting, alone, all models would demonstrate comparable changes in performance, as there is no reason to expect that overfitting would affect one condition more than others, and the pattern of model performance across conditions would be preserved and similar in males and females. This is not the case (i.e., the relative performance of emotion and WM task-based models in males is opposite that in females), further evidence that the sex difference in which brain state most improves gF prediction is not the product of overfitting. We do, however, agree with this reviewer that it will be productive and interesting to explore this effect in additional data sets, and we look forward to doing so in the future.
2) The main finding is that "brain state can be manipulated to better reveal brain-behavior relationships, and that identifying and inducing the right brain state in a given group can improve trait prediction".
This seems sensible to me, g-loaded tasks involve activity/coherence across certain networks, it makes sense to examine those networks when they are expressing those active/synchronised states. The fact that "Results generalize across conditions and two large, independent datasets" is also a strength -the importance of reproducibility is finally coming to the fore in the imaging field at the moment, which is a good thing.
We thank the reviewer for these supportive comments.
A concern though, is how well balanced this comparison actually is? Specifically, were the rest and task acquisitions the same duration? I.e., did they have the same acquisition parameters, and were there the same type and number of EPI scans for task and rest? This is an important point to consider. Longer acquisition will allow for a more robust estimate of correlation strength between each pair of nodes. this results in a connectivity matrix that has greater signal vs noise. One should ensure that the rest vs task difference is not simply a consequence of differences in the reliability of these estimates.
Methods (Imaging parameters and preprocessing, p21-22) have been revised to clarify that all fMRI data in a given data set were acquired using the same imaging protocol, and relevant imaging parameters 48 . We appreciate that condition duration may affect the reliability of functional connectivity measures and, in turn, the success of predictive modeling, and note that, for this reason, the poor performance of HCP rest-based models is particularly impressive given the longer duration of the rest runs (Discussion, p15). To further investigate the potential effects of condition duration on model performance, we performed several additional analyses, described in detail in our response to Reviewer 1, comment 3; we found no clear relationship between condition duration and resulting model performance, and task-based models still outperformed rest-based models after connectivity matrices were recalculated using the same number of frames from each condition. In the manuscript, these analyses are described in Methods (Effects of condition duration, p26-27), and corresponding results are presented in Supplementary Table 6 and summarized in Results (Investigation of potential confounds, p8). We also investigated a potential relationship between reliability at the node level and model node degree, and found no significant relationship (see response to Reviewer 1, comment 7).
3) Relatedly to the above two points, when stating how much variance is actually explained, I would have been more convinced if the headline value was for a cross validation. by this, I mean where the model is trained on HCP data for one task, and then validated with more HCP data for that same task. I don't question that there is a connectivity-g relationship, but without such a step, the headline estimates of actual variance explained are hard to interpret (I note, that this is a criticism that can be levelled at a number of other high impact papers, and the authors at least apply a cross validation across studies, albeit with somewhat different tasks/acquisitions).
We agree that cross-validation is critical to avoid overfitting and stringently test model generalizability, and clarify that, for this reason, we used a leave-one-out cross-validation approach when training and testing all CPM models, with the exception of the cross-data set analysis, such that data used for training and testing were always kept strictly separate. This ensures that all prediction results presented in the manuscript are cross-validated. These methods are described in full in Methods (Cognitive prediction, p23-24 and Validation of the models, p24-25). We also recognize the benefits and potential pitfalls of leave-one-out and k-fold crossvalidation approaches, and have re-run our analyses using k-fold cross-validation for model training and testing. This analysis and corresponding results are discussed below (see response to Reviewer 3, comment 8).
4) The comparison of rest vs. task seems well balanced and sensible (assuming the above condition regarding number of TRs is met). A question is whether the authors also looked at measures taken of dynamic connectivity relative to steady state (PPI for example), or were the average taken across both rest and task for the task acquisitions? In a way it is critical to use the simple correlational as opposed to PPI approach, as this is more balanced across task and rest conditions, making them more comparable. Nonetheless, an obvious prediction of the authors hypothesis is that the PPI measures (increased connectivity during task vs rest) should in fact give the best possible explanation of g.
We apologize for the confusion regarding our approach and note that we chose to use the simple correlational approach -that is, correlating time courses across entire task runs without accounting for task design -for several reasons: to make models derived from task and rest conditions more comparable, as suggested by this reviewer; to permit identification of stable, core gF-related circuitry in all conditions (including at rest; PPI would ignore any such connections that fail to meaningfully change across conditions, i.e., connections that are part of an "intrinsic" functional architecture 23 ); and to treat each task as a separate and continuous brain state.
We are also interested in the performance of PPI measures, with the caveat, as noted above, that such measures may fail to capture stable gF-related circuitry; nevertheless, this remains an important question for future work, and we look forward to investigating it further to more comprehensively and mechanistically characterize task-induced brain states.

5) I was somewhat confused by the description of supplementary analyses with more robust
compensation for movement artefacts. Does this mean that movement still had a component in the connectivity-g relationship for the primary analyses as reported in the main text? if so, then really these supplementary analyses should be presented as the main analyses, and headline statistics altered accordingly. If not, and the minimal data cleaning pipeline was sufficient, then are these not redundant?
While we employed a conservative threshold for motion exclusion in both data sets (mean frameto-frame displacement less than 0.1 mm, and maximum frame-to-frame displacement less than 0.15 mm), gF remained correlated with motion in 3 out of 21 runs (Methods, p21-22). Given this, as well as the finding that model predictions were in many cases correlated with mean frame-to-frame displacement, we were concerned that motion could be confounding the primary analyses, as this reviewer suggests. To ensure that this was not the case, we undertook several additional analyses (incorporation of mean frame-to-frame displacement into feature-selection and model building steps; Methods, Effects of head motion, p25-26). We predicted that these analyses would be redundant, indicating that the variance in true and predicted gF that can be explained by motion is relatively non-overlapping with the variance in these measures explained by network strength. This was indeed the case, as model performance was not substantially changed by these additional analyses (Supplementary Table 4). In sum, we agree that these analyses are redundant, but feel that this redundancy is important to highlight, as it addresses concerns that models may be based on or predictive of participant motion, rather than gF; that is, these supplementary analyses indicate that motion is not a meaningful confound in our primary analyses.
6) The observation that models trained on rest data provide stronger prediction of 'g' when applied to task data seems to be a strength -although I was not clear on how generally / consistently this was the case -it would be good to clarify this in the text. Also, to clarify that the task and rest acquisitions actually have the same amount of data as outlined above.
We have clarified in the text that in ten out of twelve tested cases, models trained on rest data and tested on WM task data outperformed models trained and tested on rest data: In fact, in all but 2 out of 12 tested cases, the rest-based models performed better when applied to the WM data than when applied to the rest data on which they were built (Results, p10).
See responses to Reviewer 3, comment 2 and Reviewer 1, comment 3 for a discussion of the relationship between condition duration and model performance.

7) Repeating analyses with different parcellation resolutions is a strength of the paper.
We thank the reviewer for this supportive comment.
8) It is surprising that leave-one-out approach was applied in such a large dataset -the gold standard would be to use cross validation with train and test sub-groups.
We appreciate the bias-variance trade-off inherent in selecting a cross-validation approach 39 ; to empirically address this concern, we re-ran main analyses using k-fold cross-validation, with k = 10, for both the HCP and PNC data sets. This analysis approach is described in Methods (Effects of cross-validation method, p27); corresponding results (100r 2 ) are presented in Supplementary  Table 7, reproduced below, and summarized in Results (Investigation of potential confounds, p8). Of note, these results demonstrate no substantial differences in absolute or relative model performance compared to models generated using a leave-one-out cross-validation approach. 9) I was unclear whether the number of edges that were feature selected in the male and female populations was balanced? Obviously any differences in this would lead to a problem, whereby more edges gives the model more degrees of freedom for fitting (and over-fitting) the data.

Feature-selection threshold
We thank the reviewer for this important question, and note that it touches upon several points of great interest to us.
First, we demonstrate in the main analyses that P value-based thresholds and sparsity thresholds for feature-selection yield models with comparable absolute and relative performance (Supplementary Table 1). Additionally, when considering models generated using P value-based feature-selection thresholds, we find that better-performing models include more edges ( Supplementary Fig. 2), a finding that lends further support to the idea that task-induced brain states improve gF prediction by amplifying and revealing individual differences in patterns of functional connectivity, such that more edges are significantly related to gF and thus leveraged for its prediction. We therefore suggest that allowing models to include varying numbers of edges is informative, and present analyses using P value-based thresholds in the manuscript and figures (for both whole-sample and sex differences analyses).
As discussed above, we are generally concerned and careful about potential overfitting, but because models are trained and tested on summary statistics -that is, on unweighted sums of selected edges' strengths -the number of edges included in the model should not change the model's degrees of freedom or the likelihood of overfitting.
Nevertheless, we believe that it is important to employ sparsity thresholds as a control to ensure that the number of edges included in the models does not have any unforeseen effects on model performance. It is for this reason that we used both P value-based and sparsity thresholds in the main analyses, and in the same spirit, we re-ran CPM on males and females separately using sparsity, rather than P value-based, thresholds. Below, we present results (i.e., 100r 2 ) for the sparsity thresholds that select numbers of edges closest to those selected using a feature-selection threshold of P < 0.01. For comparison, we also present results of models generated using a feature-selection threshold of P < 0.01 (as presented in the manuscript), and note that the same trends hold (i.e., WM task-based models outperformed emotion task-based models in males, while the opposite was true in females): In summary this is an intriguing and potentially important paper with great potential. I hope that the above suggestions are helpful in strengthening it further.

Additional Note
We note one additional change to the manuscript: in the process of revising its contents, we discovered that there were several HCP subjects lacking coverage in 9/268 nodes. As in the PNC data set, we adopted the conservative approach of excluding these nodes from all subjects. We re-ran all analyses with these nodes excluded and revised the manuscript accordingly; there were no substantial changes in revised results or their interpretation.
Once again, we thank the reviewers for their helpful comments, and look forward to continuing to explore many of these important ideas.