Determining the Hierarchical Architecture of the Human Brain Using Subject-Level Clustering of Functional Networks

Optimal integration and segregation of neuronal connections are necessary for efficient large-scale network communication between distributed cortical regions while allowing for modular specialization. This dynamic in the cortex is enabled at the network mesoscale by the organization of nodes into communities. Previous in vivo efforts to map the mesoscale architecture in humans had several limitations. Here we characterize a consensus multiscale community organization of the functional cortical network. We derive this consensus from the clustering of subject-level networks. We applied this analysis to magnetic resonance imaging data from 1003 healthy individuals part of the Human Connectome Project. The hierarchical atlas and code will be made publicly available for future investigators.

Using surrogates of brain activity such as the blood-oxygen-level dependent (BOLD) signal obtained using functional magnetic resonance imaging (fMRI), whole-brain functional networks (i.e., connectomes) can be estimated in vivo. The brain functional connectome is organized at multiple spatial scales, one of which is the mesoscale. It has been recently shown that a full repertoire of functional communities-groups of nodes that are densely connected internally-can be consistently decoded, even at rest. In the brain, these communities are known to represent subsystems and mediate distinct neurophysiological functions (e.g., the brain's visual subnetwork) [1][2][3] . Moreover, this scale is highly sensitive to disease, where several psychiatric disorders have shown selective disruption in particular brain communities [4][5][6][7][8] . This apparent importance has prompted interest in this line of investigation; while a vast wealth of knowledge has been gained from these efforts, several methodological pitfalls remain.
To map the mesoscale architecture of the normal brain, previous studies have generally applied the community detection algorithm on a group-representative network obtained by averaging networks from a group of individuals. However, it is becoming increasingly clear that important features may be lost by such averaging (including some that are present across individuals 9 ), leading to a representation that may not resemble a true central tendency in the group [10][11][12][13][14][15] .
Further, the analysis of networks derived from time series (i.e., correlation networks) is challenging. First, unlike prototypical networks where edges are either present or absent 16 , edges in correlation networks represent the magnitude of statistical association, and so are on a continuum. Second, methods that are commonly used to index association-most commonly Pearson correlation-also produces negative values (anticorrelations). Third, these networks contain numerous indirect dependencies by virtue of the transitivity inherent in correlations [17][18][19] . This is particularly salient in large networks, as these indirect effects may be compounded with higher-order interactions.
Previous investigations have generally made use of thresholds prior to analysis in order to treat functional networks like usual sparse graphs. The rationale is that the strong connections that would be retained would likely be the most relevant and least likely to be artifactual, also leading to the elimination of negative edges. Recently, the relevance of weak network edges has become increasingly recognized, as they appear to convey unique information not encoded in strong edges [20][21][22][23][24] . Further, the choice of the threshold is also challenging and can give rise to heterogeneity in the findings 25,26 . Like weak edges, negative edges (i.e., anticorrelations) have been also shown to have a substantial physiological basis [27][28][29][30][31] .

Materials and Methods
neuroimaging dataset. The data that we use in this work is from the Washington University-Minnesota Consortium Human Connectome Project (HCP) 43 . The latest release at the time of writing was adopted (S1200; March 2017). Details regarding this dataset have been previously published 43,45 . Briefly, it included state-of-the-art whole-brain MRI acquisition with structural, functional, and diffusion-weighted imaging; the scanner was a customized Siemens Skyra 3 T scanner with slice-accelerated sequences for fMRI [48][49][50][51] . Whole-brain functional data were acquired in two sessions. Each session consisted of two phase-encoding directions (left-right and right-left) each a 15 min multiband gradient echo-planar resting-state run (Voxel size = 2.0 × 2.0 × 2.0 mm; TR = 720 ms; TE = 33.1 ms; Flip angle = 52; 72 slices; Bandwidth = 2,290 Hz/pixel; FOV = 208 × 180 mm). Informed consents were obtained from all subjects. The study procedures were approved by the institutional review board at Washington University in St. Louis. All experiments were performed in accordance with relevant guidelines and regulations. Permission was obtained from the HCP to use the Open Access data for the present study, which abides by the Data Use Terms (http://www.humanconnectome.org/data/data-use-terms). The S1200 release includes 1097 subjects with resting-state fMRI scans. Our analysis was restricted to subjects who completed all four resting-state scan runs; namely, two sessions, each with two encoding runs. This resulted in n = 1003 individuals; 534 females and 469 males. The age range was 22 to 37, with a mean of 28.7. first-level processing. The HCP minimal processing pipeline was used 51,52 . Briefly, this included projection to the surface space, 2 mm FWHM smoothing, ICA + FIX denoising with minimal high-pass filtering, and surface registration using MSMall [53][54][55][56][57][58][59] . To define our ROIs, we used a newly-developed multimodal parcellation (MMP) 60 . We chose this parcellation because it is neuroanatomically informed, with data from cortical architecture, connectivity, and topography. This group parcellation consisted of 360 ROIs (180 per hemisphere) that cover the entirety of the cerebral cortex (but does not include the subcortex or cerebellum), which we mapped to the individuals. In the standard surface space, we calculated the mean time series from all voxels in each ROI. Resting-state fMRI time courses from the left-right and right-left phase-encoding runs and the two sessions were concatenated (total duration 60 minutes; 4800 time point). Functional connections between nodes i and j were defined as the Pearson product-moment correlation of their respective time series.
Workflow overview. The first step consisted of extracting the multiresolution network community structure from each individual subject using a variation of the Louvain algorithm (detailed in sections 2.4 and 2.5). To address the near-degeneracy of the algorithm, we used multiple runs and a partition similarity-based framework to pick a stable output (detailed in section 2.6). The resulting outputs from all individuals were used to populate a co-classification matrix. We then used a hierarchical multiresolution method 46 to extract a consensus that is representative of the group (detailed in 2.7). community detection. This consisted of using the Louvain community detection algorithm 34 ; specifically, an iterated version as implemented in the GenLouvain package 61 . This versatile method is based on finding communities with a high degree of intra-and low degree of inter-connectedness. This is achieved heuristically by optimizing the modularity metric Q, which captures the degree of connectedness within communities compared to connectedness expected under a null model 34,36 : where A and P are the observed and null adjacency matrices, respectively; and = ∑ m A ij ij 1 2 is the total strength of the network. The Kronecker δ c c ( , ) i j function is equal to 1 when i and j are in the same community and 0 otherwise. As mentioned in the introduction, classical formulation of P is that of a permutation null model 36 Although this definition was extended to weighted signed networks [37][38][39] , it remains inappropriate for use with networks generated from correlation of time series. Namely because entries in the correlation matrices are not independent, which leads to a bias in community detection 17,18,[40][41][42] . We avoided this by adopting a null model that was specifically designed for correlation-based networks 41 (detailed below). The networks were not thresholded, retaining all weights (both positive and negative).
Methods related to modularity maximization are known to suffer from the resolution limit-the inability to detect communities smaller than a certain size 62 . Several methods have been used to address this 32,63-65 . Here we adopt a recursive hierarchical approach to recover the community structure at multiple scales 46,66 . After detecting the first-level (i.e., coarsest) communities in the initial run of the algorithm, we define each detected community as a separate subgraph and run the algorithm recursively on each. This is repeated until no more statistically significant subcommunities can be detected. This iterative procedure allows the resolution of a hierarchically nested structure. Indeed, this is consistent with notions of hierarchical organization in the cortex from microscopic tracer studies in other mammals 67,68 . We use this hierarchical framework at the subject-level, and a slightly different one at the reclustering stages (see below).
Due to the nearly degenerate outputs of the Louvain algorithm 69 , for each subject-level run, we performed 100 iterations and adopted the partition that is most similar to the ensemble, borrowing the approach from [70][71][72] . The rationale is that similarity to the ensemble can be used as a surrogate for stability, and the most stable partition solution is likely the closest to the ground truth 33,72 . To evaluate partition similarity, we used the z-score of the Rand coefficient (described below) and calculated the pairwise similarity between all 100 partition solutions, selecting for each subject the partition with the highest mean similarity to the ensemble. Since the community detection procedure was applied recursively in a hierarchical manner, 100 iterations and subsequent selection of the representative partition based on similarity was also applied to each subcommunity. This was done in order to avoid adopting an arbitrary solution from each subject.
A random matrix null model for correlation networks. To generate null models that are appropriate for use with correlation networks, we adopted a method based on random matrix theory (RMT). It is described in full detail in the original publication 41 . Briefly, this method builds null networks using a modified Wishart distribution with the same common trend and noise as the observed network, but without the community structure (sum of the random component and the dominant positive component 41 ). This method does not require thresholding and is compatible with negative values. The end result being that positive interactions are maximally concentrated within modules and negative interactions expelled outside.
For the hierarchical recursive application, after identification of a community c in the initial network, we regenerate a null model from time series of nodes belonging to this community, and the community detection is then applied to this subgraph. One criticism of such recursive procedures relates to the fact that there are no clear stopping rules 32 . Here, it is important to mention that with the RMT method, only network components that stand out from the ones predicted by the community-free null model in a "statistically significant" manner are kept in the filtered networks 41,42 . Therefore, when the community detection algorithm is ran on the RMT-filtered networks, all detected communities are statistically significant 41,42 . If no communities are detected, the recursive algorithm stops. partition similarity. In order to measure the similarity between two partitions, we adopted the z-score of the Rand coefficient 70,71 . In the original definition, the Rand coefficient of two partitions is calculated as the ratio of the number of node pairs classified in agreement in both partitions, to the total number of pairs 73 . Similar to Doron et al. 71 , Akiki et al. 6 , Betzel et al. 74 , Bassett et al. 72 , here we use the z-score variant introduced in Traud et al. 70 , which can be interpreted as a measure of how similar two partitions are, beyond what might arise at random 33,70,75 . An analytical formula for the z-score of the Rand coefficient between partition a and b can be written as: where M is the total number of node pairs, M a and M b the numbers of node pairs in the same community in a and b respectively, w ab the number of pairs that figure in the same community in both a and b, and σ w ab the standard deviation of w ab (as in Traud et al. 70 ). For more information, see SI.
Hierarchical consensus reclustering. For the hierarchical consensus reclustering step, we adopted a recently developed method by Jeub et al. 46 . Our multiresolution co-classification matrices embed information from the different hierarchical levels of community structures detected from the clustering results at the level of the subjects' networks. The method developed by Jeub et al. 46 extends the classical formulation of consensus reclustering 47 by allowing a hierarchical multiresolution output and has built-in tests for statistical significance 46 .
Here, the quality function was modularity-like as in Eq. 1, with a "local" variant of the permutation null model. Briefly, in addition to the constraint of the traditional permutation model of a fixed size and number of www.nature.com/scientificreports www.nature.com/scientificreports/ communities, this local model also assumes that node i is fixed and node j is random when calculating α P ( ) ij local . This means that the null network only receives contributions from nodes that are less frequently co-classified together compared to the local permutation model, at a statistical significance level α. More detail can be found in Jeub et al. 46 .
The same recursive principle and Louvain implementation described above were used here. To ensure a stable output, 100 iterations are used at each level (cases of output instability are dealt by meta-reclustering as in Lancichinetti and Fortunato 47 , Jeub et al. 46 ); the threshold for statistical significance is set at α = 0.05. partition task homogeneity. Task fMRI data from the Human Connectome Project span several domains (social cognition, motor, gambling, working memory, language processing, emotional processing, and relational processing) 45,76 . Here we used the effect size activation maps over 86 task contrasts (group average over 997 subjects from the S1200 release).
We adopted the framework of module functional homogeneity 12,77,78 to gauge the quality of partition solutions across the hierarchical levels as another metric independent of partition similarity. Nodes within well-defined modules should, in principle, respond in agreement across tasks. That is, modules should show a high degree of homogeneity (e.g., uniformly activated or deactivated when completing a certain task). Similar to Schaefer et al. 77 , Kong et al. 78 , for each task, we quantify homogeneity by calculating the negative of the standard deviation of nodal activation (Cohen's d) values for each module (the sign was flipped so that lower, more negative values indicate lower functional homogeneity, while greater, less negative values closer to zero indicate higher homogeneity). Of the 86 task contrasts we excluded redundant ones (e.g., the standard deviation from Faces -Shapes contrast is identical to the Shapes -Faces contrast), retaining 48 contrasts in total.
However, because we were to compare partitions of different size and numbers, this simple definition would not be sufficient [79][80][81] . For example, modules with a smaller number of nodes may be more homogeneous. To account for this bias, we standardized it against the null distribution obtained by randomly permuting node assignments to modules 10,000 times (keeping the number and size of the communities constant), and expressed the homogeneity as a z-score. For each hierarchy, the z-scores were averaged over all modules and then over all task contrasts. This resulted in a summary measure how functionally homogeneous each partition solution is, beyond what is expected by the size and number of modules.
Analysis of nodal consistency. To quantify how consistently each nodes is assigned to its community, we calculated the nodal consistency, using the hierarchical consensus partitions as a reference. For each consensus partition level, nodal consistency was calculated by counting the number of times (across subjects) a node is co-classified with other nodes that belong to the same community (using the hierarchical consensus partitions as a reference), divided by the total number of times that the node has been classified (i.e., to same or to a different community).
To better understand the variability in consistency across nodes, we calculated the following: 1) the nodal strength as a measure of network hubness calculated as = ∑ s A i j ij as the strength (hubness) of node i, 2) the signal-to-noise ratio defined as . For simplicity, these were calculated from group-averaged data (average connectivity matrix for hubness, group-representative time series for the S1200 sample 82 for SNR, and group-average task fMRI contrast maps for the task fMRI coefficient of variation). We then used these terms in a multiple linear regression model as predictors of nodal consistency.

Robustness analyses.
In-scanner head motion has been reported to confound the estimation of functional connectivity 83 . For the main analysis, we adopted the strategy proposed by the HCP group 51,52 which includes ICA + FIX 58,59 . We chose not to regress out the "global signal" [mean grey-matter time course regression (MGTR)] as there is concern that the process may distort the correlation structure by shifting the connections to an approximately zero-centered distribution, causing an artifactual increase in computed anticorrelations, and induce a shift in areal boundaries and a distance dependence in functional connections 60,[84][85][86] . The first point is particularly salient in our case as we do not threshold negative connections prior to the community detection analyses. MGTR may therefore spuriously shift (weakly) "positive" corrections into anticorrelations, which would directly impact how they are treated by the RMT-based method, which is based on expelling negative connections outside of modules 41 . However, it has been argued that the standard HCP denoising methods do not fully remove motion artifacts and that regression of the "global signal" remains an effective strategy in reducing the dependence of correlations on motion 27,[87][88][89] . To assess the robustness of our community detection results, we have repeated the main analysis after incorporating MGTR into the preprocessing pipeline. To avoid the influence of "artifactually" induced anticorrelations on the community detection, we removed negative values from the correlation matrices prior to the RMT decomposition [note: an example of the non-thresholded MGTR approach can be found in the accompanying Supporting Information (SI) document]. To compare the consistency between the partitions obtained with and without MGTR, we correlated the co-classification matrices obtained from the subject-level clustering of the two methods, and, for interpretability, the percentage of nodes that differ in the subject-derived consensus partitions(i.e., the Hamming distance). In our analysis, we used a multi-modal cortical parcellation 60 to define the network nodes. To ensure that this did not lead to idiosyncratic results, we have repeated the main analyses with a cortical parcellation by Shaefer et al. 77 which is based solely on function, and may be more functionally coherent (see SI). (2019) 9:19290 | https://doi.org/10.1038/s41598-019-55738-y www.nature.com/scientificreports www.nature.com/scientificreports/ Statistical analysis. Statistical tests for community detection are described above. To compare the similarity measures from the subject-derived vs. group-derived consensus, we used two-tailed permutation hypothesis tests (10,000,000 permutations), and Cohen's d for the effect size. Fisher's combined probability test was used to calculate a summary measure of the combined results from the individual permutation tests (at each hierarchical level) bearing on the same hypothesis.

Results
Mesoscale organization revealed by subject-derived consensus. We first performed the community detection at the level of the individual subjects' scans. Using the multi-modal parcellation (MMP) atlas (180 cortical areas per hemisphere, excluding subcortical structures) 60 , regional fMRI time series were used to generate functional networks and corresponding random matrix null models after appropriate preprocessing (see Materials and Methods). These were then used with the Louvain community detection algorithm. To index the full range of spatial resolutions, we applied the algorithm recursively: each daughter community was treated as a new network and the process was repeated until no statistically significant communities were found under the null model. Near-degeneracy of the Louvain algorithm was addressed by considering 100 runs of the algorithm and picking the most stable output before proceeding to the next hierarchical level (see Materials and Methods).
This resulted in a median of 5 hierarchical levels for each subject (range 1-8). The number of communities across all hierarchical levels ranged between 2 and 134. To identify a representative partitioning for the group based on the information from the subject-level partitioning, we adopted a consensus clustering approach 33,46,47,72 . This method consists of summarizing the outputs of the subject clustering results by quantifying the number of times nodes i and j were assigned to the same partition across subjects and hierarchies and populating a co-classification matrix C ij with these values. This co-classification matrix was then subjected to a recursive clustering algorithm recently introduced for multiresolution consensus reclustering 46 . This resulted in a subject-derived consensus hierarchical tree with 103 levels, ranging from 3 communities at the first hierarchical split to 112 communities at the finest-grained level (Fig. 1a). Thus, the finest branches of the tree contain about 3 nodes (areas).
By means of the methods that we used, all partition solutions in the group hierarchy were statistically significant under the respective null models (see Materials and Methods). To identify partition solutions in the consensus hierarchy that are most expressed at the level of individual subjects, we calculated the average similarity between each hierarchical consensus partition level and the subject-level partition solutions (first averaged within each subject across hierarchies, then across subjects). This allowed us to gauge the "representativeness" of the different levels, with those corresponding to local maxima considered to be of particular importance (Fig. 1b).
As an additional independent partition quality metric, we used a measure of functional homogeneity derived from task fMRI. Interestingly, homogeneity peaks appeared to coincide with the similarity peaks, suggesting a convergence between a partition's "representativeness" and its functional homogeneity across tasks (see SI).
To facilitate the interpretability of the modular organization, here we focus on the first local maximum yielding 6 communities (Fig. 1c). At this level, the organization consisted of an occipital community corresponding to the cortical visual system (visual); a community centered around the central fissure and extending to the transverse temporal gyrus, corresponding to the somatosensory, auditory, motor and supplementary cortices (somatomotor); a community with anterior and posterior midline components (medial prefrontal and posterior cingulate cortices) as well as a middle temporal component, collectively known as the default mode; a community with nodes predominantly in the frontoparietal cortex that is more expressed in the right hemisphere (central executive); a cingulo-opercular community spanning the ventral part of the salience system (ventral salience); and a more dorsal community spanning dorsal parts of the salience system (dorsal salience).
Interestingly, other prominent local maxima appear to have occurred at neurobiologically relevant divisions (Fig. 2). The level yielding 11 communities corresponds to the split of the default mode community into a midline core community and middle temporal lobe community (Fig. 2a). The level resulting in 13 communities represents the split of the auditory community from the somatomotor community (Fig. 2b). The level yielding 19 communities represents the delineation of the language community, more expressed on the left (Fig. 2c). The level yielding 30 communities represents a hemispheric split of the central executive community (Fig. 2d).
Subject-derived consensus partitions are more accurate representation of individuals' mesoscale features. As a control to the subject-level method, here we perform the community detection on a group-representative network. We concatenated the fMRI regional time series from all participants and generated connectivity and corresponding random matrix null networks. For a meaningful comparison with the subject-level method, we iterated the hierarchical algorithm an equal number of times as the subject-level method (1003 times) and used the multiresolution output from these runs to populate a co-classification matrix. Since the first-level clustering here was performed on a single group-representative network, a slight variability in the outputs of the 1003 runs was deemed necessary to achieve a rich co-classification matrix. For this reason, (2019) 9:19290 | https://doi.org/10.1038/s41598-019-55738-y www.nature.com/scientificreports www.nature.com/scientificreports/ the procedure to control for the near-degeneracy was applied at the hierarchical consensus clustering only (see Materials and Methods). Each run resulted with a median of 5 hierarchical levels (range 4-5). We then built a co-classification matrix C ij using the outputs and performed the hierarchical consensus reclustering step. This resulted in a group-derived consensus hierarchy of 101 levels, with a number of communities ranging between 4 and 143 (Fig. 3a). See SI for more details. www.nature.com/scientificreports www.nature.com/scientificreports/ One of the goals of this study was to assess whether applying the clustering algorithms at the subject level followed by meta-reclustering to reach a group consensus (i.e., the method in the previous section) confers a meaningful advantage over applying the clustering algorithm to a group-representative network as is generally done in the literature. For this reason, we calculated the average similarity between the group average-derived consensus partitions (Fig. 3a) and the partitions at the level of the individual subjects. We compared these values with the similarity between the subject-derived consensus partitions and the partitions at the level of the individual subjects (Fig. 3b). Compared to the group-derived consensus partitions, the subject-derived consensus partitions were more similar to the individuals' partitioning throughout the 65 hierarchical levels that are in common (two-tailed permutation tests; = n 1003; 10 −7 < p < 2.073 × 10 −4 ). The mean effect size was = .
partitions are robust to prepossessing methods. Subject-derived consensus partitioning after MGTR yielded 115 hierarchical levels, with an organization that is similar to the main analyses (Fig. 5a). The vectorized upper triangles of the co-classification matrix generated from the data with MGTR and without MGTR were strongly correlated ( = . r 0 9538, < − p 10 307 , = df 64618) (Fig. 5b). Similarly, the partition similarities were high across all hierarchies as evidenced by the z-score of the Rand coefficient (corresponding to < − p 10 307 ), and the percent agreement ranged between 75.56% and 96.11% (mean = 82.20%, SD = 3.49) for the other hierarchical levels (Fig. 5c).
As in the main analyses, the similarity of the consensus hierarchies to the subject-level partitioning as well as the homogeneity peaked at the level of 6 communities (Fig. 5d). This partitioning revealed the same subsystems identified in the main analyses (Fig. 5e), with 93.06% agreement (dashed red line in Fig. 5c).

Discussion
Using subject-level clustering of functional networks and a reclustering framework, we mapped the mesoscale architecture of the cortex at multiple scales. Confirming the study hypothesis, the subject-derived consensus framework yielded results that were more similar to the individual subjects compared to the group-derived consensus, despite the fact that both were obtained from the same initial data. While it is inevitable that any type of  www.nature.com/scientificreports www.nature.com/scientificreports/ acquisitions typically consisted of few minutes (e.g., 5-10 min; TR = 3000 ms), therefore, averaging may have been a good strategy to increase the signal-to-noise ratio. However, in the current study, the acquisitions consisted of 60 min (TR = 720 ms) potentially mitigating this issue.
The implicit assumption that the individual subjects' clustering results are an appropriate reference is reinforced by a body of literature showing that subject-per-subject extraction of community organization better agrees with the individuals' behavioral characteristics 78 . Further, the convergence of the two independent metrics that we used to assess the partition solutions-similarity and task functional homogeneity-gives credence to the notion that the similarity between the consensus partitions and those at the level of the individual is a relevant quality metric (see SI).
Data science approaches are being increasingly implemented to study the relationship between brain connectivity, behavior, and psychopathology. However, a major challenge in this field is the large number of features compared to the number of observations per study. One important aspect of the multiresolution modular atlas that we developed in this study is that it can be used to reduce the number of features in a neuroanatomically informed manner while accounting for the connectome architecture.
Several important neuroanatomical observations can be made. The representative partition solution that yielded 6 clusters (level 4; Fig. 1c) corresponds to a well-described set of functional systems 1,2,93 . The visual system can be seen represented in a large occipital community, ranging from early and higher-order visual cortices to the a b  www.nature.com/scientificreports www.nature.com/scientificreports/ dorsal and ventral streams. Even at the coarsest scale, the visual system comprised a separate community than the somatosensory system-likely, a parallel of the heavy anatomical differentiation in that area and consistent with electrophysiological studies in humans and primates 67 . The somatomotor community included both early and higher somatosensory and motor cortices, and the supplementary motor cortex. Relatively early on in the hierarchy, there is a separation of the auditory system (Fig. 2b). The rest of the somatomotor system remains relatively cohesive. The cluster that spans several midline (medial prefrontal cortex and posterior cingulate cortex) and middle temporal structures, is known in the literature as the default mode, and is known to be involved in task-negative processes 94,95 . At the level of 6 communities (Fig. 1c), the default mode constitutes a single entity, but subsequently branches into a midline cluster (anterior medial prefrontal and posterior cingulate cortices) and a middle temporal cluster (Fig. 2a). This division is consistent both functionally and anatomically; evidence from task-imaging showing that the midline structures are functionally specialized for self-relevant decisions and the inference of other people's mental states (i.e., theory of mind), whereas the temporal components are implicated in autobiographical memory 94 . At the level of 4 communities (Fig. 1a), there is a large "salience" community that is spread over several higher-order associative regions implicated in salience detection and response, which subsequently branches into specialized ventral (also known as the cingulo-opercular system) and dorsal salience systems (Fig. 1c), an organization that has been well described previously 96 . Interestingly, in Fig. 2c, there is branching of a community that is mostly expressed in the left hemisphere and includes Broca's area and the superior temporal gyrus, and bears resemblance to areas of activation during language processing tasks 45,60,97,98 . This community, which has traditionally evaded most 1,2 (but not all 97,98 ) mesoscale investigations, may represent a language system 97,98 . It is notable that this putative language community is substantially more in agreement with the one described in Langs et al. 98 than Spronk et al. 97 , despite the fact that the former used a markedly different analytic approach (voxel-based, clustering in embedding space, etc).
Another interesting observation is the absence of a parallel to the "limbic" network described by Yeo et al. 1 which was localized to the ventral surface of the brain. A possible explanation is that the older data that was used in their study had marked signal dropout at the base of the brain, where this network was identified. The HCP data used in the current study had significant improvements in terms of a reduction in signal dropout from the ventral brain and other regions due to air sinuses; notably the orbitofrontal, inferior temporal, and lateral mid-temporal cortices 99 . In our analyses, parcels in this region were assigned to the default mode and the central executive communities. Another possible explanation for this discrepancy could be the varying definition of what a "cluster" means across studies. Here clusters refer to communities, whereas in Yeo et al. 1 , clusters were identified based on connectivity profiles. A recent study by Spronk et al. 97 , that used data from the HCP and the MMP cortical parcellation, identified an "oribito-affective" community that corresponds to posterior orbitofrontal parts of the limbic network described in Yeo et al. 1 , though the authors note that it had the lowest "confidence score" of community assignment among the identified networks. Much of the nodes in the described orbito-affective community (i.e., posterior orbitofrontal) are included in our central executive community (Fig. 1c).
Disagreements between our community node assignments and those in Spronk et al. 97 could be due to several methodological differences, such as our use of a consensus procedure to avoid a network group average, a multiscale community detection framework, and a null model based on RMT for community detection that is more compatible with functional networks (compared to the more prevalent permutation null models). Because functional networks are estimated using correlation measures, they are subject to indirect effects that manifest themselves as artifactual connections. These indirect effects are due to second-, third-and higher-order interactions that are not present in the real (i.e., direct) network [for example, if nodes 1 and 2 and nodes 1 and 3 are strongly correlated, a correlation will also be observed between node 2 and 3 even where none exists (or, if a true relationship exists, here it would be overestimated)] 17,18 . This often leads to an inaccurate estimation of the network modular structure, if not accounted for by the null model 40,41 . The RMT-based community detection method that we employed mitigates these effects. In a recent study, a permutation null model failed to recover the ground truth community from neurophysiological recordings in the suprachiasmatic nucleus in mice, while an RMT-based null model was successful 42 . Another potential contributor to the discrepancy is the choice of the spatial scale of interest: while the current study characterized a wide range of spatial scales, and opted for a data-driven method to focus on those that are most expressed at the level of individuals in addition to functional homogeneity, Spronk et al. 97 described the organization at the scale yielding 12 clusters, based on a priori "hard criteria" (e.g., a requirement for a separate auditory community), in addition to stability metrics. It is therefore conceivable that exploring different spatial resolutions in their data could show more consistent parallels.
Calculating the nodal consistency revealed that certain nodes were less consistent than others in terms of community assignments (Fig. 4). To ensure that this was not caused by signal dropout in these particular regions, we calculated the SNR from the time course. This analysis did not reveal an association between the nodal SNR and consistency scores. Another potential reason that we wanted to rule out was that low consistency nodes are those that are highly task-specific. To quantify that, we used the group-wise task fMRI activation map contrasts and calculated a measure of how much variability each node is exhibiting in terms of activation across different tasks. That also did not turn out to contribute to the variance of the nodal consistency scores. The other potential explanation was that these nodes play a specific-hub-like-topological role, integrating information from different modules, and because their connectivity profiles is not primarily limited to a single module, they are less consistently classified. While there are numerous metrics that index hubness 90 , we opted for the simplest definition-nodal strength. This hubness metric was found to partially account for the variability in consistency (contributing approximately 7.6% of the variance). Although it only explained a small part of the variance, it remains entirely possible that the nodal strength is insufficient to capture hubness. In fact, indexing hubness in correlation networks is a known challenge and traditional metrics do not rigorously identify important hubs 100 . In fact, one of the alternate definitions proposed in Power et al. 100 , which is based on finding nodes in which multiple modules are represented, is conceptually similar to our consistency metric.
Processing the data with and without MGTR resulted in broadly similar partitions (Fig. 5). We were able to say that the subject-derived consensus partitions resemble individual-level partitions more than the group-derived consensus because both were derived from the same data. However, the same framework could not be used to compare consensus partitions derived with and without MGTR as the comparator is not the same. Despite the fact that MGTR removes some relevant physiological neural data present in the global signal and may alter the properties of the networks especially the negative edges [which we had to remove (see SI)], it is very effective at removing motion-related artifact 87,88 . Evidence that supports "sacrificing" some neural data in favor of a less contaminated signal includes a recent observation that MGTR strengthens the associations between connectivity and behavioral measures 101 . Therefore, we release all atlases without taking a stance regarding which method is superior.
While we used a multimodal gradient-based parcellation because it is believed to be the most neuroanatomically accurate to date 60 , other atlases based on unimodal analyses of functional connectivity may be more functionally-coherent 77,79,102 . We repeated the main analysis using the Schaefer et al. 77 atlas and recovered a broadly similar set of modules (see SI). However, a quantitative analysis 103 was not within the scope of this article, but should be pursued in the future. In this study, we constrained the analysis to the cerebral cortex, though we recognize the importance of extending the mapping of these communities to the subcortex and cerebellum 104,105 . The multimodal brain atlas that we adopted does not include any subcortical or cerebellar nodes. Some groups have added community affiliations for the subcortex or cerebellum on a post hoc basis 97,106,107 . The challenge in our case is to keep the definition of a community consistent. This will likely require the treatment of the nodes of the whole brain at the same time. Another point that should be addressed in the future relates to the fact that we got a varying number of hierarchical levels across subjects (since we did not constrain the number of communities or hierarchical levels); this means that subjects did not all contribute an equal amount of information to the co-classification matrix. The decision to define nodes as brain parcels instead of a voxel-wise analysis was motivated by factors that include using neurobiologically-meaningful building blocks, mitigating MRI-related limitations, and computational intractability 108 . However, voxel-wise investigations (e.g., Barga et al. 13 , Gordon et al. 12 ) are equally important in brain mapping endeavors as they naturally provide a finer level of granularity. In this study, we used the RMT null model 41 . Another null model that can be used with correlation networks without violating their properties is known as the uniform null model 40,74 . However, it requires the use of a tunable parameter, does not address the global mode (i.e., common trend), and can lead to a large number of singleton communities 40,42,109 . Finally, other mesoscale organizations should also be explored, such methods include stochastic block modeling 110 , overlapping communities 111 , and approaches that allow for varying spatial configuration of functional brain regions 112 .

Data availability
The dataset used in the study is from the Human Connectome Project, S1200 Release, available in the Human Connectome Project repository (http://www.humanconnectome.org). The code and hierarchical brain maps are available on GitHub (https://github.com/emergelab/hierarchical-brain-networks).