Element-centric clustering comparison unifies overlaps and hierarchy

Clustering is one of the most universal approaches for understanding complex data. A pivotal aspect of clustering analysis is quantitatively comparing clusterings; clustering comparison is the basis for many tasks such as clustering evaluation, consensus clustering, and tracking the temporal evolution of clusters. In particular, the extrinsic evaluation of clustering methods requires comparing the uncovered clusterings to planted clusterings or known metadata. Yet, as we demonstrate, existing clustering comparison measures have critical biases which undermine their usefulness, and no measure accommodates both overlapping and hierarchical clusterings. Here we unify the comparison of disjoint, overlapping, and hierarchically structured clusterings by proposing a new element-centric framework: elements are compared based on the relationships induced by the cluster structure, as opposed to the traditional cluster-centric philosophy. We demonstrate that, in contrast to standard clustering similarity measures, our framework does not suffer from critical biases and naturally provides unique insights into how the clusterings differ. We illustrate the strengths of our framework by revealing new insights into the organization of clusters in two applications: the improved classification of schizophrenia based on the overlapping and hierarchical community structure of fMRI brain networks, and the disentanglement of various social homophily factors in Facebook social networks. The universality of clustering suggests far-reaching impact of our framework throughout all areas of science.


Introduction
Clustering is one of the most basic and ubiquitous methods to analyze data.Classically, clustering is viewed as separating data elements into disjoint clusters of comparable sizes [1,2].However, complications to this simplistic picture are becoming more prevalent, particularly given the rise of network science and nuanced clustering methods that reveal heterogeneous cluster size distributions [3,4], overlaps [5,6,7,8], and hierarchical structure [9,10,11,12] (see Figure 1a).These generalizations present new challenges for clustering comparison [13,3] and render current methods susceptible to critical biases [14,15,3,16,17].In addition to the consistent grouping of elements into clusters, similarity measures must account for many other aspects of clusterings, such as the number of clusters, the size distribution of those clusters, multiple element memberships when clusters overlap, and scaling relations between levels of hierarchical clusterings.
Despite the increasing prevalence of irregular cluster features, the effect of such structure on clustering similarity has received little attention.Here we illustrate that all of the most popular clustering similarity measures are vulnerable to critical biases, calling into question the appropriateness of their general usage.We also argue that these biases are maintained or exacerbated by extensions to accommodate overlapping or hierarchical clusterings [18,19,20], suggesting that none of the existing frameworks for clustering similarity are adequate for comparing overlapping and hierarchically structured clusterings.

Element-centric clustering comparisons
Here we propose a new element-centric framework that not only addresses the common biases, but also naturally incorporates overlaps and hierarchy.In our approach, elements are compared based on the relationships induced by the cluster structure, in contrast to the traditional cluster-centric philosophy.As we will see, this change in perspective resolves many of the aforementioned difficulties.
Our approach captures cluster-induced relationships between the elements through the cluster affiliation graph, which is a bipartite graph where one vertex set corresponds to the original elements and the other corresponds to the clusters (see Figure 1b, Methods and Supplemental Information, SI, section S3).It naturally incorporates overlaps with multiple edges, and hierarchy with weighted edges.The cluster affiliation graph is then projected onto the element vertices to produce the cluster-induced element graph, which is a weighted, directed graph that summarizes the inter-element relationships induced by common cluster memberships [21] (see Figure 1c and Methods).
The traditional notion of pair-wise co-occurrence in a cluster is captured by the presence of an edge in the cluster-induced element graph.However, the focus on element pairs misses high-order relations (triplets, quadruplets, etc.) which are useful for characterizing cluster structure [22].Such high-order co-occurrences can be captured through the presence of paths in the cluster-induced element graph.The weight of the path accounts for the relative importance of elements in the presence of overlapping and hierarchical cluster structures.Here, (1) (2) (3) -5 (2) (3) (4) Three examples of clusterings: a partition, a clustering with overlap, and a clustering with both overlapping and hierarchical structure.b, Cluster affiliation graphs derived from the overlapping and hierarchical clusterings.c, Cluster-induced element graphs found by projecting the cluster affiliation graphs in b to the element vertices.d, The element-affinity matrices found as the personalized pagerank equilibrium distribution.e, The corrected L1 metric distance between each affinity distribution in d gives an element-wise similarity between clusterings, the average element-wise similarity provides the final clustering similarity score.f, A binary hierarchical clustering is compared to each of its individual levels.g, The hierarchical scaling parameter for element-centric similarity acts as a "zooming lens", refocusing the similarity to different levels (1)(2)(3)(4) of the hierarchical comparison in f.
we incorporate every possible path between elements obtaining the equilibrium distribution for a personalized diffusion process on the graph (often called "personalized pagerank" or "random walk with restart") [23,24,25].A similarity score is calculated for each element as the corrected L1 metric distance between these discrete probability distributions; the final similarity score between the two clusterings is the average of the element-wise scores (Figure 1d and Methods).As illustrated in Figure 1, our element-centric framework unifies disjoint, overlapping, and hierarchical clustering comparison in a single framework.
Beyond naturally accommodating generalized clusterings, our element-centric similarity can provide detailed insights into how two clusterings differ because the similarity is calculated at the level of individual elements.Simply examining individual element-wise scores reveal how consistently each element is grouped across clusterings.The rank-distribution of element-wise scores reflects the elements' relative contributions to the total similarity: a flat distribution suggests the clusterings differ equally across all elements while a skewed distribution suggests the clusterings are distinguished by a subset of elements (see SI, section S3.4).Additionally, the measure can be averaged over the pair-wise comparisons within a set of clusterings.The element-wise agreement is revealed by the average of these element-wise scores over comparisons between uncovered clusterings and a reference clustering (SI, section S3.6).The element-wise scores can also be averaged over all pair-wise comparisons within the set of uncovered clusterings, revealing the frustrated elements that cannot be consistently clustered.
Our element-centric framework is flexible and allows several choices to accommodate alternative interpretations.For example, our choice of hierarchical weighting function and the scaling parameter, r, reflects a continuum in the hierarchy (Figure 1g): lower r emphasizes higher levels and reflects a divisive hierarchy, in which lower levels of the dendrogram are treated as refinements of the higher levels, while larger r puts emphasis on lower levels and reflects an agglomerative hierarchy, in which higher levels of the dendrogram are seen as a coarsening of the lower level cluster structure.Other interpretations of hierarchy can be implemented by changing the specific hierarchical weighting function.Moreover, our choice of L1 comparisons between personalized pagerank distributions, which is based on a principled extension of element co-occurrence, can be replaced by another measure of graph similarity or probability metric with an alternative intuition of the trade-offs associated with clustering similarity.In fact, several common clustering similarity measures can be recovered by adapting other choices of graph similarity; for example, choosing the graph-edit distance between the two graphs induced from disjoint partitions reduces our measure to the Rand index.

Bias in clustering comparisons
Our element-centric similarity measure is the only clustering similarity method to follow our common-sense expectations and avoid critical biases when comparing generalized clusterings.
We demonstrate such biases by constructing a parametrized family of synthetic clusterings and observing the behavior of clustering similarity measures (listed in Figure 2d and SI, section S4).
In the first example, the consistent grouping of elements is tested by comparing a clustering with equally sized clusters against itself after a fraction of element memberships have been shuffled between clusters (Figure 2a).Intuition suggests that as the randomization increases, the similarity between the original clustering and the shuffled clustering should decrease from the maximum value (1.0 in all cases) to some non-zero value, reflecting the fact that the number and sizes of clusters are still identical.However, two measures reach zero, ignoring the similarity of the cluster size sequences.The overlapping normalized mutual information (ONMI) [19] is particularly conservative, reporting no similarity at just over 50% randomization; ONMI's surprising behavior highlights the difficulty of accommodating overlaps in a traditional similarity framework.
The second example explores the bias favoring skewed cluster size sequences.Starting from an initial clustering with regularly sized clusters, we generate new, shuffled clusterings through a preferential attachment shuffling scheme (Figure 2b and SI, section S4).Intuition suggests that as the entropy of the cluster size sequence decreases (reflecting an increase in the cluster size heterogeneity), the two clusterings should become less similar.However, four similarity measures increase as the entropy of the cluster size sequence decreases.
Finally, we investigate a scenario where the number and sizes of clusters in two clusterings diverge (Figure 2c).This extreme case captures the bias of information theoretic measures towards comparisons with many clusters.Normalized mutual information (NMI) reports larger similarity if we simply increase the number of clusters in one of the clusterings.
These three examples suggest that the most common measures are subject to critical biases which render them inappropriate for comparing generalized clusterings-only our element-centric similarity measure displays the intuitive behavior in all examples.An extended analysis, additional examples, and additional measures (such as the variation of information, VI) are given in the SI, section S4.

Element-centric comparisons reveal insights into how K-means clusterings differ
Beyond serving as a global measure of clustering similarity, our element-centric similarity also provides detailed insights into how clusterings differ, in contrast to other measures.Consider an illustrative example from K-means clustering shown in Figure 3a; 19 clusters were randomly placed in a square with a randomly selected arrangement (Gaussian blob, anisotropic blob, circle, or spiral) and size (see SI, section S5.2).K-means has difficulty when the predefined clusters overlap or when circularly arranged [26].This difficulty can be explicitly quantified by calculating the average element-wise similarity between the predefined clustering and 100 uncovered clusterings (Figure 3b).We then calculate the frustration by ..; the average of all pair-wise comparisons between the 100 uncovered clusterings reveal data points that are consistently grouped into similar clusters or are assigned to drastically We also present a real-world example of handwriting recognition [27] (Figure 3d and SI, section S5.3).The same procedure reveals that some clusters of digits are correctly and consistently identified ("0"), while the error mostly results from incorrect grouping of other digit clusters ("9", "8", and "1"; Figure 3e).Element-wise clustering frustration shows that there are some digits which cannot be consistently classified ("3" and "8", Figure 3f), while some errors are regularly made ("1" and "9").The extreme examples of these two types of error are shown in Figure 3g.

The convolution of meta-data in social networks
We now use our framework to explore the community structure of college friendship networks on Facebook.Previous research has suggested that friendship networks at major universities are organized into clusters which reflect the graduation year, dormitory, or student major [28,29].However, the details of the organizing principles underlying this similarity are unknown.Here we demonstrate and visualize how multiple attributes interact and contribute to community structure.We first derive clusterings in binary friendship networks using the Louvain method (see SI section S5.4) and compare these to the aforementioned self-declared user attribute clusterings.Element-wise similarity reveals that school year closely captures the modular structure for most of the network, particularly for the students in early years, while students' major gradually takes over the cohort-based connections (Figure 3h,i red arrows).This result, which has only become straight-forward through our framework, supports the intuition that network structure results from the convolution of multiple attributes [30].

Element-centric comparisons of overlapping and hierarchical clustering in brain networks
Finally, to illustrate the utility of our element-centric similarity measure, we demonstrate its ability to capture meaningful differences in clustering structure by classifying schizophrenic individuals based on the overlapping and hierarchical community structure of resting-state fMRI brain networks.There are several known distinctive and interpretable properties of resting-state fMRI brain networks in schizophrenia, but their classification utility is limited, with accuracies between 75% − 80% [31,32,33].Network communities, in particular, are hypothesized to capture functionally integrated modules in the brain that reflect key properties of schizophrenia [31].Here we demonstrate that employing our measure to compare communities derived from functional brain networks can improve the classification accuracy significantly.We extract communities with overlapping and hierarchical structure using OSLOM community detection [34] from the functional brain networks of 48 subjects (29 healthy controls and 19 individuals diagnosed with schizophrenia) analyzed in a previous study [32] (see SI, section S5.1 for details).The similarity between each pair of the subjects' hierarchical and overlapping clusterings was found using our element-centric similarity measure, producing a 48 × 48 similarity matrix (Figure 4a).This similarity matrix was then d, The labeled handwritten digit data projected using t-SNE dimensionality reduction for visualization.e, The average element-wise similarity between the labels and 100 K-means clusterings.f, The average element-wise similarity between 100 K-means clusterings.g, Exemplar digits that are consistently grouped as in the ground-truth clustering, consistently clustered differently from the ground-truth clustering, least frustrated, and most frustrated.h,i, Facebook friendship networks for h College A and i College B. The element-wise similarity between user affiliation to school year, dorm, and major compared to Newman's modularity optimized by the Louvain method demonstrates that social networks can be organized by a convolution of different attributes (black vs red arrows).The similarity to school year attenuates with student's status (1st year -4th year, orange arrows).
used in conjunction with a weighted k-nearest neighbors classifier to perform a binary classification of subjects as either schizophrenic or healthy controls (SI, section S5.1).Evaluated by a nested 10-fold cross-validation procedure, our approach achieves an average accuracy of 84%, outperforming other measures (ONMI) and state-of-the-art results (Figure 4b).Note that, classification based on individual levels from the hierarchy does not perform as well as the method using the full hierarchy.
Our element-centric clustering similarity measure also provides insights into which brain regions are consistently clustered within groups.To find such group differences, we consider the element-centric similarity between all healthy controls, and the element-centric similarity between all schizophrenic patients.As seen in Figure 4c, the difference between the means of these two groups highlights several regions which are consistently clustered into similar functional modules in the healthy controls or schizophrenic patients.In particular, regions of interest (ROIs) located in the Fusiform gyrus (Brodmann Area 37) were consistently clustered in the healthy controls but displayed great variability in cluster structure for the schizophrenic patients (verified with a Bayesian difference of means test, see SI section S5.1).This result is corroborated by the fact that the Fusiform gyrus has previously been associated with abnormal activation in schizophrenia during semantic tasks [35,36].

Summary and discussion
In summary, we present an element-centric framework that intuitively unifies the comparison of disjoint, overlapping, and hierarchically structured clusterings.We have presented that our element-centric similarity does not suffer from the common counter-intuitive biases of existing measures, and that it also provides insights into how clusterings differ at the level of individual elements.
Our framework suggests straight-forward extensions to more complex scenarios, such as soft or fuzzy clusterings, hierarchical clusterings specified by dendrograms with merge distance information, and hyper-graph similarity.The framework also provides a measure of pair-wise similarity between elements, akin to the nodal association matrix of Bassett et al. [37], and an element-wise clustering similarity which summarizes the difference in relationships induced by overlapping and hierarchically structured clusterings from the perspective of individual elements.Both of these objects hold promise for use in clustering ensemble methods [38,39].
As clustering methods advance to uncover more nuanced and accurate organizational structure of complex systems, so too should clustering similarity measures facilitate meaningful comparisons of these organizations.The element-centric framework proposed here provides an intuitive quantification of clustering similarity that holds great promise for uncovering the relationships amongst all types of clusters, such as network communities, ontogenies, and dendrograms.The ubiquity of clustering in all areas of science suggests the extensive potential impact of our framework.

Graph representation of clusterings
Given a clustering C of N elements E = {v 1 , . . ., v N } into K C clusters, the cluster affiliation graph is an undirected bipartite graph where one vertex set corresponds to the elements, the other corresponds to the clusters, and a weighted edge exists between a cluster and each of its elements.For hierarchically structured clusterings, each cluster c β ∈ C is assigned a hierarchical level l β ∈ [0, 1] by rescaling the hierarchy's acyclic graph (dendrogram) according to the maximum path length from the roots [40].The weight of the cluster affiliation edge is given by the hierarchy weighting function h(l β ): where r is a scaling parameter that determines the relative importance of membership at different levels of the hierarchy.The cluster-induced element graph is formed by projecting the cluster affiliation graph (with N × K C bipartite adjacency matrix A) onto the element vertices resulting in a directed graph with the edge w ij between elements v i and v j having weight:

Personalized PageRank affinity
Given an cluster-induced element graph with weighted adjacency matrix W , the personalized PageRank (PPR) affinity from element v i to all elements v j is found as the stationary distribution of a diffusion process with stochastic matrix Π i and restart probability 1.0 − α to v i : The value of α controls the influence of overlapping clusters and hierarchical clusters with shared lineages; here we use α = 0.90.The above equation can be exactly solved for partitions-the affinity value for each co-clustered element pair is inversely proportional to the cluster size, and 0 otherwise.For clusterings with overlaps or hierarchy, several algorithms are available to quickly approximate the PPR affinity [41].See the SI, section S3 for further comments about implementation.

Element-centric similarity
The element-wise similarity of an element v i in two clusterings A and B is found by comparing the stationary probability distributions p A and p B induced by the PPR processes on the two cluster-induced element graphs.Here, we use the normalized L1 metric for probability distributions corrected to account for the PPR process: The final element-centric similarity score S(A, B) of two clusterings A, B is the average of the element-wise similarities:

Datasets
Details on the synthetic clusterings, fMRI brain networks, K-means point, handwriting, and Facebook social network datasets can be found in SI Text.

Contents S1 Clusterings
Throughout this work, we are focused on the grouping of elements (i.e.data points or vertices) into clusters (the groups).The set of clusters is called a clustering.Specifically, given a set of We consider three classes of clusterings.A partition is a clustering in which all elements are members of one, and only one, cluster.An overlapping clustering allows elements to be members of multiple clusters.Hierarchical clusterings capture the nested organization of clusters at different scales and are accompanied by a directed acyclic graph (or dendrogram) showing the hierarchical relationships between clusters.
The rest of this paper focuses on the similarity of two clusterings over the same set of N labeled elements, A = {A 1 , . . ., A K A } (with K A clusters of sizes a i ) and B = {B 1 , . . ., B K B } (with K B clusters of sizes b i ).

S2 Existing measures of clustering similarity
The clustering similarity measures can be roughly categorized into three classes [42].The first class counts the pairs of elements co-assigned to the same cluster in both clusterings; Albatineh et al. [43] provides a list of 22 such clustering similarity measures based on pair counting.The second class identifies clusters which constitute a best match between the two clusterings [44]; examples include the maximum matching statistic [44], and the maximum matching ratio [45].The third class captures the amount of information which exists about the cluster membership of a randomly selected element; examples include the mutual information and its normalized variants [46,47], as well as the variation of information [42].Here, we focus on five of the most prominent measures from the clustering literature: the adjusted Rand index, the Jaccard index, the F measure, normalized mutual information (NMI), and overlapping normalized mutual information (ONMI).

S2.1 Rand Index
The Rand index [48] counts the number of element pairs which are either members of the same cluster, or members of different clusters in both clusterings.The most common formulation of the Rand index focuses on the following four sets of the N 2 element pairs: N 11 the number of element pairs which are grouped in the same cluster in both clusterings, N 10 the number of element pairs which are grouped in the same cluster by A but in different clusters by B, N 01 the number of element pairs which are grouped in the same cluster by B but in different clusters by A, and N 00 the number of element pairs which are grouped in different clusters   S1, by the following set of equations: The Rand index between clusterings A and B, RI(A, B) is then given by the function: It lies between 0 and 1, where 1 indicates the clusterings are identical and 0 occurs for clusters which do not share a single pair of elements (this only happens when one clustering is the full set of elements and the other clustering groups each element into its own singleton cluster).As the number of elements being clustered becomes large, the measure becomes dominated by the number of pairs which were classified into different clusters (N 00 ), resulting in decreased sensitivity to co-occurring element pairs [49].

S2.2 Adjusted Rand index (ARI)
A popular extension of the Rand index, called the adjusted Rand index (ARI), considers the average of the measure in the context of a random ensemble of clusterings [50,22,43,17].Such a correction for chance uses the expected similarity of all pair-wise comparisons between clusterings specified by a random null model to establish a baseline; the resulting similarity values have a new interpretation that facilitates comparisons within a set of clusterings.Specifically, once corrected for chance, a similarity value of 1 still corresponds to identical clusterings, but a value of 0 now corresponds to the expected value amongst random clusterings.Positive values of corrected similarly better reflect an intuitive comparison of clusterings, although they are still slightly biased [51].However, the correction process also introduces negative values for similarity that occur when two clusterings are less similar than would be expected by chance.
The commonly used adjusted Rand index (ARI) of Hubert and Arabie [22] calculates the expectation of the Rand index under the assumption that random clusterings are drawn from the permutation model.In the permutation model the number and size of clusters within a clustering are fixed; all random clusterings are generated by shuffling the elements between the fixed clusters.The expectation of the Rand index with respect to the permutation model follows from the fact that the entries in Table S1 follow a generalized hypergeometric distribution.Taking of the Rand index with respect to the permutation model for the cluster size sequences of clusterings A and B is given by: Fowlkes and Mallows [49], Hubert and Arabie [22], or Albatineh and Niewiadomska-Bugaj [43] for the full derivation).Finally, the ARI between clusterings A and B is given by:

S2.3 Jaccard index
Another popular clustering similarity measure which utilizes pair-wise co-occurrence between the elements is the Jaccard index or Jaccard similarity coefficient [52].Originally proposed to compare regional floras [53], the Jaccard index is a similarity measure for finite sets.It is defined as the number of element pairs which are grouped in the same cluster in both clusterings divided by the number of element pairs which are grouped in the cluster in at least one of the clusterings.Thus, it ignores the number of element pairs that are grouped into different clusters by both clusterings.One minus the Jaccard index is a metric on the collection of finite sets [54].Using the above notation from the contingency table Table S1, the Jaccard index between clusterings A and B takes the form:

S2.4 F measure
The F measure has a long history of use in clustering validation, natural language processing, information retrieval, and machine learning.It is based off of two asymmetric measures (sometimes called Dice's asymmetric coefficients), that count the proportion of element pairs which are correctly co-assigned to the same cluster in one of the clusterings using the other clustering as a baseline.When one of these clusterings is considered to be a ground-truth clustering, these asymmetric measures are known as precision and recall.The F measure is the harmonic mean of the precision and recall.Specifically, the F measure between clusterings A and B is given by: The F measure F and Jaccard index J are related by J = F/(2 − F ).

S2.5 Normalized mutual information (NMI)
Another family of approaches for finding the similarity of two cluster coverings is based on the amount of information in each covering and the amount of information one covering contains about the other.These quantities can also be calculated from the contingency Table S1.The entropy H of a clustering A is given by and, using the entries n km from the contingency table S1, the joint entropy between two clusterings A and B is Thus, the mutual information between two clusterings is given by: S3 Detailed methods

S3.1 Cluster induced relationships
Formally, cluster induced relationships are represented via the cluster affiliation graph [63].A cluster affiliation graph is constructed for a clustering C of labeled elements V = {v 1 , . . ., v N } as a bipartite graph CT AG(V∪C, R) where one vertex set corresponds to the original elements V and the other vertex set corresponds to the clusters C.An undirected edge a iβ ∈ R ⊂ V ×C is placed between element v i ∈ V and cluster c β ∈ C if v i ∈ c β , i.e. the element is a member of the cluster.Notice that an element's membership in several overlapping clusters directly translates into several edges in the cluster affiliation graph.
The cluster affiliation graph also accommodates hierarchical cluster organization.Hierarchical cluster structure captures organization at different scales and is typically represented by a directed acyclic graph or a dendrogram, a tree-like structure in which more closely related elements have common ancestors lower in the tree than compared to more distantly related elements [64].Both directed acyclic graphs and dendrograms have nodes representing the clusters of the clustering, and directed edges between two nodes whenever the target cluster is a decedent of the source cluster.Clusters which are not decadents of any other cluster are called root clusters, while clusters which have no descendants are known as leaves.Following Czegel & Palla [40], every cluster c β in C is given a rescaled hierarchical level l β according to the following process.The hierarchical level of all roots is 0.0, while the hierarchical level of all leaves is 1.0.If a cluster is not a root or a leaf, then we find the path of maximum length between a root and a leaf which passes through the cluster.The hierarchical level of the cluster is then the cluster's position in the path relative to the root (maximum distance from the root) divided by the total path length.
Given a clustering with a rescaled hierarchical structure (such that every cluster has an hierarchical level, see above), the edge weights of the cluster affiliation graph are given by a function of the cluster's hierarchical level.Specifically, if an element v i ∈ V is a member of a hierarchical cluster c β ∈ C which occurs at level l β in the acyclic graph or dendrogram capturing the hierarchical organization of C, the appropriate edge a iβ in the cluster affiliation graph has weight a iβ = h(l β ) given by the hierarchy weight function h : [0, 1] → R + .The function h reflects an important decision of hierarchical clustering similarity in general: one has to decide if the similarity of hierarchies should be more strongly focused on the coarser relationships (those at the top of the dendrogram) or the finer relationships (those at the bottom of the dendrogram).This distinction is related to the fact that hierarchies can be constructed in a divisive manner (a top-down approach in which clusters are successively subdivided into finer-grained structures) or an agglomerative manner (a bottom-up approach in which clusters are successively combined into coarser-grained structures).The shape of h will determine what trade-offs are made in terms of hierarchical similarity: a constant function flattens the hierarchy into an overlapping clustering with one level, a monotonically decreasing h will favor relationships induced by higher levels of the dendrogram, while monotonically increasing h will favor relationships induced by lower levels of the dendrogram over those at higher levels.A choice of h that is not monotonically increasing or decreasing would suggest there are some resolutions which are more important than others but those resolutions are scattered throughout the dendrogram.
Here, we adapt the hierarchical weighting function: where r is a constant that determines the relative importance of membership at different levels of the hierarchy.For r < 0, similarities between higher levels of the dendrogram are favored over lower levels, while for r > 0 similarities between lower levels are more important than higher levels.While the decision of an appropriate value of r depends on the specific application, we take the approach that similar hierarchical clusters should respect the finest graining of the network, and cluster memberships are further enhanced as one ascends the dendrogram.In general, we have found that the exact value of r for which the lowest level of the dendrogram is considered the most important will depend on the height of the hierarchy (length of the maximum directed path in the acyclic graph).For all comparisons between hierarchical clusterings conducted in this work, we use a value of r = 8, but further investigation into the sensitivity of the measure on r will be needed.

S3.2 Cluster-induced element graph
The cluster affiliation graph contains all of the same information as the original clustering structure, we now begin to summarize attributes of that structure which contribute to clustering similarity.To extract a coherent set of relationships between the elements induced by the clustering, the bipartite cluster affiliation graph is projected to its element vertices to form the cluster-induced element graph.Specifically, the cluster-induced element graph is a weighted, directed network where the edge w ij captures the aggregated influence between elements v i and v j , normalized by the total weight incident to element v i : Note that self-loops occur throughout the element interaction graph.

S3.3 Generalizing element co-occurrence
Next we extend the concept of element co-occurrence to the cluster-induced element graph.
As discussed in section S2, many existing clustering similarity measures focus on the pair-wise co-occurrence of elements in clusters.In the cluster-induced element graph, the co-occurrence of two elements in at least one cluster is captured by the presence of an edge.The weight of this edge reflects the relative influence between the elements aggregated over all clusters in which the elements co-occur.The focus on element pairs misses high-order relations which are induced by the cluster structure and are beneficial for differentiating cluster structure [22].The cluster-induced element graph captures such high-order occurrences through the presence of paths.Thus, all co-occurrences between 3 elements are captured by paths of length 2, while the cooccurrences between 4 elements are captured by paths of length 3, etc.The weight of the path accounts for the relative importance of neighboring elements in the presence of overlapping and hierarchical cluster structures.Note that in our generalization, singleton clusters are naturally accommodated by the presence of self-loops in the cluster-induced element graph, and hence paths which contain multiple passes through the singleton element.
The information contained in all possible paths through a graph can be integrated using a diffusion process on the graph.However, from the perspective of each element, all paths through the cluster-induced element graph are not created equal.Instead, we want to favor those paths which explore the local neighborhood around each element.Thus, for our element-centric similarity measure, we utilize the stationary distribution of a personalized diffusion process as a useful proximity measure that integrates both local and global graph structure around an element.
Specifically, given a cluster-induced element graph with weighted adjacency matrix W , the Personalized PageRank affinity between element v i and all elements v j is found as the stationary distribution of a diffusion process on the element interaction graph with restart probability 1.0 − α to e i .
The value of α controls the influence of longer paths in the element interaction graph which relates to the relative importance of overlapping clusters and hierarchical clusters with shared lineages; here we use α = 0.9.The complete matrix of pair-wise personalized pagerank affinities provides a relative measure of the similarity between two elements under the relationships induced by a clustering.One potential use of this matrix, not explored here, is to average the affinity matrices over several clusterings.The resulting object should function in a similar manor as the nodal-affinity matrices of Bassett et.al. [37] and can become the subject of further consensus clustering routines [39].
In general, for large data sets and clusterings with many overlapping and hierarchical clusters, the calculation of personalized pagerank can be a computationally expensive process; a different matrix must be inverted for every element with a resulting complexity of O(N 4 ).However, there are some computational simplifications that can be made.First, the personalized pagerank affinity of strict partitions can be analytically solved (see Section S3.5).Second, when several elements share exactly the same cluster memberships, their resulting personalized pagerank affinity vectors are related by simple permutations; therefore, the personalized pagerank affinity vector need only be calculated once for each common cluster membership set.Third, due to the utility of personalized pagerank for recommendation systems, there have been many algorithms for the approximation of personalized pagerank [41].Because the worst case computational complexity of element-centric similarity will only occur for highly overlapping and deeply hierarchical clusterings, objects which were previously incomparable using traditional clustering similarity methods, we do not consider the computational complexity as a drawback of our method.
It is also useful to note that our choice of the personalized pagerank equilibrium distribution can be motivated in terms of the graph similarity between the two cluster-induced element graphs [65].While there are many different methods to assess the similarity of graphs, several other common choices have meaningful interpretations for clustering similarity.For example, the graph-edit distance between the cluster-induced element graphs derived from partitions results in the Rand index.

S3.4 Element-centric similarity
A convenient aspect of the element-centric similarity is that one can recover element-specific values of similarity under the different clusterings.For our element-centric similarity, we use the L1 metric for probability distributions, corrected to account for the personalized pagerank process: This correction accommodates the fact that personalized diffusion will always have at least 1 − α probability centered at each vertex, so the largest difference between two personalized pagerank vectors is 2α and not 2.
The element-wise similarity scores provide insight into how the clusterings differ.The ranked-distribution of element-wise scores reflects the differences in cluster structure.A flat distribution occurs when all elements have the same similarity score.This suggests that all elements saw an roughly equal change in cluster structure.A highly skewed distribution occurs when some elements have much higher or lower similarity than the rest of the elements.This suggests that the two clusterings had their agreements and disagreements concentrated on this small set of elements.
The final element-centric similarity S(A, B) of two clusterings A, B is the average of the element-wise similarities.

S3.5 Element-centric similarity for strict partitions
When the clustering is a strict partition (a clustering without overlapping memberships or hierarchical structure), the calculation of the personalized pagerank matrix Π for the clustering can be analytically solved.First, note that in the absence of overlap and hierarchical structure, the element interaction graph of a clustering is clique graph where each connected component corresponds to a single cluster from the clustering and all edge weights are 1.0.
For each element e i , there is one cluster c β of size |c β | to which e i belongs, and the resulting peronsalized pagerank matrix has entries: where δ is the Dirac delta function.Thus, the similarity of strict partitions has low computational overhead and can actually become much faster than traditional clustering similarity methods when many comparisons are made at once.

S3.6 Average agreement and frustration
Beyond a similarity measure between two clusterings, our element-centric similaritymeasure reveals how an arbitrary set of clusterings groups the elements.The average agreement between a reference clustering and a set of clusterings measures the regular grouping of elements with respect to a reference clustering.Specifically, given a clustering G and a set of clusterings R = {R 1 , . . ., R T }, the element-wise average agreement for element v i is evaluated as: The frustration within a set of clusterings reflects the consistency with which elements are grouped by the clusterings.For the set of clusterings R = {R 1 , . . ., R T }, the element-wise frustration for element v i is given by: (S27)

S4 Comparing clustering similarity measures: extend discussion
In order to evaluate the behavior of clustering similarity measures, we introduced three comparison scenarios in which one clustering was held constant and the second clustering was randomly generated according to different constraints.These three scenarios specifically focused on the trade-offs made by clustering similarity measures when incorporating strong discrepancies in three aspects of the cluster structure: the consistent grouping of elements into clusters, the size distribution of the clusters, and the number of clusters.Here, we provide an extended analysis of the behaviors seen in the original three scenarios shown in Fig. 1 and introduce one additional comparison scenario.For all of the scenarios, we consider a baseline clustering, named clustering A, which contains 1, 024 elements clustered into 32 clusters of equal size with no overlap.All results are averaged over 100 instantiations and reported with an error of one standard deviation.
In the first scenario, we compare clustering A to a second clustering generated by randomly shuffling the membership of a fraction p of the elements in clustering A, leaving the number and size sequence of the clusters unchanged.As expected, all clustering similarity measures decrease as p increases.The Jaccard index, F measure, NMI, and our elementcentric similarity measure remain at a non-zero value reflecting the fact that even after all of the element memberships have been fully shuffled, there will be a faction of the elements which are co-assigned to the same cluster in both clusterings-a property of all clusterings with a similar number and distribution of clusters.However, the adjusted Rand index and ONMI eventually reach a base value of 0. This is particularly noticeable in the case of ONMI which assess 0 similarity between the clusterings with only p ≈ 0.5, losing the ability to discern the similarity of clusterings with more randomization.The 0 base value for the adjusted Rand index reflects the underlying philosophy of the measure: Random clusterings should have a similarity of 0, regardless of the number of clusterings or cluster size sequence [22,17].
In the second scenario, we explore the effect of a skewed cluster size sequence.For this case, we compare clustering A to a second clustering B generated using a preferential attachment model of element assignment.Specifically, using a clustering with 32 equal sized clusters (and randomized element memberships compared to clustering A) as the seed, at each step of our algorithm, a random element is uniformly chosen for reassignment to a new cluster based on the current sizes of those clusters.A move is rejected if it resulted in an empty cluster.The shuffling procedure is run for a total of 10 4 steps and the subsequent samples from all 100 trials are grouped into 40 bins according to their clustering entropy.There are now three distinct types of behaviors exhibited by the clustering similarity measures.The NMI and our element-centric similarity measure exhibit the intuitive behavior and decrease as the clustering entropy decreases.The ONMI and ARI maintain a 0 similarity for all comparisons regardless of the clustering entropy.Note, the larger variation in NMI, ONMI, and ARI seen for small basin entropy results from the presence of singleton and binary clusters which contribute to statistical fluctuations in element memberships.Finally, the F measure and Jaccard index increase as the entropy decreases: They cannot account for the differences in the cluster size distribution.This increase is a consequence of their formulation in terms of the correctly co-assigned element pairs while disregarding the incorrectly co-assigned element pairs.
In the third scenario, we explore the effect of the number of clusters.Here, we compare clustering A against a second clustering B generated by randomly assigning the elements to c regularly sized clusters, where c is the control parameter for the scenario.Hence, one clustering remains the same size, while the other has c regularly sized clusters.Again, we see two distinctly different behaviors of the clustering similarity measures: the Jaccard index, F measure, ONMI, ARI and our element-centric similarity measure all follow our intuition and decrease with increasing c, while NMI increases with increasing c.The increasing behavior for NMI can be attributed to the information-theoretic bias towards comparisons with more clusters [14,66,16,17].This bias makes NMI a particularly troubling measure for hierarchical clusterings where we expect the number of clusters to vary over several orders of magnitude.
Finally, we introduce one additional scenario, depicted in Figure S1, in which both clustering A and clustering B are generated by randomly assigning the elements to c regularly sized clusters.This scenario prominently demonstrates the trade-offs with which a clustering similarity measure must contend.Namely, in this scenario, the effect of randomized element memberships is opposed by the increasing similarity of the number of clusters and cluster size sequences.This trade-off is clearly illustrated by the behavior of our elementcentric similarity measure; the initial decrease, resulting from the increasingly random element memberships, is eventually overcome by the relative similarity in the number and sizes of the clusters.Eventually, the similarity reaches the expected value of 1 when there are 2 10 clusters-every element is placed within a singleton cluster in both clusterings and the randomization of element memberships has no effect.In contrast, NMI always increases as the number of clusters increases suggesting the aforementioned bias towards clusterings with more clusters is always stronger than the effect of element randomization.Once again, the extreme behavior of ONMI can be seen when the measure jumps to a similarity of 1 at 2 10 clusters.The decreasing behavior for the Jaccard index and F measure results from their scaling behavior-when the number of clusters is large relative to the number to elements, there are very few elements co-occurring in each cluster.The dataset used here was originally collected for Cheng et al. [32]; please refer to that work for specific details of the data acquisition and pre-processing, here we only offer a brief overview.Data was acquired from 19 individuals diagnosed with schizophrenia (6 female, mean age 33.1 ± 10.9 years) and 29 healthy controls (15 female, mean age 28.1 ± 8.4 years).Diagnosis of schizophrenia was based on the Structured Clinical Interview for the DSM-IV Axis I Disorders (SCID-I) [67] and medical chart review.All subjects were scanned on a Siemens TIM Trio 3 T MRI scanner using a 32-channel head coil.The high anatomical scan had a resolution of 1 mm 3 .A total of 200 volumes of resting state fMRI data were acquired with EPI sequences for 8 min and 20s.During the resting state fMRI scan, the subjects were at rest with eyes closed and instructed not to think of anything in particular.All functional data were motion corrected in FSL.The three scenarios from the main text, and one additional scenario described in Figure S1 for different normalization terms of NMI: the minimum of cluster entropies (min), the average of the cluster entropies (sum), the geometric mean of the cluster entropies (sqrt), and the maximum of the cluster entropies (max).See Section S2.5 for the measure details. Figure S3: The VI displays counter-intuitive behavior for skewed cluster sequences and differing number of clusters.The three scenarios from the main text, and one additional scenario described in Figure S1.Since the VI is a metric, the intuitive behavior differs from the similarity measures discussed in the main text; one would now expect the measure to increase in a-c and decrease in d.See Section S2.7 for the measure details.

S5 Clustering similarity applications
In conjunction with the anatomical image, the functional images were parcellated using a parcellation scheme proposed by Shen et al. [68].This parcellation divides the cerebral cortex into 278 regions of interest (ROIs), and was derived from resting state functional data of the healthy subjects by maximizing functional homogeneity within each ROI.After regressing out head motion, the time signal was band-pass filtered between 0.01 − 0.10 Hz and the time courses were extracted from the 278 brain ROIs as the average over voxels.
The functional network was computed from the wavelet coherence between all pair-wise combinations of ROIs, giving rise to a square symmetric matrix (278 × 278).The resulting functional connectivity matrix has only positive edges.In order to identify a backbone network structure, the multiscale network backbone [69] was extracted using an alpha of α = 0.2.Technically, the multiscale backbone is a directed network, however, since our original graph was undirected, we convert the mutliscale backbone back into an undirected network.The network was not corrected to insure a single connected component.

S5.1.2 Overlapping and hierarchically structured clusterings
Overlapping and hierarchically structured clusterings were derived using Order Statistics Local Optimization Method (OSLOM) network community detection [34] with the following parameters: weighted, undirected edges, p = 0.1, 100 runs for the detection at the bottom of the hierarchy and 1000 runs for the detection at the top of the hierarchy.All singlet communities were kept.Due to the variability in clustering structure between runs of the algorithm, 10 clusterings were extracted for each patient.
The subject similarity matrix was then constructed as follows.The similarity of each diagonal entry is 1.0.Each off-diagonal entry in the (48 × 48) subject similarity matrix is the average element-centric similarity similarity of all comparisons 10×10 = 100 between the 10 OSLOM communities uncovered for each subject.For all comparisons, we set α = 0.9 and r = 8.0.Our choice of the scaling parameter, r = 8.0, was grounded in the explorations of synthetic binary hierarchies of equivalent height.The dis-similarity matrix is one minus the similarity matrix.Four additional matrices were found by using the community structure found by slicing each OSLOM community dendrogram and retaining only the bottom or top communities and performing all pair-wise comparisons with either the element-centric similarity or ONMI similarity measures.

S5.1.3 Classification
Given a dis-similarity matrix, a distance weighted k-Nearest Neighbors (kNN) classifier was trained using nested and stratified 10-fold validation [70].Specifically, the data was randomly split into 10 groups such that the proportions of each class were kept relatively equal in each group.Each group in turn was then used as the testing set, while the other 9 groups formed the training set.For each training set, we first find the best k for the kNN classifier using a grid search for k between 1 and 15 and another stratified 10-fold validation.The classifier was then retrained on the entire training set for the specified k.Finally, the accuracy of the trained classifier was found on the testing set.In the paper, we report the average accuracy identified in 100 random initializations of the nested 10-fold validation technique [71,72].

S5.2 Point clusters
5, 000 points were random formed into clusters in an algorithm akin to the process for constructing benchmark graphs [46].Cluster sizes were randomly drawn from a powerlaw distribution with a minimum cluster size of 10, a maximum cluster size of 1000, and an exponent of 1.0.The center of those clusters was uniformly selected from points in a 40 × 40 box.The standard deviation (or spread) of each cluster was also drawn from a powerlaw distribution with a minimum of 0.2, a maximum of 2.0, and an exponent of 1.0.Next, the type of each cluster was uniformly selected from four options.The first option is the 2-D Gaussian blob with mean given by the cluster center and standard deviation given by the cluster standard deviation.The second option is the 2-D Anisotropic blob with a mean given by the cluster center, standard deviation given by the cluster standard deviation, and transformation given by the rotational matrix: The sci-kit learn [73] implementation of K-means clustering was initialized with K = 19 clusters and random initial centroids.The identification method was then run from 100 random centroid initializations.Clustering agreement was calculated by comparing all 100 uncovered clusterings with the ground-truth clustering using the element-wise similarity vector was found for each comparison and then averaged over the uncovered clusterings.Clustering frustration was calculated from all pair-wise comparisons between the 100 uncovered clusterings using the element-wise similarity vector was found for each comparison and then averaged over each comparison.

S5.3 Handwriting digits
The digits data set is bundled with the sci-kit learn source code and consists of 1797 images of 88 gray level pixels for handwritten digits.The reference clustering contains 10 clusters corresponding to the true digit.The data set was originally assembled by Alimoglu and Alpaydin [27].To provide a visualization, the data was projected to 2-d using the t-Distributed Stochastic Neighbor Embedding (t-SNE) dimensionality reduction method [74] initialized from the pca decomposition.
The sci-kit learn [73] implementation of K-means clustering was initialized with K = 10 clusters and random initial centroids.The identification method was then run from 100 random centroid initializations.Clustering agreement was calculated by comparing all 100 uncovered clusterings with the ground-truth clustering using the element-wise similarity vector was found for each comparison and then averaged over the uncovered clusterings.Clustering frustration was calculated from all pair-wise comparisons between the 100 uncovered clusterings using the element-wise similarity vector was found for each comparison and then averaged over each comparison.

S5.4 Facebook friendship networks
The Facebook friendship networks analyzed here were originally released as part of the the Facebook 100 data set [28,29].This data set contains a snapshot of all friendships at each of 100 schools in the fall of 2005.Additionally, the data includes several categorical variables volunteered by the users on their individual pages: gender, class year, high school, major, and dormitory residence.Here, we only analyze the networks corresponding to two schools: the Oberlin (College A) and Rochester networks (College B).For each school we took the largest connected component.The extracted clusterings were uncovered using the Louvain method, a optimization scheme that identifies clusters with high Newman's modularity [75,76].The categorical data for year, dorm and major were used to create three non-overlapping clusterings.Every student with missing categorical data was placed into an individual singleton cluster.
Comparisons between the categorical clusterings and the extracted clustering were made using our element-centric similarity measure.For both schools, there is a high similarity to year, confirming previous results [28,29].The element-wise similarity scores indicate that this similarity is strongest for the first-year students, and fails to capture the clustering structure of other students (original text, Fig. 4 h,i black arrows).Those regions with low similarity to year actually have higher similarity to major.

Figure 1 :
Figure1: The element-centric perspective naturally incorporates overlaps and hierarchy.a, Three examples of clusterings: a partition, a clustering with overlap, and a clustering with both overlapping and hierarchical structure.b, Cluster affiliation graphs derived from the overlapping and hierarchical clusterings.c, Cluster-induced element graphs found by projecting the cluster affiliation graphs in b to the element vertices.d, The element-affinity matrices found as the personalized pagerank equilibrium distribution.e, The corrected L1 metric distance between each affinity distribution in d gives an element-wise similarity between clusterings, the average element-wise similarity provides the final clustering similarity score.f, A binary hierarchical clustering is compared to each of its individual levels.g, The hierarchical scaling parameter for element-centric similarity acts as a "zooming lens", refocusing the similarity to different levels (1-4) of the hierarchical comparison in f.

Figure 2 :
Figure2: Element-centric similarity behaves intuitively in three clustering similarity scenarios while common clustering similarity measures exhibit counter-intuitive behaviors.1, 024 elements are assigned to clusters according to the following scenarios (a-c) and compared using the Jaccard index, adjusted Rand index, the F measure, normalized mutual information (NMI), overlapping normalized mutual information (ONMI), and our element-centric similarity.All results are averaged over 100 runs and error bars denote one standard deviation.a, A clustering with 32 non-overlapping and equal-sized clusters is compared to a randomized version of itself where elements are shuffled.b, A clustering with 32 non-overlapping and equal-sized clusters is compared against clusterings with increasing cluster size skewness.c, A clustering with 8 non-overlapping and equal-sized clusters is compared against a clustering with n non-overlapping, equal-sized clusters and randomized element memberships for different values of n. d, Only our element-centric similarity measure follows the intuitive behavior in all three scenarios.

Figure 3 :
Figure 3: Element-wise clustering similarity reveals insights into how clusterings differ.a-c, A K-means clustering example.a, The planted clustering.b, The average element-wise similarity between the planted clustering and 100 K-means clusterings.c, The average elementwise similarity between 100 K-means clusterings.d-g, A handwriting classification example.d,The labeled handwritten digit data projected using t-SNE dimensionality reduction for visualization.e, The average element-wise similarity between the labels and 100 K-means clusterings.f, The average element-wise similarity between 100 K-means clusterings.g, Exemplar digits that are consistently grouped as in the ground-truth clustering, consistently clustered differently from the ground-truth clustering, least frustrated, and most frustrated.h,i, Facebook friendship networks for h College A and i College B. The element-wise similarity between user affiliation to school year, dorm, and major compared to Newman's modularity optimized by the Louvain method demonstrates that social networks can be organized by a convolution of different attributes (black vs red arrows).The similarity to school year attenuates with student's status (1st year -4th year, orange arrows).

Figure 4 :
Figure 4: Our element-centric similarity better differentiates the overlapping and hierarchical community structure of functional brain networks in healthy and schizophrenic individuals.a, Hierarchical clustering of average pair-wise element-centric similarity using the entire OSLOM hierarchy closely reflects the true classification of participants as healthy (light blue) or schizophrenic (dark blue), while hierarchical clustering of the average pair-wise similarity using ONMI on the bottom level of the OSLOM hierarchy fails to uncover patient classification.b, Classification accuracy using different clustering similarity measures averaged over 100 instances of 10-fold cross-validation, error bars denote one standard deviation.c, The difference in element-centric similarity for each brain region when comparing amongst the healthy controls minus the similarity when comparing amongst the schizophrenic individuals; ROIs within the Fusiform gyrus are more consistently clustered in the healthy controls than in the schizophrenic individuals.

Figure
Figure S1: A fourth scenario demonstrates the trade-offs between element randomization, cluster size sequence, and the number of clusters.Two clusterings with random element memberships into 2 3 < c < 2 10 non-overlapping and equal-sized clusters for different values of c.

Figure S2 :
Figure S2: NMI's bias towards the number of clusters is independent of normalization term.The three scenarios from the main text, and one additional scenario described in FigureS1for different normalization terms of NMI: the minimum of cluster entropies (min), the average of the cluster entropies (sum), the geometric mean of the cluster entropies (sqrt), and the maximum of the cluster entropies (max).See Section S2.5 for the measure details.

a
cos(θ) −a sin(θ) b sin(θ) b cos(θ) , (S28)where a, b randomly drawn from the unit interval and θ was randomly drawn from the range [0, π].The third option is the circle centered at the cluster center with radius given by the cluster standard deviation; the points were uniformly spread along the circle and Gaussian noise with mean 0 and standard deviation 0.2 was added to all points.The forth option is the spiral with points uniformly spread in the range [0, 10], converted to circular coordinates by (x, y) → (σ √ x cos(x), σ √ y cos(y)), where σ is the cluster standard deviation, randomly rotated by the rotation matrix of equation (S28) with a = b = 1 and θ randomly drawn from the range [0, π], and Gaussian noise with mean 0 and standard deviation 0.2 was added to all points.

Table S1 :
The contingency table T for two clusteringsA = {A 1 , ..., A K A } and B = {B 1 , ..., B K B } of N elements, where n ij = |A i ∩ B j |are the number of elements in both cluster A i ∈ A and cluster B j ∈ B. by both A and B. Intuitively, N 11 and N 00 are indicators of the agreement between the two clusterings, while N 10 and N 01 reflect the disagreement between the clusterings.The aforementioned pair counts are identified from the contingency table T between two clusterings, shown in Table