Mapping robust multiscale communities in chromosome contact networks

To better understand DNA’s 3D folding in cell nuclei, researchers developed chromosome capture methods such as Hi-C that measure the contact frequencies between all DNA segment pairs across the genome. As Hi-C data sets often are massive, it is common to use bioinformatics methods to group DNA segments into 3D regions with correlated contact patterns, such as Topologically associated domains and A/B compartments. Recently, another research direction emerged that treats the Hi-C data as a network of 3D contacts. In this representation, one can use community detection algorithms from complex network theory that group nodes into tightly connected mesoscale communities. However, because Hi-C networks are so densely connected, several node partitions may represent feasible solutions to the community detection problem but are indistinguishable unless including other data. Because this limitation is a fundamental property of the network, this problem persists regardless of the community-finding or data-clustering method. To help remedy this problem, we developed a method that charts the solution landscape of network partitions in Hi-C data from human cells. Our approach allows us to scan seamlessly through the scales of the network and determine regimes where we can expect reliable community structures. We find that some scales are more robust than others and that strong clusters may differ significantly. Our work highlights that finding a robust community structure hinges on thoughtful algorithm design or method cross-evaluation.


I. INTRODUCTION
Mammalian genomes fold into a network of 3D structures that facilitate and regulate genetic processes such as transcription, DNA repair, and epigenetics. 1;2;3;4 Most recent discoveries linking genetic processes and genomes' 3D organization derive from chromosome capture methods, such as Hi-C.;6;7 These maps depict chromosomes as having 3D structures on a broad range of scales: megabasescale A/B compartments, 5 sub-compartments (A1, A2, B1, . . ., B4), 8 sub-megabase-scale Topologically Associated Domains (TADs), 9 sub-TADs and short-ranged loops. 8;14;15 However, recently, there has been an emerging research direction alongside this development that takes advantage of the methods developed in complex network theory.;17;18;19 While these and many other community detection methods led to several impactful insights, underneath this approach reside an often overlooked fundamental limitation: in most networks, more than one node partition may represent a feasible network community division.Because this limitation is fundamental to the network, this type of degeneracy exists regardless of the community-finding method.Also, the degeneracy becomes increasingly problematic if trying to detect small-scale communities, where there is a significant risk of over-fitting, or in dense networks, where it is hard to determine node-community memberships with significant certainty. 20his degeneracy problem posits that Hi-C maps' community structure is particularly challenging because Hi-C networks are almost fully connected even if most links are weak.Therefore, we expect that these networks possess several community divisions that cannot be further rated without including new data, e.g., gene expression or epigenetic profiles.Yet more intriguing, this limitation hints that there is a noteworthy probability that community-finding or data-clustering algorithms disagree on the optimal division.;21 This paper explores these limitations by mapping out the landscape of possible network partitions in Hi-C data.To this end, we use the Generalized Louvain Method 22;23 that allows us to detect communities at different network scales.We also developed a method to determine regimes where the solution landscape is degenerate and where we find robust communities.

II. RESULTS
To study the multiscale 3D organization in chromosomes, we use Hi-C data from the B-lymphoblastoid human cell line (see Sec. IV.A for references and data handling).As in other approaches, 16;17;18 we convert the Hi-C data into a network, where nodes represent 10 5 base pair long DNA segments (100 kb), and the links stand for segment-segment 3D interactions, where the weights are associated with the Hi-C contact count.In this study, we focus on chromosome 10.
To partition the network and map out multiscale communities, we use the Generalized Louvain method (Gen-Louvain).GenLouvain separates the network into communities where nodes share more interconnections than some null model (we defer details to Sec.IV.B).To construct a realistic null model, we assume that the segmentsegment contact frequencies decay as a power-law l −α , with linear separation l and decay exponent α.This scaling feature appears in established polymer physics models 24 and in Hi-C data. 25Averaging the Hi-C contacts over many segments gives two regimes: α ≈ 1.08 for long distances (∼ 500-7000 kb), 5;8 and α ≈ 0.75 for short distances (∼ 200-1200 kb). 26See Eq. (3) in Sec.IV.B for how we implement this contact scaling in GenLouvain.
Besides the exponent α, GenLouvain has a scale parameter γ.By varying this parameter, users may scan the network hierarchies and find multiscale communities.Using this approach, we sample feasible partitions of the network.We call the collection of these partitions the solution landscape.

A. Classifying the solution landscape
GenLouvain optimises the modularity quality function Q (Eq.( 3)) to find mesoscale communities with above-average connectivity.Because the community division problem is NP-hard, it is practically impossible to enumerate all network divisions and determine which one is optimal.Instead, GenLouvain finds feasible divisions using a stochastic search algorithm. 27But as with most community detection algorithms, GenLouvain sometimes gets trapped in local quality maxima.We illustrate this trapping schematically in Fig. 1 that shows two wellseparated local maxima, and , overlayed in a quality contour plot.Depending on starting conditions, Gen-Louvain will gravitate to or drift towards .To increase the chance of finding the best-quality partition, we run 1,000 independent optimisation passes using dif- ferent random seeds and compare the Q values.
But for some networks, the solution landscape does not split into two distinct peaks as in Fig. 1.For example, the quality may be nearly identical even in distant parts of this landscape.This means that it is challenging to distinguish the optimal partition since they are degenerate.To detect such degeneracies, we calculate the distance between partitions P and P using the weighted mean Jaccard distance where C P i are the nodes in community i in P . 28Because the distances d P P are not symmetric (d P P = d P P ), we use the average: When d = 0, the partitions are identical.And if d = 1, they are completely dissimilar.We acknowledge that there are other thinkable distance metrics, such as variation of information, but using such metrics will not change the solution landscape's qualitative topology.Next, we classify the solution landscape using the Jaccard distances d and the partition qualities Q.We find three broad landscape categories depending on the variability of d and Q, Var(d) and Var(Q), summarised in Table I.First, if both Var(d) and Var(Q) are low, we find structurally similar partitions of almost the same quality.Second, we find dissimilar partitions of different qualities when both are high.For partitions in the third category (arguably the most interesting case), where Var(d) is high and Var(Q) is low, we may find dissimilar partitions having similar quality where no partition should be preferred over any other.In our notation, this case represents a degenerate solution landscape.The fourth regime (low Var(d) and high Var(Q)) is unsound as we find similar partitions with relatively large quality differences.This means that as long as we find similar partitions, there is no need to study the variability in Q to guarantee that GenLouvain found the global quality maximum.

B. Identifying robust core communities
We identified three solution landscapes in the previous section using the variabilities among the partitions' quality and pairwise distances.However, this only provides a qualitative assessment of the landscape's overall characteristics.Even when there are distinct peaks, there are always some deviations close to these peaks, where node assignments may differ.To quantify these differences, we tessellate the solution landscape by clustering the partitions and determining robust node-community assignments in each cluster.We start by grouping similar partitions into clusters and comparing their sizes and qualities.The partition with the locally highest quality represents the cluster centre.To cluster similar partitions relative to the cluster centre (d < d max ), we use a clustering algorithm, 28 modified to maximise in-cluster similarity.Below, we summarise the main steps: 1. Order all partitions by their quality Q and let the best partition form a cluster centre (Fig. 2a).2. Create new cluster centres with any partitions that are separated by at least d max from any already present cluster centres (Fig. 2b).
3. Assign the remaining partitions to the closest cluster centre (Fig. 2c).
In this procedure, the critical parameter is the distance threshold d max .This value balances the cluster size and partition similarity with the rest of the cluster.In this analysis, we use d max = 0.10, implying that the best-matching communities' weighted average fraction of shared nodes is at least 90 percent.Next, after finding the cluster centres, we study if some network communities are more robust than others.We want to know if specific nodes co-appear in the same community in most partitions within a cluster while other nodes tend to change community memberships.To do this, we first select clusters in the solution landscape with at least 100 partitions (Fig. 3a-b).Then, we search for the largest node subset C i of each community C i in P that is clustered together in at least a fraction p of the other co-clustered partitions. 30We call these subsets core communities of the cluster centre (Fig. 3c).The parameter p balances core communities' size with how many partitions in the cluster that supports them.We use p = 0.9 to compensate for that the partitions in the clusters are allowed to differ by 90 percent on average.

C. Mapping the solution landscape of human chromosome 10
In this section, we study the degeneracy of the Hi-C network for human chromosome 10, applying the results from the previous section (see Sec. IV.A for data handling).Particularly, we wish to know how the solution landscape and core communities change with the parameter α associated with chromatin folding and GenLouvain's scale parameter γ that sets the typical community size (see Sec. IV.B).To make the ensuing discussion less abstract, we express γ as a characteristic community size  II).TADs' effective size is 0.33 Mb.To fit them in the size axis, we show their effective size when omitting TADs smaller than five Hi-C-bins (∼ 0.83 Mb).We visualise the solution landscapes using DensMAP 29 on a contour plot of the quality scores.In panels a-d the distance between any two points is at least dmax.
ŝ (number of base pairs).This change simplifies the analysis, particularly when relating our results to established chromatin divisions.
Since the community sizes are relatively heterogeneous for most γ values, we calculate ŝ using the perplexity of the community sizes (see Eqs. ( 4) and (5) in Sec.IV.C).We choose this metric because it is a better representation of characteristic sizes than the median or the arithmetic mean.We depict the explicit ŝ-γ relationships in Fig. S1 for α = 0.75 and α = 1.08.
In Fig. 4a-d, we plot the solution landscapes for four pairs of α and ŝ, each landscape spanning 1,000 Gen-Louvain runs.Just as in Figs. 2 and 3, we illustrate clusters as markers on top of Q contour plots made using DensMAP. 29Each marker's diameter is proportional to the size of the cluster, and the colour represents the cluster's quality.
The panels (a-d) illustrate typical landscape behaviours.For example, (a) highlights a case where it is hard to find the optimal partition and distinguish the best community division because all partitions have nearly identical qualities but have dissimilar community structures.This leads to numerous size-one cluster centres scattered across the landscape.As pointed out in Table I, we characterise this case as degenerate because there is substantial variability among the cluster centres pairwise distances and low variability in quality (high Var(d) and low Var(Q)).So, in this case, we cannot be sure which cluster centre GenLouvain will gravitate towards from some random initial condition.
For larger community sizes (ŝ ∼ 70 Mb), the solution landscape becomes much easier to analyse because we have only a few large clusters.For example, in (b), GenLouvain recovers the same cluster centre most of the time.Also, around (b), we find the most peaked solution landscapes where all partitions belong to a single cluster.
In panels (a) and (b), we used the looping exponent α = 1.08, which is the genome-wide averaged contact decay in human cells for distances 1 Mb.However, α = 0.75 fits the data better for shorter distances (0.5-1.2 Mb).With this in mind, we made similar analyses as above but for α = 0.75 (Fig. 4c-d).This change made a noteworthy difference for the small communities [panel (c)]: the landscape has a clear cluster centre and a reliable, optimal solution.However, forcing GenLouvain to assemble large communities with α = 0.75 makes it increasingly degenerate up to a point (d) when the solution landscape has a global maximum alongside many local maxima with slightly lower Q.
Apart from these four examples, we made a parameter sweep of community sizes ŝ for α = 0.75 and α = 1.08.But instead of creating landscape plots for each parameter pair, we calculated the Jaccard distances d 1 , d 2 , . . ., d i , . . .[Eq. (2)] between all partition pairs.Then we calculated the simple average MD(d) = E [d i ] and the coefficient of variation CV(Q) of all partition qualities Q 1 , Q 2 , . . . .The middle panel shows how MD(d) varies with ŝ for α = 1.08 (crosses) and α = 0.75 (circles) where we colour-coded the markers using CV(Q).This plot allows us to identify scale regimes where MD(d) is large but CV(Q) is small, which is a hallmark of a degenerate solution landscape.For example, the plot demonstrates that α = 1.08 is not a suitable folding parameter to find reproducible small-scale communities in the range ∼ 1-4 Mb.
In the middle panel, we also indicate ŝ of published chromatin divisions, like TADs (> 0.5 Mb) and A/B compartments (see Sec. IV.A), by vertical dashed and dotted lines.The scales close to (b) (encircled) corresponds to characteristic A/B compartment sizes, ŝ = 66 Mb.Using α = 1.08, this scale is associated with a nondegenerate landscape leading to a reliable partition of the Hi-C network.But interestingly, we note that there seems to be an even better division at a slightly smaller ŝ.This panel also shows that we must use α = 0.75 to find reliable partitions with sizes similar to TADs ŝ = 0.33 Mb.Finally, sandwiched between A/B compartments and TADs, there is yet another commonly used Hi-C division denoted A1, A2, and B1,..., B3.This regime has less reliable communities because the landscape is flatter (exemplified in (d)).After classifying the solution landscape in Fig. 4, we analyzed how robust the partitions are by identifying the core communities across ŝ.As illustrated in Fig. 3, we extract robust communities by first clustering similar partitions and then quantifying the internal cluster differences.We quantify these differences by calculating the fraction of identical node-community memberships.We omit clusters with less than ten percent of the total partition ensemble for a given ŝ-α combination (100 out of a 1,000 partitions).We find robust communities when large clusters have a high fraction of nodes assigned to core communities (note marker sizes in Fig. 5).This finding holds for both folding parameters, α = 0.75 and α = 1.08.Conversely, we find a fuzzy community structure when small clusters have the same relative quality Q/Q max and a small fraction of core-assigned nodes.
For α = 0.75, we observe that the most robust scale is ŝ ∼ 10 0 Mb.Here, one dominating cluster contains more than half of all partitions in which the communities contain nodes interacting primarily over short distances.These communities are mostly unbroken DNA sequences (Fig. S3a) similar to TADs.But there are exceptions.For instance, we find a few large communities that join nodes from linearly separated DNA segments.We illustrate the complete scale-dependent node-community memberships in Fig. S3a.This figure shows how the nodes redistribute between communities when ŝ changes.Apart from observing stable communities (e.g, the beginning of the chromosome), we note that the 3D folding is not perfectly hierarchical, in which smaller communities form larger and larger super-structures.Albeit small, there are deviations that make the folding structure semi-nested. 18r α = 1.08, we detect more than 80 percent core nodes when ŝ > 40 Mb and the most robust scale for ŝ ∼ 100 Mb.But this scale is trivially robust as most nodes are in a giant community (Fig. S3b).A more interesting case is where ŝ ∼ 60 Mb and ŝ ∼ 90 Mb, with the former having a slightly larger fraction of core nodeassignments.While ŝ ∼ 60 Mb is similar to typical sizes of A/B compartments (Fig. 4), we find multiple clusters when ŝ ∼ 70 Mb that have similar quality but with lower core-node fractions.
Overall, we note that GenLouvain can detect reliable core communities at two distinct network scales (ŝ ∼ 1 Mb and ŝ ∼ 60 Mb) depending on the value of the folding parameter α.To investigate if there are other stable network scales, we made a sweep of α values for each ŝ and calculated the mean partition distances MD(d).As shown in the heat map Fig. 6, the most robust regimes are the top-left and bottom-right where MD(d) is the smallest.In the bottom left corner where α ∼ 1 and ŝ is small, we find the most degenerate solution landscape.

E. Established chromatin divisions differ from optimal network communities
In Fig. 4, we indicated typical sizes of a few established chromatin divisions, like large TADs and A/B compartments, by lines.These chromatin divisions have size distributions that differ from typical network communities.To make a better comparison, we varied γ to find the network partition that is most similar to the chromatin divisions, disregarding that the effective size ŝ may differ from ŝTAD or ŝA/B .Then we quantified the similarity by calculating the adjusted mutual information (AMI), commonly used to compare partitions.The AMI is 1 when the two partitions are identical and 0 when inseparable from chance.We summarise the results of our AMI analysis in Table II.
TABLE II Comparing optimal network partitions with established chromatin divisions.We derived the sizes for A1,A2,B1,...,B3 by aggregating A1,2/B1,2,3 segments ("A/B segments") and the A/B sizes from merging A1/. . ./B3 subcompartments (see Fig. S2).Notation: effective size ŝ, and adjusted mutual information (AMI), chromatin looping exponent α.We found no similar partition for A1,. . .,B3.For TADs (Table II, top row), we find the best correspondence when ŝ = 0.77 Mb, which is larger than TADs' effective size ŝTAD = 0.33 Mb.Here, the AMI score is 0.53, indicating that the community structures show significant deviations.This deviation is likely be-cause median TAD sizes are close to the data resolution we use (0.1 Mb).The AMI score is similar for A/B compartments (AMI = 0.47), but the scales match better (ŝ = 66 Mb vs ŝ = 59 Mb).We find the best overlap with the small-scale A 1,2 /B 1,2,3 segments (denoted "A/B segments" in Table II) with ŝ = 1.8 Mb and AMI = 0.72.We do not compare our results with A1/A2/B1/B2/B3 sub-compartments because we cannot detect robust communities in this regime.

Characteristic
Finally, in Fig. 7, we visualise how the nodecommunity membership differs between the A/B compartments and the optimal network partition at ŝ = 59 Mb.We observe that most sub-compartments are isolated into a single network community.But the A2 sub-compartment includes Hi-C bins assigned to the two largest communities.

III. DISCUSSION
Hi-C networks are densely connected.Therefore, finding reliable community structures across various scales is challenging.To better understand this problem, we have mapped out the solution landscape of feasible partitions in a chromosome contact network at different organization scales.We sampled 1,000 partitions using different scale-and DNA-looping parameters to detect regimes as-sociated with robust or degenerate solution landscapes.We classified these regimes in terms of the variabilities of the partition's qualities and pairwise distances.Then we used a partition clustering approach and compared cluster sizes and qualities.Also, studying the proximity of the best-quality partition, we find robust core communities supported by at least 90 percent of the proximate partitions.Finally, varying the looping parameter α We find robust small-scale communities for α = 0.75 and larger-scale communities for α = 1.08, roughly corresponding to TADs and A/B compartments.Between these extremes, we find a regime opaque to community detection methods.
We mapped out the multiscale solution landscape in Fig. 4 and discovered regimes where the landscape is degenerate, as illustrated in panel (a).It is critical to note this degeneracy problem is not easily resolved using another community detection method because there might not exist strong communities in the data at that scale.Therefore, different methods will provide different answers.We circumvented some degenerate scales by modifying the null model's folding parameter.For example, at ŝ ∼ 1 Mb, changing α from 1.08 to 0.75, GenLouvain recovers the same optimal partition most of the time.However, this approach is not straightforward to generalise.Furthermore, we found two distinct regimes in the αŝ parameter space where community detection is easy (in Fig. 6).But this finding does not exclude other robust network scales.In GenLouvain's modularity function, we assumed that node-node contacts decay as a power law with some exponent α.While this is consistent with the average contact decay in Hi-C maps and established polymer physics models (e.g., the Gaussian chain or the fractal globule), there could be other functional forms that better describe the actual folding mechanism or a blend of several competing mechanisms (e.g., short-ranged loop-extrusion and long-ranged phase separation). 31This amounts to improving the null model, which we leave as an avenue for future research.
We found that established chromatin divisions differ from the optimal GenLouvain partition associated with identical characteristic sizes (Table II).Even if sweeping through a range of characteristic sizes, we still find significant differences with the most similar GenLouvain partition.We achieved the best match for A 1,2 /B 1,2,3 segments, and the matching communities are robust.While we cannot reach perfect overlap using one single characteristic size, we point out that it is conceivable to increase the overlap if considering partitions from several ŝ.This indicates that our approach might find most chromatin divisions but not at a single ŝ.This finding helps benchmark our results to other published TAD-finding methods and offers a systematic approach to highlight deviations from expected network partitions under the null model (power law decaying contacts).
While this work focuses on Hi-C contact maps, Gen-Louvain is commonly used to detect communities in a wide range of networks.Therefore, our work is helpful to other researchers searching for robust communities when facing the degeneracy problem.

A. Assembling chromosome contact data
We downloaded Hi-C data for the B-lymphoblastoid human cell line (GM12878) 8 from the GEO database (MAPQG0 dataset, 100 kb resolution). 32The data file contains measured contact frequencies between DNA segment pairs in a cell population.We only consider intrachromosome contacts in our analysis, allowing us to study each chromosome by itself.We interpret the Hi-C data as a weighted network in sparse form, where each node represents a 100 kb DNA segment, and the link weight is the measured contact count.Before constructing the network, we normalise the data using the Knight-Ruiz matrix balancing algorithm.
In addition to Hi-C data, we use datasets associated with existing 3D divisions: 8 A/B sub-compartments and topologically associating domains (TAD) (downloaded from the GEO database 32 ).The sub-compartments divide chromosomes into regions called A1, A2, B1, B2, B3, and B4.While A1 and A2 exhibit high gene expression, B1-B3 are associated with repressed and inactive DNA regions (B4 is found only in chromosome 19 and does not participate in our study as we focus on chromosome 10).Also, functionally similar sub-compartments tend to have correlated contact patterns and are generally referred to as A-and B-compartments.Alongside the sub-compartment, we study TADs.Defined by the Arrowhead algorithm, 8 TADs are genomic regions with above-average contact frequencies, serving as microenvironments for co-regulated genes.TADs appear as squares along the main diagonal in Hi-C maps.

B. Multiscale community detection
To find network communities, we use the Generalized Louvain method (GenLouvain). 23GenLouvain searches for network partitions that maximise the modularity function Q, capturing local deviations from the expected background connectivity.While the most common choice is random connections, better known as the Newman-Girvan null model, 33 we rescale the expected link weights to mimic that nodes are interconnected DNA segments forming a long polymer chain that is folded in 3D inside the cell nucleus. 17Empirical data shows that the average link weight (∝ number of contacts) decays as a power-law with linear node separation.After this modification, the

FIG. 1
FIG.1Solution landscape of network partitions (circles) on a quality contour plot.The (locally) best-quality partitions appear on the landscape's peaks ( and ).The square partition has the highest quality.

FIG. 2
FIG. 2 Partition clusters in the solution landscape.(a) Partitions with different quality and distance to the best quality partition ( ).(b) The first partition separated by at least dmax from any cluster centre forms a new centre.This process repeats until all clusters are separated by at least dmax.(c) All partitions are assigned to the nearest cluster.

FIG. 3
FIG. 3 Identifying core communities in a cluster centre.(a) The best-quality cluster X (large ) in the solution landscape.(b) The cluster centre P and co-clustered partitions P1, P2, . . ., P k inside the cluster with possibly different community assignments.(c) Core communities of the best partition P are found in a fraction p of the co-clustered partitions.

FIG. 4
FIG.4 Solution landscapes at different scales.Mean difference (MD) of pairwise partition distances d for different α (main panel), surrounded by selected solution landscapes (a-d).The marker radius and colour are proportional to the quality Q's coefficient of variation.As vertical lines, we show the effective sizes of chromatin divisions (summarised in TableII).TADs' effective size is 0.33 Mb.To fit them in the size axis, we show their effective size when omitting TADs smaller than five Hi-C-bins (∼ 0.83 Mb).We visualise the solution landscapes using DensMAP 29 on a contour plot of the quality scores.In panels a-d the distance between any two points is at least dmax.

FIG. 5
FIG.5Cluster sizes and the fraction of core nodes.Core nodes are clustered in at least 90 percent of the cluster's partitions.We only show clusters with at least 100 partitions.

FIG. 6
FIG. 6 Mean absolute difference (MD) of pairwise partition distances d for chromosome 10 for different α.

FIG. 7
FIG. 7 Core communities and A/B compartments of chromosome 10.The leftmost column represents 5 Mb bins coloured by position.The middle column represents the partition most similar to the A/B compartments (ŝ = 66 Mb), with communities ordered by average position.Transparent segments do not belong to the core.The right-most column represents A/B sub-compartments.
FIG. S3 Alluvial diagram of core communities of chromosome 10 for α = 0.75 and α = 1.08 at different scales.The left-most column represents linear bins coloured by position.The remaining columns represent community structure at different scales, vertically ordered by their average position and coloured by the positions of their contained segments.Transparent segments are not in the core.