Hierarchical Reconstruction of High-Resolution 3D Models of Large Chromosomes

Eukaryotic chromosomes are often composed of components organized into multiple scales, such as nucleosomes, chromatin fibers, topologically associated domains (TAD), chromosome compartments, and chromosome territories. Therefore, reconstructing detailed 3D models of chromosomes in high resolution is useful for advancing genome research. However, the task of constructing quality high-resolution 3D models is still challenging with existing methods. Hence, we designed a hierarchical algorithm, called Hierarchical3DGenome, to reconstruct 3D chromosome models at high resolution (<=5 Kilobase (KB)). The algorithm first reconstructs high-resolution 3D models at TAD level. The TAD models are then assembled to form complete high-resolution chromosomal models. The assembly of TAD models is guided by a complete low-resolution chromosome model. The algorithm is successfully used to reconstruct 3D chromosome models at 5 KB resolution for the human B-cell (GM12878). These high-resolution models satisfy Hi-C chromosomal contacts well and are consistent with models built at lower (i.e. 1 MB) resolution, and with the data of fluorescent in situ hybridization experiments. The Java source code of Hierarchical3DGenome and its user manual are available here https://github.com/BDM-Lab/Hierarchical3DGenome.

different models corresponding with different local optimums. However, because LorDG put a high priority on satisfying contacts with high interaction frequencies, its models often satisfy a major set of contacts with high interaction frequencies and are similar to each other.
However, the ability to construct high-resolution (<=5 KB) 3D models of entire large chromosomes that are needed to study the detailed interactions between genes and regulatory elements, such as enhancers at the genome scale, are still out of reach for most, if not all, of the existing methods. More recently, Rieber, L. and Mahony, S 23 , developed an approximate multidimensional scaling (MDS) algorithm called miniMDS, capable of constructing the 3D structure of chromosomes and genome at high resolution from Hi-C datasets better than most of the existing methods. This algorithm partitions a Hi-C dataset into subsets, performs high-resolution MDS separately on each subset, and then reassembles the partitions using low-resolution MDS. At the time of writing this manuscript, the miniMDS is the only known method that has attempted to build a relatively higher resolution 3D models of an entire chromosome.
High-resolution genome structure modeling has several challenges. Firstly, the structure sampling is much more intensive since the number of chromosomal fragments to be modeled is much larger in high resolution. For instance, the number of chromosomal fragments to be considered at 5 KB resolution is 20 times as many as at 100 KB. Secondly, as the resolution increases, the number of contacts between fragments gets smaller, leaving less contact data for restraining the positions of the fragments. Finally, the search space for high-resolution models is much larger than low-resolution models, making spatial optimization much more complicated. And due to the substantial increase of the model space, models with different topologies in high resolution may satisfy the same chromosomal data. One way to reduce the search space is to require that high-resolution models have a structural topology similar to that of low-resolution models whose structure can be more stably constructed due to the availability of more contact data between larger fragments. In this work, we introduce a hierarchical algorithm to build high-resolution 3D chromosome models at 5 KB resolution by using low-resolution models at 1 MB resolution to assemble high-resolution models of chromosomal domains into full high-resolution models of entire chromosomes. Our results show that the high-resolution chromosome models reconstructed by our method satisfied input chromosomal contacts well, and are consistent with the experimental FISH data.

Methods
Data source. We used the Hi-C contact matrices datasets (GEO Accession Number: GSE63525)) of cell line GM12878 8 for all analyses.
Normalization. In this work, we used two normalizations in two modelling processes. The first one is Knight-Ruiz normalization (KR) method 30 for building high resolution model of individual domains (step 4 in Fig. 1). The second one is iterative correction and eigenvector decomposition (ICE) normalization 31 method for building the low-resolution model of the entire chromosome (step 2 in Fig. 1), where each domain is represented by a point or bead.
Overview of the algorithm. A chromosome is modeled as a string of beads in 3D space, where each bead denotes the midpoint of a DNA fragment at a specific resolution (e.g. 5 KB long). The position of a bead is then represented by three coordinates (x, y, z). Interaction frequencies between beads i, j are converted into spatial  Figure 1. The steps of Hierarchical3DGenome algorithm to reconstruct high-resolution models of chromosomes. The seven steps that Hierarchical3DGenome takes to reconstruct high-resolution models of chromosomes are: (1) break a chromosome into chromosomal domains according to input data and represent each domain as a point or bead, (2) build a 3D model of the entire chromosome at low resolution, (3) create a contact matrix for each domain, such that for n domains there are n contact matrices, (4) build 3D model of high resolution for each domain, (5) scale the 3D Models of the entire chromosome at low resolution to match with the models of the domains at high resolution, (6) substitute beads in low-resolution models with their high-resolution domain models, and (7) refine the high-resoluiton models of the entire chromosome to satisfy inter-domain contacts.
www.nature.com/scientificreports www.nature.com/scientificreports/ distances by the function = α d ij where IF ij and d ij are the interaction frequency and approximate distance (expected distance) between bead i and j, with α as a conversion factor. The goal is to place beads in 3D space so that their pairwise distances satisfy the expected distances converted from interaction frequencies as well as possible.
Our algorithm first reconstructs the 3D model of a chromosome at low resolution, which is used to guide the search for optimal models at high resolution. Each fragment (or point) in low-resolution models represents a contact domain, which is considered a structural unit of chromosome 8 . A chromosomal domain has substantially more contacts within itself than with other domains. Therefore, the accurate models of each chromosomal domain at high-resolution can be reconstructed individually. The models of individual domains are then assembled together according to the overall topology of full chromosomal models at low resolution.
Specifically, our hierarchical algorithm, Hierarchical3DGenome, constructs high resolution chromosome 3D models in seven steps (Fig. 1). The input is a contact matrix of a chromosome at a high-resolution (e.g. 5 KB). In Step 1, the chromosome is partitioned into contact domains (or topologically associated domains (TADs)) using the arrowhead domain algorithm 8 . When a domain contains small domains inside, only this domain is considered because the small domains have been represented by it. Then, each separate domain is represented by a point or bead and the interaction frequencies between beads are calculated to make a low-resolution contact matrix for the entire chromosome. The matrix is normalized by the iterative correction and eigenvector decomposition (ICE) normalization 31 to remove technical biases 32 , biological factors 33 and the different visibility of beads due to their different lengths. This new contact matrix is used to build a low-resolution model of the entire chromosome using our in-house method LorDG 17 in Step 2. We used the default parameter settings in LorDG. The default  www.nature.com/scientificreports www.nature.com/scientificreports/ parameter setting for LorDG algorithm allows it to search for the best conversion factor within the range [0.1, 3.0] with a step-size of 0.1 for an input contact matrix of a chromosome. The high-resolution contact matrices of individual domains are also extracted from the full high-resolution contact matrix of the chromosome in Step 3. The 3D models of each domain at the high-resolution are reconstructed individually in Step 4 (see the detailed description in the Sub Section "Construction of High-Resolution Models for each Domain").
The topology of a correct high-resolution model of a chromosome should be similar to that of its correct low-resolution model, even though it is not guaranteed that they are in the same scale. So, in Step 5, the low-resolution model of the full chromosome constructed in Step 2 is scaled by a ratio so that it can be used to guide the assembly of the high-resolution models of individual domains into a final high-resolution model of the entire chromosome. The ratio used for this scaling is estimated from the models of individual domains and the low-resolution model of the entire chromosome (see the detailed description in the Sub Section "Estimating the Scaling Ratio between High-Resolution and Low-Resolution Models").  www.nature.com/scientificreports www.nature.com/scientificreports/ After the low-resolution model is scaled to match with the scale of high resolution, in Step 6, each bead of the low-resolution model is substituted by a high-resolution model constructed in Step 4 for the corresponding domain that the bead represents. Finally, Step 7 is to further refine the location of the models of the domains to satisfy more inter-domain contacts (see the detailed description in the Sub Section "Model Refinement").

Construction of High-Resolution Models for Each Domain.
To build high-resolution models from the contact matrix of a domain, a good conversion factor (α) to translate interaction frequencies into spatial distances is needed because the conversion function plays a crucial role in determining the quality of reconstructed models. We used the LorDG 17 algorithm to search for the best conversion factor within the range [0.1, 3.0] with a step-size of 0.1 for an input contact matrix. Each domain could have a different conversion factor, therefore, the median of conversion factors from all the domains (e.g. α = . 0 9) was selected as the consensus conversion factor to translate interaction frequencies into spatial distances to build high resolution models of domains with LorDG.

Estimating the Scaling Ratio between High-Resolution and Low-Resolution Models.
To estimate the ratio to scale the low-resolution model to match with the high resolution model, the distance between centers (c x , c y ) of the mass of two adjacent domain models (x, y) is estimated by the following formula: , where d ij is the distance between fragment i in domain x and fragment j in domain y, d xi is the distance between c x and i and d yj is the distance between c y and j (Fig. 2). The rationale is that d xy is always less than or equal to yj . Therefore, d xy can be well approximated by given a sufficient number of fragments i and j are tested. It is worth noting that the adjacent domains are chosen because they have a high enrichment of inter-domain contacts between domains, hence, the distance estimated between them is more reliable.
Given the 3D models of domains, the distances d xi and d yj can be calculated from the coordinates of the centers (i.e. the average of the coordinates of the domain model (x, y)), the fragment i in domain x, and the fragment j in domain y. The distance d ij can be calculated from the formula = α d ij IF 1 using the interaction frequency (IF) between fragments i and j according to the conversion factor (α) found in Step 4.
The distance between the centers of two adjacent domains, d xy , calculated above, are then divided by their corresponding distance in the low-resolution model to obtain a scaling ratio. In total, there will be − n 1 estimated distance ratios where n is the number of domains or beads in the low-resolution model. The final ratio used to scale the low-resolution model is the median of these estimated ratios. The centers of mass of the high-resolution domain models are placed at the locations of the corresponding beads (or points) of the low-resolution model in order to obtain an initial high-resolution model of the entire chromosome for further refinement.
Model Refinement. In the refinement step (Step 7), we used the LorDG algorithm to adjust the coordinates of all the points of the initial high-resolution models of all the domains to satisfy high-resolution chromosomal contacts. Starting from the initial, unrefined model produced in Step 6, both intra-domain and inter-domain www.nature.com/scientificreports www.nature.com/scientificreports/ contacts are used in the optimization to refine it. LorDG uses all contacts to adjust the model to maximize the satisfaction of the contacts. The objective function of LorDG is non-convex and its optimization converges at local optimums. Therefore, the intra-domain contacts that have higher interaction frequency than inter-domain contacts and have already been well satisfied in the initial model are mostly preserved during the optimization. The optimization in the refinement step mostly tries to satisfy more inter-domain contacts to assemble domain models together.

Results
We used the high-resolution Hi-C dataset 8 with our method to build and evaluate the chromosome models at 5 KB resolution. Figure 3 shows a 3D model of Chromosome 11 represented by 26,065 points at 5 KB resolution, which is a 3D model of a human chromosome of the highest resolution to the best of our knowledge. We conducted five tests to evaluate the quality of these models. Firstly, we calculated the correlation between the fragment-fragment distances in the models and the expected distances calculated from contact matrices. Secondly, we checked if our 5 KB-resolution models were consistent with models reconstructed directly from contact matrices at 1 MB resolution. Thirdly, to figure out if domain models were well adjusted to satisfy inter-domain contacts, we extracted contact sub-matrices for every two adjacent domains consisting of both inter-and intra-domain contacts, reconstructed 3D models of the two adjacent domains from the matrices, and then compared them with the corresponding models of the two domains extracted from the full-chromosome models at 5 KB resolution. Fourthly, we investigated if the high resolution chromosomal models were consistent with the FISH data 8 . The comparison shows that our models exhibited the 4 loops on four different chromosomes that were identified from the FISH data. Finally, we compared our method with an existing method for high-resolution modeling.

Correlation between Reconstructed Distances and Expected Distances. We calculated the
Spearman's correlation between reconstructed fragment-fragment distances in the 3D models and their expected distances derived from the input contact matrices (Figs 4 and S1). The average and standard deviation of the correlations are 0.5357 and 0.0397, respectively. Considering the large number of expected distances to be correlated for each chromosome (e.g. 42,000,000 expected distances for Chromosome 1) and a lot of noise and inconsistency in these distances, these correlation values are good and suggest the 3D structures reconstructed models are of reasonable quality. In addition, we performed a comparison of the contact maps derived from each model with the input Hi-C data at varying contact cut-off (Figs 5 and S2). When contacts with low interaction frequencies are removed, the correlations are better. This indicates that low interaction frequencies, unreliable contacts drive the correlations down, and our models put a high priority on satisfying reliable contacts with high interaction frequencies. The correlation increases as the cut-off threshold increases, which suggests an increase in the model consistency with the Hi-C data as contacts with low interaction frequencies are removed.

Consistency between models of 5 KB resolution and 1 MB resolution. At 1 MB resolution,
because contacts often have high interaction frequencies and a fragment is in contact with more other fragments and therefore more restrained, the topology of models is generally more reliable and stable. We compare 5 KB-resolution models with the models built directly from contact matrices of 1 MB resolution to check if their topologies are consistent. To make this comparison, the 5 KB resolution models were zoomed out (reduced the resolution) to 1 MB resolution models. This was achieved by averaging coordinates of points of the same bin which are 1 MB long as in the 1 MB resolution models. We then calculated Spearman's correlation between www.nature.com/scientificreports www.nature.com/scientificreports/ pairwise distances of the zoomed-out model and the 1 MB resolution models (Figs 4 and S3). For all chromosomes, the correlations are >0.72, suggesting that models at 1 MB and 5 KB have the similar topologies.

Evaluation of inter-domain interactions.
To check if domain models were adjusted appropriately to satisfy inter-domain contacts, we built models of every two adjacent domains from their contact sub-matrices extracted from the full contact matrix of a chromosome. We compared these models with their counterparts extracted from 5 KB-resolution full-chromosome models. The boxplot in Fig. 6 shows Spearman's correlations between the models reconstructed individually and those extracted from the 5 KB resolution models for all chromosomes. The high average correlations suggest that the domains in the high-resolution full-chromosome models were generally well adjusted to satisfy inter-domain contacts.
Validation with FISH Data. We validated our models against the Fluorescence In Situ Hybridization (FISH) data 8 . The FISH data identified four chromatin loops in four chromosomes of the cell line GM12878 (Chr.

L1-L2 (Euclidean distance in models)
L2-L3 (Euclidean distance in models) 11 2  Table 2. The IF and distance between the three probes from the Hi-C data and 1 KB models of Chromosomes 11, 13 and 14, respectively. The genomic positions of three probes (L1, L2, and L3) on the four loops of chromosomes 11, 13, and 14, and the distances between these probes in the high-resolution 1 KB models. Columns 2 and 3 report the IF between the three probes from the Hi-C data, columns 4 and 5 report the distances between the three probes from the models. Distances in the models are consistent with IFs in Hi-C data, but they are not small enough to appear like loops.
www.nature.com/scientificreports www.nature.com/scientificreports/ 11, 13, 14 and 17). For each loop, the distances between three probes (or loci), i.e., L1, L2 and L3 on the loop, were measured by FISH experiments. Although the three probes have the same genomic distance, the spatial distance of L1-L2 is much smaller than the spatial distance of L2-L3 according to the FISH experiment.
We calculated the spatial distances L1-L2 and L2-L3 in the 5 KB high-resolution models and confirm that the distance of L1-L2 is indeed much smaller than the distance of L2-L3, indicating that the loops involving L1 and L2 were correctly reconstructed in the 3D models. These distances between the probes are shown in Table 1. The four loops in our models are visualized and shown in Fig. 7.
Validation based on two-compartment feature of chromosomes. Previous studies 1 has shown that the chromosomes can be divided into separate compartments (euchromatin and heterochromatin). We performed the Principal Component Analysis (PCA) on the Hi-C contact matrices to divide the chromosome into two compartments. Afterwards, we annotated the identified compartments with different colors to assess how separable they are in 3D chromosomal models. As illustrated in Fig. 8, regions in the same compartments are spatially grouped together in the models. Figure 8 shows the two compartments of chromosomes 19 and 21.

Comparison with an existing high-resolution model construction method. We compared
Hierarchical3DGenome with the state of the art high resolution model construction method, miniMDS 23 . We calculated the Spearman's correlation between input distances and distances inferred from the output 3D structure of each chromosome at 5 KB resolution produced by miniMDS. Only the non-missing points identified in the miniMDS output structure were used in this calculation. The default parameters of miniMDS were used. The result shows that the correlation is higher for Hierarchical3DGenome for every chromosome than miniMDS, indicating that Hierarchical3DGenome infers 3D structures that are more consistent with the input Hi-C data (Figs 4 and S4). The structure from miniMDS sometimes contains folds or cluttered points with unrecognizable chromosomal features and quite some fragments are missing. In comparison, the features in the structure of Hierarchical3DGenome are well distinguished (Fig. 9a-c versus Fig. 9d-f). www.nature.com/scientificreports www.nature.com/scientificreports/ Chromosome models at 1 KB resolution. We attempted to build chromosomal models at 1 KB resolution for Chromosomes 1, 11, 13, 14 and 17 to test the capability of our method for building models of even higher resolution (Fig. 10). These models satisfied input contacts reasonably well. Spearman's correlations between expected distances and reconstructed distances were 0.44, 0.45, 0.43, 0.48 and 0.53 respectively for these chromosomes. They were also similar to the models at 1 MB resolution, indicated by a high Spearman's correlations (>0.89) between them. However, the detailed shapes of the four loops identified by FISH experiments on chromosome 11, 13, 14 and 17 [8] in the models of 1 KB resolution did not appear as loop-like as they did in the models of 5 KB resolution. In particular, the distances between L1-L2 in the models is not small enough to appear like loops even though they are still smaller than the distances between L2-L3 in all cases as expected ( Table 2). One possible reason is that the input chromosomal contact data is not dense enough to build the loops of 1 KB resolution, which were initially predicted with Hi-C data at 10 KB resolution prior to being verified by FISH experiments. Another reason could be that at 1 KB resolution, the increased level of noise and structural variance in the dataset reduced the prediction performance of the basic modeling tool (e.g. LorDG) used by the hierarchical modeling algorithm in this work. The problem in the former case can be solved if the Hi-C data of higher quality and resolution can be generated. In the latter case, the hierarchical modeling algorithm could be further improved by using a more robust, basic 3D modeler to build domain models. The two issues will be investigated in the future.
Computational requirement and performance. We assessed Hierarchical3DGenome on two server machines: a x86_64 bit Redhat-Linux server consisting of multi-core Intel(R) Xeon(R) CPU E7-L8867 @ 2.13 GHz with 120 GB RAM and a high-performance computing cluster (Lewis) with Linux. For each computational task performed on the Lewis cluster, we allocated 10 cores and 80 G memory. On average on both servers, Hierarchical3DGenome takes about four to eight hours for each chromosome model construction, even though running on the Lewis cluster is faster. To run Hierarchical3DGenome for structure reconstruction on local computers, readers may follow the instructions in the user manual.
Comparison of Hierarchical3DGenome with LorDG. We compared the accuracy and performance of Hierarchical3DGenome with LorDG without chromosome domain partitioning on the high-resolution data. Using a high-performance computing (HPC) cluster machine, we allocated 10 cores, 80G of memory, with a time limit of 2 days for each chromosome structure reconstruction task. Hierarchical3DGenome took 4-8 hours to reconstruct the 5 KB structure for each chromosome (Fig. 11(a,c)). In contrast, for LorDG without domain partitioning, no structural models can be constructed for Chromosomes 1 to 10, and 23 largely because they required more than 2 days for their structure construction to be completed. This indicates that LorDG without domain partitioning is much slower than Hierarchical3DGenome. Additionally, the structures constructed from  www.nature.com/scientificreports www.nature.com/scientificreports/ LorDG without domain partitioning is unrealistically condensed and cluttered, making it difficult to identify the relationship between chromosomal regions at this resolution ( Fig. 11(b,d)).
Moreover, we performed a topology consistency check of the 5 KB models produced by the two methods with the 1 MB models built directly from contact matrices of 1 MB resolution (Table 3). At 1 MB resolution, most contacts have high interaction frequencies, and thus models are very reliable for serving as a reference. All 5 KB chromosome models of LorDG without domain partitioning have the spearman correlations with corresponding 1 MB resolution models less than 0.45, much lower than the correlation of >0.72 for the 5 KB models built by Hiearchical3DGenome. This result clearly demonstrates the limitation of LorDG without domain partitioning and supports the hierarchical reconstruction approach of Hierarchical3DGenome.

Discussion
The reconstruction of high-resolution 3D structures plays an important role in understanding and identifying various detailed interactions within a genome. It has been shown that the genome architecture and spatial structure is crucial for genome activity and DNA functions [34][35][36][37] because the spatial organization facilitates cellular activities such as gene expression and transcriptional regulation. Hence, modeling the chromosome 3D structures at higher resolution facilitates the identification of chromosomal regions such as chromosome territories, compartments, TADs and chromatin loops, and the relationship between these regions.
We introduced a hierarchical modeling algorithm to build high-resolution models of chromosomes by using low-resolution chromosomal models as a framework to position high-resolution models of topologically associated domains (TADs) constructed from Hi-C data. The algorithm was able to successfully reconstruct 3D models of full chromosomes of the human cell line GM12878 at 5 KB resolution. These high-resolution models were consistent with models at 1 MB resolution and FISH data. This algorithm has the potential to reconstruct 3D chromosome models at even higher resolution given Hi-C data of sufficient sequencing depth and quality.
Therefore, through the 3D structures generated by Hierarchical3DGenome researchers can study the relationship between chromosomal regions at a fine-grained scale. By using Hierarchical3DGenome, researchers can also identify structural patterns and relationships between regions related to topologically associated domains (TADs). This is possible because Hierarchical3DGenome takes TADs of chromosomes into consideration during structure construction to preserve important properties exhibited by intra-and inter-TAD contacts. Hence, Hierarchical3DGenome method is useful tool to gain detailed knowledge about chromatin organization and enable a rational interpretation of genome functions based on this organization.

Data Availability
The Java source code of Hierarchical3DGenome, sample datasets, its user manual, and the parameters for running the different analysis are available here: https://github.com/BDM-Lab/Hierarchical3DGenome.