Integrative genome modeling platform reveals essentiality of rare contact events in 3D genome organizations

A multitude of sequencing-based and microscopy technologies provide the means to unravel the relationship between the three-dimensional organization of genomes and key regulatory processes of genome function. Here, we develop a multimodal data integration approach to produce populations of single-cell genome structures that are highly predictive for nuclear locations of genes and nuclear bodies, local chromatin compaction and spatial segregation of functionally related chromatin. We demonstrate that multimodal data integration can compensate for systematic errors in some of the data and can greatly increase accuracy and coverage of genome structure models. We also show that alternative combinations of different orthogonal data sources can converge to models with similar predictive power. Moreover, our study reveals the key contributions of low-frequency (‘rare’) interchromosomal contacts to accurately predicting the global nuclear architecture, including the positioning of genes and chromosomes. Overall, our results highlight the benefits of multimodal data integration for genome structure analysis, available through the Integrative Genome Modeling software package.


Table S1
List of the five data sources used in genome population modeling (Hi-C, lamina DamID, 3D HIPMap pairwise and radial FISH, and single cell SPRITE), see also Fig. 1. The k-th data source is associated with its interpretation, its data variable and its latent variable * . indicates the number of domains, (> ) indicates the total number of genomic loci (two homologous copies are distinguished), is the number of distance bins and is the total number of structures in the population.
Given { }, we aim to estimate the structure population model such that the likelihood In the following section, we define the mathematical formulation of ( * | ) and ( | * , ) for all K data sources.
We will use the same notation from Methods and Table S1. ⃗ ⃗ = ( , , ) ∈ ℝ 3 is the coordinate of locus in structure of the population, and we use ‖… ‖ 2 to denote the Euclidean norm of a vector.
Note that capital letter indices, such as I and J, relate to genomic regions without distinguishing between two homologous copies, while lower letter indices i, i' and j, j' distinguish between the two copies.
Let us also assume that = ̅ is an integer labeling the bin into which distance ̅ falls, e.g. ̅ ∈ ℬ = ̅ .

Distributions of radial positions of genomic regions from 3D HIPMap FISH experiments
Expansion of ( | ) The .
Note that, without losing generalization, we use the origin (0,0,0) as the nuclear center, thus the Euclidian norm of the coordinate vector of chromatin ‖⃗ ⃗ ‖ 2 , is equivalent to its distance from the nuclear center. and +1 are the lower and upper bound of radial distances that define the radial distance range in bin ℬ .
′ is calculated from as where ̅ = ( ̅ ) × × is the "projected radial position tensor", which is derived from by projecting its representation (with homologous copies) to its counterpart without distinguishing between homologous copies. We then model each as = ′ + , where are independent and identical normally distributed random variables with mean zero and =∝ (0, 2 ) . ( is effectively set to 0).

Probability of chromatin contacts with the lamina from Lamin B1 DamID data (univariate data)
Expansion of ( | ) We introduced the binary valued latent tensor = ( ) × , which indicates whether the i-th chromatin domain is in contact with nuclear lamina in the s-th structure of the population ( = 1) or not ( = 0). Again: The numerator of the exponential approximates the distance of a genomic region to the nuclear envelope.
are the nuclear ellipsoid semi axes and > 0 is a threshold defining the interaction proximity of a genomic region with the nuclear lamina (see also "Simulate structural observables from a population of genome structures" and Supporting Information for more details). If the cell shape is spherical, such as in GM12878 cells, the calculation of the normal distance to the nuclear envelope can be performed exactly, and we would recover the probability distribution from Li et al. 2 Expansion of ( | , ) ( | ) can be expanded as ( | , ) = ∏ ( | ′ ) where ′ is the lamina-chromatin contact probability of the genomic region I computed from as where ̅ = ( ̅ ) × is the "projected lamina-chromatin contact tensor", which is derived from the latent lamina contact tensor by projecting its representation (with homologues copies) to its counterpart without distinguishing between homologous copies. We then model each as = ′ + , where are independent and identical normally distributed random variables with mean zero and =∝ (0, 2 ) . ( is effectively set to 0).

Pairwise Distance distributions from 3D FISH HIPMap experiments (bivariate data)
Expansion of ( | ) Pairwise distance distributions are cast in the form = ( ) × × , with being the probability that genomic loci and have a distance in the range defined by bin ℬ = [ , +1 ).
For each pair of genomic regions ( , ), the all-distance normalization ∑ = 1 holds. The corresponding latent variable is the indicator tensor = ( ) × × × and ( | ) is approximated as: Which now depends on two different genomic loci and . The probability associated with entries of tensor is then defined as: Expansion of ( | , )

Chromatin contact probabilities from Hi-C data (bivariate data)
Expansion of ( | ) We introduced the latent contact indicator tensor = ( ) × × for complementing every single cell's contact information. We approximate ( | ) as We assume a pair of chromatin domains (i, j) have a contact in structure s, if and only if is the excluded volume radius of sphere (Supplementary Information). Following Tjong et al. 1 we have: Expansion of ( | , ) Given = ( ) × as the contact probability derived from Hi-C data, also ( | ) can be expanded as , where ′ is the contact probability of the genomic regions I and J computed from as where ̅̅̅ = ( ̅ ) × × is the "projected contact tensor", which is derived from by projecting We then model each as = ′ + , where are independent and identical normally distributed random variables with mean zero and =∝ (0, 2 ) . ( is effectively set to 0).

Chromatin multi-way contacts from single cell SPRITE data (multivariate data)
Expansion of ( | ) We introduced the latent indicator tensor Since in principle any data point is independent on others: The product ∏ runs over the different clusters with loci. The probability of multi-way colocalization reads: The loci that co-localize define a cluster geometric center . We write the probability as: is the cluster radius, which we define in the Supplementary Information, and runs over the participating loci. In our modeling, multi-way clusters are interpreted as a product of pairwise contacts between each of the participating loci and the cluster centroid.
Expansion of ( | , ) as the multi-way contact probability of -loci derived from single cell SPRITE data, we can write where ′ 1 ,…, is the multi-way contact probability of the genomic regions 1 , … , computed from as 1 ,…, where ̅ = ( ̅ 1 ,…, , ) × is the "projected multi-way contact tensor", which is derived from by projecting its representation (with homologues domain copies) to its counterpart without homologous domain distinction. ) ( 1 ,… is effectively set to 0).

All clusters with arbitrary number of loci
The formulation can be easily extended to account for clusters of different sizes . Each is associated with a , so for example: where we multiply over the different cluster sizes. Overall, the notation ∏ ∏ denotes the product over all individual clusters.

Gathering all the termsformulation of the conditional probability function
With all probabilistic models in place we can now maximize the log-likelihood (see also In addition to the 5 data sources from 4 experimental methods (Supplementary Table S1), we also include a set of spatial constraints based on additional information about the genome organization. These data are included in form of general spatial constraints acting on chromatin domain domains: (i) a nuclear volume confinement restraint that forces all chromatin domains to be inside the nuclear volume, (ii) excluded volume restraints that prevent "hard core" overlap between any two chromatin domains and (iii) a polymer chain connectivity restraint between chromatin domain neighbors in a chromosome, which guarantees the structural integrity of the chromosomal chains. Additional information about these restraints is given in the Supporting

Information.
In summary, the maximum likelihood problem is formally expressed as follows, We will now detail how the optimization is performed by describing the Assignment and Modeling steps of the optimizations as implemented in the Integrated Genome Modeling platform for each data source.

A/M optimization: Assignment Step
Starting from the current population of structures ( ) , which resulted from a previous A/M optimization iteration, the optimal latent variables +1 , +1 , +1 , +1 , +1 for the following Modeling Step (see Methods) are fully determined by solving: For the sake of generality, we will assume each genomic region comes with two copies ( , ′ ), but the discussion can easily be adjusted for single copy loci.
Also, let [ : ] indicate the ranking function that returns the rank of the element ∈ when the t-uple is sorted in ascending order.

3D radial HIPMap FISH data
Assume the distributions of radial distances have been already assigned into distance bins {ℬ } and cast into the × variable. By using the information in the data × and the population ( ) , we determine the optimal latent variable × × +1 . (we will drop the + 1 superscript in the following, since there is no ambiguity).
Let us denote with all the entries of matrix that are associated with genomic locus . If { | = 1, … , } denotes the set of discrete distance values (from experiments for example) that define the minimal radial distribution of region (Methods), those distances are mapped onto bins that are identified by the integers = { = }, such that 1 ≤ 2 ≤ ⋯ (recall that the integer = ̅ labels the distance bin into which distance ̅ falls, e.g. ̅ ∈ ℬ = ̅ ). is non-zero on those bins only.
We compute all the radial distances of both copies ( , ′ ) in all structures of the population , and select the minimal distances and sort them in ascending order; we obtain a sequence of ordered (copy of domain , structure index) pairs: The latent variable is then populated as follows: Basically, = 1 only if the pair ( , ) has the same rank in the list of ordered pairs as the th distance bin does in the list of ordered distance bins . If ranks are different or does not label a target distance bin (i.e., ∉ , } ), then = 0.
The same procedure applies to assign distributions of radial maximal distances.
This maximizes the probability log ( , | ) = log ( | , ) log ( | ). Sorting both the distances in the models and in the input and running a one-to-one assignment makes sure that deviations from the target distances are smallest (maximizing log ( | )) and also makes equal to ′ , therefore maximizing log ( | , ) (see Methods).
Lamina DamID: By using the information in the data ×1 and the population ( ) , we determine the optimal latent variable × +1 .
We compute the distances of both copies of domain to the nuclear envelope (NE) { (⃗ ⃗ , ), (⃗ ⃗ ′ , ) | = 1, … , }) from population ( ) , and sort them in ascending order. The (2 ⋅ ) ℎ ranked value is chosen to be a lamina activation distance . The latent variable is then populated as follows: by counting the number of all pooled distances that are larger than the activation distance.
Computing the distance to the lamina is detailed later in the "A/M optimization: Modeling Step" section.
This procedure maximizes log ( , | ) = log ( | , ) ⋅ ( | ). This is true for two reasons. It assigns contacts only to domains with shortest distances to the lamina, therefore maximizing This maximizes the probability log ( , | ) = log ( | , ) ⋅ log ( | ). Sorting both the distances in the models and in the input and running a one-to-one assignment makes sure that deviations from the target distances are smallest (maximizing log ( | )) and also makes equal to ′ , see Methods.
Ensemble Hi-C By using the information in the data This procedure maximizes log ( , | ) = log ( | , ) ⋅ ( | ). This is true for two reasons.
It assigns contacts only to locus pairs with shortest distances, therefore maximizing ( | ); also, the distance threshold method heuristically maximizes the first term log ( | , ) = ∑ log ( , ′ ) by making equal to ′ . (See also Tjong et al. 1 )

SPRITE
Using the information in the data = ( 1 ,…, ) and the population ( ) , we determine the latent variables = ( 1 ,…, , ) × . For convenience, we will discuss Assignment for one cluster = ( 1 , … , ). There are two steps: first, we determine which copies of the genomic regions make up the cluster, and then to which structure in the population the cluster is assigned.
Let us assume that the cluster involves ℎ ≤ different chromosomes, and that all chromosomes come with two homologous copies. The two copies associated with domain 1 will be uniquely identified as ( 1 , ℎ = 1) and ( 1 , ℎ = 2). To avoid complications due to number of all possible diploid representations of a size cluster scaling as 2 , we additionally assume that loci mapping to the same chromosome are also physically on the same copy. We then solve the minimization: . . ℎ = ℎ , ℎ Where ℎ and ℎ indicate the copy indexes. The optimal copy indexes are those which minimize the radius of gyration   ) in a given structure, by forcing regions from the same chromosome to have the same copy index ℎ. The minimal radius of gyration is obtained for (ĥ 1 = 1, ĥ 2 = 1, ĥ 3 = 2), so the diploid representation of cluster 3 in that structure is ̃3 = {1,2,220}.
The Assignment step concludes with assigning the cluster to one structure in the population ( ) .
We collect optimal diploid representations ̃[ ] of the cluster in each structure of the population.
In order to avoid overloading a few structures with many clusters, we draw the optimal structure index from the Gibbs distribution: where is the integer number of clusters already assigned to current structure ("structure occupancy"), is the Heaviside step function and ⟨ ⟩ = / is the ratio between number of structures in the population and the total number of clusters to be assigned. Note that a non-zero penalization is applied only when the current structure occupancy is larger than ⟨ ⟩.
With all of this in mind, the SPRITE latent variable is finally populated as follows: Basically, for each cluster of size , 1 ,… , = 1 only if the -tuple ( 1 , … ) is the optimal diploid representation of the cluster (in the sense discussed above) in structure , which is drawn from the probability distribution ( ).

A/M optimization: Modeling Step
Starting from the current latent variables +1 , +1 , +1 , +1 , +1 which resulted from a previous A-step (see Equations S4-S8), the optimal coordinates +1 = ( 1 +1 , … , +1 ) of all the diploid genome structures for the following A/M iteration are fully determined by solving: The effect of each non-zero latent variable entry is modeled as an appropriate energy term (i.e., restraint), which is added to the genome energy function, and which is compatible with the probability representation of the data sources given in the Methods. Each energy term involves one (univariate), two (bivariate) or more (multivariate) chromatin loci, and actively constrains their distances; each interaction is associated with a positive scalar (residual error) which monitors to what extent a given restraint is violated. Specifically, > 0.05 indicates that a given distance in the model is off by more than 5% with respect to the expected distance from the input: we will call that a violation in the model. We now detail the form of the energy terms we use to model each data source.
3D radial HIPMap FISH data = 1 indicates that locus must have radial distance from nuclear center in distance bin The residual error (0 ≤ ≤ 1) is defined as: Ensemble Hi-C The latent variable = 1 indicates that loci and form a contact in structure . We then only apply an upper harmonic bound: 0, ℎ SPRITE 1 ,…, , = 1 indicates that loci 1 , … , colocalize in structure . We introduce a massless particle (centroid, no excluded volume effect) in structure located in the cluster geometric center, . We introduce the following spatial restraint:  Figures 1 and 2), and loci are harmonically restrained to the centroid in order to promote geometric integrity. Each of the terms in the summation is associated with a residual error (see also "Ensemble Hi-C").

Synthetic data simulations
Simulated Hi-C, lamina DamID and FISH (both radial and pairwise) data were extracted from the ground truth population as detailed in the Methods section. In particular, FISH probes were selected randomly across all chromosomes, FISH pairs ( , ) were selected by first downsampling the 15453 haploid loci using a uniform stride of 5, building all the pairs out of those, and then sampling randomly across this data set. A couple of technical remarks are in order. First, the same exact set of simulation parameters was used to generate all populations, in order to ensure consistency: eventually, all simulated populations satisfy more than 99.995% of the imposed restraints.
Second, data are incorporated into a population in a fashion similar to the steps adopted for real data, by resorting to decreasing optimization thresholds (see also Extended Data Figures 1-2).
In particular, non-zero final threshold values are used here too, which represent the highest accuracy we are confident the models could achieve. This reproduces the standard scenario where data are only known within a confidence range, and probability values that are too small could be misconstrued as noise or a perturbation.

Iterative contact refinement step
Since loci that are not expected to form a contact are not explicitly being restrained from doing that, excess contacts cannot be prevented, which are shown to lead to excessively compact structures that cannot be relaxed easily. A heuristic method has been put it place to compensate for such an effect, by allowing to assess the actual portion of expected contacts that are available for allocation, which then affects the way the activation distance is computed in Hi-C and DamID assignment steps. The empirical procedure relies on the predicament that when a population expresses more contacts than it should, we reduce the assignment probability: a lower probability (with fewer restraints) should be equally effective.
Assume the expected number of contacts is the sum of a number of effective contacts that are actually imposed, , and a number of incidental contacts (that are also expressed but not imposed), : The number of incidental contacts is here expressed as a fraction of the number of non-applied (non-enforced) contacts. The latter term can originate from cooperative effects, which automatically bring loci closer without an explicit bonding term operating. We can solve for . This is an effective probability which controls the number of restraints to be enforced. Now, we need an estimate for the scalar . , which is then used to update the activation distance . Please note that the correction is only implemented if there is a contact excess.

HFFc6 experimental data pre-processing
Ensemble Hi-C data 7 4DN portal identifier: 4DNES2R6PUEK We used in situ Hi-C datasets from HFFc6 cell line (reference genome hg38) 7 . Similar to the protocol by ref. 8 , low bin sequence coverage 3% regions were discarded during the normalization process. For data normalization, we adopted the same Knight-Ruiz normalization method used in ref. 9 , leading to a normalized contact frequency matrix = ( ) × at 20 kb resolution.
Contact frequencies at 20kb resolution are converted to contact probabilities by scaling frequencies with a normalization factor f max , which is chosen to represent the contact frequency value at which two bins have a 100% probability to form a contact. After obtaining the contact probability matrix at 200-kb resolution, we identified bins that have spurious and isolated inter-chromosomal interaction probabilities higher than 0.2 (only 50 out of ~187 million possible inter chromosomal interactions), and removed the corresponding bins in the 20-kb raw matrix (290 out of 154,423 bins).
We then repeated the entire process, including KR normalization at 20kb resolution, and regenerated the 200-kb contact probability matrix where no inter-chromosomal contact probability is higher than 0.2 following the same procedure explained above.
Finally, probabilities for nearest neighbors are set to 1 to ensure chain integrity.
This hierarchical procedure for generating 200kb probability matrices ensures that two contact maps at different resolutions are consistent with one another, by default.
We performed extensive benchmarking by generating multiple populations with different setups and assessed the quality of each setup by predicting orthogonal experimental data. Our analysis revealed that the best setup requires an up-scaling of all the inter-chromosomal interactions by a factor of 1.5. We observed that an up-scaling of inter-chromosomal contact probabilities can be be circumvented by integrating other data modalities, which confirmed our results from the synthetic calculation section, which showed that data biases can be corrected through data integration.

Estimating lamina contact probabilities from Lamina DamID signals
For Lamina DamID data processing we adapt our established protocol as described in Li et al. 2 We convert the experimental lamin B1 DamID signals 10 (indicated as for genome domain ) into an array of lamina contact probabilities . Chromatin regions with a baseline signal larger than 1 ( ≥ 1.0, Dam-LaminB1/Dam ratio) would have a lamina contact probability larger than 0, therefore we set all signals lower than 1 to 0. Then, lamin contact probabilities are computed as follows:

= ̅̅̅̅
Where ̅̅̅̅ is the genome-wide average of DamID signal 10  are computed from the raw data 10 as follows: Each region in the TSA-seq data was mapped to 200-kb regions with an overlap of 50% or higher.
After mapping, each 200-kb region had multiple TSA-seq regions, therefore the signals mapped to each 200-kb region were averaged (we first took the inverse log2 of the signals, then averaged and took the log2 of the average value).  ) and standard Pearson's correlation coefficients between experimental input and output chromosomal Hi-C maps from structures in the HDSF population (see Figure 2).