Evaluating the role of the nuclear microenvironment in gene function by population-based modeling

The nuclear folding of chromosomes relative to nuclear bodies is an integral part of gene function. Here, we demonstrate that population-based modeling—from ensemble Hi-C data—provides a detailed description of the nuclear microenvironment of genes and its role in gene function. We define the microenvironment by the subnuclear positions of genomic regions with respect to nuclear bodies, local chromatin compaction, and preferences in chromatin compartmentalization. These structural descriptors are determined in single-cell models, thereby revealing the structural variability between cells. We demonstrate that the microenvironment of a genomic region is linked to its functional potential in gene transcription, replication, and chromatin compartmentalization. Some chromatin regions feature a strong preference for a single microenvironment, due to association with specific nuclear bodies in most cells. Other chromatin shows high structural variability, which is a strong indicator of functional heterogeneity. Moreover, we identify specialized nuclear microenvironments, which distinguish chromatin in different functional states and reveal a key role of nuclear speckles in chromosome organization. We demonstrate that our method produces highly predictive three-dimensional genome structures, which accurately reproduce data from a variety of orthogonal experiments, thus considerably expanding the range of Hi-C data analysis.

. Pearson and stratum adjusted correlation coefficients between input/output for each chromosome.
-Table S2.List of experimental data sources used in this work (and relating accession numbers)                           where  ,  , is the number of regions from subcompartment  in shell   in structure ,   is the total number of regions in subcompartment , and  is the total number of structures.
iii.Comparison with GPSeq: GPSeq scores 7 are rescaled to have values between 0 -1, where scores 0 and 1 correspond to a chromatin region being at the nuclear lamina and nuclear center, respectively 7 .Average radial positions extracted from our structures vary between 0.48 -0.94 with higher values corresponding to proximity to nuclear lamina.For comparison with GPSeq, we subtract the average radial positions from 1 and then rescale the values to be between 0 -1.
iv.Average radial positions of regions from different replication phases: Genomic regions are divided into 6 groups (G1b, S1, S2, S3, S4, G2) based on their mapped replication phases 8 .For each group, the distribution of the average radial positions is then determined from the structure population.
Other RG related analysis i. TAD border detection: To investigate if chromatin regions with maxima in RG profiles coincide with TAD borders, we first identify peak regions in the average RG profiles with the detect_peaks python package 14 .2068 peak regions are detected genomewide with minimum peak distance (mpd) set to 3 (peaks must be at least 3 data points/600-kb apart from each other).We then check if these identified regions coincide with TAD borders detected by TopDom 15 , HiCseg 16 , InsulationScore 17 , and TADbit 18 .We determine an overlap if there is a TAD border within 200-kb window of a peak region.
ii. RG peak frequency: Peak regions in the RG profiles are detected in each individual structure using detect_peaks python package 14 with same parameters as in the previous section.The RG peak frequency (PF) of a region  is then calculated as: where   and   ′ are the number of structures in which region  and its homologus copy has an RG peak, and  is the number of genome structures in the population.
Other SpD, NuD related analysis Speckle distance heatmaps: A speckle distance heatmap for a chromosome visualizes, for a given chromatin region, the speckle distance variability across the population of models.For each copy of a chromatin region, the distance to the nearest predicted speckle is calculated in each structure of the population.These distances (20,000 distances total due to 2 copies and 10,000 structures) are ranked from lowest to highest values and plotted along a column of the speckle distance heatmap and color coded according to the distance.Colors range from low distance (red) to large distances (blue).
Other SAF, LAF, NAF related analyses i. Predicting lamin B1 DamID signals using LAF: The predicted laminaDamID signal of region  is calculated as: where  ̅̅̅̅̅̅ is the mean lamina association frequency calculated from all regions in the genome.
ii.Comparison with imaging data: We compare our SAF, LAF and NAF values with imaging data 12 .To calculate association frequencies from imaging and models, we use different distance thresholds (250, 500, 750, 1000 nm distance thresholds for SAF and LAF when calculated from imaging or models, and additional thresholds of 1250, 1500, 1750, 2000 nm for LAF when calculated from models) to define an association to the nuclear body of interest.We find that the best correlations are obtained when the following distance thresholds are used: -SAF: 500 nm for imaging, 750 nm for models -NAF: 1000 nm for imaging, 1000 nm for models -LAF: 1000 nm for imaging, 2000 nm for models For SAF comparisons, we use the predicted speckle partitions from interior regions (Case 2 for speckle partitions in Identifying spatial partitions via Markov clustering).
Other TSA-seq related analysis:

i. Predicting SON TSA-seq signals using only cis relationships in folded chromosomes:
To identify contributions of cis interactions in SON TSA-seq signals, speckle locations are defined by the geometric center of consecutive A1 sequence blocks formed by more than 1 A1 chromatin region (instead of the geometric center of A1 spatial partitions, which can be formed by both cis and trans chromosomal interactions).For single A1 regions, the bead center location is used instead.For each chromatin region, we then calculate its spatial distances to these predicted speckle locations in the folded chromosome, which are used to predict the resulting TSA-seq signals from cis interactions only.
ii. Predicting SON TSA-seq signals using only cis relationships in random conformations: We also repeat the same calculations as defined in the previous section, but instead of the folded chromosomes, use models with random chain configurations, generated without Hi-C data (i.e.only chain connectivity and excluded volume).TSA-seq data is calculated accordingly from the corresponding distances based on the random polymer chain configurations.These distances are then used to predict SON TSA-seq signals as defined above.

iv. Histone modification histograms based on predicted SON TSA-seq deciles:
Following the procedure described in ref 3 , we divide the 200-kb chromatin regions in our models into 10 decile groups based on their predicted SON TSA-seq signals; deciles 1 and 10 contain regions with the lowest and highest 10% predicted TSA-seq signals, respectively.We then count the number of mapped peaks of H3K27me3, H3K4me3, and H3K9ac as well as the number of A1, A2, A1+A2 regions in each decile, and calculate the fraction of histone modification peaks or A1/A2 regions accrued in each decile.For mapping histone modification peaks to 200-kb bins to match our models' resolution, see Mapping experimental data to models in Supplementary Information.Same histograms using experimental TSA-seq deciles are re-generated from Fig. 8 in ref 3 using WebPlotDigitizer 19 .

Comparison of gene expression with structural features ▪ Transcription frequency
Transcription frequency (TRF) of each gene in the single cell RNA-seq (scRNA-seq) data is defined as the fraction of cells in the population of cells, where the gene has non-zero mRNA transcription counts in the scRNA-seq data 11 .TRF is also calculated from the recently published nascent RNA-MERFISH imaging data as the fraction of cells where the gene is transcribed (transcription: on) in the population of imaged cells 12 .

▪ Gene expression heatmaps
Gene expression heatmaps for each chromosome visualize the variability of mRNA counts (the expression levels) for each gene in a population of cells 11 .For each chromatin region, the observed mRNA count in each cell of the population of models is ranked from highest to lowest values and plotted along a column.Colors ranged from high mRNA counts (red) to 0 (dark blue).

▪ ROC curve for assessing performance to classify lowly or highly expressed genes
We first identify the top 10% (T10) and the bottom 10% (B10) genes with the highest and the lowest total non-zero mRNA counts (i.e. gene expression values) in the scRNA-seq data 11 .Several structural features (mean radial ILF, mean speckle distances, SAF, variability of radial positions and speckle distances) are then calculated for all chromatin regions mapped to T10 genes and B10 genes.
To determine the most informative structural features for distinguishing T10 genes from B10 genes, we perform receiver operator characteristic (ROC) analysis.Specifically, for each structural feature, we define 10 threshold levels, equally separating the range of values for each structural feature.Then we determine how well the gene in the T10 and B10 groups are separated by each threshold value by calculating the corresponding number of true positives/negatives (TP, TN) and false positive/negatives (FP, FN).
For each structural featue  and for each threshold level, , the true positive rate (TPR) and false positive rates (FPR) are then calculated as The ROC curves are then plotted for each feature using TPR/FPR values.

Other structural analyses ▪ Experimental GRO-seq and TSA-seq data analysis
Averaging TSA-seq and GRO-seq signals in concentric shells around subcompartment partitions: To quantify average TSA-seq 3 and GRO-seq 6  Note that this measure only relies on the geometric position of a partition center and the folded genome (i.e.calculates average gene expression from all chromatin in a shell, independent of subcompartment annotations).

▪ Neighborhood composition
The neighborhood composition (NeC) shows how frequent chromatin regions in different subcompartments are in spatial proximity to regions of a specific subcompartment.The average percentage of subcompartment  in the neighborhood composition of subcompartment  in the population is calculated as: where  is the number of structures in the population,   is the number of 200-kb regions belonging to subcompartment , { , } is the set of 200-kb chromatin regions in the neighborhood of the region  in structure , and  ,, is the number of chromatin regions from subcompartment  in the set { , }.We define the neighborhood of  in structure  as  , = {:  ≠ ,   < 500 }, which contains the list of all chromatin regions with less than 500 nm center-to-center distance (  ) to chromatin region .
The neighborhood composition enrichment (NeCE) of subcompartment  in the neighborhood of subcompartment  is calculated as: where   is the neighborhood composition percentage calculated for subcompartment  in the neighborhood of subcompartment  and the denominator is the average percentage of subcompartment  observed in the neighborhood of all subcompartments.
Values greater than 1 (  > 1) indicate that subcompartment  is enriched in the neighborhood of subcompartment , whereas values lower than 1(  < 1) show depletion of  around .

▪ Enrichment heatmaps for various features
Enrichment of structural features, experimental TSA-seq, DamID, and GRO-seq signals, and histone modifications in various groups: To identify structural feature or experimental signal enrichments for chromatin in different groups (subcompartments, TSA-seq deciles, superenhancers, enhancers, replication phases, A/B-LV/HV groups, and T10/B10 genes), we first normalize each feature value to range between 0 and 1.We then calculate the enrichment of a structural feature , for group  as: where   is the number of 200-kb chromatin regions in group ,   is the structure feature value for chromatin region .For  ̅  , we first randomly select the same number (  ) of regions in the genome and calculate the average feature value, and repeat this step 1000 times.For the enrichment of histone modifications in A-LV and A-HV groups, we randomly select the same number of regions only from regions in compartment A. We then take the average of 1000 different average feature values calculated from randomly selected regions.
For visualization purposes, we reverse the ranges of radial positions, mean-speckle, and mean-nucleoli distances in the structural feature enrichment heatmaps, so lower values would be indicated with red.

Enrichment of replication phases, LADs, and subcompartments in A-LV, A-HV, B-LV, and B-HV groups:
We calculate the enrichment of various tags  (based on replication phases, LADs, or subcompartments), in group  as: where  , is the fraction of regions with tag t in group g.For  , ̅̅̅̅̅̅ , we first randomly select the same number of genomic regions (as in group g) and calculate the fraction of regions with tag t among those regions, and repeat this step 1000 times.We then take the average of 1000 different fraction values.

▪ K-means clustering of A and B compartments
For clustering, we first normalize all 17 structural features using log2-transformation.We then perform K-means clustering using all transformed features for A and B subcompartments separately.We use scikit-learn python package to perform K-means clustering 20 and set the n_clusters parameter to 2 for A and 3 for B compartments.and ) are calculated in each structure.The minimum distance from all possible pairs in each structure is then used to calculate the fraction of models in which both regions are colocalized.We assume a loci pair is colocalized in a structure if the calculated minimum distance in that structure is lower than 1 m (  < 1 ).

▪ Radial positions of trans and cis interactions
We select 1,000 random structures from the population and identify all the trans and cis chromatin interactions.Then we calculate the average radial position of the location where the trans or cis interaction occurs by taking the mean of the radial positions of the two loci forming the interaction in that structure.
After obtaining the contact probability matrix at 200-kb resolution, we identified bins that have spurious inter-chromosomal interaction probabilities (higher than 0.2), and removed the corresponding bins in the 20-kb raw matrix, and repeated the KR normalization, and regenerated the 200-kb contact probability matrix where no inter-chromosomal contact probability is higher than 0.2 following the same procedure explained above.Finally, we set contact probabilities between the consecutive domains as well as between domains up to 1 Mb distance in the gap regions to 1 in order to maintain the chain integrity.

Iterative refinement of restraint assignment
During the optimization procedure, it is possible that excess contacts can lead to compact structures that cannot be further relaxed easily.A heuristic method has been put in place to compensate for such an effect, by allowing us to assess the actual portion of expected contacts that are available for allocation, which then affects the way the activation distance    is computed in the Hi-C assignment steps (A).The empirical procedure relies on the predicament that when a population expresses more contacts than it should, we reduce the assignment probability; a lower probability (with fewer restraints) should be equally effective.
Assume the expected number of contacts    is the sum of a number of effective contacts that are actually imposed,   , and a number of incidental contacts (that are also expressed but not imposed),   : The number of incidental contacts is expressed as a fraction of the number of non-applied (non- ▪ GRO-seq GRO-seq read counts 6 at plus and minus strands were summed up for each 200-kb region.

▪ Superresolution imaging
For the recently publised superresolution imaging data 12 , we used the provided datasets at https://zenodo.org/record/3928890. 1041 imaged loci were first mapped to 200-kb resolution.All loci were either mapped to one or two 200-kb regions depending on their overlaps.If a locus was mapped to two regions, the features calculated from the models were averaged over these two regions in the analyses.

▪ ChIP-seq (Histone marks)
The signals and peaks for H3K36me3, H3K27me3, H3K9ac, H3K9me3, H3K27ac, H3K4me1, H3K4me2, H3K4me3, H3K79me2, H4K20me1, and H2AFZ histone modifications were downloaded from ENCODE 9,10 .Each region in the ChIP-seq data was mapped to 200-kb regions with an overlap of 50% or higher.After mapping, each 200-kb region had multiple ChIP-seq regions; therefore the signals mapped to each 200-kb region were averaged (we first took the inverse log2 of the signals, then averaged and took the log2 of the average value.),and the number of peaks mapped to each region were counted.

▪ Subcompartments
The definition of subcompartment states was retrieved from ref 2 which are at 100-kb resolution.
For each 200-kb region of chromatin, we first mapped the subcompartment states and calculated their proportion.The state of the 200-kb region was then set to the majority of its constituent state (50% or more).If there was no majority constituent state, we then assigned the region as NA and discarded them from further subcompartment-related analyses.

▪ Compartments
The definition of compartment states were retrieved from the 4DN data portal (Table S2) and ref 2 which are at 250-kb resolution.We mapped each compartment state to a 200-kb region with the maximum overlap.If there was no mapped state to a 200-kb region, then we assigned it as NA and discarded them from further compartment-related analyses.

▪ LADs
The definitions of LADs (constitutive and facultative LADs and inter-LADs) were retrieved from refs. 4,7.LAD assignments were mapped to overlapping 200-kb regions.After mapping, if a 200kb region had only one mapped LAD state, we assigned that state to the region.However, if the region had multiple mapped LAD states, we assigned the region as NA and discarded them from further LAD-related analysis.This mapping procedure resulted in 1304, 495, 1010, 1904 regions with cLAD, fLAD, ciLAD, and fiLAD assignments, respectively.

▪ Enhancers/Superenhancers
The definitions of enhancers (EN) and superenhancers (SEN) were retrieved from ref 13 .Each EN and SEN region was mapped to 200-kb regions with an overlap of 50% or higher.After mapping, we assigned regions as EN/SEN if they overlapped with one or more EN/SEN peaks.We assigned regions as NA if they did not overlap with any EN/SEN peaks and discarded them from further EN/SEN-related analysis.

3D DNA FISH Experiments
We carried out a set of 3-color FISH experiment where all probes were on chromosome 6: RP11-945M14 (306,712 -532,406), RP11-1076L22 (130,076,074 -130,287,007), and RP11-111J1 (87,193,201 -87,351,481).For any particular chromosome domains (regions), multiple BAC clones were chosen and synthesized by Empire Genomics and tested individually for their specificity.The experiment was performed following the previous protocols 24,25 .GM12878 cells were cultured in a DMEM medium supplemented with 15% FBS, glutamine and penicillin/streptomycin as suggested by ENCODE.Two days before the experiment, 22mm x 22mm coverslips were cleaned and coated with L-poly-lysine (1mg/ml) at room temperature for 1-2 hours, and dried in a tissue culture hood after a brief rinse with sterile MilliQ water.On the day of the experiment, 10 million GM12878 cells were harvested by centrifugation at 100g for 10 minutes, resuspended in fresh culture medium (3x10 6 cells/ml), and seeded evenly on the coverslip in a 6-well tissue culture plate.After incubating at 37C for one hour and briefly washing with PBS, the cells (on coverslips) were fixed with 4% freshly made paraformaldehyde (in 0.4x PBS) at room temperature for 10 minutes.The cell membrane was permeabilized firstly with 0.5% triton X100/1xPBS at room temperature for 20 minutes, and then through 4-5 freeze-and-thaw cycles (by dipping in liquid nitrogen and then thawing in room temperature) in the next day after pretreatment overnight with 20% glycerol/1xPBS and also before each dip.To facilitate access of the FISH probe to the chromatin DNA, the samples were each washed twice with 0.05% triton X100/1xPBS for five minutes, and then treated with 0.1N HCl at room temperature for 5-10 minutes to remove basic nuclear proteins.The HCl was removed from the sample followed by two washes with 0.05% triton X100/PBS and one wash with 2x SSC (diluted from 20xSSC: 3M Fig. S1 Correlation between the mean distance to the nearest speckle and the interchromosomal contact probability (ICP) of genomic regions.-Fig.S2-22.Structural feature profiles for Chr2-22.-Fig.S23.Residual ratios -Fig.S24.Comparison of replicate structure populations -Fig.S25.Population size convergence plots -Fig.S26.Chromatin interaction networks for subcompartments -Table Fig. S1.Pearson correlation (r= -0.95, p=~0) between the mean distance to the nearest speckle and the inter-chromosomal contact probability (ICP) of genomic regions.

Fig. S23 .
Fig. S23.Residual ratios.The residual ratio Δ is defined as Δ  = (   −    )/   with    and    as the contact probabilities between regions k and l from experiment and models, respectively.Residual ratios are very small, and centered at a median of 0.03 (mean= -0.05) for intra-chromosomal (left) and 0.001 (mean = -0.002)for inter-chromosomal (right) contacts, showing excellent agreement between experiment and model.

Fig. S25 .
Fig. S25.Population size convergence plots.Pearson correlations between the population with 10,000 structures and populations with smaller sizes (50, 100, 1,000, and 5,000 structures).Contact probabilities (left) and radial positions of chromatin regions (right) already converge at 1,000 structures and have very high correlations (>0.99) with the 10,000 structure population.

Fig. S26 .
Fig. S26.Representative chromatin interaction networks (CIN) for chromatin in each subcompartment in a single structure.Each node in CINs represents a single chromatin region connected by edges if the two regions are in physical contact in the 3D structure.Nodes are colored by their neighborhood connectivity (i.e. the average contacts formed by their neighbor nodes) ranging from low (blue) to high (red).
iii.Predicting SON TSA-seq signals using speckle distances based on A1 sequence locations: Speckle locations are approximated by the sequence positions of A1 regions, either as median sequence position for a block of consecutive A1 chromatin regions or the sequence positions of individual A1 regions, if their neighboring regions are not part of the A1 subcompartment.The distance    between a chromatin region i and speckle position j, separated in sequence by n chromatin regions, is then defined as    = 2 ×   , where   = 118  is the excluded volume radius of a chromatin region in the models (see Genome representation).
signals for chromatin with respect to distance to spatial partition centers of each subcompartment, the nuclear volume around a spatial partition center is divided into concentric shells, with each consecutive shell radius increasing by 200 nm.The signals are then averaged over concentric shells around partition centers as follows: In each individual genome structure, the signals of chromatin located in the same shell volume is averaged, irrespective of the chromatin's subcompartment assignment.The average signal per shell are further averaged over all partition centers in the same subcompartment and over all structures of the population.
Clusters are then compared with actual subcompartment assignments to compute clustering accuracy.The highest prediction accuracies are obtained when clustering is performed with a subset of structural features for both A and B subcompartments.The used features in the clustering are cell-to-cell variability of radial positions, SAF, NAF, median trans A/B ratios for A, and cell-to-cell variability of radial positions and nucleoli distances, nucleoli TSA-seq, ICP, median trans A/B ratios for B subcompartment predictions, respectively.▪ Comparison with 3D in situ hybridization (3D-FISH) data FISH probes are mapped to 200-kb chromatin regions in our models according to the highest overlap.Radial positions and pairwise distances for each mapped probe are determined in each structure in the population and compared to the radial positions and pair distances in FISH experiments.FISH and model radial positions are normalized by their maximum values.Intra-chromosomal distances in models are defined by their surface-to-surface distances of the corresponding probe regions (in both copies of the chromosome).Colocalization fraction of inter-chromosomal pairs are calculated as following: first the center-to-center distances of all possible probe pairs ( − ,  −  ′ ,  ′ − ,  ′ −  ′ where  ′ and  ′ are the homologous copies of each 200-kb chromatin regions, enforced) contacts.The latter term can originate from cooperative effects, which automatically bring loci closer without an explicit bonding term operating.We can solve for probability  0 =   /:  0 =   − 1− .This is an effective probability which controls the number of restraints to be enforced.The estimate for the scalar  is calculated as follows.Let us compare the factual number of contacts in the population (expressed by the tensor    = ∑   (−1),X  =1 ) with the predicted number of contacts from the previous assignment step (   = ∑   (−1),assig  =1 ): 100-kb single-cell LaminaDamID lamina contact frequencies 4 were mapped to 200kb and averaged over 200-kb regions.▪ GP-seq First, 100-kb rescaled GP-seq scores were averaged over 4 replicate experiments (2 HindII, 2 MboI experiments) for each 100-kb region 7 .The values were then mapped to 200-kb and averaged over 200-kb regions.

Table S1 .
1earson and stratum adjusted correlation coefficients (SCC)1between the input and output Hi-C matrices for each chromosome.For SCC calculation, the smoothing parameter and the upper bound of the genomic distance for interacting loci were set to 0 and 50 Mb, respectively.

Table S2 .
Experimental data used in our analyses.