SnapFISH: a computational pipeline to identify chromatin loops from multiplexed DNA FISH data

Multiplexed DNA fluorescence in situ hybridization (FISH) imaging technologies have been developed to map the folding of chromatin fibers at tens of nanometers and up to several kilobases in resolution in single cells. However, computational methods to reliably identify chromatin loops from such imaging datasets are still lacking. Here we present a Single-Nucleus Analysis Pipeline for multiplexed DNA FISH (SnapFISH), to process the multiplexed DNA FISH data and identify chromatin loops. SnapFISH can identify known chromatin loops from mouse embryonic stem cells with high sensitivity and accuracy. In addition, SnapFISH obtains comparable results of chromatin loops across datasets generated from diverse imaging technologies. SnapFISH is freely available at https://github.com/HuMingLab/SnapFISH.

Many computational pipelines have been developed for identifying chromatin loops from Hi-C datasets, including HiCCUPS 1 , FitHiC 2 /FitHiC2 3 , SIP 4 , Peakachu 5 , etc (see detailed benchmark analysis in the reference 6 ). However, none of these methods can be directly applied to multiplexed DNA FISH data due to two major reasons. First, all existing Hi-C loop callers are based on statistical modeling of count data. The underlying Poisson or negative binomial model does not fit the continuous data measuring Euclidean distance in multiplexed DNA FISH data. Although researchers could convert multiplexed DNA FISH data into countbased data similar to Hi-C, by first pre-specifying a Euclidean distance threshold (as described in a recent preprint 7 ), and counting the number of cells with Euclidean distance smaller than the pre-specified threshold. However, due to missing loci in multiplexed DNA FISH data, the number of available cells at each loci pair is different. Therefore, one can only use the frequency, instead of the count, for downstream loop calling analysis, making the count-based Poisson or negative binomial model inappropriate.
To fill in such gap, we developed SnapFISH, a loop caller tailored for multiplexed DNA FISH data. Different from all existing Hi-C-based loop callers, SnapFISH uses two sample T-test to model the continuous Euclidean distance, providing rigorous statistical justification of the loop calling pipeline. SnapFISH also provides the nonparametric Wilcoxon rank test as an alternative, when the assumption of T-test is violated (see details in Supplementary Information Section 6). In addition, SnapFISH applies the contact frequency filtering based on the average Euclidean distance between two loci with 25Kb 1D genomic distance (see details in Supplementary Information Section 2), achieving high precision and satisfactory recall (Table S3). Taken together, the methodological innovations in SnapFISH make it a tailored loop caller for multiplexed DNA FISH data, compared to the existing Hi-C-based loop callers.
Section 2: Optimizing the SnapFISH algorithm using DNA seqFISH+ data in mESCs.
We evaluated different combinations of SnapFISH parameters, in order to achieve optimal F1-score, benchmarking against the reference loop set consisting of 35 HiCCUPS loops identified from deeply sequenced bulk Hi-C data. First of all, we performed both two sample T-test and the non-parametric Wilcoxon test. For both two tests, we converted P-values into false discovery rate (FDR), and evaluated 3 FDR threshold values (FDR<10%, FDR<5% and FDR<1%) to define loop candidates. We then grouped nearby loop candidates into clusters, and identified cluster summits and singletons. We evaluated two approaches to select summits: using the loci pair with the smallest FDR, or using the loci pair nearest to the geometric center of the cluster. In the end, the final loop list consists of cluster summits with population-level contact frequency >= Cutoff.summit and singletons with population-level contact frequency >= Cutoff.singleton. We applied a grid search to evaluate all combinations of Cutoff.summit and Cutoff.singleton, with the step size of 1/6 and under the constraint that Cutoff.summit <= Cutoff.singleton. We only calculated F1-score for the combination of SnapFISH parameters resulting in >=1 loop summit. Taken together, we evaluated 144 combinations of SnapFISH parameters, and selected the one corresponding to the highest F1-score ( In addition, we evaluated the performance of SnapFISH with different sizes of the local neighborhood control regions. As the default setting, SnapFISH defines the local neighborhood control regions as the loci pairs within 1D genomic distance 25Kb~50Kb to the loci pair of interest, for both 5Kb and 25Kb bin resolution multiplexed DNA FISH data (Figure S1), and such a distance threshold is motivated by the HiCCUPS 1 algorithm and our SnapHiC 8 algorithm. In this analysis, we consider 3 additional threshold values for 1D genomic distance: 0~25Kb, 50Kb~75Kb and 75Kb~100Kb. When using loci pairs with 1D genomic distance 0~25Kb as the control, SnapFISH did not find any loop candidate, since the minimal FDR based on two sample T-test is 0.1237, underscoring the necessity of removing immediately adjacent loci pairs in the local neighborhood control region. When using loci pairs with 50Kb~75Kb as the control, SnapFISH identified 18 loops with 10 true positives. The precision, recall and F1-score are 55.6%, 28.6% and 0.377, respectively. When using loci pairs with 75Kb~100Kb as the control, SnapFISH identified 11 loops with 6 true positives. The precision, recall and F1-score are 54.6%, 17.1% and 0.261, respectively. Taken together, our data suggest that the default setting with 1D genomic distance 25Kb ~ 50Kb achieves the best performance with the highest F1-score.
The current version of SnapFISH only allows 5Kb bin and 25Kb bin resolutions, and our results demonstrate that SnapFISH works well for all the datasets used in this study. We recommend using the default setting for bin resolution between 5Kb and 25Kb. For finer resolution data (i.e., 2Kb), future work is warranted to optimize SnapFISH parameters, including both the size of local neighborhood control regions, statistical significance levels and contact frequency thresholds. We will pursue this research direction when more genome-wide finer resolution imaging data become publicly available.
Section 3: Applying SnapFISH to DNA seqFISH+ data in three additional mouse brain cell types.
Takei et al study 9 contains DNA seqFISH+ data of 2,762 cells from nine major cell type clusters. We have shown that SnapFISH can identify enhancer-promoter loops in mouse excitatory neurons ( Figure 2C), which is the largest cell type cluster consisting of 1,895 cells (3,790 alleles). We then evaluated the performance of SnapFISH on other cell type clusters with a smaller number of cells. Since our previous analysis has shown that SnapFISH can only identify the Sox2 enhancer-promoter loop with more than 200 alleles ( Figure S5 and Table S5), we applied SnapFISH to DNA seqFISH+ data generated from three additional major cell type clusters with more than 200 alleles, including inhibitory neurons expressing parvalbumin (Pvalb for short, 155 cells, 310 alleles), astrocytes (Astro for short, 152 cells, 304 alleles) and endothelial cells (Endo for short, 240 cells, 480 alleles). All the remaining five cell type clusters consist of less than 200 alleles.
Notably, the targeted segment detection efficiency is 39.5%, 38.6% and 19.6% for Pvalb, Astro and Endo, respectively, which are comparable (Pvalb and Astro) or much lower (Endo) than the 43.1% targeted segment detection efficiency in excitatory neurons. The low targeted segment detection efficiency and the small number of alleles in these three cell types may result in the limited sensitivity of loop calling. As expected, SnapFISH identified 5, 5 and 3 loops in Pvalb, Astro and Endo, respectively (Table S4C), which are much less than the 87 loops identified from excitatory neurons. We also used Paired-Tag data 10 to annotate these SnapFISHidentified loops. Figure S6 showed an illustrative example of an enhancer-promoter loop in chromosome 3 in Astro, where the loop connects the promoter of gene Adra1a to an ATAC peak ~200Kb upstream. This example showcased that SnapFISH can identify enhancer-promoter loops in Astro with much smaller cell number than excitatory neurons.

Section 4. Evaluation the reproducibility of SnapFISH among biological replicates.
We first evaluated the reproducibility of SnapFISH using 25Kb bin resolution DNA seqFISH+ data generated from mESCs 11 , which consists of two biological replicates. Replicate #1 contains 201 cells (i.e., 402 alleles) with an average targeted segment detection efficiency of 58.6%, while Replicate #2 contains 245 cells (i.e., 490 alleles) with an average targeted segment detection efficiency of 70.6%. Applying SnapFISH to these two replicates separately identified 6 loops and 14 loops in Replicate #1 and Replicate #2, respectively ( Table S7). The more loops detected from Replicate #2 might be due to the fact that Replicate #2 contains more cells with a higher targeted segment detection efficiency. Reassuringly, all 6 loops identified from Replicate #1 overlap with loops identified from Replicate #2 ( Table S7). As one illustrative example in chromosome 16, Figure S7A shows that the loop identified exclusively by SnapFISH in Replicate #1 is reproducible in Replicate #2. Notably, SnapFISH-identified loops in both Replicate #1 and Replicate #2 overlap HiCCUPS-identified loops from mESC bulk Hi-C data ( Figure S7A).
We further evaluated the reproducibility of SnapFISH using 25Kb bin resolution DNA seqFISH+ data generated from mouse excitatory neurons 9 , which consists of three biological replicates.  (Table S8). As one illustrative example in chromosome 16, Figure S7B shows that SnapFISHidentified loops are reproducible among 3 replicates, while SnapFISH identified more loops from Replicate #3 with more alleles and higher detection efficiency than in Replicate #1 and Replicate #2.
Taken together, our data showed that SnapFISH can identify reproducible loops among different biological replicates in both mESCs and mouse excitatory neurons. The loops specific to only one replicate may be due to the high cell-to-cell variability of chromatin loops among the cell population.

Section 5. Evaluation the robustness of SnapFISH against different levels of measurement errors in multiplexed DNA FISH experiments.
SnapFISH takes the (X, Y, Z) coordinates measured in multiplexed DNA FISH experiments as the input, and generates a list of chromatin loops. However, those (X, Y, Z) coordinates do not reflect precise 3D positions, but rather positions estimated from Gaussian approximation or other fitting approaches. Here we performed benchmark analysis to evaluate the robustness of the SnapFISH loop calling again different levels of measurement errors in multiplexed DNA FISH experiments.
First of all, we collected 5Kb resolution ORCA data at the mESC Sox2 locus with measurement error, which contains the 95% confidence interval (CI) for each (X, Y, Z) coordinates. We calculated the average length of 95% CI for each allele, and divided all 6,007 alleles into three groups: low uncertainty (2,002 alleles, length of 95% CI <17.21nm, detection efficiency 64.8%), median uncertainty (2,002 alleles, 17.21nm <= length of 95% CI <20.56nm, detection efficiency 60.9%) and high uncertainty (2,003 alleles, length of 95% CI >=20.56nm, detection efficiency 45.0%). For all 3 groups of measurement errors, SnapFISH can accurately identify the Sox2 enhancer-promoter loop ( Figure S8A and Table S9A). Notably, in the low and median uncertainty groups, SnapFISH identified one additional CTCF-CTCF loop, which overlaps chromatin interactions identified from mESC CTCF PLAC-seq data 12 .
To the best of our knowledge, none of the two publicly available DNA seqFISH+ datasets 9,11 nor chromatin tracing data 13 contains uncertainty measurement. Therefore, we performed a simulation-based robustness analysis as follows.
We first added simulated measurement errors into the chromatin tracing data at the mESC Sox2 locus.
Previous study 14 has shown that measurement errors in chromatin tracing data are ~40nm for the X and Y coordinates, and ~50nm for the Z coordinate. Therefore, we simulated measurement errors from Gaussian distribution with mean 0 and three different standard deviations (sd): low uncertainty (sd=25nm), median uncertainty (sd=50nm) and high uncertainty (sd=100nm). We then added the simulated measurement error to the (X, Y, Z) coordinates of 1,416 129 alleles. For all 3 types of measurement errors, SnapFISH can accurately identify the Sox2 enhancer-promoter loop ( Figure S8B and Table S9B).
Moreover, we simulated measurement errors into the 25Kb resolution DNA seqFISH+ data from mESCs. Similar to the abovementioned analysis for chromatin tracing data, we added three types of measurement errors, and benchmarked the performance of SnapFISH on the perturbed data (Table S9C). Specifically, when adding low level of measurement errors (i.e., sd=25nm), SnapFISH identified 18 loops, among which 14 are true positive. The precision, recall and F1-score are 77.8%, 40.0% and 0.528, respectively. When adding median level of measurement errors (i.e., sd=50nm), SnapFISH identified 19 loops, among which again 14 loops are true positive. The precision, recall and F1-score are 73.7%, 40.0% and 0.519, respectively. When adding high level of measurement errors (i.e., sd=100nm), SnapFISH identified 10 loops, and 8 loops are true positive. The precision, recall and F1-score are 80.0%, 22.9% and 0.356, respectively.
Taken together, our results show that SnapFISH can robustly identify the Sox2 enhancer-promoter loop at different levels of measurement errors for both ORCA data and chromatin tracing data. For the genome-wide benchmark analysis using mESC DNA seqFISH+ data, adding low to median level of measurement error leads to a slight increase of false positive (i.e., a decrease in precision). However, adding high level of measurement error leads to a decrease in precision and large reduction in sensitivity. Collectively, our results demonstrate that SnapFISH is robust to different levels of measurement errors in multiplexed DNA FISH data.

Section 6. Using non-parametric Wilcoxon test as an alternative to the default two-sample T-test.
The distribution of Euclidean distance between a pair of loci across cells may deviate substantially from the normal distribution (e.g., heavily skewed, or having multiple modes). We therefore provide the non-parametric Wilcoxon test as an alternative to the default two-sample T-test. In our previous analysis (Table S3), we have shown that the Wilcoxon test and T-test provide similar genome-wide loop calling results in mESC DNA seqFISH+ data. As one illustrative example, we evaluated the Sox2 enhancer-promoter loop using the mESC chromatin tracing data.
Specifically, we first plotted the distribution of Euclidean distance between the Sox2 promoter and its superenhancer, and the average Euclidean distance for loci pairs in the local neighborhood control regions, in both CAST allele and 129 allele ( Figure S9A and Figure S9B). The distance distribution is approximately normal, supporting the use of T-test. In the CAST allele (Figure S9A), the p-value from two-sided two sample T-test and the two-sided two sample Wilcoxon rank test are 5.74e-20 and 2.86e-40, respectively. In the 129 allele ( Figure S9B), the p-value from two-sided two sample T-test and the two-sided two sample Wilcoxon rank test are 2.51e-13 and 1.52e-38, respectively. To evaluate why the Wilcoxon rank test generates more significant pvalues than T-test, we used the alleles without missing data at the Sox2 enhancer-promoter loop (N=469 for the CAST allele, N=498 for the 129 allele), and plotted the Euclidean distance of the Sox2 enhancer-promoter loop against the average Euclidean distance for loci pairs in the local neighborhood control regions ( Figure  S10A and Figure S10B). In both CAST and 129 alleles, the majority of alleles showed slightly smaller Euclidean distance at the Sox2 enhancer-promoter loop compared to that in its local neighborhood control regions. The majority imbalanced towards one direction but with relatively small differences together explain the more significant p-values from the Wilcoxon rank test than from T-test.

Section 7. Time and memory cost of SnapFISH.
We evaluated the time and memory cost of SnapFISH, using 5Kb bin resolution mESC multiplexed DNA FISH data with different numbers of the CAST alleles and the 129 alleles (i.e., # of alleles = 100, 200, 300, 400 and 500). Figure S11 showed that SnapFISH is computationally efficient, which only takes ~10 seconds to analyze 500 alleles, and the running time increases approximately linearly with the number of alleles. In addition, SnapFISH requires ~120Mb memory for different numbers of CAST alleles and 129 alleles. All computation tasks are performed in the single compute node with a 2.40 GHz Intel processor and 256 Gb RAM. Figure S1. Cartoon illustration of local neighborhood control region. A. 5Kb bin resolution for multiplexed DNA FISH data and ORCA data. B. 25Kb bin resolution for DNA seqFISH+ data. In both, the red area represents the pair of targeted segments of interest. The blue area represents the local neighborhood control region defined as a circle with 25Kb~50Kb 1D genomic distance (see details in Methods). In both, the grey areas are regions within 25Kb 1D genomic distance of the pair of targeted segments of interest, which are excluded from the definition of local neighborhood control region. Source data are provided.
A. B. 5Kb bin resolution 25Kb bin resolution Figure S2. Heatmap of average Euclidean distance (unit: nm) and population-level contact frequency from mESC DNA seqFISH+ data, and heatmap of Hi-C contact frequency (unit: log2 KR-normed Hi-C contact frequency) from mESC bulk Hi-C data. In each sub-figure, the left and middle panel is the heatmap of average Euclidean distance and contact frequency from mESC DNA seqFISH+ data, respectively. The right panel is the heatmap of Hi-C contact frequency from mESC bulk Hi-C data. We showed 8 chromosomes that contain 10Kb bin resolution and 25Kb bin resolution HiCCUPS-identified loops: