Detecting continuous structural heterogeneity in single-molecule localization microscopy data

Fusion of multiple chemically identical complexes, so-called particles, in localization microscopy, can improve the signal-to-noise ratio and overcome under-labeling. To this end, structural homogeneity of the data must be assumed. Biological heterogeneity, however, could be present in the data originating from distinct conformational variations or (continuous) variations in particle shapes. We present a prior-knowledge-free method for detecting continuous structural variations with localization microscopy. Detecting this heterogeneity leads to more faithful fusions and reconstructions of the localization microscopy data as their heterogeneity is taken into account. In experimental datasets, we show the continuous variation of the height of DNA origami tetrahedrons imaged with 3D PAINT and of the radius of Nuclear Pore Complexes imaged in 2D with STORM. In simulation, we study the impact on the heterogeneity detection pipeline of Degree Of Labeling and of structural variations in the form of two independent modes.

In this paper we present a model-free continuous heterogeneity detection method that works directly on localization coordinates in order to employ the full potential of the SMLM data.We apply our method to experimental data for the continuous detection of the height of DNA-origami tetrahedron structures imaged in 3D and of the radius of Nuclear Pore Complexes (NPCs) imaged in 2D.

Methods
Our continuous heterogeneity detection (CHD) pipeline is illustrated in Fig. 1.It starts by computing a pairwise registration of all particles to obtain a dissimilarity matrix of the particles, similar to the approach of Huijben et al. 23 .The dissimilarity values between particles are mapped via multi-dimensional scaling 29 to a high-dimensional feature space (MDS).This typically results in a low-dimensional manifold embedded in the high-dimensional MDS space.The low-dimensionality indicates that the particles are mostly alike but vary in one or just a few features.In the next step, the Isomap algorithm 30 is used to "unroll" the data into a lower dimensional embedding.Finally, principal component analysis is used to project this representation onto the axis carrying the largest variation in a 1D latent space.This sequence of transformations preserves local ordering, and therefore, the ordering in 1D latent space carries the information on the continuous heterogeneity captured by the dissimilarity measure.Since neighboring particles in the latent space are structurally similar to each other, we can now divide them into bins and fuse the particles per bin.
In the following we describe the different steps of the algorithm in detail.

Pairwise registration
We use the all-to-all registration 11 to register N particles from a 2D/3D SMLM dataset.Each particle is independently registered to all other particles in the dataset using a combination of Gaussian Mixture Model (GMM) registration 31 and the Bhattacharya cost function 14 .First, particles are aligned using GMM registration with multiple initial poses, which results in a set of rotation matrices and translation vectors.The final registration parameters are those that maximize the Bhattacharya cost function over the GMM optima for the different initializations.This procedure results in transformation parameters (rotation and translation) and the optimum cost value for each pair.The width of the Gaussian distribution (scale parameter) used in GMM registration is obtained by registering ten random groups of particles for different scales.In this scale-sweep approach, the scale parameter which gives the highest Bhattacharya cost value for all test sets is chosen as the scale value to be used for the whole dataset.It is necessary to select the proper scale value to avoid blurring of nearby binding sites or overfitting on each localization.The elements of the N × N matrix of optimum cost values (or rather the upper triangular part of this matrix) are normalized by the number of localizations for the two corresponding particles 23 .This makes the cost function matrix less sensitive to variations in the number of localizations.Finally, all pairwise cost values (quantifying the similarity between pairs of particles) are subtracted from the maximum pairwise cost value to create a dissimilarity matrix.

Multi-dimensional scaling (MDS) space
We assign coordinates to the N particles in a high-dimensional (dimension D = 30 ) space, by interpreting the values of the dissimilarity matrix as the Euclidean distance in this high-dimensional space, the so-called MDS space.This is done by iteratively updating the coordinates in MDS space to minimize the stress function: (1) where d ij is the pairwise dissimilarity between particle i and j, and x i is the MDS position vector of particle i ( i = 1, 2, . . .N ).We found that a dimensionality D > 15 resulted in a value for the stress function smaller than 10 −4 , which was sufficient for all our applications.As a rule of thumb, the number of dimensions is set to be D = 30 23 .

Isomap
We have observed that the point cloud in MDS space can and often is distributed across a lower-dimensional manifold.In addition, the ordering of the points on this manifold correlates with the ordering from the dissimilarity metric.In order to take advantage of this in case it occurs, we use a global geometric framework for nonlinear dimensionality reduction, the so-called Isomap algorithm 30 .The Isomap algorithm is used to unroll the low-dimensional manifold embedded in the high-dimensional MDS space by flattening the curved manifold, preferably into a non-curved shape.This in turn enables easier detection of the dominant directions of variation in the dissimilarity measure across the data in MDS space.Isomap unrolls the manifold while keeping the number of dimensions the same as the original MDS space using the following steps: (1) Clustering the particles 32 to find the k nearest neighbors based on the Euclidean distance for all N particles in MDS space.(2) Connecting the k neighborhoods for each point to construct a proximity graph.(3) Computing the shortest pairwise distance for all pairs of points in the graph.This results in a new N × N matrix in which the elements represent the geodesic distance.(4) Embedding the geodesic distance matrix into MDS space by minimizing the stress function, keeping the number of dimensions the same.The parameter k, which corresponds to the number of neighbors used in Isomap, should be empirically chosen based on the distribution of the particles in MDS space (typically between 4 and 12).We empirically found k = 4 works well.If k is too high, unrolling cannot be performed properly as neighbors can then be found not just along the manifold, but also via a "shortcut" through empty space.In cases where the successive differences between explained variance on the first three axes are higher than half of the average variance explained, Isomap can reveal the low dimensional latent space.Otherwise, it can not further reduce the number of dimensions due to an isotropic distribution in MDS space.

Principal component analysis (PCA)
After unrolling the data in the high dimensional MDS space, we used PCA to identify the direction of the largest variation.Here the axis with maximum variance explained corresponding to the highest eigenvalue was selected as the one embedding the maximum mode of variation in the data measured by the Bhattacharya cost function.Subsequently, by projecting all particles on its main principal component axis, a 1D latent space is created, which corresponds to some of the continuous heterogeneity that exists in the dataset.

Reconstruction per bin
To visualize the continuous heterogeneity revealed in the latent space, particles are divided into bins, and a single reconstruction for each bin is made.The way in which the particles are distributed over the bins is adapted to the distribution of the particles in the latent space.For a (near) uniform distribution of particles, the bin width is set such that all bins cover the dynamic range of values in latent space.For a more bell-shaped distribution of particles, the bin width is set to be equal to half of the fitted standard deviation of the distribution.The total number of bins is typically chosen in the range of 5-10, but can in principle be chosen arbitrarily, provided there are more than approximately 10 particles per bin.The resulting reconstructions per bin are expected to be more faithful since the particles in each bin are structurally close to each other.

Particle fusion
The superparticle reconstructions per bin are made with a template-free method developed previously by us 13 , based on earlier work by Evangelides and Horaud 33 , the so-called Joint Registration of Multiple Point Clouds (JRMPC).

Model-based shape parameter estimation
We identify the coordinate axis in 1D latent space indicating the heterogeneity parameter.In an experimental 2D NPC dataset, we compare the latent space coordinate to the estimated radius of the ring structure.To estimate the radius, we center all particles by subtracting the mean of all localizations in a particle from the localizations, as in Heydarian et al. 11 .Subsequently, the localization coordinates are transformed into polar coordinates.We take the mean of the radial coordinate as an estimation of NPC radius.To estimate the precision of the model we calculate the FWHM of the radius histogram and divide it by the square root of the number of localization events to find the standard error of the mean.In an experimental 3D DNA-origami tetrahedron dataset, we compare the latent space coordinate to the estimated height of the tetrahedron structure.To this end, we manually align the particles in each bin along the z-axis and project all localizations on the z-axis, giving a histogram of z coordinates of all localizations in the particle.This histogram has two peaks, one corresponding to the three binding sites in the base plate of the tetrahedron, and the other corresponding to the tip of the tetrahedron.By fitting a mixture of two Gaussian distributions to this histogram, we find the height as the difference between the mean of the two fitted Gaussian distributions.In a simulation study on elliptically shaped 2D NPCs we estimate the ellipticity of the ring structure as in Huijben et al. 23 by finding the center of the 8 blobs in each particle using k means clustering, followed by fitting an ellipse to the centers of the 8 blobs.
Vol:.( 1234567890)  100MG) and 500 µg/mL glucose oxidase (Sigma #G2133-50KU)) and a second buffer (10 mM NaCl, 50 mM Tris pH 8 and 10% Glucose).Images were taken using an Oxford Nanoimager (ONi) STORM microscope with oil-immersion objective (100X, 1.4NA) and laser power density of 4 kW/cm 2 .For imaging acquisition, a 641 nm laser was used to excite AF647-BG and a 405 nm laser was used to enhance blinking.Samples were pre-bleached with 190 mW of 641 nm laser (1500 frames), then acquisition was done with 120 mW of the 641 nm laser (5000 frames) and enhanced blinking with 0.2 mW of 405 nm and 120 mW of 641 nm (5000 frames).

Results
We applied the proposed CHD analysis pipeline to two experimental datasets, a 3D DNA origami tetrahedron dataset, and a 2D NPC dataset.We also investigated limitations of the method by two simulation studies, on the impact of the Degree Of Labeling (DOL) and on the sensitivity to two independent modes of continuous variation.

Continuous distribution of 3D DNA-origami tetrahedron height
We applied our CHD pipeline to a three-dimensional DNA-origami tetrahedron data set, imaged with DNA-PAINT 23,34 .This data set consists of 218 particles with an edge length of around 100 nm and a height of around 90 nm.There is a variation in tetrahedron height, which was analyzed previously by Huijben et al. 23 with their clustering approach.We have applied our CHD method to this data set as well and the key results are shown in Fig. 2. The reconstruction of the whole data set (Fig. 2a) was obtained using fast particle fusion based on JRMPC 13 .Figure 2b shows the distribution of particles over the latent space coordinate.There is a clear correlation of the found latent space coordinate with a continuous height variation of about 45 nm in the entire data set, providing additional support for the validity of our method.The variance explained in the first PC axis is 23% (clearly larger than in the next directions, which give rise to 14, 9, 5, and 4% variance explained).We divided the latent space into 10 bins with equal length in latent space as the particles are almost uniformly distributed in the latent space.Figure 2c-h shows the reconstructions for each bin.These reconstructions have less elongated blobs than the overall reconstruction of Fig. 2a.A 3D reconstruction of each bin can be seen in Supplementary video 1.

Continuous distribution of 2D NPC radius
The NPC dataset consists of 1339 particles imaged with 2D STORM (see Methods section).Each particle was picked (cropped) manually from a single SMLM image.The shape of lower-dimensional projections of the distribution of points in MDS space turns out to be closer to an ellipsoid structure as opposed to a more flat shape.As a result, the Isomap step does not provide additional value in this case, and we directly applied PCA to the distribution in MDS space.We attribute the more diffuse topology of the manifold in MDS space to other modes of variation in addition to the dominant one (NPC radius).These confounding modes of variation could be related to other modes of structural variation, and SMLM specific statistical variations such as Degree Of Labeling (DOL).By estimating the size of each particle as explained in the Model-based shape parameter estimation in Method section, we observed a continuous variation in the radius of the scaffold that seems drawn from a Gaussian distribution with a mean of around 55 nm, and a standard deviation of around 4 nm.This continuous heterogeneity causes blurriness in the reconstruction of all particles in the dataset (Fig. 3a) that assumes homogeneity of the underlying data.Fig. 3b shows the results of analyzing this dataset with our CHD algorithm.There is a very clear correlation between the latent space coordinate and the independently assessed radius, validating our method for discovery of continuous heterogeneity without any prior assumptions.Although the first principal axis shows a very good correlation with radius, the variance explained in this axis is only 4%.This may be related to the more diffuse topology of the MDS manifold compared to the 3D DNA-origami tetrahedron case.We also inspected the second or higher principal axis but did not observe a correlation to another structural feature.
Figure 3c-l shows reconstructions of particles in 10 bins, defined according to the observed bell-shaped distribution of the latent space coordinate (see Methods section).The results indicate that the registration of particles in the middle bins (Fig. 3f-i) leads to better reconstructions compared to the first (Fig. 3c, e) and last bins (Fig. 3j-l).As a quantitative measure, we calculated the spectral signal to noise 35 (SSNR) curves for each bin.To that end we divided each bin into two halves and registered each part independently using the fast particle fusion approach.After alignment of the two halves, we applied a random rotation in view of the 8-fold rotational symmetry of the structure, thereby avoiding hotspots.Figure 4 shows that for the middle latent space bins (4-7), the SSNR curves are higher than for the first (1-3) and last (8-10) bins, in agreement with the visual quality of the reconstructions per bin in (Fig. 3c-l).All SSNR curves level off to a noise plateau for spatial frequencies higher than approximately 0.15 nm −1 , indicating that the smallest features in the reconstructions are about 6 nm.The reason for the relatively poor reconstruction of the extreme bins is not just the lower number of similar particles that contribute to the reconstruction, but also that these bins are contaminated with outlier particles.This suggests an additional value of our proposed method, namely outlier particle detection.Figure 5 shows examples of particles that can be designated as outlier particles compared to randomly selected valid particles.As threshold for the definition of outlier, we take the extreme 1% of the distribution of particles in latent space.Visually, these outlier particles are relatively remote from the expected ring shaped point clouds.

Simulation on impact of DOL
A low Degree Of Labeling (DOL) of binding sites is a common problem in SMLM.It can be expected that DOL also has an impact on the ability to detect continuous heterogeneity in a dataset, as the randomness of which binding sites are labeled (and which not) affects the imaged structure of each individual particle.We investigated the impact of DOL by a simulation study of a 2D NPC structure.We applied our CHD pipeline to simulated data sets with five DOL values (30, 50, 70, 90, and 100%), and for particle radii drawn from uniform distributions with four ranges.The simulation is performed as Huijben et al. 23 with 250 particles in each dataset.Figure 6 shows the variance explained on the first PC axis (the latent space coordinate) as a function of DOL for the four distributions.As expected the variance explained decreases with decreasing DOL.If we take 50% as a minimum value for this performance criterion, we can conclude that the DOL should be at least in the range 50-70%.

Simulation on two modes of variation
So far, we have only considered cases in which there was a single, dominant mode of variation.To evaluate the ability of the CHD method to detect two modes of variation simultaneously, we made a simulation study of 2D NPC particles with both a variation in radius (uniform distribution ranging from 30 to 70 nm) and a variation in ellipticity of the ring (uniform distribution ranging from 0.6 to 1.0).The outcome was compared to simulations with only a variation in radius or in ellipticity.Figure 7 shows the similarity matrices for these different simulations.The matrices for single-mode variation simulations (Fig. 7a, b) were ordered such that the particles are sorted based on the ellipticity or radius of the particles.As a consequence, by moving from left to right in each row in Fig. 7a, b, the similarity value decreases since each particle is compared with a less similar structure in terms of ellipticity or radius.This applies to every row in the matrix and creates a diagonal band of higher similarity values.The width and the average hue of this diagonal band in the matrix are related to the sensitivity of the similarity measure to the ellipticity or radius.If we apply the same particle ordering procedure for either  the ellipticity or the radius for two-mode variation simulation (Fig. 7c, d), it turns out that the width and hue of the diagonal band is limited or seemingly absent.This indicates that the single similarity metric we use might have difficulties to provide sufficient information to detect multiple modes of variation.
In the simulated dataset with two modes of variation it appeared nevertheless possible to detect both modes.Figure 8 shows the distribution of particles in the 2D latent space (first and the second principal axes), where the color code shows the radius and ellipticity ground truth, respectively.Clearly, the variation in radius corresponds with the first principal axis, while the variation in ellipticity corresponds with the second principal axis.While these initial simulation results are encouraging, we have not been able to detect multiple modes of variation in both experimental datasets we studied.Several confounding factors, in particular a stronger underlabeling, could be limiting in detecting multiple variation modes in experiment.

Discussion
In summary, we have developed a model-free continuous structural heterogeneity tool to sort particles based on a dissimilarity measure.We successfully detected continuous structural variation in different localization microscopy datasets, such as DNA-origami tetrahedrons and NPCs, which led to more faithful fusions.The method should be applicable to any SMLM dataset that consists of particles that share a similar structure but vary in conformational state.In cryo-EM studying structure variation is applied to e.g.variation of position or motion of side groups.Similar variations could potentially be visible in our approach albeit at a lower resolution due to the imaging modality.
It is not clear how many samples are required per bin to detect structural heterogeneity, let alone how this number of samples depends on localisation precision and degree of labelling.Already for normal averaging, with the assumption of structural homegeneity, it is unclear how the FRC resolution of the reconstruction depends on the number of particles.This is very different from cryo-EM SPA where each extra particle reduces the noise and the improvement follows the expected 1/ √ N scaling).As best practice, we inspect the result and then judge if the numbers have been sufficient.
We found several limitations of the proposed template-free continuous heterogeneity detection method.Firstly, it is not a priori clear if the Isomap unrolling step is of use or not.Secondly, picking up modes of variation like variations in DOL, that have no clear geometrical interpretation such as size parameters, turn out in simulation to be too challenging to detect.Thirdly, the detection of multiple continuous modes of shape variation in SMLM data remains unsolved, despite initially hopeful simulation results.This may be due to the poorer quality of experimental data and variation modes that are entangled differently than foreseen in simulation.Finally, the method is based on a single dissimilarity metric that may be expected to have different sensitivities to different modes of variation.
An alternative to the proposed method may be to fit an a priori model to the data and then sort the data by the distribution of model parameters obtained by the fit.A major drawback of such an approach, however, is that the outcome of the analysis is prone to biases induced by the assumed model.A more fruitful next step may be to consider multiple metrics that quantify specific features and/or particle similarity for sorting the particles.This can be performed either by a semi-template-free approach with multiple features that are designed to be sensitive to specific modes of variation or by generating sets of more abstract features using deep neural networks and auto-encoders 36 .

Figure 1 .
Figure 1.Continuous heterogeneity detection pipeline.N particles are registered using the all-to-all registration procedure resulting in N(N − 1)/2 dissimilarity values.Multi-dimensional scaling embeds the dissimilarity matrix into an abstract Cartesian space (only the first 3 dimensions are shown).Points lying on a low-dimensional manifold are unrolled using Isomap.Projection of the unrolled manifold on the main principal component axis creates a 1D latent space in which the data is ordered based on the dissimilarity captured by the dissimilarity measure. https://doi.org/10.1038/s41598-023-46488-zwww.nature.com/scientificreports/

Figure 2 .Figure 3 .Figure 4 .
Figure 2. Continuous distribution of DNA origami tetrahedron height.(a) All particles are registered using fast particle fusion based on the homogeneity assumption.The fusion of structurally different particles causes blurry and elongated blobs.(b) The distribution of the particles in latent space.(c-l) Fusion of particles in 10 bins sorted along the latent space coordinate, showing a clear variation in the height of tetrahedrons.Scale bar in (c) applies to (d-l) as well.Colormap represents the localization density.

Figure 5 .
Figure 5. Outliers detected by the CHD pipeline.Images (a-l) correspond to particles that are in low-density regions of the latent space.As reference, images (m-x) are randomly selected from the whole data set.

Figure 6 .
Figure 6.Performance of continuous heterogeneity detection with DOL.Each curve consists of 5 data points, pertaining to datasets with five DOL values (30, 50, 70, 90, 100%).The radius of the particles is drawn from a uniform distribution with ranges indicated in the legend.

Figure 7 .
Figure 7. Sensitivity analysis of the cost function to two modes of variation.Plots illustrating the normalized pairwise cost value (similarity measure) for a variation in radius and/or ellipticity of simulated 2D NPC.(a) Cost values of a dataset with uniform distribution over ellipticity ratio ranging from 0.6 to 1. Particles are sorted based on their ellipticity from top to down and left to right in the matrix.The values in each column show the sensitivity of a particle with specific ellipticity to the other particles.As a result, the color separation over the diagonal band refers to the sensitivity of the method to the variations in ellipticity.(b) Cost values of a dataset with uniform distribution over the radius ranging from 30 nm to 70 nm.(c, d) Cost values of a dataset with two modes of variations.In (c) particles are sorted based on their ellipticity, while in (d) sorting is based on radius.A comparison of the diagonal bands in (c) and (d) shows that the method is more sensitive to variations in radius than in ellipticity.

Figure 8 .
Figure 8. 2D Latent space for two modes of variation.(a) Distribution of particles in the 2D latent space color coded respectively with the ground truth radius.(b) Distribution of particles in the 2D latent space color coded respectively with ellipticity.