Introduction

Super-resolution microscopy has opened up new opportunities in biological and biomedical research by providing unprecedented molecular insights into the inner workings of cells1,2. Classical light microscopy can only resolve structural features that are larger than the diffraction limit of light (a few hundred nanometers)3. By overcoming the diffraction limit, super-resolution microscopy has revealed long hidden mechanisms underlying intracellular transport processes4 and the spatial organization of mRNA translation5,6. The major feature of single-molecule localization microscopy (SMLM) is its ability to exploit the stochastic and sparse switching of fluorescence emission of specific labels binding to target molecules7. For example, in DNA-based point accumulation for imaging in nanoscale topography (DNA-PAINT), target molecules are labelled with a short DNA strand and detected through stochastic and transient hybridization with a sequence-complementary, fluorophore labeled DNA strand8. Over time, each target molecule generates fluorescent detection events that cluster in space. Such clustering of single or densely packed molecules is observed for proteins in cells, e.g. the AMPA receptor in neurons9, or on synthetic DNA origami structures10. In addition, SMLM data sets may contain detection events that represent ambiguous information. For instance, a super-resolved image may contain false positive localizations (i.e. background noise from nonspecific fluorescent signal that does not originate from a target molecule). Furthermore, in high density regions of target molecules, localizations from multiple target molecules can overlap or form complex structures11.

Because of the point-like nature of SMLM data, quantitative analysis opens opportunities to characterize cellular structures at the nanoscale12. One aim in SMLM data analysis is to group multiple detection events into a cluster, representing a single labeled protein or an assembly of densely packed proteins that cannot be spatially discriminated.

Current state-of-the-art cluster detection algorithms rely on some form of prior, user-provided information. This information can be the type of localization patterns to be detected or prior experience with similar data sets. One of the most widely used13,14,15,16,17,18 algorithms is the ’density-based spatial clustering algorithm’ (DBSCAN)19,20. While DBSCAN is intuitive, simple, and fast for 2D datasets21 the parameter choice that defines the clustering can suffer from human bias. To identify the ’most probable’ clustering parameters based on prior knowledge, empirical methods for parameter identification have been proposed20 (see Table S1). Another algorithm, ‘ordering points to identify the clustering structure’ (OPTICS)22 circumvents the use of a global clustering parameter by defining borders between clusters of localizations through changes in local point density. Fundamentally different approaches to clustering have also gained traction to circumvent the shortcomings of DBSCAN23,24,25,26. The importance of benchmarking clustering algorithms has recently been addressed by proposing a unifying framework for comparison27.

In 2015, Rubin-Delanchy et al. introduced a Bayesian parameter finding approach28,29 which starts from a user-generated prior parameter set for a cluster-proposing algorithm, and subsequently computes a posterior probability for each parameter set. Also in 2015, Levet et al. introduced SR Tesselation, an algorithm which segments large-scale images into polygonal regions30,31 and is specialized to reveal spatial structures at multiple scales32. Most recently in 2020, Williamson et al. developed a machine-learning approach named ‘cluster analysis by machine learning’ (CAML)33 that classifies localizations depending on their local neighborhood. This approach does not require the users to provide parameters. It also outperforms most classical algorithms in selected clustering challenges33, but depends on training data sets. We point out that these methods devise clustering approaches to group single detection events. In order to infer molecule numbers, additional analysis steps need to be incorporated that correct for multiple emission events detected for the same fluorophore. This can directly be implemented in the current approach, by considering the time domain34,35 or through the analysis of binding kinetics36.

Despite their successes, none of the current approaches have fully removed the dependency on prior knowledge – either a statistical model is needed, a reference density needs to be set, or a machine-learning model needs to be trained on a pre-selected data set. We address the problem of parameter sensitivity and user-generated bias by building on the widely applied DBSCAN13,37,38,39,40 algorithm and propose an unbiased parameter selection which we call FINDER. FINDER minimizes dependencies on prior knowledge by leveraging what is usually seen as a distractor: false positive localizations, or ‘noise’. The core principle of FINDER is to use information about the clustering variability with respect to the variation of the parameters to then select the most robust clustering. Global noise levels therefore act as a lower boundary for the sensitivity of the algorithm, preventing over-segmentation and minimizing false positive cluster inclusions. To validate this approach, we use clusters which we identified in super-resolution images8 to produce synthetic test sets, and compare the performance of FINDER with one of the currently best performing clustering algorithms, the adaptive machine-learning algorithm CAML33. We also compare the performance to classical DBSCAN and density-based OPTICS clustering approaches22. We show that the FINDER algorithm is both independent of training data and is computationally tractable. FINDER also exhibits a similar or better performance as measured by true positive detections, and a reduction in false-positive cluster detections. Finally, since the parameters explored by FINDER have a precise meaning, the algorithm outcome is straightforwardly interpretable.

Results

Figure 1
figure 1

Single-molecule localization microscopy (SMLM) datasets exhibit density variations, noise inclusion and can lead to different cluster analysis results. (a) Localizations representing AMPA-receptors in a dendritic segment of a neuron9. (b) Example of four DNA origami trimers, with different clustering results for different parameters choices. The FINDER algorithm we propose here identifies parameters that lead to a statistically reliable assignment of clusters of localizations.

The assignment of clusters of localizations in super-resolution microscopy is not trivial. Consider, for example, DNA-PAINT data sets of super-resolved neuronal AMPA receptor localizations. In Fig. 1 (left) one can see that owing to variable cluster sizes, high-noise and overlap the identification of clusters of localizations is difficult. Similarly, data sets of DNA origami trimers (Fig. 1 right) indicate that identifying localizations that belong to single molecules of interest and separating these locations from the background noise is also a challenge. Algorithms can propose candidate clusters that may correspond to single molecules of interest. However, verifying the reliability of the result is difficult because the ground truth is not always known and different parameter settings within the algorithm can lead to different outcomes. This means that the presence or absence of molecules of interest and their inferred location will vary depending on the parameter settings of the algorithm. In general, parameter settings, algorithm training and selection can become sources for clustering bias or errors.

Figure 2
figure 2

Schematics of FINDER algorithm (a) In DBSCAN, a new cluster is initiated when at least one core point (shown in red) is present that has at least minPts other points within distance \(\varepsilon\) from the core point (see circles, left). Inspired by DBSCAN, the clustering algorithm used in FINDER iteratively removes non-core points (shown in black) which results in a more frequent identification of noise localizations (grey points, right). (b) Two clustering assigments are considered similar, if the number of matching localizations is greater than the number of unmatched localizations. Example of two similar cluster assignments (top row) and two non-similar cluster assignments (bottom row). (c) Phase space of possible clustering outcomes. FINDER computes a similarity score among clustering results sharing the same value minPts (i.e., for each line on the plot, like the one highlighted with the dashed line). (1–3) represent three possible clustering outcomes within the parameter space. The parameters used for (1)-(3) correspond to the location of the respective number in the phase diagram, respectively.

To solve these challenges we developed a new clustering approach (FINDER) which is based on a similarity metric rather than prior knowledge. FINDER identifies optimal parameters based on a similarity measure that is computed across an interval of probable parameters. To do so, FINDER uses a density-based approach that is based on the core points defined in DBSCAN and then optimizes this parameter based on the parameter phase space in the particular data set. Core points are points that have at least minPts neighbors within radius \(\varepsilon\). With DBSCAN, these core points are used to initiate clusters. Instead, in FINDER, all non-core points are identified as noise localizations and iteratively removed. We refer to this more conservative definition of core points as ’noise-free DBSCAN’. Our tests suggest that this self-contained definition of core points is more robust to noise and leads to a lower number of false positive cluster detections (see Figs. S6S11). In the supplemental information, we compare the performance of FINDER with a version of FINDER using the classical density-based algorithm DBSCAN and with OPTICS, showing a higher false positive detection rate and lower robustness, respectively. The clustering results obtained from each parameter combination are then compared using a Similarity Score (see “Similarity of clusterings and the definition of the similarity score” in Methods) and the selected parameters are identified. A step-by-step explanation can be found in the "Methods" section “FINDER algorithm”. In Fig. 2 a visual schematic description of the FINDER algorithm is depicted.

To benchmark FINDER, we used super-resolution images of structurally well-defined DNA origami trimers and tetramers. DNA origami are folded DNA nanostructures with well- defined binding sites for fluorophores10. As such, the ground truth of the geometry of binding sites is known: the algorithms should identify not more than 3 or 4 subclusters for each DNA origami trimer and tetramer, respectively. Note that the subclusters of the identified oligomers exhibit considerable heterogeneity in their number of localizations per subcluster. We start by comparing FINDER to the recently proposed machine-learning based clustering algorithms (CAML), which outperformed most classical clustering algorithms in selected clustering test cases33. CAML feeds density variations of the neighborhood of each point into a trained classifier to identify clusters. Here, we use the pre-trained models ’CAML 07VEJJ’ and ’CAML 87B144’ from Ref.33, which consider the first 100 and 1000 neighboring points, respectively. Let us note that CAML models belong to the class of pre-trained machine learning models, and we did not re-train them for this or the following data sets (Figs. 35 and Supplemental Information). Since FINDER is meant to have a general scope and is intended to work across different length scales and cluster shapes, it must be compared with algorithms that are not fine-tuned to a specific dataset. Re-training CAML models to adapt to specific data features could presumably result in better performance but could compromise performance for data sets that deviate from these statistics. Since the FINDER algorithm is designed to be a general-purpose algorithm, we choose to compare it to existing CAML models, which are not fine-tuned to a specific dataset.

In Fig. 3, we show that the FINDER algorithm accurately predicts the number of binding sites of the DNA origami oligomer, even though the density of localizations is highly heterogeneous. Notably, the adaptive CAML algorithms lead to a wide variety of detected subclusters of localizations. CAML 07VEJJ detects 3-mers most often, but fails in detecting 4-mers. In constrast, CAML 87B144 is more accurate in detecting 4-mers than 3-mers. One explanation for this discrepancy is that, on average, the detected 3-mers have 156 localizations per subcluster, but 4-mers have 285 localizations per subcluster. This could explain why CAML 07VEJJ, which considers only the first 100 neighboring points, performs poorly for the tetramer dataset, but it does not explain the performance of CAML 87B144. It also does not explain the segmentation failure of CAML 87B144 if no random noise is added (see Fig. S5). This results suggest that a version of CAML which considers the first 1000 points would need to be retrained for such 3-mer and 4-mer datasets. Retraining of the model is one possibility to include global information into local clustering decisions but selecting training data while considering all aspects of the statistics to be captured can be challenging. Furthermore, the ground truth statistics regarding inter-and intra-cluster distance in experimentally recorded data sets is not a priori known which complicates the selection of the reference point for training data. It is therefore hard to define a good training data set without introducing user-generated training biases. This highlights a challenge that adaptive algorithms share: information about the local neighborhood is used for clustering decisions – but often, these decisions need information about the global properties of the dataset – such as noise intensity or cluster separation. Often, these global properties are assumed a priori through user-defined parameters or training data. This potential source of bias is avoided by FINDER, which systematically probes the full dataset to identify one set of global parameters for an easily interpretable density-based clustering algorithm. Concerning the computational load, FINDER usually requires more time than CAML 87B144, which in turn is slower than CAML0 7VEJJ. For example, for the 3-mers data of Fig. 3, the time required for running the algorithms on a laptop were: CAML 07VEJJ 9.914s, CAML 87B144 69.879s, FINDER 106.455s. For the 4-mers: CAML 07VEJJ 39.779s, CAML 87B144 229.632s, FINDER 716.328s. However, we point out that FINDER does not require any training time, while the other models were previously trained.

Figure 3
figure 3

Performance of clustering algorithms for an image composed of 25 DNA-Origami 3-mers (left) and 44 4-mers (right), with added random noise localizations (\(10\%\) of cluster localizations). The optimal radial parameters identified by FINDER for DBSCAN (noisefree) are \(\varepsilon = 8.05\)nm and \(minPts = 9\) (3-mers), and \(\varepsilon = 3.61\)nm and \(minPts = 8\) (4-mers). The histograms show the distribution of the number of subclusters detected for each 3-mer and 4-mer in our test data set, respectively. See Fig. S5 for clustering results without added noise, leading to a segmentation failure of CAML 87B144, suggesting that retraining is necessary.

As a second benchmark, we employed two libraries of unit clusters of localizations: (1) manually identified clusters in a SMLM dataset of a synapse9, which contain on average only 17 localizations per cluster (see Fig. S1 a) and (2) manually identified sub-clusters of DNA-origami trimers, with on average 113 localizations per cluster (see Fig. S1 b). We re-arranged these unit clusters in three different configurations and added random noise points. These surrogate test data sets provide a ground-truth, while also retaining the biological variability of the cluster geometry. Our clustering outcomes are summarized in Fig 4. As expected, the 07VEJJ model fails for the set of unit clusters where the number of points per cluster can be larger than the number of points considered (100, see Fig. S1 b). Interestingly, in several test cases CAML 87B144, which considers the first 1000 neighboring points, also fails for the smaller set of unit clusters (see Fig. S1 a). If the ‘correct’ CAML model is chosen, performance is good, with a high number of true positive and low number of false negative detections. For all test cases we considered here, FINDER leads to a similar number of true positive cluster detections as the CAML model that performs better for the given configuration. FINDER consistently results in the lowest number of false positive cluster detections. See Figs. S6-S11 for more expansive tests. This suggests that FINDER is able to identify global parameters which give robust results with low false positive detection rates.

Figure 4
figure 4

Performance of FINDER and CAML clustering algorithms across synthetic datasets which are composed of clusters of localizations from two libraries. Low-density clusters (ac) are composed of clusters from a SMLM dataset of a a synapse9 with an average of 17 localizations per cluster. High-density clusters (ef) are composed of manually identified sub-clusters of DNA-origami trimers with on average 113 localizations per cluster. (a,d) High overlap: A grid of \(5 \times 5\) clusters with distances equal to the maximal cluster diameter in the dataset, and with \(20\%\) random noise localizations (as a fraction of clustered localizations). (c,d) High noise: A grid of \(5 \times 5\) clusters with distances equal to the 1.5 times the maximal cluster diameter in the dataset, is superimposed with an equal number (\(100\%\)) of random noise localizations. (e,f) Unstructured: 25 clusters are randomly distributed along a sinusoidal path, with \(100\%\) and \(150\%\) added random noise localizations (as a fraction of clustered localizations) in c and f, respectively. The top row shows one instance of a randomized pattern for each case, with highlighted ground truth clusters. For further detail, see supplemental figures: S6-S11, and Fig. S16.

Finally, we applied FINDER and other clustering algorithms on SMLM datasets for which the ground truth is not known. In Fig. 5, we show the clustering results for single-molecule localization DNA-PAINT data of newly synthesized proteins after global homeostatic scaling in neuronal dendrites6 (see Fig. S13 for an analogous analysis for a dataset of neuronal AMPA-receptors). Most clusters of localizations detected by FINDER and the two CAML algorithms have 50 or fewer localizations. CAML (07VEJJ) leads to an abrupt cutoff of cluster sizes at 100, suggesting that typical clusters can exceed that size, and therefore more than 100 neighbors need to be included in the analysis. The cluster size distribution obtained using CAML (87B144) leads to a tail in which clusters with more than 150 localizations are found, and \(40\%\) of all localizations are identified as noise. In comparison, FINDER identifies \(21\%\) of all localizations as noise localizations. FINDER therefore includes more large clusters, which usually represent molecular aggregations, than CAML (87B144). The distribution of cluster sizes for CAML (87B144) (see second bin) and visual inspection of the clusters suggests that there is an over-segmentation, which may require additional filtering. We conclude that the cluster-size distributions are overall robust with respect to the choice of the algorithm. We also note that clustering-results provided by FINDER have the advantage of being easily interpretable and reproducible: For example here, a cluster is defined if 8 core points are found within \(\varepsilon = 76.8\)nm, and a noise localization is defined if it does not have 8 localizations within that radius. Note that the selected value for \(\varepsilon\) is much larger compared to the previous cases to which FINDER was applied (Figs. 3, 4 and Supplemental Information related to them). This is due to the different distribution of the points in this recording, which has a larger number of points (more than 400, 000) and a different density.

Figure 5
figure 5

Analysis of newly synthesized proteins in neuronal dendrites in DNA-PAINT data6. Left: Localizations analyzed using FINDER, CAML (07VEJJ) and CAML (87B144)33. Right: The top row depicts a section of the full field of view corresponding to the red rectangle in the left panel. Detected clusters are highlighted as colored points and localizations that were classified as noise are shown as grey points. The optimal parameters identified by FINDER for DBSCAN (noisefree) for \(minPts = 8\) are \(\varepsilon =76.80\)nm. FINDER, CAML (07VEJJ) and CAML (87B144) assigned \(21.8\%\), \(0\%\) and \(1.4\%\) of all localizations to clusters with more than 400 localizations, respectively. See red circle (I) for an example of a large cluster. The overall structure of the results is similar (eg. red circle (II)), but FINDER sets a higher threshold for the selection of small clusters. Therefore, it identifies more clusters with a low number of localizations (cluster size \(< 25\)) as noise, see eg. red circle (III). See Fig. S14 for the statistics showing the 10th-neighbor distances, Fig. S12 for an overview of the localizations not identified as noise, and for large clusters for each algorithm, and Fig. S17 for clustering outcomes within the full phasespace. See Fig. S13 for an analogous analysis of super resolved neuronal AMPA receptor localizations from9.

Discussion

The identification of clustered localizations in single-molecule localization microscopy data is a crucial step for quantifying proteins within nano-clusters24,41.

A bottleneck for automated image analysis is the identification of appropriate clustering parameters, particularly if target molecules themselves are clustered or if noise levels are high. Clustering algorithms used in scientific studies of SMLM datasets therefore need to fulfill many – sometimes contradicting – requirements. They have to be fast, robust with respect to parameter choice, and offer results that are easily interpretable. Manual parameter tuning needs to be avoided, as detailed knowledge about the system may not be readily available and can introduce human bias. Because control experiments with known ground truth clusterings are not always available, false positive cluster inclusions need to be minimized in order to avoid over- or misinterpretations of the data.

DBSCAN is one of the most popular clustering algorithms for SMLM data13,14,15,16,17,37,38,39,40. Part of its appeal is the fact that its density-based clustering rule is intuitive and it can offer close-to-optimal results if parameters are chosen carefully23. It has two input parameters: a radial parameter \(\varepsilon\) and the minimal number of points minPts within this radial parameter needed to initiate a cluster. However it is not always clear how to set the radial parameter manually for heterogeneous cluster sizes or datasets with changing density. Heuristic rules for setting these parameters exist20,23, but in many examples they are ambiguous. For instance, the optimal value for the radial parameter \(\varepsilon\) is commonly estimated from the k-th neighbor curves, which may be not well-defined and vary with k23. Several adaptations and improvements of DBSCAN have been proposed to deal with density-variations, and the parameter estimation problem42,43. Reverse-nearest-neighbor approaches such as RECORD44, IS-DBSCAN45, ISB-DBSCAN46 and RNN-DBSCAN47 only require setting one parameter k, and can deal with density-variations within the dataset.

In many applications, however, parameters are manually chosen (see Table S1) which may not lead to optimal results when analyzing many datasets with differing localization densities.

Here, we pursue an approach that does not locally adapt to varying densities, but which sets a global threshold for the required cluster density for scenarios in which the radial parameter cannot unambiguously be extracted from k-th-neighbor curves. The idea is to leverage the information given by overall noise levels to set a global threshold for cluster inclusions and therefore minimize the number of false positive cluster inclusions. To tackle the issue of parameter choice in this scenario, while retaining the advantages of density-based clustering methods such as DBSCAN, we presented here an unbiased, automated parameter identification algorithm FINDER. FINDER combines the following benefits: First, it does not require prior knowledge about the structure of clustered localizations and as such it side-steps the need for a statistical model, the need to perform supervised training on annotated data sets, or the need for manual parameter choices that might encompass user-generated biases. Second, FINDER leads to easily interpretable results and automatically defines parameters which are then applied across the full dataset. Finally, FINDER can be applied to large, heterogeneous data sets with clustering results which are robust to noise and signal overlap (Figs. 2, 3 and 4).

In the absence of a known ground truth- as is the case in most scientific exploratory data analysis- clustering algorithms need to be transparent, easily interpretable, and minimize false positive detections to avoid misinterpretation of data. In local adaptive algorithms, local density changes are employed to detect clusters of localizations. But it is often unclear on what basis the threshold for the local density change is set. In 2020, the local machine-learning based clustering method (CAML33) has been proposed. If the correctly trained CAML algorithm is chosen, clustering results are generally very good, with high true positive and low false positive detections (see Fig. 4). However in some cases, these methods can lead to severe over-segmentation (Figs. 3, 4, 5) and in other cases to under-counting of clusters (Fig. 3). For exploratory data analysis with an unknown ground truth, this sensitivity is critical, because one does not know which datasets are outside of the validity regime of a given trained model. This oversegmentation can be avoided if the global parameter choice is used to transfer information from the global to the local scale. For instance, information at the global scale is the overall amount of noise, or the general heterogeneity of the clusters. The local scale is the neighborhood of each localization. By avoiding pre-filtering of noise, and by performing both noise-filtering and clustering in one step, FINDER uses the global information to set the local parameters (minPts and \(\varepsilon\)). In brief, FINDER selects the parameters which leads to the most robust clustering.

We systematically benchmarked FINDER against existing algorithms using two sets of experimentally recorded clusters of localizations. We found that – despite using the same parameters across the full dataset – the cluster inclusion and exclusion criteria used in FINDER perform robustly compared to trained machine-learning models. Our tests on synthetic data sets in the high noise and the high overlap regimes showed that density variations due to noise can lead to over-segmentation in adaptive clustering algorithms if an incorrect algorithm is used (see Fig. 4). For example, for synthetic reconstructions, FINDER performed similar to the better of two pre-trained CAML models. FINDER was able to minimize the number of false positive clusters while maintaining a high ratio of true positive clusters. We also tested FINDER on a dataset of DNA-origami trimers and tetramers and showed that FINDER reduces the number of false positive cluster detections and at the same time retains many true positive detections – leading to an accurate prediction of the underlying molecular structure.

In conclusion, we showed that performing noise identification, parameter-choice and clustering in one single post-processing step, such as proposed in FINDER, provides a reliable and unbiased method for a spatial analysis of SMLM data sets. In most experimental settings, the ground truth is not known, and therefore minimizing the number of false positive cluster detections is important to avoid erroneous interpretations of experimental results. We showed here that an all-in-one cluster identification can help limit the effect of human biases, and can speed up the interpretation of single molecule microscopy datasets.

Methods

FINDER algorithm

The FINDER algorithm identifies the hyperparameters for a cluster-proposing algorithm. Here, we employ two cluster proposing algorithms: DBSCAN, as well as a version of DBSCAN based on iterative removal of non-core points (see "Methods"-section on “Noise-free DBSCAN” for details). Both algorithms take two parameters, which are a fixed minimum number of neighboring points (minPts), and a typical distance (\(\varepsilon\)). FINDER determines the optimal parameter pairs \((minPts^*,\varepsilon ^*)\) through the following steps:

  1. 1.

    Compute the distribution of the distance to the kth-nearest neighbors. Here, we set \(k=10\) (see discussion, and Fig. S17 and Fig. S18).

  2. 2.

    Define the interval in which the algorithm will search for the parameter \(\varepsilon\) as the interval between the 10th to the 90th percentile of the distribution of kth-nearest neighbors. The algorithm will explore n points linearly or logaritmically distributed in this interval. In our experiments, we set \(n = 15\) and set a logaritmic scale. Here, n governs the numerical precision of the final parameter values and the speed of convergence.

  3. 3.

    The interval for the second parameter, i.e. minPts, is a collection of integer values. In our experiments, they span from 5 to 20, since these are close to the biologically plausible lower and upper limits for cluster size, but other choices are also possible if they cover the biologically relevant domain.

  4. 4.

    For the input dataset, compute clustering results for every possible parameter combination using the cluster-proposing algorithm.

  5. 5.

    Compute the similarity score among clustering results sharing the same minPts value, by varying \(\varepsilon\). This choice was motivated by the fact that the radial parameter seems to play a predominant role in the clustering outcome (see18 and also the variation w.r.t. the number of clusters in Figs. S17 and S18). For details, refer to Methods section “Similarity of clusterings and similarity score”. see Methods section “Similarity of clusterings and similarity score”.

  6. 6.

    For each minPts, the value of \(\varepsilon\) that correspond to the clustering with the largest similarity score is selected. This list of pair of parameters is referred to as the line of optima.

  7. 7.

    The values for the similarity score of the elements of the line of optima are re-scaled so that they span from 0 to 1. This is accomplished by removing the minimal value from each element and then dividing by the maximum value.

  8. 8.

    The selected parameters are chosen moving along the line of optima. They are selected to be the first for which the normalized similarity score fall below \(\alpha = 0.5\), i.e., when its value is less than \(50\%\) of the highest similarity score.

The procedure is illustrated in Figs. S3 and S4.

Noise-free DBSCAN cluster definition

DBSCAN initiates clusters using core points. Core points are points which have a at least minPts neighboring points within a distance \(\varepsilon\). Here, we modified this classic DBSCAN cluster definition to make it more robust to noise. First, we iteratively remove all non-core points from the dataset of localizations X such that only core points remain (see Fig. S2). Next, FINDER partitions the remaining core points into clusters. The algorithm is illustrated by the following pseudo-code:

figure a

In all figures of the main manuscript, we used the noise-free DBSCAN cluster definition inside FINDER. For comparison with the classic DBSCAN cluster definition see Figs. S6-S11 in the supplementary material.

Similarity of clusterings and the definition of the similarity score

Let \(\mathcal {C}_i = \{X_1,X_2,\ldots ,X_r\}\) be a clustering of a set of points \(\mathcal {P}=\{p_1,p_2,\ldots ,p_N\}\). We define the similarity score of two clusterings \(\mathcal {C}_1 = \{X_1,X_2,\ldots ,X_r\}\) and \(\mathcal {C}_2 = \{Y_1,Y_2,\ldots ,Y_t\}\) as the sum of similar subsets within the partitions:

$$\begin{aligned} S(\mathcal {C}_1,\mathcal {C}_2) = \sum _{i,j} s(X_i,Y_j). \end{aligned}$$
(1)

Two subsets \(X_i\) and \(Y_j\) are said to be similar if the number of overlapping points (i.e., points shared by the two clusters) is larger than the number of non-overlapping points for each of the subsets:

$$\begin{aligned} s(X_i,Y_j) = \left\{ \begin{array}{ll} 1 &{} \text {if } |X_i \cap Y_j| > \max ( |X_i \setminus (X_i \cap Y_j) | , |Y_j \setminus (X_i \cap Y_j) |\\ 0 &{} \text {otherwise} \end{array} \right. . \end{aligned}$$
(2)

The similarity score of a clustering \(\mathcal {C}_i\) within an assembly of clusterings \(\mathcal {A} = \{\mathcal {C}_1,\mathcal {C}_2,\ldots ,\mathcal {C}_m\}\) is defined as the sum of similarity scores:

$$\begin{aligned} \bar{S}(\mathcal {C}_i,\mathcal {A}) = \sum _j S(\mathcal {C}_i,\mathcal {C}_j). \end{aligned}$$
(3)

Generation of surrogate test data from DNA origami images

To benchmark the performance of FINDER on experimental data sets and compare its performance to alternative existing clustering methods we used a DNA origami data set with a known cluster structure. We considered images of DNA origami containing three or four binding sites that we measured using DNA-PAINT8. Even though three or four localization clusters are expected for this DNA origami data set and this knowledge can provide a ground truth for clustering outcomes, we found that all clustering methods consistently detected a varying fraction of dimers. Upon visual inspection, we found that some expected trimers or tetramers appeared incomplete because some fluorophore binding sites were absent from some origami. We thus divided the trimer data into two groups: visibly resolved and visibly un-resolved trimers. We also introduced the group of resolved tetramers, in order to test the performance of all algorithms on a different geometric configuration See Fig. 3 for analysis of both groups.

In a second test, we accessed the robustness of the algorithms in the limit of overlapping clusters, in the high-noise limit, and with geometric constraints. We manually selected clusters representing single binding sites from resolved trimers (see Fig. 4b) and, in a second set, we manually selected from AMPA receptor images (see Fig. S1). We then re-assembled those monomers in pre-defined grid and path geometries, and with varying levels of added random noise localizations.

Definition of true and false positive cluster detections

In Fig. 4, a cluster X from the ground truth clustering \(\mathcal {C}_1\) is counted as being correctly detected by cluster Y from clustering \(\mathcal {C}_2\) if the overlap between both clusters covers at least \(30\%\) of the points of cluster X, i.e. if \(|X \cap Y| > 0.3 |X|\) and if cluster Y does not detect any other clusters of clustering \(\mathcal {C}\), i.e. one cluster Y cannot detect two clusters from \(\mathcal {C}_1\). All clusters from \(\mathcal {C}_2\) that can be attributed to a cluster of \(\mathcal {C}_1\) in such a way are counted as true positives, and the remaining clusters of \(\mathcal {C}_2\) count as false positives. In Fig.S16, we vary the overlap threshold between \(0\%\) and \(100\%\), and show robustness of the results with respect to this variation.

Note that other metrics for the similarity of two clusterings47 such as the Adjusted Rand Index (ARI)48 or the Normalized Mutual Information (NMI)49, mix the similarity and the number of correctly identified clusters. In contrast, for the second benchmark in Fig. 4, we focus on how many clusters have been correctly and incorrectly identified. We therefore use a metric that uses a hard, binary, threshold for individual clusters.

Experimentally recorded super-resolution microscopy data

DNA Origami data

DNA origami containing 3 (‘trimers’, 3-fold symmetry, 55 nm interspacing, see Fig. S15) and 4 (’tetramers’, 4 fold symmetry, 40 nm interspacing) binding sites (containing P1 docking oligos) were imaged on the same N-STORM system (Nikon, Japan) as the above reported AMPA-receptor data: an Eclipse Ti-E inverted microscope, equipped with a Perfect Focus System (Ti-PSF) and a motorized x-y stage. Total internal reflection fluorescence (TIRF) was adjusted using a motorized TIRF illuminator in combination with a 100 x oil-immersion objective (CFL Apo TIRF, NA 1.49) with a final pixel size of 158 nm. For imaging, 647 nm excitation wavelength was used, housed in a MLC400B (Agilent) laser combiner. An optical fiber guided the laser beam to the microscope body and via a dichroic mirror (T660LPXR, Chroma) to the sample plane. Fluorescence emission was separated from excitation light via a bandpass filter (ET705/72m, Chroma) and detected by an iXon Ultra EMCCD camera (DU - 897U-CS0-23 #BV, Andor). The software NIS-Elements Ar/C (Nikon) and \(\mu\)Manager were used to control the setup and the camera. TIRF illumination was used for super-resolution acquisitions of DNA origami data with a power of 30–40 mW, which was de- termined directly after the objective and under wide-field configuration. Time-lapse datasets with 24000 frames for DNA origami trimers and 10773 frames for DNA origami tetramers and 16 bit depth were acquired at 3.3 Hz frame rate and 5 MHz camera read-out bandwidth; pre-amplification: 3; electron multiplying gain: 50. For DNA-PAINT imaging, the imaging buffer contains P1- Atto655 (CTAGATGTAT-Atto655, Eurofins Genomics) in 500 mM NaCl, pH 7.3. The P1-Atto655 concentration was 10 nM for origami trimer and tetramer data, and 0.5 nM for AMPA receptor DNA-PAINT experiments9.

DNA-PAINT acquisitions were reconstructed using Picasso:Localize, a module of the Picasso software version 0.4.08 (https://github.com/jungmannlab/picasso), by applying a minimal net gradient of 1500. With Picasso:Render, drift corrections were applied based on the redundant cross-correlation (RCC), with a segmentation of 1000 was applied. Drift-corrected data was filtered using Picasso:Filter. To generate the DNA origami trimer and tetramer cluster datasets for cluster-identification validation, trimer and tetramer clusters were identified by eye using Picasso:Render and manually selected with a picking diameter of 2 camera pixels.

Newly synthesized protein data

Newly synthesized protein data was previously reported (see Ref. 6). In brief, cultured neuron was incubated in a growth medium containing (Tetrodotoxin) TTX for 1 h 15 mins before the treatment ended, the neuron was metabolically labelled with AHA. The immuno-stained neuron samples were then imaged using DNA-PAINT8.

AMPA-receptor data

AMPA-receptor validation data was previously reported (see Ref. 9). In brief, cultured neurons were stained by primary antibody against AMPA receptor GluA2 subunit before fixation and secondary antibody staining, in which the secondary antibody was modified to carry a P1 docking oligo. The immuno-stained neuron samples were then imaged using DNA-PAINT8.