Abstract
Spatial aggregation of proteins might have functional importance, e.g., in signaling, and nanoimaging can be used to study them. Such studies require accurate characterization of clusters based on noisy data. A set of spatial correlation approaches free of underlying cluster processes and input parameters have been widely used for this purpose. They include the radius of maximal aggregation r_{a} obtained from Ripley’s L(r) − r function as an estimator of cluster size, and the estimation of various cluster parameters based on an exponential model of the Pair Correlation Function(PCF). While convenient, the accuracy of these methods is not clear: e.g., does it depend on how the molecules are distributed within the clusters, or on cluster parameters? We analyze these methods for a variety of cluster models. We find that r_{a} relates to true cluster size by a factor that is nonlinearly dependent on parameters and that can be arbitrarily large. For the PCF method, for the models analyzed, we obtain linear relationships between the estimators and true parameters, and the estimators were found to be within ±100% of true parameters, depending on the model. Our results, based on an extendable general framework, point to the need for caution in applying these methods.
Introduction
In cell biology and elsewhere, spatial aggregation or clustering is an interesting phenomenon, possibly with a functional role — e.g., the behavior of proteins to form submicrometer sized clusters could be important for their functionality, e.g. in signaling^{1,2,3,4}, transcription^{5,6}, etc. The origins, structure and function of spatial heterogeneity of proteins in varous systems are only being studied. Spatial location information, available from fluorescence and electron microscopic imaging, and recently from subdiffraction limited fluorescence imaging such as Single Molecule Localization Microscopy(SMLM) techniques like PhotoActivated Localization Microscopy (PALM) and STochastic Optical Reconstruction Microscopy(STORM)^{7,8,9}, are key to such studies^{10,11,12,13}. Accurate characterization of clustering — its strength, scale and density — is an important part of these studies, whether for relative comparison between different systems, perturbation conditions and to test hypothesis (e.g. the relative importance of lipid rafts and actin cytoskeleton in membrane protein clustering^{14}, or the possible mechanisms of early Tcell signaling^{10}), or for absolute quantification, e.g. the size of clusters in a particular cell type in a particular condition, and the number of molecules in them.
A number of methods have been used to characterize the clusters from imaging data^{15,16,17}. The methods can be broadly categorized into two: (1) clustering or segmentation to identify the clusters, followed by their characterization; and (2) spatial statistics approaches based on a secondorder spatial summary statistic like Besag L(r) − r function or the Pair Correlation Function g(r). These secondorder functions can be used for comparison of clustering at different scales and between different experimental systems and perturbations, and estimators based on these functions can be used for ensemble cluster parameter estimation. In general, they have a few advantages over many of the segmentation approaches: they are parameterfree, can detect interactions at multiple spatial scales, can work with both dense and sparse point patterns, often have direct physical interpretations^{18}, and are amenable to rigorous extensions incorporating error models, crucial in the case of nanoscale imaging^{18,19,20}. Specifically, models of various sources of errors in SM imaging, such as the artifact clustering due to single fluorophore blinking^{11,18}, localization uncertainty of 10–50 nm (FWHM) due to limited number of signal photons collected and the imperfect detection efficiency (only 40–60% of photoactivable fluorophores in the sample can be typically detected) can be incorporated to correlation functions^{11,18,20}. Also, in the case of SMLM, the notion of spatial point patterns align well with the nature of its point localization readout. In practice, a major convenience of using such methods have been that they estimate ensemble functions at different scales and the various cluster parameters for a whole dataset, without requiring user set input parameters, making comparative studies easy in systems where variability within cluster sizes are not important.
Two spatial statistics based estimators of cluster parameters based on these functions widely reported in the nanoimaging and protein cluster analysis literature are 1) the radius of maximal aggregation r_{a}^{15,17,20,21,22,23,24,25,26,27,28,29}, the radius value corresponding to the maxima of the empirical L(r) − r function, as an estimator of cluster size(length scale); and 2) the functional approximation of g(r) as an exponential function^{5,11,18,30,31,32,33}, extended to 3D in ref. 34, leading to estimators of cluster size, amplitude or strength and number of molecules per cluster (see Methods). While both methods are based on secondorder correlation, the estimators are different – one is based on the empirical maxima of the L(r) − r function while the other is based on fitting the empirical pair correlation function to an approximate model to obtain the model parameters. These methods are not concerned with the underlying spatial distributions, e.g. the shapes of clusters and the distribution of molecules in them. Effects due to differences in underlying spatial distribution are either ignored or approximated, making the estimation process free of underlying cluster processes.
However, molecular distribution in clusters observed through bioimaging could be of different shapes, depending on the underlying physical mechanism. The molecules could be concentrated at the center of the cluster, either heavily or lightly, or distributed uniformly within the cluster (Fig. 1). In the case of SMLM imaging, e.g., the clusters formed due to photoblinking are reported to have a Gaussian^{11} or Cauchy peak shape^{33} – the latter a well known “fat tailed” distribution – depending on the photon count distribution within the cluster. It is plausible to model internalization in circular or spherical bodies with a hardcore process (a disk in 2D). Liquid clusters, due to surface tension, are expected to form spherical clusters^{35}. Analysis methods often assume Gaussian shapes for membrane clusters^{16,17}. Refs 36,37 have suggested modeling membrane protein distributions using 2DIsing model, to account for phase transitions and criticality. It is not clear how the parameter estimation approaches that are independent of underlying true cluster processes are biased or scaled due to these different underlying molecular distributions in clusters. They also raise the question of identifiability: e.g., can the size (i.e. length scale) estimator of these approaches be mapped exclusively to the size parameter of the true process, independent of other parameters, such as number of clusters per unit area or cluster density or amplitude? If the estimated size parameter is dependent on both the size and amplitude parameters of the underlying true process, one must account for it during the comparative analysis of cluster sizes, as it may not accurately reflect the true differences in size, estimation being affected by amplitudes as well. Other point pattern based parametric methods^{26,38} also have to deal with similar issues. The influence of shape and geometry – of crucial importance in biology^{39} – in estimation is observed in other fluorescence based technologies as well^{40}.
Some clues have been obtained from simulation studies. Kiskowski et al.^{21} studied the relation between the true radius of disk clusters R and estimates of r_{a} by means of simulations, and derived important insights — such as R ≤ r_{a} ≤ 2R, and a dependency of r_{a} on separation between clusters. However, since the study was based on simulations, with a limited set of parameters and models (only disk clusters), the understanding is limited, and the possibilities of generalization are not clear. Lagache et al.^{26} performed a theoretical analysis of a similar estimator — maxima of the Kfunction normalized with its variance — for disk shaped clusters, and reported a simpler, constant relation r_{a}/R = 1.3. Such a relationship would have been convenient, however its generality in terms of models and parameters is not clear. No studies of the bias introduced by the approximate model of g(r) has yet been reported, to the best of our knowledge.
Note that the accuracy or bias of an estimator cannot be improved by repeated measurements, unlike its precision. By definition, bias affects absolute quantification. The same is the case regarding their use as relative comparisons: the need to account for biases might become important if (1) the parameters are not separately identifiable or (2) involves scaling that are model dependent and the model is not known, or if the comparisons involve different models.
In this work, we explore, with theoretical rigor, the bias in parameter estimation and the questions of identifiability introduced by these approaches (Fig. 1). We consider a number of spatial cluster processes whose theoretical g(r) and L(r) − r are known, and then derive the relation between the parameters of these approaches (such as r_{a}) and the true process parameters (e.g., the cluster size parameter r_{t}). We find that, in general, for a large class of clustered point patterns, the ratio p of the radius of maximal aggregation r_{a} and the size parameter of the true process r_{t}(p = r_{a}/r_{t}) can be derived as an implicit nonlinear function of two cluster parameters: r_{t} and the number of clusters per unit area (κ). We also find that it possible to derive a theoretical lower bound for p, given a cluster model following some basic assumptions. We validate the theoretical results with simulations. We also perform similar analysis for the normalized Kfunction presented in ref. 26, to report a more complex relationship between the true cluster size and the estimator than presented in ref. 26. Then, we investigate the bias due to the exponential approximation model of g(r), for the different models. By minimizing the Least Square Error between the true and approximate PCFs, we obtain scaling laws between the approximate model and the true model parameters, and validate the approach by simulations. The extension for other cluster models is straightforward.
Results
Parameter identifiability issues and bias of radius of maximal aggregation
Here we obtain the relationship between the radius of maximal aggregation, defined as , and the true cluster size parameter (detailed definitions in Methods). We focus on a class of clustered point patterns with Kfunctions of the form
with h(r) = H′(r) and A > 0. Such a shape for Kfunction can represent diverse cluster models, such as uniform disk, Gaussian, etc (called Poisson cluster processes) and Ising processes (see Methods, Supplementary Table S1, for details). H typically is a function of the true cluster size parameter. The parameter A represents the number of clusters per unit area (κ) in the case of Poisson cluster processes, and ‘amplitude’ in the case of the Ising model (Methods). L(r)−r is typically also used to compare the ‘strength’ of clustering, and the expressions in Supplementary Table S1 relates L(r)−r to different cluster parameters. It can be noted that for a large class of models, the expressions of L(r)−r are independent of the number of molecules per cluster (Methods).
For point processes with the Kfunction as in (1), using the basic criteria for local maxima, we obtain L′(r_{a}) − 1 = 0, and hence K′(r_{a})^{2} = 4πK(r_{a}) using (5). Substituting this in (1), we obtain
That is, r_{a} depends on A in general, as A is not a parameter of H and h. (2) can be used to obtain a relation between for all the models listed in Supplementary Table S1, where r_{t} is the cluster size parameter of the true process. The results are shown in Supplementary Table S2 and the details of derivation are given in Supplementary Information.
We find that it is possible to write the relationship for all the Poisson cluster processes discussed, and hence, p (and r_{a}) is nonlinearly dependent on both κ and r_{t} (Supplementary Table S2). In the case of exponential PCF considered in detail in the next section, of which the varGamma process is an example, the results – nonlinear relation between r_{a}, r_{t} and κ, hold as well. In the case of the Ising process, the corresponding relationship is of the form , where a_{I} and ξ are the amplitude and size parameters of Ising process (Methods, Supplementary Tables S1). Figure 2a shows the contour plot of p vs κr_{t} for different models. The behavior of p in the case of a powerlaw PCF — heavily dependent on parameters — is given in Supplementary Figure S5 (derivation in Supplementary Information). Note that p is independent of the number of points per cluster (μ in Methods) if the expressions for Kfunctions are independent of it.
The singularity at H(r_{a}) − r_{a}h(r_{a}) = 0 in (2) provides a lower bound for p for all the models analyzed, and is also shown in Supplementary Table S2 and Fig. 2a. The lower bound so obtained is a fundamental characteristic of the cluster model. The existence of a lower bound for r_{a} for any cluster model with Kfunction of the form in (1) can be proved theoretically given some basic assumptions on h(r) (see Supplementary Information).
It can be seen that the lower bound for p is model dependent. For the disk model, e.g., 1.29564 < p < 2, whereas, for Gaussian model, 2.24181 < p < ∞. Now, the p for different processes cannot be directly compared, as the size parameter of the true process r_{t} is defined differently for them. A more comparable measure would be r_{q}, the (true) radius at which q fraction of the points are expected to lie for a particular distribution, typically obtainable in the form r_{q} = u_{q}r_{t}, such as the case of r_{0.95} = 2σ in the case of 1D Gaussian distribution. r_{q} is conceptually similar to “full width at half maximum” (FWHM), a measure that is widely used in the imaging literature. r_{q} would then correspond to the ratio p_{q} = p/u_{q}. Considering the case q = 95%, the values for u_{0.95} and the lower bounds for p_{0.95} for different distributions is given in Supplementary Information, and the plot p_{0.95} vs is shown in Fig. 2c. It can be seen that p_{0.95} is dependent on both the model as well as both the number of clusters per unit area and the true cluster size.
The systematic relationship established between p (or p_{0.95}), A and r_{t}, clarifies the bias and identifiability issues in estimation. The results agree with ref. 21, and provide a tighter theoretical lower bound (1.29564 instead of 1) for disk clusters. The approach can also explain the qualitative influence of intercluster distance on r_{a} observed by^{21}, through the dependency of p on κ, r_{q} would then correspond to the ratio. The dependency of p on other cluster parameters and the cluster model means that the estimator could be a poor choice as a comparison tool between different experiments r_{q} would then correspond to the ratio.
Validation with simulations
To establish the validity of the theoretical derivation obtained in previous section (shown in Supplementary Table S2) we performed a Monte Carlo simulation study. In addition to information about the accuracy of radius of maximal aggregation (the subject of the theoretical study), it also provides information about its precision as an estimator.
Clustered point patterns, belonging to either Gaussian or disk clusters, were simulated in a unit square, for varying κ and r_{t}. The theoretical value of p for a given κ and r_{t} were obtained by solving the analytical expressions in Supplementary Table S2, and was compared to was obtained from the empirical maximum of the L(r) − r curves. The results are shown in Fig. 3 (also see Supplementary Figure S1). The mean value of from simulations broadly agree with the theoretical results, though the deviation increases with increasing (see also the Mean Squared Error in Fig. 3c,d). This is probably the result of increasing number of clusters per unit area (increasing κ) or having larger clusters within the unit square used in the simulations (increasing r_{t}), both resulting in overlapping clusters, resulting in deviations from theoretical framework based on a particular cluster model. In fact, it can be seen that the deviation is most influenced by increasing radius (Fig. 3c,d).
Case of normalized Kfunction
In ref. 26, a variation of Kfunction was introduced, referred to as (details in Methods). The radius of maximal aggregation for was then obtained by setting . Using numerical approaches, they obtained the constant relation .
In our hands, for a square observation window (for simplicity) resulted in a more nuanced situation, as shown in Fig. 4 (details in Supplementary Information). We found that depends on the number of points n and the ratio m between the side length of the square observation window and the true size parameter r_{t}, and converges to a maximum value (an upper bound) at large m, which is approximately equal to the minimum values obtained in the case of r_{a} based on L(r) − r. For example, in the case of clusters with R = 20 nm with an area of analysis of size 10 μm, then m = 500, and is close to the maximum value (Fig. 4), and hence a constant (1.296 in the case of disk clusters, approximately equal to the factor of 1.3 obtained in ref. 26). On the other hand, if the area of analysis was smaller, say 1 μm, then m = 50, and depends critically on n (Fig. 4). The dependency of on n, in contrast with p in the case of L(r) − r, is because used in the definition of (Methods) is nonlinearly dependent on n, whereas the expression for K(r, n) (and L(r) − r) is independent of n. Note that is independent of κ and β, unlike the case of r_{a} and L(r) − r presented in the previous section.
Estimation based on exponential approximation of Pair Correlation Function
We now consider another estimator that has been suggested for estimating cluster parameters, the approach based on fitting Pair Correlation Functions. As discussed in the section Methods, the theoretical PCF is not unique to a cluster model, and its signature shape and sensitivity are often not sufficient to identify the models (e.g., see Fig. 5), not the least because each experiment provides a realization of a stochastic process, with the observed statistic approaching the theoretical one only as n → ∞. Model selection based on Monte Carlo (MC) rank tests^{41,42} — ranking the empirical statistic value among the values of the statistic from MC simulations based on estimated parameters — based on PCF or the related K or L(r) − r functions is not sound, if the same function was used for parameter estimation^{41}. The standard method in this case is to perform MC rank tests with a statistic that is different from the one that was used for parameter estimation, e.g., the nearest neighbor distribution function if the PCF was used for estimation. However, the approach is known to have low statistical power^{42}, and we too had similar experience during preliminary attempts to identify the cluster models from simulations and SMLM data (results not shown). Therefore, functional approximations such as g_{a}(r) = 1 + a exp(−r/d), proposed as part as the PCPALM method, have much appeal.
Here, we derive a measure of bias in parameters introduced by this approximation, given a true model. We aim to find the relations m = d/r_{t}, n = a/a_{t} and l = N_{a}/N_{t}, given a true model for the PCF in the form f(r) = 1 + a_{t}v(r, r_{t}). Here, N_{a} and N_{t} are the average number of points per cluster corresponding to the approximate model and the true model respectively, as per (7). Given a specific model for f(r), we find the relation between parameters in the case of the fit that provides the minimum (Least) Squared Error E, i.e.,
Note that the Least Squares criteria was used in original PCPALM papers for parameter estimation^{11,43}. If E has a minima at , then and at , which can be solved to obtain expressions for . Measures of m, n and l can then be found using these.
We were able to obtain measures of m,n and l for all the cluster models described in Supplementary Table S1, and the results are shown in Table 1 and the best fit PCFs can be seen in Fig. 6a (details in Supplementary Information). The m_{0.95} values: m_{0.95} = d/r_{0.95} = m/u_{0.95}, r_{0.95} being the scale at which 95% of points are expected to lie, can also be obtained as constant scalar values, given by 0.63, 0.82, 0.38 and 0.28, for Gaussian, disk, Cauchy and varGamma models respectively.
For example, in the case of Gaussian shaped clusters, with the PCF given in Supplementary Table S1, we obtain, for r_{m} > 6σ, m = d/σ ≈ 1.54, n ≈ 1.26, l ≈ 1.48, with m_{0.95} = 0.63. The parameters can be either upscaled or downscaled — e.g., the number of molecules per cluster is overestimated by 50% by using g_{a}(r) for estimation, whereas in the case of Ising process, it is underestimated by 40%. The overestimation/underestimation for all parameters is no more than by 100% in all the models the approach was applied, except in the case of the amplitude parameter in the Ising model. In this case too, while the a parameter is dependent on both the true amplitude a_{I} as well as true size parameter ξ, the effect is to the extend of n = 0.38–1.44 for ξ = 5–1000 nm, the case relevant in the case of protein clusters.
For the models in Supplementary Table S1, this means that (1) the estimated parameters scale linearly with the true ones, (2) the scaling is either independent of other parameters or only mildly dependent, (3) the theoretical scaling due to the exponential approximation is within 100%, in contrast with the radius of maximal aggregation, which can be several times higher (technically upto ∞) depending on models and parameter values.
We validated this theoretical approach by means of Monte Carlo simulations. We simulated Gaussian cluster processes in a unit square for different conditions, such as varying the numbers of points per cluster as well as cluster radius. The empirical PCF of these point patterns were fitted to both the theoretical PCF for Gaussian point patterns, as well as the functional approximation g_{a}(r), and the various parameters estimated. The estimates for N, the number of points per cluster is shown in Fig. 6b. It can be seen that the simulations agree with the theoretical prediction, with estimates using g_{a}(r) being overestimated, whereas the fit to Gaussian PCF providing accurate results.
Discussion
Quantitative studies of protein clustering from noisy single molecule data requires accurate and precise estimation tools. A key source of inaccuracy or bias in quantification could be the bias in the estimation tools themselves. Unlike in the case of precision, repeated measurements cannot remove the errors due to bias. The commonly used correlationbased estimators provide a convenient estimation tool by incorporating error models relevant to SM imaging and by not requiring the user to input parameters. However, the results should be interpreted after accounting for estimator bias.
The description of protein clustering involves multiple independent parameters, such as the number of clusters per unit area, cluster length scale, number of molecules per cluster and the cluster shape or distribution, all of which are unknown. Correlation approaches have been used in two ways: (1) to estimate some of these parameters, and (2) to estimate and compare the level or ‘strength’ of clustering between different conditions, e.g., based on the magnitude of L(r) − r curves. The latter task involves estimating a lumped measure that combines the above mentioned parameters, which describes the ‘strength’ of clustering. However, it is not clear how these estimators relate to true parameters. Can the true parameters be estimated independently (identifiability), or do the estimators correspond to some lumped parameters that combine independent true parameters? Also, what does the ‘strength’ of clustering mean with respect to the independent cluster parameters? It might be possible to have the same strength estimate values (e.g., L(r) − r magnitude) for different combinations of parameter values, if their net effect on the lumped measure is the same.
Based on the theoretical correlation functions corresponding to a diverse set of models (Table S1), we have theoretically analyzed three correlation based methods for estimating cluster parameters that have been proposed in the literature, for their identifiability and bias. The results also provide information about how different parameters influence the correlation functions for different cluster processes, i.e., what does ‘strength’ of clustering mean. Our results provide a cautionary tale about using these approaches, and also provides a general framework to analyze their bias.
Specifically, we analyzed: the radius of maximal aggregation based on L(r) − r function and on normalized Kfunction, both primarily estimators for cluster size (length scale), and the estimation based on the functional approximation with an exponential function for the Pair Correlation Function, proposed in the PCPALM method. The expressions in Table S1 provides the relation between cluster parameters and correlation functions (i.e., strength of clustering). We find that crucial independent parameters of clustering, such as the number of molecules per cluster, may not be reflected in the L(r) − r curves (Table S1). We were able to derive the theoretical relation between the radius of maximal aggregation and the true model parameters, for different models. These relations give an operational interpretation of the radius of maximal aggregation for different cluster generating models. In addition to the relation to model specific size parameter (such as radius for disk clusters, standard deviation for Gaussian clusters, etc), we also analyzed the relation between the radius of maximal aggregation and a cluster size measure that is defined independent of models (radius at which 95% of molecules are expected to lie). Our results illustrate that the ratio of radius of maximal aggregation (from L(r) − r curves) to the true cluster size nonlinearly depends on the true cluster size as well as the number of clusters per unit area (or corresponding parameters, such as amplitude) for all the models considered. While we were able to derive theoretical lower bounds for this ratio, we found that the ratio could be arbitrarily large depending on the parameters and models, illustrating that the radius of maximal aggregation is not always indicative of the true cluster size.
In the case of the approach based on the exponential approximation of Pair Correlation Function, we were able to derive the scaling laws between the parameters of the approximate model and the true model, based on the Least Squares criteria. It was found that the relationship between estimates and true parameters were found to be linear, and that the bias is limited to ±100% of true parameter values for the models we considered. These results apply to model independent measures such as number of molecules per cluster.
Our results provide a cautionary tale in the quantification of clustering based on commonly used correlation approaches. The estimates obtained are crucially dependent on various cluster parameters or models, meaning that considering them as absolute measures might be misleading. Moreover, in the case of many estimators (e.g., r_{a}), a qualitative, monotonic mapping between the corresponding parameters (size) are found to be not possible in general, instead, they might depend on other parameters as well (e.g., number of clusters per unit area κ in the case of r_{a}). That is, for certain combinations of parametric values, a higher r_{a} might not mean a higher true cluster size, since they might be confounded by κ. However, there are regimes of parametric values where the dependency on other parameters are not key (see Fig. 2), and guidance for whether the analysis do not fall in such regimes where a qualitative mapping is possible, can be obtained using the expressions in Table S2 (and corresponding figures in Fig. 2) along with estimates of relevant parameters. Since the results in the case of exponential approximation of PCF points to a linear mapping between the parameters and estimators for the models analyzed, and since qualitative, monotonic mapping is possible, it appears to be in general better estimator than r_{a}, for the models analyzed. Unfortunately, there seems to be a significant effect of model dependent scaling, and whether it is an acceptable level of bias depends on the specific problem and the magnitudes involved, and caution must be applied during analysis. Also, we find that the identification of cluster models based on secondorder functions is difficult, and we also describe how background noise might crucially affect estimation (Methods). We recommend that the results using the correlation approaches must ideally be corroborated by means of other methods (such as DBSCAN, or recent approaches such as^{16,17}), especially if the effect size is small.
Although only a limited set of models were analyzed here, the results illustrate the limits of the estimators. The analysis presented here can also be extended to other models and settings in a straightforward manner. For instance, we focused solely on the case of constant cluster size for all models for the purpose of illustration. However, the analysis can be easily extended to variable cluster sizes by modeling the cluster size as a random variable in (10). Also, our analysis shows that it might be possible to obtain theoretical bounds for parameters given a set of candidate models, e.g. by taking the worst bounds among candidates, even though the specific candidate model for a system is not known or is difficult to be inferred. It also points to a possible approach to reducing the bias: by using nonparametric models for the PCF, although care must be taken against overfitting and also in interpreting the results. This work only deals with the accuracy limits of the estimators, their precision could also be important in practical applications, which must be analyzed separately. The results presented in this work are not limited to protein clusters, and are applicable to any system with spatial clustering.
Methods
Background definitions
For a spatial point pattern in 2Dspace, Ripley’s Kfunction is defined^{41,42,44} as
where ρ is the spatial density (average number of points per unit area), and M(r) is the number of other events within distance r of a randomly chosen event. The Besag L(r) − r, a measure of cluster strength at r, is then given by
and the Pair Correlation Function by
Alternative but equivalent definitions of PCF starting with the notion of spatial autocorrelation are also possible^{11}.
The radius of maximal aggregation, .
The function g_{a}(r) = 1 + a exp(−r/d) has been proposed as a functional approximation for the Pair Correlation Function (PCF) of “2Dsystem of clusters with no predefined shape”^{11,30}. The parameter a is the amplitude, a measure of point density in the clusters, and d, the correlation length, gives the radius of the cluster^{11}. For the PCF g(r), the average number of points per cluster can then be obtained as
which is equal to N_{a} = 2πad^{2}ρ in the case of g_{a}(r), where ρ is the average density of points in the area of analysis^{11}.
In ref. 26, the normalized statistic was proposed, given by , where
where A is the area and P the perimeter of the observation window, and n the number of points.
For disk process, they use, similar to the expression in Supplementary Table S1:
where κ is the number of clusters per unit area, and β the clustered fraction.
Theoretical expressions for g(r) and L(r) − r
In order to derive the theoretical expressions for L(r) − r and g(r) for different cluster models, it is useful to focus on a class of spatial cluster processes, known as Poisson cluster processes, or NeymanScott processes(details in refs 41 and 42), which are generated in the following way. First, a set of parent points are created, following a spatial Poisson process (complete spatial randomness) with density (intensity) κ. Then, S number of points are distributed around each parent point according to the i.i.d bivariate PDF f_{pdf} (.), S following some i.i.d distribution with mean μ. We assume a constant cluster size (length scale) parameter here, though the analysis with a variable cluster size is straightforward. These offspring points form the clustered point pattern. Such simple spatial cluster models that consider different shapes of clusters provide a starting point for the theoretical analysis of estimators. The Ising model, also considered in analysis, provides a more physical example.
Assuming f_{pdf} (.) to be radially symmetric, let the PDF of the distance r between two offspring points within a cluster is given by h_{d}(r) and its Cumulative Distribution Function (CDF) by H_{d}(r). Then^{41}:
The density of the point pattern will be μκ. When , since E[S(S − 1)] = μ^{2} for Poisson distribution, (10) reduces to
The derivation in case of other distributions for points per cluster is straightforward. In the case of geometric or exponential distribution of S, behavior often observed in nanoimaging^{45,46,47}, for Thus, for a large class of cluster models, the Kfunction is independent of the number of molecules per cluster µ.
Note that H_{d}, being the CDF, is monotonic and nondecreasing. The corresponding PCF becomes:
The PCF and Kfunction for different cluster shapes are given in Supplementary Table S1, and the shapes of their PDFs are given in Supplementary Information. Note that disk clusters contain points distributed uniformly at random within a circle (disk), a process known as Matérn cluster process in spatial statistics (the case of Gaussian cluster shapes is known as Thomas process). Also note that r_{t} is defined differently for different cluster models: for a disk cluster, r_{t} = R, the true cluster radius, whereas for Gaussian clusters, we set r_{t} = σ, the true standard deviation (the full list can be found in Supplementary Table S1). We also add the physical Ising model to the compilation, since it is one of the models that has been proposed for membrane protein clustering^{18}, even though it is not a NeymanScott process. Also, note that the exponential approximation g_{a}(r) has the same shape as the variance Gamma function model (varGamma) in Supplementary Table S1, pointing at the nonuniqueness of g(r) shapes and the difficulty of identifying cluster models from data based on their PCF shapes.
Effect of background
To model a monomer fraction or background, a spatial Poisson distributed monomer point pattern can be superimposed to a purely clustered process, such that the purely clustered fraction of points is β. The resulting Kfunction and PCF can be obtained using the expression for superposition of two independent point processes^{41}. In the case of a clustered process with g(r) = 1 + Bv(r), superposition with such a background process results in the PCF:
where B_{e} = Bβ^{2}, β being the purely clustered fraction^{41}. Expressions for K(r) and L(r) − r undergo similar scaling in parameter. It can be noted that the shape of the function remains the same as the purely clustered process, the change in parameter B being the only change, again pointing at the nonuniqueness of PCF shapes, and the quadratic effect of background on the function (note the effect on (7)).
Simulation and analysis details
All simulations were done in R, using the spatstat library^{48}. Simulations of cluster processes were done with standard library functions, such as rThomas and rMatClust. Parameter estimation by minimum contrast method was done using kppm function, and using parameters “Thomas” and “VarGamma”. Analytical derivations were performed with the help of symbolic algebra software [Mathematica(Wolfram Research, USA)].
Additional Information
How to cite this article: Shivanandan, A. et al. On characterizing protein spatial clusters with correlation approaches. Sci. Rep. 6, 31164; doi: 10.1038/srep31164 (2016).
References
 1.
Sieber, J. J. et al. Anatomy and dynamics of a supramolecular membrane protein cluster. Science 317, 1072–1076 (2007).
 2.
Lang, T. & Rizzoli, S. O. Membrane protein clusters at nanoscale resolution: more than pretty pictures. Physiology 25, 116–124 (2010).
 3.
Hartman, N. C. & Groves, J. T. Signaling clusters in the cell membrane. Curr Opin Cell Biol. 23, 370–376 (2011).
 4.
GarciaParajo, M. F., Cambi, A., TorrenoPina, J. A., Thompson, N. & Jacobson, K. Nanoclustering as a dominant feature of plasma membrane organization. J Cell Sci. 127, 4995–5005 (2014).
 5.
Cisse, I. I. et al. Realtime dynamics of rna polymerase ii clustering in live human cells. Science 341, 664–667 (2013).
 6.
Ricci, M. A., Manzo, C., GarcaParajo, M. F., Lakadamyali, M. & Cosma, M. P. Chromatin fibers are formed by heterogeneous groups of nucleosomes in vivo. Cell 160, 1145–1158 (2015).
 7.
Betzig, E. et al. Imaging Intracellular Fluorescent Proteins at Nanometer Resolution. Science 313, 1642–1645 (2006).
 8.
Hess, S. T., Girirajan, T. P. & Mason, M. D. UltraHigh Resolution Imaging by Fluorescence Photoactivation Localization Microscopy. Biophys J. 91, 4258–4272 (2006).
 9.
Rust, M. J., Bates, M. & Zhuang, X. Subdiffractionlimit imaging by stochastic optical reconstruction microscopy (STORM). Nat Methods 3, 793–796 (2006).
 10.
Williamson, D. J. et al. Preexisting clusters of the adaptor lat do not participate in early t cell signaling events. Nat Immunol. 12, 655–662 (2011).
 11.
Sengupta, P. et al. Probing protein heterogeneity in the plasma membrane using PALM and pair correlation analysis. Nat Methods 8, 969–975 (2011).
 12.
Lillemeier, B. F. et al. TCR and lat are expressed on separate protein islands on t cell membranes and concatenate during activation. Nat Immunol. 11, 90–96 (2009).
 13.
Prior, I. A., Muncke, C., Parton, R. G. & Hancock, J. F. Direct visualization of ras proteins in spatially distinct cell surface microdomains. J Cell Biol. 160, 165–170 (2003).
 14.
Scarselli, M., Annibale, P., Gerace, C. & Radenovic, A. Enlightening gproteincoupled receptors on the plasma membrane using superresolution photoactivated localization microscopy. Biochem Soc T. 41, 191–196 (2013).
 15.
Deschout, H., Shivanandan, A., Annibale, P., Scarselli, M. & Radenovic, A. Progress in quantitative Single Molecule Localization Microscopy. Histochem Cell Biol. 142, 5–17 (2014).
 16.
RubinDelanchy, P. et al. Bayesian cluster identification in singlemolecule localization microscopy data. Nat Methods 12, 1072–1076 (2015).
 17.
Levet, F. et al. Srtesseler: a method to segment and quantify localizationbased superresolution microscopy data. Nat Methods 12, 1065–1071 (2015).
 18.
Veatch, S. L. et al. Correlation functions quantify superresolution images and estimate apparent clustering due to overcounting. Plos One 7, e31457 (2012).
 19.
Shivanandan, A., Deschout, H., Scarselli, M. & Radenovic, A. Challenges in quantitative Single Molecule Localization Microscopy. FEBS Lett. 588, 3595–3602 (2014).
 20.
Shivanandan, A., Unnikrishnan, J. & Radenovic, A. Accounting for limited detection efficiency and localization precision in cluster analysis in Single Molecule Localization Microscopy. Plos One 10, e0118767 (2015).
 21.
Kiskowski, M. A., Hancock, J. F. & Kenworthy, A. K. On the Use of Ripley’s KFunction and Its Derivatives to Analyze Domain Size. Biophys J. 97, 1095–1103 (2009).
 22.
Antoku, Y., Dedecker, P., Pinheiro, P. S., Vosch, T. & Srensen, J. B. Spatial distribution and temporal evolution of DRONPAfused SNAP25 clusters in adrenal chromaffin cells. Photochem Photobiol Sci. 14, 1005–1012 (2015).
 23.
Roy, N. H., Chan, J., Lambelé, M. & Thali, M. Clustering and Mobility of HIV1 Env at Viral Assembly Sites Predict Its Propensity To Induce CellCell Fusion. J Virol 87, 7516–7525 (2013).
 24.
Pezzarossa, A. Imaging plasma membrane domains in signaltransduction pathways. Ph.D. thesis, Leiden University (2012).
 25.
Malkusch, S., Muranyi, W., Müller, B., Kräusslich, H.G. & Heilemann, M. Singlemolecule coordinatebased analysis of the morphology of hiv1 assembly sites with nearmolecular spatial resolution. Histochem Cell Biol. 139, 173–179 (2013).
 26.
Lagache, T., Lang, G., Sauvonnet, N. & OlivoMarin, J.C. Analysis of the Spatial Organization of Molecules with Robust Statistics. Plos One 8, e80914 (2013).
 27.
Parton, R. G. & Hancock, J. F. Lipid rafts and plasma membrane microorganization: insights from Ras. Trends Cell Biol. 14, 141–147 (2004).
 28.
Felizzi, F. & Iber, D. Integrin clustering as a result of local membrane deformations and local signaling feedbacks. Physica A 408, 198–211 (2014).
 29.
Gurry, T., Kahramanogullari, O. & Endres, R. G. Biophysical mechanism for rasnanocluster formation and signaling in plasma membrane. Plos One 4, e6148 (2009).
 30.
Tobin, S. J. et al. Nanoscale effects of ethanol and naltrexone on protein organization in the plasma membrane studied by photoactivated localization microscopy (palm). Plos One 9, e87225 (2014).
 31.
Albertazzi, L. et al. Probing exchange pathways in onedimensional aggregates with superresolution microscopy. Science 344, 491–495 (2014).
 32.
You, C., Richter, C. P., Lochte, S., Wilmes, S. & Piehler, J. Dynamic submicroscopic signaling zones revealed by pair correlation tracking and localization microscopy. Anal Chem 86, 8593–8602 (2014).
 33.
Pertsinidis, A. et al. Ultrahighresolution imaging reveals formation of neuronal SNARE/Munc18 complexes in situ. Proc Natl Acad Sci USA 110, E2812–E2820 (2013).
 34.
Liu, Z. et al. 3d imaging of sox2 enhancer clusters in embryonic stem cells. Elife 3, e04236 (2014).
 35.
Hyman, A. A., Weber, C. A. & Jülicher, F. Liquidliquid phase separation in biology. Annu Rev Cell Dev Biol. 30, 39–58 (2014).
 36.
Veatch, S. L. et al. Critical fluctuations in plasma membrane vesicles. ACS Chem Biol. 3, 287–293 (2008).
 37.
Machta, B. B., Papanikolaou, S., Sethna, J. P. & Veatch, S. L. Minimal model of plasma membrane heterogeneity requires coupling cortical actin to criticality. Biophys J. 100, 1668–1677 (2011).
 38.
Shivanandan, A., Radenovic, A. & Sbalzarini, I. F. MosaicIA: an ImageJ/Fiji plugin for spatial pattern and interaction analysis. BMC Bioinformatics 14, 349 (2013).
 39.
Gould, S. J. D’Arcy Thompson and the science of form. New Lit Hist. 2, 229–258 (1971).
 40.
Sbalzarini, I. F., Mezzacasa, A., Helenius, A. & Koumoutsakos, P. Effects of organelle shape on fluorescence recovery after photobleaching. Biophys J. 89, 1482–1492 (2005).
 41.
Diggle, P. J. Statistical Analysis of Spatial Point Patterns (Arnold, London, 2003).
 42.
Illian, J., Penttinen, A., Stoyan, H. & Stoyan, D. Statistical analysis and modelling of spatial point patterns vol. 70 (John Wiley & Sons, 2008).
 43.
Sengupta, P., JovanovicTalisman, T. & LippincottSchwartz, J. Quantifying spatial organization in pointlocalization superresolution images using pair correlation analysis. Nat Protoc. 8, 345–354 (2013).
 44.
Møller, J. & Waagepetersen, R. Statistical Inference and Simulation for Spatial Point Processes (Chapman and Hall/CRC, Boca Raton, 2004).
 45.
Lee, S.H., Shin, J. Y., Lee, A. & Bustamante, C. Counting single photoactivatable fluorescent molecules by photoactivated localization microscopy (PALM). Proc Natl Acad Sci USA 109, 17436–17441 (2012).
 46.
Annibale, P., Vanni, S., Scarselli, M., Rothlisberger, U. & Radenovic, A. Quantitative photo activated localization microscopy: Unraveling the effects of photoblinking. Plos One 6, e22678 (2011).
 47.
Greenfield, D. et al. SelfOrganization of the Escherichia coli Chemotaxis Network Imaged with SuperResolution Light Microscopy. Plos Biol. 7, e1000137 (2009).
 48.
Baddeley, A. J. & Turner, R. Spatstat: An R Package for Analyzing Spatial Point Pattens J Stat Softw. 12, 1–42 (2005)
Acknowledgements
A.S. thanks Prof. Ivo Sbalzarini, Hendrik Deschout and Dileep Kalathil for early discussions and guidance. This work was financially supported by Swiss National Science Foundation grants No. 200021–125319 and No. 20021–132206. A.S. was funded by a PhD fellowship grant from NCCBI.
Author information
Author notes
 Arun Shivanandan
Present address: UMR 168, Institut Curie Research Center, Paris 75248, France.
 Jayakrishnan Unnikrishnan
Present address: GE Global Research, Niskayuna, New York 12309, United States.
Affiliations
Laboratory of Nanoscale Biology, Institute of Bioengineering, Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne 1015, Switzerland
 Arun Shivanandan
 & Aleksandra Radenovic
Audiovisual Communications Laboratory, School of Computer and Communication Sciences, Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne 1015, Switzerland
 Jayakrishnan Unnikrishnan
Authors
Search for Arun Shivanandan in:
Search for Jayakrishnan Unnikrishnan in:
Search for Aleksandra Radenovic in:
Contributions
A.S. conceived the project, performed the analysis, and wrote the manuscript, with inputs and guidance from J.U. and overall supervision from A.R. All authors read and approved the final manuscript.
Competing interests
The authors declare no competing financial interests.
Corresponding author
Correspondence to Arun Shivanandan.
Supplementary information
PDF files
Rights and permissions
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
About this article
Further reading

Can single molecule localization microscopy detect nanoclusters in T cells?
Current Opinion in Chemical Biology (2019)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.