Deconvoluting kernel density estimation and regression for locally differentially private data

Local differential privacy has become the gold-standard of privacy literature for gathering or releasing sensitive individual data points in a privacy-preserving manner. However, locally differential data can twist the probability density of the data because of the additive noise used to ensure privacy. In fact, the density of privacy-preserving data (no matter how many samples we gather) is always flatter in comparison with the density function of the original data points due to convolution with privacy-preserving noise density function. The effect is especially more pronounced when using slow-decaying privacy-preserving noises, such as the Laplace noise. This can result in under/over-estimation of the heavy-hitters. This is an important challenge facing social scientists due to the use of differential privacy in the 2020 Census in the United States. In this paper, we develop density estimation methods using smoothing kernels. We use the framework of deconvoluting kernel density estimators to remove the effect of privacy-preserving noise. This approach also allows us to adapt the results from non-parametric regression with errors-in-variables to develop regression models based on locally differentially private data. We demonstrate the performance of the developed methods on financial and demographic datasets.

Scientific Reports | (2020) 10:21361 | https://doi.org/10.1038/s41598-020-78323-0 www.nature.com/scientificreports/ (i.e., a respondent determines whether to answer a sensitive question truthfully or with forced yes/no based on flipping a biased coin) is differentially private and the probability of the head for the coin determines the privacy budget in differential privacy 15 . However, differential is a more general and flexible methodology that can be used for categorical and non-categorical (i.e., continuous domain) questions 4,16,17 . This paper specifically consider the problem of analyzing privacy-preserving data on continuous domain, which is out of the scope of randomized response methodology. Locally differential data can significantly distort our estimates of the probability density of the data because of the additive noise used to ensure privacy. The density of privacy-preserving data can become flatter in comparison with the density function of the original data points due to convolution of its density with privacy-preserving noise density. The situation can be even more troubling when using slow-decaying privacy-preserving noises, such as the Laplace noise. This concern is true irrespective of how many samples are gathered. This can result in under/over-estimation of the heavy-hitters, a common and worrying criticism of using differential privacy in the US Census 18 .
Estimating probability distributions/densities under differential privacy is of extreme importance as it is often the first step in gaining more important insights into the data, such as regression analysis. However, most of the existing work on probability distributions estimation based on locally differential private data focuses on categorical data [19][20][21][22][23] . For categorical data (in contrast with numerical data), the privacy-preserving noise is no longer additive, e.g., the so-called exponential mechanism 24 or other boutique differential privacy mechanisms 25 are often employed that are not on the offer in the 2020 US Census. The density estimation results for categorical data are also related to de-noising results in randomized response methods 12 . The work on continuous domains is often done by binning or quantizing the domain. However, finding the optimal number of bins or quantization resolution depending on privacy parameters, data distribution, and number of data points is a challenging task.
In this paper, we take a different approach to density estimation by using kernels and thus eliminating the need to quantize the domain. Kernel density estimation is a non-parametric way to estimate the probability density function of a random variable using its samples proposed independently by Parzen 26 and Rosenblatt 27 . This methodology was extended to multi-variate variables in 28,29 . These estimators work based on batches of data; however, they can also be made recursive [30][31][32] . When the data samples are noisy because of measurement noise or, as in the case of this paper, privacy-preserving noise, we need to eliminate the effect of the additive noise kernel density estimation by deconvolution 33 . Therefore, we use the framework of deconvoluting kernel density estimators [33][34][35][36] to remove the effect of privacy-preserving noise, which is often in the form of Laplace noise 37 . This approach also allows us to adapt the results from non-parameteric regression with errors-in-variables [38][39][40] to develop regression models based on locally differentially private data. This is the first time that deconvoluting kernel density estimators have been used for analyze differentially-private data. This is an important challenge facing social science researchers and demographers following the changes administered in the 2020 Census in the United States 4 .

Methods
Consider independently distributed data points {x[i]} n i=1 ⊂ R q , for some fixed dimension q ≥ 1 , from common probability density function φ x . Each data point x[i] ∈ R q belongs to an individual. Under no privacy restrictions, the data points can be provided to the central aggregator to construct an estimate of the density φ x denoted by φ x . We may use kernel K, which is a bounded even probability density function, to generate the density estimate φ x . A widely recognized example of a kernel is the Gaussian kernel 41

in
In the big data regime n ≫ 1 , the choice of the kernel is not crucial to the accuracy of kernel density estimators so long as it meets the conditions in 34 . In this paper, we keep the kernel general. By using kernel K, we can construct the estimate where h > 0 is the bandwidth. The bandwidth is often selected such that h → 0 as n → ∞ . The optimal rate of decay for the bandwidth has been established for families of distributions 33,34 .

Remark 1
The problem formulation in this paper considers real-valued data as opposed as categorical data. This distinguishes the paper from the computer science literature on this topic, which primarily focuses on categorical data [19][20][21][22][23] . Real-valued data can arise in two situations. First, the posed question can be non-categorical, e.g., credit rating for loans or the interest rates of loans. We will consider this in one of our experimental results. However, aggregated categorical data can also be real-valued. For instance, the 2020 US Census reports the aggregate number of individuals from a race or ethnicity group within different counties. These numbers will be made differentially private as part of the US Census Bureau's privacy initiative 4 . Therefore, the methods developed in this paper are still relevant to categorical data, albeit in aggregated forms. www.nature.com/scientificreports/ As discussed in the introduction, due to privacy restrictions, the exact data points {x[i]} n i=1 might not be available to generate the density estimate in (2). The aggregator may only have access to noisy versions of these data points: where n[i] is a privacy-preserving additive noise. To ensure differential privacy, Laplace additive noises is often used 37 . For any probability density φ , we use the notation supp(φ) to denote its support set, i.e., supp(φ) := {ξ : φ(ξ ) > 0}.
Assumption 1 is without loss of generality as we are always dealing with bounded domains in social sciences with a priori known bounds on the data (e.g., the population of a region).

Definition 1 ensures that the statistics of privacy-preserving output x[i] + n[i]
, determined by its distribution, do not change "significantly" (the magnitude of change is bounded by the privacy parameter ǫ ) if the data of individual x[i] changes. If ǫ → 0 , the output becomes more noisy and a higher privacy guarantee is achieved. Laplace additive noise is generally used to ensure differential privacy. This is formalized in the following theorem, which is borrowed from 37 .
In what follows, we assume that the reporting policy in Theorem 1 is used to generate locally differentially private data points. Since would also follow a common probability density, which is denoted by φ z . Note that where z , x , and n are the characteristic functions of φ z , φ x , and φ n . Using (4), we can use any approximation of z to construct an approximation of x and thus estimate φ x . If we use kernel K for estimating density of z[i] , ∀i , we get Here, φ z is used to denote the approximation of φ z . The characteristic function of φ z is given by Therefore, the characteristic function of φ x is given by .
Under appropriate conditions on the kernel K 34 , we can see that (2). In average, we are canceling the effect of the differential privacy noise. Selecting bandwidth (or smoothing parameter) h is an important aspect of kernel estimation. In 26 , it was shown that lim n→∞ h = 0 guarantees asymptotic ubiasedness (i.e., point-wise convergence of the kernel density estimate to the true density function) while lim n→∞ nh = +∞ is required to ensure asymptotic consistency. Many studies have focused on finding optimal bandwidth 29,42-44 . Numerical methods based on cross validation for setting the bandwidth are proposed in 45,46 . Often, it is recommended to compare the results from different bandwidth selection algorithms to avoid misleading conclusions caused by over-smoothing or under-smoothing of the density estimate 47 . These results have been also extended to noisy measurements with deconvoluting kernel density estimation 34,48 . If h scales according to where O denotes the Bachmann-Landau notation. Therefore, E{ φ x (x) − φ x (x)} 2 → 0 as n → ∞ . This means that the effect of the differential-privacy noise is effectively negligible on large datasets.
For regression analysis, we consider independently distributed data points {(x[i], y[i])} n i=1 from common probability density function. We would like to understand the relationship between inputs x[i] and outputs y[i] for all i. Similarly, we assume that we can only access noisy privacy-preserving inputs . Following the argument above, we can also construct the Nadaraya-Watson kernel regression (see, e.g. 49

Results
In this section, we demonstrate the performance of the developed methods on multiple datasets. We first use a synthetic dataset for illustration purposes and then utilize real financial and demographic datasets. Throughout this section, we use the following original kernel: Note that x = x is a scalar as we are only considering a single input. This is the Cauchy distribution. We get the adjusted kernel in We use the cross-validation procedure in (8) to find the bandwidth in the following simulation and experiments.
Synthetic dataset. We use a simulation study is to illustrate the performance of the Nadaraya-Watson kernel regression in (7) for privacy-preserving data. We consider multiple scenarios. We use two distributions for {x[i]} n i=1 . The first one is a Gaussian mixture (1/3)N (−1, 1) + (2/3)N (3/2, 1/2) truncated over [−3, 3] . The second distribution is a chi-squared distribution with three degrees of freedom distribution χ 2 (3) truncated over [0,3]. The truncation in both cases is for satisfaction of Assumption 1. We also consider two regression curves: g 1 : x � → x 2 (1 − x 2 )/5 and g 2 : x � → 4.5 sin(x) − 5 . Finally, we assume a Gaussian measurement noise N (0, 1) , i.e., y is a zero mean Gaussian random variable with unit variance. Figure 1 shows the kernel regression model (dashed black) and true regression curve (solid black) for mixture Gaussian data made differentially private with ǫ = 10 . Here, we consider two dataset size of n = 1000 and n = 10,000 and two regression curves of g 1 and g 2 , introduced earlier. The Nadaraya-Watson kernel regression using differentially-private provides fairly accurate predictions. The accuracy of the prediction improves as the dataset gets larger. Figure 2 illustrates the kernel regression model (dashed black) and true regression curve (solid black)   50 . For the accepted loans, dataset contains interest rates of the loans per annum and loan attributes, such as total loan size, and borrower information, such as number of credit lines, credit rating, state of residence, and age. Here, we only focus on data from 2010 (to avoid possible yearly fluctuations of the interest rate), which contains 12,537 accepted loans. We also focus on the relationship between the FICO (https ://www. fico.com/en/produ cts/fico-score .) credit score (low range) and the interest rates of the loan. This is an interesting relationship pointing to the value of credit rating reports 51 . The FICO credit score is very sensitive (as it relates to the financial health of an individual) and possesses a significant commercial value (as it is sold by a for-profit corporation). Thus, we assume that is is made available publicly in a privacy-preserving manner using (3). Note that the original data in 50 provides this data in an anonymized manner without privacy-preserving noise. Figure 3 illustrates estimates of probability density function of the credit score φ x (x) using original noiseless data with original kernel φ np x (x) in (2) (solid gray), ǫ-locally differential private data with original kernel φ x (x) = 1 nh n i=1 K((x − z[i])/h) (dashed black), and ǫ-locally differential private data with adjusted kernel in (5) (solid black) for ǫ = 5.0 and bandwidth h = 0.1 . Note that φ x (x) = 1 nh n i=1 K((x − z[i])/h) is a naive density estimate as it does not try to cancel the effect of the privacy-preserving noise. Clearly, using the original kernel for the noisy privacy-preserving data flattens the density estimate φ x (x) . This is because we are in fact observing a convolution of the original probability density with the probability density of the Laplace noise. Upon using the adjusted kernel K h (x) the estimate of the probability density using the noisy privacy-preserving data matches the estimate of the probability density with the original data (with additional fluctuations due to the presence of noise). This provides a numerical validation of (6). Now, let us focus on the regression analysis. Figure 4 shows the kernel regression model (solid black) and the linear regression model (dashed black) based on the original data with bandwidth h = 0.02 superimposed on the original noiseless data (gray dots). The mean squared error for the kernel regression model is 4.42 and the mean squared error for the linear regression model is 4.61. The kernel regression model is thus slightly superior (roughly 4%) to the linear regression model; however, the gap is narrow. Figure 5 illustrates the kernel regression model (solid black) and the linear regression model (dashed black) based on the ǫ-locally differential private data with ǫ = 5 and bandwidth h = 0.20 superimposed on the original noiseless data (gray dots). The mean squared error for the kernel regression model is 5.70 and the mean squared error for the linear regression model is 7.11. In this case, the kernel regression model is considerably (roughly 20%) better. In Fig. 6, we observe the mean squared error for the kernel regression model and the linear regression model based on the ǫ-locally differential private data versus privacy budget ǫ . Clearly, the kernel regression model is consistently superior to the linear regression model. As ǫ grows larger, the performance of the kernel regression model and the linear regression model based on the ǫ-locally differential private data converge to the performance of the kernel regression model and the linear regression model based on original noiseless data. This intuitively makes sense as, by increasing the privacy budget, the magnitude of the privacy-preserving noise becomes smaller. dataset is available for download on UCI 52 . The dataset contains attributes, such as education, age, work type, gender, race, and a binary report whether the individual earns more than 50,000$ per year. We also focus on the relationship between the education (in years) and the individual ability to earn more than 50,000$ per year. The education is assumed to be made public in a privacy-preserving form following (3). This information can be considered private as it can be used in conjunction with other information to de-anonymize the dataset.   www.nature.com/scientificreports/ In this case, the kernel regression model is slightly (roughly 4%) better. In Fig. 9, we observe the logarithm of the likelihood for the kernel regression model and the logistic regression model based on the ǫ-locally differential private data versus privacy budget ǫ . The horizontal lines show the logarithm of the likelihood for the kernel regression model and the logistic regression model based on original noiseless data. Again, the kernel regression model is consistently superior to the logistic regression model. However, the effect is not as pronounced as the linear regression in the previous subsection. Finally, again, as ǫ grows larger, the performance of the kernel regression model and the logistic regression model based on the ǫ-locally differential private data converge to the performance of the kernel regression model and the linear regression model based on original noiseless data.

Discussion
The density of privacy-preserving data is always flatter in comparison with the density function of the original data points due to convolution with privacy-preserving noise density function. This is certainly a cause for concern due to addition of differential-privacy noise in 2020 US Census. This unfortunate effect is always present irrespective of how many samples we gather because we observe the convolution of the original probability density with the probability density of the privacy-preserving noise. This can result in miss-estimation of the  www.nature.com/scientificreports/ heavy-hitters that often play an important role in social sciences due to their ties to minority groups. We developed density estimation methods using smoothing kernels and used the framework of deconvoluting kernel density estimators to remove the effect of privacy-preserving noise. This can result in a superior performance both for estimating probability density functions and for kernel regression in comparison to popular regression techniques, such as linear and logistic regression models. In the case of estimating the probability density function, we could entirely remove the flatting effect of the privacy-preserving noise at the cost of additional fluctuations. The fluctuations however could be reduced by gathering more data.