Differentially private density estimation with skew-normal mixtures model

The protection of private data is a hot research issue in the era of big data. Differential privacy is a strong privacy guarantees in data analysis. In this paper, we propose DP-MSNM, a parametric density estimation algorithm using multivariate skew-normal mixtures (MSNM) model to differential privacy. MSNM can solve the asymmetric problem of data sets, and it is could approximate any distribution through expectation–maximization (EM) algorithm. In this model, we add two extra steps on the estimated parameters in the M step of each iteration. The first step is adding calibrated noise to the estimated parameters based on Laplacian mechanism. The second step is post-processes those noisy parameters to ensure their intrinsic characteristics based on the theory of vector normalize and positive semi definition matrix. Extensive experiments using both real data sets evaluate the performance of DP-MSNM, and demonstrate that the proposed method outperforms DPGMM.

With the rapid developments of big data technology, data privacy protection has become a important issue. Differential privacy is an attack-resistant model with strict theoretical proof and good quantitative representation. Differential private density estimation has been done on mixtures model. The most of them are considered in the Gaussian mixtures model (GMM) 1,2 , but the data can presents skewness or heavy tailed behavior. Nearly, the vulnerability of pre-trained models to the attack is not uniform when the training data itself is skewed, Trues et al. 3 shows that risk from some attacks is routinely increased when models use skewed training data. This risk may be caused by the reduction of the accuracy of the model density estimation on assumed data normality.
To the best of our knowledge no existing work has concern in the skew-normal mixtures model with differential privacy. In skew-normal mixtures model, all data are assumed to be drawn by randomly sampling from one of component of the skew-normal mixture model, and the learning task is to estimate the proportion parameters, location parameters, scale parameters and skewness parameters of each component distribution in model. In particular, we need to consider some problem of skew-normal mixtures model, where (1) skew-normal distribution compose the skew-normal mixtures model, and (2) we not clear the sample belongs to which component, and (3) we just only have perturbed data, and (4) we hope to get estimate values of all parameters.
Addressing on these problems, we propose DP-MSNM, which appends two extra steps (noise adding step and post-processing step) to the general method of density estimation with MSNM in each iteration like. The noise adding step is adding classical Laplace noise to the original estimated parameters in each iteration to achieve differential privacy. The post-processing step will through eigen decomposition on scale parameter to prevent its not positive semi-definite and use normalize method to ensure sum the weight parameter as to 1.
The remainder of the paper is organized as follows. Section "Preliminaries" introduces preliminaries. Given DP-MSNM details in Section "Differentially private MSNM". Section "Experimental setup" evaluates the experimental performance. Finally , Section "Conclusions" concludes the work and discusses future works. www.nature.com/scientificreports/ We called ε as privacy budget, the smaller ε means the higher level of privacy provides. Where we adding calibrated random noise to the output to achieved privacy. We use Laplacian mechanism and L 1 -sensitivity to output perturbation. Definition 2 (L 1 -sensitivity) The L 1 sensitivity of a query function M as the maximum change in the L 1 -norm of the query output on the neighbor data sets D and D ′ . That is , where we use Hamming distance to measure the distance between two data sets D and D ′ , and its distance is expressed as d(D, D ′ ) = {i : x i � = y i , x i ∈ D, y i ∈ D ′ } . We denote the noisy vector or matrix are follows Laplace distribution Lap(0, ε is scale parameter. Given L 1 sensitivity S(M) and the privacy budget ε , this mechanism can ensure differential privacy.
Density estimation with MSNM. In the multivariate skew-normal mixture models (MSNM), we assumed sample x i is i.i.d. with a density f MSN (x i ) that can be written as a linear composition of skew-normal distributions 5 in the form w h e r e x = (x 1 , . . . , x n ), a n d where π k ≥ 0 is mixture proportion and satisfied that G k=1 π k = 1 . ξ k , � k , α k are parameter in the k-th component of the mixture models, and α k is the p-dimensional skewness parameter, k is a p × p positive semi-definite(PSD) correlation matrix, ξ k is the p-dimensional local parameter. φ d (x; ξ , �) is the density of a normal distribution N(ξ , �) , �(·) is the cumulate distribution function of standard normal. The function f MSN (x; ξ , �, α) can be present by where σ = (� ⊙ I p ) 1 2 , ⊙ denote Hadamard product. If a random variable X has density function like Eq. (2), we called the variable X is a Multivariate skew-normal distribution, and denote X ∼ SN p (ξ , �, α) like as Contreras-Reyes and Cortés, 2016 6 .
Recalling the variables X 0 ∼ SN p (0,�, α) and X ∼ SN p (ξ , �, α) , we can obtain an additional representation for multivariate skew-normal distribution from the results of Azzalini et al. 5 as follows: where δ = (δ 1 , δ 2 , . . . , δ p ) T is a vector with the elements in (−1, 1) , and U 0 and U 1 are the normal variables of dimension p and 1, the joint distribution is is a full-rank correlation matrix. By the transformation X = ξ + σ X 0 , we have Let |Z 1 | = τ . Given τ , we can obtain the condition representation, that is, For a p-dimensional SN distribution, the parameters have the following relationships by simple algebraic work (See 5 ): www.nature.com/scientificreports/ Furthermore, we have Thus, let η = σ δ , by the relation between and ¯ , we have Therefore Let Z ik be the membership indicator variable such that it equals 1 when x i is from the k-th component of the MSNM model, and equals 0 otherwise. Consider the complete data (X, . . , Z Gi ) follows a multi-nomial distribution with the trial and cell probabilities of π 1 , . . . , π G . Let us write it as Z i ∼ M(1; π 1 , . . . , π G ) . Based on the component indicators, for each X i , i = 1, . . . , n , the hierarchical representation for MSNMs can be given by Cabral et al. 7 where TN [0,+∞) (0, 1) denotes the half normal distribution. According to the hierarchical representation (6), ignoring the added constants and denoting the complete data log-likelihood function is According to the Bayesian theorem, we have where Thus, for the current parameters � (t) , let The EM algorithm then proceeds as follows: E-step: Let us compute the conditional expectations as 1ik , a n d Therefore, the Q function can be written as 2ik . www.nature.com/scientificreports/ M-Step: Let us maximize Q(�, � (t) ) with respect to under the restriction with G k=1 π k = 1 .

Differentially private MSNM
In this section, we present DP-MSNM method which is a differentially private density estimation algorithm with MSNM.
Main idea. Let G denotes the component order of the MSNM. The main idea of DP-MSNM is add to two extra steps after getting the original estimated parameters in the M-step of each iteration. The original estimated parameters include mixture proportion ( π ), local parameter vector ( ξ k ), scale parameter matrix ( k ) and skewness parameter vector ( α k ) of each Skew-Normal distribution, where π = (π 1 , . . . , π G ) and k ∈ {1, . . . , G} . First, we need to get noise of each parameter by L 1 -sensitivity and allocated privacy budget, and add these noises to the original estimated parameters. We get noisy parameters π and ξ k ,� k ,η k . The second step is post-processes π and ξ k ,� k ,η k , since the noise added will break some intrinsic characteristics of weight parameter and component parameters, this step will output π and ξ k ,� k ,η k Noise adding step. In this part, we will analyze the L 1 sensitivities of π k , ξ k , � k , η k , and we will add calibrated noise to the original estimated parameters. We suppose that D = {x 1 , . . . , x n } and D ′ = {x 1 , . . . , x n−1 } are two neighbor data sets,where D ′ has the same n − 1 records as D ′ , but not have the n-th record. Let N k = n i=1 r 0ik , we also assume the data set is well-separated 8 and N k ≥ n 2G , k = 1, . . . , G . We use differentially private k-means algorithm 9 to get the centers of clusters,and get η k form skew parameter α k through Mardia measure method 10 . We denote S(π), S(ξ ), S(�), S(η) as the L 1 -sensitivities of π, ξ k , � k , η k in each iteration. According to Eqs. (8)-(10), we give S(π), S(ξ ), S(�), S(η) in Lemmas 1, 2, 3 and 4.

Lemma 1
In each iteration, the L 1 -sensitivity of π is S(π) = G n .
The proof process of Lemma 1 is identical as Wu et al. 1 . In order to make the proof of subsequent lemmas clear, we still write it down as follows.
Post-processing step. The noise parameters π,ξ k ,� k ,η k which obtained in the Noise adding step may not satisfy some of the basic properties of the original parameters. Firstly, π not satisfied π k ≥ 0 and n k=1 = 1 www.nature.com/scientificreports/ generally. Secondly, ¯ k would not be a positive semi-definite matrix after adding noisy. If the noisy covariance matrix ¯ k is not PSD matrix, the other parameters, for instance ξ k ,ᾱ k can not be calculated exactly. We need to solve these problems as follow. First, we use the method of normalized to the π . We let π k =π k −min{π } max{π }−min{π } , and make them sum to one by π k =π k +ζ G k=1πk +Gζ . There ζ is a small number, it can be smooth π Secondly, we can use theory of Higham 11 to ensure the covariance matrix ¯ k to be PSD through post-processing ¯ k . Like Wu et al. 1 , we use the follow equation where ̺(� k ) = min{r ≥ 0 :� k + rI � 0} , 0 means that all the eigenvalues are not less than 0, I is the identity matrix. We use ˆ k to get ξ k , and next we will get η k ,� k and α k by Eqs. (5) and (7). The parameters ξ k ,� k ,α k will be join into the next iteration calculation.
Although adding two extra steps on original parameters, we can know the parameters π,ξ ,α,� also satisfied ε-differential privacy 12 .

DP-MSNM.
From what discussed above, we give the DP-MSNM algorithm in Algorithm 2, where lines 1-2 allocate the privacy budget and initialize the parameters, lines 3 calculates the L 1 -sensitivities of all parameters, lines 5-6 execute the normal E-step and M-step, lines 7-12 execute the noise adding step, lines 13-20 execute the post-processing step. Given total privacy budget ε , Wu et al. 1 shows that under the maximum iteration T, Algorithm 2 satisfies ε-differential privacy. for k = 1, . . . , G , we allocate the privacy budget ε � = 0.6 T * G ε to k ,ε ξ = 0.13 T * G ε and ε η = 0.13 T * G ε to ξ k and η k respectively, because of the higher ε � can reduce the amount of noise added to the k . Experimental setup. We evaluate DP-MSNM on two real data sets, the Australian Institute of Sport data set and the 3DRoad Network data set, which from UCI Machine Learning Repository. We choose the columns 2-3 in AIS data set and the columns 2-4 in 3DRoad data set. For convenience, we'll still assume that the default values for both data sets are the same as those in Wu et al. 1 . In this article, we use the complete data set, and the default values are recorded in the Table 1.
We use log-likelihood mean 1 n log{f MSNM (x; π, ξ , �, α)} which is the average log-likelihood of the whole data set to evaluate the performance of DP-MSNM. A larger logarithmic likelihood mean means that the parameters estimated by the model have a better performance. We compare the density estimation effects of SAGMM, DPGMM and DP-MSNM with log-likelihood mean models for the two data sets respectively (see Figs. 1 and 2). In particular, we use AIC and BIC criteria 6 to compare the fitting effect of DPGMM algorithm and DP-MSNM algorithm on two data sets.

Results
Compare with DP-GMM and SAGMM. We implement SAGMM by applying the privacy preserving platform GUPT . The default values are the same as that of DP-MSNM . For the number of partitions, and the spherical covariance in SAGMM which is assumed known before the experiment. We set the number of partitions as 5. Different variances are applied and the best spherical covariance is chosen for each experiment. From Figs. 1 and 2 we observed that the performance of the three methods improve when B increases, DP-MSNM outperforms SAGMM and DPGMM. In terms of the shape of the curve in Fig. 1, our results are different from those of Wu et al. 1 , mainly because we used a larger amount of data in the data set 3DRoad, and in order not to lose data information, we did not normalized each attribute of the data set.
We implement DPGMM by software R with the same environment as the DP-MSNM. From the Figs. 1 and 2 we find the DP-MSNM outperforms DPGMM.

Effect of number of skew-normal distribution (G).
For AIS data set, as can be seen from Table 2, using DPGMM algorithm, the values of AIC and BIC are the minimum when k = 3. By using DP-MSNM algorithm, the values of AIC and BIC are the smallest when k = 2. For 3DRoad data set, as can be seen from Table 3, using These results is consistent with the results we can see from Figs. 3 and 4, because for AIS data, the loglikelihood mean of DPGMM algorithm at k = 3 is close to that of DP-MSNM algorithm (Fig. 3).A similar thing happened with 3Droad data (Fig. 4).
So, we can conclude that the order of the component has great effect to our results.

Conclusions
In this paper, we proposed DP-MSNM, which is a effective and accurate privacy-preserving mechanism to estimate density. We added noise to original parameters and through post-processing step to ensure weight and covariance parameter have it's intrinsic characteristics. In the future, we will extend the DP-MSNM model to high-dimensional data. We also plan to construct other Skew-family distributions to differential privacy.