Privately vertically mining of sequential patterns based on differential privacy with high efficiency and utility

Sequential pattern mining is one of the fundamental tools for many important data analysis tasks, such as web browsing behavior analysis. Based on frequent patterns, decision-makers can obtain both economic gains and social values. Sequential data, on the other hand, frequently contain sensitive information, and directly analyzing these data will raise user concerns from a privacy perspective. Differential privacy (DP), as the most popular privacy model, has been employed to address this privacy concern. Most existing DP-Solutions are designed to combine horizontal sequence pattern mining algorithms with differential privacy. Due to the inefficiency of horizontal algorithms, their DP-Solutions cannot ensure high efficiency and accuracy while offering a high privacy guarantee. Therefore, we proposed privVertical, a new private sequence pattern mining scheme combining the vertical mining algorithm with differential privacy to achieve the above objective. Unlike DP-solutions based on horizontal algorithms, privVertical can promote efficiency by avoiding performing costly database scans or costly projection database constructions. Moreover, to promote accuracy, a differentially private hash MapList (called privHashMap) is designed to record frequent concurrency items and their noisy support based on the Sparse Vector Technique. PrivHashMap is used to pre-pruning excessive infrequent candidate sequences in private mining, and Sparse Vector Technique is used to promote the accuracy of PrivHashMap. After pruning these invalid candidate sequences, less noise is required to achieve the same level of privacy, increasing the accuracy of private mining. Theoretical privacy analysis proves privVertical satisfies \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varepsilon$$\end{document}ε-differential privacy. Experiments show that privVertical achieves higher accuracy and efficiency while achieving the same privacy level.

a differentially private mining scheme based on the PerfixSpan algorithm.They first constructed a model-based prefix tree to mine prefixes and the candidate set of substring patterns.Then they refined the frequency of the substring patterns in the successive phase to reduce the perturbation noise.Xu et al. 9,10 proposed a differentially private SPM scheme based on Apriori-based algorithms.They perturbed their frequencies with random noise to satisfy differential privacy.To promote the accuracy of private mining, they proposed shrinking long sequences and filtering invalid candidate patterns based on a sampling database.These DP-Solutions can not provide high efficiency and accuracy while providing a high level of privacy.Reasons are summarized as follows: (1) Low efficiency: horizontal algorithms contain too many database scans or prefix-projected database construction.Private processing needs to be designed for each scan or construction, which reduces efficiency.(2) Low accuracy: Lower accuracy is caused by two factors: the first one is that more candidates generated in private mining result in lower accuracy because the amount of noise must be proportionate to the number of candidates.The other is that refined or sampling errors exist in calculating the frequency of patterns after perturbing, which lowers the accuracy.A practical and available mining scheme should provide a high privacy level while ensuring its efficiency and accuracy, and existing solutions can not achieve the above objectives at the same time.
Therefore, we attempt to design a differentially private sequence pattern mining scheme with both high efficiency and utility, while providing a high level of privacy.As far as we know, there are two types of non-private SPM algorithms: horizontal mining algorithms [18][19][20][21] and vertical mining algorithms [22][23][24][25][26] .The former works are characterized by performing costly database scans or projection database constructions.The latter works are characterized by scanning the original database to create its vertical format (called IDList) and then generating candidate patterns through the cross-join of IDLists.Vertical mining algorithms are more effective than horizontal mining algorithms.In light of the advantages of the vertical mining algorithm, we attempt to design a private mining scheme based on the vertical sequential pattern mining algorithm.To make the private mining satisfy differential privacy constraints, we designed a random noise addition scheme combined with the vertical mining process.To further improve the accuracy, we designed a differentially private hash map list (called as privHashMap) to record frequent co-currency items based on the Sparse Vector Technique 2 .PrivHashMap is used to pre-pruning excessive invalid candidate sequences in private mining.After pruning these invalid candidate sequences, less noise is needed to maintain the same level of privacy, improving the accuracy.

Differentially private SPM
Sequential pattern mining provides knowledge, and at the same time, it has the risk of privacy disclosure.Several differentially private sequence pattern mining schemes (DP-SPM) have been proposed to address the above privacy concern.Bonomi et.al. 8 first proposed a differentially private mining scheme based on the PerfixSpan algorithm.They first constructed a model-based prefix tree to mine prefixes and a candidate set of substring patterns.Then they refined the frequency of the substring patterns in the successive phase to reduce the perturbation noise.Xu et.al. 9,10 proposed a differentially private SPM scheme based on the Apriori-based algorithm.They first designed a sequence shrinkage technique to reduce the length of the sequence.Then they used the statistical information of the sampling data set to prune invalid candidate patterns to improve the accuracy.In contrast to the above DP-SPM methods, the following works concentrate on differentially private SPM under different constraints.Cheng et.al. 11 proposed a private mining scheme DP-MFSM for maximum frequent sequence mining.Li et al. 12 proposed a differentially private sequence pattern mining algorithm with time constraints.Le et.al. 13 proposed a differentially private sequential pattern mining scheme considering time intervals for electronic medical record systems.Supposed data managers are not trusted, Le et al. 15 and Afrose et al. 16 employed local differential privacy to protect the privacy in sequential pattern mining.Wang et al. 14,17 proposed several privacypreserving schemes for critical or top-k patterns mining over data streams.

Non-private SPM
Non-private SPM algorithms are divided into two categories:mining algorithms based on horizontal database format and mining algorithms based on vertical database format.(1) SPM algorithms based on horizontal database format.Apriori-based algorithms (e.g.AprioriAll 18 ) are representative horizontal mining algorithms.In Apriori-based algorithms, candidate patterns are generated according to downward closure property, and the original dataset is scanned several times to calculate the support of candidate patterns.They are inefficient www.nature.com/scientificreports/due to multiple database scans and large candidate patterns.Some improved algorithms [19][20][21] are proposed to improve efficiency, such as PrefixSpan 19 .PrefixSpan explores prefix projection to reduce the efforts of candidate subsequence generation and also belongs to horizontal algorithms.However, efficiency is also its bottleneck due to multiple constructions of prefix-projected databases.(2) SPM algorithms based on vertical database format.Vertical mining algorithms are proposed to solve the problem of low mining efficiency caused by multiple data scans.Spade 22 is a representative vertical mining algorithm.The efficiency of vertical algorithms can be improved by eliminating database scans.However, too many cross-connection operations of vertical lists also result in lower mining efficiency.The works [23][24][25][26] proposed several improvement strategies to address this issue.

Preliminaries Differential privacy
Definition 1 (Differential privacy) 2 .If the output of a randomization algorithm M on any neighboring sequence datasets S, S ′ satisfies the following constraints: M is said to be ε-differentially private.FS is an arbitrary subset of the output domain of M. ε is called as privacy budget.It is used to control the privacy level of M. A smaller ε represents a stronger privacy protection.S and S ′ represent neighboring sequential datasets, which means that |S| − |S ′ | = 1.
Definition 2 (Sensitivity) 2 .Let f denote the sensitivity of the query function f, and it can be calculated as follows: where f equals to the maximum L 1 norm distance between S and S ′ .The sub-index 1 represents L 1 norm, which means the sum of the magnitudes of the vectors in space.Definition 3 (Laplace mechanism) Let Q denote a query function sequence with dimension n, Q denotes the sensitivity of Q.Let ξ =< ξ 1 , ξ 2 , . . ., ξ n > be a random noise vector, ξ i = Lap( ) , = �Q/ε , and the probability density function is p( , and a sequence of M i (S) provides ε i -differential privacy.

Sequence database
Let U = {p 1 , p 2 , . . ., p l } denote the universe set of items, and p i represents a single item.Let I x = {p i , p j , . . ., p k } ⊆ U denote an unordered set of distinct items.A sequence s =< I 1 , I 2 , . . ., I n > is an ordered arrangement of item- sets such that I k ⊆ U(1 ≤ k ≤ n) .As shown in Fig. 1, the sequence database S consists of a set of sequences {s 1 , s 2 , . . ., s n } , and the first sequence s 1 contains five itemsets < {p 1 , p 2 }, {p 3 }, {p 6 , p 7 }, {p 7 }, {p 5 } >.

Frequent sequence pattern
If the support of a sequence pattern pattern i exceeds a certain threshold, then pattern i is a frequent sequence pattern.A frequent pattern with length k is called a frequent k-sequence pattern.

The basic idea of the algorithm
The sequence patterns mining algorithm must be randomized to guarantee data privacy.Our scheme is designed based on the vertical sequence pattern mining algorithm 24 to achieve high efficiency.Differential privacy is employed to perturb the vertical mining algorithm.To make the mining process satisfy ε-differential privacy, a straight solution is designed as follows: (1) Perturb the IDLists of candidate patterns according to the Laplace mechanism.
(2) Based on the perturbed IDLists, calculate noisy supports of candidate patterns by performing the cross-join operations of noisy IDLists.( 3) Filter out frequent patterns based on their noisy supports and the minimum support threshold.
In the above process, the noise required for perturbation should be proportional to the sensitivity of the mining process and inversely correlated with the privacy budget.Suppose the maximum cardinality of frequent sequences is m, the privacy budget ε is divided equally in the iterative mining process, and thus the budget allo- cated in each iteration is ε/m .Suppose Q = {q 1 , q 2 , . . ., q n } is the query function in the private mining, where q i represents the sub-query of the number of candidate i-sequences in the ith iteration.The sensitivity of q i equals to the maximum number of candidate i-sequences |CS i | .Therefore, add Lap(ε/m • |CS i |) noise to the IDLists of candidate i-sequences, the ith iteration process can satisfy ε/m-differential privacy.After m iterations, it can easily prove that the private mining process satisfies ε-differential privacy.However, the sensitivity of the straight solution is too high, which results in a large amount of noise required for privacy protection.
PrivVertical, an improved algorithm, is proposed to reduce high sensitivity.PrivVertical consists of three components: (1) Construction of privHashMap based on sparse vector technology.(2) Perturbation of IDLists.
(3) Infrequent candidates pruning based on privHashMap.To satisfy ε-differential privacy, the privacy budget can be allocated as follows: the budget for the construction of privHashMap is ε 1 = α • ε , and the budget for the remaining two components is ε (

1) Construction of privHashMap based on sparse vector technology
There are too many infrequent candidates generated in private mining.The existence of these patterns won't affect the accuracy of non-private mining, but it will lower the accuracy of private mining.Infrequent candidates should be pruned as early as possible to address this issue, and PrivHashMap is proposed for this purpose.In the vertical mining process, when generating candidate i-patterns based on frequent (i − 1)-patterns, the generation style can be divided into two types: one is i-extension, and the other is s-extension.Suppose s = {I 1 , I 2 , . . ., I n } is a sequence, I k ⊆ U .If p j , p k ∈ I x , for an integer x such that 1 ≤ x ≤ n and p k ≻ lex p j , the item p k is said to succeed by i-extension.Otherwise, if p j ∈ I x and p k ∈ I y for some integers x and y such that 1 ≤ x < y ≤ k , the item p k is said to succeed by s-extension.According to the sequence database in Fig. 1, all single items and their extension items can be seen in Fig. 3.For example, {p 2 } is an i-extension item of p 1 , and {p 2 , p 3 , p 5 , p 6 } are s-extension items of p 1 .PrivHashMap records extension items and their co-occurrence frequencies of each single item.They are used to pre-prune infrequent candidates in the iterative mining process.After pruning infrequent candidates, sensitivity can be reduced, the required noise can be lower, and thus accuracy can be improved.In our implementation, if the co-occurrence frequency information is recorded in an n × n matrix, there will be a large waste of empty entries.The existence of these empty entries will seriously affect the candidate pattern filtering efficiency.So we implemented the co-occurrence frequency table as a hash table of HashSets.Each HashSet corresponds to an item p k and its extensive frequent co-occurrence items.Figure 3 shows an example of privHashMap.PrivHashMap is constructed based on original sequence data.To avoid possible privacy leakage, its construction should be perturbed by Laplace noise.In this process, the sparse vector technique (SVT) 2 is employed to ensure accuracy while providing a high privacy level.The private construction can be implemented in two steps: Firstly, perturb the support threshold σ with Laplace noise Lap(2/ε 1 ) , and get a noisy threshold σ .Sec- ondly, perturb the frequency of each candidate 2-sequence C(r 2 ) with Laplace noise Lap(4 • �/ε 1 ) .Compare C(r 2 ) + Lap(4 • �/ε 1 ) with noisy support σ .Output 'above the threshold' if C(r 2 ) + Lap(4 • �/ε 1 ) ≥ σ , other- wise output 'below the threshold' .Here represents the sensitivity of the private construction, and its value equals the number of frequent 2-sequences |FS 2 | .In this way, PrivHashMap is constructed with both high accuracy and a high level of privacy.
Proof Let A represent the query function of the frequency of co-currency items, and A i represents the query function of the ith co-currency items.For any neighboring sequence datasets S,S ′ , if we can prove we can get algorithm 1 satisfies ε 1 -differential privacy.Since S and S ′ are neighboring sequence databases, we can get that A i (S ′ ) − 1 ≤ A i (S) ≤ A i (S ′ ) + 1 .Let r = {r 1 , r 2 , . . ., r l } repre- sent an output vector in this process, r ≥ σ = {i : r i ≥ σ } , r < σ = {i : r i < σ } .Let ρ represent the query function of the number of co-currency items,

2) Perturbation of IDList
In vertical mining, candidate k-patterns ( k > 1 ) are generated by performing the cross-join with IDLists of frequent (k − 1)-patterns.To satisfy differential privacy, IDLists should be perturbed by random noise for privacy protection.As shown in Fig. 4, the IDList of a pattern r consists of a header node and several entries.The header node records the number of entries and the budget allocated here.Each entry represents the list of sequences where r appears (SID) and the timestamps when it appears (TID).The perturbation of the IDList follows the following two steps: First, Perturb the true entry number with the allocated budget.Second, perform consistency adjustment of IDList according to the noisy number.In step 1, for each candidate k-sequence, the magnitude of noise required for the perturbation of its IDList is Lap(m • �f k /nε 2 ) , where m represents the maximum car- dinality of frequent patterns, f k represents the sensitivity, and n is the size of the sequence database.In step 2, consistency adjustment can be performed as follows: the perturbation of IDList can be divided into two types: positive perturbation and negative perturbation.From Fig. 4, we can see that dummy entries can be inserted into the noisyIDList when the perturbation is positive.Each dummy entry is composed of a SID value and a TID value.To ensure mining accuracy, the SID value of the dummy entry should be different from all existing SID values.In this way, the accuracy can not be affected when generating candidate patterns by performing the cross-join between IDLists.When the perturbation is negative, consistency adjustment can be implemented by deleting several entries of the IDLlist according to the noisy count.
(3) Candidates Pruning based on privHashMap Vertical mining is an iteration process.In the kth recursive process, candidate k-sequences are generated by performing the cross-join of noisyIDLists of frequent (k − 1)-sequences.Meanwhile, infrequent candidate patterns are pruned based on privHashMap.The pruning rules are as follows: Let A i and A j represent any two frequent (k − 1)-sequence patterns, A i = P ∪ x , A j = P ∪ y , P is the common prefix of A i and A j , y is the exten- sion item of A i , x is the last item of A i , a is the last item of P, r = A i ∪ y .If y is an i-extension item or s-extension item of a in privHashMap, r should be retained in the candidate k-patterns set.Otherwise, r should be filtered out.
For each candidate k-pattern r retained in the candidate set, perturb its IDList.The magnitude of noise for perturbation is Lap(m • �f k /nε 2 ) .Based on the noisyIDList of candidate k-patterns and the support threshold, frequent k-patterns can be filtered out.The sensitivity of the queries in the kth recursive mining is The analysis is as follows: The private vertical mining is a recursive process.Let T k represent the candidate k-sequences generated in the kth recursive process, frequent k-patterns can be filtered out from T k .According to differential privacy, the magnitude of noise should be proportional to the sensitivity and inversely proportional to the budget.Suppose the query function in the private mining is Q = {q 1 , q 2 , . . ., q m } , where q k represents the query of candidate k-sequences in the kth iteration process.The sensitivity of q k can be calculated as follows: In the kth iteration process of the straight solution, after adding or deleting any one sequence, the upper bound of affected candidate k-sequences is C k l .Suppose the maximum cardinality of sequence is l, the maximum number of k-sequence patterns contained in this sequence is C k l .Let T k represent the candidate k-sequences generated in the kth recursive mining, the sensitivity of q k is �f k = min(C k l , T k ) .Let del k represent the invalid candidate k-patterns pruned from T k based on privHashMap.After pruning the invalid patterns, the sensitivity of After pruning these infrequent candidates, the magnitude of the noise is reduced.
Algorithm 1 describes the construction of privHashMap.Firstly, perturb the true support threshold with 2/ε 1 (line 1), and estimate the number of frequent co-occurrence items with ε 1 /4 (line 2).Next, for each item p i , find out s-extension items and i-extension items and calculate their co-occurrence frequency, then record them in PrivHashMap i and PrivHashMap s respectively (lines 3-12).Traverse PrivHashMap (line 13), for each r in PrivHashMap i and PrivHashMap s , perturb its true support with ε 1 /4 (line 14).If the noisy support is no less than the noisy support threshold, add retained it in PrivHashMap (lines 15-16).Otherwise, discard r.Then continue to traverse PrivHashMap i and PrivHashMap s until all elements in PrivHashMap have been traversed (lines 17-20).At last, return PrivHashMap (line 21).
Algorithm 2 describes the private vertical mining process: PrivVertical consists of three steps: the first is the construction of privHashMap, the second is the perturbation of IDList, and the third is generating and pruning candidate patterns based on the noisy FS 1 .To make algorithm 2 satisfy ε-differential privacy, ε is divided into two parts ε 1 and ε 2 .ε 1 is used to construct the privHashMap (Algorithm 2: line 2).The remaining budget ε 2 is used for private vertical mining.Allocate ε 2 /m to each recursive process, construct noisyIDList of 1-sequences with ε 2 /m , and find out FS 1 (Algorithm 2: lines 3-4).Then Recursively call the process of Enumerate & Pruning, generate and prune candidate patterns to get frequent k-patterns (Algorithm 2: lines 5-6).
Algorithm 3 describes the process of Enumerate & Pruning: (1) For any pattern A i ∈ FS k , add it to FS and output it (Algorithm 3: lines 1-2); (2) For each pattern A j ∈ FS k , merge A i and A j as r (Algorithm 3: lines 3-4) and add it to T k (Algorithm 3: lines 5-7). (3)Pruning invalid patterns in T k based on privhashmap (Algorithm 3: lines 8-10).( 4) For each pattern r retained in the candidate set, construct its noisyIDList.( 5) Calculate frequent k-sequence patterns FS k based on noisyIDList (Algorithm 3: line 12). ( 6 Let FS k as the input parameter, recur- sively call Algorithm 3 to mine frequent (k + 1)-sequence patterns (Algorithm 3: line 13).www.nature.com/scientificreports/Suppose the max length of frequent patterns is p max , |FS K | denotes the number of frequent k-sequence patterns, and C 2 |FS K | is the extension number when generating candidate (k+1)-patterns based on FS K .Let T k+1 denote the candidate (k+1)-patterns.Thus the complexity of Algorithm 3 is

Complexity analysis
) .Algorithm 2 has low complexity.To reduce the complexity, Algorithm 2 constructs PrivHashMap to filter excessive candidate patterns.After reducing the size of candidates, the perturbation is also reduced.Thus the low complexity of Algorithm 2 can be ensured.Detailed running time evaluation can be seen in the "Efficiency evaluation" of experiments section.

Privacy analysis
Theorem 4 The private vertical mining satisfies ε 2 -differential privacy.
Proof Suppose the maximum cardinality of frequent patterns is m, the private vertical mining process consists of m recursive sub-processes.Let Q = {q 1 , q 2 , . . ., q m } represent the query function in this process, where q k repre- sents the query function of the kth recursive process.FS represents the frequent patterns set, and FS k represents the frequent k-sequence patterns set.f k represents the sensitivity of the kth recursive process.If we demonstrate that the following equation is true, we can get the private vertical mining satisfies ε 2 -differential privacy.
A detailed analysis is as follows: In the above proof, the first inequality is inferred from the triangle inequality theorem, and the second inequality is inferred from the sensitivity definition ( Evidenced by the same token, the following equation holds: Therefore, the private vertical recursive mining algorithm satisfies ε 2 -differential privacy.

Utility analysis
Theorem 2 For any β > 0 , at least with the probability of 1 − β , the upper-bound error between noisy frequency and true frequency of frequent k-patterns is γ , where Proof Suppose r is a candidate k-sequence, and its true frequency is c(r).Since the perturbed noise to the true frequency is Lap(m • �f k /nε 2 ) , the probability that the error between true frequency and noisy frequency of r is no less than: In the kth recursive mining process, since the number of perturbed candidate k-patterns is f k , the union upper bound of the probability less than

Experimental results
We conduct experiments to evaluate the utility and efficiency of PrivVertical.All algorithms are implemented with Java. (5) Vol.:(0123456789)

Metrics
We adopt the following metrics to measure utility: F-score and RE 9 .F-score is used to measure the utility of the private mining results.The definition of F-Score is as follows: where FS is the noisy frequent sequence patterns of the private mining scheme, and FS is the original frequent sequence patterns of the no-private mining scheme.RE (Relative Error) is used to measure the error between actual support and noisy support, which is defined as follows: where sup x denotes the true support, and sup ′ x denotes the noisy support.Running time is used to measure the efficiency of algorithms.

Datasets
Real datasets used in experiments are MSNBC and Kosaarak, which record the URL categories visited by users in time order, and click stream data respectively.Datasets can be obtained from the SPMF websi te.Detailed information can be seen in Table 1.|S| is the number of records of the dataset, is the number of distinct items, and Max|s| and Avg|s| denote the maximal and the average record length respectively.

Effect of ε on utility
Figure 5 shows how the parameter ε affects the accuracy of the three algorithms.In general, privVertical performs better under the same privacy level.The main reasons are as follows: Priv-Vertical prunes many invalid candidate patterns in private mining, the sensitivity is reduced and the noise required is also reduced.Although PrivApriori reduces the sensitivity by shrinking sequences and pruning invalid candidate patterns, there exists a sampling error in private mining.Prefix uses a prefix tree to reorganize the sequence database, and uses the projection technique to calculate the noisy support of a pattern.It contains the reorganization error, which results in low accuracy.As ε increases, the privacy level decreases, F-Score increases, and RE decreases.The reason is as follows: the higher the parameter ε , the lower the privacy level is.A lower privacy level means the required noise is lesser, improving the accuracy.Compared with MSNBC, the utility of Kosarak is higher.The reason may be that the candidate patterns on kosarak are sparse, the number of candidate patterns is relatively small, and the pruning strategy is more effective.

Effect of pruning strategy on utility
We evaluate how effective of the pruning strategy based on privHashMap on utility in this section.
The private vertical mining scheme without candidate patterns pruning is named as StraightSolution.From Fig. 6a, b, we can see the sensitivity reduction rate on MSNBC in the private mining achieves to 74-78%.The reduction rate on Kosarak achieves to 98%.From Fig. 6c, d, Compared with StraightSolution, privVertical can greatly improve the utility.The main reason is that the sensitivity is reduced, the noise required decreases.

Efficiency evaluation
Figure 7 shows how the pruning strategy based on privHashMap affects efficiency.We can see PrivVertical performs better than the StraightSloution.After pruning invalid candidate patterns, the size of candidate patterns set is lower.In this way, it will take less time to make the private processing for the candidate patterns, thus the running time is reduced.With the increase of σ , the efficiency of the two schemes is decreasing, this is because the candidate patterns space becomes smaller.

Conclusion
In this study, we analyzed why existing works can not afford a private sequence patterns mining scheme with a high level of privacy while achieving both high utility and efficiency.The first reason is the low efficiency of the horizontal mining style.The second reason is low utility caused by too many candidate patterns generated in private mining.To address the above issue, privVertical, a private vertical sequential pattern mining scheme is proposed for the first time.High efficiency is attained by reducing the scanning times of database in the private mining process.It is implemented by perturbing a non-private vertical mining algorithm with differential privacy.The utility is enhanced by less noise required for the same level of privacy.It is implemented by two strategies: The first is designing the privHashMap.It is a private co-occurrence hash map list designed based on the Sparse Vector Technology and used to filter invalid candidate patterns in the subsequent private mining process.Therefore the magnitude of noise required for privacy protection can be reduced.Secondly, a noise addition scheme for the vertical mining algorithm is designed, which can improve the mining efficiency while satisfying differential privacy constraints.By formal theoretical analysis, the upper bound of the utility of privVertical is given, and the privacy bound of privVertical is also proved.Compared with other state-of-art methods, experiments verified that privVertical has higher accuracy and efficiency under the same privacy budget.This is because the Laplace noise required for perturbing the support of candidate patterns is reduced greatly.In the experiments, the candidate reduction rate on two real datasets achieves 74-98%, which greatly improves the utility of private mining.Our experimental results also demonstrated how the pruning strategy based on privHashMap affects efficiency.After pruning invalid candidate patterns, the size of the candidate patterns set is lower.In this way, it takes less time to make the private processing for the candidate patterns, and thus the running time is reduced.In future work, Priv-Vertical will be applied to other practical problems, such as product recommendation and biomedical data analysis.Another possible future work may also include testing the proposal using other practical datasets or metrics related to quality.

Figure 1 .
Figure 1.Example of sequence pattern mining and privacy leakage.

Figure 2 .
Figure 2. Vertical format of sequence database (IDLists of single items).

Algorithm 2
describes the overall scheme.It consists of Algorithm 1 and Algorithm 3. Let N represent the dataset size (the number of sequences).Let max|s| represent the max cardinality of the sequence, and C 2 max|s| is the number of combinations of different co-occurrence items.Thus the complexity of Algorithm 1 is O(N × C 2 max|s| ) .Vol:.(1234567890) Scientific Reports | (2023) 13:17866 | https://doi.org/10.1038/s41598-023-43030-z

Figure 6 .
Figure 6.Effect of pruning based on privHashMap on utility.

Figure 7 .
Figure 7. Effect of pruning based on privHashMap on efficiency.
9omparisonOur experiments include the following comparison algorithms: (1) Prefix 8 : a representative privacy-preserving scheme based on perfixspan, which is implemented by perturbing the prefix tree.(2)PrivApriori: a representative private mining scheme based on an Apriori-based algorithm9.These two methods are representative horizontal mining algorithms with differential privacy.They are compared with PrivVertical, which is a vertical mining algorithm with differential privacy.

Table 1 .
Detailed information of datasets.
Figure 5.Effect of ε on Utility.