Quantum algorithm for MMNG-based DBSCAN

DBSCAN is a famous density-based clustering algorithm that can discover clusters with arbitrary shapes without the minimal requirements of domain knowledge to determine the input parameters. However, DBSCAN is not suitable for databases with different local-density clusters and is also a very time-consuming clustering algorithm. In this paper, we present a quantum mutual MinPts-nearest neighbor graph (MMNG)-based DBSCAN algorithm. The proposed algorithm performs better on databases with different local-density clusters. Furthermore, the proposed algorithm has a dramatic increase in speed compared to its classic counterpart.

www.nature.com/scientificreports/ The Eps-neighborhood, the fundamental definition of the algorithm, can be used to distinguish core points and noncore points. Eps is the distance threshold. Core points are the points inside of any cluster, and noncore points are the points on the border of any cluster or the points belonging to none of the clusters. Let p be a point in a database D , where |N Eps (p)| denotes the number of points within the Eps-neighborhood of p . Let MinPts be the threshold of the number of points; if |N Eps (p)| ≥ MinPts , then p is a core point; otherwise, p is a noncore point.
Definition 2 (directly density-reachable) A point p is directly density-reachable from a point q if 1. p ∈ N Eps (q) and 2. |N Eps (p)| ≥ MinPts Directly density-reachable is not always symmetric. When p and q are both core points, the direct density reachability is symmetric; when one is a core point and the other is a border point, the direct density reachability is asymmetric.
Definition 3 (density-reachable) A point p is density-reachable from a point q if there is a chain of points p 1 , p 2 , . . . , p N ∈ D , p 1 = q , p N = p such that p i+1 is directly density-reachable from p i . Definition 4 (density-connected) A point p is density-connected to a point q if there is a point o such that p and q are density-reachable from o.
Definition 5 (cluster) Let D be a database of points. A cluster C is a nonempty subset of D satisfying the following conditions: 1. ∀p, q : if p ∈ C and q is density-reachable from p then q ∈ C. 2. ∀p, q ∈ C : p is density-connected to q.
In the database D , not all the points belong to clusters. The points that do not belong to any cluster are defined as "noise" in the DBSCAN algorithm.
Definition 6 (noise) Let C 1 , . . . , C k be the clusters of the database D , i = 1, . . . , k . Then, we define the noise as the set of points in the database D not belonging to any cluster C i , i.e., noise = {p ∈ D|∀i : p / ∈ C i }.
Grover's algorithm. Let us assume that we wish to search M(1 ≤ M ≤ N ) solutions from an unstructured search space of N elements. Rather than examining N elements one by one, Grover's algorithm checks the elements in parallel by assigning indexes to all of the elements and storing the indexes in a quantum register. With a series of unitary operations augmenting the success probability gradually, Grover's algorithm can obtain the indexes of the target elements with a high probability.

The proposed algorithm
In this section, we design a quantum mutual MinPts-nearest neighbor graph algorithm and a quantum DBSCAN algorithm and present a quantum MMNG-based DBSCAN algorithm.
Quantum MMNG algorithm. Let D be a database, p and q be some objects in D , and MinPts be a positive integer. The relative concept will be introduced as follow.

Definition 7 (Mutual MinPts-nearest neighbor (MMN)):
If p is in the MinPts-nearest neighborhood of q and q is in the MinPts-nearest neighborhood of p , then we call p a mutual MinPts-nearest neighbor of q ; similarly, q is a mutual MinPts-nearest neighbor of p.

Definition 8 (Mutual MinPts-nearest neighbor graph (MMNG)):
The mutual MinPts-nearest neighbor graph can be constructed by connecting each point to its mutual MinPts-nearest neighbors.
Note that MMNG is an algorithm with high complexity. To speed up the MMNG, we intend to quantize the MMNG algorithm. Dürr et al. 27  www.nature.com/scientificreports/ In this paper, algorithm 1 is used to obtain the subsets of database D . After obtaining the subsets, we apply the quantum DBSCAN algorithm on each subset to obtain the eventual clusters and the noise set.
Quantum DBSCAN algorithm. We consider a database D N = {p 1 , . . . , p N } , which is composed of n points, and each point p i has k attributes. For each point p i in D N = {p 1 , . . . , p N } , it is necessary to calculate Dist(p i , p j ) n − 1 times to determine the Eps-neighborhood of p i . Determining the Eps-neighborhood is fairly time-consuming. To solve this problem, we intend to screen the points in the Eps-neighborhood of p i with quantum search.
In our model, a quantum distance black box is proposed. The proposed black box can accept two types of inputs, as illustrated in Fig. 1. |i� is a one-state input and the index of point p i ; j is a superposition of inputs and includes the indexes of all the points. Evidently, this is feasible because one q-bit can be a pure state or a superposition of states. Furthermore, one query to this black box means asking for distances between the point p i and all the points p j s (when i = j , Dist(p i , p j ) = 0 ). After obtaining the distances, the black box compares them with the Eps distance. Then, a selection function f (i, j) assigns a value of 1 when Dist(p i , p j ) is smaller than or equal to the Eps distance and a value of 0 otherwise. The selection function is shown in Eq. (1). www.nature.com/scientificreports/ Meanwhile, a flipping operation is carried out in the black box. As depicted in Eq. (2), if f (i, j) = 1 , the ancillary register |r� is flipped; if f (i, j) = 0 , the ancillary register |r� remained unaltered. The symbol ⊕ denotes module 2, also known as an exclusive-or.
Based on the aforementioned black box, we designed algorithm 2 (quant_find_Eps-neighborhood as described below) as a subroutine of the quantum-based DBSCAN algorithm. Given a specific point p i , algorithm 2 is able to fix its Eps-neighborhood.
Once the Eps-neighborhoods are fixable quantum-mechanically, core points and noncore points become discernable according to the basic notions of the classic DBSCAN. If a point is a noncore point, we keep looking for a core point because there is no need to create a new cluster for a noncore point; however, if a point is a core point, we set up a new cluster and expand it. With the expanding methodology offered in the original DBSCAN algorithm, the quantum-based DBSCAN algorithm quant_DBSCAN(D N , Eps , MinPts ) is presented hereafter, as shown in algorithm 3.
(2) |i� j |r� → |i� j r ⊕ f (i, j) Figure 1. The oracle to compute the distance between p i and all the possible p j s. |r� is an ancillary register.

The algorithm analysis
In this section, we briefly analyze the complexity of our algorithm first and then present the success probability of our algorithm.
The complexity. Dürr et al. 27 proved that the complexity of quant_find_smallest_values is O √ cn . It is easy to see that the complexity of algorithm 1 is O(N √ MinPts * n). www.nature.com/scientificreports/ For algorithm 2, there are |N Eps (p)| + 1 targets for each point. According to the original version of Grover's algorithm, algorithm 2 needs to interrogate the oracle approximately N |N Eps (p)|+1 times. It can easily be perceived that the smaller |N Eps (p)| is, the more queries are needed. In the worst case scenario, when |N Eps (p)| = 0 , the queries of the oracle are approximately √ N times. In other words, the complexity of algorithm 2 is O( √ N). For algorithm 3, we need to calculate the Eps-neighborhood of every point. This means that algorithm 3 needs to call algorithm 2 N times. Thus, we can ensure that the complexity of algorithm 3 is smaller than O(N √ N) , even though the |N Eps (p)| s are different for different points.
In other words, the complexity of our proposed algorithm is approximately O(N √ MinPts * n).
The success probability. Dürr et al. 27 proved that quant_find_smallest_values is able to obtain c nearest neighbors with a high probability. It is easy to infer that algorithm 1 can obtain subsets with a high probability. It is noteworthy that |N Eps (p)| s are different from point to point, which means that there are different numbers of targets when algorithm 2 is dealing with different points. As a result, the success probabilities are different when calculating different Eps-neighborhoods. By referencing the former work, the success probability of algorithm 2 can be calculated via Eq. (3) after T iterations.
We already know that T ≈ N |N Eps (p)|+1 . Usually, MinPts is far less than N , and |N Eps (p)| are numbers close to MinPts . As a result, it can be inferred that the success probability of algorithm 3 is high.
Our proposed method is a combination of algorithm 1 and algorithm 3, thus success probability of the proposed method is high.

Performance evaluation
To show the effectiveness of the proposed algorithm, performance evaluation based on two databases is conducted. To compare our algorithm with the classic DBSCAN method and the NaNG method, we use the two synthetic sample databases depicted in Fig. 2.
The experimental results on database 1 are shown in Fig. 3. A total of 399 objects are included in database 1. In the figure, the black squares represent the points that are detected as outliers. The experimental result of DBSCAN on database 1 is undesirable, and the accuracy is approximately 74.6%. The experimental result of NaNG is better than that of DBSCAN, with an accuracy of 90.73%. Our proposed method has the best performance on database 1, with an accuracy of approximately 95.74%.
The experimental results on database 2 are shown in Fig. 4. As shown in Fig. 4a, database 2 includes 320 objects and 8 clusters. As shown in Fig. 4b, the result of DBSCAN is tolerable with an accuracy of 92.5%. From the result shown in Fig. 4c, we can see that NaNG mistakenly combines two clusters into one. The accuracy of NaNG is 87.5%. As shown in Fig. 4d, the performance of the proposed method is the same as DBSCAN with an accuracy of 92.5%.

Conclusion
Inspired by the mutual neighbor method and quantum computing, in this work, we present a quantum MMNGbased DBSCAN. Compared to the original DBSCAN, the proposed method performs better on databases with different local-density clusters. Furthermore, the proposed method is dramatically faster than its classical counterpart.