An improved density peaks clustering algorithm based on grid screening and mutual neighborhood degree for network anomaly detection

With the rapid development of network technologies and the increasing amount of network abnormal traffic, network anomaly detection presents challenges. Existing supervised methods cannot detect unknown attack, and unsupervised methods have low anomaly detection accuracy. Here, we propose a clustering-based network anomaly detection model, and then a novel density peaks clustering algorithm DPC-GS-MND based on grid screening and mutual neighborhood degree for network anomaly detection. The DPC-GS-MND algorithm utilizes grid screening to effectively reduce the computational complexity, improves the clustering accuracy through mutual neighborhood degree, and also defines a cluster center decision value for automatically selecting cluster centers. We implement complete experiments on two real-world datasets KDDCup99 and CIC-IDS-2017, and the experimental results demonstrated that the proposed DPC-GS-MND can detect network anomaly traffic with higher accuracy and efficiency. Together, it has a good application prospect in the network anomaly detection system in complex network environments.

www.nature.com/scientificreports/ network anomaly or intrusion, such as R2L and U2R. HIDE 8 is a statistical-based network anomaly detection system that applies statistical models and neural network classifiers for anomaly detection. FSAS 9 is a statisticalbased network anomaly detection system, including feature generators and flow-based detectors. Statistical-based methods can achieve high accuracy and detection rate when the threshold for identifying anomaly is correctly adjusted, and can provide accurate alarms for malicious activities without requiring prior knowledge of normal activities in advance. However, it is usually not straightforward to choose the best statistic and set the values of different parameters. Classification-based methods rely on the normal network traffic activity profile and treats activities that deviate from the baseline profile as anomalies 10 . Several classification models have been applied to detect network anomaly, such as k nearest neighbor (KNN), support vector machines (SVM), and decision trees. The models have the ability to classify network traffic into two categories (normal or anomaly) or a set of classes (normal with each anomaly as a category) 1 . Chen et al. 11 presented a network anomaly detection approach called FEW-NNN, which utilizes an improved KNN classification model based on fuzzy entropy weighted and natural nearest neighbor. Ambusaidi et al. 12 proposed a least square SVM classification model to design a lightweight network anomaly detection system by selecting important network traffic features and detecting network anomalies. Abbes et al. 13 introduced an approach that constructs an adaptive decision tree with application layer protocol analysis for effective anomaly detection. Although classification-based anomaly detection methods are popular and usually have high detection rate for known attacks, they cannot detect unknown attacks or events before providing relevant training information and require more computational resources.
Clustering-based methods cluster large datasets into similar groups without relying on class labels. The most popular types are regular clustering and co-clustering, where there are differences between the strategies for handling observations and features 10 . Specifically, regular clustering combines data points from the observations, while co-clustering considers both observations and features. Su et al. 14 applied a network traffic sampling method based on average-linkage hierarchical clustering for network anomaly detection. Petrovic et al. 15 proposed a network anomaly attacks method, which combines Davies-Bouldin index of the cluster and centroid diameter of the cluster. Ahmed et al. 16 applied X-means clustering to detect collective abnormal flows. Bhuyan et al. 17 proposed a clustering-based network anomaly detection system, which uses k-means to cluster legitimate data and computes reference points for each cluster. Clustering-based methods reduce the computational complexity and provides stable performance. However, it is difficult to evaluate the technology without assuming that the larger clusters are normal and the smaller clusters are anomalies or intrusions, and it is time-consuming to dynamically update the established configuration profiles.
Soft computing-based methods are generally thought of including genetic algorithms (GA), artificial neural networks (ANN) and fuzzy sets. Anderson et al. 18 combined genetic algorithm and fuzzy logic to predict and detect network anomalies based on six attributes extracted from flow-based network data, in which GA algorithm is used to predict network behavior and fuzzy logic is used to evaluate whether an instance represents abnormal. Alnafessah et al. 19 presented an anomaly detection method based on artificial neural network, which can accurately detect and classify abnormal behaviors, and can be easily used with online Spark systems. Abolhasanzadeh et al. 20 developed a deep autoencoder technique to reduce data dimensions and then applied a shallow artificial neural networks classification model to evaluate the effectiveness. Soft computing-based methods are applied when the decision of identifying an element of network traffic as anomalous or normal is not certain. They have good efficiency and can effectively resolve inconsistency in the dataset with rough sets, but most methods have scalability problems and training becomes very difficult without a reliable amount of normal traffic data.
Knowledge-based methods construct a rule set based on the existing attack information, and then detect anomalies related to the constructed rule set. Common knowledge-based methods are rule and expert system, as well as ontology and logic-based. Snort 21 is a popular rule-based intrusion detection system, which detects malicious network packets by matching the packets with predefined rules, and now it involves more than 20,000 rules. Petri 22 is a knowledge-based intrusion detection system, which composes of directed bipartite graphs and colored Petri nets standing for intrusion features. Naldurg et al. 23 proposed an IDS applying temporal logic specifications, in which attack patterns are formulated in a logic structure. Hung et al. 24 proposed an ontology-based method to establish NADS based on the end user domain, in which a network anomaly detection system can be simply constructed. Knowledge-based methods have sufficient robustness and high accuracy to detect known attacks. However, it is impossible to identify rare or zero-day anomalies. Considering all types of anomalies or attacks, building the best, non-redundant and consistent rule set is a difficult task.
Combination learners-based methods combine different models at different levels such as features, decisions and data. They use multiple mechanisms to effectively classify data points, most of which are used for network anomaly detection systems based on ensemble-based and hybrid-based. Folino et al. 25 proposed a distributed data mining method based on genetic programming extended with ensemble learning to improve the accuracy of anomaly detection. Perdisci et al. 26 provided a payload network anomaly detection system based on a hybridbased one-class SVM to improve the accuracy. Some researchers have used a combination of classifier and clustering methods to take advantage of the two technologies for network anomaly detection. Xiang et al. 27 combined tree classifier and Bayesian clustering for network anomaly detection. Al-Yaseen et al. 28 provided a multi-level network anomaly detection model which applies the modified k-means clustering with SVM classification and extreme learning machine. Combination learners-based methods can achieve higher accuracy and detection rates than single method, and can handle both the known and unknown attacks. However, hybridization more than one technique may lead to high computational costs and is generally not appropriate or real-time detection.
Each network anomaly detection approach can work well in certain situations. However, there is no one approach that can work well in all situations. This is because the nature of network traffic is constantly changing, and the performance of the technology depends on the point of deployment in network. Although many  30 proposed an improved DPC algorithm called DPC-DLP, which employs the idea of KNN to calculate the cut-off and local density of points, and applies a graph-based method to assign distribute points. Leung et al. 31 provided an improved DPC with a grid-based high-dimensional clustering algorithm for anomaly detection. Xu et al. 32 provided an improved density peaks clustering algorithm based on grid called DPCG to improve the efficiency. Ni et al. 3 utilized unsupervised feature selection and density peaks clustering to detect network anomaly. Yang et al. 33 provided an improved DPC algorithm called MDPCA to reduce the training scale and unbalanced samples. Li et al. 34 proposed a hybrid model by combining KNN and DPC for network attack detection. Shi et al. 35 presented a malicious attack detection method aiNet_DP, which combines artificial immune network and density peaks clustering. Although DPC is a good algorithm for network anomaly detection, it still has some limitations. DPC computes local density by measuring the distance between all points, which leads to much high computational complexity, especially for large-scale data. In response to these limitations, we propose a novel improved DPC algorithm based on grid screening and mutual neighborhood degree.

Anomaly detection module and dataset
This part focuses on a thorough description of the anomaly detection model and experience dataset, which are the most important aspects in the network anomaly detection research.
Anomaly detection model. We design the network anomaly detection model as shown in Fig. 2. The network traffic anomaly detection process can be divided into five steps, including network traffic data collection, traffic data sampling, traffic dimension reduction, anomaly detection modeling, anomaly detection results and evaluation.
Network traffic collection: Collecting datasets that represent the problem that need to be solved is the most important step in designing a good machine learning model. The dataset employed in this research is a 10% subset of KDDCup99 dataset, which is processed by 4 GB binary TCP traffic data from 7 weeks of network. Another dataset employed in this research is CIC-IDS-2017.
Traffic data sampling: Network traffic data sampling extracts the most representative examples from the original massive network traffic dataset, removes redundant and similar traffic data and obtains a relatively small reduced traffic dataset to improve the detection performance of anomaly detection methods. Here, we sample a portion of dataset, and downsample the three kinds "neptune", "normal" and "smurf " data to ensure relative balance with other kinds of data.
Traffic dimension reduction: Before performing anomaly detection modeling on massive high-dimensional network traffic data, it is necessary to perform feature reduction and reduction processing on the data. Here, we delete the features that can be calculated from other dimension data, and then utilize the Fisher score method and deep graph feature learning approach to obtain the key features.
Network anomaly detection modeling: After performing numerical standardization and normalization to convert all features to a common scale with zero and one, we need do network anomaly detection modeling. There are six categories anomaly detection models including statistical-based, classification-based, clusteringbased, soft computing-based, knowledge-based and combination learners-based. In this paper, we implement several unsupervised clustering algorithms that do not required labelled dataset for network anomaly detection.
Anomaly detection results and evaluation: The results according to solution and clustering accuracy on the training set and the testing set will be shown in this part. Most of the network anomaly detection results use accuracy-related indicators for evaluation including false positive rate, precision rate, recall rate, overall accuracy rate and F-Value. Here, we utilize accuracy rate to evaluate the experiment results, and we compare the approach with DPC-GS-MND and three other challengers including DPCG, MDPCA and DPC-DLP.
Dataset and preprocessing. The KDDCup99 dataset 36 is the most commonly used dataset in the field of anomaly or intrusion detection and machine learning research. It contains 5 million connection records, which  Table 1, the KDDCup99 10% dataset is imbalance, in which "neptune", "normal" and "smurf " are much higher than other types, so we down-sample these three types of samples to ensure relative balance. Not all features are useful for the detection and even there will be a burden on the memory. In the pre-processing, first, we delete the features that can be calculated from others, and then, we utilize the Fisher score method and deep graph feature learning algorithm in 11 to obtain the top 10 important features for anomaly detection, as shown in Table 2.
Since some features are consisted of letters, such as protocol_type, flag and label, we need to convert the corresponding letters into numerical values, and then perform numerical standardization. The normalization is performed according to the follow Eq. (1) to convert the data into [0,1], where x ′ ij is the numerical standardized value of x ij and x ′′ ij is normalized value of x ′ ij .

The DPC-GS-MND clustering algorithm
The key idea of density peaks clustering algorithm is based on the following two assumptions: (1) the cluster center is surrounded by data points not higher than its density; (2) the distance between the cluster center points and the higher density point is relatively far. The importance of density peaks clustering algorithm is the decision graph, that is, how to select the cluster centers more quickly and accurately 37 . This paper proposes an improved density peaks clustering algorithm called DPC-GS-MND, which is based on grid screening and mutual neighborhood degree. In the DPC-GS-MND algorithm, grid screening, mutual neighborhood degree and automatic center selection technologies are introduced to optimize and improve decision graphs drawing and cluster centers selection.
Density peaks clustering. The density peaks clustering algorithm mainly includes three aspects: (1) Calculate local density ρi of each data point x i , and minimum distance δi between x i and all other data points with higher density. (2) Obtain cluster centroids by the drawn decision graph according to ρi and δi. (3) Assign each remaining data point to the same cluster centroid as its nearest high-density neighbor.
(1)  www.nature.com/scientificreports/ Definition 1 Local Density. Given a dataset containing n data points X ( X = {x 1 , x 2 , . . . , x n } ), for ∀x i , x j ∈ X , assuming that the local density of x i is ρ i , then ρ i is given by Eq. (2); where Eudist x i , x j represents the Euclidean distance between data point x i and data point x j , dist cutoff denotes a given cutoff distance.

Definition 2
High-density Distance. High-density distance is measured by computing the minimum distance between a data point and other higher-density data points. For the highest density data point, δ i is calculated by Eq. (3), and for other data points, δ i is computed by Eq. (4).
Definition 3 Density Peak. The point whose distance is large and local density is large is defined as density peak. For ∀x i ∈ X , the matrix (ρ i , δ i ) can be obtained by computing local density ρ i and distance from the highdensity point δ i , and then the density peaks decision graph can be drawn. The density peak points have both high value of ρ i and δ i .
The detail process of DPC algorithm is shown as Algorithm 1.
According to Algorithm 1, in Step 2, the space complexity increases significantly while calculating the distance matrix between all data points, and it limits the speed on large-scale datasets. In Step 3-Step 5, the definition of local density not considers the structural differences within the data, and it is difficult to obtain good clustering effect. In Step 6, the determination of cluster center requires human selection, which increases the uncertainty of clustering, especially when the human eye cannot accurately select the cluster center in some cases.
Grid screening. DPC algorithm can efficiently detect anomalies and find clusters of arbitrary shapes. However, its space complexity is significantly increased when calculating the distance matrix between all data points, and it limits the speed of DPC application on large-scale datasets. The paper introduces the grid screening technology. First, it divides the whole data space by grid cells, and then maps the dataset to grid cells; Then it removes the sparse grid cells and focus on considering the data points in the rest dense grids. This greatly decreases memory requirements and time complexity.

Definition 4 Grid Side Length
Given a d-dimensional dataset containing n data points X ( X = {x 1 , x 2 , . . . , x n } ), for ∀x i ∈ X , 1,2,…,d), then S = [l 1 , h 1 ) * [l 2 , h 2 ) * · · · * [l d , h d ) represents a d-dimensional data space. Each dimension of the data space is divided into grid cells with equal sides and disjoint edges, and the grid side length gsl is defined as follow. where µ denotes the screening ratio, which is applied to adjust the size of the grid length gsl.

Definition 5 Grid Cell Density
Given a d-dimensional dataset containing n data points X ( X = {x 1 , x 2 , . . . , x n } ), the data space is divided into grid cells {u 1 , u 2 , . . . , u n } with grid side length gsl , and map X to the corresponding grid cells. The grid cell density of u i is ρ u i is defined as follow.
where count(G u i ) denotes the number of points in the cell with statistical grid number G u i .
Mutual neighborhood degree. The local density defined in DPC algorithm does not consider the structural differences within the data. it is difficult to obtain good clustering effect. Therefore, the relative density and the neighbors of a sample can more accurately and effectively determine whether it is cluster center. Here, we compute the relative density in local area of the sample, rather than the relative density in the whole area.
Definition 6 KNN Local Density. Given a dataset containing n data points X ( X = {x 1 , x 2 , . . . , x n } ), for ∀x i ∈ X , knn(i) represents the k nearest neighbors set of x i andx j ∈ knn(i) . The KNN local density of x i is ρ i , which is defined as follow Eq. (7).
where Eudist x i , x j denotes the Euclidean distance between data point x i and data point x j ; k denotes neighbor points number.
Definition 7 Neighborhood Degree. Given a dataset containing n data points X ( X = {x 1 , x 2 , . . . , x n } ), for ∀x i , x j ∈ X , the neighborhood degree is defined by distance between data points, and computation equation is as follow.
where NDegree x i , x j represents the neighborhood degree between data point x i and data point x j . The greater the distance between data point x i and data point x j , the lower the similarity and the smaller the neighborhood degree.
Definition 8 Relative Neighborhood Degree. Given a dataset containing n data points X ( X = {x 1 , x 2 , . . . , x n } ), for ∀x i , x j ∈ X , we introduce local neighborhood degree to compute the relative neighborhood degree of x i and x j . The equation is as follow.
where knn(x i ) denotes a set of the k nearest neighbors x i , NDegree x v , x j denotes the neighborhood degree of data point x i to data point x j .

Definition 9
Mutual Neighborhood Degree. Given a dataset containing n data points X ( X = {x 1 , x 2 , . . . , x n } ), for ∀x i , x j ∈ X , mutual neighborhood degree is defined based on relative neighborhood degree, as Eq. (10).
where MNDegree x i , x j represents the mutual neighborhood degree of data point x i and data point x j ; RNDegree x i , x j is the relative neighborhood degree of data point x i to data point x j and RNDegree x j , x i is relative neighborhood degree of data point x j to data point x i .
Here, the novel measure of mutual neighborhood degree between data points improves the density peaks clustering algorithm, and solves the problem that the density peaks clustering algorithm does not consider the structural differences within the data to find true local density.
Automatic cluster center selection. In the DPC algorithm, the determination of cluster centers requires human selection, which leads to an increase in the uncertainty of clustering, especially in the case where the www.nature.com/scientificreports/ human eye cannot accurately select the cluster center in some cases. The choice of centers becomes very difficult. By comprehensively considering the two decision parameters ρ and δ of cluster center, a cluster center decision value is proposed: where P i and i represent the normalized values of ρ i and δ i , respectively. The calculation equations as follow: where ρ min and δ min represent the minimum value in ρ i and δ i , ρ max and δ max are the maximum value in ρ i and δ i . Obviously, the larger the γ of the data point, the more likely it is the cluster center. We calculate the value of γ and arrange {γ i } N i=1 in descending order and plot on coordinate plane to get the density peaks decision graph. We can see that the γ value has obvious size boundaries. Therefore, the density peaks can be automatically selected by using a heuristic method to set a threshold. When the value of γ is greater than the threshold, the density peaks will be determined as the clustering centers.
DPC-GS-MND: an improved DPC clustering algorithm. First, DPC-GS-MND algorithm utilizes the idea of k-neighborhood to calculate the local density of data points and find the density peaks, and then assigns the k nearest neighbors to their corresponding clusters. Secondly, it computes the mutual neighborhood degree between data points, and then finds the closest unallocated data points according to the mutual neighborhood degree, next, assigns them to the relative clusters. Finally, it repeats this operation until all data points are allocated. The DPC-GS-MND algorithm flowchart is shown in the follow Fig. 3.
The detailed process of the proposed DPC-GS-MND algorithm is as follow Algorithm 2.  We utilize accuracy rate to evaluate the experiment results in this paper, which is given by the follow Eq. (14).
where m is the number of network anomaly types, n ij denotes the number of i-type network anomalies clustered to be j type. TP i , FP i , FN i and TN i are defined as: TP i = n ii , FP i = j� =i n ji , FN i = i� =j n ij and TN i = j� =i n jj .
Experimental results. We have developed and studied the network anomaly detection model based on DPC-GS-MND clustering algorithm that uses only observable aspects of network traffic.  Figure 4 shows the comparison results of the two in clustering accuracy and clustering time. Figure 4 shows that each time a different dataset is selected, the proposed DPC-GS-MND algorithm has different detection accuracy and performance, but the fluctuations are within the acceptable range. The reason is that the algorithm has ability to recognize different anomalies. By comparing the running accuracy of different datasets, it is verified that the DPC-GS-MND algorithm is superior to the anomaly detection accuracy and the detection effect is better than DPC algorithm. The DPC-GS-MND algorithm can quickly and accurately identify anomalies.

Experiments 2
We conducted experiments on five single types to verify the detection effect on a single attack type. The result is shown in Fig. 5. Since the feature value of the R2L attack pattern is changeable, it is much similar to normal data and not easy to be detected. Therefore, in addition to R2L, the other three types of attack are relatively good, each detection accuracy exceeds 93%.

Experiments 3
In this paper, we introduce grid screening and mutual neighborhood degree technology to improve the DPC clustering algorithm. In these experiments, we need to confirm the effectiveness of grid screening technology and mutual neighborhood degree technology. We compare the network anomaly detection  As shown in Figs. 6 and 7, the experimental results show that, compared with DPC-GS, DPC-MND and DPC, the DPC-GS-MND algorithm has much higher anomaly detection accuracy. The DPC-GS-MND algorithm has much lower running time than DPC-MND and DPC. DPC-MND has better anomaly detection accuracy than DPC-GS, but DPC-GS has a shorter running time. This shows that the introduced grid screening technology can improve the computational performance, and the introduced mutual neighborhood degree technology can effectively improve the detection accuracy. The anomaly detection accuracy rate and running time of the four algorithms are shown in Table 3. Among all the four clustering algorithms, network anomaly detection using the DPC-GS-MND algorithm can provide the best accuracy rate and relatively short running time. This shows that the DPC-GS-MND algorithm has better anomaly detection accuracy than MDPCA, DPC-DLP and DPCG. The DPC-GS-MND algorithm takes less  www.nature.com/scientificreports/ time than MDPCA and DPC-DLP, but takes a little more time than DPCG. The accuracy of DPCG is lower than that of MDPCA and DPC-DLP, but the running time is shorter. This is because both DPC-GS-MND and DPCG utilize grids to greatly improve the running efficiency of algorithm, while DPC-GS-MND takes some time to further compute the mutual neighborhood degree.

Experiments 5
In order to further confirm the availability of the DPC-GS-MND algorithm for network anomaly detection, we also have some experiments on CIC-IDS-2017, which is a latest network traffic dataset and covers 11 common attacks including Bot, DoS, DDoS, SQL Injection, Brute Force, Infiltration, Port scan and XSS.
As shown in Fig. 8, the experimental results show that the DPC-GS-MND algorithm has a higher anomaly detection accuracy than DPCG, MDPCA and DPC-DLP on the dataset CIC-IDS-2017. Based on the Experiments 4 to Experiments 6, the DPC-GS-MND algorithm has better result both on real-world datasets KDDCUP99 and CIC-IDS-2017.
Through our experimental evaluation, we clearly proved that the network traffic anomaly detection model based on the DPC-GS-MND algorithm outperforms DPC, DPCG, MDPCA and DPC-DLP. The proposed DPC-GS-MND algorithm improves both the accuracy and efficiency of network traffic anomaly detection.

Conclusion and future work
In this paper, we propose and evaluate a novel improved network traffic anomaly detection method named DPC-GS-MND. In order to achieve efficient and accurate detection of abnormal traffic, an improved density peaks clustering algorithm based on grid screening and mutual neighborhood degree is proposed. Experiment results on well-known datasets show that the proposed DPC-GS-MND method was able to identify anomaly with higher detection accuracy and less running time.
In future research, we hope to extend the work to the following two major directions. (1) Adaptive k value selection: The DPC-GS-MND algorithm uses k nearest neighbors, the value of parameter k still needs to be manually determined and the adaptive determination of k value will be the next work. (2) Data reduction: data reduction includes network traffic sampling and important features extraction. We need to further study malware traffic sampling and its representative features extraction to improve the efficiency and accuracy of anomaly detection. During the process of network traffic sampling, more attention should be paid to the unbalanced features of network traffic data.