Abstract
With the widely application of cluster analysis, the number of clusters is gradually increasing, as is the difficulty in selecting the judgment indicators of cluster numbers. Also, small clusters are crucial to discovering the extreme characteristics of data samples, but current clustering algorithms focus mainly on analyzing large clusters. In this paper, a bidirectional clustering algorithm based on local density (BCALoD) is proposed. BCALoD establishes the connection between data points based on local density, can automatically determine the number of clusters, is more sensitive to small clusters, and can reduce the adjusted parameters to a minimum. On the basis of the robustness of cluster number to noise, a denoising method suitable for BCALoD is proposed. Different cutoff distance and cutoff density are assigned to each data cluster, which results in improved clustering performance. Clustering ability of BCALoD is verified by randomly generated datasets and city light satellite images.
Similar content being viewed by others
Introduction
Cluster analysis is a data processing algorithm using unsupervised learning that has been widely used in machine learning and information recognition1. The main clustering methods used so far include K-means clustering, the Gaussian mixture model (GMM), hierarchical clustering, mean shift clustering, and density-based spatial clustering of applications with noise (DBSCAN)2,3. K-means clustering can achieve good clustering performance for spherical datasets and is one of the most widely used clustering methods4. However, K-means clustering has poor clustering ability in dealing with aspheric data, and it must determine the number of clusters in advance5,6. GMM can calculate the expectation and variance of Gaussian datasets7, but it is difficult to use for effective cluster analysis of data with poor Gaussian characteristics. Mean shift clustering8 can automatically determine the number of clusters, but for a small-cluster dataset the selection of the sliding window radius may affect the clustering results. DBSCAN does not need to determine the number of clusters and can do cluster analysis for data with arbitrary shapes9. However, if the density of the sample set is not uniform, and the differences in cluster spacing vary somewhat, the clustering quality is poor and too many parameters must be adjusted. Rodriguez et al. Rodriguez A and Laio A proposed a density-based fast searching method10 that overcomes the drawbacks of conventional data clustering methods and can quickly cluster aspheric sets. Unlike the mean shift method, their procedure does not require embedding the data in a vector space; however, it has the problem of requiring manual work to select the number of clusters and easily ignores small clusters, making clustering results depend on manual experience and subjective judgment.
As cluster analysis is applied in more and more fields, the amount of sample data increases, and the cluster number also increases. The increase in the cluster number increases the difficulty in judging the number of clusters for various clustering methods. Small clusters denote points in the data sample that are far away from large clusters and have aggregation phenomenon. Small clusters seldom appear, but in the field of engineering structural safety analysis11,12, small clusters usually indicate extreme or unconsidered working conditions. In some cases, accurately identifying small-cluster information is more important than identifying large clusters. Unfortunately, existing methods focus mainly on forming large clusters, and there is little research on small clusters. Also, existing data clustering algorithms for processing noise rely mainly on an artificial threshold. In cluster analysis, using a unified standard denoising indicator increases the discarding of small clusters because the clustering characteristics of each cluster are different. These problems impose challenges on traditional clustering methods.
In this paper, a bidirectional clustering algorithm is proposed that works by combining the advantages of mean shift clustering and clustering by fast search and find of density peaks. The proposed algorithm is divided into an up process and a down process. In the up process, the local density of different data points is calculated to find high-local-density points nearest to data points, and then data chains are formed from data points ranging from low local density to high local density. In the down process, the highest local-density data points are treated as clustering centers, and then the data chains are merged, all data points are traversed, and finally the clustering operation is completed from high-local-density data to low-local-density data. By comparing K-means clustering, GMM, and BCALoD, it was found that the proposed clustering algorithm can quickly calculate the number of clusters and has good recognition performance for small clusters. In the clustering process, the number of parameters requiring adjustment is reduced to a minimum.
The noise problem was also addressed in this study in real-world data measurement13 by using the characteristics of a large range and the non-evident aggregation property of noise. A noise reduction decision indicator is proposed that is suitable for BCALoD and assigns a different cutoff distance and cutoff density for each cluster to retain important data as much as possible. By clustering random data examples and city light photographs, it was found that the proposed algorithm has good denoising performance.
Bidirectional clustering algorithm based on local density (BCALoD)
Assume that the clustering center is data points with maximum local density, and then calculate the local density (ρi) of each point in the dataset:
where dij is the Euclidean distance between point j and point i, and dc is the cutoff distance selected in the proportion of 1% to 2% of the total distance of data points. Because a Gaussian kernel is used to represent the local distance ρ8,14, the probability that the local density is the same at all points in the area is low. Assuming that the local density ρ of each point is different, data points are sorted according to the value of the local density ρ. Figure 1 shows the schematic diagram of BCALoD.
In the up process, the condition for judging point k as a cluster center is given by within the cutoff distance, no point has a larger local density than point k. For all points in the dataset, each point establishes a connection with only its upper layer, and there is no relation with the next layer; that is, the tops of the selected data chains have nothing to do with the starting point. The up process is started by selecting from the point with the smallest local density ρ. The addressing sequence reflects only the path from bottom to top and does not change the local density of the cluster centers. For small clusters, if no point within the cutoff distance has a higher local density, those clusters are formed as one category respectively. Overall, the up process can be regarded as the inverse operation of clustering data points in a density peak clustering algorithm. Algorithm 1 and 2 show the pseudo code of up process and down process, respectively.
In clustering by fast search and find of density peaks, when the data sizes of large clusters are much greater than those of small clusters, the information of small clusters is easily overwhelmed, resulting in subjectivity in determining the number of clusters. BCALoD can properly identify the clustering of small clusters. The mean shift algorithm must embed random data points and continuously iterate the sliding window, which leads to a high data calculation cost and usually hampers analyzing high-dimensional clusters. The proposed method of constructing data clusters by data chains can reduce the operational cost and ensure that the selection of clustering centers is unrelated to the initial selection. BCALoD establishes the data chains, is not required to embed the data into the vector space, can automatically determine the number of clusters, is more sensitive than other methods to small clusters, and reduces the adjusted parameters to the lowest. In summary, this method combines the advantages of clustering by fast search and find of density peaks and mean shift clustering.
We used a Gaussian mixture distribution to generate a 2D dataset (Case 1), as shown in Fig. 2a. Figure 2b shows the results of the BCALoD clustering algorithm. Table 1 shows the size of each cluster. The largest cluster contains 5,000 data points. While, the size of smallest cluster is only 30. There are 3 small clusters in Case1.
Noise recognition and cutoff distance optimization
The influence of noise is usually unavoidable in the real world15. How to deal with noise reasonably is an important problem in cluster analysis16. The noise in cluster analysis has the characteristics of large range and small aggregation. Cluster analysis reveals that empirical values can quickly determine a relatively reasonable range. When the distance between two clusters is artificially increased or noise is added, the values of the cutoff distances also change, whereas the clustering properties in the clusters do not change. Meanwhile, because the aggregation degree of each cluster is different, it is unreasonable to use a uniform cutoff distance for denoising.
We introduced noise points to Case 1, as shown in Fig. 3a. The BCALoD algorithm was used to do cluster analysis for all data. Figure 3b shows the clustering results.
If the number of noise points is much lower than the number of real clusters, denoising can be done simply by comparing cluster sizes (Fig. 2b). However, when there are many real small clusters or noise clusters (Fig. 3b), it is easy to discard real small clusters by denoising using only cluster size. Because of the characteristics of noise, the local density of noise is much lower than that of actual clustering centers. Therefore, when clustering density is gradually filtered from small to large, the low clustering density data points in noise clusters and real clusters are deleted first, while the clustering centers of real clusters are not deleted (Fig. 4a). In other words, the number of clustering centers of real clusters has strong robustness against changes in clustering density.
Set the range of the local density stabilized in i clusters as \(h_{i}^{1}\). The maximum local density corresponding to the i cluster is \(h_{i}^{2}\). Set the decision indicator Di as
If \(h_{i}^{1}\) is high, the number of clusters selected as the i cluster is stable. If \(h_{i}^{2}\) is low, the number of clusters retained is high, which means that more data points are retained. Figure 4b shows the decision indicators of the noise-containing 2D Gaussian mixture distribution dataset. When the cluster number of the dataset is determined, the cutoff local densities and cutoff distances of each cluster can be calculated through data filtering (Fig. 4c), and then the clustering results can be obtained (Fig. 4d). Algorithm 3 presents the process of noise recognition.
BCALoD algorithm could retain more sensitivity to small clusters and save non-noise data to the greatest extent possible. Also, the algorithm used a decision indicator to scientifically assign a different cutoff local density and cutoff distance for each cluster. Because the denoising process does not require repeated calculations on data points, this algorithm can reduce calculation costs.
Results and discussion
We used a combination dataset to verify the BCALoD algorithm by comparing the results of different clustering algorithms, as shown in Fig. 5. By comparing the BCALoD algorithm, K-means clustering, and mixed Gaussian clustering, the relative performance of the BCALoD algorithm was determined. In K-means clustering, the silhouette coefficient17 is used to calculate the clustering and separation degrees, and the maximum value of the silhouette coefficient is selected as the number of clusters, which completes K-means clustering. In mixed Gaussian clustering, the Bayesian information criterion (BIC)18 is used to select the number of clusters, and the minimum value of the BIC is selected as the data number of clusters.
Table 2 shows the clustering quality indexes (AR, ARI, NMI,HI) of BCALoD, DBSCAN, DPC and K-mean. AR, ARI, NMI and HI are Adjusted Rand Index, Rand Index, Normalized Mutual Information and Hubert index, respectively. When the value of the clustering quality index is close to 1, the clustering effect is closer to the reality. Nc is the number of clusters after calculation. Ns is the set value. All parameters of BCALoD are larger than those of other clustering algorithms, which indicates that BCALoD algorithm has strong clustering ability and efficient noise processing effect on these dataset. Figure 6 shows the clustering results of a variety of 2D and 3D datasets. Algorithm BCALoD presents good clustering ability for different types of datasets.
For image clustering, pixel values of different data points are different. Because the pixel value of Li is an integer, it can easily yield the same local density. Therefore, taking the pixel value as the main factor, the local density was defined as:
where ρi is the local density calculated by Eq. (1). Because Li is an integer, the value of ρi of each point was normalized to make the local density of different data points different. Under the condition of the same pixel, data points were sorted by comparing ρi.
Figure 7 shows a city lighting satellite image19 and the clustering results by BCALoD cluster algorithm. The brightness of each point in the area was different, and the lights were widely distributed. There were many distribution centers in the image, and the distribution shapes were irregular. It would be very difficult to execute effective clustering calculations for these images using traditional methods.
As shown in cases above, for nonspherical clusters, spirals datasets, noise-containing datasets, and city light image data, the proposed algorithm achieved good clustering performance. The only parameter that required adjustment in the clustering calculation was cutoff density, which reduced the difficulty of parameter adjustment in the clustering process as much as possible. The number of clusters was automatically determined by the method based on BCALoD, which was more sensitive to the information of small clusters, and assigned a different cutoff distance and cutoff local density for each data cluster.
Compared with other clustering methods, BCALoD is more sensitive to small clusters and has good denoising ability. Due to the assumption that the clustering center is the data point with the maximum local density, BCALoD clustering algorithm has a relatively weaker effect on the closed annular data clustering than other multi-parameter clustering methods. Therefore, post-processing method of clustering center still needs to be further analyzed. At the same time, BCALoD is relatively sensitive to the local density of the data set, requires a high degree of data density. If the density in the cluster is not high enough, it is easy to judge the data as noise data. The preliminary determination of the cutoff distance dc still needs further exploration.
Conclusions
A BCALoD algorithm is proposed in this paper that is based on local density that combines the merits of clustering by fast search and find of density peaks and mean shift clustering. The algorithm forms data chains from low-local-density data points to high-local-density data points, treats the latter as clustering centers, and then integrates the data chains and completes the clustering operations. The number of clusters is automatically determined by the BCALoD algorithm, which is more sensitive to small clusters, and reduced the number of parameters requiring adjustment to the lowest.
Using the characteristics of noise, a denoising method is proposed based on local density, which ensures a denoising effect and retains the sensitivity of the BCALoD algorithm to small clusters. The denoising method assigns a cutoff local density and cutoff distance for each cluster, retaining useful data as much as possible. The proposed algorithm can also reduce calculation cost.
The BCALoD algorithm can do effective cluster analyses on aspheric clusters, spirals datasets, noise-containing datasets, and image data. The proposed algorithm can also achieve ideal clustering performance for clusters of various shapes. When the number of clusters is large, the BCALoD algorithm can quickly determine the number of clusters, is more sensitive to small clusters, and removes noise effectively.
References
Shirkhorshidi, A. S., Aghabozorgi, S., Wah, T. Y. & Herawan, T. Big data clustering: A review. ICCSA 2014, 707–720 (2014).
Dharni, C., & Bnasal, M. An improvement of DBSCAN Algorithm to analyze cluster for large datasets. In 2013 IEEE International Conference in MOOC, Innovation and Technology in Education (MITE). 42–46 (2013).
Zhu, Y., Ting, M. K. & Carman, J. C. Density-ratio based clustering for discovering clusters with varying densities. Pattern Recogn. 60, 983–997 (2016).
Liu, G. Y., Yang, J., Hao, Y. & Zhang, Y. Big data-informed energy efficiency assessment of china industry sectors based on k-means clustering. J. Clean. Prod. 183, 304–314 (2018).
Laszlo, M. & Mukherjee, S. A genetic algorithm that exchanges neighboring centers for k-means clustering. Pattern Recogn. Lett. 28, 2359–2366 (2007).
Jain, A. K. Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31, 651–666 (2010).
Yang, M. S., Lai, C. Y. & Lin, C. Y. A robust em clustering algorithm for Gaussian mixture models. Pattern Recogn. 45, 3950–3961 (2012).
Cheng, Y. Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Anal. Mach. Intell. 17, 790–799 (1995).
Yang, K., Gao, Y., R Ma, Chen, L., Wu, S., & Chen, G. DBSCAN-MS: Distributed density-based clustering in metric spaces. In IEEE 35th International Conference on Data Engineering (ICDE). 1346–1357 (2019).
Rodriguez, A. & Laio, A. Clustering by fast search and find of density peaks. Science 344, 1492–1496 (2014).
Kurasova, O., Marcinkevicius, V., Medvedev, V., Rapecka, A., & Stefanovic, P. Strategies for big data clustering. In IEEE 26th International Conference on Tools with Artificial Intelligence (ICTAI). 740–747 (2014).
Wu, W. H. et al. Design, implementation and analysis of full coupled monitoring system of FPSO with soft yoke mooring system. Ocean Eng. 113, 255–263 (2016).
Herranen, H., Kuusik, A., Saar, T., Reidla, M., & Majak, J. Acceleration data acquisition and processing system for structural health monitoring. In 2014 IEEE Metrology for Aerospace (MetroAeroSpace). 244–248 (2014).
Hinneburg, A., & Gabriel, H. H. DENCLUE 2.0: Fast clustering based on kernel density estimation. In International Symposium on Intelligent Data Analysis. 70–80 (2007).
Du, Y., Wu, W. H., & Yue, Q. J. Prototype measurement for deep water floating platforms based on monitoring technology. In International Conference on Ocean, Offshore and Arctic Engineering (2013).
Xie, J., Gao, H., Xie, W., Liu, X. & Grant, P. W. Robust clustering by detecting density peaks and assigning points based on fuzzy weighted k-nearest neighbors. Inf. Sci. Int. J. 354, 19–40 (2016).
Errecalde, M. L., Cagnina, L. C. & Rosso, P. Silhouette + attraction: A simple and effective method for text clustering. Nat. Lang. Eng. 22, 687–726 (2016).
Nishida, M. & Kawahara, T. Speaker model selection based on the Bayesian information criterion applied to unsupervised speaker indexing. IEEE Trans. Speech Audio Process. 13, 583–592 (2005).
Joshua, S., Adam, V. It’s Valley Fog Season. NASA. https://www.nasa.gov/image-feature/it-s-valley-fog-season. (2018).
Acknowledgements
This research was financially supported by the National Key R&D Program of China (2017YFC0307203), the National Natural Science Foundation of China (No. U1906233), the Key R&D Program of Shandong Province (2019JZZY010801). The Fundamental Research Funds for the Central Universities (DUT20ZD213, DUT20LAB308). These supports are gratefully acknowledged.
Author information
Authors and Affiliations
Contributions
Author A. prepared all figures. Author B. and C. wrote the main manuscript text and All authors reviewed the manuscript. Author A: B.L; Author B: W.W.; Author C: Z.H.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lyu, B., Wu, W. & Hu, Z. A novel bidirectional clustering algorithm based on local density. Sci Rep 11, 14214 (2021). https://doi.org/10.1038/s41598-021-93244-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-021-93244-2
This article is cited by
-
A raster-based spatial clustering method with robustness to spatial outliers
Scientific Reports (2024)
-
Deep learning and predictive modelling for generating normalised muscle function parameters from signal images of mandibular electromyography
Medical & Biological Engineering & Computing (2024)
-
Utility-driven assessment of anonymized data via clustering
Scientific Data (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.