Introduction

Clustering is defined as grouping objects in sets. A good clustering method will generate clusters with a high intra-class similarity and a low inter-class similarity1. There are several classic and representative clustering methods which are widely used in biological data analysis, including k-means clustering2,3, Partitioning Around Medoids (PAM)4, hierarchical clustering (Hcluster)5, Clustering Large Applications (CLARA)4, Agglomerative Nesting (AGNES)4,6,7, Divisive Analysis Clustering (DIANA)4, Clusterdp8,9 and DBSCAN10.

K-means clustering is a popular method of vector quantization in data mining. The term “k-means” was first used by MacQueen2 in 1967 and the standard algorithm was first proposed by Lloyd3 in 1957. K-means clustering is typically used to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

The Partitioning Around Medoids (PAM) is a clustering algorithm related to the k-means clustering and the medoids shift algorithm4. Both the k-means and PAM are partitional (breaking the dataset up into groups) and both attempt to minimize the distance between points labeled to be in a cluster and a point designated as the center of that cluster. In contrast to the k-means clustering, PAM chooses data points as centers and works with a generalization of the Manhattan Norm to define distance between data points. The PAM method was proposed in 1987 and is a classical partitioning technique of clustering that clusters the dataset of n objects into k clusters.

Hierarchical clustering (Hcluster)5 is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two subcategories: agglomerative and divisive1. In general, the merges and splits can be achieved in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.

Clustering large applications (CLARA)4 is characterized by taking a small portion of the data as a sample without considering the entire data set. It extracts multiple sample sets from the data set and uses the best cluster as output, by using PAM for each sample set. CLARA can handle a larger data set than PAM. Agglomerative nesting (AGNES)4,6,7 algorithm belongs to hierarchical clustering method. AGNES initially takes each object as a cluster, afterwards the clusters are merged step by step according to certain criteria, using a single-link method. The level of similarity of the two clusters is measured by the similarity of the nearest pair of data points in the two different clusters. The clustering process is repeated until all objects finally meet the number of clusters. The DIANA (Divisive analysis)4 algorithm is a typical split clustering method. DIANA first places all objects in a cluster and then subdivides them into smaller clusters until the desired number of clusters is obtained. Density-based methods include Clusterdp8,9, DBSCAN10, etc. Clusterdp8,9 is a recently developed method based on the idea that centroids are characterized by a higher local density than their neighbors and by a comparably high distance from objects with higher density.

Obviously, each clustering method has its own strengths and drawbacks. Although some methods work well on one data set, it may give poor results on another data set. The K-means clustering algorithm is compromised when feature is highly correlated and is extremely sensitive to outliers, because its distance measurement can be easily influenced by extreme values, and it is also computationally difficult (NP-hard)11,12,13,14,15. The most time-consuming part of PAM is the calculation of the distances between objects. CLARA relies on the sampling approach to handle large datasets4, therefore, the quality of CLARA’s clustering results depends greatly on the size of the sample. AGNES algorithm does not undo what was previously carried out. No objective function is directly minimized. Sometimes it is difficult to identify the correct number of clusters by using the dendrogram. DIANA chooses the object with the maximum average dissimilarity and then moves all objects to this cluster that are more similar to the new cluster than to the remainder.

We consider that the objective of clustering is to minimize the “residuals” within clusters. We can use norms to measure “residuals”, like L2~L0 norms16. For example, L2 error is the square error, L1 error is the nuclear norm and L0 error is the rank of the residual matrix. Minimizing nuclear norm not only reduces the quantitative error (variance) but also reduces the qualitative errors (rank) and encourages the residuals to be embedded in low dimensional spaces. To achieve this goal, we developed the Nuclear Norm Clustering (NNC) method (available at https://sourceforge.net/projects/nnc/), a highly accurate and robust algorithm used for clustering analysis. Nuclear Norm Clustering aims to improve the accuracy of clustering. In this paper, we compared the performance of NNC with that of other seven methods, using 15 publically available datasets. We then tested the performance of NNC on two psoriasis genome-wide association study (GWAS) datasets17,18,19,20.

Methods

To apply our method to a specific dataset, users need to provide a data matrix M and the desired number of cluster K. The objective function to minimize is the nuclear norm of the pooled within class residual. The nuclear norm of a matrix is defined as the sum of singular values of the matrix.

Suppose we had a candidate class label vector A, where A[i] was an integer indicating that the ith sample belong to the A[i]th cluster. We first calculated the means/center of each class. Then for each sample/row, we subtracted its corresponding class mean, forming a pooled residual matrix. Then we performed singular value decomposition (SVD)21 to obtain nuclear norm. This procedure could be denoted as NN(A).

We used simulated annealing22 to choose an optimal A that minimize NN(A). First we initially random guess some A. Then we randomly change one sample’s label obtain A′, and test if it improves the nuclear norm. If Uniform(0, 1) < exp((NN-NN’)/T) then A = A′, where T is the annealing parameter. The algorithm is shown in Table 1.

Table 1 The pseudocode of Nuclear Norm Clustering.

Bechmarking

We benchmarked eight methods: k-means clustering, Partitioning Around Medoids (PAM), Hierarchical clustering (Hcluster, using Euclidean metric to calculate dissimilarities), Clustering Large Applications (CLARA), Agglomerative Nesting (AGNES), Divisive Analysis Clustering (DIANA), Clusterdp (Clusterdp was chosen as the representative of density-based methods) and Nuclear Norm Clustering (NNC). We used the NNC software available at https://sourceforge.net/projects/nnc/ and implemented the other seven methods using various R packages: factoextra23 and densityClust24. To evaluate the performance of benchmarked clustering methods, we used the macro-averaged F-score25,26. Benchmarking was performed on a desktop PC equipped with an Intel Core i7-4790 CPU and 32 GB of memory. The parameters tested were shown in Supplemental Materials 1, 2 and 4.

Benchmarking Public Datasets Study

Overall 15 public datasets were included: spambase27, Indian liver patient28, blood transfusion service center29, pima Indians diabetes30, parkinsons31, QSAR biodegradation32, Ionosphere27, pathbased33, mammographic mass34, breast cancer wisconsin diagnostic35, seeds36, wine27, jain37, flame38, iris27.

Applications on GWAS Dataset Study

We applied each of the aforementioned method to two psoriasis genome-wide association (GWAS) genetic datasets17,18,19,20. We obtained the dataset, a part of the Collaborative Association Study of Psoriasis (CASP), from the Genetic Association Information Network (GAIN) database, a partnership of the Foundation for the National Institutes of Health. The data were available at http://dbgap.ncbi.nlm.nih.gov. through dbGap accession number phs000019.v1.p1. All genotypes were filtered by checking for data quality18. We included 1590 subjects (915 cases, 675 controls) in the general research use (GRU) group and 1133 subjects (431 cases and 702 controls) in the autoimmune disease only (ADO) group. A dermatologist diagnosed all psoriasis cases. Each participant’s DNA was genotyped with the Perlegen 500 K array. Both cases and controls agreed to sign the consent contract, and controls (≥18 years old) had no confounding factors relative to a known diagnosis of psoriasis.

In our previous work18, we found that when the number of SNPs as predictors was chosen as 50, the independent ADO (testing) dataset could reach the maximum AUC39 (AUC = 0.7063) using logistic regression prediction model. Thus we used SNP ranking methods, considering allelic association p-values (on the Psoriasis GWAS dataset of GRU group), to select top 50 associated SNPs (take 5 intervals, such as 5, 10 …, 50, shown in Supplementary Materials 4) and then compared the performance of different clustering methods on two Psoriasis GWAS datasets (both GRU and ADO group).

Results

Results from public datasets

Table 2 summarizes the macro-averaged F-score of all methods on 15 public datasets. NNC, together with Clusterdp and Hcluster, all performed best in 4 datasets. PAM performed optimally in 2 datasets. Following PAM, DIANA performed best only one datasets. Furthermore, we observed that the datasets in which NNC performed better were linearly separable (especially in iris, seeds and wine datasets).

Table 2 Macro-averaged F-score of all methods on 15 datasets.

And NNC performed significantly better (Wilcoxon Rank Sum test’s p value < 0.05, Supplemental Materials 3) than k-means, PAM, CLARA, AGNES and DIANA in F-score on benchmarked 15 datasets. Thus NNC is a competitive method for clustering task.

Results from psoriasis dataset study

We benchmarked seven methods: k-means, PAM, Hcluster, CLARA, AGNES, DIANA, NNC (Clusterdp was not included was because the psoriasis data was too large, so it took too long to adjust the parameters) in the psoriasis dataset study.

Table 3 presents the mean and standard deviation of each method’s performance among 2 psoriasis GWAS datasets. The macro-averaged F-score of selected 50 top associated SNPs (take 5 intervals) were shown in Supplemental Materials 4. In Table 3, we observed that NNC had the second largest mean of F-score (mean = 0.5735) in psoriasis dataset of GRU group and the maximal mean of F-score (mean = 0.6725) in psoriasis dataset of ADO group, and the mean differences between NNC and the next best performing method were 0.0860 and 0.0135. Additionally, in psoriasis dataset of GRU group, NNC obviously improved the F-score in the benchmarked datasets (improved clustering accuracy = 18%), compared with the third best performing method. While compared to the best performing method, the clustering accuracy of NNC was reduced by 5%. In psoriasis dataset of ADO group, the clustering accuracy of NNC was improved by 2% compared to the second best performing method. And the macro-averaged F-score curves of seven methods on psoriasis dataset 1 and psoriasis dataset 2 were shown in Figs 1 and 2, respectively. More interestingly, we found that the F-score of NNC and Hcluster in the top 50 SNPs were superior to other methods in Fig. 1. In Fig. 2, the F-score of NNC was optimal. In conclusion, NCC performed well in two psoriasis datasets and appears to be superior to its competitor methods: k-means, PAM, Hcluster, CLARA, AGNES and DIANA. It is worth mentioning that NNC appeared to be more robust and less sensitive to potential outliers. Although the F-score of NNC was not the best for all datasets, it was the top performer in both the public and the psoriasis datasets.

Table 3 Mean and SD of F-score on 2 psoriasis datasets.
Figure 1
figure 1

The macro-averaged F-score of selected top 50 associated SNPs on the Psoriasis GWAS dataset of GRU group.

Figure 2
figure 2

The macro-averaged F-score of selected top 50 associated SNPs on the Psoriasis GWAS dataset of ADO group.

Discussion

Clustering has been applied for identifying groups among the observations4. For example, using clustering to classify patients into subgroups according to their gene expression profile in cancer research. It can be useful for identifying the molecular profile of patients with good or bad prognostic, as well as for understanding the disease.

NNC outperforms the k-means clustering by breaking its limitations: K-means attempts to minimize the total squared error, which is sensitive to the outliers. Furthermore, k-means performed not well in datasets (like Indian liver patient and Parkinsons, Table 2) with strong correlation coefficient matrixes. To overcome the limitation, we employed the nuclear norm as a measure of clustering fitness. First, nuclear norm40 is a L1 measure of error, thus is relatively more robust than squared error. Second, in the presence of variable correlation, nuclear norm internally orthogonalizes the variables and penalizes/down-weights correlated variables.

NNC, along with Clusterdp and Hcluster, had the best performance in more public datasets (Table 2). And we found that these three methods performed best on different public datasets. They could be complementary methods in different real datasets. Furthermore, we observed that the datasets in which NNC performed better were linearly separable (especially in iris, seeds and wine datasets).

NNC has two parameters, the desired number of cluster K and the number of iterations. The greater the number of iterations, the more precise the convergence. But if the number of iterations is too large, it will affect the computing efficiency. In the psoriasis GWAS datasets, the parameters were chosen as follows: K = 2, the number of iterations = 200000. Generally, when the number of iterations is 20000, NNC also performs well enough (robust with the parameters). The computation complexity of NNC is O(sample number × feature number × min(sample number, feature number) × iterations). When there are 10k or 100k objects in the dataset, it will be rather slow. However, NNC is fast enough (Table 4) to handle medium size datasets (below 10k) in practice.

Table 4 The detail running time comparison of all benchmarked methods.

In conclusion, we presented the Nuclear Norm Clustering (NNC) method and our work demonstrated that NNC is a rather promising alternative method for clustering in medium size datasets.