Clustering is a popular technique for discovering groups of similar objects in large datasets. It is nowadays applied in all areas of life sciences, from biomedicine to physics. However, designing high-quality cluster analyses is a tedious and complicated task with manifold choices along the way. As a cluster analysis is often the first step of a succeeding downstream analysis, the clustering must be reliable, reproducible, and of the highest quality. To address these challenges, we recently developed ClustEval, an integrated and extensible platform for the automated and standardized design and execution of complex cluster analyses. It allows researchers to design and carry out cluster analyses involving a large number of clustering methods applied to many, large datasets. ClustEval helps to shed light on all major aspects of cluster analysis, from choosing the right similarity function to using validity indices and data preprocessing protocols. Only this high degree of automation allows the researcher to easily run a clustering task with many different tools, parameters, and settings in order to gain the best possible outcome. In this paper, we guide the user step by step through three fundamentally important and widely applicable use cases: (i) identification of the best clustering method for a new, user-given protein sequence similarity dataset; (ii) evaluation of the performance of a new, user-given clustering method (densityCut) against the state of the art; and (iii) prediction of the best method for a new protein sequence similarity dataset. This protocol guides the user through the most important features of ClustEval and takes ∼4 h to complete.
Subscribe to Journal
Get full journal access for 1 year
only $21.58 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Wittkop, T. et al. Comprehensive cluster analysis with transitivity clustering. Nat. Protoc. 6, 285–295 (2011).
R&ttger, R. et al. Density parameter estimation for finding clusters of homologous proteins--tracing actinobacterial pathogenicity lifestyles. Bioinformatics 29, 215–222 (2013).
King, A.D., Przulj, N. & Jurisica, I. Protein complex prediction via cost-based clustering. Bioinformatics 20, 3013–3020 (2004).
Nepusz, T., Yu, H. & Paccanaro, A. Detecting overlapping protein complexes in protein-protein interaction networks. Nat. Methods 9, 471–472 (2012).
Wirapati, P. et al. Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Res. 10, R65 (2008).
R&ttger, R. Clustering of biological datasets in the era of big data. J. Integr. Bioinform. 13, 300 (2016).
Wiwie, C., Baumbach, J. & Rottger, R. Comparing the performance of biomedical clustering methods. Nat. Methods 12, 1033–1038 (2015).
Aggarwal, C.C. & Reddy, C.K. Data Clustering: Algorithms and Applications (CRC Press, 2013).
Andreopoulos, B., An, A., Wang, X. & Schroeder, M. A roadmap of clustering algorithms: finding a match for a biomedical application. Brief Bioinform. 10, 297–314 (2009).
Powers, D.M.W. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Tech. 2, 37–63 (2011).
Wiwie, C. & Röttger, R. in Biocomputing 39–50 (World Scientific, 2016).
Fox, N.K., Brenner, S.E. & Chandonia, J.M. SCOPe: structural classification of proteins--extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 42, D304–D309 (2014).
Chandonia, J.M., Fox, N.K. & Brenner, S.E. SCOPe: manual curation and artifact removal in the structural classification of proteins - extended database. J. Mol. Biol. 429, 348–355 (2017).
Ding, J., Shah, S. & Condon, A. densityCut: an efficient and versatile topological approach for automatic clustering of biological data. Bioinformatics 32, 2567–2576 (2016).
Monti, S., Tamayo, P., Mesirov, J. & Golub, T. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. 52, 91–118 (2003).
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Davies, D.L. & Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1, 224–227 (1979).
Dunn, J.C. Well-separated clusters and optimal fuzzy partitions. J. Cybernetics 4, 95–104 (1974).
Rousseeuw, P.J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
Powers, D.M.W. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Tech. 2, 37–63 (2007).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc. Ser. B (Methodological) 289–300 (1995).
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning. Springer, (2009).
Fowlkes, E.B. & Mallows, C.L. A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78, 553–569 (1983).
Jaccard, P. Etude Comparative de la Distribution Florale dans Une Portion des Alpes et du Jura (Impr. Corbaz, 1901).
Rand, W.M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971).
Rosenberg, A. & Hirschberg, J. V-Measure: a conditional entropy-based external cluster evaluation measure. in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) 410–420 (2007).
Frey, B.J. & Dueck, D. Clustering by passing messages between data points. Science 315, 972–976 (2007).
Kaufman, L. & Rousseeuw, P.J., in Finding Groups in Data 199–252 (Wiley, 2008).
Bezdek, J.C. in Pattern Recognition with Fuzzy Objective Function Algorithms 43–93 (Springer, 1981).
Kaufman, L. & Rousseeuw, P.J. in Finding Groups in Data 126–163 (Wiley, 2008).
Rodriguez, A. & Laio, A. Clustering by fast search and find of density peaks. Science 344, 1492–1496 (2014).
Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. in KDD′96 Proceedings of the Second International Conference on Knowledge Discovery and Data Mining 96, 226–231 (1996).
Kaufman, L. & Rousseeuw, P.J. in Finding Groups in Data 253–279 (Wiley, 2008).
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M. & Hornik, K. Vol. 1.2 R Package Version 2.0.1. (R Foundation for Statistical Computing, 2015).
R Core Team R. A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2012).
Van Dongen, S. Graph Clustering by Flow Simulation. Doctoral dissertation, University of Utrecht (2000).
Bader, G.D. & Hogue, C.W.V. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinform. 4, 2 (2003).
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Kaufman, L. & Rousseeuw, P.J. in Finding Groups in Data 68–125 (Wiley, 2008).
Kohonen, T. Self-organized formation of topologically correct feature maps. Biol. Cybernetics 43, 59–69 (1982).
Karatzoglou, A., Smola, A., Hornik, K. & Zeileis, A. — An S4 package for Kernel methods in R. J. Stat. Softw. 11, 1–20 (2004).
Wittkop, T. et al. Partitioning biological data with transitivity clustering. Nat. Methods 7, 419–420 (2010).
Wittkop, T., Baumbach, J., Lobo, F.P. & Rahmann, S. Large scale clustering of protein sequences with FORCE-A layout based heuristic for weighted cluster editing. BMC Bioinform. 8, 396 (2007).
J.B. and C.W. received financial support from the VILLUM foundation (Young Investigator grant no. 13154) as well as the Vice Chancellor's research fund at the University of Southern Denmark (SDU2020 grant MeDA).
The authors declare no competing financial interests.
Astral SCOPe protein class k subset: BLAST all vs all. (TXT 387 kb)
Astral SCOPe protein class k subset: FASTA genetic domain sequences. (TXT 14 kb)
Astral SCOPe protein class k subset: protein family assignment. (TXT 4 kb)
Astral SCOPe protein class k subset: converted similarity file. (TXT 228 kb)
densityCut program: zipped Java wrapper class as a JAR file. (ZIP 4 kb)
About this article
Cite this article
Wiwie, C., Baumbach, J. & Röttger, R. Guiding biomedical clustering with ClustEval. Nat Protoc 13, 1429–1444 (2018). https://doi.org/10.1038/nprot.2018.038