Abstract
Clustering is a popular technique for discovering groups of similar objects in large datasets. It is nowadays applied in all areas of life sciences, from biomedicine to physics. However, designing high-quality cluster analyses is a tedious and complicated task with manifold choices along the way. As a cluster analysis is often the first step of a succeeding downstream analysis, the clustering must be reliable, reproducible, and of the highest quality. To address these challenges, we recently developed ClustEval, an integrated and extensible platform for the automated and standardized design and execution of complex cluster analyses. It allows researchers to design and carry out cluster analyses involving a large number of clustering methods applied to many, large datasets. ClustEval helps to shed light on all major aspects of cluster analysis, from choosing the right similarity function to using validity indices and data preprocessing protocols. Only this high degree of automation allows the researcher to easily run a clustering task with many different tools, parameters, and settings in order to gain the best possible outcome. In this paper, we guide the user step by step through three fundamentally important and widely applicable use cases: (i) identification of the best clustering method for a new, user-given protein sequence similarity dataset; (ii) evaluation of the performance of a new, user-given clustering method (densityCut) against the state of the art; and (iii) prediction of the best method for a new protein sequence similarity dataset. This protocol guides the user through the most important features of ClustEval and takes ∼4 h to complete.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Distance-based clustering challenges for unbiased benchmarking studies
Scientific Reports Open Access 23 September 2021
-
Causal Network Inference for Neural Ensemble Activity
Neuroinformatics Open Access 04 January 2021
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout






References
Wittkop, T. et al. Comprehensive cluster analysis with transitivity clustering. Nat. Protoc. 6, 285–295 (2011).
R&ttger, R. et al. Density parameter estimation for finding clusters of homologous proteins--tracing actinobacterial pathogenicity lifestyles. Bioinformatics 29, 215–222 (2013).
King, A.D., Przulj, N. & Jurisica, I. Protein complex prediction via cost-based clustering. Bioinformatics 20, 3013–3020 (2004).
Nepusz, T., Yu, H. & Paccanaro, A. Detecting overlapping protein complexes in protein-protein interaction networks. Nat. Methods 9, 471–472 (2012).
Wirapati, P. et al. Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Res. 10, R65 (2008).
R&ttger, R. Clustering of biological datasets in the era of big data. J. Integr. Bioinform. 13, 300 (2016).
Wiwie, C., Baumbach, J. & Rottger, R. Comparing the performance of biomedical clustering methods. Nat. Methods 12, 1033–1038 (2015).
Aggarwal, C.C. & Reddy, C.K. Data Clustering: Algorithms and Applications (CRC Press, 2013).
Andreopoulos, B., An, A., Wang, X. & Schroeder, M. A roadmap of clustering algorithms: finding a match for a biomedical application. Brief Bioinform. 10, 297–314 (2009).
Powers, D.M.W. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Tech. 2, 37–63 (2011).
Wiwie, C. & Röttger, R. in Biocomputing 39–50 (World Scientific, 2016).
Fox, N.K., Brenner, S.E. & Chandonia, J.M. SCOPe: structural classification of proteins--extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 42, D304–D309 (2014).
Chandonia, J.M., Fox, N.K. & Brenner, S.E. SCOPe: manual curation and artifact removal in the structural classification of proteins - extended database. J. Mol. Biol. 429, 348–355 (2017).
Ding, J., Shah, S. & Condon, A. densityCut: an efficient and versatile topological approach for automatic clustering of biological data. Bioinformatics 32, 2567–2576 (2016).
Monti, S., Tamayo, P., Mesirov, J. & Golub, T. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. 52, 91–118 (2003).
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Davies, D.L. & Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1, 224–227 (1979).
Dunn, J.C. Well-separated clusters and optimal fuzzy partitions. J. Cybernetics 4, 95–104 (1974).
Rousseeuw, P.J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
Powers, D.M.W. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Tech. 2, 37–63 (2007).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc. Ser. B (Methodological) 289–300 (1995).
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning. Springer, (2009).
Fowlkes, E.B. & Mallows, C.L. A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78, 553–569 (1983).
Jaccard, P. Etude Comparative de la Distribution Florale dans Une Portion des Alpes et du Jura (Impr. Corbaz, 1901).
Rand, W.M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971).
Rosenberg, A. & Hirschberg, J. V-Measure: a conditional entropy-based external cluster evaluation measure. in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) 410–420 (2007).
Frey, B.J. & Dueck, D. Clustering by passing messages between data points. Science 315, 972–976 (2007).
Kaufman, L. & Rousseeuw, P.J., in Finding Groups in Data 199–252 (Wiley, 2008).
Bezdek, J.C. in Pattern Recognition with Fuzzy Objective Function Algorithms 43–93 (Springer, 1981).
Kaufman, L. & Rousseeuw, P.J. in Finding Groups in Data 126–163 (Wiley, 2008).
Rodriguez, A. & Laio, A. Clustering by fast search and find of density peaks. Science 344, 1492–1496 (2014).
Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. in KDD′96 Proceedings of the Second International Conference on Knowledge Discovery and Data Mining 96, 226–231 (1996).
Kaufman, L. & Rousseeuw, P.J. in Finding Groups in Data 253–279 (Wiley, 2008).
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M. & Hornik, K. Vol. 1.2 R Package Version 2.0.1. (R Foundation for Statistical Computing, 2015).
R Core Team R. A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2012).
Van Dongen, S. Graph Clustering by Flow Simulation. Doctoral dissertation, University of Utrecht (2000).
Bader, G.D. & Hogue, C.W.V. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinform. 4, 2 (2003).
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Kaufman, L. & Rousseeuw, P.J. in Finding Groups in Data 68–125 (Wiley, 2008).
Kohonen, T. Self-organized formation of topologically correct feature maps. Biol. Cybernetics 43, 59–69 (1982).
Karatzoglou, A., Smola, A., Hornik, K. & Zeileis, A. — An S4 package for Kernel methods in R. J. Stat. Softw. 11, 1–20 (2004).
Wittkop, T. et al. Partitioning biological data with transitivity clustering. Nat. Methods 7, 419–420 (2010).
Wittkop, T., Baumbach, J., Lobo, F.P. & Rahmann, S. Large scale clustering of protein sequences with FORCE-A layout based heuristic for weighted cluster editing. BMC Bioinform. 8, 396 (2007).
Acknowledgements
J.B. and C.W. received financial support from the VILLUM foundation (Young Investigator grant no. 13154) as well as the Vice Chancellor's research fund at the University of Southern Denmark (SDU2020 grant MeDA).
Author information
Authors and Affiliations
Contributions
C.W. implemented ClustEval, its administration interface, and the prediction pipeline. C.W. designed and wrote the protocol. J.B. and R.R. jointly directed this work. All authors contributed to the proofreading of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Data 1
Astral SCOPe protein class k subset: BLAST all vs all. (TXT 387 kb)
Supplementary Data 2
Astral SCOPe protein class k subset: FASTA genetic domain sequences. (TXT 14 kb)
Supplementary Data 3
Astral SCOPe protein class k subset: protein family assignment. (TXT 4 kb)
Supplementary Data 4
Astral SCOPe protein class k subset: converted similarity file. (TXT 228 kb)
Supplementary Software
densityCut program: zipped Java wrapper class as a JAR file. (ZIP 4 kb)
Rights and permissions
About this article
Cite this article
Wiwie, C., Baumbach, J. & Röttger, R. Guiding biomedical clustering with ClustEval. Nat Protoc 13, 1429–1444 (2018). https://doi.org/10.1038/nprot.2018.038
Published:
Issue Date:
DOI: https://doi.org/10.1038/nprot.2018.038
This article is cited by
-
Distance-based clustering challenges for unbiased benchmarking studies
Scientific Reports (2021)
-
Causal Network Inference for Neural Ensemble Activity
Neuroinformatics (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.