Guiding biomedical clustering with ClustEval

Wiwie, Christian; Baumbach, Jan; Röttger, Richard

doi:10.1038/nprot.2018.038

Protocol
Published: 24 May 2018

Guiding biomedical clustering with ClustEval

Christian Wiwie¹,
Jan Baumbach^1,2,3 &
Richard Röttger¹

Nature Protocols volume 13, pages 1429–1444 (2018)Cite this article

1416 Accesses
3 Citations
5 Altmetric
Metrics details

Subjects

Abstract

Clustering is a popular technique for discovering groups of similar objects in large datasets. It is nowadays applied in all areas of life sciences, from biomedicine to physics. However, designing high-quality cluster analyses is a tedious and complicated task with manifold choices along the way. As a cluster analysis is often the first step of a succeeding downstream analysis, the clustering must be reliable, reproducible, and of the highest quality. To address these challenges, we recently developed ClustEval, an integrated and extensible platform for the automated and standardized design and execution of complex cluster analyses. It allows researchers to design and carry out cluster analyses involving a large number of clustering methods applied to many, large datasets. ClustEval helps to shed light on all major aspects of cluster analysis, from choosing the right similarity function to using validity indices and data preprocessing protocols. Only this high degree of automation allows the researcher to easily run a clustering task with many different tools, parameters, and settings in order to gain the best possible outcome. In this paper, we guide the user step by step through three fundamentally important and widely applicable use cases: (i) identification of the best clustering method for a new, user-given protein sequence similarity dataset; (ii) evaluation of the performance of a new, user-given clustering method (densityCut) against the state of the art; and (iii) prediction of the best method for a new protein sequence similarity dataset. This protocol guides the user through the most important features of ClustEval and takes ∼4 h to complete.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Clustering of biological data.**

**Figure 2: Typical structure of cluster analyses in ClustEval.**

**Figure 3: Workflows of this protocol.**

**Figure 4: Design of a ClustEval parameter optimization run.**

**Figure 5: Visualizations of clustering qualities.**

**Figure 6: Clustering visualization for the gene expression dataset.**

Distance-based clustering challenges for unbiased benchmarking studies

Article Open access 23 September 2021

KMD clustering: robust general-purpose clustering of biological data

Article Open access 02 November 2023

Accurately clustering biological sequences in linear time by relatedness sorting

Article Open access 08 April 2024

References

Wittkop, T. et al. Comprehensive cluster analysis with transitivity clustering. Nat. Protoc. 6, 285–295 (2011).
Article CAS PubMed Google Scholar
R&ttger, R. et al. Density parameter estimation for finding clusters of homologous proteins--tracing actinobacterial pathogenicity lifestyles. Bioinformatics 29, 215–222 (2013).
Article Google Scholar
King, A.D., Przulj, N. & Jurisica, I. Protein complex prediction via cost-based clustering. Bioinformatics 20, 3013–3020 (2004).
Article CAS PubMed Google Scholar
Nepusz, T., Yu, H. & Paccanaro, A. Detecting overlapping protein complexes in protein-protein interaction networks. Nat. Methods 9, 471–472 (2012).
Article CAS PubMed PubMed Central Google Scholar
Wirapati, P. et al. Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Res. 10, R65 (2008).
Article PubMed PubMed Central Google Scholar
R&ttger, R. Clustering of biological datasets in the era of big data. J. Integr. Bioinform. 13, 300 (2016).
Google Scholar
Wiwie, C., Baumbach, J. & Rottger, R. Comparing the performance of biomedical clustering methods. Nat. Methods 12, 1033–1038 (2015).
Article CAS PubMed Google Scholar
Aggarwal, C.C. & Reddy, C.K. Data Clustering: Algorithms and Applications (CRC Press, 2013).
Andreopoulos, B., An, A., Wang, X. & Schroeder, M. A roadmap of clustering algorithms: finding a match for a biomedical application. Brief Bioinform. 10, 297–314 (2009).
Article CAS PubMed Google Scholar
Powers, D.M.W. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Tech. 2, 37–63 (2011).
Google Scholar
Wiwie, C. & Röttger, R. in Biocomputing 39–50 (World Scientific, 2016).
Fox, N.K., Brenner, S.E. & Chandonia, J.M. SCOPe: structural classification of proteins--extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 42, D304–D309 (2014).
Article CAS PubMed Google Scholar
Chandonia, J.M., Fox, N.K. & Brenner, S.E. SCOPe: manual curation and artifact removal in the structural classification of proteins - extended database. J. Mol. Biol. 429, 348–355 (2017).
Article CAS PubMed Google Scholar
Ding, J., Shah, S. & Condon, A. densityCut: an efficient and versatile topological approach for automatic clustering of biological data. Bioinformatics 32, 2567–2576 (2016).
Article CAS PubMed PubMed Central Google Scholar
Monti, S., Tamayo, P., Mesirov, J. & Golub, T. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. 52, 91–118 (2003).
Article Google Scholar
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS PubMed Google Scholar
Davies, D.L. & Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1, 224–227 (1979).
Article CAS PubMed Google Scholar
Dunn, J.C. Well-separated clusters and optimal fuzzy partitions. J. Cybernetics 4, 95–104 (1974).
Article Google Scholar
Rousseeuw, P.J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
Article Google Scholar
Powers, D.M.W. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Tech. 2, 37–63 (2007).
Google Scholar
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc. Ser. B (Methodological) 289–300 (1995).
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning. Springer, (2009).
Book Google Scholar
Fowlkes, E.B. & Mallows, C.L. A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78, 553–569 (1983).
Article Google Scholar
Jaccard, P. Etude Comparative de la Distribution Florale dans Une Portion des Alpes et du Jura (Impr. Corbaz, 1901).
Rand, W.M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971).
Article Google Scholar
Rosenberg, A. & Hirschberg, J. V-Measure: a conditional entropy-based external cluster evaluation measure. in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) 410–420 (2007).
Frey, B.J. & Dueck, D. Clustering by passing messages between data points. Science 315, 972–976 (2007).
Article CAS PubMed Google Scholar
Kaufman, L. & Rousseeuw, P.J., in Finding Groups in Data 199–252 (Wiley, 2008).
Bezdek, J.C. in Pattern Recognition with Fuzzy Objective Function Algorithms 43–93 (Springer, 1981).
Kaufman, L. & Rousseeuw, P.J. in Finding Groups in Data 126–163 (Wiley, 2008).
Rodriguez, A. & Laio, A. Clustering by fast search and find of density peaks. Science 344, 1492–1496 (2014).
Article CAS PubMed Google Scholar
Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. in KDD′96 Proceedings of the Second International Conference on Knowledge Discovery and Data Mining 96, 226–231 (1996).
Google Scholar
Kaufman, L. & Rousseeuw, P.J. in Finding Groups in Data 253–279 (Wiley, 2008).
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M. & Hornik, K. Vol. 1.2 R Package Version 2.0.1. (R Foundation for Statistical Computing, 2015).
Google Scholar
R Core Team R. A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2012).
Van Dongen, S. Graph Clustering by Flow Simulation. Doctoral dissertation, University of Utrecht (2000).
Bader, G.D. & Hogue, C.W.V. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinform. 4, 2 (2003).
Article Google Scholar
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Article CAS PubMed PubMed Central Google Scholar
Kaufman, L. & Rousseeuw, P.J. in Finding Groups in Data 68–125 (Wiley, 2008).
Kohonen, T. Self-organized formation of topologically correct feature maps. Biol. Cybernetics 43, 59–69 (1982).
Article Google Scholar
Karatzoglou, A., Smola, A., Hornik, K. & Zeileis, A. — An S4 package for Kernel methods in R. J. Stat. Softw. 11, 1–20 (2004).
Article Google Scholar
Wittkop, T. et al. Partitioning biological data with transitivity clustering. Nat. Methods 7, 419–420 (2010).
Article CAS PubMed Google Scholar
Wittkop, T., Baumbach, J., Lobo, F.P. & Rahmann, S. Large scale clustering of protein sequences with FORCE-A layout based heuristic for weighted cluster editing. BMC Bioinform. 8, 396 (2007).
Article Google Scholar

Download references

Acknowledgements

J.B. and C.W. received financial support from the VILLUM foundation (Young Investigator grant no. 13154) as well as the Vice Chancellor's research fund at the University of Southern Denmark (SDU2020 grant MeDA).

Author information

Authors and Affiliations

Institute of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
Christian Wiwie, Jan Baumbach & Richard Röttger
Department of Experimental Bioinformatics,
Jan Baumbach
, TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany
Jan Baumbach

Authors

Christian Wiwie
View author publications
You can also search for this author in PubMed Google Scholar
Jan Baumbach
View author publications
You can also search for this author in PubMed Google Scholar
Richard Röttger
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.W. implemented ClustEval, its administration interface, and the prediction pipeline. C.W. designed and wrote the protocol. J.B. and R.R. jointly directed this work. All authors contributed to the proofreading of the manuscript.

Corresponding author

Correspondence to Jan Baumbach.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Data 1

Astral SCOPe protein class k subset: BLAST all vs all. (TXT 387 kb)

Supplementary Data 2

Astral SCOPe protein class k subset: FASTA genetic domain sequences. (TXT 14 kb)

Supplementary Data 3

Astral SCOPe protein class k subset: protein family assignment. (TXT 4 kb)

Supplementary Data 4

Astral SCOPe protein class k subset: converted similarity file. (TXT 228 kb)

Supplementary Software

densityCut program: zipped Java wrapper class as a JAR file. (ZIP 4 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wiwie, C., Baumbach, J. & Röttger, R. Guiding biomedical clustering with ClustEval. Nat Protoc 13, 1429–1444 (2018). https://doi.org/10.1038/nprot.2018.038

Download citation

Published: 24 May 2018
Issue Date: June 2018
DOI: https://doi.org/10.1038/nprot.2018.038

This article is cited by

Distance-based clustering challenges for unbiased benchmarking studies
- Michael C. Thrun
Scientific Reports (2021)
Causal Network Inference for Neural Ensemble Activity
- Rong Chen
Neuroinformatics (2021)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.