Guiding biomedical clustering with ClustEval

Abstract

Clustering is a popular technique for discovering groups of similar objects in large datasets. It is nowadays applied in all areas of life sciences, from biomedicine to physics. However, designing high-quality cluster analyses is a tedious and complicated task with manifold choices along the way. As a cluster analysis is often the first step of a succeeding downstream analysis, the clustering must be reliable, reproducible, and of the highest quality. To address these challenges, we recently developed ClustEval, an integrated and extensible platform for the automated and standardized design and execution of complex cluster analyses. It allows researchers to design and carry out cluster analyses involving a large number of clustering methods applied to many, large datasets. ClustEval helps to shed light on all major aspects of cluster analysis, from choosing the right similarity function to using validity indices and data preprocessing protocols. Only this high degree of automation allows the researcher to easily run a clustering task with many different tools, parameters, and settings in order to gain the best possible outcome. In this paper, we guide the user step by step through three fundamentally important and widely applicable use cases: (i) identification of the best clustering method for a new, user-given protein sequence similarity dataset; (ii) evaluation of the performance of a new, user-given clustering method (densityCut) against the state of the art; and (iii) prediction of the best method for a new protein sequence similarity dataset. This protocol guides the user through the most important features of ClustEval and takes 4 h to complete.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Clustering of biological data.
Figure 2: Typical structure of cluster analyses in ClustEval.
Figure 3: Workflows of this protocol.
Figure 4: Design of a ClustEval parameter optimization run.
Figure 5: Visualizations of clustering qualities.
Figure 6: Clustering visualization for the gene expression dataset.

References

  1. 1

    Wittkop, T. et al. Comprehensive cluster analysis with transitivity clustering. Nat. Protoc. 6, 285–295 (2011).

    CAS  Article  PubMed  Google Scholar 

  2. 2

    R&ttger, R. et al. Density parameter estimation for finding clusters of homologous proteins--tracing actinobacterial pathogenicity lifestyles. Bioinformatics 29, 215–222 (2013).

    Article  Google Scholar 

  3. 3

    King, A.D., Przulj, N. & Jurisica, I. Protein complex prediction via cost-based clustering. Bioinformatics 20, 3013–3020 (2004).

    CAS  Article  PubMed  Google Scholar 

  4. 4

    Nepusz, T., Yu, H. & Paccanaro, A. Detecting overlapping protein complexes in protein-protein interaction networks. Nat. Methods 9, 471–472 (2012).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  5. 5

    Wirapati, P. et al. Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Res. 10, R65 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  6. 6

    R&ttger, R. Clustering of biological datasets in the era of big data. J. Integr. Bioinform. 13, 300 (2016).

    Google Scholar 

  7. 7

    Wiwie, C., Baumbach, J. & Rottger, R. Comparing the performance of biomedical clustering methods. Nat. Methods 12, 1033–1038 (2015).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  8. 8

    Aggarwal, C.C. & Reddy, C.K. Data Clustering: Algorithms and Applications (CRC Press, 2013).

  9. 9

    Andreopoulos, B., An, A., Wang, X. & Schroeder, M. A roadmap of clustering algorithms: finding a match for a biomedical application. Brief Bioinform. 10, 297–314 (2009).

    CAS  Article  PubMed  Google Scholar 

  10. 10

    Powers, D.M.W. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Tech. 2, 37–63 (2011).

    Google Scholar 

  11. 11

    Wiwie, C. & Röttger, R. in Biocomputing 39–50 (World Scientific, 2016).

  12. 12

    Fox, N.K., Brenner, S.E. & Chandonia, J.M. SCOPe: structural classification of proteins--extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 42, D304–D309 (2014).

    CAS  Article  PubMed  Google Scholar 

  13. 13

    Chandonia, J.M., Fox, N.K. & Brenner, S.E. SCOPe: manual curation and artifact removal in the structural classification of proteins - extended database. J. Mol. Biol. 429, 348–355 (2017).

    CAS  Article  PubMed  Google Scholar 

  14. 14

    Ding, J., Shah, S. & Condon, A. densityCut: an efficient and versatile topological approach for automatic clustering of biological data. Bioinformatics 32, 2567–2576 (2016).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  15. 15

    Monti, S., Tamayo, P., Mesirov, J. & Golub, T. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. 52, 91–118 (2003).

    Article  Google Scholar 

  16. 16

    Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    CAS  Article  Google Scholar 

  17. 17

    Davies, D.L. & Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1, 224–227 (1979).

    CAS  Article  PubMed  Google Scholar 

  18. 18

    Dunn, J.C. Well-separated clusters and optimal fuzzy partitions. J. Cybernetics 4, 95–104 (1974).

    Article  Google Scholar 

  19. 19

    Rousseeuw, P.J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).

    Article  Google Scholar 

  20. 20

    Powers, D.M.W. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Tech. 2, 37–63 (2007).

    Google Scholar 

  21. 21

    Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc. Ser. B (Methodological) 289–300 (1995).

  22. 22

    Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning. Springer, (2009).

    Google Scholar 

  23. 23

    Fowlkes, E.B. & Mallows, C.L. A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78, 553–569 (1983).

    Article  Google Scholar 

  24. 24

    Jaccard, P. Etude Comparative de la Distribution Florale dans Une Portion des Alpes et du Jura (Impr. Corbaz, 1901).

  25. 25

    Rand, W.M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971).

    Article  Google Scholar 

  26. 26

    Rosenberg, A. & Hirschberg, J. V-Measure: a conditional entropy-based external cluster evaluation measure. in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) 410–420 (2007).

  27. 27

    Frey, B.J. & Dueck, D. Clustering by passing messages between data points. Science 315, 972–976 (2007).

    CAS  Article  PubMed  Google Scholar 

  28. 28

    Kaufman, L. & Rousseeuw, P.J., in Finding Groups in Data 199–252 (Wiley, 2008).

  29. 29

    Bezdek, J.C. in Pattern Recognition with Fuzzy Objective Function Algorithms 43–93 (Springer, 1981).

  30. 30

    Kaufman, L. & Rousseeuw, P.J. in Finding Groups in Data 126–163 (Wiley, 2008).

  31. 31

    Rodriguez, A. & Laio, A. Clustering by fast search and find of density peaks. Science 344, 1492–1496 (2014).

    CAS  Article  Google Scholar 

  32. 32

    Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. in KDD′96 Proceedings of the Second International Conference on Knowledge Discovery and Data Mining 96, 226–231 (1996).

    Google Scholar 

  33. 33

    Kaufman, L. & Rousseeuw, P.J. in Finding Groups in Data 253–279 (Wiley, 2008).

  34. 34

    Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M. & Hornik, K. Vol. 1.2 R Package Version 2.0.1. (R Foundation for Statistical Computing, 2015).

    Google Scholar 

  35. 35

    R Core Team R. A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2012).

  36. 36

    Van Dongen, S. Graph Clustering by Flow Simulation. Doctoral dissertation, University of Utrecht (2000).

  37. 37

    Bader, G.D. & Hogue, C.W.V. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinform. 4, 2 (2003).

    Article  Google Scholar 

  38. 38

    Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  39. 39

    Kaufman, L. & Rousseeuw, P.J. in Finding Groups in Data 68–125 (Wiley, 2008).

  40. 40

    Kohonen, T. Self-organized formation of topologically correct feature maps. Biol. Cybernetics 43, 59–69 (1982).

    Article  Google Scholar 

  41. 41

    Karatzoglou, A., Smola, A., Hornik, K. & Zeileis, A. — An S4 package for Kernel methods in R. J. Stat. Softw. 11, 1–20 (2004).

    Article  Google Scholar 

  42. 42

    Wittkop, T. et al. Partitioning biological data with transitivity clustering. Nat. Methods 7, 419–420 (2010).

    CAS  Article  PubMed  Google Scholar 

  43. 43

    Wittkop, T., Baumbach, J., Lobo, F.P. & Rahmann, S. Large scale clustering of protein sequences with FORCE-A layout based heuristic for weighted cluster editing. BMC Bioinform. 8, 396 (2007).

    Article  Google Scholar 

Download references

Acknowledgements

J.B. and C.W. received financial support from the VILLUM foundation (Young Investigator grant no. 13154) as well as the Vice Chancellor's research fund at the University of Southern Denmark (SDU2020 grant MeDA).

Author information

Affiliations

Authors

Contributions

C.W. implemented ClustEval, its administration interface, and the prediction pipeline. C.W. designed and wrote the protocol. J.B. and R.R. jointly directed this work. All authors contributed to the proofreading of the manuscript.

Corresponding author

Correspondence to Jan Baumbach.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Data 1

Astral SCOPe protein class k subset: BLAST all vs all. (TXT 387 kb)

Supplementary Data 2

Astral SCOPe protein class k subset: FASTA genetic domain sequences. (TXT 14 kb)

Supplementary Data 3

Astral SCOPe protein class k subset: protein family assignment. (TXT 4 kb)

Supplementary Data 4

Astral SCOPe protein class k subset: converted similarity file. (TXT 228 kb)

Supplementary Software

densityCut program: zipped Java wrapper class as a JAR file. (ZIP 4 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wiwie, C., Baumbach, J. & Röttger, R. Guiding biomedical clustering with ClustEval. Nat Protoc 13, 1429–1444 (2018). https://doi.org/10.1038/nprot.2018.038

Download citation

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing