Protocol | Published:

Guiding biomedical clustering with ClustEval

Nature Protocols volume 13, pages 14291444 (2018) | Download Citation

Abstract

Clustering is a popular technique for discovering groups of similar objects in large datasets. It is nowadays applied in all areas of life sciences, from biomedicine to physics. However, designing high-quality cluster analyses is a tedious and complicated task with manifold choices along the way. As a cluster analysis is often the first step of a succeeding downstream analysis, the clustering must be reliable, reproducible, and of the highest quality. To address these challenges, we recently developed ClustEval, an integrated and extensible platform for the automated and standardized design and execution of complex cluster analyses. It allows researchers to design and carry out cluster analyses involving a large number of clustering methods applied to many, large datasets. ClustEval helps to shed light on all major aspects of cluster analysis, from choosing the right similarity function to using validity indices and data preprocessing protocols. Only this high degree of automation allows the researcher to easily run a clustering task with many different tools, parameters, and settings in order to gain the best possible outcome. In this paper, we guide the user step by step through three fundamentally important and widely applicable use cases: (i) identification of the best clustering method for a new, user-given protein sequence similarity dataset; (ii) evaluation of the performance of a new, user-given clustering method (densityCut) against the state of the art; and (iii) prediction of the best method for a new protein sequence similarity dataset. This protocol guides the user through the most important features of ClustEval and takes 4 h to complete.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

References

  1. 1.

    et al. Comprehensive cluster analysis with transitivity clustering. Nat. Protoc. 6, 285–295 (2011).

  2. 2.

    et al. Density parameter estimation for finding clusters of homologous proteins--tracing actinobacterial pathogenicity lifestyles. Bioinformatics 29, 215–222 (2013).

  3. 3.

    , & Protein complex prediction via cost-based clustering. Bioinformatics 20, 3013–3020 (2004).

  4. 4.

    , & Detecting overlapping protein complexes in protein-protein interaction networks. Nat. Methods 9, 471–472 (2012).

  5. 5.

    et al. Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Res. 10, R65 (2008).

  6. 6.

    Clustering of biological datasets in the era of big data. J. Integr. Bioinform. 13, 300 (2016).

  7. 7.

    , & Comparing the performance of biomedical clustering methods. Nat. Methods 12, 1033–1038 (2015).

  8. 8.

    & Data Clustering: Algorithms and Applications (CRC Press, 2013).

  9. 9.

    , , & A roadmap of clustering algorithms: finding a match for a biomedical application. Brief Bioinform. 10, 297–314 (2009).

  10. 10.

    Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Tech. 2, 37–63 (2011).

  11. 11.

    & in Biocomputing 39–50 (World Scientific, 2016).

  12. 12.

    , & SCOPe: structural classification of proteins--extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 42, D304–D309 (2014).

  13. 13.

    , & SCOPe: manual curation and artifact removal in the structural classification of proteins - extended database. J. Mol. Biol. 429, 348–355 (2017).

  14. 14.

    , & densityCut: an efficient and versatile topological approach for automatic clustering of biological data. Bioinformatics 32, 2567–2576 (2016).

  15. 15.

    , , & Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. 52, 91–118 (2003).

  16. 16.

    , , , & Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

  17. 17.

    & A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1, 224–227 (1979).

  18. 18.

    Well-separated clusters and optimal fuzzy partitions. J. Cybernetics 4, 95–104 (1974).

  19. 19.

    Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).

  20. 20.

    Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Tech. 2, 37–63 (2007).

  21. 21.

    & Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc. Ser. B (Methodological) 289–300 (1995).

  22. 22.

    , & The Elements of Statistical Learning. Springer, (2009).

  23. 23.

    & A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78, 553–569 (1983).

  24. 24.

    Etude Comparative de la Distribution Florale dans Une Portion des Alpes et du Jura (Impr. Corbaz, 1901).

  25. 25.

    Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971).

  26. 26.

    & V-Measure: a conditional entropy-based external cluster evaluation measure. in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) 410–420 (2007).

  27. 27.

    & Clustering by passing messages between data points. Science 315, 972–976 (2007).

  28. 28.

    & , in Finding Groups in Data 199–252 (Wiley, 2008).

  29. 29.

    in Pattern Recognition with Fuzzy Objective Function Algorithms 43–93 (Springer, 1981).

  30. 30.

    & in Finding Groups in Data 126–163 (Wiley, 2008).

  31. 31.

    & Clustering by fast search and find of density peaks. Science 344, 1492–1496 (2014).

  32. 32.

    , , & A density-based algorithm for discovering clusters in large spatial databases with noise. in KDD′96 Proceedings of the Second International Conference on Knowledge Discovery and Data Mining 96, 226–231 (1996).

  33. 33.

    & in Finding Groups in Data 253–279 (Wiley, 2008).

  34. 34.

    , , , & Vol. 1.2 R Package Version 2.0.1. (R Foundation for Statistical Computing, 2015).

  35. 35.

    R Core Team R. A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2012).

  36. 36.

    Graph Clustering by Flow Simulation. Doctoral dissertation, University of Utrecht (2000).

  37. 37.

    & An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinform. 4, 2 (2003).

  38. 38.

    et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).

  39. 39.

    & in Finding Groups in Data 68–125 (Wiley, 2008).

  40. 40.

    Self-organized formation of topologically correct feature maps. Biol. Cybernetics 43, 59–69 (1982).

  41. 41.

    , , & — An S4 package for Kernel methods in R. J. Stat. Softw. 11, 1–20 (2004).

  42. 42.

    et al. Partitioning biological data with transitivity clustering. Nat. Methods 7, 419–420 (2010).

  43. 43.

    , , & Large scale clustering of protein sequences with FORCE-A layout based heuristic for weighted cluster editing. BMC Bioinform. 8, 396 (2007).

Download references

Acknowledgements

J.B. and C.W. received financial support from the VILLUM foundation (Young Investigator grant no. 13154) as well as the Vice Chancellor's research fund at the University of Southern Denmark (SDU2020 grant MeDA).

Author information

Affiliations

  1. Institute of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark.

    • Christian Wiwie
    • , Jan Baumbach
    •  & Richard Röttger
  2. Department of Experimental Bioinformatics, TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany.

    • Jan Baumbach

Authors

  1. Search for Christian Wiwie in:

  2. Search for Jan Baumbach in:

  3. Search for Richard Röttger in:

Contributions

C.W. implemented ClustEval, its administration interface, and the prediction pipeline. C.W. designed and wrote the protocol. J.B. and R.R. jointly directed this work. All authors contributed to the proofreading of the manuscript.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Jan Baumbach.

Supplementary information

Text files

  1. 1.

    Supplementary Data 1

    Astral SCOPe protein class k subset: BLAST all vs all.

  2. 2.

    Supplementary Data 2

    Astral SCOPe protein class k subset: FASTA genetic domain sequences.

  3. 3.

    Supplementary Data 3

    Astral SCOPe protein class k subset: protein family assignment.

  4. 4.

    Supplementary Data 4

    Astral SCOPe protein class k subset: converted similarity file.

Zip files

  1. 1.

    Supplementary Software

    densityCut program: zipped Java wrapper class as a JAR file.

About this article

Publication history

Published

DOI

https://doi.org/10.1038/nprot.2018.038

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.