Topological methods for data modelling

Abstract

The analysis of large and complex data sets is one of the most important problems facing the scientific community, and physics in particular. One response to this challenge has been the development of topological data analysis (TDA), which models data by graphs or networks rather than by linear algebraic (matrix) methods or cluster analysis. TDA represents the shape of the data (suitably defined) in a combinatorial fashion. Methods for measuring shape have been developed within mathematics, providing a toolkit referred to as homology. In working with data, one can use this kind of modelling to obtain an understanding of the overall structure of the data set. There is a suite of methods for constructing vector representations of various kinds of unstructured data. In this Review, we sketch the basics of TDA and provide examples where this kind of analysis has been carried out.

Key points

  • The analysis of large and complex data sets is crucial to all areas of science and industry, and is needed to support artificial intelligence. Existing methods for data analysis are often inadequate to deal with data that exhibit a great deal of complexity, because they are unable to express complicated ‘data shapes’.

  • Topology (the mathematical study of shape) has been extended to topological data analysis to give systematic graph representations of data sets, which are informative in many different ways. Graphs can be thought of as encoding shape.

  • Graph representations of data permit systematic unsupervised analysis of data, with a variety of methods for the interrogation of the data. They constitute a compression of the data that nevertheless preserves salient features.

  • Because of the flexibility of graph representations, methods for measuring the corresponding shape are required. Homology is a family of such methods. It is useful both for overall understanding of data sets and for generation of numerical features for many kinds of unstructured data.

  • Topological data analysis has been applied in many different complex data situations.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Statistical circle.
Fig. 2: Geometric realization.
Fig. 3: The Vietoris–Rips complex.
Fig. 4: The nerve construction of a covering.
Fig. 5: Persistence barcodes for dimensions 0 and 1.
Fig. 6: Functional persistence barcodes.
Fig. 7: Applications of topological data analysis.

References

  1. 1.

    Berkowitz, J. Big data hits beamline. Berkeley Lab. Comput. Sci. https://cs.lbl.gov/news-media/news/2013/big-data-hits-the-beamline/ (2013).

  2. 2.

    Gaillard, M. CERN Data Centre passes the 200-petabyte milestone. CERN https://home.cern/news/news/computing/cern-data-centre-passes-200-petabyte-milestone (2017).

  3. 3.

    Everitt, B., Landaum S., Leese, M. & Stahl, D. Cluster Analysis (John Wiley, 2011).

  4. 4.

    Armstrong, M. Basic Topology (Springer, 1983).

  5. 5.

    Dummit, D. & Foote, R. Abstract Algebra Vol. 1 (Wiley, 2004).

  6. 6.

    Edelsbrunner, H. & Harer, J. Computational Topology. An Introduction (American Mathematical Society, 2010).

  7. 7.

    Chazal, F. & Michel, B. An introduction to topological data analysis: fundamental and practical aspects for data scientists. Preprint at arXiv https://arxiv.org/abs/1710.04019 (2017).

  8. 8.

    Carlsson, G., Ishkhanov, T., De Silva, V. & Zomorodian, A. On the local behavior of spaces of natural images. Int. J. Computer Vis. 76, 1–12 (2008).

    MathSciNet  Google Scholar 

  9. 9.

    Hatcher, A. Algebraic Topology (Cambridge Univ. Press, 2002).

  10. 10.

    Carlsson, G. Topological pattern recognition for point cloud data. Acta Numer. 23, 289–368 (2014).

    MathSciNet  MATH  Google Scholar 

  11. 11.

    Vietoris, L. Über den höheren Zusammenhang kompakter Räume un eine klasse von zusammenhangstreuen Abbildungen. Math. Ann. 97, 454–472 (1927).

    MathSciNet  MATH  Google Scholar 

  12. 12.

    Edelsbrunner, H., Kirkpatrick, D. & Seidel, R. On the shape of a set of points in the plane. IEEE Trans. Inf. Theory 29, 551–559 (1983).

    MathSciNet  MATH  Google Scholar 

  13. 13.

    Akkiraju, N. et al. Alpha shapes: definition and software. Geometry Center http://www.geom.uiuc.edu/software/cglist/GeomDir/shapes95def/index.html (1995).

  14. 14.

    de Silva, V. & Carlsson, G. Topological estimation using witness complexes. Eurographics https://doi.org/10.2312/SPBG/SPBG04/157-166 (2004).

  15. 15.

    Singh, G., Memoli, F. & Carlsson, G. Topological method for the analysis of high dimensional data sets and 3D object recognition. Eurographics https://doi.org/10.2312/SPBG/SPBG07/091-100 (2007).

  16. 16.

    Aurenhammer, F., Klein, R. & Lee, D. Voronoi Diagrams and Delaunay Triangulations (World Scientific, 2013).

  17. 17.

    Reeb, G. Sur les points singuliers d’une fome de Pfaff completement integrable ou d’une fonction numerique. C. R. Seances Acad. Sci. 222, 847–849 (1946).

    MATH  Google Scholar 

  18. 18.

    Robins, V. Towards computing homology from finite approximations. Topol. Proc. 24, 503–532 (1999).

    MathSciNet  MATH  Google Scholar 

  19. 19.

    Frosini, P. & Landi, C. Size theory as a topological tool for computer vision. Pattern Recognit. Image Anal. 9, 596–603 (1999).

    Google Scholar 

  20. 20.

    Edelsbrunner, H., Letscher, D. & Zomorodian, A. Topological persistence and simplification. Discrete Comput. Geom. 28, 511–533 (2002).

    MathSciNet  MATH  Google Scholar 

  21. 21.

    Zomorodian, A. & Carlsson, G. Computing persistent homology. Discrete Comput. Geom. 33, 249–274 (2005).

    MathSciNet  MATH  Google Scholar 

  22. 22.

    Edelsbrunner, H. & Harer, J. Persistent homology — a survey. Contemp. Math. 453, 257–282 (2008).

    MathSciNet  MATH  Google Scholar 

  23. 23.

    Chazal, F., Cohen-Steiner, D., Guibas, L., Memoli, F. & Oudot, S. Gromov–Hausdorff stable signatures for shapes using persistence. Comput. Graph. Forum 28, 1393–1403 (2009).

    Google Scholar 

  24. 24.

    Cohen-Steiner, D., Edelsbrunner, H. & Harer, J. Stability of persistence diagrams. Discrete Comput. Geom. 37, 103–120 (2007).

    MathSciNet  MATH  Google Scholar 

  25. 25.

    Steiner, D. C., Edelsbrunner, H., Harer, J. & Mileyko, Y. Lipschitz functions have Lp-stable persistence. Found. Computat. Math. 10, 127–139 (2010).

    MATH  Google Scholar 

  26. 26.

    Chan, J., Carlsson, G. & Rabadan, R. Topology of viral evolution. Proc. Natl Acad. Sci. USA 110, 18566–18571 (2013).

    ADS  MathSciNet  MATH  Google Scholar 

  27. 27.

    Bubenik, P. Statistical topological data analysis using persistence landscapes. J. Mach. Learn. Res. 16, 77–102 (2015).

    MathSciNet  MATH  Google Scholar 

  28. 28.

    Adams, H. et al. Persistence images: a stable vector representation of persistent homology. J. Mach. Learn. Res. 18, 1–35 (2017).

    MathSciNet  Google Scholar 

  29. 29.

    Adcock, A., Carlsson, E. & Carlsson, G. The ring of algebraic functions on persistence barcodees. Homol. Homotopy Appl. 18, 381–402 (2016).

    MATH  Google Scholar 

  30. 30.

    Kalisnik, S. Tropical coordinates on the space of persistence barcodes. Found. Comput. Math. 19, 101–129 (2019).

    MathSciNet  MATH  Google Scholar 

  31. 31.

    Yao, Y. et al. Topological methods for exploring low-density states in biomolecular folding pathways. J. Chem. Phys. 130, 144115 (2009).

    ADS  Google Scholar 

  32. 32.

    Duponchel, L. Exploring hyperspectral imaging data sets with topological data analysis. Anal. Chim. Acta 1000, 123–131 (2018).

    Google Scholar 

  33. 33.

    Offroy, M. & Duponchel, L. Topological data analysis: a promising big data exploration tool in biology, analytical chemistry, and physical chemistry. Anal. Chim. Acta 910, 1–11 (2016).

    Google Scholar 

  34. 34.

    Torres, B. et al. Tracking resilience to infections by mapping disease space. PLoS Biol. 14, e1002494 (2016).

    Google Scholar 

  35. 35.

    Louie, A., Song, K. H., Hotson, A., Thomas Tate, A. & Schneider, D. S. How many parameters does it take to describe disease tolerance? PLoS Biol. 14, e1002485 (2016).

    Google Scholar 

  36. 36.

    Bhatia, H., Gyulassy, A., V. Lordi, P. J., Pascucci, V. & Bremer, P. TopoMS: comprehensive topological exploration for molecular and condensed-matter systems. J. Comput. Chem. 39, 936–952 (2018).

    Google Scholar 

  37. 37.

    Olejniczak, M., Gomes, A. & Tierny, J. A topological data analysis perspective on non-covalent interactions in relativistic calculations. Int. J. Quantum Chem. 120, e26133 (2019).

    Google Scholar 

  38. 38.

    Lukasczyk, J. et al. Viscous fingering: a topological visual analytic approach. Appl. Mech. Mater. 869, 9–19 (2017).

    Google Scholar 

  39. 39.

    Lee, J. et al. Spatiotemporal genomic architecture informs precision oncology in glioblastoma. Nat. Genet. 49, 594e599 (2017).

    Google Scholar 

  40. 40.

    Camara, P., Levine, A. & Rabadan, R. Inference of ancestral recombination graphs through topological data analysis. PLoS Comput. Biol. 12, e1005071 (2016).

    ADS  Google Scholar 

  41. 41.

    Camara, P. Topological methods for genomics: present and future directions. Curr. Opin. Syst. Biol. 1, 95–101 (2017).

    Google Scholar 

  42. 42.

    Nicolau, M., Levine, A. & Carlsson, G. Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proc. Natl Acad. Sci. USA 108, 7265–7270 (2011).

    ADS  Google Scholar 

  43. 43.

    Romano, D. et al. Topological methods reveal high and low functioning neuro-phenotypes within fragile X syndrome. Hum. Brain Mapp. 35, 4904–4915 (2014).

    Google Scholar 

  44. 44.

    Nielson, J. et al. Topological data analysis for discovery in preclinical spinal cord injury and traumatic brain injury. Nat. Commun. 6, 8581 (2015).

    ADS  Google Scholar 

  45. 45.

    Saggar, M. et al. Towards a new approach to reveal dynamical organizaton of the brain using topologial data analysis. Nat. Commun. 9, 1399 (2018).

    ADS  Google Scholar 

  46. 46.

    Li, L. et al. Identification of type 2 diabetes subgroups through topological analysis of patient similarity. Sci. Transl. Med. 7, 311ra174 (2015).

    Google Scholar 

  47. 47.

    Hinks, T. et al. Multidimensional endotyping in patitents with severe asthma reveals inflammatory heterogeneity in matrix metalloproteinases and chitinase 3-like protein 1. J. Allergy Clin. Immunol. 138, 61–75 (2016).

    Google Scholar 

  48. 48.

    Hinks, T. et al. Innate and adaptive T-cells in asthmatics patients: relationship to severity and disease mechanisms. J. Allergy Clin. Immunol. 136, 323–333 (2015).

    Google Scholar 

  49. 49.

    Leroux, S. & Jund, P. Ring statistics analysis of topological networks: new approach and application to amorphous GeS2 and SiO2 systems. Comput. Mater. Sci. 49, 70–83 (2010).

    Google Scholar 

  50. 50.

    Hiraoka, Y. et al. Hierarchical structures of amorphous solids characterized by persistent homology. Proc. Natl Acad. Sci. USA 113, 7035–7040 (2016).

    Google Scholar 

  51. 51.

    MacPherson, R. & Schweinhart, B. Measuring shape with topology. J. Math. Phys. 53, 073516 (2012).

    ADS  MathSciNet  MATH  Google Scholar 

  52. 52.

    Kramar, M., Goullet, A., Kondic, L. & Mischaikow, K. Persistence of force networks in compressed granular media. Phys. Rev. E 87, 042207 (2013).

    ADS  Google Scholar 

  53. 53.

    Mueth, D., Jaeger, H. & Nagel, S. Force distribution in a granular medium. Phys. Rev. E 57, 3164–3169 (1998).

    ADS  Google Scholar 

  54. 54.

    Cang, Z. & Wei, G. TopologyNet: topology based deep convolutional and multi-task neural networks for biomolecular property predictions. PLoS Comput. Biol. 13, e100569 (2017).

    Google Scholar 

  55. 55.

    Nguyen, D. et al. Mathematical deep learning for pose and binding affinity prediction and ranking in D3R Grand Challenges. J. Comput. Aided Mol. Des. 33, 71–82 (2019).

    ADS  Google Scholar 

  56. 56.

    Sousbie, T. The persistent cosmic web and its filamentary structure — I. Theory and implementation. Mon. Not. R. Astron. Soc. 414, 350–383 (2011).

    ADS  Google Scholar 

  57. 57.

    Sousbie, T., Pichon, C. & Kawahara, H. The persistent cosmic web and its filamentary structure — II. Illustrations. Mon. Not. R. Astron. Soc. 414, 384–403 (2011).

    ADS  Google Scholar 

  58. 58.

    Otter, N., Porter, M., Tillmann, U., Grindrod, P. & Harrington, H. A roadmap for the computation of persistent homology. EPJ Data Sci. 6, 17 (2017).

    Google Scholar 

  59. 59.

    Henselman, G. & Ghrist, R. Matroid filtrations and compputational persistent homology. Preprint at arXiv https://arxiv.org/abs/1606.00199 (2016).

  60. 60.

    Yoon, H. Cellular Sheaves and Cosheaves for Distributed Topological Data Analysis. Thesis, Univ. Pennsylvania (2018).

  61. 61.

    Boissonnat, J.-B., Pritam, S. & Pareek, D. Strong collpase for persistencey. Preprint at arXiv https://arxiv.org/abs/1809.10945 (2018).

  62. 62.

    Kerber, M. & Schreiber, H. Barcodes of towers and a streaming algorithm for persistent homology. Discrete Comput. Geom. 61, 852–879 (2018).

    MathSciNet  MATH  Google Scholar 

  63. 63.

    Zhang, S., Xiao, M. & Wang, H. GPU-accelerated computation of Vietoris–Rips persistence barcodes. Preprint at arXiv https://arxiv.org/abs/2003.07989 (2020).

  64. 64.

    Zhang, S. et al. HYPHA: a framework based on separation of parallelisms to accelerate persistent homology matrix reduction (ACM, 2019).

  65. 65.

    Morozov, D. & Nigmetov, A. Towards lockfree persistent homology (ACM, 2020).

  66. 66.

    Tierny, J., Favelier, G., Levine, J., Gueunet, C. & Michaux, M. The topology toolkit. IEEE Trans. Vis. Comput. Graph. 24, 832–842 (2017).

    Google Scholar 

  67. 67.

    Carlsson, G., Dwaraknath, A. & Nelson, B. J. Persistent and zigzag homology: a matrix factorization viewpoint. Preprint at arXiv https://arxiv.org/abs/1911.10693 (2019).

  68. 68.

    Batko, B., Mischaikow, K., Mrozek, M. & Przybylski, M. Conley index approach to sampled dynamics. SIAM J. Appl. Dyn. Syst. 19, 665–704 (2020).

    MathSciNet  MATH  Google Scholar 

  69. 69.

    Mischaikow, K., Mrozek, M., Reiss, J. & Szymczak, A. Construction of symbolic dynamics from experimental time series. Phys. Rev. Lett. 82, 1144 (1999).

    ADS  Google Scholar 

  70. 70.

    Zgliczynski, P. & Mischaikow, K. Rigorous numerics for partial differential equations: the Kuramoto–Sivashinsky equation. Found. Comput. Math. 1, 255–288 (2013).

    MathSciNet  MATH  Google Scholar 

  71. 71.

    Chen, G., Mischaikow, K., Laramee, R., Pilarczyk, P. & Zhang, E. Vector field editing and periodic orbit extraction using Morse decomposition. IEEE Trans. Vis. Comput. Graph. 13, 769–785 (2007).

    Google Scholar 

  72. 72.

    de Silva, V., Munch, E. & Patel, A. Categorified Reeb graphs. Discrete Comput. Geom. 55, 854–906 (2016).

    MathSciNet  MATH  Google Scholar 

  73. 73.

    Carlsson, G. & de Silva, V. Zigzag persistence. Found. Comput. Math. 10, 367–405 (2010).

    MathSciNet  MATH  Google Scholar 

  74. 74.

    Carlsson, G. & Zomorodian, A. The theory of multidimensional persistence. Discrete Comput. Geom. 42, 71–93 (2009).

    MathSciNet  MATH  Google Scholar 

  75. 75.

    Lesnick, M. & Wright, M. Interactive visualization of 2-D persistence modules. Preprint at arXiv https://arxiv.org/abs/1512.00180 (2015).

Download references

Acknowledgements

This article has benefited greatly from discussions with J. Carlsson, P. Lum, S. Locklin and B. Mann.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Gunnar Carlsson.

Ethics declarations

Competing interests

The author declares no competing interests.

Additional information

Peer review information

Nature Reviews Physics thanks Vanessa Robins and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Glossary

Features

In any data set, the features are the various numerical quantities attached to data points. In a data matrix, they are the columns of the matrix, and the rows are the data points.

Clustering decomposition

Any method that decomposes a data set into disjoint groups, called clusters.

Space

A set equipped with a notion of nearness. For any positive integer, subsets of \({{\mathbb{R}}}^{n}\) are examples, and so are metric spaces.

Connected components

The decomposition of a space into disjoint pieces that are separated from each other, and which cannot be so decomposed further.

Metric spaces

An abstraction of the notion of distance in the plane. A metric space consists of a set X and a non-negative valued distance function d on pairs of points in X, satisfying certain conditions, such as symmetry and the triangle inequality d(x, z) ≤ d(x, y) + d(y, z).

Covering

A covering of a set X is a collection of subsets of X whose union is all of X. The sets need not be disjoint.

Homology

An invariant that counts occurrences of geometric patterns, such as loops, in a space.

Simplex

A subset of \({{\mathbb{R}}}^{n}\) that is the convex hull of k points, where k ≤ n + 1. For k = 2, 3 and 4, simplices are intervals, triangles and tetrahedra, respectively.

Homotopy

For maps f and g between spaces X and Y, f, g : X → Y, f and g are homotopic if there is a continuous one-parameter family of maps beginning with f and ending at g.

Diameter

In any space where we have a notion of distance, the diameter is the maximum distance between any pair of points. For example, the diameter of the sphere is 2.

L distance

A notion of distance for \({{\mathbb{R}}}^{n}\) in which the distance between two points is the maximum of the absolute values of the differences between the coordinates of the two points.

Tropical

A tropical algebra is a version of algebra with addition and multiplication replaced by max or min and multiplication, respectively.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Carlsson, G. Topological methods for data modelling. Nat Rev Phys 2, 697–708 (2020). https://doi.org/10.1038/s42254-020-00249-3

Download citation

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing