# Topological methods for data modelling

## Abstract

The analysis of large and complex data sets is one of the most important problems facing the scientific community, and physics in particular. One response to this challenge has been the development of topological data analysis (TDA), which models data by graphs or networks rather than by linear algebraic (matrix) methods or cluster analysis. TDA represents the shape of the data (suitably defined) in a combinatorial fashion. Methods for measuring shape have been developed within mathematics, providing a toolkit referred to as homology. In working with data, one can use this kind of modelling to obtain an understanding of the overall structure of the data set. There is a suite of methods for constructing vector representations of various kinds of unstructured data. In this Review, we sketch the basics of TDA and provide examples where this kind of analysis has been carried out.

## Key points

• The analysis of large and complex data sets is crucial to all areas of science and industry, and is needed to support artificial intelligence. Existing methods for data analysis are often inadequate to deal with data that exhibit a great deal of complexity, because they are unable to express complicated ‘data shapes’.

• Topology (the mathematical study of shape) has been extended to topological data analysis to give systematic graph representations of data sets, which are informative in many different ways. Graphs can be thought of as encoding shape.

• Graph representations of data permit systematic unsupervised analysis of data, with a variety of methods for the interrogation of the data. They constitute a compression of the data that nevertheless preserves salient features.

• Because of the flexibility of graph representations, methods for measuring the corresponding shape are required. Homology is a family of such methods. It is useful both for overall understanding of data sets and for generation of numerical features for many kinds of unstructured data.

• Topological data analysis has been applied in many different complex data situations.

## Access options

from\$8.99

All prices are NET prices.

## References

1. 1.

Berkowitz, J. Big data hits beamline. Berkeley Lab. Comput. Sci. https://cs.lbl.gov/news-media/news/2013/big-data-hits-the-beamline/ (2013).

2. 2.

Gaillard, M. CERN Data Centre passes the 200-petabyte milestone. CERN https://home.cern/news/news/computing/cern-data-centre-passes-200-petabyte-milestone (2017).

3. 3.

Everitt, B., Landaum S., Leese, M. & Stahl, D. Cluster Analysis (John Wiley, 2011).

4. 4.

Armstrong, M. Basic Topology (Springer, 1983).

5. 5.

Dummit, D. & Foote, R. Abstract Algebra Vol. 1 (Wiley, 2004).

6. 6.

Edelsbrunner, H. & Harer, J. Computational Topology. An Introduction (American Mathematical Society, 2010).

7. 7.

Chazal, F. & Michel, B. An introduction to topological data analysis: fundamental and practical aspects for data scientists. Preprint at arXiv https://arxiv.org/abs/1710.04019 (2017).

8. 8.

Carlsson, G., Ishkhanov, T., De Silva, V. & Zomorodian, A. On the local behavior of spaces of natural images. Int. J. Computer Vis. 76, 1–12 (2008).

9. 9.

Hatcher, A. Algebraic Topology (Cambridge Univ. Press, 2002).

10. 10.

Carlsson, G. Topological pattern recognition for point cloud data. Acta Numer. 23, 289–368 (2014).

11. 11.

Vietoris, L. Über den höheren Zusammenhang kompakter Räume un eine klasse von zusammenhangstreuen Abbildungen. Math. Ann. 97, 454–472 (1927).

12. 12.

Edelsbrunner, H., Kirkpatrick, D. & Seidel, R. On the shape of a set of points in the plane. IEEE Trans. Inf. Theory 29, 551–559 (1983).

13. 13.

Akkiraju, N. et al. Alpha shapes: definition and software. Geometry Center http://www.geom.uiuc.edu/software/cglist/GeomDir/shapes95def/index.html (1995).

14. 14.

de Silva, V. & Carlsson, G. Topological estimation using witness complexes. Eurographics https://doi.org/10.2312/SPBG/SPBG04/157-166 (2004).

15. 15.

Singh, G., Memoli, F. & Carlsson, G. Topological method for the analysis of high dimensional data sets and 3D object recognition. Eurographics https://doi.org/10.2312/SPBG/SPBG07/091-100 (2007).

16. 16.

Aurenhammer, F., Klein, R. & Lee, D. Voronoi Diagrams and Delaunay Triangulations (World Scientific, 2013).

17. 17.

Reeb, G. Sur les points singuliers d’une fome de Pfaff completement integrable ou d’une fonction numerique. C. R. Seances Acad. Sci. 222, 847–849 (1946).

18. 18.

Robins, V. Towards computing homology from finite approximations. Topol. Proc. 24, 503–532 (1999).

19. 19.

Frosini, P. & Landi, C. Size theory as a topological tool for computer vision. Pattern Recognit. Image Anal. 9, 596–603 (1999).

20. 20.

Edelsbrunner, H., Letscher, D. & Zomorodian, A. Topological persistence and simplification. Discrete Comput. Geom. 28, 511–533 (2002).

21. 21.

Zomorodian, A. & Carlsson, G. Computing persistent homology. Discrete Comput. Geom. 33, 249–274 (2005).

22. 22.

Edelsbrunner, H. & Harer, J. Persistent homology — a survey. Contemp. Math. 453, 257–282 (2008).

23. 23.

Chazal, F., Cohen-Steiner, D., Guibas, L., Memoli, F. & Oudot, S. Gromov–Hausdorff stable signatures for shapes using persistence. Comput. Graph. Forum 28, 1393–1403 (2009).

24. 24.

Cohen-Steiner, D., Edelsbrunner, H. & Harer, J. Stability of persistence diagrams. Discrete Comput. Geom. 37, 103–120 (2007).

25. 25.

Steiner, D. C., Edelsbrunner, H., Harer, J. & Mileyko, Y. Lipschitz functions have Lp-stable persistence. Found. Computat. Math. 10, 127–139 (2010).

26. 26.

Chan, J., Carlsson, G. & Rabadan, R. Topology of viral evolution. Proc. Natl Acad. Sci. USA 110, 18566–18571 (2013).

27. 27.

Bubenik, P. Statistical topological data analysis using persistence landscapes. J. Mach. Learn. Res. 16, 77–102 (2015).

28. 28.

Adams, H. et al. Persistence images: a stable vector representation of persistent homology. J. Mach. Learn. Res. 18, 1–35 (2017).

29. 29.

Adcock, A., Carlsson, E. & Carlsson, G. The ring of algebraic functions on persistence barcodees. Homol. Homotopy Appl. 18, 381–402 (2016).

30. 30.

Kalisnik, S. Tropical coordinates on the space of persistence barcodes. Found. Comput. Math. 19, 101–129 (2019).

31. 31.

Yao, Y. et al. Topological methods for exploring low-density states in biomolecular folding pathways. J. Chem. Phys. 130, 144115 (2009).

32. 32.

Duponchel, L. Exploring hyperspectral imaging data sets with topological data analysis. Anal. Chim. Acta 1000, 123–131 (2018).

33. 33.

Offroy, M. & Duponchel, L. Topological data analysis: a promising big data exploration tool in biology, analytical chemistry, and physical chemistry. Anal. Chim. Acta 910, 1–11 (2016).

34. 34.

Torres, B. et al. Tracking resilience to infections by mapping disease space. PLoS Biol. 14, e1002494 (2016).

35. 35.

Louie, A., Song, K. H., Hotson, A., Thomas Tate, A. & Schneider, D. S. How many parameters does it take to describe disease tolerance? PLoS Biol. 14, e1002485 (2016).

36. 36.

Bhatia, H., Gyulassy, A., V. Lordi, P. J., Pascucci, V. & Bremer, P. TopoMS: comprehensive topological exploration for molecular and condensed-matter systems. J. Comput. Chem. 39, 936–952 (2018).

37. 37.

Olejniczak, M., Gomes, A. & Tierny, J. A topological data analysis perspective on non-covalent interactions in relativistic calculations. Int. J. Quantum Chem. 120, e26133 (2019).

38. 38.

Lukasczyk, J. et al. Viscous fingering: a topological visual analytic approach. Appl. Mech. Mater. 869, 9–19 (2017).

39. 39.

Lee, J. et al. Spatiotemporal genomic architecture informs precision oncology in glioblastoma. Nat. Genet. 49, 594e599 (2017).

40. 40.

Camara, P., Levine, A. & Rabadan, R. Inference of ancestral recombination graphs through topological data analysis. PLoS Comput. Biol. 12, e1005071 (2016).

41. 41.

Camara, P. Topological methods for genomics: present and future directions. Curr. Opin. Syst. Biol. 1, 95–101 (2017).

42. 42.

Nicolau, M., Levine, A. & Carlsson, G. Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proc. Natl Acad. Sci. USA 108, 7265–7270 (2011).

43. 43.

Romano, D. et al. Topological methods reveal high and low functioning neuro-phenotypes within fragile X syndrome. Hum. Brain Mapp. 35, 4904–4915 (2014).

44. 44.

Nielson, J. et al. Topological data analysis for discovery in preclinical spinal cord injury and traumatic brain injury. Nat. Commun. 6, 8581 (2015).

45. 45.

Saggar, M. et al. Towards a new approach to reveal dynamical organizaton of the brain using topologial data analysis. Nat. Commun. 9, 1399 (2018).

46. 46.

Li, L. et al. Identification of type 2 diabetes subgroups through topological analysis of patient similarity. Sci. Transl. Med. 7, 311ra174 (2015).

47. 47.

Hinks, T. et al. Multidimensional endotyping in patitents with severe asthma reveals inflammatory heterogeneity in matrix metalloproteinases and chitinase 3-like protein 1. J. Allergy Clin. Immunol. 138, 61–75 (2016).

48. 48.

Hinks, T. et al. Innate and adaptive T-cells in asthmatics patients: relationship to severity and disease mechanisms. J. Allergy Clin. Immunol. 136, 323–333 (2015).

49. 49.

Leroux, S. & Jund, P. Ring statistics analysis of topological networks: new approach and application to amorphous GeS2 and SiO2 systems. Comput. Mater. Sci. 49, 70–83 (2010).

50. 50.

Hiraoka, Y. et al. Hierarchical structures of amorphous solids characterized by persistent homology. Proc. Natl Acad. Sci. USA 113, 7035–7040 (2016).

51. 51.

MacPherson, R. & Schweinhart, B. Measuring shape with topology. J. Math. Phys. 53, 073516 (2012).

52. 52.

Kramar, M., Goullet, A., Kondic, L. & Mischaikow, K. Persistence of force networks in compressed granular media. Phys. Rev. E 87, 042207 (2013).

53. 53.

Mueth, D., Jaeger, H. & Nagel, S. Force distribution in a granular medium. Phys. Rev. E 57, 3164–3169 (1998).

54. 54.

Cang, Z. & Wei, G. TopologyNet: topology based deep convolutional and multi-task neural networks for biomolecular property predictions. PLoS Comput. Biol. 13, e100569 (2017).

55. 55.

Nguyen, D. et al. Mathematical deep learning for pose and binding affinity prediction and ranking in D3R Grand Challenges. J. Comput. Aided Mol. Des. 33, 71–82 (2019).

56. 56.

Sousbie, T. The persistent cosmic web and its filamentary structure — I. Theory and implementation. Mon. Not. R. Astron. Soc. 414, 350–383 (2011).

57. 57.

Sousbie, T., Pichon, C. & Kawahara, H. The persistent cosmic web and its filamentary structure — II. Illustrations. Mon. Not. R. Astron. Soc. 414, 384–403 (2011).

58. 58.

Otter, N., Porter, M., Tillmann, U., Grindrod, P. & Harrington, H. A roadmap for the computation of persistent homology. EPJ Data Sci. 6, 17 (2017).

59. 59.

Henselman, G. & Ghrist, R. Matroid filtrations and compputational persistent homology. Preprint at arXiv https://arxiv.org/abs/1606.00199 (2016).

60. 60.

Yoon, H. Cellular Sheaves and Cosheaves for Distributed Topological Data Analysis. Thesis, Univ. Pennsylvania (2018).

61. 61.

Boissonnat, J.-B., Pritam, S. & Pareek, D. Strong collpase for persistencey. Preprint at arXiv https://arxiv.org/abs/1809.10945 (2018).

62. 62.

Kerber, M. & Schreiber, H. Barcodes of towers and a streaming algorithm for persistent homology. Discrete Comput. Geom. 61, 852–879 (2018).

63. 63.

Zhang, S., Xiao, M. & Wang, H. GPU-accelerated computation of Vietoris–Rips persistence barcodes. Preprint at arXiv https://arxiv.org/abs/2003.07989 (2020).

64. 64.

Zhang, S. et al. HYPHA: a framework based on separation of parallelisms to accelerate persistent homology matrix reduction (ACM, 2019).

65. 65.

Morozov, D. & Nigmetov, A. Towards lockfree persistent homology (ACM, 2020).

66. 66.

Tierny, J., Favelier, G., Levine, J., Gueunet, C. & Michaux, M. The topology toolkit. IEEE Trans. Vis. Comput. Graph. 24, 832–842 (2017).

67. 67.

Carlsson, G., Dwaraknath, A. & Nelson, B. J. Persistent and zigzag homology: a matrix factorization viewpoint. Preprint at arXiv https://arxiv.org/abs/1911.10693 (2019).

68. 68.

Batko, B., Mischaikow, K., Mrozek, M. & Przybylski, M. Conley index approach to sampled dynamics. SIAM J. Appl. Dyn. Syst. 19, 665–704 (2020).

69. 69.

Mischaikow, K., Mrozek, M., Reiss, J. & Szymczak, A. Construction of symbolic dynamics from experimental time series. Phys. Rev. Lett. 82, 1144 (1999).

70. 70.

Zgliczynski, P. & Mischaikow, K. Rigorous numerics for partial differential equations: the Kuramoto–Sivashinsky equation. Found. Comput. Math. 1, 255–288 (2013).

71. 71.

Chen, G., Mischaikow, K., Laramee, R., Pilarczyk, P. & Zhang, E. Vector field editing and periodic orbit extraction using Morse decomposition. IEEE Trans. Vis. Comput. Graph. 13, 769–785 (2007).

72. 72.

de Silva, V., Munch, E. & Patel, A. Categorified Reeb graphs. Discrete Comput. Geom. 55, 854–906 (2016).

73. 73.

Carlsson, G. & de Silva, V. Zigzag persistence. Found. Comput. Math. 10, 367–405 (2010).

74. 74.

Carlsson, G. & Zomorodian, A. The theory of multidimensional persistence. Discrete Comput. Geom. 42, 71–93 (2009).

75. 75.

Lesnick, M. & Wright, M. Interactive visualization of 2-D persistence modules. Preprint at arXiv https://arxiv.org/abs/1512.00180 (2015).

## Acknowledgements

This article has benefited greatly from discussions with J. Carlsson, P. Lum, S. Locklin and B. Mann.

## Author information

Authors

### Corresponding author

Correspondence to Gunnar Carlsson.

## Ethics declarations

### Competing interests

The author declares no competing interests.

### Peer review information

Nature Reviews Physics thanks Vanessa Robins and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

### Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Glossary

Features

In any data set, the features are the various numerical quantities attached to data points. In a data matrix, they are the columns of the matrix, and the rows are the data points.

Clustering decomposition

Any method that decomposes a data set into disjoint groups, called clusters.

Space

A set equipped with a notion of nearness. For any positive integer, subsets of $${{\mathbb{R}}}^{n}$$ are examples, and so are metric spaces.

Connected components

The decomposition of a space into disjoint pieces that are separated from each other, and which cannot be so decomposed further.

Metric spaces

An abstraction of the notion of distance in the plane. A metric space consists of a set X and a non-negative valued distance function d on pairs of points in X, satisfying certain conditions, such as symmetry and the triangle inequality d(x, z) ≤ d(x, y) + d(y, z).

Covering

A covering of a set X is a collection of subsets of X whose union is all of X. The sets need not be disjoint.

Homology

An invariant that counts occurrences of geometric patterns, such as loops, in a space.

Simplex

A subset of $${{\mathbb{R}}}^{n}$$ that is the convex hull of k points, where k ≤ n + 1. For k = 2, 3 and 4, simplices are intervals, triangles and tetrahedra, respectively.

Homotopy

For maps f and g between spaces X and Y, f, g : X → Y, f and g are homotopic if there is a continuous one-parameter family of maps beginning with f and ending at g.

Diameter

In any space where we have a notion of distance, the diameter is the maximum distance between any pair of points. For example, the diameter of the sphere is 2.

L distance

A notion of distance for $${{\mathbb{R}}}^{n}$$ in which the distance between two points is the maximum of the absolute values of the differences between the coordinates of the two points.

Tropical

A tropical algebra is a version of algebra with addition and multiplication replaced by max or min and multiplication, respectively.

## Rights and permissions

Reprints and Permissions

Carlsson, G. Topological methods for data modelling. Nat Rev Phys 2, 697–708 (2020). https://doi.org/10.1038/s42254-020-00249-3

• Accepted:

• Published:

• Issue Date: