Topological data analysis of spatial patterning in heterogeneous cell populations: clustering and sorting with varying cell-cell adhesion

Different cell types aggregate and sort into hierarchical architectures during the formation of animal tissues. The resulting spatial organization depends (in part) on the strength of adhesion of one cell type to itself relative to other cell types. However, automated and unsupervised classification of these multicellular spatial patterns remains challenging, particularly given their structural diversity and biological variability. Recent developments based on topological data analysis are intriguing to reveal similarities in tissue architecture, but these methods remain computationally expensive. In this article, we show that multicellular patterns organized from two interacting cell types can be efficiently represented through persistence images. Our optimized combination of dimensionality reduction via autoencoders, combined with hierarchical clustering, achieved high classification accuracy for simulations with constant cell numbers. We further demonstrate that persistence images can be normalized to improve classification for simulations with varying cell numbers due to proliferation. Finally, we systematically consider the importance of incorporating different topological features as well as information about each cell type to improve classification accuracy. We envision that topological machine learning based on persistence images will enable versatile and robust classification of complex tissue architectures that occur in development and disease.


Introduction
Animal tissues are spatially organized into complex spatial patterns by varying adhesive interactions between different cell types [1,2]. For instance, mixtures of motile animal cells in a planar geometry can self-sort into their respective types, as well as aggregate into multicellular clusters [3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21]. Historical work by Steinberg attempted to explain these behaviors using a physical analogy with surface tension, where cells exhibit "differential adhesion" with one another [22], which was subsequently updated by Brodland to include contractility [23]. In particular, two different cell types that each exhibit strong homotypic adhesion (to self) but weak heterotypic adhesion (to the other) would eventually segregate into separate clusters, each consisting of a single cell type [24]. On the other hand, two cell types with strong heterotypic adhesion would randomly intermix within a single cluster. Between these limiting cases, intermediate homotypic and heterotypic adhesion would result in a core-shell organization, where a first cell type would aggregate at the interior, and the second cell type would organize as a spread layer around the periphery. More recent work using Drosophila melanogaster has addressed the role of cellular contractility [25] and proliferation [26] in tissue patterning, resulting in distinctive topologies including developmental compartment boundaries [27], hierarchical hexagonal patterning in the retina [28,29], and multicellular rosettes during germband extension [30]. Contractility-based sorting has also been characterized in germ-layer organization during zebrafish gastrulation [31]. "Checkerboard" cellular patterns of sensory hair cells and supporting cells have been observed in the auditory epithelium of the mouse cochlea [32]. More complicated finger-like or labyrinth-like tissue patterning have also been attributed to reaction-diffusion (Turing) mechanisms [33]. Further, Lim and coworkers have demonstrated synthetic cell-cell adhesions that are fully modular and tunable, enabling rational design of multicellular architecture [34]. Given this rich diversity of tissue architectures generated experimentally and in silico, an emerging challenge is to achieve an automated and unbiased classification of distinct spatial patterns [35]. An intriguing possibility is to implement an interpretable and computationally-efficient machine learning framework that can be generalized to classify spatial patterns comprised of multiple interacting cell types.
Topological data analysis (TDA) is a promising approach for machine learning of high-dimensional architectures that extracts the "shape" of a dataset based on spatial connectivity [36]. TDA considers how discrete data points may be connected pairwise (dimension 0 homology) or linked into closed loops around an empty region (dimension 1 homology). Persistent homology then treats the stability of these connected structures at varying spatial scales [37]. Topological features can then be represented using a characteristic persistence "barcode" or "diagram," and their relative "similarity" can be determined based on the "cost" of rearranging one diagram to resemble the other [38]. Such topological approaches have recently been used to visualize the spatial organization of a single motile species [39][40][41][42][43][44][45][46] and particulate systems [47]. These past investigations mostly utilized connected components (dimension 0 homology) to analyze populations of fixed size that were near confluency. However, a comparison of two populations with differing size will be biased since the number of connected components is not identical. Our recent work has shown that classification based on closed loops (dimension 1 homology) is a robust approach to classify spatial patterns with varying population size, particularly in the presence of empty regions [48].
Alternatively, persistence images represent topological features based on the weighted sum of Gaussian features, which yields a standardized finite vector representation that is amenable to machine learning [49]. These persistence images can be compressed into a fixed-length numerical representation using dimensionality reduction and manifold learning techniques such as Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP) [50], Potential of Heat Diffusion for Affinity-based Transition Embedding (PHATE) [51] and autoencoder (AE) [52,53]. In principle, unsupervised hierarchical clustering can then be applied to determine similar spatial structures and compared to some ground truth. Nevertheless, the effectiveness of persistence images for classifying spatial patterns comprised of multiple species has not been previously addressed.
In this article, we investigate the use of persistence images for unsupervised classification of multicellular spatial patterns that emerge from two interacting cell types. We systematically tune homotypic and heterotypic interactions between cell types to generate distinct architectures ranging from dispersed individuals to intermixed clusters to partially or wholly sorted clusters, as well as more complex striped phases, hierarchical hexagonal patterns, and rosettes. The spatial organization of these patterns is then represented by persistence images, persistence curves, or classical order parameters. Further, dimensionality reduction and hierarchical clustering was performed based on dimension 0 and/or dimension 1 homology for each cell type, as well as accounting for both cell types. As a case study, we first considered pattern formation in populations of fixed size. We then considered pattern formation in populations where cells can proliferate with contact inhibition. This work establishes the importance of topological features such as connected components (dimension 0 homology) and closed loops (dimension 1 homology) for classifying spatial patterns, as well as information about each cell type. Altogether, we envision this computational approach will enable new quantitative insights into the emergence of complex tissue architectures via spatiotemporal interactions between multiple cell types.

Materials and Methods
We investigated how two different types of interacting cells (discrete agents) self-organize into multicellular patterns as cell-cell adhesion was varied (Fig. 1a). For simplicity, cells were defined as either "blue" or "orange," and a total of 512 different combinations with varying adhesion between blue-blue, orange-orange, or blue-orange cell types were simulated at constant population size (i.e. no proliferation) (Fig. 1a,i). These simulations were then implemented for another 343 different conditions (3 replicates each) where cells were permitted to proliferate when there was unoccupied Steady-state multicellular patterns were then manually classified to define a "ground truth" (Fig. 1b,i). These patterns were further converted to persistence images, which can also be represented as feature descriptor vectors (Fig. 1b,ii). For comparison, these patterns were also converted to feature vectors based on conventional order parameters that represent radial or angular symmetries of blue, orange, or both sets of cells (Fig. 1b,iii). Finally, these multicellular patterns were converted to normalized persistence curves. Unsupervised classification of these feature descriptor vectors was then implemented using UMAP [50], PHATE [51], PCA and autoencoder (AE) [52,53] (Fig. 1c) for comparison with the ground truth classification (Fig. 1b,i).

Agent-Based Modeling of Two Interacting Cell Types
Cells were represented as rigid-body agents that undergo non-inertial motion (Eq. 1) due to a random self-propulsion force, P t i , and cell-cell interactions governed by the attraction-repulsion force, F t ij (Fig. 2a). Cell positions were initialized in a [−20, 20] × [−20, 20] simulation box with periodic boundaries on all sides. The equation of motion was given by: where x t i denotes the position of the ith cell at time t, ∆t denotes the time-step, and η denotes a friction coefficient. Two cells i and j in close proximity (within neighborhood distance r max = 1.5) would attract or repel each other in accordance with an the attraction-repulsion force, F ij , which is the gradient of some potential U (||x j − x i ||) (Eq. 2) (Fig. 2b):

a) Self-Propulsion and Cell-Cell Interactions b) Morse Potential c) Proliferation and Contact Inhibition
In particular, we define U in terms of a Morse potential with a first term that represents long-range attraction and a second term that represents short-range repulsion (Eq. 3): where the adhesion parameter J ij = J(T (i), T (j)) determines the magnitude of the attraction-repulsion potential, and depends on cell types T (i) and T (j) for cells i and j respectively, which we denote as blue or orange types (e.g. τ B and τ O , respectively). We further define the characteristic length scale of the first long-range attraction term as l A = 14.0, and the second short-range repulsion term as l R = 0.5, based on our previous work with epithelial cells [15,48]. Finally, the second short-range repulsion term was further scaled to be 1/4 of the first long-range attraction term.
Cell positions were initialized at t = 0 from a uniform distribution with rejection sampling to ensure a minimum separation of 1.0 between all cells. The magnitude of the polarization force, |P|, was fixed at 0.005 and the adhesion parameter, J, was varied between 0.01 and 0.25. As a consequence, cells moved directionally at a constant speed for some duration, followed by random reorientation. This behavior has been experimentally validated for epithelial cells [54], and is analogous to classical "run and tumble" models of bacteria.
The friction coefficient η = 1.0, and time-step ∆t = 0.02, were held constant throughout the simulation. Each simulation ran for 5, 000, 000 time-steps. Repolarization takes place every 2, 500 time-steps, with an initial offset to prevent synchronization. In scenarios with proliferation, cell division was modeled with a cell cycle duration of 80, 000 time-steps, and permitted to occur as long as cells have less than 4 neighbors (i.e. contact inhibition of proliferation) (Fig. 2c). This scenario with both proliferation and sorting occur is reminiscent of tissue repair after damage, such as the regeneration of zebrafish skin patterns after laser ablation [55].
Altogether, for scenarios at constant cell number (without proliferation), a total of 512 combinations of adhesion parameters were simulated (8 different values of J BB × 8 different values of J OO × 8 different values of J BO ), with 3 independently initialized replicates. However, for scenarios with proliferation, simulations with low adhesion were not included since they were too computationally expensive and did not reach steady state over the timescales considered for the other simulations (J ≈ 0.00, 0.001). Thus, only 216 combinations of adhesion parameters were simulated with proliferation (6 different values of J BB × 6 different values of J OO × 6 different values of J BO ), with 3 independently initialized replicates.

Computation of Persistence Diagrams, Persistence Images, and Order Parameters
Given a point cloud representing cell positions, we extracted topological features by constructing a chain of simplicial complexes based on a "proximity parameter" ϵ (separation distance between cell centroids) (Fig. 3A). Specifically, we used Euclidean distance between the cell positions to compute persistent homology via the Vietoris-Rips filtration. To avoid confusion, we denote these distances as "interval start" and "interval end" rather than "birth" and "death," which have different meanings in cell biology and topology literature. We used TDA to obtain a list of (start, end) pairs for topological features. These start and end pairs can be represented using a barcode diagram, or a persistence diagram for visualization (Fig. 3B). In the worst case scenario, this computation is performed in O(n 3 ) time, where n is the number of generators of the filtered complex [56], although this has been sped up using the standard reduction algorithm. In order to perform machine learning on simulation data, it was helpful to encode the topological information into vectors, so persistence images were introduced as a vectorized representation of topological features that are encoded in persistence diagrams.
Persistence images summarize the topological features using an intensity function over a measurement of length connected to the ϵ radius [49]. Generally, in order to compute a persistence image, we take the points from the persistence diagram and place Gaussians centered around each point, additionally weighting the Gaussians so that more significant points have larger weight. Then we take the sum over these Gaussians to produce an intensity. However, depending on the dimensionality of the feature we are interested in, this process, as well as the shape of the output, varies. For each simulation at "steady state", persistence images were computed, using only orange cell positions, only blue cell positions, and both cell type positions together. In each case, we calculated the persistence images in dimensions 0 and 1. As an example, we now discuss further details for the computation of persistence images for dimension 0 and dimension 1.
In dimension 0 homology, we consider only the connected component features and place Gaussians at each point weighted by (end -start) resulting in a 1-dimensional intensity function (Fig. 3C)  Persistence images (C, E) are generated by replacing interval start/end coordinates in the persistence diagram with a Gaussian weighted by distance from the diagonal. In dimension 0 homology (notated H 0 , the interval start is always 0 (each cell position is a connected component), resulting in a 1D image (C). In dimension 1 homology (notated H 1 ), the intervals represent topological loops arising at non-zero values of the filtration radius, resulting in a 2D image.
where D is the persistence diagram and g (start),σ is a Gaussian with mean start and variance σ 2 In dimension 1 homology, the presence/absence of topological loops is indicated by (start, end) coordinates. We transform the coordinate system to measure persistence along the y-axis, (start, end -start), and place Gaussians around these new points, weighted by the distance to the diagonal (Fig. 3D). In other words, weight each Gaussian The result is a 2-dimensional intensity function, with higher intensity for more persistent features (Fig. 3E). Mathematically, where g (start,end−start),Σ is a 2-dimensional Gaussian centered at (start, end-start) with covariance matrix Σ.
Persistence curves are an alternative method of summarizing the persistence diagram by defining a function along its diagonal. Consider a point (t, t) on the diagonal of the persistence diagram D, which defines a rectangular region where (start, end) ∈ [0, t] × [t, ∞). We apply a function ψ to all points (interval coordinates) within this region and compute a summary statistic T , resulting in a scalar value PC(t). More formally, we define the persistence curve, PC as a real-valued function over the diagonal, [57] PC(D, ψ, T )(t) = T (ψ(D; b, d, t)|(b, d) ∈ D t ), t ∈ R There are various options for choosing the functions ψ and T . One example is the Betti curve, where ψ is the indicator function and T is the summation operator. The Betti curve counts the number of points in the [0, t] × [t, ∞) rectangular region for all points t along the diagonal. Another example is the life curve, where ψ is (end -start) and T is the summation operator. Here, we implement the Gaussian persistence curve [58], where ψ places a Gaussian of fixed bandwidth at each point in the [0, t] × [t, ∞) region, weighted by distance from the diagonal, and T is the integral sum. One benefit of using Gaussian persistence curves is efficiency: they are easy to implement and fast to compute, with theoretical guarantees on stability and injectivity [58]. They have previously been applied to problems in image classification and neuroscience [57,[59][60][61]. Although we focus on persistence images in the main text, persistence curves are benchmarked in the supporting information and represent a promising avenue for future work.

Dimensionality Reduction and Classification
Persistence images computed from cell positions were concatenated and compressed to a low-dimensional feature vector. Dimensionality reduction was then performed on the feature vectors using principal component analysis (PCA), Uniform Manifold Approximation and Projection (UMAP) [50], Potential of Heat-diffusion for Affinity-based Trajectory Embedding (PHATE) [51], and an autoencoder (AE) [52], which facilitated visualization (in 2-D) and unsupervised classification using hierarchical clustering. Briefly, PCA projects data in low dimensions using a linear combination of features to maximize variance along each principal component. UMAP is a learned embedding based on ideas from manifold learning and TDA that constructs a high-dimensional graph representation of the data and optimizes a low-dimensional graph embedding to preserve structural similarities. PHATE computes an information distance based on a diffusion operator constructed from affinity between data points. It leverages multi-dimensional scaling (MDS) to perform dimensionality reduction using the information distance metric. An autoencoder learns the identity map by passing data, x, through an encoder network, E, to obtain a low-dimensional latent representation, z = E(x), which can be decoded to reconstruct the original data via the decoder network,x = D(z) = D(E(x)). The autoencoder is trained without supervision using the reconstruction loss penalty, L = ∥x −x∥. The classification results were determined by comparing cluster labels obtained using hierarchical clustering to ground truth phase classification. Agglomerative hierarchical clustering (ward linkage) [62] was performed on 20-dimensional representations obtained by dimensionality reduction, with a stopping criterion (i.e. dendrogram cutoff) to prevent merging when the number of clusters reached the total number of distinct phases in the ground truth.

Sorting and Clustering of Two Non-Proliferating Cell Types with Varying Adhesion
Differential adhesion simulations were performed using a self-propelled particle model with randomly initialized particle positions and periodic boundaries. Two cell types, labelled in blue (τ B ) and orange (τ O ), were simulated at constant population size (N = 200 total, 60% blue and 40% orange) with systematically varying parameters governing blue-blue adhesion (J BB ), orange-orange adhesion (J OO ), and orange-blue adhesion (J OB ), respectively. A random self-propulsion force of constant magnitude, |P| = 0.005, was applied to each particle throughout the simulation. The total simulation time was chosen to exceed the time taken to reach stable steady-state configurations for each set of parameter values (Fig. S1, S2).
At slightly increased blue-orange adhesion (J BO = 0.05), with comparable blue-blue (J BB = 0.05) or orange-orange adhesion (J OO = 0.05), sorting into separate blue and orange clusters was again observed (Fig. 4B,iv, S5, S6). When blue-blue or orange-orange adhesions were stronger than blue-orange (J BO ≤ J OO , J BO ≤ J BB ), a new type of cluster was observed with intermixed orange and blue cells (Fig. 4B,v). This organization is reminiscent of "checkerboard" cellular patterns of sensory hair cells and supporting cells have been observed in mouse auditory epithelium of the cochlea [32]. Moreover, for weak blue-blue adhesion (J BB < 0.03) and weak orange-orange adhesion (J OO < 0.05), blue and orange cells organized into clusters with alternating stripes that were roughly 1-2 cells thick (Fig. 4B,vi). These striped phases are somewhat reminiscent of those observed during skin patterning in zebrafish [63]. At stronger blue-blue adhesion (0.03 < J BB ) and weak orange-orange adhesion (J OO < 0.03), clusters consisted of orange cells dispersed in a hexagonal configuration, surrounded by blue cells. (Fig. 4B,vii). Conversely, at stronger orange-orange adhesion (0.03 < J OO ) and weak blue-blue adhesion (J BB ≈ 0), clusters consisted of blue cells dispersed in a hexagonal configuration, surrounded by orange cells. (Fig. 4B,viii). Interestingly, this hierarchical patterning has a arXiv Template  superficial resemblance to cone cells in ,the Drosophila retina [28]. At stronger orange-orange adhesion (0.03 < J OO ) and slightly stronger blue-blue adhesion (J BB ≈ 0.01), clusters consisted of a core of orange cells surrounded by blue cells at the periphery (Fig. 4B,ix).
At intermediate blue-orange adhesion (J BO = 0.13), with stronger blue-blue or orange-orange adhesions (J BO ≤ J OO , J BO ≤ J BB ), clusters again were comprised of intermixed orange and blue cells (Fig. 4C,v, S7, S8). For weaker blue-blue or orange-orange adhesions, (J OO ≈ J BB ≤ J BO ), clusters with alternating orange and blue stripes were observed (Fig. 4C,vi. At stronger blue-blue adhesion (0.03 < J BB ) and weak orange-orange adhesion (J OO < 0.03), clusters consisted of orange cells dispersed in a hexagonal configuration, surrounded by tightly packed blue cells (Fig. 4C,xi). When blue-blue adhesion and orange-orange adhesion were roughly comparable (J BB ≈ J OO ≈ 0.09), clusters consisted of tightly packed orange cells dispersed in a hexagonal configuration, surrounded by tightly packed blue cells (Fig. 4C,x). At the strongest blue-blue adhesion (J BB ≈ 0.20) and weak orange-orange adhesion (J OO < 0.03), spots of orange cells were again dispersed in a hexagonal configuration, surrounded by slightly sparser blue cells (Fig. 4C,vii). At stronger orange-orange adhesion (0.03 < J OO ) and weak blue-blue adhesion (J BB ≈ 0), clusters again consisted of blue cells dispersed in a hexagonal configuration, surrounded by orange cells. (Fig. 4C,vii). When orange-orange adhesion was comparable to blue-orange adhesion (J OO ≈ J B0 = 0.13) with weaker blue-blue adhesion (J BB = 0.07), clusters were observed with finger-like or labyrinth-like patterns of orange and blue cells (Fig. 4C,xii), analogous to those generated by reaction-diffusion (Turing) mechanisms [33]. At strong orange-orange adhesion (J OO ≈ 0.2) with weak blue-blue adhesion (J BB ∈ [0.03, 0.11]), clusters consisted of a core of orange cells surrounded by blue cells at the periphery (Fig. 4C,ix). Notably, a number of cluster configurations observed at J BO = 0.13 (Fig. 4C,v-ix) were qualitatively similar to those previously observed at J BO = 0.05 (Fig. 4B,v-ix), but offset to higher values of J BB or J OO .
At strong blue-orange adhesion (J BO = 0.25) with comparable blue-blue and orange-orange adhesion (J BB ≈ J OO ), blue and orange cells organized into clusters with alternating stripes that were roughly 1-2 cells thick (Fig. 4D,vi, S9, S10). At stronger blue-blue adhesion (0.03 < J BB ) and weak orange-orange adhesion (J OO < 0.03), clusters again consisted of spots of orange cells dispersed in a hexagonal configuration, surrounded by tightly packed blue cells (Fig. 4C,xi). Conversely, at stronger orange-orange adhesion (0.03 < J OO ) and weak blue-blue adhesion (J BB ≈ 0), clusters again consisted of blue cells dispersed in a hexagonal configuration, surrounded by orange cells. (Fig. 4D,viii). When orange-orange adhesion was comparable to blue-orange adhesion (J OO ≈ J BO = 0.25) with weaker blue-blue adhesion (J BB = 0.15), clusters were observed with finger-like patterns of orange and blue cells (Fig. 4D,xii).
When blue-blue adhesion and orange-orange adhesion were roughly comparable (J BB ≈ J OO ≈ 0.13), clusters consisted of tightly packed orange cells dispersed in a hexagonal configuration, surrounded by tightly packed blue cells (Fig. 4D,x). When all three adhesions were comparable (J BO ≈ J BB ≈ J OO = 0.25), clusters were observed with intermixed orange and blue cells (Fig. 4D,v). Thus, we then performed unsupervised classification on the multicellular patterns for varying adhesion parameters (Fig. 4) using persistence images (Fig. S11), normalized persistence curves (Fig. S12), and order parameters (Fig. S13).

Unsupervised Classification of Two Non-Proliferating Cell Types with Varying Adhesion
Unsupervised classification of multicellular patterns using PCA, PHATE, AE, and UMAP was compared based on our ground truth labeling, and also colored by increasing values of J BO , J OO , and J BB (Fig. 5). Ground truth labels (12 in total, denoted i -xii) were assigned based on manual inspection of particle configurations at the end of the simulation. For ease of visualization, axes were plotted so that the spatial configurations are more qualitatively consistent, as noted in the figure. In general, patterns with individually dispersed cells (e.g. i, ii, iii) were classified farther away from other patterns where both cell types were clustered (Fig. 5). Further, for the crescent-like grouping of the remaining clustered cell patterns, J BO increased from right to left, J OO increased from the inside out, and J BB increased from left to right. Proceeding clockwise from the top right, the right arm of the crescent represented high J BB and J OO with low J BO , and included sorted clusters (iv, green), partially sorted clusters (v, yellow), and orange clusters surrounded by blue cells (ix, brick red). Further, the center of the crescent represented high J BB , low J OO , and high J BO , which included hexagonal configurations of orange cells (vii, gray; x, fuchsia; xi, purple). The left arm of the crescent represented low J BB , low J OO , and high J BO , which included hexagonal configurations of blue cells (viii, gray blue). Finally, the upper left tip of the crescent represented low J BB , high J OO , and high J BO , which included striped configurations (vi, sky blue; xii, turquoise).
The accuracy of unsupervised AE classification ( Fig. 5D) relative to our manually annotated ground truth (Fig. 4) was determined based on persistence images that considered only dimension 0 homology, only dimension 1 homology or both dimension 0 and dimension 1 (Fig. 6, S14). Further, unsupervised classification was performed using persistence images that included information on both blue and orange cells, blue cells only, and orange cells only (Fig. 6).
Classification accuracy based on persistence images using only dimension 0 homology was moderately successful with only one cell type, ranging from 69% for orange cells only, to 86% for blue cells only, but increasing to 93% for blue and orange cells combined (Fig. 6A). In comparison, classification accuracy using only dimension 1 homology was worse when using only orange cells (61%) and blue cells (74%), but considerably improved when considering both orange and blue cells (97%) (Fig. 6B). Finally, classification accuracy using both dimension 0 and dimension 1 homology generally outperformed dimension 0 and dimension 1 only, ranging from 71% for orange cells only, up to 87% for blue cells only, and then 96% for both orange and blue cells (Fig. 6C). Overall, classification accuracy was considerably better when considering blue cells only relative to orange cells only, which could be explained by the 60: 40 ratio of blue cells to orange cells in the simulations. Interestingly, combining information from both blue and orange cells incrementally improved classification accuracy for dimension 0 homology only (Fig. 6A), but resulted in much larger improvements for dimension 1 (Fig. 6B) only as well as dimension 0 and dimension 1 (Fig. 6C). For comparison, classification using persistence curves was slightly worse, ranging from 62% to 83% (Fig. S15ABC, S16). Classification using radial order parameters was comparable to persistence images, ranging from 68% to 94% (Fig. S17ABC, S18), but considerably worse for angular order parameters. Overall, the best classification occurred with persistence images and AE, considering both blue and orange cells for dimension 1 homology only (97%), and comparable performance for dimension 0 and dimension 1 homology (96%) (Table S1, S2, S3).

Sorting and Clustering of Two Proliferating Cell Types with Varying Adhesion
Next, these differential adhesion simulations were then implemented with proliferation, so that a mother cell would divide into two daughter cells after some duration (e.g. 50,000 time-steps), randomly offset. Cell division events were implemented so that one daughter cell retained the velocity and direction of the mother cell, while the other daughter cell was placed nearby, but moving at equal velocity in the opposite direction. Moreover, a contact inhibition rule was included so that a cell was not permitted to divide if the local cell density was high (more than four nearest neighbors). Simulations were otherwise performed consistently with the previous scenario, starting with a 60:40 ratio of blue and orange cells while systematically varying blue-blue adhesion (J BB ), orange-orange adhesion (J OO ), and blue-orange adhesion (J BO ). Again, the total simulation time was chosen to exceed the time taken to reach stable steady-state configurations for each set of parameter values (Fig. S19, S20). For most of the representative simulations, the steady-state configuration was reached by 8-10 cell cycles (15% of the overall simulation duration), although certain scenarios involving the merging of clusters required up to 48 cell cycles (76% of the overall simulation duration). Nevertheless, the multicellular stripe or spot organization within these clusters was typically evident by ∼10 cell cycles. This scenario with both proliferation and sorting was inspired by tissue repair, such as the regeneration of zebrafish skin patterns after laser ablation [55]. For weak blue-orange adhesion J BO = 0.05 with roughly comparable blue-blue and orange-orange adhesion (J BB ≈ J OO ), clusters appeared intermixed with irregular domain sizes (Fig. 7A,v, S21). We utilized the same numbering convention as in Fig. 4 for ease of comparison. Nevertheless, for weak blue-blue adhesions J BB = 0.05 and varying orange-orange adhesions 0.07 ≤ J OO , clusters were partially sorted with more blue cells than orange cells (Fig. 7A,xiv), which was not previously observed. By analogy, for weak orange-orange adhesions J OO = 0.05 and varying blue-blue adhesions 0.07 ≥ J BB , clusters were partially sorted with more orange cells than blue cells (Fig. 7A,xv).
In comparison, for weak blue-blue and orange-orange adhesion (J BB ≈ J OO ≤ 0.09), clusters with alternating orange and blue stripes were observed (Fig. 7B,vi). For slightly increased blue-blue or orange-orange adhesions (J OO = 0.13, J BB = 0.13), clusters exhibited finger or labyrinth-like morphology (Fig. 7B,xii). For strong blue-blue adhesions (0.17 ≤ J BB ) and weak orange-orange adhesions (J OO = 0.05), clusters were sorted with blue cells in the interior and orange cells at the periphery (Fig. 7B,vii). In comparison, for strong orange-orange adhesions (0.17 ≤ J OO ) and weak blue-blue adhesions (J BB = 0.05), clusters were sorted with orange cells in the interior and blue cells at the periphery (Fig. 7B,ix).
At the strongest blue-orange adhesion (J BO = 0.25), intermixed clusters were only observed for comparably strong blue-blue and orange-orange adhesions (J BB ≈ J OO = 0.25) (Fig. 7D,v, S25, S26). When blue-blue and orangeorange adhesions were weaker but comparable (J BB ≈ J OO ≤ 0.21), clusters with alternating orange and blue stripes were observed (Fig. 7D,vi). At one location (J BB = 0.09, J OO = 0.05), clusters consisted of spots of orange cells dispersed in a hexagonal configuration, surrounded by tightly packed blue cells (Fig. 7D,xi). For stronger blue-blue adhesions relative to orange-orange adhesions (J OO < J BB ≤ 0.17, clusters consisted of circular spots of orange cells surrounded by blue cells (Fig. 7D,x). For weak blue-blue adhesions (J BB = 0.05), a new cluster morphology was observed with mixed stripes and spots (Fig. 7D,xiii). Finally, for stronger orange-orange adhesions relative to blue-blue adhesions (J BB < J OO ≤ 0.25), clusters exhibited a labyrinth or finger-like morphology (Fig. 7D,xii).

Unsupervised Classification of Two Proliferating Cell Types with Varying Adhesion
Unsupervised classification of these multicellular patterns with proliferation was implemented using PCA, PHATE, AE, and UMAP for comparison with ground truth labeling, then plotted by increasing values of J BB , J OO , and J BO ( Fig. 8). Ground truth labels (10 in total, denoted v -vii, ix -xv) were assigned based on manual inspection of the final particle configurations. For ease of visualization, axes were again plotted so that the spatial configurations are more qualitatively consistent. In general, it was apparent that similarly classified conditions were more widely dispersed after dimensionality reduction for these simulations with proliferation relative to no proliferation (Fig. 5), which can be attributed in part to differences in total cell numbers. Since cells continued to proliferate until contact inhibited "steady state," no simulations were observed with individually dispersed cells.
After 2D embedding, UMAP and PHATE yielded roughly similar distributions (Fig. 8AB), while PCA and AE also were comparable (Fig. 8CD). Nevertheless, conditions with different colors were poorly separated by UMAP and PHATE relative to PCA and AE (e.g. v, yellow and xv, tan). Roughly, J BO increased from top to bottom, J OO increased inward from the periphery, and J BB increased from left to right (Fig. 8). Proceeding from the bottom upwards, the bottom grouping of cells represented high J BB , low J BB , and high J BO , which included spots of orange cells in a hexagonal configuration (x, fuchsia and xi, purple) (Fig. 5). Further, the middle grouping of cells represented low J BB , high J OO , and high J BO , which included striped patterns (vi, sky blue; xii, turquoise; xiii, gray blue) (Fig. 8).
Finally, the top grouping of cells represented high J BB , high J OO , and low J BO , which included intermixed clusters (v, yellow), partially sorted clusters (xiv, brick red; xv, tan), and clusters with orange cells at the interior and blue cells at the periphery (ix) (Fig. 8).
For these simulations with contact-inhibited proliferation, the accuracy of unsupervised AE classification (Fig. 8) relative to our manually annotated ground truth (Fig. 7) was again determined based on persistence images that considered only dimension 0 homology, only dimension 1 homology or both dimension 0 and dimension 1 (Fig. S28). Further, unsupervised classification was performed using persistence images that included information on both blue and orange cells, blue cells only, and orange cells only. Notably, unsupervised AE classification for simulations with proliferation was considerably worse compared to simulations with constant cell number, particularly for dimension 1 homology or dimension 0 and 1 homology, which ranged from 37% to 78% (Fig. 9). Qualitatively similar trends were also observed where classification accuracy tended to be worse when considering orange cells only, with some improvement for blue cells, as well as orange and blue cells, respectively.
Persistence diagrams and images may be biased by different point cloud sizes, corresponding here to different numbers of cells, which likely affected the unsupervised classification. Thus, we normalized the persistence images across all simulations by dividing each persistence image by its maximum intensity. As a consequence, classification accuracy for normalized images improved considerably by 5-10%. For example, classification accuracy based on normalized persistence images improved to 61-84% for dimension 0, to 56-81% for dimension 1, and to 63-83% for dimension 0 and 1, respectively (Fig. 9). The general trend of improving classification from orange only to blue only to orange and blue remained consistent with normalization. For comparison, classification using persistence curves was comparable, ranging from 49% to 84% (Fig. S15DEF, S28). Classification using radial order parameters was also comparable to persistence images, ranging from 66% to 88% (Fig. S17DEF, S29), but considerably worse for angular order parameters. Further, classification of persistence curves and order parameters using PHATE was also associated with poorly separated groupings (Fig. S30). Overall, dimension 0 only classification tended to outperform dimension 1 only classification, and dimension 0 and dimension 1 together tended to give the best classification. Moreover, accounting for both orange and blue cells tended to give comparable classification for persistence images, persistence curves, and order parameters (Table S4, S5, S6). Figure 9: Unsupervised classification accuracy for persistence images at varying population size. Unsupervised classification by hierarchical clustering of 20-dimensional autoencoder embedding of persistence images. Darker bars represent persistence images without normalization, whereas lighter bars represent the additional improvement after normalization by peak intensity per image. Classification accuracy of images with and without normalization is computed by comparing cluster labels to ground truth.

Discussion and Conclusion
In this work, we investigated how two interacting and motile cell types ("blue" and "orange") self-organize into multicellular patterns when the strength of blue-blue, blue-orange, or orange-orange adhesions are varied systematically. Based on the relative positions of blue cells with respect to other blue cells and orange cells, as well as orange cells to other orange cells, these patterns could be represented using persistence images, persistence curves, and classical order parameters, then analyzed using dimensionality reduction and hierarchical clustering. After optimization, unsupervised classification showed excellent agreement with our manually annotated ground truth (85-95%) (Fig. 6, 9). Moreover, this classification is in good agreement with Steinberg's scaling arguments for differential adhesion [24]. For instance, the classifier reveals distinct regimes where both cell types are intermixed, which occurs when homotypic adhesions are weaker than heterotypic adhesion (J BB , J OO < J OB , Fig. 4v). Conversely, cells are sorted apart when heterotypic adhesions are much weaker than homotypic adhesion, (J OB << J BB , J OO , Fig. 4iv). We also classify a pattern where a core of orange cells surrounded by a shell of blue cells for mismatched homotypic adhesion with intermediate heterotypic adhesion (J BB < J OB < J BB , ix). Our computational analysis is more granular, identifying distinct patterns driven by more subtle differences in homotypic and heterotypic adhesion. For example, we observe discrete clusters of blue or orange cells of varying size arranged in checkerboard, stripe, or labyrinth patterns ( Fig. 4vi-xii). These patterns are very robust for identical adhesion parameters with different initial conditions (Fig. S31). More recent conceptual models are based on interfacial tension, which address both cell-cell adhesion as well as cortical tension [1]. Our classified patterns are also in qualitative agreement with scaling arguments for interfacial tension ( [2,23]), noting that interfacial tension decreases with increasing adhesion. However, we note that Brodland's work uses a vertex-based model of cell shape, which yields certain numerical prefactors for the scaling argument. Thus, there are some quantitative differences with our agent-based model which does not consider interfacial tension and polygonal cell shape.
Based on manual inspection of the remaining discrepancies (5-15%), we recognized that some conditions were located at the "phase boundary" between different regions, and could exhibit a mixture of different spatial patterns. Indeed, misclassified conditions were often far from the centroid of each grouping, which was also apparent from the 2D AE embedding (Fig. S32, S33, S34, S35, S36, S37). Although some caution is warranted when visually interpreting dimensionality reduced embeddings, we note that different colored groupings were relatively well separated by AE, particularly for the simulations with proliferation (Fig. 8,D). In comparison, UMAP biased towards discrete clusters with some loss of global structure, while PHATE "squeezed" data points together into continuous branch-like structures [51]. Thus, for simulations with proliferation, different colors are widely dispersed without clean separation in UMAP and PHATE embedding, including intermixed (v, yellow) and partially mixed particles (xv, brown), as well as stripes (vi, light blue) and spots (x, pink) (Fig. 8,A,B), which likely contributed to their markedly worse performance for classifying simulations with proliferation. For comparison, we have included supervised classification results obtained using a "softer" probabilistic algorithm where conditions can be classified by multiple adjacent regions. Briefly, we used a soft margin support vector machine (SVM), using the radial basis function for nonlinear transformation of the input, at various values of C and computed the accuracy using 5-fold cross-validation, which also exhibits excellent performance (Table S7, S8).
Moreover, the classification was occasionally confounded by multicellular patterns with similar spatial organization but where the positions of the "blue" and "orange" cells were switched. For instance, consider a scenario where hexagonal arrays of orange clusters are surrounded by blue cells occur at high J BB and low J OO , (Fig. 4B,vii).
Switching the relative positions of blue and orange cells in this scenario results in hexagonal arrays of blue clusters surrounded by orange cells at low J BB and high J OO (Fig. 4B,viii). The input to the classifier consists of feature vectors representing the persistence images of blue, orange, or blue and orange cells, which are weighted equally without explicitly specifying a "cell type." Thus, from the perspective of the unsupervised classifier, these two configurations appeared indistinguishable, especially given that the effective blue and orange cell sizes were identical based on the interaction potential (Eq. 3). Analogous issues occurred for unsupervised classification using only one cell type (e.g. orange) without consideration of the other cell type (e.g. blue). For example, if only orange cells were considered, hexagonal arrays of blue clusters surrounded by orange cells appeared relatively similar to clusters with blue cells at the interior and concentric rings of orange and blue, particularly in dimension 1 homology that only considers topological loops (Fig. S38A). Instead, if only blue cells were considered (and dimension 1 homology), hexagonal arrays of orange clusters would appear very similar to hexagonal arrays of orange individuals (Fig. S38B). Similarly, if the spatial configuration of orange cells was comparable in scenarios with very different blue cell configurations, consideration of only orange cells in dimension 0 homology would also result in misclassification (Fig. S39B). The use of feature vectors representing spatial information about both blue and orange cells typically yielded improved classification. Nevertheless, classification using dimension 0 or dimension 1 homology usually resulted in comparable accuracy (using both colors), without significant improvement when both dimension 0 and dimension 1 were considered.
These simulations were initialized with a 60:40 ratio of blue to orange cells, which indirectly encodes a difference between cell types into the classifier. To check the robustness of these results, we repeated these simulations with varying domain sizes and verified that multicellular patterns occurred consistently, particularly the organization of cells within clusters (Fig. S40, S41). We note that classification accuracy remains consistently high across different domain sizes, both for constant population size as well as with proliferation (Fig. S42). We also varied the cell type ratio from 90:10 to 10:90 over a comparable range of homotypic and heterotypic adhesion strengths (Fig. S43). Notably, increasing the fraction of one cell type (e.g. blue) relative to the other (e.g. orange) was equivalent to increasing the relative homotypic adhesion (e.g. strengthening blue-blue adhesion relative to orange-orange). For example, a 90:10 ratio of blue:orange cells typically resulted in sparse orange cells surrounded by blue cells (Fig. 4D-H). In comparison, a 10:90 ratio of blue:orange cells exhibited sparse blue cells surrounded by orange cells, also equivalent to switching the positions of blue and orange cells. Moreover, cells that could undergo contact-inhibited proliferation exhibited additional patterns with both stripes and spots (Fig. 7C,xiii), as well as partially sorted patterns (Fig. 7C,xiv, xv). We further verified that varying the proliferation rate with contact inhibition resulted in qualitatively similar spatial patterns when homotypic and heterotypic adhesions were held constant (Fig. S44). Overall, these artifacts of interchanging cells are unlikely to arise under more realistic conditions, since different cell types within a tissue exhibit appreciable differences in size or biomarker expression that would further inform unsupervised classification. We further considered simulations with three interacting cell types (50% blue, 30% orange, 20% green), which resulted in analogous pattern formation as with two interacting cell types (Fig. S45, S46, S47). This algorithm also demonstrated excellent classification of these patterns based on the positions of two out of the three cell types, especially with blue and orange cells which comprise 80% of the population (Fig. S48). We also classified blue-green and orange-green pairings, using the blue-orange ground truth phase classification as a reference. This resulted in a few misclassifications, primarily resulting from overlapping cell positions or blue and green cells (Fig. S46D), orange and green cells (Fig.  S46E), as well as previously unseen spatial arrangements that emerged due to competition between two cell types for maximizing interaction with the third cell type (Fig. S46F).
Although self-organization into multicellular patterns occurs relatively robustly in development, the motility and interactions of individual cells are stochastic. Thus, an unsupervised classifier must be stable against variability that arises from biological "noise." The transformation of coordinates to persistence diagrams in topological data analysis is provably stable with respect to bottleneck distance and Wasserstein distance [37]. However, direct comparison of cell positions via bottleneck distance is undesirable, since the L ∞ norm is entirely determined by a single topological feature. Comparisons based on the Wasserstein metric are computationally expensive, requiring the calculation of an optimal transport plan between pairs of persistence diagrams. Here, we have focused on persistence images as an underexplored approach for topological data analysis of multicellular pattern formation, relative to classical order parameters and persistence curves. Persistence images enable excellent unsupervised classification at intermediate computational cost with theoretical guarantees on stability. The Euclidean distance between persistence images based on the 2-D Gaussian kernel is bounded by the 1-Wasserstein distance between the corresponding persistence diagrams [49]. The kernel bandwidth and image resolution can be adjusted to achieve the desired trade-off between computational efficiency and stability guarantees. In comparison, persistence curves can be computed more efficiently but lead to worse classification. The loss of information by summing over the Gaussian kernel along the diagonal is offset by the greater computational efficiency of using a lookup table of cumulative distribution values in the calculation. In comparison, classical order parameters perform very well for classification, but are quite expensive computationally. One explanation for the relative success of classical order parameters, especially for simulations with proliferation, is that they have been normalized to account for different numbers of (identical) particles. In comparison, the optimum normalization for persistence images remains unresolved, and will be considered more thoroughly in a follow-up manuscript.
In conclusion, we show that combining persistence images with autoencoders enables the unsupervised, computationally efficient classification of spatial patterns associated with two interacting cell types. This approach represents topological features of the multicellular architecture as the weighted sum of Gaussian features, yielding a standardized finite vector representation. We show that optimized dimensionality reduction using AE and hierarchical clustering can reveal topologically similar simulation conditions, in excellent agreement with our manually annotated ground truth, particularly for populations of fixed size. However, persistence images of simulations with varying population size required normalization for unbiased comparison, and performed slightly worse. In a follow-up manuscript, we will apply this approach to analyze stripe and spot patterning in zebrafish development, as well as to gain deeper mechanistic insight into persistence images. Overall, we envision that topology-based machine learning represents a powerful and human-interpretable framework to explore the diversity of complex tissue shapes and patterns that emerge from self-sorting and collective cell migration.

Competing Interests
D.B. is currently supported by a Boehringer Ingelheim Fellowship at Yale University. All other authors declare no competing financial or non-financial interests.

Data Availability
All simulation data generated for this study are available via an Open Science Foundation repository at https: //osf.io/md86n/?view_only=66ac8655cf1842e4bfe52ca0b8b59a04 [64] under an MIT License.

Code Availability
All code used to perform the simulations, compute order parameters and persistent homology, and generate classification results is available on GitHub at https://github.com/dbhaskar92/Coculture-ABM-Model under an MIT license.                              Table S6: Unsupervised classification accuracy of order parameters at varying population size.
Note S1: We also considered classification using a softer probabilistic algorithm. Briefly, we used a soft margin support vector machine (SVM) where the soft-margin classifier: , · · · , n, ζ i ≥ 0 trained using the hinge-loss function, max{0, 1 − y i (w T x i + b)}, allows flexibility for misclassifications, via slack variables ζ i . The regularization parameter, C, controls the trade-off between maximizing the margin at the decision boundary and minimizing the loss. We trained the soft margin classifier, using the radial basis function for nonlinear transformation of the input, at various values of C and computed the accuracy using 5-fold cross-validation. Crucially, we modified the scoring function to ignore misclassification between adjacent phases in the accuracy computation.