Inferring spatial and signaling relationships between cells from single cell transcriptomic data

Single-cell RNA sequencing (scRNA-seq) provides details for individual cells; however, crucial spatial information is often lost. We present SpaOTsc, a method relying on structured optimal transport to recover spatial properties of scRNA-seq data by utilizing spatial measurements of a relatively small number of genes. A spatial metric for individual cells in scRNA-seq data is first established based on a map connecting it with the spatial measurements. The cell–cell communications are then obtained by “optimally transporting” signal senders to target signal receivers in space. Using partial information decomposition, we next compute the intercellular gene–gene information flow to estimate the spatial regulations between genes across cells. Four datasets are employed for cross-validation of spatial gene expression prediction and comparison to known cell–cell communications. SpaOTsc has broader applications, both in integrating non-spatial single-cell measurements with spatial data, and directly in spatial single-cell transcriptomics data to reconstruct spatial cellular dynamics in tissues.


.1 Optimal transport
Optimal transport theory finds an optimal (under predefined cost) transportation plans from a distribution to another.

The unbalanced structured optimal transport
Consider a problem where the two distributions to be mapped are discretely represented by vectors ω 1 ∈ R n and ω 2 ∈ R m . Let M ∈ R n×m , A 1 ∈ R n×n , and A 2 ∈ R m×m be dissimilarity measurements for bins across two distributions and within each distribution. A transport plan can be represented by γ ∈ R n×m + which specifies how much mass from each bin of one distribution is transported to each bin of the other distribution indicating a mapping between the two datasets. Measurements of different aspects are defined to assess the quality of transport plans. The major one is the transport cost defined as Since scRNA-seq data and spatial data are usually not from the same samples and there are cells lost in experiments, we should allow unbalanced transport, i.e. the transport desinations of a bin do not need to add up to the same mass of this bin. This is realized by using a penalty term for the unbalance in γ instead of strict constraint. [6] Kullback-Leibler (KL) divergence is used to measure the divergence of two measures The penalty term for unbalance in γ is defined based on KL divergence E unbalance = KL(γ1 m |ω 1 ) + KL(γ T 1 n |ω 2 ).
The Gromov-Wasserstein (GW) distance measures difference between metric spaces. [16] In other words, it focuses on how well the intra-dataset relationship is preserved by the mapping. The GW loss is defined as A linear combination of Eq. 1 and 4 has been proposed for structured data. [24]. In practice, a regularization term is often used to accelerate convergence of the algorithm. The entropy regularization term is defined as H(γ) = i,j γ i,j log(γ i,j ).
The three loss terms defined above assemble to the total loss for finding the mapping between scRNA-seq data and spatial data.

Wasserstein distances
The optimal transport framework naturally provides a measurement for difference between distributions. We use this distance to measure (1) distance between spatial distributions of individual cells as indicator of spatial distance between cells and (2) distance between spatial gene expression patterns. In the later application, we can both quantify this distance on the original geometry or directly on the single cell data paired with cell-cell spatial distances. The Wasserstein distance is defined as It is often solved with an entropy term defined in Eq. 5. The outcome of this regularized version is called Sinkhorn distance which is an approximation to Wasserstein distance. Note that when the cost matrix satisfies to be a metric, this distance is also a metric. In SpaOTsc, Wasserstein distance is used to compute distances between single cells based on distance matrix of spatial locations resulting in a distance matrix for single cells satisfying to be a metric. This distance matrix for single cells is further used as the cost matrix when characterizing differences between genes thus resulting in a metric for spatial expression patterns of genes.

Numerical approximation
We focus on the case L = L 2 which results in a quadratic optimization problem and follow the optimization scheme given in [24]. The transport part and structure part in Eq. 6 can be written as where L i,j,k,ℓ = L A 1 (i, j), A 2 (k, ℓ) and L ⊗ γ = k,ℓ L i,j,k,ℓ γ k,ℓ i,j . The proposition 1 in [16] suggests that if L can be written as for some functions f 1 , f 2 , h 1 , h 2 , then L ⊗ γ can be decomposed as where c A 1 ,A 2 is independent of γ. This property provides an efficient way of calculating L ⊗ γ. When the quadratic loss L 2 (a, b) = 1 2 (a − b) 2 is used, the functions can be constructed as Proposition 2 in [16] gives the gradient of the Gromov-Wasserstein part of the loss function, ∇ γ < L ⊗ γ, γ >= 2L ⊗ γ.
So the gradient of the loss function in Eq. 8 can be computed as 4 The Frank-Wolfe algorithm is used to find local minima. The directionfinding subproblem in the algorithm can be regarded as an optimal transport problem Here we substitute the usual strict restriction of mass conservation by the soft penalty term E unbalance to allow unbalance transport. This problem is solved following [6]. Then a step size τ is determined which minimizes where the function can be written as Some terms are combined assuming A 1 and A 2 are symmetric. Let τ * be the minimizer satisfying 0 ≤ τ * ≤ 1. The new transport plan is updated as γ k = τ * p k + (1 − τ * )γ k−1 .

Approximation of Wasserstein distance with landmark points
When computing spatial distance between individual cells, we accelerate the computation by approximating the spatial geometry with fewer landmark points. We follow a method proposed to approximate the topology induced by point clouds. [7]. When the spatial locations are regarded as a point cloud, the algorithm initializes by randomly selecting a point from the point cloud as a landmark point. We call the points in the point cloud that are not landmark points the candidate points. At each step, we first compute the shortest distance from each candidate point to the collection of landmark points. Then a candidate point with the largest shortest distance is selected as a landmark point. This process is iterated until the target number of landmark points is achieved. The information carried by the non-landmark points are assigned to the nearest landmark points. Optimal transport is computed on landmark points as an approximation for the distance computed on the original point cloud.

Approximation of geodesic distances
Tissues often have nontrivial shapes which motivates the use of geodesic distances for spatial locations. To efficiently approximate the geodesic distances between spatial locations of in situ data, we construct a graph based on k-nearest neighbors and compute the lengths of shortest paths. Given a collection of points with their spatial coordinates, we first build a k-nearest neighbor graph based on the Euclidean distance which is also used as edge weights. If there are separate connected subgraphs, an edge is added between two closest points from each pair of such subgraphs. Now we have a path-connected graph with nodes representing the points. The shortest path length between each two nodes on this graph is computed to approximate geodesic distances between points. This approximate geodesic distance matrix is denoted by D geo (X, k) where k is the number of nearest neighbors used and X is the point cloud.

Mapping between spatial data and single-cell data
Consider a single cell dataset with n cells and a spatial dataset with m spatial locations. To initialize the computation of optimal transport, we first need to construct the distance matrices M, A 1 , A 2 listed at the beginning of Section 1.1.1. To compute the similarity of two binary vectors x and y, we use Matthews correlation coefficient. Spearman's rank-order correlation coefficient is used for numerical valued vectors. Both correlation coefficients ranges from −1 for totally opposite relation to 1 for perfect match. When number of genes of moderate, we directly apply these correlation coefficients to the gene expression vectors. A pre-processing is carried out to reduce the dimensionality when a large number of genes are involved. Within a dataset, for example the scRNA-seq data, we use principal component analysis to map the data onto a lower dimensional space to robustly quantify differences between individual cells. When comparing across datasets, we use canonical correlation analysis to map both datasets onto a common low dimensional space following [4]. The correla-tion coefficients are transformed into a distance measurement by applying a exponential function (e −c − e −1 )/(e − e −1 ).
Once we have the distance matrices, we solve the optimization problem in Eq. 6. The optimal transport plan γ * ∈ R n×m + is a matrix serving as mapping between the data sets (n single cells and m spatial locations). The entry γ * i,j depicts the likelihood of cell i originated from location j. This map is used to reconstruct spatial gene expressions byγ * T * g where g ∈ R n,ng is a matrix recording gene expressions in single-cell data andγ * is column normalized (each column sums to 1) version of γ * . The location of the maximum entry of each row of γ * delivers a prediction of spatial origin of the single cell corresponding to this row.

Spatial distance between single cells
Using the cost matrix constructed in the previous section, we obtain a transport matrix γ * ∈ R n×m + connecting scRNAseq data with n cells and spatial data with m locations. Let D spa ∈ R m×m + be a distance matrix between spatial locations. The spatial distance between two single cells i and j is indicated by the Wasserstein distance: whereγ * i is the ith row of γ * normalized to have a sum of 1. The spatial distance matrix of single cellsD sc directly leads to several applications. (1) Spatially meaningful low dimensional visualization of single-cell data is obtained by feeding theD sc to dimension reduction algorithms that can work with distance matrices, such as tSNE and UMAP.
(2) Spatially localized sub-populations of cells are identified by performing sub-clustering based onD sc within cell clusters determined in previous single-cell data analysis.

Clustering of spatial gene expression patterns
In this section, we discuss the quantitative comparison of gene expression patterns using optimal transport. Since the setup of computational optimal transport does not explicitly require geometry, we can directly compare using single cell data in addition to comparing using predicted spatial expressions. In the case of comparing directly in single cell data, let g a , g b ∈ R n be the vectors of expression of these two genes in individual cells, and D sc ∈ R n×n be the spatial distance between in the single cells. The difference of spatial gene expression pattern is quantified by d W (g a , g b ,D sc ), the Wasserstein distance defined in Eq. 7. The inputs g a and g b are normalized first to have sum of 1. The case of comparing spatial expression in the original geometry is similar with the inputs being the normalized expression across spatial locations and the distance matrix explicitly constructed using geodesic distance in the original geometry. A python implementation [23] of louvain clustering algorithm [2] with the RBConfiguration method [18] is used to cluster the genes based on a neighborhood graph built with the quantified spatial pattern differences.

Weight kernels
Thresholding is often used in analysis of genes or spatial relationships. To avoid the unstability caused by hard thresholding, we use continuous kernel functions to assign weights that reflects spatial relations. These smooth weight functions have shown to outperform hard cutoffs in modeling protein structures [28]. We list two commonly used weight kernels here. The exponential kernel and the generalized Lorentz kernel are defined as where η is a scaling parameter analogous to the cutoff distance in hard thresholding approaches and the parameter ν controls the steepness of decay near η. These kernels can also be used to substitute hard thresholding of significance as φ(1/s; 1/η, ν) with s being a signal which is considered significant when exceeds the scaling parameter η. The weight kernels will be used throughout the following analysis methods.

Intercellular gene-gene regulatory information flow
Consider two variables depicting the expression of a gene in the spatial neighborhoods of cells and the expression of a gene in the cells. We quantify how much information about the second variable is provided by the first.
We call this intercellular gene-gene information flow in the sense of information theory. In the following sections, we call the genes corresponding to the variables the source gene and the target gene respectively. Note that the source gene and the target gene can be the same, in which case, we are measuring a non-linear analog of spatial autocorrelation.

Mutual information
The entropy of a random variable X measures how much information it carries and can be written as The base 2 of the logarithm results in the units of shannons. There are other choices that lead to other information units. The conditional entropy can be defined in a similar manner, Conditional entropy measures how much information is carried in X that cannot be told by observing Y . Also, the joint entropy can be written as Then, the mutual information between X and Y measures how much information does observing one variable provide about the other can be computed by The mutual information can also be equivalently expressed by the entropy terms, 9

Partial information decomposition
We are interested in the case with three variables X, Y, Z in this application. Specifically, we seek to inquire how much unique information is provided about Z by observing Y when X is already observed. Partial information decomposition (PID) is proposed by Paul and Randall [27]. When considering X and Y together, we can measure how much information is provided about Z by X and Y using mutual information I(Z; X, Y ). PID states that the information provided by X and Y about Z can be classified into 4 categories: (1-2) uniquely provided by X or Y that the other variable can not provide (unique information), (3) jointly provided by X and Y that they can not provide alone (synergy), and (4) both X and Y can provide alone (redundancy). PID seeks to decompose I(Z; X, Y ) into these four parts to decipher the detailed information flow among the three variables.
The decomposition can be expressed as We briefly go over one formulation of the decomposition by [27] which starts by computing redundancy. The mutual information between Z and X can be decomposed by each outcome of Z as The term I spec is called the specific information, Then the redundancy Rdn(Z; X, Y ) is quantified by This construction relies on the perspective that the minimum information provided by the sources reflects the common information and thus redundancy. Once redundancy is computed, the unique information can be computed as The implementation (PID WB) in the python package discrete information theory (dit) is used [11].

Information flow network construction
When applying PID to the inference of inter-cellular gene talks, we consider a target gene G tar , a source gene in the spatial neighborhoods G src , and a collection of background genes G that are related to G tar . The observation of G under a spatial distance η can be constructed with the weight kernels defined in Eq. 13, where W i,j = φ (Dsc (i, j); η, ν) for some positive ν. It is desirable to directly compute Unq G (G tar ; G src ) to assess the unique information about the target gene provided by the expression of source gene in spatial neighborhoods upon knowing the background genes. However, this decomposition can quickly become computationally intractable as the lattice becomes huge even with a moderate size of G. Fortunately, for a target gene and a set of candidate source genes, we are able to compare the source genes relatively. In other words, we can use the average unique information in the three variable case instead, As these analysis are carried out in the discrete setting of information theory, we pre-processed the gene expression levels into discrete bins using the implementation of Bayesian blocks [22] provided in the Astropy package [20,17]. The score defined in Eq. 25 requires the computation of O(n 3 ) partial information decomposition which could become very time consuming when the number of genes considered is large. We use a weighted spatial correlation coefficient [5] as a pre-screening of significant pairs before performing PID. The weighted spatial correlation coefficient is defined as where µ and σ are mean value and standard deviation and W is a spatial weight matrix satisfying W ii = 0 and ij W ij = 1.

Spatial distance for intercellular signaling
Intercellular signaling follows a general mechanism involving mainly ligands, receptors and downstream genes. Based on this explicit mechanism, we aim at inferring an effective spatial distance for known signaling.
To this end, we use ensemble of trees models which are interpretable and are able to address the signaling mechanism.

Feature importance in random forest
Random forest is an ensemble of independently trained decision trees [3]. It is assumed that individual trees make different mistakes and their consensus may achieve better performance. In practice, random forest is usually an efficient close match to state-of-the-art results in tasks with moderate sample and feature size. Here, we briefly review the feature importance measurement in random forest regression models. A decision tree is a tree with each node equipped with a criteria associated to a feature. When a sample is fed to the tree, it flows from the root to a leaf which then gives the predicted target value for this sample. An important score used to build a decision tree, the impurity score of a node can be calculated as where y i are the target values of the training samples that go through this node andȳ is the average of these values. The process of training a decision tree can be roughly understood as choosing features to add new nodes to the tree which mostly decrease the impurity. At the same time, impurity scores can also be used to assess the importance of features. The relative importance of a feature in a tree can be quantified by where N is the index set for nodes associated to this feature, C i is the index set of for the child nodes of node i, and N k is the number of training samples that flow through node j. Then, the normalized importance of feature i in a tree can be computed as where N f is the number of features in the tree. Then, the importance of a feature across the whole forest can be simply quantified by an average of its normalized feature importance in all the trees that assembles the forest. The random forest implementation in scikit-learn is used. [15]

Inference of spatial signaling range
Here, we build random forest regression models that predict the expression of a target gene A (a downstream gene in a signaling) based on a collection of genes within the same cell that are closely related to A and the expression of gene B (the ligand in this signaling) in the neighborhood. The samples are weighted by the expression level of the receptor(s) involved in the signaling. Specifically, the sample weights are constructed as where η r is a scaling parameter deciding whether the receptor is significantly expressed and ν is a negative integer. Given a spatial distance η, we first construct a weight matrix W ∈ R n×n for the cells, whereD(i, j) is the inferred spatial distance between cell i and j and φ is the kernel function defined in Eq. 13 with a positive ν. LetW be the The vector representing the expression of gene B in each cells' neighborhood can be approximated as where x is the column vector for expression of gene B. Similar to Eq. 25, the random forest model is built with additional features depicting the expression of a collection of background genes G. The signaling strength under this spatial distance is then indicated by the average of this feature 13 importance score over a collection of downstream genes. Computing this signaling strength for a sequence of spatial distances provides us a hint of the effective range of this particular signaling.

Cell-cell communication
With the inferred spatial cell-cell distance, we can analyze cell-cell communication by examining the intercellular signaling in a spatial setting.

Optimal transport cell-cell communication
For an intercellular signaling where we know the ligand, receptor(s) and downstream genes, we interpret the signaling process as a transport problem where the ligands are transported from their source distribution (described by the expression level of the ligand gene) to a destination distribution (depicted by the expression level of receptors and the downstream genes). Intuitively, if a cell highly expresses the receptor genes and the expression of downstream genes are consistent with the known up/down regulation relationship, this cell is likely to be a target cell in the signaling. For the source distribution in the transport problem, we simply use the expression level of ligands, where L i is the expression level of the ligand gene in cell i. A scoring is defined to quantify how consistent is the expression pattern of downstream genes according to the known up/down regulation relationships where n D is the number of downstream genes, D j,ℓ is the ℓth downstream gene in cell j, η ℓ is a scaling parameter for soft thresholding of gene expressions, and ν ℓ is positive for down-regulated genes and negative for up-regulated genes. A bigger value for |ν ℓ | is used if a steeper boundary is desired in the weight kernel defined in Eq. 13. If a strict consistency is desired, we can use multiplicative penaltŷ The destination distribution for the signaling is then described by Another working assumption is that it is more likely for ligands to get to their spatial neighbors than far away locations. This makes it natural to use our spatial cell-cell distance matrixD sc as our cost matrix for the optimal transport problem. Finally, we set up an optimal transport problem: arg min γ∈R n×n where KL is the KL divergence. Note that the last term (entropy regularization) in Eq. 37 was used to accelerate computation in previous applications. In this signaling application, this entropy term accounts for the stochasticity of ligand diffusion, i.e., it penalizes the case that all ligands from a cell go to the same closest cell. The optimal transport plan γ * S (i, j) is interpreted as a scoring for cell j receiving signal from cell i.
There are also cases where spatial distance for certain signaling is available by applying inference methods or from knowledge. In this case, we can tweak the transport cost in Eq. 37 to account for this constraint, where D S is the given spatial distance of signaling. The extremely high cost of transporting between cells with distance farther than the signaling distance prevents the mapping between these cells.
Compared to methods that analyze cell pairs independently, this approach naturally addresses an important property of ligand-receptor signaling, i.e. it is likely for a cell releasing signaling ligands to have shorter signaling range if it is surrounded by more consumers of this ligands (cells highly expressing receptors and show consistent behavior for downstream genes). Further more, this approach implicitly takes into account the spatial constraint and closely resembles the inferred cell-cell communication networks that are explicitly constrained by inferred spatial distance for signaling.

Additional spatial distance constraint in optimal transport cellcell communication
When prior knowledge about the effective range of a ligand is available, we can further constrain the optimal transport inference of signaling. We implement the extra spatial distance constraint by setting the transport cost between cells that exceeds this distance to an excessively large number. As a result, long connections will be eliminated and additional short connections might merge. We discuss the effect of adding a spatial distance constraint below. On the single-cell level, for illustration, we consider a simple system with three cells: a signal sender cell (cell 1) and two signal receiver cells (cell 2 and cell 3) (see Supplementary Fig. 22). Cell 3 is more likely a receiver cell than cell 2 (in other words, cell 3 has a greater mass than cell 2 as destinations in the optimal transport setup) but the distance between cell 3 and cell 1, d(cell 1, cell 3) is greater than d(cell 1, cell 2). Without the spatial constraint, we would identify a stronger connection between cell 1 and cell 3 than between cell 1 and cell 2 due to the conservation of mass. If a spatial constraint that is between d(cell 1, cell 3) and d(cell 1, cell 2) is applied, we would observe no mass going from cell 1 to cell 3 (thus the previous strong connection vanishes) but more mass going from cell 1 to cell 2 (the emergence of a shorter connection). In the optimal transport formulation, the transportation cost (a large cost if moving much mass along long distance) and the mass conservation violation (a penalty for unbalanced mass transport) are the two penalty terms in the objective function to be minimized. When a spatial constraint is applied, a connection whose distance is longer than this constraint will induce a huge penalty in the transportation cost while the elimination of this connection might lead to the emergence of new short connections which only induces a moderate penalty in mass conservation violation. Therefore, we may observe new connections when a spatial constraint is applied.
On the cluster level, there are also new connections of visually long distances when the spatial constraint is added (Supplementary Fig. 18). This is because we used a geometric average to represent the location of clusters. So there are clusters that are adjacent in space while their cluster centers are visually distant (e.g. cluster 4.1 and cluster 3.1 in Supplementary Fig. 18b). As a result, we may observe a new connection between clusters of visually long distance contributed by cell pairs of short distance located near the interfaces where the two clusters meet.

Communication index
We also give another simple approach by adding a spatial weight term to a scoring index similar to the one introduced in [26]. The scoring index quantifying the likelihood of cell i signaling to cell j is constructed as where α i,j measures the significance of ligand-(transporter)-receptor, where L i is the ligand in cell i, T i is the sum of transporter for this ligand in cell i, R j is the corresponding receptor in cell j. The index β j measuring the consistency of downstream genes in cell j is the same as constructed in Eq. 34 and Eq. 35.
expression of a gene is predicted using the rest of the spatial data and the entire scRNA-seq data. Several metrics were used depending on the type of spatial data (binary or continuous). Several established methods designated for this particular problem were considered: DistMap [13], Achim [1], and Seurat [21].

Impact of using unbalanced and structured optimal transport
We alter the parameters α and ρ in Eq. 6 to assess the usefulness of combining structured transport and allowing unbalanced transport. The parameter α ranges from 0 meaning regular optimal transport without structured information to 1 meaning the pure Gromov-Wasserstein distance. When the parameter ρ takes the value of +∞, this becomes balanced optimal transport and the balance constraint is relieved as ρ decreases. The exper-iment results suggest that both relaxing the balance constraint and adding the structured loss help improve the accuracy (Supplementary Fig. 1-4).

Intercellular gene-gene information flow
The concept of intercellular gene-gene information flow is in the perspective of information, i.e. how helpful observing a gene in the environment is for inferring the a gene in the cell. The pairs of genes we identify may be connected via various mechanisms of intercellular communications. Here we test our approach on known signalings. We anticipate that the information flow from a ligand gene to its target gene should be relatively higher. While we are not able to cover all target genes in a signaling, the score should still be higher on average if we examine a large unbiased collection of genes. We analyzed Dpp and Wg signaling and observed expected patterns (Supplementary Fig. 29).

Effect of number of background genes used in intercellular process analyses
Ideally, we would like to iterate over all the genes when computing the intercellular gene information flows or to use all other genes as background genes in the machine learning models when inferring distance for cell-cell signaling. However, this can become computationally intractable especially in the case of partial information decomposition. Therefore, we rank the genes by their correlation to the target gene and choose the top subset as the background genes. We altered the number of top genes used and the experiments shows that a moderate size of 50 genes is enough to capture the key features (Supplementary Fig. 24-28).

Cell-cell communication in drosophila embryo
Dpp signaling: • Ligands: dpp The downstream genes are taken from the supplementary table (S1 Dataset) of [8]. The genes marked as "Y" in the column called "Dpp targets" and are also present in scRNA-seq data are considered. Wg signaling: • Ligands: wg, Wnt5.
The genes CycD and Jra are taken from Kegg database [12] (Wnt signaling pathway -Drosophila melanogaster). Two other genes en [10] and dpp [29] are identified from literature. The genes for EGF signaling are taken from [14].
The genes of Wnt and BMP signaling are taken from Kegg database [12] (Wnt signaling pathway -Danio rerio and TGF-beta signaling pathway -Danio rerio). The genes for FGF signaling are taken from [9].

Spatial distance of signaling in drosophila embryo
We filtered out downstream genes whose variances are too small in the inference of signaling distance using random forest model. The following downstream genes are considered: • Dpp signaling: Ance, ush, Doc1, Doc2, Doc3, tup, egr • Wg signaling: CycD, Jra, en, dpp, Ubx In Wg signaling analysis, there is an additional gene Ubx [19] that was not considered in cell-cell communication because it is unknown whether it is up-or down-regulated by Wg.

Supplementary Figures
Supplementary Figure 1 Performance of structured and unbalanced optimal transport. The parameter is the weight for structured transport penalty and is the weight for the penalty term of unbalanced transport. (a) The auc of leave-one-out cross validation of drosophila embryo data with varying weights for unbalance penalty while the structured penalty weight =0. (b) The auc of leave-one-out cross validation of drosophila embryo data with varying weights for structured penalty while only balanced transport is allowed ( = ∞). (c, d) Same plots as (a) and (b) for zebrafish embryo data. Both scRNA-seq data and spatial data are taken from (Satija, R. et al., Nature Biotech., 2015).

Supplementary Figure 2
Leave-one-out cross validation on drosophila embryo data.
The ground truth (left columns) and prediction (right columns) of spatial gene expression in the leave-one-out cross validation of the 84 genes in spatial data of drosophila embryo. Additional drosophila embryo gene expression prediction.
The experimental data for these genes can be found in the Supplementary Figure  S4  The gene m4 is skipped since it is absent in the scRNA-seq data. The prediction is based on the scRNA-seq data, and the spatial data which does not include the above genes.

Supplementary Figure 4
Leave-one-out cross-validation zebrafish embryo data.
The ground truth (left columns) and prediction (right columns) of spatial gene expression in the leave-one-out cross-validation of the 47 genes in spatial data of zebrafish embryo. Both scRNA-seq data and spatial data are taken from ( Prediction accuracy with different preprocessing procedures. The similarity between scRNA-seq data and the spatial data is quantified by Spearman's correlation coefficient of the common genes. The numerical spatial data is used. The similarity within the scRNA-seq data is quantified by the Spearman's correlation coefficient of first 50 principal components. *Normalized data downloaded from the DREAM Single Cell Transcriptomics Challenge (Synapse ID: syn15665609). Spatial summary of spatially localized subclusters in drosophila embryo data.
(a) A visualization of spatial arrangement of subclusters. The average cell-cell distance between cells from two subclusters are used to describe distances between the subclusters. This distance between subclusters is fed to umap to generate coordinates for visualization. In the network, each subcluster is connected to its three closest neighbors where a darker connection means a shorter distance. (b) Spatial origin distributions of the subclusters. Spatial summary of spatially localized subclusters in zebrafish embryo data.
Both scRNA-seq data and spatial data are taken from (Satija, R. et al., Nature  Biotech., 2015). (a) A visualization of spatial arrangement of subclusters. The average cell-cell distance between cells from two subclusters are used to describe distances between the subclusters. This distance between subclusters is fed to umap to generate coordinates for visualization. In the network, each subcluster is connected to its three closest neighbors where a darker connection means a smaller distance. (b) Spatial origin distributions of the subclusters.   Clustering of genes based on spatial pattern in drosophila embryo data.
(a) A visualization of k-nearest-neighbor graphs (k=3) of genes where each node represents a gene and an edge means the two genes have similar spatial expression pattern. The differences between genes are quantified using Wasserstein distance with the spatial distance between single cells as the cost matrix for the optimal transport problem. Only highly variable genes are considered. (b) Average spatial patterns mapped to the original geometry for the gene clusters. Clustering of genes based on spatial pattern in mouse visual cortex data.
(a) A visualization of k-nearest-neighbor graphs (k=3) of genes where each node represents a gene and an edge means the two genes have similar spatial expression pattern. The differences between genes are quantified using Wasserstein distance computed using the spatial data.  Random forest inference for spatial range of Wg signaling with different number of background genes.
The inference of spatial range of Wg signaling in drosophila embryo data (ligand gene names: "wg", "Wnt5" in data) using different number of background genes (Ng) in the random forest model. The grey band represents the 95% confidence interval. The experiments were repeated three times with similar results. The effect of choosing different scaling parameters determining the significance of gene expression.
The scaling parameter can be regarded as a soft cutoff for deciding whether a gene is significantly expressed when studying the cell-cell communications. It can be observed for this Wg signaling analysis of the drosophila embryo that more significant signal senders and receivers can be prioritized by increasing this scaling parameter. Inside the violin plots are standard boxplots (median, 25th perceltile, 75th percentile, the bigger of minimum value and 25th percentile -1.5 interquartile range, and smaller of maximum value and 75th percentile + 1.5 interquartile range).

Supplementary Figure 17
Random forest inference for spatial range of Dpp signaling with different number of background genes.
The inference of spatial range of Dpp signaling in drosophila embryo data using different number of background genes (Ng) in the random forest model. The grey band represents the 95% confidence interval. The experiments were repeated three times with similar results.   Optimal transport communication with and without spatial distance cutoff.
The coordinates of cells in the 2d plots are obtained by UMAP map based on the inferred spatial distances between cells. The top 10000 cell-cell communications are plotted as curved lines. The color of the connections are same as the source cells. The colors of points correspond to subcluster labels. (a, b) Spatial cell-cell distance is used as transport cost matrix and no cutoff distance for signaling is used. (c, d) Similar to (a) and (b) with additional explicit spatial distance cutoffs (100 μm for Wg and 125 μm for Dpp).   The ith row and jth column quantifies the unique information provided about the jth gene (target gene) by the ith gene (source gene) in a spatial neighborhood of 25 μm. Different numbers of background genes (N g =10, 20, 50) are considered when computing the unique information. The genes with the highest intracellular correlation with the target genes are selected as background genes respectively for each target gene. The scatter plot shows comparison of the nonzero entries of the matrices quantified by the Pearson correlation coefficient.

Supplementary Figure 25
PID inference of intercellular gene regulation in drosophila data with different number of background genes under a spatial range of 50 μm.
The ith row and jth column quantifies the unique information provided about the jth gene (target gene) by the ith gene (source gene) in a spatial neighborhood of 50 μm. Different numbers of background genes (N g =10, 20, 50) are considered when computing the unique information. The genes with the highest intracellular correlation with the target genes are selected as background genes respectively for each target gene. The scatter plot shows comparison of the nonzero entries of the matrices quantified by the Pearson correlation coefficient.

Supplementary Figure 26
PID inference of intercellular gene regulation in drosophila data with different number of background genes under a spatial range of 75 μm.
The ith row and jth column quantifies the unique information provided about the jth gene (target gene) by the ith gene (source gene) in a spatial neighborhood of 75 μm. Different numbers of background genes (N g =10, 20, 50) are considered when computing the unique information. The genes with the highest intracellular correlation with the target genes are selected as background genes respectively for each target gene. The scatter plot shows comparison of the nonzero entries of the matrices quantified by the Pearson correlation coefficient. PID inference of intercellular gene regulation in drosophila data with different number of background genes compared to using 500 background genes.
The gene-gene information flow among the 20 most variable genes in the drosophila datasets were inferred using various numbers of background genes ranging from 10 to 500. All obtained gene-gene interactions were compared to using 500 background genes using the Pearson's correlation coefficients. PID inference of intercellular gene regulation in the mouse visual cortex data for with different number of background genes compared to using 300 background genes.
The gene-gene information flow among the 20 most variable genes in the drosophila datasets were inferred using various numbers of background genes ranging from 1 to 300. All obtained gene-gene interactions were compared to using 300 background genes using the Pearson's correlation coefficients. The distance scale is set to 1/10 of the longest spatial distance among the single cells.
Pearson's r

Supplementary Figure 29
Gene-gene information flows for signaling pathways.
We evaluate the PID approach to quantifying gene-gene information flow by applying it to known ligands and downstream genes involved in intercellular signaling. To this end, we compute the spatial gene-gene information flow from the ligand genes to a collection of highly variable genes. The flows to downstream target genes in the signaling pathway and to other genes are compared. The former is expected to be higher than the later. Spatial ranges of 125 μm and 100 μm (inferred by SpaOTsc) are used for Dpp signaling and Wg signaling respectively. Inside the violin plots are boxplots (median, 25th perceltile, 75th percentile, the bigger of minimum value and 25th percentile -1.5 interquartile range, and smaller of maximum value and 75th percentile + 1.5 interquartile range). The numbers of data points for the violin plots from left to right are 7, 941, 5, 943. The likelihood of signal sending and receiving cells of the RNA seqFISH+ data of mouse olfactory bulb.
The seven fields in the RNA seqFISH+ data are plotted separately.  The likelihood of signal sending and receiving cells of the scRNA-seq data of mouse olfactory bulb mapped to the space of Slide-seq data.
Each dot corresponds to a single cell in the scRNA-seq data and the position is determined based on the SpaOTsc mapping of this scRNA-seq data and the reference Slide-seq data. The likelihood of signal sending and receiving cells of the scRNA-seq data (WT1) of mouse olfactory bulb mapped to the space of RNA seqFISH+ data.
Each dot corresponds to a single cell in the scRNA-seq data and the position is determined based on the SpaOTsc mapping of this scRNA-seq data and the reference RNA seqFISH+ data.