DOT: a flexible multi-objective optimization framework for transferring features across single-cell and spatial omics

Single-cell transcriptomics and spatially-resolved imaging/sequencing technologies have revolutionized biomedical research. However, they suffer from lack of spatial information and a trade-off of resolution and gene coverage, respectively. We propose DOT, a multi-objective optimization framework for transferring cellular features across these data modalities, thus integrating their complementary information. DOT uses genes beyond those common to the data modalities, exploits the local spatial context, transfers spatial features beyond cell-type information, and infers absolute/relative abundance of cell populations at tissue locations. Thus, DOT bridges single-cell transcriptomics data with both high- and low-resolution spatially-resolved data. Moreover, DOT combines practical aspects related to cell composition, heterogeneity, technical effects, and integration of prior knowledge. Our fast implementation based on the Frank-Wolfe algorithm achieves state-of-the-art or improved performance in localizing cell features in high- and low-resolution spatial data and estimating the expression of unmeasured genes in low-coverage spatial data.


Main
The organization of cells within human tissues, their molecular programs and their response to perturbations are central to better understand physiology, disease progression and to eventual identification of targets for therapeutic intervention [1,2].
Single-cell RNA sequencing can profile a large part of the transcriptome of large portions of individual (single) cells.This has made these technologies (hereafter scRNA-seq) an essential tool for revealing distinct cell features (such as cell lineage and cell states) in complex tissues and has profoundly impacted our understanding of biological processes and the underlying mechanisms that control cellular functions [3][4][5].However, scRNA-seq requires dissociation of the tissue [6], losing the information about the spatial context and physical relationship between cells, that is critical to understand the functioning of tissues.
To overcome these limitations, there has been recent advancements in spatially resolved transcriptomics (SRT) methods [7][8][9].SRT methods measure gene expression in locations coupled with their two-or three-dimensional position.SRT methods vary in two axes: spatial resolution and gene coverage.On one hand, technologies such as Multiplexed Error-Robust Fluorescence In-Situ Hybridization (MERFISH) and In-Situ Sequencing (ISS), achieve cellular or even subcellular resolution [10], but are limited to measuring up to a couple of hundred pre-selected genes.On the other hand, spatially resolved RNA sequencing, such as Spatial Transcriptomics [11], commercially available as 10X's Visium, and Slide-seq [12], enable high-coverage gene profiling by capturing mRNAs in-situ but come at the cost of measuring these averaged within spots that include multiple cells.Thus, there is a trade-off between resolution and richness (gene coverage) of SRT data.
A natural strategy to provide a complete picture is to combine scRNA-seq data with high-resolution SRT to transfer dissociated cells to spatial locations or generally to combine scRNA-seq with low-resolution SRT is to estimate the composition of cell types in each spot.Alternatively, we can attempt to enrich the high-resolution SRT by predicting the expression of unmeasured genes.Integrating scRNA-seq and SRT is challenging for many reasons such as the limited number of genes shared across these modalities, differences in measurement sensitivities across technologies, and high computational cost for large-scale datasets.Recent methods mostly rely on the genes that are captured both by scRNA-seq and SRT without using the remaining genes captured in each modality, do not use the spatial relationships between cells in the spatial data, are limited to high or low resolution spatial data either in application or their underlying assumptions, and in many cases come with high computation cost for large instances [13].In Section 4.1 we discuss the related work in more details.
Neglecting the spatial context is equivalent to assuming random placement of spots in the space, which is in contrast to the established structure-function relationship of tissues [9].Considering only a subset of genes limits the applicability of these methods to cases where the two data sets share several informative genes, which might not be the case when different technologies are used for profiling, or when few genes are measured in the spatial data (e.g., in MERFISH).
In this article, we present DOT, a versatile and scalable optimization framework, to integrate scRNA-seq and SRT for localizing the cell features via a multi-criteria mathematical program.Our model does not require the expression profiles to be mRNA counts and is applicable to both high-and low-resolution SRT, in the form of inferring membership probabilities for the former and relative or absolute abundance of cell types in the latter.We adapt a generalization of Optimal Transport with a tailored objective to leverage spatial information and to go beyond the use of only genes that are expressed in both modalities at the same time.Our optimization model is novel in considering several practical aspects in a unified framework, including (i) spatial relations between different cell features, (ii) differences in measurement sensitivity of different technologies, (iii) heterogeneity of cell sub-populations, (iv) compositional sparsity and size of spatial locations at different spatial resolutions, and (v) incorporation of prior knowledge about expected abundance of cell features in situ.We present a very fast implementation for our model based on the Frank-Wolfe algorithm thereby ensuring scalability and efficient solvability in large-scale datasets.DOT has a broader application beyond cell type decomposition, including transferring continuous features such as expression of genes that are missing in SRT but present in scRNA-seq data.
DOT is freely available to facilitate its application and further development.

DOT is a versatile multi-objective optimization model for integrating spatial and single-cell omics
Given a reference scRNA-seq data (R for short), which is a collection of single cells each annotated with a categorical or continuous feature (such as cell type), and a target spatially resolved transcriptomics data (S for short), which consists of a set I of spots, associated with a location containing one or more cells, we wish to determine the abundances (in the case of multiple cells per spot) or single value (in the case of a single cell per spot) of the unobserved feature(s) in spots of S (see Fig. 1).In what

Observed Transferred
Sub populations Let X R c,g and X S i,g denote the expression profiles of sub-population c ∈ C and spot i ∈ I, respectively, for genes g ∈ G.We assume that X R c,g is the mean expression of gene g across the cells that belong to sub-population c ∈ C of R (see Section 4.2.2 for extension to heterogeneous sub-populations).Moreover, X S i,g is the aggregation of expression profiles of potentially several cells when S is low-resolution.A high-quality transfer should naturally match the expression of the common genes across R and S.

Multi-objective optimization framework
We ensure this by considering the following expression-focused criteria: (i) Matching expression profile of spots (Fig. 1b).Expression profile of each spot i ∈ I in S (i.e., X S i,: ) should match the expression profile transferred to that spot from R via Y (i.e, c∈C Y c,i X R c,: ).We penalize the dissimilarity of these vectors via: (ii) Matching expression profile of sub-populations (Fig. 1c).Expression profile of each sub-population c ∈ C in R should match the expression profile of spots assigned to this sub-population via Y : (iii) Matching gene expression maps (Fig. 1d).Expression map of each gene g ∈ G in S should be similar to the expression map of that gene as transferred from R via Y : In the above formulations, d cos is a scale-invariant metric based on cosine-similarity which measures the difference between two vectors regardless of their scales (Section 4.2.1).In addition to the expression-focused objectives, we may incorporate prior knowledge in the form of the spatial location of spots as well as the expected abundance of cell sub-populations using the following compositional criteria: (iv) Capturing spatial relations (Fig. 1e).Spots that occupy adjacent locations and have similar expression profiles are expected to be of similar compositions.Given P, the set of adjacent pairs of spots with similar expression profiles, we encourage similar composition profiles for these spots by penalizing where d JS is the Jensen-Shannon divergence and w ij captures similarity of expression profiles of spots i and j (Section 4.2.1).
(v) Matching expected abundances (Fig. 1f).If prior information about the expected abundance of cell categories in S is available (e.g., when R and S correspond to adjacent tissues or consecutive sections), then abundance of cell categories transferred to S should be consistent with the given abundances.We measure dissimilarity between the vector of expected abundances (denoted r) and abundance of cell categories in S via The expression-focused objectives naturally take precedence over the compositional objectives, especially when a large number of genes are common between R and S, but the compositional objectives are useful when the number of common genes is limited.
Note that objective (v) provides additional control over the abundance of cell types in S, but can be ignored if prior information about the abundance of cell types is not available.
We treat these criteria as objectives in a multi-objective optimization problem and to consider them simultaneously (i.e., produce a Pareto-optimal solution), we optimize Y against a linear combination of these objectives as formulated below, hereafter referred as the DOT model: Here, λ C , λ G , λ S , and λ A are the user-defined penalty weights, and n i is an upper bound on the expected size (number of cells) of spot i ∈ I (i.e., n i = 1 for high resolution SRT).For low-resolution SRT, we set n i = n for a pre-determined parameter n and let the model determine the size of the spots (see Section 4.4.1).
Next, we present an evaluation of the model, comparing its performance to the related work and highlight different aspects of DOT in different applications.Briefly, we evaluate the performance of DOT to transfer the cell type label of single-cell level spots in high-resolution SRT and decompose spots to cell type abundances in low-resolution SRT, and estimate the expression of genes that are missing in SRT but are measured in the reference scRNA-seq.Details of the datasets and performance metrics used for these experiments are presented in Appendix C and Section 4.4.2,respectively.

DOT locates cell types in high-resolution spatial data
Our goal with our first set of experiments is to evaluate the performance of different models in determining the abundance of cell types at each spot.We used the highresolution MERFISH spatial data of the primary motor cortex region (MOp) of the mouse brain [14], which contains the spatial information and cell type of 280,186 cells across 75 samples (Appendix C.1).Since the cell type represented in the spot is known in our high-resolution spatial data, we can use this information as ground truth when evaluating the performance of the different models.Details about the benchmark instances can be found in Section 4.4.3.
We compared performance of DOT against four models from the literature: RCTD [15], Tangram [16], Seurat [17], and SingleR [18] in transferring cell types from single-cell to high-resolution SRT.Given the multiclass classification nature of cell type prediction in high-resolution SRT, we also used RF [19] as a multiclass classifier baseline.
DOT dominates the three specialized decomposition methods and the base line classification methods in assigning correct cell types to the spots (Fig. 2a), and produces well-calibrated probabilities (Fig. 2b) and better captures the relationships between cell types in space (Fig. 2c), owing to its capacity to incorporate the spatial information through d S .We also observe that even with very few genes in common between SRT and the reference scRNA-seq data (e.g., |G| ≤ 75), DOT is able to reliably determine the cell type of spots in the space with high accuracy.In contrast, RCTD fails to produce results due to lack of shared information, and Seurat and Tangram produce results with low accuracy.The under-performance of Seurat is due to its over-fitting to the a prior distribution of cell types in the reference data, while Tangram struggles with the large number of cells in the reference data not being matched with the target spatial data.We also observe that DOT performs robustly under fluctuations in the gene expression.

DOT determines cell type abundances in low-resolution spatial data
Since there is no ground truth for real low-resolution spatial data such as Visium and Slide-seq, we produce ground truth low-resolution spatial data in an objective manner by reproducing measurements of low-resolution data by pooling adjacent cells in the   high-resolution spatial data of primary motor cortex of the mouse brain (MOp), primary somatosensory cortex of the mouse brain (SSp), and the developing human heart.We next used single-cell level spatial data coming from osmFISH technology [22] to produce multicell data for SSp (Section C.2). Subsequently, for the developing human heart, we used subcellular spatial data generated by the ISS platform [23] (Section C.3).We tested the performance of DOT against the five deconvolution methods on these two samples, results of which are illustrated in Fig. 3c.DOT outperforms other models in the human heart sample and is among the best-performing models in the mouse SSp sample.We also observe that DOT exhibits a uniform performance across different regions of the tissues, which implies that the performance of DOT is not sensitive to different regions/cell types of the tissue (compare to Tangram and Seurat in SSp and RCTD in human heart).These results further highlight the competitive performance of DOT and its robustness in identifying the cell type composition of spots across different tissues.

DOT estimates the expression of unmeasured genes in spatially resolved data accurately
Given that in high-resolution SRT typically only a few genes are measured, the expression of genes that were not measured in SRT can be estimated by transferring scRNA-seq to SRT.Therefore, we evaluate the performance of DOT in estimating the expression of missing genes in the high-resolution SRT using the spatial data from breast cancer tumor microenvironment [24] (see Appendix C.4).As the high-and lowresolution SRT in this dataset come from the same tissue section, we can use the gene expression maps in low-resolution SRT as a proxy for ground truth to evaluate the expression maps of the missing genes in the high-resolution SRT as estimated by DOT.
We started by evaluating the performance of DOT on genes that are present in the high-resolution spatial data as ground truth.In Fig. 4a we show a qualitative comparison of maps of eight genes related to breast cancer [25] produced by DOT with those of high-resolution (ground truth) and low-resolution data (approximate ground truth).The expression maps produced by DOT match almost perfectly with the ground truth expression maps.Both DOT and the ground truth high-resolution spatial data also match the low-resolution gene expression maps almost perfectly, which further validate the quality of the solution produced by DOT.Note that due to the single-cell resolution of the high-resolution spatial data colors are brighter.
Nonetheless, the spatial patterns match between all three rows.For a quantitative comparison of expression maps in the high-and low-resolution SRT, given that there is no one-to-one correspondence between single-cell spots in the highresolution and multicell spots in the low-resolution spatial data, we split the tissue into a 10 by 10 grid, and aggregated the expression of each gene within each tile.
Consequently, we obtained two 100 by 18,000 matrices, one for the ground truth lowresolution spatial data and another for DOT.Fig. 4c compares the column-wise cosine similarities across different genes.These results further confirm the ability of DOT in reliably estimating the expression of missing genes in high-resolution spatial data.

DOT is efficient and scalable
We designed the mathematical model and the solution method for DOT with particular attention to scalability and computational efficiency.In terms of algorithmic performance (Table 1), DOT takes on average 426 seconds to solve each instance of the high resolution spatial data, which is an order of magnitude faster than RCTD, Tangram, and RF, and is comparable to Seurat and SingleR.Similarly, DOT took on average 433 seconds to solve the low-resolution instances of MOp, which proved to be more than twice faster than Seurat, and orders of magnitude faster than RCTD, SPOTlight, C2L and Tangram, further highlighting the superiority of DOT in terms of both accuracy and computational efficiency.

Discussion
Single-cell RNA-seq and spatially-resolved imaging/sequencing technologies provide each a partial picture in understanding the organization of complex tissues.To obtain a full picture, computational methods are needed to combine these two data modalities.
We present DOT, a versatile, fast and scalable optimization framework for transferring cell sub-populations from a reference scRNA-seq data to tissue locations, thereby transferring categorical and continuous features from the reference data to the spatial data.DOT can help to improve our understanding of cellular functions and tissue architecture.Our optimization framework employs several alignment measures to assess the quality of transfer from different perspectives and determines the relative or absolute abundance of different sub-populations in situ by combining these metrics in a multi-objective optimization model.Our metrics are designed to account for potentially different gene expression scales across the two modalities.Moreover, based on the premise that nearby locations with similar expression profiles posses similar compositions, our model leverages the spatial information as well as both joint and dataset-specific genes in addition to matching the expression of common genes.
In addition, whenever prior information about the abundance of cell features in the spatial data is available (e.g., estimated from a similar tissue), our model gives the user the flexibility to match these abundances to a desired level.Our model also takes into account inherent heterogeneity of cell sub-populations through a pre-processing step to ensure that refined sub-clusters of the reference are transferred.
Our model is applicable to both high-resolution (such as MERFISH) and lowresolution (such as Visium) spatial data and can be used for gene intensity or expression count data.While we use the same optimization framework for both highand low-resolution spatial data, our model has specific features to account for the distinct features of these modalities.In particular, our model can determine the size of spots in low-resolution spatial data and accounts for sparsity of composition of spots.4 Methods

Related work
Several decomposition methods (also known as deconvolution methods) have been proposed in recent years [13].As cell type decomposition, particularly in the highresolution spatial data, is inherently a multiclass classification task, classification methods, such as Random Forests [19], can be used for tackling this problem.However, because of the domain-specific properties of this problem, including differences in gene coverage, resolution, measurement sensitivity, and modality-specific characteristics, there has been an increased interest in improvement and new method development to aggregate scRNA-seq and SRT since the initial efforts [28,29].
SPOTlight [20]  While the aforementioned methods are designed specifically for low-resolution spatial data, some are also applicable to high-resolution spatial data.Among the methods that are specialized for high-resolution spatial data, Tangram [16] incorporates a deep learning model to find the best placement of single cells in spots using a designed loss function and can thus carry cell type information as a byproduct.Seurat V3 workflow [17] is a widely-used toolkit for analyzing scRNA-seq data, which offers an "anchoring" technique based on mutual nearest neighbours classifier for aligning two modalities in the space of principal components.
From a methodological standpoint, our formulation generalizes Optimal Transport (OT) (see Appendix B), which is a way to match, with minimal cost, data points between two domains embedded in possibly different spaces using different variants of the Wasserstein distance [30][31][32][33].Over the past years, OT has been applied to various machine learning problems in a wide variety of contexts such as generative modeling [34], feature aggregation [35], dataset denoising [36], generalization error prediction [37], graph matching/classification [38], and domain adaptation [39].In particular, OT has been employed in computational biology with applications such as transporting entities from one cross sectional measurement to the next using unbalanced dynamic transport [40], studying developmental time courses and understanding the molecular programs that guide differentiation during development [41], reconstructing developmental trajectories from time courses with snapshots of cell states and lineages [42], reconstructing the organization of cells in the tissue [43,44] and alignment of spatial omics [45].In addition, computational pipelines with OT components have been developed to facilitate applications of OT in computational biology [26].
In Appendix B we establish the connections between our formulation and OT formulations, and highlight the distinct features of our model that make it more suitable for the task of transferring annotations from the reference sub-populations to high-or low-resolution spatial data.Briefly, we note that our distance functions d i and d S share elements with Fused Gromov-Wasserstein (FGW) [46], which is also implemented as part of moscot [26].Indeed, we present metrics for R and S for which the resulting FGW encourages similar compositions for adjacent spots with similar expression profiles, thereby its connection to our definition of set P and our distance function d S .
Besides the specialized distance functions included in the objective function of DOT that measure quality of the transport map from different practical perspectives, there are other substantial differences between the common components of our formulation and FGW.The first difference is that OT formulations, including FGW, construct their transportation cost matrix by assuming that each spot is assigned to exactly one subpopulation, discarding the fact that spots in low-resolution spatial data are composed of multiple cells coming from potentially different sub-populations.In contrast, our [46], making DOT more appealing from a computational view for large-scale datasets.

Deriving the distance functions
To assess dissimilarity between expression vectors a and b, we introduce the distance function where cos (a, b) = 1 ∥a∥∥b∥ ⟨a, b⟩.We note that, unlike cosine dissimilarity (i.e., 1 − cos(•, •)), d cos is a metric distance function.Moreover, d cos is quasi-convex for positive vectors a and b, and is scale-invariant, in the sense that it is indifferent to the magnitudes of the vectors.This is by design, since we want to assess dissimilarity between expression vectors regardless of the measurement sensitivities of different technologies.
When assessing the gene expression profiles, this also allows to measure the differences regardless of the size of spots and cell sub-populations.
With this distance metric, by minimizing d i (Y ) as defined in Eq. ( 1), we ensure that the vector of gene expressions in spot i ∈ I (i.e., X S i,: ) is most similar to the vector of gene expressions transferred to spot i through Y (i.e., c∈C Y c,i X R c,: ).Similarly, with d c (Y ) as defined in Eq. ( 2), we minimize dissimilarity between centroid of subpopulation c ∈ C in R (i.e., X R c,: ) and its centroid in S as determined via Y , i.e., , where ρ c = i∈I Y c,i is the total number of spots in S assigned to c.Given the scale-invariance property of d cos , we can drop 1/ρ c and derive Eq. ( 2) as .
We also note that d g (Y ) as defined in Eq. ( 3) measures the difference between the expression map of gene g ∈ G in S (i.e., X S :,g ) and the one transferred to S through Y (i.e., c∈C Y c,: X R c,g ) regardless of the scale of the expression of g in S and R up to a constant multiplicative factor.
Our goal with objective (iv) as defined in Eq. ( 4) is to leverage the spatial information and potentially features that are contained in S but not in R to encourage spots that are adjacent in the tissue and exhibit similar expression profiles to attain similar cell type compositions.(Note that we do not assume that all adjacent spots should attain similar cell type compositions.)To achieve this goal, we define P as to denote the set of pairs of spots (i, j) that are adjacent (∥x i − x j ∥ ≤ d) and have similar expression profiles (w i,j ≥ w), with x i denoting the spatial coordinates of spot i in R 2 or R 3 , and w ij = cos(X S i,: , X S j,: ) denoting the cosine similarity of spots i and j according to the full set of genes measured in S (i.e., G S ).Here, d is a given distance threshold and w is a cutoff value for cosine similarity.As a larger w results in a smaller set P, we can ensure that d S can be computed linearly in the number of spots |I| by choosing a proper value for w such that |P| = O(|I|) (see also Section 4.4.1).
We employ Jensen-Shannon divergence defined as to measure dissimilarity between distributions q and p, where D KL (p∥q) = j p j log(p j /q j ) is the Kullback-Leibler divergence [47].We remark that d JS (p, q) is strongly convex and does not require absolute continuity on distributions q and p [48].
Finally, if prior information about the expected abundance of cell types in S is available (e.g., estimated from a neighboring single-cell level tissue), we denote the expected abundance of cell type c ∈ C in S by r c .Note that abundance of cell type c ∈ C in S according to Y is ρ c := i∈I Y c,i .Since r and ρ need not be mutually continuous, we employ d JS (ρ, r) in Eq. ( 5) to measure the difference between r and ρ.

Cell heterogeneity
While the cell annotations such as cell types often correspond to distinct subpopulations of cells, significant variations may naturally exist within each subpopulation.This means a single vector X R c,: may not properly represent the distribution of cells within sub-population c.Consequently, transferring c solely based on the centroid of cells that belong to c may not capture these variations.To capture this intrinsic heterogeneity, we cluster each sub-population into predefined κ smaller groups using an unsupervised learning method, and produce a total of κ|C| centroids to replace the original |C| centroids.With this definition of centroids, we treat all terms as before, except d A , since prior information about sub-populations (and not their sub-clusters) are available.
Note that this approach can be extended to singleton sub-clusters, in which case DOT transfers the individual cells from the reference scRNA-seq data to the spatial data.However, transferring individual cells may be computationally expensive and prone to over-fitting, particularly when the reference data and the spatial data are not matched or when there is significant drop-out in the reference scRNA-seq data.
In general, we treat the sub-clusters with very few cells as outliers and remove them to obtain a set K c of sub-clusters for sub-population c ∈ C. Once Y is obtained, k∈Kc Y k,i determines the abundance of sub-population c in spot i.

Sparsity of composition
As previously discussed, spatial data are either high-resolution (single-cell level) or low-resolution (multicell level).In the case of high-resolution spatial data, given that each spot corresponds to an individual cell (i.e., n i = 1), we expect that spots are pure (as opposed to mixed), in the sense that we prefer Y c,i close to 0 or 1.In general, assuming that size of spot i is ni (i.e., ni = c∈C Y c,i ) and Y c,i ∈ {0, ni }, then Y c,i = ni for exactly one category c and is zero for all other categories.Consequently, for binary-valued Y we obtain which is linear in Y for fixed ni .As linear objectives promote sparse (or corner point) solutions, we may control the level of sparsity of the solution by introducing a parameter θ ∈ [0, 1] and redefining d i (Y ) as Note that a higher value for θ yields a sparser solution.Indeed, with θ = 1 and zero weights assigned to other objectives, the optimal solution will be completely binary.
Note that ni acts as a penalty weight and can be set to a fixed value (e.g., n i ).

A fast Frank-Wolfe implementation
We propose a solution to the DOT model based on the Frank-Wolfe (FW) algorithm [49,50], which is a first-order method for solving non-linear optimization problems of the form min x∈X f (x), where f : R n → R is a (potentially non-convex) continuously differentiable function over the convex and compact set X. FW operates by replacing the non-linear objective function f with its linear approximation f (x) = f (x (0) ) + ) at a trial point x (0) ∈ X, and solving a simpler problem x = arg min x∈X f (x) to produce an "atom" solution x.The algorithm then iterates by taking a convex combination of x (0) and x to produce the next trial point x (1) , which remains feasible thanks to convexity of X.The FW algorithm is described in Algorithm Algorithm 1: Frank-Wolfe algorithm for DOT Compute the atom solution Ŷ (t) : for each spot i ∈ I do 6 Find the current best category ĉ = arg min c∈C {∆ 1, in which f (Y ) is the objective function in Eq. ( 6).Implementation details can be found in Appendix A.
While the DOT model is not separable, its linear approximation can be decomposed to |I| independent subproblems, one for each spot i ∈ I.This is because, unlike conventional OT formulations, we do not require the marginal distribution of cell subpopulations (i.e., i∈I Y c,i ) to be equal to their expected distribution (i.e., r c ), but have penalized their deviations in the objective function using d A defined in Eq. ( 5).
The subproblem i then becomes min ⟨Y :,i , ∆ :,i ⟩ : which has a simple solution.Denoting the category with smallest coefficient by ĉ, if cost coefficient of ĉ is negative then Y ĉ,i = n i , otherwise Y ĉ,i = 1.Consequently, Y c,i = 0 for all other categories.This property of Algorithm 1 enables it to efficiently tackle problems with large number of spots in the spatial data.

Parameter setting
In its most general form, our multi-objective formulation for DOT involves the penalty weights λ C , λ G , λ S and λ A in Eq. ( 6), the upper bound on size of spots n in Eq. ( 8), and the spatial neighborhood parameters w and r that derive the definition of spatial pairs P in Eq. (10).Here, we show how all of these parameters can be inferred from the data, hence eliminating the need for the user to tune these parameters.
We set the penalty weights in such a way that all objectives contribute equally to the objective function.More specifically, we set For the low-resolution case, we employ a generalized linear regression model to estimate N (see Appendix A.2). We also set λ S = |I| n|P| as it is not difficult to verify that 0 ≤ d S (Y ) ≤ n|P| when Jensen-Shannon divergence is computed in base 2 logarithm.
Similarly, whenever prior information about the expected abundance of sub populations (i.e., r) is available, we scale r such that c∈C r c ≈ N and set λ A = |I| N =1 n .When such information is not available, we turn off this objective by setting λ A = 0.
We set the sparsity parameter θ = 1 for high-resolution SRT, and set θ = 0 for low-resolution SRT.To capture heterogeneity of sub-populations, we clustered each sub-population c ∈ C into κ = 10 clusters and filtered out the sub-clusters containing less than 1% of the total number of cells in c.To compute the distance threshold d, we computed the Euclidean distance of each spot to its 8 closest spots in space 1 , yielding 8|I| values.We then took d as the 90 th percentile of these values.Finally, we set w to the maximum of 0.6 and the largest value that maintains |P| ≤ |I| to ensure meaningful spatial neighborhoods and that d S scales linearly in the number of spots for the sake of computational efficiency.
For RCTD, SPOTlight, Tangram, and C2L we used the default parameters suggested by the authors with the following exceptions.For RCTD we set the parameter UMI min to 50 to prevent the model from removing too many cells from the data.Given the large number of cell types in the mouse MOp datasets, for SPOTlight we reduced the number of cells per cell type to 100 to enhance the computation time.Similarly, as Tangram was not able to produce results in a reasonable time for the MOp instances, we randomly selected 500 cells per cell type to reduce the computation time.For C2L, we used 20000 epochs to balance computation performance and accuracy.For Seurat and SingleR, we followed the package documentations, with functions used with default parameters.For RF we used the implementation provided in the R package ranger [51] with all parameters set at their default values.

Performance metrics
We used three metrics for comparing the performance of different models in predicting the composition of spots.In our high-resolution spatial data coming from the MOp region of mouse brain, we know the cell type of each single-cell spot given as P c,i = 1 if spot i is of type c, and P c,i = 0 otherwise.We can therefore treat the cell type prediction as a multiclass classification task.
Accuracy is the proportion of correctly classified spots (i.e., sum of the main diagonal in the confusion matrix) over all spots.We also use Brier Score, also known as mean squared error, to compare the accuracy of membership probabilities produced by each model: where Y c,i is the probability predicted by the model that spot i is of cell type c.As Brier Score is a strictly proper scoring rule for measuring the accuracy of probabilistic predictions [52], lower Brier Score implies better-calibrated probabilities.
Besides the cell type that each spot is annotated with, we can produce a cell type probability distribution for each spot by considering the cell type of its neighboring spots, using a Gaussian smoothing kernel of the form where K i,j = exp −∥x i − x j ∥ 2 /2σ 2 and σ is the kernel width parameter which we set to 0.5 d.Note that as spot j becomes closer to spot i, its label contributes more to the probability distribution at spot i.Using these probabilities, we also introduce the Spatial Jensen-Shannon (SJS) divergence to compare the probability distributions assigned to spots (i.e., Y ) with the smoothed probabilities (i.e., P ) where d JS (Y :,i , P:,i ) is the Jensen-Shannon divergence between probability distributions Y :,i and P:,i with base 2 logarithm as defined in Eq. (11).
Unlike the high-resolution spatial data, the ground truth P c,i in the low-resolution spatial data corresponds to relative abundance of cell type c in spot i.We can therefore assess the performance of each model by comparing the probability distributions P :,i and the estimated probabilities (i.e., Y :,i ) using Brier Score or Jensen-Shannon metrics.

Data preparation
For experiments on transferring cell types to high-resolution spatial data (Section 2.2), with each sample of the MERFISH MOp (see Appendix C.1), we created a reference single-cell data using all the 280,186 cells, except the cells contained in the sample, and the 254 genes to estimate the centroids of the 99 reference cell types.We further created 15 high-resolution spatial datasets for each sample (i.e., a total of 1125 spatial datasets) as follows.To simulate the effect of number of shared features between the spatial and scRNA-seq data, we assumed that only a subset of the 254 genes are available in the spatial data by selecting the first |G| genes, where |G| ∈ {50, 75, 100, 125, 150} (i.e., 20%, 30%, 40%, 50%, 60% of genes).Moreover, to simulate the effect of differences in measurement sensitivities of different technologies, we introduced random noise in the spatial data by multiplying the expression of gene g in spot i by 1 + β i,g , where For experiments on estimating the expression of unmeasured genes in low-coverage spatial data (Section 2.4), we matched the common capture areas of high-and lowresolution spatial data using the Hematoxylin-Eosin (H&E) images accompanying these spatial data (Supplementary Fig. A1), which corresponded to 134,664 cells in the high-resolution and 3,928 spots in the low-resolution spatial data.Given that the task at hand is to estimate the expression of missing genes in the high-resolution spatial data, we performed community detection on the graph of shared nearest neighbors of cells in scRNA-seq using the Leiden implementation in [17], which is common practice in single-cell analysis and is used as a first step towards cell sub-population identification (note that the reference scRNA-seq does not contain cell type annotations).
This resulted in 218 clusters; we then transferred the centroids of these clusters to the high-resolution spatial data.(We also tried as high as 1000 fine-grained clusters but got essentially the same results.)

Data availability
Publicly available single-cell RNA-seq and spatial data can be accessed via the following accession numbers or the links provided.MERFISH data of mouse MOp [14] can be accessed at the Brain Image Library: https://doi.org/10.35077/g.21.
Single-cell RNA-seq data of mouse MOp [53] and SSp [54] can be accessed at the NeMO Archive for the BRAIN Initiative Cell Census Network via https:// assets.nemoarchive.org/dat-ch1nqb7and https://assets.nemoarchive.org/dat-jb2f34y,respectively.osmFISH data of mouse SSp is available at http://linnarssonlab. org/osmFISH/.ISS and scRNA-seq data of the developing human heart [23] is available at the European Genome-phenome Archive via accession number EGAS00001003996.Xenium, Visium and scRNA-seq data of human breast cancer [24] can be accessed at https://www.10xgenomics.com/products/xenium-in-situ/preview-dataset-human-breast.More detailed description of these datasets can be found in Appendix C.

Code availability
The code is open source and freely available at https://github.com/saezlab/dot.and set the negative entries of Y to 0. Given that |C| is typically small, the matrix inversion in Eq. (A1) can be done easily.Moreover, given that X S and X R are count matrices, c i Y c,i gives an estimate on the total number of cells that can fit in S.
In the third solution, we simply set Y c,i = rc c ′ r c ′ n for each i and c.Note that this solution is optimal for d A .We then set the initial solution as the convex combination of these three solutions with weights 0.4, 0.4, 0.2, respectively.

A.3 Derivatives
To find the derivatives of d i (Y ) and d c (Y ), defined in Eq. ( 1) and Eq. ( 2), we introduce auxiliary quantities XS := Y ⊤ X R and XR := Y X S to denote the expressions transferred through Y to spots and cell sub-populations, respectively.Derivatives for d i (Y ) and d c (Y ) can then be calculated as: Finally, the derivatives for d A defined in Eq. ( 5) can be calculated as: (even a partially unbalanced Fused Gromov-Wasserstein formulation; see below) cannot distinguish between the size of different spots.
(iii) Finally, when reliable information about the abundance of sub-populations is not available, even a partially unbalanced OT formulation may not be appropriate and the OT formulation results in a trivial solution in which each spot gets assigned to its closest sub-population independently of other spots.Note that our centroid distance function d c and gene map distance function d g defined in Eq.
(2) and Eq. ( 3), respectively, prevent such a trivial solution even when no prior information about the abundance of sub-populations is available.
The second link between our formulation and variants of OT can be characterized via the Fused Gromov-Wasserstein (FGW) formulation, a variant of OT for matching structured data.In our application, given M R and M S as metrics in the space of R and S, which denote the pairwise dissimilarity between elements of R and S, respectively, FGW combines the linear cost c∈C i∈I C c,i Z c,i with the 2-Gromov-Wasserstein distance [60] and replaces the objective function in Eq. (B2) with for some α ∈ [0, 1].From this perspective, the GW distance component of (B5) can capture the spatial relations between spots.In the following, we show how our spatial distance function d S defined in Eq. ( 4) is related to this distance function for a particular choice of metrics M R and M S .Observe that ⟨Z :,i , Z :,j ⟩ measures similarity between composition of spots i and j.Consequently, for a discrete metric M R (i.e., when sub-populations are radically different), minimizing GW(Z) encourages spots i and j to acquire similar compositions when 2M S i,j − 1 > 0, discourages spots i and j from acquiring similar compositions when 2M S i,j − 1 < 0, and is indifferent to the composition of spots i and j when 2M S i,j − 1 = 0. To produce a metric M S that captures the dissimilarity of spots in terms of their locations and expressions, we define D 1 i,j and D 2 i,j to represent distance of spots (i, j) with respect to their locations and expressions, respectively i,j = d cos X S i,: , X S j,: , where d is a given distance threshold, and D 2 i,j is computed with respect to all genes in S (i.e., G S ).Finally, we take M S to be the average of D 1 and D 2 : Remark 1. M S is a metric in the domain of S, since both D 1 and D 2 are metrics.
Remark 2. With the definition of M S in Eq. (B6) and M R a discrete metric, GW(Z) encourages adjacent spots to attain similar compositions if their expressions are similar, (ii) discourages distant spots from attaining similar compositions if their expressions are different, and (iii) is indifferent to pair (i, j) when i and j are distant or different in expressions, but not both.
From this perspective, our spatial distance function d S defined in Eq. ( 4) specializes GW(Z) to encouraging adjacent spots to attain similar compositions if their expressions are similar.Note that our definition of set of spatial pairs P given in Eq.
(10) uses the same distance threshold d.However, given the non-convex and quadratic nature of GW(Z), our d S distance function is computationally more appealing as it is convex and scales linearly with the number of spots.

Fig. 1 :
Fig.1: Overview of inputs and outputs of DOT and its optimization framework.a) From left to right: DOT takes two inputs: (i) spatially resolved transcriptomics data, which contains spatial measurements of genes at either high or low resolution spots and their spatial coordinates, and (ii) reference singe-cell RNA-seq data, which contains single cells with categorical (e.g., cell type) or continuous (e.g., expression of genes that are missing in the spatial data) annotations.DOT employs several alignment objectives to locate the sub-populations and the annotations therein in the spatial data.The alignment objectives ensure a high quality transfer from different perspectives: b) the expression profile of each spot in the spatial data (left) must be similar to the expression profile transferred to that spot from the reference data (right), c) the expression profile of each sub population in the reference data (left) must be similar to the expression profile of that sub population inferred in the spatial data (right), d) expression map of each gene in the spatial data (left) must be similar to expression map of that gene as transferred from the reference data (right), e) spots that are both adjacent and have similar expression profiles should have similar compositions, and f ) if prior knowledge about the expected relative abundance of sub-populations is available, the transfer should retain the given abundances.

Fig. 2 :
Fig. 2: Performances of transfer of cell types in high-resolution spatial data as function of the gene coverage in the spatial data (x-axis) and as function of different amounts of noise in gene expression (φ).Points represent the median of 75 values, and the shaded areas correspond to their interquartile interval.SingleR does not produce probabilities and is compared based on Accuracy only.

Fig. 3 :
Fig. 3: (a) Synthetic low-resolution SRT from high-resolution SRT.Dots represent cells and tiles represent multicell spots.(b) Performance of the algorithms in the lowresolution spatial data across 75 samples of MOp.Each point denotes the average performance across all spots in the sample.(c) Distribution of performance of models on each individual spot in the low-resolution spatial data of Mouse SSp (top) and developing human heart (bottom).Each subplot shows the distribution of prediction error based on the Jensen-Shannon divergence at each spot in the spatial data, with the average value over all spots given on top of each plot.

Fig. 3a illustrates
Fig. 3a illustrates a sample low-resolution SRT obtained from the high-resolution MERFISH data of a MOp tissue.In Fig. 3b we show the comparison of the performance of DOT against RCTD, SPOTlight [20], cell2location (C2L) [21], Tangram and Seurat in determining the cell type composition of the multicell spots created based on the MOp dataset (see Section

Fig. 4 :
Fig. 4: (a) Expression map of eight breast cancer markers measured in both Xenium (ground truth; top) and Visium (low-resolution proxy; bottom), and as transferred from scRNA-seq to Xenium using DOT (estimated; middle).Brighter means higher expression.(b) Expression map of five breast cancer markers that are measured in Visium (bottom) but are missing in Xenium and are transferred from scRNA-seq using DOT (top).(c) Cosine similarity between expression maps of Visium and DOT for the genes that are not measured in Xenium.

Fig. 4b illustrates
Fig.4billustrates the expression maps of five genes associated with breast cancer that are not measured in the high-resolution spatial data but are estimated by DOT.

For
instance, in the context of inferring cell type composition of spots, this allows us to produce pure cell type compositions for high-resolution spatial data and mixed compositions for low-resolution spatial data.While our optimization model in its most general form involves several components, we have designed a solution method based on the Frank-Wolfe algorithm with special attention to scalability to large-scale reference and spatial data.Moreover, our implementation reduces involvement of the user in parameter tuning by estimating the objective weights and other hyper parameters of the model from the data, thereby facilitating application of DOT to different problems with minimal implementation effort.Given that our model theoretically generalizes optimal transport (see Section 4.1 and Appendix B), we envision that DOT can be integrated with OT-based computational frameworks such as moscot[26] in the future.Using experiments on data from mouse brain, human heart, and breast cancer, we showed that DOT predicts the cell type composition of spots and expression of genes in spatial data with high accuracy, achieving and often outperforming the state-of-the-art methods both in terms of predictive performance and computation time.Although we demonstrated the application of DOT in transferring cell type labels and inferring the expression of missing genes, our model can be used for transferring other features such as Transcription Factor and pathway activities inferred from the reference scRNA-seq data[27].Additionally, our optimization framework can potentially be extended to alignment of spatial multiomics by exploiting the spatial information of the different data types.As our formulation is hypothesis-free (i.e., does not rely on statistical assumptions based on mRNA counts), DOT naturally extends to applications in other omics technologies.
uses non-negative matrix factorization regression to factorize the scRNA-seq count matrix into topic profile and topic distribution matrices.SPOTlight then uses non-negative least squares regression to model the gene expressions in spots as a product of the topic profile matrix learned from scRNA-seq and a topic distribution matrix, which is then used to determine the cell type composition of spots.Robust cell type decomposition (RCTD)[15] fits a statistical model by maximum-likelihood estimation, assuming a Poisson distribution for the expression of each gene at each spot.Cell2location is another statistical model which assumes a two-step Bayesian model for inferring cell type composition of spots[21].In the first step, it estimates reference cell type centroids from single-cell profiles.In the second step, cell2location uses these reference centroids to decompose mRNA counts at individual spatial locations into reference cell types.
is in the range of 0 and |I|, while c∈C d c (Y ) and g∈G d g (Y ) are upperbounded by |C| and |G|, respectively.We set the upper bound on the size of spots to n = N |I| where N is the total number of cells that can fit the spatial data.Clearly, N = |I| in high-resolution SRT since each spot is at single-cell resolution, thus n = 1.
φ) with φ ∈ {0, 0.25, 0.5}.We produced ground truth for low-resolution MOp using the common subclass annotations between MERFISH MOp and scRNA-seq MOp[53] (see Appendix C.1) as follows.For each of the 75 MERFISH MOp samples, we randomly assigned each cell in the MERFISH MOp data to a cell in the scRNA-seq MOp data of the same subclass.Next, we lowered the resolution of spatial data by splitting each sample into regular grids of length 100µm and aggregated the expression profiles of cells within each tile as the expression profile of the respective spots.
∥ 3 ⟨X R c,: , XR c,: ⟩ .Similarly, we may derive the derivatives for d g (Y ) defined in Eq. (∥ 3 ⟨X S :,g , XS :,g ⟩The derivatives for d S defined in Eq. (4) can be computed as∂d S ∂Y c,i = 1 2 j∈I:(i,j)∈P or (j,i)∈P log 2Y c,i Y c,i + Y c,j.

Proposition 1 .= 2 Z
Let β = i∈I j∈I (1 − M S i,j ) 2 p i p j .Assuming that M R is a discrete metric so that M R c,c = 0 and M R c,k = 1, for c, k ∈ C, c ̸ = k, then GW(Z) = β + i∈I j∈I 2M S i,j − 1 ⟨Z :,i , Z :,j ⟩ Proof.Given M R c,k = 1 for c ̸ = k and M R c,c = 0, we obtain GW(Z) = i∈I j∈I c∈C M S i,j 2 Z c,i Z c,j + i∈I j∈I c∈C k∈C,k̸ =c 1 − M S i,j 2 Z c,i Z k,j c,i Z c,j + i∈I j∈I c∈C k∈C 1 − M S i,j 2 Z c,i Z k,j = i∈I j∈I 2M S i,j − 1 ⟨Z :,i , Z :,j ⟩ + β,where we have used β = i = p i and k∈C Z k,j = p j .

Table 1 :
Average computation times (in seconds) across different experiments.