Abstract
A pressing challenge in spatially resolved transcriptomics (SRT) is to benchmark the computational methods. A widelyused approach involves utilizing simulated data. However, biases exist in terms of the currently available simulated SRT data, which seriously affects the accuracy of method evaluation and validation. Herein, we present scCube (https://github.com/ZJUFanLab/scCube), a Python package for independent, reproducible, and technologydiverse simulation of SRT data. scCube not only enables the preservation of spatial expression patterns of genes in referencebased simulations, but also generates simulated data with different spatial variability (covering the spatial pattern type, the resolution, the spot arrangement, the targeted gene type, and the tissue slice dimension, etc.) in referencefree simulations. We comprehensively benchmark scCube with existing singlecell or SRT simulators, and demonstrate the utility of scCube in benchmarking spot deconvolution, gene imputation, and resolution enhancement methods in detail through three applications.
Similar content being viewed by others
Introduction
The emergence and rapid development of spatially resolved transcriptomics (SRT) technologies offer an unprecedented opportunity to understand the cell composition, molecular architecture, and functional details of tissues at spatial levels^{1}. Various emerging cuttingedge technologies, including imagingbased^{2,3,4}, spatial barcoding nextgeneration sequencing (NGS)based^{5,6,7}, and laser capture microdissection (LCM)based^{8,9} approaches, have been successfully employed across diverse biological fields, such as developmental biology^{10}, cancer^{11}, and neuroscience^{12,13}.
As SRT data have become available, there has been an increasing advancement of computational tools for downstream analysis. Currently, more than 300 software packages have been developed for various spatial transcriptomic data analysis^{14}, such as spatially variable gene detection^{15,16,17,18}, cell type deconvolution^{19,20,21,22,23,24,25,26}, unmeasured gene imputation^{27,28,29,30,31}, spatial domain identification^{18,32,33}, and spatial cellcell interaction inference^{34,35}. While these computational methods are often based on reasonable assumptions, it is difficult to benchmark and evaluate their performance without gold standards.
Currently, a wildlyused approach for assessing the performance of computational methods involves utilizing simulated SRT data constructed from scRNAseq data. However, variations exist in terms of the scRNAseq data selection and the approach used to generate simulated SRT data across different benchmarking experiments. In most cases, the simulated data are constructed specifically based on the assumptions that underlie the computational method being evaluated, which may introduce bias in the evaluation process. Additionally, the lack of reproducibility in the description of most simulation steps can impede the potential reuse of these methods by other researchers. Several methods have recently emerged to generate synthetic SRT data. For example, SRTsim simulates the gene expression counts with a count model inferred from the reference data and then allocates the simulated counts to spatial locations in the synthetic data^{36}. scDesign3 applies a probabilistic model which can incorporate diverse cell covariates to simulate the gene expression changes across spatial locations^{37}. However, both SRTsim and scDesign3 make the simulations based on the SRT data, leading to inherent limitations of the generated data, such as limited gene detection and low cellular resolution. Furthermore, scDesign3 cannot simulate the SRT data with new spatial locations specified by users, which greatly restricts its application; SRTsim, though, can assign simulated expression counts for each new location based on its \(k\) nearest neighboring locations measured in the reference data, this strategy may fail to preserve the spatial expression patterns of genes when there is a large discrepancy in tissue shape or cell type spatial distribution between the reference and synthetic data, such as the slices of different brain donors in the DLPFC datasets by Maynard et al.^{38}. Therefore, a general simulation framework that can simulate the independent, reproducible, and various SRT data to better facilitate the development of spatial transcriptomic data analysis methods is still in demand.
To this end, we introduce scCube, an SRT simulator for simulating multiple spatial variability in spatial resolved transcriptomics and generating unbiased simulated SRT data. Based on the variational autoencoder (VAE) framework, scCube can simulate the gene expression profiles of different cell (or spot) populations in scRNAseq (or SRT) data. Next, the spatial distribution patterns for specific populations can then be generated by the referencebased or referencefree strategy. We evaluated the simulation performance of scCube with existing singlecell or SRT simulators across various real SRT datasets, and demonstrated the utility of scCube in three benchmarking applications. The results indicated that scCube is a userfriendly framework to simulate unbiased SRT data, enabling researchers to benchmark and evaluate different computational methods more lightly and accurately.
Results
Design concept of scCube
The workflow of scCube consists of two components: 1) gene expression simulation, and 2) spatial pattern simulation (Fig. 1). In the gene expression simulation step, scCube applies a variational autoencoder (VAE) model^{39} to simulate the gene expression profiles of cell (or spot) populations within the characterized clustering spaces^{20} (Fig. 1a; Methods). One of the main advantages of VAE is the ability to generate new data points compared with classical autoencoders, which is highly in line with the design requirements for a simulator. In the current version, scCube includes about 300 trained models of various tissues derived from four human and mouse scRNAseq atlas (Tabula Muris^{40}, Tabula Sapiens^{41}, MCA^{42}, and HCL^{43}) as well as eight highquality SRT datasets (Supplementary Data 1). These models can be conveniently employed through the Python package by users to generate the new gene expression profiles of specific tissues. Additionally, scCube is also equipped to accommodate external scRNAseq or SRT datasets provided by users to train new models. Next, scCube will simulate unique spatial distribution patterns for the specific cell (or spot) populations set by users. In this step, scCube provide two strategies, referencebased and referencefree, to simulate the various spatial patterns in real SRT data. In the referencebased simulations, for each cell (or spot) population, scCube constructs a mapping between the cells (or spots) in generated data and the positions in the spatial reference using the optimal transport algorithm, and then maps the generated cells (or spots) to positions with the maximum likelihood of spatial origin (Fig. 1b; Methods). In the referencefree simulations, scCube further provides two strategies. Users can either choose to generate the random spatial patterns with the default spatial autocorrelation function, or flexibly simulate the highly customized complex spatial patterns by combining different types as well as numbers of basis patterns provided by scCube (Fig. 1c; Methods). By combining the simulated gene expression profiles and spatial patterns, scCube can generate unbiased SRT data corresponding to diverse spatial variations, such as the spatial pattern continuity of cell populations, the resolution for spotbased SRT data, the number and type of targeted genes for imagingbased SRT data, the dimension of the tissue slice, and so on (Supplementary Fig. S1). What’s more, scCube also builds in a set of visualization functions, which helps users investigate simulated datasets more intuitively.
Simulation performance evaluation of the referencebased strategy of scCube
We first comprehensively compared the simulation performance of the referencebased strategy of scCube with other existing simulators, including two SRT simulators, SRTsim^{36} and scDesign3^{37}, as well as six singlecell simulators, scDesign2^{44}, SymSim^{45}, ZINBWaVE^{46}, and three variations of Splatter^{47} (Splat, Splat Simple, and Kersplat). To achieve this, 29 real SRT datasets from seven different tissues generated by six sequencing technologies that include 10X Visium, ST, Stereoseq, MERFISH, STARmap, and 10X Xenium were utilized as the benchmark datasets (Supplementary Fig. S2 and Supplementary Data 2). The simulation performance was evaluated by the Pearson correlation coefficient (PCC_{GEV}) and mean absolute error (MAE_{GEV}) values between the gene expression vector for each gene across spatial positions in real and simulated data, as well as the Pearson correlation coefficient (PCC_{GBM}) values between the gene expression value for each gene from the simulated data’s spatial locations predicted by the two generalized boosted regression models (GBMs) trained on the real data and simulated data separately (Methods). As shown in Fig. 2a, b and Supplementary Figs. S34, scCube outperformed other SRT and singlecell simulators over most of the seven benchmark datasets from different tissues, with the highest average PCC_{GEV} and PCC_{GBM} values as well as the lowest average MAE_{GEV} values (except for the human DLPFC 10X Visium and zebrafish embryo Stereoseq datasets) between the real spatial expression patterns of all genes and the corresponding simulated spatial expression patterns. Consistently, scCube also exhibited excellent simulation performance over all tissue slices of the human DLPFC 10X Visium and mouse hypothalamus MERFISH datasets (Supplementary Fig. S5). We further demonstrated the spatial expression patterns of several representative marker genes in the real and simulated data. As illustrated in Fig. 2c and Supplementary Figs. S6–12, scCube accurately preserved the spatial expression patterns of these marker genes across different datasets. In contrast, although the two other SRT simulators, SRTsim and scDesign3, also achieved this relatively successfully, confirming the superiority of SRT simulators in simulating spatially resolved transcriptomics compared with singlecell simulation methods, scCube was superior to both of them with the highest PCC_{GEV} and PCC_{GBM} values.
We next evaluated the overfitting issues of scCube and other two SRT simulators in three scenarios (Supplementary Data 3). In the first two scenarios, we constructed the training and test data from single datasets using two different data splitting strategies (countsplit^{48} and random splitting), respectively; while in the third scenario, we considered a more general case where the training and test data come from different tissue slices of the same sample or experiment, rather than just being split from the same datasets. Each simulator was trained on the training data, and the simulated data generated by different simulators was subsequently compared with the test data for overfitting evaluation. Although all three SRT simulators accurately simulated the spatial expression patterns of genes in the test data in the first two scenarios and didn’t have significant overfitting issues (Supplementary Fig. S13–28), scCube exhibited a unique strength in the third scenario, i.e., its scalability to simulate SRT data with a large discrepancy in tissue shape or cell type spatial distribution from the spatial reference (Supplementary Fig. S29). To be specific, we selected two tissue slices with different spatial distributions of cell types from the mouse hypothalamus MERFISH dataset as an example, where the “Bregma +0.06” and “Bregma −0.29” slices are near the anterior and posterior positions, respectively (Fig. 2d). For both scCube and SRTsim, we simulated the gene expression over the coordinates of the “Bregma −0.29” slice using the “Bregma +0.06” data as the spatial reference. As illustrated in Fig. 2e and Supplementary Fig. S30, scCube can still accurately simulate the spatial expression patterns of marker genes on the “Bregma 0.29” slice when using the different tissue slice (the “Bregma +0.06”) as the spatial reference. In contrast, SRTsim suffered from significant overfitting and failed to preserve these spatial expression patterns well. One possible reason is that for each location on the “Bregma 0.29” slice, SRTsim assigned the simulated expression counts based on the expression counts of its \(k\) nearest neighboring locations on the “Bregma +0.06” slice. Since the spatial distribution of cell types between these two slices is quite different, this strategy may lead to the spatial expression patterns of marker genes on the simulated data being heavily influenced by the spatial reference. Similar results were reproduced on the human DLPFC 10X Visium dataset. We used the “Slice 151507” from donor 1 as the spatial reference and applied scCube and SRTsim to simulate the gene expression over the coordinates of the “Slice 151676” from donor 3 respectively (Supplementary Fig. S31a), and the results showed that SRTsim still failed to exactly preserve these spatial expression patterns due to the large discrepancy in tissue shape between slices (Supplementary Fig. S31b). All these results further demonstrated the superiority and general applicability of scCube in simulating the SRT data without significant overfitting issues.
In addition, we further evaluated the biologically interpretable of the scCubesimulated data by comparing the benchmark results of three types of SRT computational methods (spot deconvolution, gene imputation, and spatial domain identification) on the real data and scCubesimulated data. As shown in Supplementary Fig. S32, the deconvolution result of each spot deconvolution method was highly consistent when using the real or scCubesimulated data, as were the benchmark results of all methods. Conversely, this consistency was not observed when using randomsimulated data (Supplementary Fig. S33). Similar results were reproduced on gene imputation and spatial domain identification computational methods (Supplementary Figs. S34 and 35), indicating that the simulated SRT data generated by scCube preserves the biological characteristics of the real data.
We also surveyed the effect of utilizing different types of cell (or spot) annotations in the gene expression simulation step on the referencebased simulation performance of scCube. As shown in Supplementary Fig. S36, we selected four human breast cancer 10X Visium datasets and performed the referencebased simulation of scCube using the pathological annotation provided by the authors, three unsupervised clustering results with different clustering resolutions (implemented by Seurat^{29}, resolution = 0.3, 0.5, and 0.7), and one spatial domain clustering result (implemented by BayesSpace^{49}) as the spot annotations, respectively. As shown in Supplementary Figs. S37–39, scCube accurately simulated the spatial expression patterns of genes in all four SRT datasets, whether using the pathological labels or the clustering results as the spot annotations, and the average PCC values between the real spatial expression patterns of all genes and the corresponding simulated spatial expression patterns are both greater than 0.9 (Supplementary Fig. S40). What’s more, the cell type composition with each spot of the scCubesimulated data generated with different spot annotations was highly consistent with the real data, with the average PCC values greater than 0.98 (Supplementary Fig. S41). These results suggest that scCube is robust to the choice of cell (or spot) annotations in the gene expression simulation step.
Finally, we systematically provided the execution time of scCube when achieving relatively stable simulation performance across seven benchmark datasets of different sizes (Supplementary Fig. S42). Notably, as shown in Supplementary Fig. S43, the execution time of scCube mainly concentrated in the training step. Therefore, if there are trained models provided by scCube matching the target SRT data, users can directly download the corresponding models to generate the simulated data within a very short period of time without additional training steps.
Simulating multiple variability in SRT data by the referencefree strategy with scCube
In this section, we illustrated the high flexibility of scCube in simulating multiple variability in SRT data with the referencefree strategy. In brief, we further subdivided the variability in SRT data into two types: gene expression variability and spatial pattern variability. For the former, scCube can select a userspecified number or type of genes based on the simulated gene expression profiles to simulate the variability of targeted genes in imagingbased SRT data. For the latter, scCube can generate simulated spatial patterns of cell types varying in number, type, and dimension with the default spatial autocorrelation function, and further simulate the variability of spatial patterns within cell types (such as cell subtypes), as well as the variability of resolution and spot arrangement in spotbased SRT data. We demonstrated it in detail by the following three examples.
Simulation of the variability of spatial patterns of cell types
A basic function of scCube is its ability to generate the spatial distributions of cell types with diverse patterns. In scCube, there are two vital parameters, \(\lambda\) and \(\delta\), which control the fuzzy degree and the continuity of generated spatial patterns, respectively. The value of \(\lambda\) ranges from 0 to 1, and larger values will tend to form clearer spatial patterns; the value of \(\delta\) should be greater than 0, and larger values will tend to form spatial patterns with greater connectivity (Fig. 3a). Users thus can apply scCube to generate specific patterns by setting and combining different \(\lambda\) and \(\delta\) values for simulating the spatial architecture of cells in different tissues.
Another function of scCube is its ability to set the number of spatial patterns and simulate spatial patterns for only a specific set of cell types (Fig. 3c). This feature of scCube allows users to better simulate different situations in real tissues, such as spatial domains for specific cell types or the uniform colocalization distribution of several cell types^{50}. In the example, we applied scCube to generate spatial patterns for only three cell types (astrocyte, brain pericyte, and neuron) (Fig. 3b). As expected, the Moran’s I values for the spatial distribution of these three cell types were much higher than those of the remaining cell types (Fig. 3c), supporting the accuracy of scCube in spatial pattern simulation. We further examined the spatial distribution of each cell type as well as the spatial expression pattern of its top 50 marker genes, as illustrated in Fig. 3d and Supplementary Fig. S44a, b, astrocyte, brain pericyte, and neuron showed distinct spatial patterns, while the other four cell types without specific spatial patterns followed similar uniform distributions. Unsurprisingly, the top 50 marker genes of astrocyte, brain pericyte, and neuron also had significantly higher Moran’s I values, which was consistent with the results of the spatial distribution of cell types (Supplementary Fig. S44c). Moreover, users can apply this function to specify whether a specific cell type has the spatial distribution pattern, which in turn achieves the manual specification of the types of spatial expression patterns for the marker genes of this cell type. As shown in Fig. 3e, we used scCube to generate two simulated SRT datasets, in which astrocytes were manually specified to be present and absent spatial patterns, respectively. Compared with Dataset 2, the marker genes of astrocyte in Dataset 1 showed obvious spatial expression patterns and had significantly higher Moran’s I values (Fig. 3f and Supplementary Fig. S45).
The additional function of scCube is its ability to simulate SRT data of threedimensional tissues (Fig. 4a, b). Recently, computational methods for threedimensional structures of tissue construction by integrating multislices are gradually emerging, such as PASTE^{51}, PRECAST^{52}, and STalign^{53}. However, real threedimensional SRT datasets for benchmarking these methods are still scarce. A widely used tradeoff is to utilize the twodimensional SRT data from a series of continuous slices, and then compare the alignment results for the same domain between them. Although this strategy is applied in the performance evaluation of most methods, it may introduce too much noise since slices are not exactly adjacent to each other. In contrast, the simulated threedimensional SRT data generated by scCube is unbiased, and thus suitable enough as the ground truth in benchmarking datasets. Furthermore, scCube provides a “split” function that can split the threedimensional SRT data into a userdefined number of twodimensional SRT data along any axis of coordinates of the simulated data (Fig. 4c). With this function, users can investigate the spatial variability along a series of adjacent twodimensional slices, such as the spatial distribution of specific cell types (endothelial cell in Fig. 4c) and spatial expression pattern of its marker genes. In addition, by jointly utilizing simulated threedimensional SRT data and a series of derived twodimensional SRT data, users thus can evaluate the computational methods for threedimensional tissues reconstruction more accurately.
Simulation of the variability of spatial patterns within cell types
scCube also provides a separate function to consider the heterogeneity within cell types and generate the spatial patterns of cell subtypes flexibly. As shown in Fig. 5a, b, we first applied scCube to generate the spatial patterns for each cell type from a scRNAseq dataset, in which macrophages contain two subtypes. Subsequently, for these two macrophage subtypes, we simulated two different types of spatial patterns to disregard or consider the heterogeneity within cell types (Fig. 5c). Compared with the former, where the spatial distribution of each subtype followed a random distribution, the latter further simulated a specific spatial pattern for each subtype to better account for intracelltype gene expression variation in space (Fig. 5d).
Simulation of the variability of resolution and spot arrangement
scCube further provides the optional parameters for setting the number of cells per spot (\(n\)) and the arrangement and neighborhood structure of all spots, which makes it possible to simulate spotbased SRT data with different resolutions and even different technology platforms. As shown in Fig. 5e, the resolution of the simulated spatial patterns decreased with the value of the parameter \(n\) gradually increased. In contrast, the average number of cells per spot increased and approximated the setting value of \(n\). We next demonstrated the ability of scCube in simulating spotbased SRT data with different spot arrangements corresponding to the three current mainstream technology platforms, including Slideseq^{6}, 10X Visium, and ST^{5}. As shown in Supplementary Fig. S46a, the simulated Slideseq data followed a random neighborhood structure of spots, while the simulated 10X Visium and ST data were opposite, having regular hexagonal and square neighborhood structures of spots, respectively. Additionally, the average numbers of cells per spot in simulated Slideseq, 10X Visium, and ST data was 2.34, 10.5, and 34.01, respectively, which is consistent with the real data. Specifically, we examined the spatial distribution of astrocyte and the spatial expression pattern of its marker gene Slc6a11 and found that they were consistent well with each other in all three types of simulated data, further suggesting the robustness of scCube in simulation (Supplementary Fig. S46b).
Simulating biologically interpretable spatial patterns by the referencefree strategy with scCube
In the referencefree spatial pattern simulation, scCube further considers generating more interpretable spatial patterns in a customized manner. Users can first simulate a series of biologically interpretable spatial basis patterns, including unstructured mixed or clustered cell populations, cell rings encircling tissue structures, and some external structures such as vessels^{54}, and then flexibly generate highly customized complex spatial patterns by combining different types as well as numbers of these basis patterns (Fig. 6a, b).
We applied scCube to the simulation of the tumorimmune microenvironment of 3 triple negative breast cancer (TNBC) patients, which corresponded to three archetypical subtypes of tumorimmune interactions: cold, mixed, and compartmentalized, respectively as a detailed example (Fig. 6c). It has been reported that the cold subtype shows the low immune infiltration, the mixed subtype shows the high mixability between tumor and immune cells, and the compartmentalized subtype shows a series of regions comprised predominantly of either immune or tumor cells^{55}. As shown in Fig. 6c, the spatial patterns simulated by scCube were very similar to the corresponding tumorimmune microenvironments. Furthermore, unlike spaSim^{54}, which can only generate spatial pattern images, scCube also combines this customized referencefree spatial pattern simulation step with the gene expression simulation step to generate the corresponding gene expression profiles that match the simulated spatial patterns, which more comprehensively simulates the real SRT data (Fig. 6d, e).
Using scCube to benchmark spot deconvolution methods
We began by demonstrating the utility of scCube in benchmarking the spot deconvolution methods. To this end, we first generated the simulated SRT data with singlecell resolution from a mouse brain scRNAseq dataset, and then simulated six spotbased SRT datasets with different resolutions by merging the different numbers of cells into specific spots (Fig. 7a). The cell type composition of each spot in these six simulated datasets was known and can be used as the ground truth when evaluating the deconvolution performances of different methods. A total of nine spot deconvolution methods were benchmarked, including Cell2location^{19}, DestVI^{21}, DSTG^{22}, RCTD^{25}, Seurat^{29}, spatialDWLS^{24}, SPOTlight^{23}, Stereoscope^{26}, and Tangram^{27}. The accuracy of the cell type composition of each spot was evaluated. We first explored the deconvolution performance of each method on the simulated SRT data with the high resolution (n = 5), and as shown in Fig. 7b, Cell2location, DestVI, and spatialDWLS, had the top 1, 2, and 3 rankings average Pearson correlation coefficient (PCC) and Spearman rankorder correlation coefficient (SRCC) values (0.930/0.913, 0.900/0.870, and 0.889/0.869, respectively), followed by RCTD (0.814/0.771), Seurat (0.809/0.778), Tangram (0.750/0.652), SPOTlight (0.734/0.597), Stereoscope (0.574/0.523), and DSTG (0.533/0.431). Consistently, the average root mean squared error (RMSE) and JensenShannon divergence (JS) values of Cell2location, DestVI, and spatialDWLS were 0.063/0.166, 0.095/0.264, and 0.115/0.194, respectively, also lower than those of the other methods.
We further examined the effect of the resolution on the performance of spot deconvolution, and benchmarked these methods on a series of simulated datasets with decreasing resolutions (n = 10, 20, 30, 50, and 100) (Fig. 7c and Supplementary Fig. S47). Interestingly, we found that RCTD, Tangram, and Cell2location were relatively robust to the resolution of spotbased SRT data, the differences between the maximum and minimum of the average PCC and SRCC values were 0.034/0.025, 0.047/0.039, and 0.070/0.095 (Fig. 7c, d). In contrast, other methods showed a preference for specific resolutions. As the resolution decreased, the performance of DestVI, Seurat, and Spotlight degraded, while spatialDWLS performed the opposite (Fig. 7c, d). Additionally, Stereoscope achieved higher average PCC/SRCC values as well as lower average RMSE/JS values when n = 30 and 50 (Fig. 7c). We also used the accuracy score (AS) proposed by Li et al.^{56} to evaluate the performance of these methods on the six simulated datasets comprehensively, and found that Cell2location outperformed other methods with the highest average AS value (0.954), followed by spatialDWLS (0.778), RCTD (0.718), DestVI (0.676), Stereoscope (0.565), Tangram (0.472), Seurat (0.440), SPOTlight (0.255), and DSTG (0.144) (Fig. 7e).
Using scCube to benchmark gene imputation methods
We next demonstrated the utility of scCube in benchmarking the gene imputation methods. In this application, we focused on examining the effect of the quality of detected transcriptomes on the performance of gene imputation. Specifically, we first generated the simulated SRT data with whole transcriptomes from a mouse brain scRNAseq dataset, and then simulated three imagingbased SRT datasets with different types of genes by selecting 200 random genes (Dataset 1), 200 highly variable genes (Dataset 2), and 200 cell type marker genes (Dataset 3), respectively (Fig. 8a). The gene expression value of each gene in these three simulated datasets was known and can be used as the ground truth when evaluating the imputation performances of different methods. We benchmarked eight gene imputation methods, including SpaGE^{28}, Tangram^{27}, stPlus^{31}, gimVI^{57}, Seurat^{29}, SpaOTsc^{58}, novoSpaRc^{59}, and LIGER^{30} by calculating the PCC, SRCC, RMSE, and JS values between the expression vector of a gene in the ground truth and the expression vector for the same gene in the imputed results predicted by different methods. As illustrated in Fig. 8b, all methods performed well on Dataset 2 and 3 but poorly on Dataset 1. This is expected and reasonable, as a crucial step in all gene imputation methods is the integration of imagingbased SRT data and scRNAseq reference using the shared genes between these two types of data. Compared with randomly selected genes, highly variable genes and cell type marker genes cover more information and contain less noise, making it possible for these methods to perform better in gene imputation.
We further compared the performance of all methods on three simulated datasets, and found that SpaGE, stPlus, and Tangram performed better than other methods, with the average PCC and SRCC values ranking in the top 4 on all datasets (Supplementary Fig. S48). The performance of SpaOTsc was influenced by the Datasets, with the top 2 on Dataset 2 but the bottom 3 ranking on Dataset 1 (Supplementary Fig. S48). Moreover, the results of AS values were similar to those of PCC and SRCC values, where Tangram, SpaGE, and stPlus achieved the top 3 rankings on Dataset 1 (1, 0.875, and 0.75, respectively); SpaGE, SpaOTsc, and stPlus achieved the top 3 rankings on Dataset 2 (1, 0.781, and 0.719, respectively); SpaGE, stPlus, and Tangram achieved the top 3 rankings on Dataset 3 (1, 0.844, and 0.781, respectively) (Fig. 8c).
Using scCube to benchmark resolution enhancement methods
In the last application, we demonstrated the utility of scCube in benchmarking two existing resolution enhancement methods, BayesSpace^{49} and CARD^{60}, both of which constructing the highresolution spatial maps of gene expression for spotbased SRT data. Specifically, we utilized scCube to simulate two pairs of spotbased SRT data with low (121 spots) and high (1089 spots) resolution, where each spot in the lowresolution data was generated by merging nine nearby spots in the highresolution data (Fig. 9a). The cell type label of each cell in the simulated singlecell resolution SRT data and the gene expression profile of each spot in the simulated highresolution spotbased SRT data were known and can be used as the ground truth for spot deconvolution and resolution enhancement, respectively. We first evaluated the spot deconvolution performance of CARD on both low and highresolution SRT data. As shown in Fig. 9b, CARD performed better on the lowresolution than on the highresolution SRT data, the average PCC were 0.996 and 0.735, respectively. We further examined the predicted result of each spot of the highresolution ST data, and found that CARD failed to accurately deconvolute the cell type compositions of the spots in the region located between two different cell type domains (Fig. 9c). The poor deconvolution performance on spots located at the boundary may be due to the simplistic assumption of CARD. The key idea of CARD to impute and construct highresolution spatial maps for cell type composition and gene expression is modeling the spatial correlation structure as a multivariate normal distribution. This strategy leads to the cell type composition on the new locations being effectively represented as a weighted summation of the nonnegative celltype proportions on the original locations^{60}, forming a “transition” state at the boundary. However, in some cases, the cell type composition of spots at the boundary of two large cell type domains often differs greatly from the composition of spots within the two domains because of the presence of specific new cell types (such as astrocyte in the simulated data), which is contrary to the assumption of CARD.
We also compared the accuracy of the highresolution spatial maps of gene expression constructed by CARD and BayesSpace. Similar to the predicted results of cell type composition, the highresolution spatial maps of gene expression constructed by CARD also presented an obvious “diffusion” trend, which were inconsistent with the ground truth. In contrast, the maps constructed by BayesSpace were more accurate (Fig. 9d and Supplementary Fig. S49). Additionally, as one might expect, both CARD and BayesSpace were better at predicting the highresolution spatial maps of marker gene expression for cell types with large populations than for rare cell types. As shown in Fig. 9e, both CARD and BayesSpace achieved the highest average PCC values in predicting the highresolution spatial maps of top 50 marker genes of endothelial cell (0.772 and 0.793, respectively), followed by oligodendrocyte (0.694 and 0.736, respectively), neuron (0.670 and 0.651, respectively), astrocyte (0.586 and 0.589, respectively), brain pericyte (0.523 and 0.547, respectively), OPC (0.501 and 0.480, respectively), and Bergmann glial cell (0.473 and 0.438 respectively).
Discussion
In this study, we have presented a spatially resolved transcriptomics simulator, scCube, for simulating multiple spatial variability in spatial resolved transcriptomics and generating unbiased simulated SRT data. We have demonstrated the capability of scCube to preserve the spatial expression patterns of genes in real SRT data, and illustrated the utility of scCube in benchmarking diverse SRTanalysis computational methods, including spot deconvolution, gene imputation, and resolution enhancement.
scCube is designed to consist of two steps for SRT simulation: gene expression simulation and spatial pattern simulation. The former step is mainly based on the VAE model, which has become a popular tool for singlecell omics data modeling^{20,61,62,63}. We have also demonstrated that the VAE model can accurately simulate various characteristics of the original data, such as the sparsity of scRNAseq or SRT data (Supplementary Figs. S50–52). Meanwhile, we have provided about 300 trained models of various tissues, which are coupled with the scCube Python package, enabling users to simulate different spatial architectures of cells in real tissues more conveniently and quickly without the additional training steps. One potential limitation of scCube is that the ground truth provided by its simulated spotbased SRT data is restricted to the cell population level since it assigns each spot a cell population first when generating synthetic data. This may not be suitable for direct use in benchmarking studies of some SRTanalysis methods that operate at a finer resolution. However, benefiting from the robustness of scCube in cell population annotations selection (Supplementary Figs. S36–41), users can flexibly choose an appropriate granularity of cell population annotation during simulation. In addition, since the gene expression profiles generated by scCube remain at the spot level, users can also refine the annotation granularity of each spot through relevant downstream analysis (such as spot deconvolution).
In the spatial pattern simulation step, scCube provides two strategies: referencebased and referencefree. The referencebased strategy uses the SRT reference and aims to simulate the spatial expression pattern of genes across positions. The benchmark results across the real SRT data from diverse tissues generated by different sequencing platforms showed the superiority and broad scalability of scCube over other SRT or singlecell simulators. Moreover, we highlight the unique strength of scCube compared with SRTsim^{36}, a current stateoftheart SRT simulator. Specifically, scCube first constructs a mapping between the cells (or spots) in simulated data and the positions in the spatial reference by solving an optimal transport problem^{64} based on the gene expression, and then maps the cells (or spots) to positions with the maximum likelihood of spatial origin. This strategy effectively avoids the interference of external noise, such as tissue slice shape, cell type proportion, and distribution, etc., and thus can robustly simulate the spatial expression pattern of genes whether in the same or across different slices. In contrast, SRTsim relies heavily on the coordinate system (i.e., slice shape) of the spatial reference in simulation, which can indeed preserve the spatial expression pattern of genes when the shape of targeted simulation data is the same or sufficiently similar to that of the spatial reference. However, the accuracy of simulation drops dramatically when simulating data with a large discrepancy from the spatial reference (Fig. 2d, e and Supplementary Figs. S29–31).
For the referencefree simulations, scCube aims to generate random or customized spatial patterns for cell populations and combine them with the simulated gene expression profiles to de novo construct complete benchmarking datasets from scRNAseq data. Compared with simulations starting from real SRT data such as implemented in SRTsim, this strategy can generate simulated SRT data that feeds both singlecell resolution and whole transcriptomes, which is suitable as the ground truth for evaluating the performance of those integration methods, such as spot deconvolution, gene imputation, and spatial reconstruction^{59,65,66}. In this paper, we have demonstrated in detail the flexibility of the referencefree spatial pattern simulation of scCube by setting specified parameter values: including the simulations of multiple variability in SRT data based on a random manner, as well as the customized generations of more biologically interpretable spatial patterns.
Additionally, a desirable feature of scCube is its ability to simulate SRT data in a userspecified manner that meets diverse benchmarking purposes while providing the ground truth that is not present in real data. Specifically, scCube allows varying one specific variable when generating the SRT data, such as the number and type of genes or the number and area of spatial patterns (Supplementary Figs. S53 and 54), which allows users to investigate the impact of certain variables on the performance of different computational methods flexibly. For instance, we have utilized scCube to generate a series of simulated spotbased SRT data with different resolutions but the same other spatial variability (such as the number, proportion and spatial distribution of cell types) and benchmarked nine widely used spot deconvolution methods. Interestingly, we found that except for RCTD, Tangram, and Cell2location, the other six methods exhibited an obvious preference for specific resolutions, which was not discussed in detail in the previous studies^{19,21,22,23,24,56,67,68}. Similarly, in the other two applications, we also further discussed the spatial variability that may affect the performance of methods with the simulated data generated by scCube. Moreover, expect for the SRTanalysis methods demonstrated in the three applications, scCube should be applicable to benchmark other computational tools, such as spatial domain identification and spatial cellcell interaction inference methods. We therefore applied scCube to benchmark the above two methods, and demonstrated that the spatial information can help to improve the accuracy of spatial domain identification and spatially proximal LR pairs inference (Supplementary Figs. S35 and S55). In summary, these results suggested that scCube can provide scalable, reproducible, and realistic simulations, which helps users to evaluate diverse methods more lightly and accurately, and better facilitates the development of spatial transcriptomic data analysis methods.
Methods
Data collection and processing
Human dorsolateral prefrontal cortex (DLPFC) SRT dataset
The human DLPFC 10X Visium dataset^{38} was downloaded from http://spatial.libd.org/spatialLIBD/, containing 12 tissue slices from 3 donors.
Mouse hypothalamus SRT dataset
The mouse hypothalamus MERFISH dataset^{69} was downloaded from https://doi.org/10.5061/dryad.8t8s248. We filtered 5 “blank” barcodes as well as the Fos gene, whose expression value in all cells was ‘NA’. A total of 12 tissue slices from the naive female animal (Animal_ID: 1) were utilized.
Mouse neocortex V1 SRT dataset
The mouse neocortex V1 STARmap dataset^{3} was downloaded from https://zenodo.org/record/7830764#.ZDpObi1HUI, containing 2 tissue slices from one donor.
Human HER2 breast cancer SRT dataset
The human HER2 breast cancer ST dataset^{70} was downloaded from https://zenodo.org/record/5511763#.Y6kMduxBzUI, containing 36 tissue slices from 8 patients.
Human skin squamous cell carcinoma (SCC) SRT dataset
The human SCC ST dataset^{71} was downloaded from GEO (GSE144240), containing 8 tissue slices from 4 patients.
Human breast cancer SRT dataset
The human breast cancer 10X Xenium dataset^{72} was downloaded from https://www.10xgenomics.com/products/xeniuminsitu/previewdatasethumanbreast, containing 1 tissue slice from one patient.
Zebrafish embryo SRT dataset
The zebrafish embryo Stereoseq dataset^{73} was downloaded from Spatial Omics DataBase (SODB)^{74} (https://gene.ai.tencent.com/SpatialOmics/dataset?datasetID=83). The tissue slice from the 24 h postfertilization (hpf) embryo was utilized.
Human breast cancer SRT and scRNAseq datasets
The human breast cancer 10X Visium dataset and paired scRNAseq dataset^{75} were downloaded from https://doi.org/10.5281/zenodo.4739739. A total of 4 tissue slices from 4 patients (CID44971, CID4535, CID4465, and CID4290) were utilized.
Tabula muris dataset
The Tabula Muris scRNAseq dataset^{40} was downloaded from https://figshare.com/projects/Tabula_Muris_Transcriptomic_characterization_of_20_organs_and_tissues_from_Mus_musculus_at_single_cell_resolution/27733, containing 96,953 cells from 31 tissues.
Tabula sapiens dataset
The Tabula Sapiens scRNAseq dataset^{41} was downloaded from https://figshare.com/projects/Tabula_Sapiens/100973, containing 952,796 cells from 27 tissues.
Mouse cell atlas (MCA) dataset
The MCA scRNAseq dataset^{42} was downloaded from https://figshare.com/articles/MCA_DGE_Data/5435866, containing 333,778 cells from 83 tissues.
Human cell landscape (HCL) dataset
The HCL scRNAseq dataset^{43} was downloaded from https://figshare.com/articles/HCL_DGE_Data/7235471, containing 575,745 cells from 102 tissues.
Human triple negative breast cancer (TNBC) image sets
The human TNBC Multiplexed Imaging (MIBI) channel images^{55} were downloaded from https://mibishare.ionpath.com. A total of 3 images from 3 patients (Patient 1, Patient 5, and Patient 24) were utilized.
All the above data except the MIBI images were prenormalized (implemented in the Python package SCANPY^{76}) prior to being input into scCube.
Design of scCube
The scCube model consists of two components, 1) gene expression simulation, and 2) spatial pattern simulation. scCube first simulates the gene expression profiles of different cell (or spot) populations (for example, the cell types, spatial domains, pathological regions, etc.) in gene expression simulation step. Notably, for each population, scCube can generate gene expression profiles with a customized number of cells (or spots) based on user settings. Subsequently, scCube performs either referencebased or referencefree strategies to generate spatial patterns for cell (or spot) populations in spatial pattern simulation step, which will be described in detail later.
Gene expression simulation
scCube applies a variational autoencoder (VAE) model^{39} consisting of an encoder and a decoder, to simulate the gene expression profiles of single cells of specific populations. Given the input scRNAseq (or SRT) data \({{{{{\boldsymbol{X}}}}}}={\left\{{{{{{{\boldsymbol{x}}}}}}}^{(i)}\right\}}_{i=1}^{N}\), where \({{{{{\boldsymbol{x}}}}}}{{{{{\boldsymbol{=}}}}}}{{{{{\boldsymbol{\{}}}}}}{{{{{{\boldsymbol{x}}}}}}}_{{{{{{\boldsymbol{1}}}}}}}{{{{{\boldsymbol{,}}}}}}\,{{{{{{\boldsymbol{x}}}}}}}_{{{{{{\boldsymbol{2}}}}}}}{{{{{\boldsymbol{\ldots }}}}}}{{{{{\boldsymbol{,}}}}}}\,{{{{{{\boldsymbol{x}}}}}}}_{{{{{{\boldsymbol{n}}}}}}}{{{{{\boldsymbol{\}}}}}}}\) represents a single cell (or spot) and its component \({x}_{n}\) represents the gene expression value of gene \(n\), the encoder utilizes the conditional distribution \({q}_{\phi }\left({{{{{\bf{z}}}}}}  {{{{{\bf{x}}}}}}\right)\) to generate the latent representation \({{{{{\bf{z}}}}}}\) from the input data \({{{{{\boldsymbol{x}}}}}}\), while the decoder leverages \({p}_{\theta }\left({{{{{\bf{x}}}}}}{{{{{\bf{z}}}}}}\right)\) to restore \({{{{{\boldsymbol{x}}}}}}\) from the latent variable \({{{{{\bf{z}}}}}}\). Here, \(\phi\) and \(\theta\) symbolize the parameters symbolize the encoder and decoder, respectively.
In practice, directly computing the posterior probability \({p}_{\theta }\left({{{{{\bf{z}}}}}}{{{{{\bf{x}}}}}}\right)\) might be challenging. To address this, we utilize the concept of variational inference, introducing an approximate distribution \({q}_{\phi }\left({{{{{\bf{z}}}}}}  {{{{{\bf{x}}}}}}\right)\) as an approximation to the true posterior probability. Specifically, we aim to maximize the Evidence Lower Bound (ELBO), which can be seen as minimizing the KL divergence between the variational approximate posterior \({q}_{\phi }\left({{{{{\bf{z}}}}}}  {{{{{\bf{x}}}}}}\right)\) and the true posterior \({p}_{\theta }\left({{{{{\bf{z}}}}}}{{{{{\bf{x}}}}}}\right)\):
By rearranging some terms, we obtain:
On the lefthand side of Eq. (2), we have the loglikelihood of the data \(\log \left({p}_{\theta }\left({{{{{\bf{x}}}}}}\right)\right)\) and the KL divergence \({{{{{{\mathcal{D}}}}}}}_{{KL}}\left({q}_{\phi }\left({{{{{\bf{z}}}}}}  {{{{{\bf{x}}}}}}\right){}{p}_{\theta }\left({{{{{\bf{z}}}}}}{{{{{\bf{x}}}}}}\right)\right)\) between the approximate and the true posterior. For VAE model, we want to maximize \(\log \left({p}_{\theta }\left({{{{{\bf{x}}}}}}\right)\right)\) and minimize \({{{{{{\mathcal{D}}}}}}}_{{KL}}\left({q}_{\phi }\left({{{{{\bf{z}}}}}}  {{{{{\bf{x}}}}}}\right){}{p}_{\theta }\left({{{{{\bf{z}}}}}}{{{{{\bf{x}}}}}}\right)\right)\), which is equivalent to maximizing the righthand side of the equation, i.e.:
Equation (3) is also known as the ELBO. To maximize the Eq. (3), we assume \({p}_{\theta }\left({{{{{\bf{z}}}}}}\right)\) takes on a multivariate Gaussian \({p}_{\theta }\left({{{{{\bf{z}}}}}}\right){{{{{\mathscr{=}}}}}}{{{{{\mathcal{N}}}}}}\left({{{{{\bf{z;}}}}}}{{{{{\bf{0}}}}}},{{{{{{\rm{I}}}}}}}\right)\) and the true (but computationally intractable) posterior \({p}_{\theta }\left({{{{{\bf{z}}}}}}{{{{{\bf{x}}}}}}\right)\) is approximated by a Gaussian form with a diagonal covariance. This assumption is reasonable as single cells from distinct cell types or domains have unique gene expression profiles that can be characterized by independent latent clustering spaces^{20}. And in this case, we can let the variational approximate posterior \({q}_{\phi }\left({{{{{\bf{z}}}}}}  {{{{{\bf{x}}}}}}\right)\) be a multivariate Gaussian with a diagonal covariance structure:
Where the mean and s.d. of the approximate posterior, \({{{{{{\boldsymbol{\mu }}}}}}}^{\left(i\right)}\) and \({{{{{{\boldsymbol{\sigma }}}}}}}^{(i)}\), can be implemented with the encoding MLP network. The KL term in Eq. (3) can be integrated analytically since both \({p}_{\theta }\left({{{{{\bf{z}}}}}}\right)\) and \({q}_{\phi }\left({{{{{\bf{z}}}}}}  {{{{{\bf{x}}}}}}\right)\) are multivariate Gaussian distributions, while the first term in Eq. (3), also termed as expected reconstruction error, requires estimation by sampling. To make sure the gradient descent algorithm can be used to train the model, we employ the reparameterization trick proposed by Kingma and Welling^{39} to make the optimization function differentiable. Specifically, we don’t directly sample \({{{{{\bf{z}}}}}}\) from the posterior but first sample \({{{{{\boldsymbol{\epsilon }}}}}}\) from the standard normal distribution and then compute \({{{{{\bf{z}}}}}}{{{{{\boldsymbol{=}}}}}}{{{{{\boldsymbol{\mu }}}}}}{{{{{\boldsymbol{+}}}}}}{{{{{\boldsymbol{\sigma }}}}}}\,{{\odot }}\,{{{{{\boldsymbol{\epsilon }}}}}}\). Thus, the gradient can be backward propagated for updating the model parameters.
Both the encoder and decoder apply a fourlayer perceptron in our model, where each layer is followed by a RELU activation except the last layer of the encoder. For the encoder, the number of neurons in each layer is 2048, 1024, 512 and 256 respectively. While for the decoder, the number of neutrons in each layer is 512, 1024, 2048 and k, respectively, where k represents the number of genes in the dataset. The dimensionality of the latent space learned by the VAE is 128.
Spatial pattern simulation
Referencebased spatial pattern simulation
In the referencebased simulation, scCube focuses on simulating the spatial expression patterns of genes over the spatial coordinates of the spatial reference data. Specifically, scCube first utilizes the VAE model trained in the gene expression simulation step to generate new cells (or spots) of different populations, whose numbers are determined by the groups in the spatial reference data. Significantly, the groups can be defined as the annotation information such as spatial domain or pathological annotation (if provided), or just as cluster labels obtained by the unsupervised or spatial domain clustering steps. Compared to the spatial reference, these generated cells (or spots) have sufficiently similar but not identical gene expression profiles. Then, scCube constructs a mapping between the cells (or spots) in generated data and the positions in the spatial reference by solving an optimal transport problem^{64}:
Where \({{{{{\bf{M}}}}}}\) is the metric cost matrix and \({{{{{\boldsymbol{a}}}}}}\), \({{{{{\boldsymbol{b}}}}}}\) are the sample weights. We assume \({{{{{\boldsymbol{a}}}}}}\) and \({{{{{\boldsymbol{b}}}}}}\) take uniform distributions \(a{{{{{\boldsymbol{=}}}}}}\frac{1}{n}{{{{{{\boldsymbol{1}}}}}}}_{n}\) and \(b{{{{{\boldsymbol{=}}}}}}\frac{1}{n}{{{{{{\boldsymbol{1}}}}}}}_{n}\) by default, and the metric cost matrix \({{{{{\bf{M}}}}}}\) is constructed by computing the sqeuclidean distance between the gene expression vectors of the generated cells (or spots) and the gene expression vectors of the positions in the spatial reference data.
Once we have the optimal transport plan \(\gamma\), we can map the generated cells to positions with the maximum likelihood of spatial origin, and thus the spatial patterns of cell (or spot) populations and the spatial expression patterns of genes are also simulated.
Referencefree spatial pattern simulation
In the referencefree simulation, scCube focuses on simulating specific spatial distribution patterns for different cell (or spots) populations based on the generated data and further provides two strategies. After randomly generating spatial positions with the same number as the cells (or spots) in the generated data, users can either choose to generate the random or customized spatial patterns.
1) Random spatial pattern generation
To be specific, scCube applies the default spatial autocorrelation function to generate random spatial patterns of populations. The spatial range containing all positions is first uniformly divided into \(n\times n\) grids. To encourage neighborhood similarity in the composition of cell type in grids, we assume that the overall distribution \(V\) of grids follows a multivariate normal distribution:
Where the \(n\times n\) covariance matrix \({{{{{\boldsymbol{\Sigma }}}}}}\) models the spatial correlation among the grids based on the Euclidean distance between their spatial coordinates:
Where \({{{{{\boldsymbol{D}}}}}}\) is the Euclidean distance matrix and \(\delta\) is the autocorrelation parameter. A larger \(\delta\) will generate spatial patterns with greater connectivity. Subsequently, for a target population, scCube randomly samples the same proportion of grids based on the proportion of this population in the generated data, and assigns all cells (or spots) which belongs to this population to the spatial positions in the sampled grids. This step repeats \(c\) times (\(c\) is set by users and does not exceed the total number of populations in the generated data), and cells (or spots) in the remaining population are randomly assigned to spatial positions in the remaining grids. Based on the above strategy, users can flexibly choose to simulate a unique spatial distribution pattern for each population, or simulate the same spatial colocalization distribution patterns for multiple populations simultaneously. Furthermore, scCube also provides an additional parameter, \(\lambda\), to generate spatial patterns with varying degrees of fuzziness by randomly swapping the spatial coordinates of \(1\lambda\) percent of the cells.
2) Customized spatial pattern generation
Inspired by spaSim^{54}, a recently published simulator of tissue spatial data, scCube provides four functions to optionally generate a series of biologically interpretable spatial basis patterns for each population, including unstructured mixed cell populations, clustered cell populations, cell rings encircling tissue structures, and some external structures such as vessels. Furthermore, scCube also provides an additional function to combine the different types and numbers of basis patterns mentioned above to generate more complex structures.
Unstructured mixed cell populations simulation
For unstructured mixed cell populations simulation, scCube randomly samples positions with a specific proportion and assigns designated cell populations to these positions. The number and proportion of cell populations can be set by users.
Clustered cell populations simulation
For clustered cell populations simulation, scCube applies geometric shapes (including circles, ovals, and other irregular shapes formed by combining circles or ovals) to delineate regions of clustered cell populations. Positions inside the margin of geometric shapes are defined as clustered structures and assigned designated cell populations, while positions outside the margin are defined as the background. Specifically, the number and parameters (such as center location, size, and direction) of geometric shapes can be customized. Given the center location \(({x}_{0},\,{y}_{0})\) and the direction \(\theta\) of the region for simulation, scCube first converts the coordinate \((x,{y})\) of each position as below:
Subsequently, scCube will apply circular or elliptical coordinate equations:
Where \(a\) and \(b\) are the major axis and minor axis, respectively, and \(a=b\) when the regions are circles. For each position, if \(\frac{{{x}_{{rotated}}}^{2}}{{a}^{2}}+\frac{{{y}_{{rotated}}}^{2}}{{b}^{2}} < \,1\), the position is inside the region, otherwise it’s outside the region. In addition, scCube further considers the presence of other infiltrating cells in the clustered cell populations, and the cell type and proportion of infiltrating cells can be specified by users.
Cell rings simulation
scCube applies concentric circles to simulate the cell single or double rings. Similar to clustered cell populations simulation, the number and parameter of concentric circles as well as the cell type and proportion of infiltrating cells can also be specified by users. Moreover, scCube provides an additional parameter \(d\) to control the widths of cell rings. Taking the cell single rings simulation as an example, for each position, if \(\frac{{{x}_{{rotated}}}^{2}}{{a}^{2}}+\frac{{{y}_{{rotated}}}^{2}}{{b}^{2}} < \,1\), the position is inside the inner cluster; if \(\frac{{{x}_{{rotated}}}^{2}}{{a}^{2}}+\frac{{{y}_{{rotated}}}^{2}}{{b}^{2}} > 1\) and \(\frac{{{x}_{{rotated}}}^{2}}{{(a+d)}^{2}}+\frac{{{y}_{{rotated}}}^{2}}{{(b+d)}^{2}} < \,1\), the position is inside the outer ring; otherwise it’s outside the outer ring and is defined as the background.
Vessels simulation
scCube applies stripes to simulate the blood or lymphatic vessels. The number and parameter (including length, width, and direction) of stripes as well as the cell type and proportion of infiltrating cells can be specified by users. Given two center endpoints \(({x}_{1},\,{y}_{1})\) and \(\left({x}_{2},\,{y}_{2}\right)\) and the width \(d\) of the stripe for simulation, scCube first calculates the normalized perpendicular vector \({{{{{\boldsymbol{p}}}}}}\):
Where \(l=\sqrt{{({x}_{2}{x}_{1})}^{2}+{({y}_{2}{y}_{1})}^{2}}\) is the length of the stripe, and the endpoints on the margin of stripe thus can be defined as \(({x}_{1}+d\frac{{{{{{\boldsymbol{p}}}}}}}{2},\,{y}_{1}+d\frac{{{{{{\boldsymbol{p}}}}}}}{2})\) or \(({x}_{1}d\frac{{{{{{\boldsymbol{p}}}}}}}{2},\,{y}_{1}d\frac{{{{{{\boldsymbol{p}}}}}}}{2})\). For each position \((x,{y})\), scCube further calculates its projection lengths \(u\) and \(v\) on the direction vector and the perpendicular vector of the stripe, respectively:
If 0\(\le u\le l\) and \(\leftv\right\le \frac{d}{2}\), the position is inside the stripe, otherwise it’s outside the stripe and is defined as the background.
Comparison with other simulation methods
We compared the simulation performance of scCube with other existing simulators using the real SRT datasets. In detail, we considered the following eight simulators, two of which are SRT simulators and six are singlecell simulators:

(1)
SRTsim^{36}: SRT simulator, implemented in the R package SRTsim, with the count matrix as inputs.

(2)
scDesign3^{37}: SRT simulator, implemented in the R package scDesign3, with the count matrix as inputs.

(3)
scDesign2^{44}: singlecell simulator, implemented in the R package scDesign2, with the count matrix as inputs.

(4)
SymSim^{45}: singlecell simulator, implemented in the R package SymSim, with the count matrix as inputs.

(5)
ZINBWaVE^{46}: singlecell simulator, implemented in the R package Splatter, with the count matrix as inputs.

(6)
Splat^{47}: singlecell simulator, implemented in the R package Splatter, with the count matrix as inputs.

(7)
Splat Simple^{47}: singlecell simulator, implemented in the R package Splatter, with the count matrix as inputs.

(8)
Kersplat^{47}: singlecell simulator, implemented in the R package Splatter, with the count matrix as inputs.
All compared simulation methods were set with default parameters and we did not perform any prenormalized step on the raw count matrices.
Evaluation metrics
PCC_{GEV}
The Pearson correlation coefficient (PCC) values between the normalized expression vector for each gene across spatial positions in real and simulated data. The PCC_{GEV} value was calculated using the following equation:
Where \(n\) is the number of spots/cells, \({x}_{{ij}}\) and \({y}_{{ij}}\) are the expression value of gene \(i\) in spot/cell \(j\) in the real data and the simulated data, respectively; \(\bar{x}\) and \(\bar{y}\) are the average expression value of gene \(i\) across all spots/cells in the real data and the simulated data, respectively.
MAE_{GEV}
The mean absolute error (MAE) values between the normalized expression vector for each gene across spatial positions in real and simulated data. The MAE_{GEV} value was calculated using the following equation:
Where \(n\) is the number of spots/cells, \({x}_{{ij}}\) and \({y}_{{ij}}\) are the expression value of gene \(i\) in spot/cell \(j\) in the real data and the simulated data, respectively.
PCC_{GBM}
Considering that direct comparison between the real and simulated gene expression vectors is overly stringent and leads to insufficient comparison results when the training data and the test data are different, according to Song et al.^{37}, we further calculated the Pearson correlation coefficient (PCC) values between the expression value for each gene from the simulated data’s spatial locations predicted by the two generalized boosted regression models (GBMs) trained on the real data and simulated data separately. The GBMs were implemented in the R package caret. Compared with PCC_{GEV}, this alternative evaluation metric is more robust and informative (Supplementary Fig. S56). The PCC_{GBM} value was calculated using the Eq. 12 above, where \(n\) is the number of spots/cells, \({x}_{{ij}}\) and \({y}_{{ij}}\) are the expression value of gene \(i\) in spot/cell \(j\) predicted by the GBMs trained on the real data and the simulated data, respectively; \(\bar{x}\) and \(\bar{y}\) are the average expression value of gene \(i\) across all spots/cells predicted by the GBMs trained on the real data and the simulated data, respectively.
Benchmarking spot deconvolution methods
For the spot deconvolution methods benchmarking, we first used scCube to generate the simulated SRT data with singlecell resolution from a mouse brain scRNAseq dataset, each cell type has a specific spatial distribution pattern. For this singlecell SRT data, all cells were split according to the fixed spatial distance which was determined by the setting average number of cells per spot \(n\) and then merged into simulated spots as the benchmarking datasets. We simulated six spotbased SRT datasets (n = 5, 10, 20, 30, 50, and 100) in total and benchmarked the following nine spot deconvolution methods with the default parameters: (1) Cell2location^{19} implemented in the Python package cell2location; (2) DestVI^{21} implemented in the Python package scvitools; (3) DSTG^{22} implemented in the Python package DSTG; (4) RCTD^{25} implemented in the R package spacexr; (5) Seurat^{29} implemented in the R package Seurat; (6) spatialDWLS^{24} implemented in the R package Giotto; (7) SPOTlight^{23} implemented in the R package SPOTlight; (8) Stereoscope^{26} implemented in the Python package Stereoscope; (9) Tangram^{27} implemented in the Python package Tangram.
Evaluation metrics
PCC
The Pearson correlation coefficient (PCC) was calculated using the Eq. 12 above, where \(n\) is the number of cell types, \({x}_{{ij}}\) and \({y}_{{ij}}\) are the proportion value of cell type \(j\) in spot \(i\) in the ground truth and the predicted result, respectively; \(\bar{x}\) and \(\bar{y}\) are the average proportion value across all cell types in spot \(i\) in the ground truth and the predicted result, respectively.
SRCC
The Spearman rankorder correlation coefficient (SRCC) was calculated using the following equation:
Where \(n\) is the number of cell types, \({d}_{{ij}}\) is the rank value of the proportion value of cell type \(i\) in spot \(i\) in the ground truth and the predicted result, respectively.
RMSE
The root mean squared error (RMSE) was calculated using the following equation:
Where \(n\) is the number of cell types, \({x}_{{ij}}\) and \({y}_{{ij}}\) are the proportion value of cell type \(j\) in spot \(i\) in the ground truth and the predicted result, respectively.
JS
The JensenShannon divergence (JS) was calculated using the following equation:
Where \(n\) is the number of cell types, \({x}_{{ij}}\) and \({y}_{{ij}}\) are the proportion value of cell type \(j\) in spot \(i\) in the ground truth and the predicted result, respectively; \({{{{{{\boldsymbol{P}}}}}}}_{i}\) and \({{{{{{\boldsymbol{Q}}}}}}}_{i}\) are the distribution probability vectors of cell type proportion in spot \(i\) in the ground truth and the predicted result, respectively.
AS
The accuracy score (AS) proposed by Li et al.^{56} was calculated using the following equation:
Benchmarking gene imputation methods
For the gene imputation methods benchmarking, we also used scCube to generate the new scRNAseq data with whole transcriptomes from the mouse brain scRNAseq dataset. Since none of the existing gene imputation methods make use of the spatial information of single cells, we thus did not simulate the spatial coordinates for each generate cells. On the contrary, we focused on the effect of the quality of detected transcriptomes in imagebased SRT data on the gene imputation methods. To do so, we utilized scCube to generate three imagebased SRT data with 200 randomly selected genes, 200 highly variable genes, and 200 cell type marker genes, respectively, and benchmarked the following eight gene imputation methods with the default parameters: (1) SpaGE^{28} implemented in the Python package SpaGE; (2) Tangram^{27} implemented in the Python package Tangram; (3) stPlus^{31} implemented in the Python package stPlus; (4) gimVI^{57} implemented in the Python package scvitools; (5) Seurat^{29} implemented in the R package Seurat; (6) SpaOTsc^{58} implemented in the Python package SpaOTsc; (7) novoSpaRc^{59} implemented in the Python package novoSpaRc; (8) LIGER^{30} implemented in the R package liger.
Evaluation metrics
The PCC, SRCC, RMSE, JS, and AS mentioned above were employed to evaluate the gene imputation performance, and the input \(n\) is the number of spots/cells, \({x}_{{ij}}\) and \({y}_{{ij}}\) are the expression value of gene \(i\) in spot/cell \(j\) in the in the ground truth and the predicted result, respectively.
Benchmarking resolution enhancement methods
For the resolution enhancement methods benchmarking, we simulated two pairs of spotbased SRT data with low (121 spots) and high (1089 spots) cellular resolution using scCube and benchmarked the following two existing methods with the default parameters: (1) BayesSpace^{49} implemented in the R package BayesSpace; (2) CARD^{60} implemented in the R package CARD.
Evaluation metrics
The PCC mentioned above were employed to evaluate the accuracy of highcellular resolution spatial maps of gene expression predicted by BayesSpace and CARD, and the input \(n\) is the number of spots, \({x}_{{ij}}\) and \({y}_{{ij}}\) are the expression value of gene \(i\) in spot \(j\) in the in the ground truth and the predicted result, respectively. Besides, since CARD also predicted the cell type composition of spots for the both low and highcellular resolution SRT data, we thus also evaluated the deconvolution result using the PCC mentioned above, and the input \(n\) is the number of cell types, \({x}_{{ij}}\) and \({y}_{{ij}}\) are the proportion value of cell type \(j\) in spot \(i\) in the ground truth and the predicted result, respectively.
Benchmarking spatial domain identification methods
For the spatial domain identification methods benchmarking, we selected a 10X Visium SRT dataset from the human dorsolateral prefrontal cortex (DLPFC) containing 12 tissue slices^{38} and generated the corresponding simulated SRT data for each slice using the referencebased strategy of scCube as the benchmarking datasets. The cortex layer label of each spot in these 12 simulated datasets was known and can be used as the ground truth. Five spatial domain identification methods (including (1) BayesSpace^{49} implemented in the R package BayesSpace; (2) SpaGCN^{18} implemented in the Python package SpaGCN; (3) STAGATE^{33} implemented in the Python package STAGATE; (4) DRSC^{32} implemented in the R package DR.SC; (5) stLearn^{77} implemented in the Python package stLearn), as well as one nonspatial unsupervised clustering method (Seurat^{29} implemented in the R package Seurat) were benchmarked with the default parameters.
Evaluation metrics
ARI
The adjusted rand index (ARI) was calculated using the following equation:
Where \(n\) is number of spots, \({a}_{i}\) and \({b}_{j}\) are the number of spots belonging to class \(i\) in the ground truth and cluster \(j\) in the predicted result, respectively, and \({n}_{{ij}}\) is number of spots belonging to class \(i\) and cluster \(j\) simultaneously.
NMI
The normalized mutual information (NMI) was calculated using the following equation:
Where the notation is the same as that in ARI.
HS
The homogeneity score (HS) was calculated using the following equation:
Where \(H({CK})\) is the conditional entropy of the classes given the cluster assignments and is given by:
and \(H(C)\) is the entropy of the classes and is given by:
Where \(n\) is number of spots, \({n}_{c}\) and \({n}_{k}\) are the number of spots belonging to class \(c\) in the ground truth and cluster \(k\) in the predicted result, respectively, \({n}_{c,k}\) is the number of spots from class \(c\) assigned to cluster \(k\).
FMI
The FowlkesMallows score (FMI) was calculated using the following equation:
Where \({Precision}=\frac{{TP}}{{TP}+{FP}}\) is proportion of spots that are correctly classified in the predicted result, and \(Recall=\frac{TP}{TP+FP}\) is proportion of spots that are correctly classified in the ground truth.
Benchmarking spatial cellcell interaction inference methods
For the spatial cellcell interaction inference benchmarking, we selected a STARmap SRT data from the mouse V1 neocortex^{3} and generated the corresponding simulated SRT data using the referencebased strategy of scCube as the benchmarking datasets. One recently published spatially resolved cellcell communication inference method (SpaTalk^{34} implemented in the R package SpaTalk) as well as other seven scRNAseq cellcell communication inference methods (including (1) CellCall^{78} implemented in the R package CellCall; (2) CellChat^{79} implemented in the R package CellChat; (3) CellPhoneDB^{80} implemented in the Python package cellphonedb; (4) CytoTalk^{81} implemented in the R package CytoTalk; (5) Giotto^{82} implemented in the R package Giotto; (6) NicheNet^{83} implemented in the R package nichenetr; (7) SpaOTsc^{58} implemented in the Python package SpaOTsc) were benchmarked with the default parameters.
Evaluation metrics
The spatial proximity significance and the coexpression level on the cellcell spatial graph network of the inferred ligandreceptor interactions (LRIs) proposed by Shao et al.^{34} were employed to evaluate the accuracy of all methods in spatial cellcell interaction inference.
Statistics & reproducibility
No statistical method was used to predetermine the sample size. There were no data exclusions except for the filtering of 5 “blank” barcodes and the Fos gene in the MERFISH data. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment. Python (version 3.8.5) and R (version 4.1.0) are used for the statistical analysis.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
No new data was generated for this study. All data used in this study is publicly available and can be accessed through the following links: (1) the human dorsolateral prefrontal cortex (DLPFC) 10X Visium dataset [http://spatial.libd.org/spatialLIBD/]^{38}; (2) the mouse hypothalamus MERFISH dataset [https://doi.org/10.5061/dryad.8t8s248]^{69}; (3) the mouse neocortex V1 STARmap dataset [https://zenodo.org/record/7830764#.ZDpObi1HUI]^{3}; (4) the human HER2 breast cancer “Spatial Transcriptomics (ST)” dataset [https://zenodo.org/record/5511763#.Y6kMduxBzUI]^{70}; (5) the human skin squamous cell carcinoma (SCC) “Spatial Transcriptomics (ST)” dataset [GSE144240]^{71}; (6) the human breast cancer 10X Xenium dataset [https://www.10xgenomics.com/products/xeniuminsitu/previewdatasethumanbreast]^{72}; (7) the zebrafish embryo Stereoseq dataset [https://gene.ai.tencent.com/SpatialOmics/dataset?datasetID=83]^{73}; (8) the human breast cancer 10X Visium and paired scRNAseq dataset [https://doi.org/10.5281/zenodo.4739739]^{75}; (9) the he Tabula Muris scRNAseq dataset [https://figshare.com/projects/Tabula_Muris_Transcriptomic_characterization_of_20_organs_and_tissues_from_Mus_musculus_at_single_cell_resolution/27733]^{40}; (10) the Tabula Sapiens scRNAseq dataset [https://figshare.com/projects/Tabula_Sapiens/100973]^{41}; (11) the MCA scRNAseq dataset [https://figshare.com/articles/MCA_DGE_Data/5435866]^{42}; (12) the HCL scRNAseq dataset [https://figshare.com/articles/HCL_DGE_Data/7235471]^{43}; (13) the human TNBC Multiplexed Imaging (MIBI) channel images [https://mibishare.ionpath.com]^{55}. Source data are provided with this paper.
Code availability
scCube is an openaccess python package available in the GitHub repository (https://github.com/ZJUFanLab/scCube)^{84}.
References
Liao, J., Lu, X., Shao, X., Zhu, L. & Fan, X. Uncovering an organ’s molecular architecture at singlecell resolution by spatially resolved transcriptomics. Trends Biotechnol. 39, 43–58 (2021).
Lubeck, E., Coskun, A. F., Zhiyentayev, T., Ahmad, M. & Cai, L. Singlecell in situ RNA profiling by sequential hybridization. Nat. Methods 11, 360–361 (2014).
Wang, X. et al. Threedimensional intacttissue sequencing of singlecell transcriptional states. Science 361, eaat5691 (2018).
Eng, C.H. L. et al. Transcriptomescale superresolved imaging in tissues by RNA seqFISH. Nature 568, 235–239 (2019).
Ståhl, P. L. et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science 353, 78–82 (2016).
Rodriques, S. G. et al. Slideseq: A scalable technology for measuring genomewide expression at high spatial resolution. Science 363, 1463–1467 (2019).
Stickels, R. R. et al. Highly sensitive spatial transcriptomics at nearcellular resolution with SlideseqV2. Nat. Biotechnol. 39, 313–319 (2021).
Nichterwitz, S. et al. Laser capture microscopy coupled with Smartseq2 for precise spatial transcriptomic profiling. Nat. Commun. 7, 12139 (2016).
Chen, J. et al. Spatial transcriptomic analysis of cryosectioned tissue samples with Geoseq. Nat. Protoc. 12, 566–580 (2017).
Asp, M. et al. A spatiotemporal organwide gene expression and cell atlas of the developing human heart. Cell 179, 1647–1660.e19 (2019).
Thrane, K., Eriksson, H., Maaskola, J., Hansson, J. & Lundeberg, J. Spatially resolved transcriptomics enables dissection of genetic heterogeneity in stage iii cutaneous malignant melanoma. Cancer Res 78, 5970–5979 (2018).
Maniatis, S. et al. Spatiotemporal dynamics of molecular pathology in amyotrophic lateral sclerosis. Science 364, 89–93 (2019).
Navarro, J. F. et al. Spatial transcriptomics reveals genes associated with dysregulated mitochondrial functions and stress signaling in Alzheimer Disease. iScience 23, 101556 (2020).
Moses, L. & Pachter, L. Museum of spatial transcriptomics. Nat. Methods 19, 534–546 (2022).
Svensson, V., Teichmann, S. A. & Stegle, O. SpatialDE: identification of spatially variable genes. Nat. Methods 15, 343–346 (2018).
Edsgärd, D., Johnsson, P. & Sandberg, R. Identification of spatial expression trends in singlecell gene expression data. Nat. Methods 15, 339–342 (2018).
Sun, S., Zhu, J. & Zhou, X. Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies. Nat. Methods 17, 193–200 (2020).
Hu, J. et al. SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nat. Methods 18, 1342–1351 (2021).
Kleshchevnikov, V. et al. Cell2location maps finegrained cell types in spatial transcriptomics. Nat. Biotechnol. 40, 661–671 (2022).
Liao, J. et al. De novo analysis of bulk RNAseq data at spatially resolved singlecell resolution. Nat. Commun. 13, 6498 (2022).
Lopez, R. et al. DestVI identifies continuums of cell types in spatial transcriptomics data. Nat. Biotechnol. 40, 1360–1369 (2022).
Song, Q. & Su, J. DSTG: deconvoluting spatial transcriptomics data through graphbased artificial intelligence. Brief. Bioinform 22, bbaa414 (2021).
ElosuaBayes, M., Nieto, P., Mereu, E., Gut, I. & Heyn, H. SPOTlight: seeded NMF regression to deconvolute spatial transcriptomics spots with singlecell transcriptomes. Nucleic Acids Res 49, e50 (2021).
Dong, R. & Yuan, G.C. SpatialDWLS: accurate deconvolution of spatial transcriptomic data. Genome Biol. 22, 145 (2021).
Cable, D. M. et al. Robust decomposition of cell type mixtures in spatial transcriptomics. Nat. Biotechnol. 40, 517–526 (2022).
Andersson, A. et al. Singlecell and spatial transcriptomics enables probabilistic inference of cell type topography. Commun. Biol. 3, 565 (2020).
Biancalani, T. et al. Deep learning and alignment of spatially resolved singlecell transcriptomes with Tangram. Nat. Methods 18, 1352–1362 (2021).
Abdelaal, T., Mourragui, S., Mahfouz, A. & Reinders, M. J. T. SpaGE: spatial gene enhancement using scRNAseq. Nucleic Acids Res 48, e107 (2020).
Stuart, T. et al. Comprehensive Integration of SingleCell Data. Cell 177, 1888–1902.e21 (2019).
Welch, J. D. et al. Singlecell multiomic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887.e17 (2019).
Shengquan, C., Boheng, Z., Xiaoyang, C., Xuegong, Z. & Rui, J. stPlus: a referencebased method for the accurate enhancement of spatial transcriptomics. Bioinformatics 37, i299–i307 (2021).
Liu, W. et al. Joint dimension reduction and clustering analysis of singlecell RNAseq and spatial transcriptomics data. Nucleic Acids Res 50, e72 (2022).
Dong, K. & Zhang, S. Deciphering spatial domains from spatially resolved transcriptomics with an adaptive graph attention autoencoder. Nat. Commun. 13, 1739 (2022).
Shao, X. et al. Knowledgegraphbased cellcell communication inference for spatially resolved transcriptomic data with SpaTalk. Nat. Commun. 13, 4429 (2022).
Li, R. & Yang, X. De novo reconstruction of cell interaction landscapes from singlecell spatial transcriptome data with DeepLinc. Genome Biol. 23, 124 (2022).
Zhu, J., Shang, L. & Zhou, X. SRTsim: spatial pattern preserving simulations for spatially resolved transcriptomics. Genome Biol. 24, 39 (2023).
Song, D. et al. scDesign3 generates realistic in silico data for multimodal singlecell and spatial omics. Nat. Biotechnol. 42, 247–252 (2023).
Maynard, K. R. et al. Transcriptomescale spatial gene expression in the human dorsolateral prefrontal cortex. Nat. Neurosci. 24, 425–436 (2021).
Kingma, D. P. & Welling, M. AutoEncoding Variational Bayes. In: 2nd International Conference on Learning Representations (ICLR), https://iclr.cc/archive/2014/conferenceproceedings/ (2014).
Tabula Muris Consortium et al. Singlecell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).
Tabula Sapiens Consortium* et al. The Tabula Sapiens: A multipleorgan, singlecell transcriptomic atlas of humans. Science 376, eabl4896 (2022).
Han, X. et al. Mapping the mouse cell atlas by microwellseq. Cell 172, 1091–1107.e17 (2018).
Han, X. et al. Construction of a human cell landscape at singlecell level. Nature 581, 303–309 (2020).
Sun, T., Song, D., Li, W. V. & Li, J. J. scDesign2: a transparent simulator that generates highfidelity singlecell gene expression count data with gene correlations captured. Genome Biol. 22, 163 (2021).
Zhang, X., Xu, C. & Yosef, N. Simulating multiple faceted variability in single cell RNA sequencing. Nat. Commun. 10, 2611 (2019).
Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.P. A general and flexible method for signal extraction from singlecell RNAseq data. Nat. Commun. 9, 284 (2018).
Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of singlecell RNA sequencing data. Genome Biol. 18, 174 (2017).
Neufeld, A., Gao, L. L., Popp, J., Battle, A. & Witten, D. Inference after latent variable estimation for singlecell RNA sequencing data. Biostatistics 25, 270–287 (2023).
Zhao, E. et al. Spatial transcriptomics at subspot resolution with BayesSpace. Nat. Biotechnol. 39, 1375–1384 (2021).
Fang, R. et al. Conservation and divergence of cortical cell organization in human and mouse revealed by MERFISH. Science 377, 56–62 (2022).
Zeira, R., Land, M., Strzalkowski, A. & Raphael, B. J. Alignment and integration of spatial transcriptomics data. Nat. Methods 19, 567–575 (2022).
Liu, W. et al. Probabilistic embedding, clustering, and alignment for integrating spatial transcriptomics data with PRECAST. Nat. Commun. 14, 296 (2023).
Clifton, K. et al. STalign: Alignment of spatial transcriptomics data using diffeomorphic metric mapping. Nat Commun. 14, 8123 (2023).
Feng, Y. et al. Spatial analysis with SPIAT and spaSim to characterize and simulate tissue microenvironments. Nat. Commun. 14, 2697 (2023).
Keren, L. et al. A structured tumorimmune microenvironment in triple negative breast cancer revealed by multiplexed ion beam imaging. Cell 174, 1373–1387.e19 (2018).
Li, B. et al. Benchmarking spatial and singlecell transcriptomics integration methods for transcript distribution prediction and cell type deconvolution. Nat. Methods 19, 662–670 (2022).
Lopez, R. et al. A joint model of unpaired data from scRNAseq and spatial transcriptomics for imputing missing gene expression measurements. Preprint at http://arxiv.org/abs/1905.02269v1 (2019).
Cang, Z. & Nie, Q. Inferring spatial and signaling relationships between cells from single cell transcriptomic data. Nat. Commun. 11, 2084 (2020).
Nitzan, M., Karaiskos, N., Friedman, N. & Rajewsky, N. Gene expression cartography. Nature 576, 132–137 (2019).
Ma, Y. & Zhou, X. Spatially informed celltype deconvolution for spatial transcriptomics. Nat. Biotechnol. 40, 1349–1359 (2022).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for singlecell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen predicts singlecell perturbation responses. Nat. Methods 16, 715–721 (2019).
Grønbech, C. H. et al. scVAE: variational autoencoders for singlecell gene expression data. Bioinformatics 36, 4415–4422 (2020).
Bonneel, N., Van De Panne, M., Paris, S. & Heidrich, W. Displacement interpolation using Lagrangian mass transport. ACM Trans. Graph. 30, 1–12 (2011).
Qian, J. et al. Reconstruction of the cell pseudospace from singlecell RNA sequencing data with scSpace. Nat. Commun. 14, 2484 (2023).
Ren, X. et al. Reconstruction of cell spatial organization from singlecell RNA sequencing data based on ligandreceptor mediated selfassembly. Cell Res 30, 763–778 (2020).
Li, H. et al. A comprehensive benchmarking with practical guidelines for cellular deconvolution of spatial transcriptomics. Nat. Commun. 14, 1548 (2023).
Chen, J. et al. A comprehensive comparison on celltype composition inference for spatial transcriptomics data. Brief. Bioinform 23, bbac245 (2022).
Moffitt, J. R. et al. Molecular, spatial, and functional singlecell profiling of the hypothalamic preoptic region. Science 362, eaau5324 (2018).
Andersson, A. et al. Spatial deconvolution of HER2positive breast cancer delineates tumorassociated cell type interactions. Nat. Commun. 12, 6012 (2021).
Ji, A. L. et al. Multimodal analysis of composition and spatial architecture in human squamous cell carcinoma. Cell 182, 1661–1662 (2020).
Janesick, A. et al. High resolution mapping of the tumor microenvironment using integrated singlecell, spatial and in situ analysis. Nat. Commun. 14, 8353 (2023).
Liu, C. et al. Spatiotemporal mapping of gene expression landscapes and developmental trajectories during zebrafish embryogenesis. Dev. Cell 57, 1284–1298.e5 (2022).
Yuan, Z. et al. SODB facilitates comprehensive exploration of spatial omics data. Nat. Methods 20, 387–399 (2023).
Wu, S. Z. et al. A singlecell and spatially resolved atlas of human breast cancers. Nat. Genet 53, 1334–1347 (2021).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: largescale singlecell gene expression data analysis. Genome Biol. 19, 15 (2018).
Pham, D. et al. Robust mapping of spatiotemporal trajectories and cellcell interactions in healthy and diseased tissues. Nat. Commun. 14, 7739 (2023).
Zhang, Y. et al. CellCall: integrating paired ligandreceptor and transcription factor activities for cellcell communication. Nucleic Acids Res 49, 8520–8534 (2021).
Jin, S. et al. Inference and analysis of cellcell communication using CellChat. Nat. Commun. 12, 1088 (2021).
VentoTormo, R. et al. Singlecell reconstruction of the early maternalfetal interface in humans. Nature 563, 347–353 (2018).
Hu, Y., Peng, T., Gao, L. & Tan, K. CytoTalk: De novo construction of signal transduction networks using singlecell transcriptomic data. Sci. Adv. 7, eabf1356 (2021).
Dries, R. et al. Giotto: a toolbox for integrative analysis and visualization of spatial expression data. Genome Biol. 22, 78 (2021).
Browaeys, R., Saelens, W. & Saeys, Y. NicheNet: modeling intercellular communication by linking ligands to target genes. Nat. Methods 17, 159–162 (2020).
Jingyang, Q., ProDong0512 & Dr. FAN, X. ZJUFanLab/scCube: scCube v2.0.1. https://doi.org/10.5281/zenodo.10686284 (2024).
Acknowledgements
This work is supported by the National Natural Science Foundation of China (U23A20513, X.F.), Ningbo Top Medical and Health Research Program (No. 2022030309, X.F.), the Fundamental Research Funds for the Central Universities (226202400001, X.F.). The authors thank the HighPerformance Computing Cluster of Zhejiang University Innovation Center of Yangtze River Delta for their technical support and thank spaSim for its inspiration in the development of referencefree spatial pattern simulation for this project.
Author information
Authors and Affiliations
Contributions
X.F. and Y.C. conceived the study. J.Q., Y.F., Z.C., and C.L. implemented the scCube model. J.Q., H.B., X.S., J.L., W.G., Y.H., A.L., and Y.Y. collected datasets involved in this article and performed benchmarking experiments. J.Q. and H.B. wrote the manuscript, and all authors edited and revised the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Yawei Li, Xun Xu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Qian, J., Bao, H., Shao, X. et al. Simulating multiple variability in spatially resolved transcriptomics with scCube. Nat Commun 15, 5021 (2024). https://doi.org/10.1038/s41467024494450
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467024494450
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.