Single-cell and spatial transcriptomics enables probabilistic inference of cell type topography

The field of spatial transcriptomics is rapidly expanding, and with it the repertoire of available technologies. However, several of the transcriptome-wide spatial assays do not operate on a single cell level, but rather produce data comprised of contributions from a – potentially heterogeneous – mixture of cells. Still, these techniques are attractive to use when examining complex tissue specimens with diverse cell populations, where complete expression profiles are required to properly capture their richness. Motivated by an interest to put gene expression into context and delineate the spatial arrangement of cell types within a tissue, we here present a model-based probabilistic method that uses single cell data to deconvolve the cell mixtures in spatial data. To illustrate the capacity of our method, we use data from different experimental platforms and spatially map cell types from the mouse brain and developmental heart, which arrange as expected.

: Comparison between visualization of marker genes' relative expression (top) and proportion estimate (middle) in section mb-ST1, with Allen Brain Atlas ISH images (bottom) as reference. The relative gene expression is obtained by dividing the number of observed transcripts at a given spot (x sg ) by the total number of observed transcripts in the given spot. The relative gene expression values are visualized according to the same procedure as the proportion values (Methods). 5 Supplementary Figure 5: Comparison between visualization of marker genes' relative expression (top) and proportion estimate (middle) in section mb-ST2, with Allen Brain Atlas ISH images (bottom) as reference. The relative gene expression is obtained by dividing the number of observed transcripts at a given spot (x sg ) by the total number of observed transcripts in the given spot. The relative gene expression values are visualized according to the same procedure as the proportion values (Methods). 6 Supplementary Figure 6: Comparison between visualization of marker genes' relative expression (top) and proportion estimate (middle) in section mb-V1, with Allen Brain Atlas ISH images (bottom) as reference. The relative gene expression is obtained by dividing the number of observed transcripts at a given spot (x sg ) by the total number of observed transcripts in the given spot. The relative gene expression values are visualized according to the same procedure as the proportion values (Methods).
Supplementary Figure 7: Visualization of proportion estimates for section dh-A (from a series of eight independent sections from the same developmental heart, named A-H), scaled within each cell type.
Supplementary Figure 8: Visualization of proportion estimates for section dh-B (from a series of eight independent sections from the same developmental heart, named A-H), scaled within each cell type.
Supplementary Figure 9: Visualization of proportion estimates for section dh-C (from a series of eight independent sections from the same developmental heart, named A-H), scaled within each cell type.
Supplementary Figure 10: Visualization of proportion estimates for section dh-D (from a series of eight independent sections from the same developmental heart, named A-H), scaled within each cell type.
Supplementary Figure 11: Visualization of proportion estimates for section dh-E (from a series of eight independent sections from the same developmental heart, named A-H), scaled within each cell type.
Supplementary Figure 12: Visualization of proportion estimates for section dh-F (from a series of eight independent sections from the same developmental heart, named A-H), scaled within each cell type.
Supplementary Figure 13: Visualization of proportion estimates for section dh-G (from a series of eight independent sections from the same developmental heart, named A-H), scaled within each cell type.
Supplementary Figure 14: Visualization of proportion estimates for section dh-H (from a series of eight independent sections from the same developmental heart, named A-H), scaled within each cell type.
Supplementary Figure 15: Visualization of the proportion estimates for the Slide-seq mouse brain hippocampus data. The alpha value is proportional to the estimated proportion value, meaning that dark blue areas correspond to regions where the given cell type constitutes a high proportion of the cells while the opposite is true for those of white color.

Characterization of Spatial Expression Data
Our method assumes that gene expression data can be sufficiently modeled by a Negative Binomial (NB) distribution. This idea is by no means novel; several methods designed for analysis of either bulk or single cell RNA-seq data relies on variations of the very same assumption (e.g., edgeR, DeSeq2, ZINB-WaVE and sctransform) [3][4][5][6]. Spatial data generated from capture based technologies, like those discussed within this work, share several features with single cell data; the most obvious being how both are constituted of positive integer values representing the number of transcripts associated with a given observation (cell alternatively capture location). The experimental methods by which such spatial data is obtained nevertheless differ substantially from those of single cell data, and thus caution should be payed to extrapolation of assumptions between the data modalities. Aware of this, we here seek to justify our assumptions and present support for our spatial data being NB-distributed, as to ensure we're not working on false premises.
To briefly recapitulate some of the terminology and properties of the NB distribution: We let x sgc denote the observable transcripts of a given gene (g), from a specific cell (c), at a given capture location (s). Our main assumption is that x sgc ∼ N B(., p g ), as a consequence of the additive property of NB distributions with shared second parameters (p g ), we therefore have that x sg = c∈Cs x sgc ∼ N B(., p g ) as well.
There are however two prominent issues that must be addressed in the context of this quest, namely that: (1) x sgc is not observed in any of the experimental platforms included in this study, and we are therefore limited to computational inferences of these values; (2) while we do observe x sg (being entries of the count matrices), we only have one such observation per distribution that we wish to characterize; this sparsity prevents us from making any general statement regarding the distributions. Due to (1) we will focus our efforts on characterization of x sg rather than x sgc , to then consider support for the former being NB distributed as strongly implicating the latter having the same property. In order to circumvent (2), we will cluster the capture locations based on their gene expression profiles and assume that members within the same cluster (k) can be taken to have approximately the same first parameter (r kg ); meaning that x sg |s ∈ k = x kg ∼ N B(r kg , p g ). As a result of this simplification, we have multiple observations from respective distribution (x kg ) and are able evaluate how well the NB distribution approximates this.
For the purpose of this inquiry, we use the mb-V1 section (Visium) and a set of genes (n = 12) listed as markers for cell types within the brain (taken from panglaodb.se), these genes also exhibit varying spatial distributions, see Supplementary Figure 20. [2] We used the Seurat package in R to normalize and cluster our data, from this we obtained 15 clusters, shown in Figure 21. See Supplementary Section 1.1.1 for more details regarding the clustering. Next, within each cluster and for each marker gene, we fitted 3 different parametric distributions (Negative Binomial, Normal and Poisson) to the observed expression (using the R package fitdistrplus). The empirical and fitted distributions for each cluster and marker gene are visualized in Supplementary Figures 22-36. As can be seen in the aforementioned figures, the NB distribution (red) tends to provide a good approximation for the observed empirical distribution (gray). Some deviations and ill-fits are observed, but these are mainly confined to clusters with few members or where the specific gene's expression do not overlap very well with the cluster (in the spatial domain). For a more quantitative assessment of how well the different distributions were able to describe the data, we compared their respective BIC (Bayesian information criterion) values, once the distributions had been fitted to the data. For all marker genes, the NB distribution outperformed the alternative distributions, as shown in Supplementary Figure 37.
We believe these results support and speak in favor of our assumption that x sg is well approximated by an NB distribution, hence also being affirmative of our model's design. Furthermore, from a theoretical standpoint it's also motivated to take the spatial data as NB-distributed; the NB distribution is often interpreted as an "overdisperssed" Poisson distribution, which represents the number of iid events that occur in given interval of time or space -a definition that translates well to the capture of mRNA at specific locations in the spatial assays. If we were to accept this as sufficient support for our assumption regarding the sum x sg , it's also motivated to assume that the respective constituents are NB-distributed, as this would indeed produce a new NB-distributed variable like that of x sg .
All results related to the discussion regarding the assumption of spatial capture-based data being Negative Bino-mial distribution, can be reproduced by running the script test-NB.R (found in the referenced github and Zenodo repositories), where the only input needed is the mb-V1 count file and a list of genes (the marker genes used) to be analyzed.

Seurat Clustering
To cluster the data, we followed the steps outlined in the Seurat Vignette [7], without any modifications (meaning, resolution = 0.8). In total 15 clusters were identified, as presented in Supplementary Figure 21.

Method Comparison
To compare and benchmark our method against alternative methods designed for deconvolution of bulk RNA-seq data using single cell RNA-seq data, we generated synthetic data according to the procedure outlined in Table  1 (Methods). The synthetic data generation produces sets with known proportion values, which can be used to quantitatively evaluate method performance. We conducted two comparative analyses: • Comparison 1, 10-30 cells per capture location: For this comparison; cell density (cells per capture location) was set to 10 − 30 cells during synthetic data generation. The intention was to generate data that resembled that obtained from the older ST (1k array) platform. Results are shown in Supplementary Figure  38 and Supplementary Table 6, where it can be seen how stereoscope outperforms the other methods.
• Comparison 2, 1-10 cells per capture location: For this comparison; cell density (cells per capture location) was set to 1 − 10 cells during synthetic data generation. For this set the intetion was to generate data that resembled that produced by the Visium platform. Results are shown in Supplementary Figure 39 and Supplementary Table 7, stereoscope performs better than the alternative methods in this comparison as well.

Data : Mouse Brain
Single Cell Single cell data was downloaded from mousebrain.org, where data from Hippocampus and Cerebellum were provided as loom-files containing a total of 29519 cells (hippocampus) and 27998 cells (cerebellum). As stated in the methods section: for the hippocampus data, the labels given as "Clusters" and "Class" were joined together in order to define more granular cell types; while for the cerebellum data we used the "Clusters" labels. Applying the subsampling scheme described in Methods, data sets consisting of 8449 (hippocampus) and 7506 (cerbellum) cells were assembled, the exact compositions of these sets are found in Supplementary Table 2 and 3.

ST/Visium
We analyzed two 100 micron array ST sections (mb-ST1 and mb-ST2), data can be accessed at the github page, original source in currently in press. We excluded all spots that were not covered by the tissue. We downloaded Visium (55 micron array) data from the website of 10x Genomics TM , listed under "Support", "Spatial Gene Expression" and "Datasets", selecting the set listed as "Mouse Brain Section (Coronal)".
[8] For the Visium data we only included spots under the tissue in our analysis.

Slide-seq
We analyzed two Slide-seq pucks: the puck visualized in Fig.1C (hippocampus) and one of the pucks used to generate Fig.2C (cerebellum) -figure numbers refer to the publication where the method was first presented. [1] The data was accessed from the "Single Cell Portal" provided by Broad Institute, pucks with ID 180413 7 (hippocampus) and 180819 11 (cerebellum) were downloaded. We used the files MappedDGEForR.csv and MappedLocationsForR.csv, associated with pucks of said IDs, to assemble an expression matrix with beads as rows and genes as columns. [9] We also replaced the original row names containing the barcode ids with each bead's spatial coordinates given as "[x coordinate]x[y coordinate]". All beads with non-zero total counts were used in the analysis.

Data : Developmental Heart
Both single cell and ST data for the analysis of the developmental heart are taken from the publication "A spatiotemporal organ-wide gene expression and cell atlas of the developing human heart.". [10] The complete set of single cells was used, while only the 8 ST sections from PCW (Post Conceptional Week) 6.5 were used as spatial data. Supplementary Table 4 gives the specifics of the single cell data set.

Single Cell
To generate synthetic single cell data, the data set originating from hippocampal tissue was downloaded from mousebrain.org (the same set as used for the analysis of the mouse brain) where the "Subclass" annotations were used as cell type identifiers. A generation and validation set of equal compositions (in terms of number of cells from each cell type) were generated, exact structure given in Supplementary Table 5.

Spatial Data
A total of 1000 spots were synthesized according to the procedure described in the Methods section. Only data from the top 500 highest expressed genes in the generation data set were used, hence a 1000 × 500 count matrix was generated. All synthetic data sets and the code used for the "synthesis" is available in the github repository of this paper, where also a tutorial to reproduce the results is presented.

Slide-seq analysis
With no histological image being provided for the Slide-seq data, we depict the obtained proportion estimates in a similar fashion to the ST data -each bead is represented by a circular marker where the alpha value is proportional to the estimated proportion for each cell type -but without a background tissue image. Due to the large number of beads present in the Slide-seq assays (and thus data points to visualize) the proportion estimates are multiplied by the scalar 0.6 in Fig.15, rendering a result which is easier to inspect and interpret. While Slide-seq provides high resolution, a one-to-one mapping between bead and cell is not guaranteed, given how beads tend to have between 1-3 cells contributing to them. [1] We therefore do not apply "hard" cluster assignments (i.e., assigning each bead to the type with highest associated proportion estimate) but rather use the proportion estimates. No scaling within cell type nor section is performed upon visualization.
Our main reason for including the cerebellum Slide-seq data was to assess whether we obtained the same number of cell types distributed across the beads as reported in the original Slide-seq study. The authors report that for the 7 Cerebellum pucks approximately 65.8 ± 1.4% of the beads are matched to a single cell type while 32.6 ± 1.2% mapped to two cell types (numbers reported as mean ± standard deviation).
To compare our results with those obtained in the Slide-seq study, we implemented an approach similar to what the Slide-seq authors used when calling the number of cell types assigned to a bead; in our case a cell type was "confidently assigned" to a bead if the proportion value of a cell type was greater than half the L2 norm of a bead's proportion vector. By doing so we could observe how 60.2% and 37.6% of beads had one respectively two cell types confidently assigned to them, see Figure 18. While these values both fall outside of the reported error range, it should be noted that we used a different single cell data set with a larger number of cell types; it's therefore plausible to suggest that what was called as contribution from one cell type in the Slide-seq study might in some cases (since we are operating at a higher granularity) be matched to two types in our data -explaining the slight discrepancies between the reported values.