Cell segmentation-free inference of cell types from in situ transcriptomics data

Multiplexed fluorescence in situ hybridization techniques have enabled cell-type identification, linking transcriptional heterogeneity with spatial heterogeneity of cells. However, inaccurate cell segmentation reduces the efficacy of cell-type identification and tissue characterization. Here, we present a method called Spot-based Spatial cell-type Analysis by Multidimensional mRNA density estimation (SSAM), a robust cell segmentation-free computational framework for identifying cell-types and tissue domains in 2D and 3D. SSAM is applicable to a variety of in situ transcriptomics techniques and capable of integrating prior knowledge of cell types. We apply SSAM to three mouse brain tissue images: the somatosensory cortex imaged by osmFISH, the hypothalamic preoptic region by MERFISH, and the visual cortex by multiplexed smFISH. Here, we show that SSAM detects regions occupied by known cell types that were previously missed and discovers new cell types.

bandwidth should be employed. We investigated the effect of bandwidth and the pixel resolution (lattice spacing) to which the gene expression is resolved (Supplementary Fig. 16-19). This showed that the cell-type signatures and cell type map are not highly dependent on pixel resolution/lattice spacing, that small deviations in the bandwidth have negligible effect, and large increases in the bandwidth can result in an overly smoothed cell type map while still capturing most cell-type signatures (Supplementary Fig. 20).

Downsampling of the vector field
SSAM downsamples gene expression vectors before performing cluster analysis. This step significantly reduces computation time compared to processing the entire tissue image, especially for graph-based clustering algorithms. We show that random sampling from the tissue image is effective (Supplementary Fig. 38), however we achieve better representation of cell-type gene expression profiles by selecting local maxima of total gene expression ( Supplementary Fig. 39) and as such made this this default downsampling method. The reasoning of the local maxima sampling strategy follows two presumptions: 1) Pixels with a higher density of mRNA count have a larger probability of being intracellular, and that 2) Gaussian smoothing locally propagates the signal sufficiently to average the signal over intracellular regions. From those presumptions, we reason that vectors with a peak signal intensity are probably situated centrally within a cell and hence have a high signal-to-noise ratio and capture a representative signature of its cell type by integrating the surrounding mRNA molecules.

Clustering and cell-type gene expression profile analysis
An important step in determining the cell-type specific gene expression profiles is the clustering step. We initially implemented the Louvain algorithm for cluster analysis, which is used routinely for cluster analysis of scRNA-seq data in packages such as Seurat 5 , however we found that this was not sensitive enough to detect all cell types, in particular the Sst Chodl type in the mouse VISp dataset. In order to address this, we employed an approach similar to a previous study that applied a second round of clustering on top of the Louvain results 6 to split potentially merged clusters. We successfully applied HDBSCAN clustering on the Louvain clustering results and were then able to identify a distinct cluster for Sst Chodl. Despite this success, it remains unclear if the same protocol would work equally well for data from different tissues and techniques, and therefore encourage the use of also other clustering methods.

Cell type mapping
We have employed the use of Pearson's correlation to classify pixels in the tissue image, generating the cell-type maps (Supplementary Fig. 2B). Correlation-based cell-type mapping has been proven to work effectively 7,8 , and was also shown to be among top performers for automatic cell-type annotation 9 . Although our cell-type mapping worked well for the examples described in the paper, the SSAM framework allows for use of other classification strategies.
For example, we have implemented an adversarial autoencoder to classify each pixel into each cell type. This preliminary work is part of the developmental branch of SSAM on Github.

Mixing of gene expression signals
In our clustering analysis of all three datasets, we observed that some clusters showed moderate mixed expression of signature genes from different cell types. This can be problematic for developmental tissues where cells can be tightly packed and transitioning between progenitor and differentiated states. Technical causes of signal mixing include: (1) different cells can overlap at different z location if the thickness of the section is comparable to the cell size; (2) clustering of the vectors might not be perfect and can include vectors in nearby clusters; (3) the gene expression estimated by the KDE algorithm is smoothed and the gene expression of one cell type can contaminate cells with different cell types located nearby. We found that (1) and (2) are of major importance for this phenomenon, but (3) also becomes noticeable for closely packed small cells. For example, we found that a relatively higher expression of the Foxj1 gene (a signature gene of ependymal cells) is detected in the choroid plexus signature, compared to that of segmentation-based osmFISH centroid. Such signal 'contamination', caused by the KDE spreading signal into adjacent cells, can be controlled by the bandwidth value of KDE -the smaller the bandwidth, the lower the contamination; however, use of very low bandwidths break one of the primary assumptions of SSAM in that the smoothed KDE signal should represent cells, and not subcellular features. Still, our work showed that this phenomenon is not so critical as to hinder detection of cell types, even with extreme cases like bandwidth of 5 μm or 10 μm (Supplementary Fig. 20). However, we recommend trying several bandwidths to investigate this mixing effect in different tissues.
Systematic evaluation of potential doublets arising from signal mixing revealed that the real effect of this phenomena is marginal (Supplementary Table 8).

Subcellular gene expression signals
SSAM identified a distinct cell-type signature of Aldoc-expressing astrocytes that also had low expression of Gfap and Mfge8. When looking closely at the localization of Aldoc signal they exhibited subcellular compartmentalization patterns, consistent with astrocytes that express high levels of Mfge8. We believe that this could due to localization of these mRNAs to internal subcellular localization compartments in Astrocytes. Such an intracellular spatial organization of the transcriptome is often an important form of post-transcriptional regulation 10 and imagingbased methods can reveal this organization 11 . While this work does not investigate this in detail, this indicates that SSAM has potential to be used to identify and investigate organization of mRNAs.

Applicability of SSAM to other spatial gene expression profiling techniques
Currently, the field of in situ transcriptomics is advancing rapidly and more than 10,000 genes can be simultaneously profiled using FISH-based methods 12,13 . The high number of genes detected in large volumes opens up the potential for in situ transcriptomics methods to at least partially replace single cell RNA sequencing, placing SSAM as the first generic and segmentation-free pipeline to rapidly and precisely reconstruct tissue structure independent of the underlying imaging technique. While the profiled genes for the datasets presented in this study were carefully curated to represent cell-type diversity, when a larger number of genes are used we would suggest applying dimensionality reductions techniques before cluster analysis.
Moreover, since the only required input data for SSAM is mRNA locations, it is highly adaptable to spatially resolved transcriptomics technologies beyond FISH methods, e.g. in situ or intact tissue sequencing [14][15][16] , and composite in situ imaging 17 .
In addition to these single molecule profiling techniques, there are those that are resolved to single cells or larger regions 18 , and Spatial Transcriptomics 19,20 . These come with the advantages and disadvantages that mRNA is aggregated to single cells/spots, e.g. making allocation of mRNA to cells easier, but also making subcellular localization analysis impossible.
These techniques have a resolution limited to the spotting profile of the technologies, compared to the native organization of mRNA within tissues. For example, the size of the spots for Visium (55um) are very large, making it less appropriate for those interested in single cell resolution, and inappropriate for tissues with very small structures, e.g. placental villi which can be 60 um in diameter, or for those interested in localization of cells that are distributed over the tissue (e.g. delta cells in the pancreas).

SSAM as a segmentation-free analysis framework
We define SSAM as a framework, since it consists of various different functions required for segmentation-free spatial cell-type detection. For example, one can choose 1) local maxima selection, or random sampling for downsampling, 2) sctransform 21  Thanks to the modular structure of SSAM framework, it is also easy to add new functionality to SSAM to reflect further user requests. For example, since we have made our package available on Github, we have received frequent requests to implement segmentation based on cell-type map as prior, therefore we have worked on a preliminary watershed segmentation feature. Also there have been some concerns with scalability of SSAM, therefore we added support of dask, which enables lazy-computation of arrays without loading the whole data in memory, so that SSAM can be applied to extremely large datasets without memory concerns. Both experimental functionalities are available in the developmental branch of SSAM on Github.

Marker gene selection
The selection of genes to profile is important for successful cell-type detection -i.e. the genes must be carefully selected to be suitable cell-type markers prior to using the SSAM algorithm.
Although this has to be carefully considered on the experimental side, we benchmarked the celltype detection rate as a function of the number of genes imaged, using several simulated settings by removing proportion of genes from MERFISH dataset ( Supplementary Fig. 40).
Surprisingly, we found that even when removing 80% of the genes, we were still able to recapitulate more than 65% of the cell-type signatures obtained by de novo analysis using all genes. The high variance of each category shows that the number of cell types detected also highly depends on which genes were selected for analysis, which means that not only the number of genes, but also the selection of genes is important as well.