Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# Cell segmentation in imaging-based spatial transcriptomics

## Abstract

Single-molecule spatial transcriptomics protocols based on in situ sequencing or multiplexed RNA fluorescent hybridization can reveal detailed tissue organization. However, distinguishing the boundaries of individual cells in such data is challenging and can hamper downstream analysis. Current methods generally approximate cells positions using nuclei stains. We describe a segmentation method, Baysor, that optimizes two-dimensional (2D) or three-dimensional (3D) cell boundaries considering joint likelihood of transcriptional composition and cell morphology. While Baysor can take into account segmentation based on co-stains, it can also perform segmentation based on the detected transcripts alone. To evaluate performance, we extend multiplexed error-robust fluorescence in situ hybridization (MERFISH) to incorporate immunostaining of cell boundaries. Using this and other benchmarks, we show that Baysor segmentation can, in some cases, nearly double the number of cells compared to existing tools while reducing segmentation artifacts. We demonstrate that Baysor performs well on data acquired using five different protocols, making it a useful general tool for analysis of imaging-based spatial transcriptomics.

This is a preview of subscription content, access via your institution

## Relevant articles

• ### Neurodevelopmental disorders—high-resolution rethinking of disease modeling

Molecular Psychiatry Open Access 25 November 2022

• ### Image-seq: spatially resolved single-cell sequencing guided by in situ and in vivo imaging

Nature Methods Open Access 24 November 2022

• ### Scalable in situ single-cell profiling by electrophoretic capture of mRNA using EEL FISH

Nature Biotechnology Open Access 22 September 2022

## Access options

\$32.00

All prices are NET prices.

## Data availability

The following datasets were used in evaluating the developed methods:

1. osmFISH mouse somatosensory cortex8, 35 genes: http://linnarssonlab.org/osmFISH/availability/.

2. MERFISH mouse preoptic hypothalamus19, 140 genes: https://doi.org/10.5061/dryad.8t8s248.

3. ISS mouse CA1 region16, 95 genes: https://doi.org/10.6084/m9.figshare.7150760.v1.

4. STARmap mouse VISp18, 1,020 genes: https://www.starmapresources.com/data/ (visual_1020, 20180505_BY3_1kgenes).

5. STARmap mouse VISp18, 160 genes: https://www.starmapresources.com/data/ (visual_160, 20171120_BF4_light).

6. seqFISH+ NIH/3T3 cells7, 10,000 genes: https://doi.org/10.5281/zenodo.2669683.

7. seqFISH mouse embryo45, 387 genes: https://marionilab.cruk.cam.ac.uk/SpatialMouseAtlas/.

8. Allen smFISH mouse VISp, 22 genes: https://github.com/spacetx-spacejam/data.

9. MERFISH mouse ileum, 241 genes: https://doi.org/10.5061/dryad.jm63xsjb2.

## Code availability

The Baysor package is available at https://github.com/kharchenkolab/Baysor. Baysor parameters for different datasets are reported in Supplementary Table 3. The code to reproduce the results is available at https://github.com/kharchenkolab/BaysorAnalysis/. This repository also contains the links to interactive visualization of the processed datasets using the Vitessce tool (http://vitessce.io/). MERFISH probe design and analysis software is available at https://github.com/ZhuangLab/MERFISH_analysis.

## References

1. Mereu, E. et al. Benchmarking single-cell RNA-sequencing protocols for cell atlas projects. Nat. Biotechnol. 38, 747–755 (2020).

2. Regev, A. et al. The human cell atlas. eLife 6, e27041. (2017).

3. HuBMAP Consortium. The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature 574, 187–192 (2019).

4. Aldridge, S. & Teichmann, S. A. Single cell transcriptomics comes of age. Nat. Commun. 11, 4307. (2020).

5. Lee, J. H. et al. Highly multiplexed subcellular RNA sequencing in situ. Science 343, 1360–1363 (2014).

6. Ke, R. et al. In situ sequencing for RNA analysis in preserved tissue and cells. Nat. Methods 10, 857–860 (2013).

7. Eng, C.-H. L. et al. Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH. Nature 568, 235–239 (2019).

8. Codeluppi, S. et al. Spatial organization of the somatosensory cortex revealed by osmFISH. Nat. Methods 15, 932–935 (2018).

9. Xia, C., Fan, J., Emanuel, G., Hao, J. & Zhuang, X. Spatial transcriptome profiling by MERFISH reveals subcellular RNA compartmentalization and cell cycle-dependent gene expression. Proc. Natl Acad. Sci. USA 116, 19490–19499 (2019).

10. Rodriques, S. G. et al. Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution. Science 363, 1463–1467 (2019).

11. Vickovic, S. et al. High-definition spatial transcriptomics for in situ tissue profiling. Nat. Methods 16, 987–990 (2019).

12. Lein, E., Borm, L. E. & Linnarsson, S. The promise of spatial transcriptomics for neuroscience in the era of molecular cell typing. Science 358, 64–69 (2017).

13. Bingham, G. C., Lee, F., Naba, A. & Barker, T. H. Spatial-omics: novel approaches to probe cell heterogeneity and extracellular matrix biology. Matrix Biol. 91-92, 152–166 (2020).

14. Soldatov, R. et al. Spatiotemporal structure of cell fate decisions in murine neural crest. Science 364, eaas9536 (2019).

15. Chen, W.-T. et al. Spatial transcriptomics and in situ sequencing to study Alzheimer’s disease. Cell 182, 976–991 (2020).

16. Qian, X. et al. Probabilistic cell typing enables fine mapping of closely related cell types in situ. Nat. Methods 17, 101–106 (2020).

17. Chen, K. H., Boettiger, A. N., Moffitt, J. R., Wang, S. & Zhuang, X. RNA imaging. Spatially resolved, highly multiplexed RNA profiling in single cells. Science 348, aaa6090 (2015).

18. Wang, X. et al. Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Science 361, eaat5691 (2018).

19. Moffitt, J. R. et al. Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Science 362, eaau5324 (2018).

20. Wang, Z. Cell segmentation for image cytometry: advances, insufficiencies, and challenges. Cytometry A 95, 708–711 (2019).

21. Park, J. et al. Cell segmentation-free inference of cell types from in situ transcriptomics data. Nat. Commun. 12, 3545 (2021).

22. Dirmeier, S. & Beerenwinkel, N. Structured hierarchical models for probabilistic inference from perturbation screening data. Preprint at bioRxiv https://doi.org/10.1101/848234 (2019).

23. Zhu, Q., Shah, S., Dries, R., Cai, L. & Yuan, G.-C. Identification of spatially associated subpopulations by combining scRNAseq and sequential fluorescence in situ hybridization data. Nat. Biotechnol. 36, 1183–1190 (2018).

24. Rueden, C. T. et al. ImageJ2: ImageJ for the next generation of scientific image data. BMC Bioinformatics 18, 529 (2017).

25. Wang, G., Moffitt, J. R. & Zhuang, X. Multiplexed imaging of high-density libraries of RNAs with MERFISH and expansion microscopy. Sci. Rep. 8, 4847 (2018).

26. Moffitt, J. R. et al. High-performance multiplexed fluorescence in situ hybridization in culture and tissue with matrix imprinting and clearing. Proc. Natl Acad. Sci. USA 113, 14456–14461 (2016).

27. Stringer, C., Wang, T., Michaelos, M. & Pachitariu, M. Cellpose: a generalist algorithm for cellular segmentation. Nat. Methods 18, 100–106 (2021).

28. Yangel, B. & Vetrov, D. in Energy Minimization Methods in Computer Vision and Pattern Recognition (eds Heyden, A., Kahl, F., Olsson, C., Oskarsson, M., & Tai, X.-C.) p 137–150 (Springer, 2013).

29. McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at arXiv https://arxiv.org/abs/1802.03426v3 (2018).

30. Kanemura, A., Maeda, S. & Ishii, S. Superresolution with compound markov random fields via the variational em algorithm. Neural Netw. 22, 1025–1034 (2009).

31. Blei, D. M., Kucukelbir, A. & McAuliffe, J. D. Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112, 859–877 (2017).

32. Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. Series B Stat. Methodol. 39, 1–38 (1977).

33. Nielsen, S. F. The stochastic EM algorithm: estimation and asymptotic results. Bernoulli 6, 457–489 (2000).

34. Kimura, T. et al. Expectation–maximization algorithms for inference in Dirichlet processes mixture. Pattern Anal. Appl. 16, 55–67 (2013).

35. Wolf, F. A., Angerer, P. & Theis, F. J. Scanpy: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).

36. Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015).

37. Harris, K. D. et al. Classes and continua of hippocampal CA1 inhibitory neurons revealed by single-cell transcriptomics. PLoS Biol. 16, e2006387 (2018).

38. Hodge, R. D. et al. Conserved cell types with divergent features in human versus mouse cortex. Nature 573, 61–68 (2019).

39. Haber, A. L. et al. A single-cell survey of the small intestinal epithelium. Nature 551, 333–339 (2017).

40. Gehart, H. et al. Identification of enteroendocrine regulators by real-time single-cell differentiation mapping. Cell 176, 1158–1173 (2019).

41. Tsoucas, D. et al. Accurate estimation of cell-type composition from gene expression data. Nat. Commun. 10, 2975 (2019).

42. Moffitt, J. R. et al. High-throughput single-cell gene-expression profiling with multiplexed error-robust fluorescence in situ hybridization. Proc. Natl Acad. Sci. USA 113, 11046–11051 (2016).

43. Stringer, C., Wang, T., Michaelos, M. & Pachitariu, M. Cellpose: a generalist algorithm for cellular segmentation. Nat. Methods 18, 100–106 (2021).

44. Traag, V. A., Waltman, L. & Van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).

45. Lohoff, T. et al. Highly multiplexed spatially resolved gene expression profiling of mouse organogenesis. Preprint at bioRxiv https://doi.org/10.1101/2020.11.20.391896 (2020).

## Acknowledgements

We thank B. Tasic and B. Long for sharing the non-published Allen smFISH data and aiding in its interpretation and the SpaceTx consortium for facilitating the collaborations. We also thank Y. Boykov (University of Waterloo) for the initial discussions and advice on an alternative segmentation approach based on graph cuts. We are also grateful to a number of colleagues who advised us on the published protocols, including N. Pierson and L. Cai (seqFISH+), S. Codeluppi, L. Borm and S. Linnarsson (osmFISH) and X. Qian, M. Hilscher and M. Nilsson (ISS). Additionally, we thank J. Miller for his input on segmentation benchmarks and B. Lelieveldt for his advising on NCV visualization. We express our gratitude to D. Molchanov and D. Vetrov (HSE, Moscow) for their input on the algorithm. J.R.M. acknowledges pilot funding from the Harvard Digestive Disease Center (P30 DK034854). V.P., P.V.K. and J.R.M. were supported by the Seed Network grant 2019-202743 from the Chan Zuckerberg Initiative. V.P. is funded through a cooperative agreement between University of Copenhagen and Harvard Medical School.

## Author information

Authors

### Contributions

P.V.K. and V.P. formulated the study and the overall approach. V.P. developed the detailed algorithms with advice from R.A.S. and K.K. V.P. implemented the Baysor package. J.R.M., R.J.X. and P.C. developed boundary immunostaining and performed MERFISH measurements. V.P. and P.V.K. drafted the manuscript, with contributions by J.R.M, R.A.S., and R.J.X. All authors provided suggestions and corrections on the manuscript text.

### Corresponding author

Correspondence to Peter V. Kharchenko.

## Ethics declarations

### Competing interests

P.V.K. serves on the Scientific Advisory Board to Celsius Therapeutics, Inc., and Biomage, Inc. J.R.M. is a cofounder and Scientific Advisory Board member of Vizgen, Inc. J.R.M. is an inventor on patents associated with MERFISH applied for on his behalf by Harvard University and Boston Children’s Hospital. The other authors declare no conflict of interest.

Peer review information Nature Biotechnology thanks Kenneth Harris and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Extended data

### Extended Data Fig. 1 Graphical models of the segmentation process.

Graphical representations of the Bayesian models used for the general Markov-Random Field (MRF) segmentation process (a) and the extended model for cell segmentation (b) are shown. Blue squares represent input parameters and data for the algorithm. The yellow circles represent the hidden parameters, fitted by the algorithm. Optional input and parameters are shown with dashed border lines. Round-corner boxes represent plate notation for a mixture of distributions with the size of the mixture shown on the bottom right corner. Nmols denotes the number of molecules in the dataset, and Ncomps is the specified number of the mixture components. Arrow labels show the distributions used to model dependencies between the corresponding variables. Matrix variables are shown with the capital letters and vector variables are designated with the overline. a, The general MRF model, where the MRF prior with weights W is used to account for the spatial dependency of the inferred labels $$\overrightarrow{z}\in 1:{N}^{comps}$$. Examples of the variables and distributions for different labelling problems are noted below the boxes. b, The detailed model for the Cell Segmentation problem. Here, Bayesian Mixture Models with Dirichlet prior were used, so the possible number of components of the mixture is infinite, which allows the algorithm to estimate the number of components automatically. To ensure that the components correspond to the actual cells, the Global Scale parameter s was introduced, which specifies the expected cell radius.

### Extended Data Fig. 2 Comparison of Baysor, pciSeq, and DAPI Watershed segmentations based on poly(A) signal.

a, Number of cells in different segmentations. b, Distribution of poly(A) brightness (x-axis) across background molecules (that is, molecules outside of the predicted segmentations) for different segmentations (color). Baysor shows the lowest number of transcripts in bright poly(A) regions, while Watershed has the heaviest tail. c, Number of cells overlapping with the segmented poly(A) regions is shown as a distribution for different segmentation methods. Baysor shows highest frequency of one-to-one mapping with poly(A) segmentations. d, Mutual information (y-axis) of molecule assignment with poly(A) segmentation is shown for different segmentation methods. To account for local variation we split the data over 7x7 grid and showed mean and 95% CI, as well as individual values for n = 49 sub-regions. e, Size of the overlap for a best-matching cell in the poly(A) segmentation, normalized to the total size (in molecules) of the best-matching poly(A) cell is shown as a distribution for different segmentation methods. Peak near 1.0 indicates that many poly-A cells are fully covered by the best-matching target cells reported by a segmentation method. f, Similar to (e), the size of the overlap with best-matching poly(A) cells is shown as a fraction of the target cell size for different target segmentation methods. Peak near 1.0 indicates that many reported target cells are fully covered by the best-matching poly(A) cell. g-k, Examples borders of Baysor (purple) and Watershed (blue) segmentations are shown. The left plots show poly(A) signal, while the right plots show molecules colored based on local expression patterns (NCVs). While in most cases there is a good correspondence between the two modalities (g-i), in some cases molecular composition clearly indicates presence of distinct cells which are not easily separated from the poly(A) signal intensity (j,k).

### Extended Data Fig. 3 Impact of the ‘prior segmentation confidence’ parameter on the difference between the prior and the posterior segmentations on the example of ISS CA1 region.

a, The Mutual Information between the Baysor and the Paper (published) segmentations (y-axis) is shown as a function of the prior segmentation confidence (x-axis). Mutual Information does not reach the value of 1.0, as even for prior confidence set to 1.0, Baysor is still allowed to re-assign molecules, recognised as background in the Paper segmentation. b, For each cell of the source segmentation (shown with colour), a cell with the largest overlap was picked from the target segmentation. The overlap fraction is shown on the y-axis for the different values of prior segmentation confidence. The boxes represent distribution quartiles with the maximal length of whiskers equal to 1.5 of the inter-quartile range. It can be seen that for high values of the prior confidence, for each Paper cell there is a Baysor cell that covers it completely (confidence ≥0.9, Source=Paper). The opposite is not true, as Baysor is allowed to re-assign the background molecules from the Paper segmentation.

### Extended Data Fig. 4 Cell statistics for different segmentation methods.

The boxplots show distributions of the number of molecules per cell (a, log-scale y-axis) and the squared root of the cell area, which is an approximation for cell radii (b) for different protocols (x-axis) and segmentation methods (fill colours). The boxes represent quartiles with the maximal length of whiskers equal to 1.5 of the inter-quartile range. For all datasets, Baysor has approximately the same values as the published segmentations, which suggests that it is not biased towards over- or under-segmentation. The Watershed and pciSeq methods stably shows lower values, consistent with registering mostly nuclei molecules.

### Extended Data Fig. 5 Comparison of the Baysor and the published segmentation on the MERFISH Hypothalamus dataset.

The figure shows the comparison in the same format as Fig. 5. a, A joint UMAP embedding of the cells from both Baysor and the paper segmentations. The colors correspond to the annotated cell types. b, The same embedding, colored by the segmentation that produced a specific cell. c, A heatmap showing expression patterns of marker genes (columns) for the different cell types (rows). The colors show expression levels, normalised for each gene. d, The frequency of different cell types is shown for the Baysor (brown bars) and the paper (blue bars) segmentations. The numbers on the top of the bars show excess percentage for Baysor. The largest difference is observed for Endothelial cells, where the Paper segmentation has 42% fewer cells compared to Baysor. e-f. Examples of Astrocytes (e) and Endothelial (f), which were not segmented by the Paper annotation, but were distinguished by Baysor. The dots correspond to the measured molecules, colored by gene (only three the most abundant genes are shown). The grayscale background shows the DAPI signal, and the black contours show the determined cell boundary. g. Example of a region with Ependymal cells, showing that for such regions molecules have homogeneous expression patterns. This results in Baysor slightly under-segmenting such cells, which causes the difference in the number of detected cells.

### Extended Data Fig. 6 Comparison of cell clusters in the MERFISH mouse ileum dataset recovered from different segmentation methods.

a,b,c, Leiden clusters, cell type spatial distributions, and marker gene expressions in the Na+K+-ATPase immunofluorescence (IF) MERFISH mouse ileum dataset, where cells are segmented by (a) Baysor with RNA information only, (b) Baysor with priors provided by Cellpose-derived IF boundaries, (c) Cellpose-derived IF boundaries. Left: UMAP of all identified cells colored based on Leiden clustering. Middle: Spatial distributions of all identified cell clusters colored as in the UMAP. Right: Expressions of marker genes in each of the identified cell clusters. The size of the dots represents the fraction of cells with at least one count of the indicated gene. The color of the dots represents the average expression of each gene across all cell types, log-transformed, and normalized to the cell type with the largest expression. DC: dendritic cells; ICC: interstitial cells of Cajal; TA: transit amplifying cells. d, The numbers of each cell type identified by each of the segmentation methods in a-c.

### Extended Data Fig. 7 Spatial distributions of all cell types identified by Baysor (with RNA information only).

Gray dots represent the location of all cells, and colored dots represent the location of the indicated cell type. DC: dendritic cells; ICC: interstitial cells of Cajal; TA: transit amplifying cells.

### Extended Data Fig. 8 Additional benchmarks against MERFISH membrane staining data.

a, Similar to Fig. 6i of the main manuscript, the distribution shows overlap of different segmentations with membrane IF segmentation. Baysor+DAPI and Baysor+IF correspond to Baysor ran with DAPI and IF segmentations as priors, respectively. b, Size of the overlap of different target segmentations with IF segments is shown relative to the size of the predicted cell in the target segmentation. c, Distribution of the number of target cells matching to cells of the membrane IF segmentation is shown for different segmentation results. d, Number of cells recovered by different segmentation methods e-f number of molecules (e) and area (f) per cell, reported by different segmentation methods. The boxes represent distribution quartiles with the maximal length of whiskers equal to 1.5 of the inter-quartile range. g, Agreement between different segmentations and membrane IF segmentation is assessed using mutual information across molecules for n = 5 central z-planes. The average and 95% confidence intervals across z-planes, as well as dots for individual values are shown. Only molecules assigned to some cell in any of the methods are used.

### Extended Data Fig. 9 Outstanding challenges: intracellular compartmentalization and homotypic cells.

a, An example of intracellular compartmentalization, illustrated by polarized expression pattern of enterocytes in the mouse ileum, as captured by MERFISH. RNAs are colored by NCV. b, Example of a homotypic cell cluster from the mouse ileum. Three panels show the same region with membrane IF signal. The left panel shows NCV molecule coloring, whereas center and right panels color molecules assigned to each cell differently. Red arrows point at homotypic cells that Baysor was only able to segment with the help of IF prior.

### Extended Data Fig. 10 Outstanding segmentation challenges.

a, Seq-FISH+ Fibroblast7 data colored by NCVs with black contours showing the published segmentation borders. b, The same data, segmented by Baysor with colors showing cell assignment. c, Example of cells which are separable only in 3D in the Allen smFISH data. The two plots show 2D projections on the physical x-y and x-z axes correspondingly. Each point represents a molecule, coloured by its gene of origin. Gad2 and Pvalb are markers of inhibitory neurons, while Sv2c with Satb2 are markers of excitatory neurons. These markers are mutually exclusive, and there should be no cell that expresses all four of these markers. d-e, Seq-FISH mouse embryo45 data colored by cell type published cell assignment (d) and the Baysor cell segmentation (e) with black contours showing the published segmentation borders. It can be seen that the dataset captures cytoplasm-specific genes, lacking nuclei expression, which leads to the holes in the middle of cells. f, Example of a cell from the STARmap VISp 160 dataset18. The black lines show the published cell boundaries. The plot shows colouring by gene for the 15 most expressed genes.

## Supplementary information

### Supplementary Information

Supplementary Figs. 1–10 and Supplementary Table captions.

### Supplementary Tables 1–3

Supplementary Tables

## Rights and permissions

Reprints and Permissions

Petukhov, V., Xu, R.J., Soldatov, R.A. et al. Cell segmentation in imaging-based spatial transcriptomics. Nat Biotechnol 40, 345–354 (2022). https://doi.org/10.1038/s41587-021-01044-w

• Accepted:

• Published:

• Issue Date:

• DOI: https://doi.org/10.1038/s41587-021-01044-w

• ### Squidpy: a scalable framework for spatial omics analysis

• Giovanni Palla
• Hannah Spitzer
• Fabian J. Theis

Nature Methods (2022)

• ### Image-seq: spatially resolved single-cell sequencing guided by in situ and in vivo imaging

• Christa Haase
• Karin Gustafsson
• Charles P. Lin

Nature Methods (2022)

• ### The expanding vistas of spatial transcriptomics

• Luyi Tian
• Fei Chen
• Evan Z. Macosko

Nature Biotechnology (2022)

• ### Scalable in situ single-cell profiling by electrophoretic capture of mRNA using EEL FISH

• Lars E. Borm
• Alejandro Mossi Albiach

Nature Biotechnology (2022)

• ### Spatial components of molecular tissue biology

• Giovanni Palla
• David S. Fischer
• Fabian J. Theis

Nature Biotechnology (2022)