Abstract
Spatially resolved omics technologies are transforming our understanding of biological tissues. However, the handling of uni- and multimodal spatial omics datasets remains a challenge owing to large data volumes, heterogeneity of data types and the lack of flexible, spatially aware data structures. Here we introduce SpatialData, a framework that establishes a unified and extensible multiplatform file-format, lazy representation of larger-than-memory data, transformations and alignment to common coordinate systems. SpatialData facilitates spatial annotations and cross-modal aggregation and analysis, the utility of which is illustrated in the context of multiple vignettes, including integrative analysis on a multimodal Xenium and Visium breast cancer study.
Similar content being viewed by others
Main
The function of biological tissues is strongly linked to their composition and organization. Advances in imaging and spatial molecular profiling technologies enable the addressing of these questions by interrogating tissue architectures with ever-growing comprehensiveness, resolution and sensitivity1,2. Existing spatial molecular profiling methods quantify DNA, RNA, protein and/or metabolite abundances in situ3,4. Several of these technologies employ light microscopy, providing spatial resolution of morphological features at length scales from the subcellular to entire organisms. Spatial omics technologies are advancing rapidly, and individual data modalities and methods feature distinct advantages and limitations such as trade-offs in spatial resolution, the extent of molecular multiplexing and detection sensitivity. The ability to efficiently integrate and then operate on data from different spatial omics modalities promises to be instrumental for the construction of holistic views of biological systems.
While progress has been made in the analysis of individual spatial omics datasets, integration of uni- and multimodal spatial omics data entails important practical challenges not sufficiently addressed by existing solutions5,6,7 (Extended Data Table 1, Supplementary Note 1 and Supplementary Table 1). Even basic operations such as loading of datasets into analysis pipelines in a coherent manner is hampered by the diversity in data types (for example, tabular data for sequencing and tens- to hundreds-of-gigabyte dense arrays for images) and file formats (for example, technology-specific vendor formats). In addition, individual spatial omics modalities can differ vastly in spatial resolution and the spatial regions for data acquisition in a tissue are often not aligned. Thus, for integration of such data they must be appropriately transformed and aligned to a common coordinate system (CCS), which is a prerequisite for the establishment of global common coordinate frameworks (CCFs)8. Finally, untangling the complexity of multimodal spatial omics datasets requires expert knowledge and motivation of approaches that enable large-scale interactive data exploration and annotation. Thus, to unlock the full potential of emerging spatial multiomics studies2,9 there is a need for computational infrastructures to store, explore, analyze and annotate data across the full breadth of spatial omics technologies with a unified programmatic interface.
The SpatialData framework enables the findable, accessible, interoperable, reusable (FAIR)10 integration of multimodal spatial omics data. A language-independent storage format increases the interoperability of data sources while the Python library standardizes access of, and operation across, different data types. The SpatialData format supports all major spatial omics technologies and derived quantities (Fig. 1a,c, Supplementary Note 2 and Supplementary Table 2). Briefly, spatial datasets are represented using five primitive elements: Images (raster images), Labels (for example, raster segmentation masks), Points (for example, molecular probes), Shapes (for example, polygon regions of interests, array capture locations and so on) and Tables (for example, molecular quantifications and annotations) (Supplementary Tables 2 and 3). The file format also tracks coordinate transformation or alignment steps applied to individual datasets. Dataset collections can be stored within a single SpatialData store, thereby facilitating joint integrative analyses. The SpatialData format builds on the Open Microscopy Environment–Next-Generation File Format (OME–NGFF) specifications and leverages the Zarr file format (Supplementary Fig. 1), thereby offering performant, interoperable access for both traditional file system- and cloud-based storage11,12 (Supplementary Note 3).
The SpatialData Python library represents this format as SpatialData objects in memory, which supports lazy loading of larger-than-memory data (Fig. 1b). The library also provides reader functions for widely used spatial omics technologies (Fig. 1c and Supplementary Table 3), as well as versatile functionalities for manipulating and accessing SpatialData objects and to define CCSs of biological tissues8. Briefly, each individual dataset is associated with a modality-specific coordinate transformation (Fig. 1b) that includes affine transformations and composite operations. Once aligned, a collection of datasets can be queried (Extended Data Fig. 1) and aggregated (Extended Data Fig. 2)—for example, using spatial annotations at diverse scales (cells, grids, anatomical regions) and both within and across modalities. The query and aggregation interfaces also allow for the creation of new datasets grouped by biologically informed factors from large dataset collections, thereby facilitating exploration, selected data sharing and access.
SpatialData has a napari plugin for interactive annotation (napari-spatialdata; Fig. 1d and Extended Data Fig. 3). The napari-spatialdata plugin can be used for the interactive definition of spatial annotations such as drawing regions of interest, or to define landmarks for guiding multidataset registration. Static figures and graphics can be created using the spatialdata-plot library (Extended Data Fig. 4).
The SpatialData library integrates seamlessly with the Python ecosystem by building on standard scientific Python data types. We have implemented a PyTorch Dataset class to effectively train deep learning models directly from SpatialData objects (Fig. 1e, Supplementary Note 4 and Extended Data Fig. 5). Further, thanks to the modular nature of the data representation, analysis packages in the scverse13 ecosystem such as Scanpy14, Squidpy15 and scvi-tools16 can be used for analysis of SpatialData objects (Fig. 1f and Supplementary Fig. 2). Taken together, the SpatialData framework provides infrastructure for the integration and analysis of spatial omics data.
To illustrate the utility of SpatialData for multimodal integration and analysis, we used the framework to represent and process data from a breast cancer study that combines hematoxylin and eosin (H&E) images and 10x Genomics Visium and Xenium assays17. The study comprises two in situ sequencing datasets (Xenium) and one spatial transcriptomics dataset (10x Visium CytAssist) from consecutive sections of a breast cancer tumor. First we used napari-spatialdata to define landmark points present in all datasets, followed by alignment of all three datasets using transformations to define a CCS (Fig. 2a). As a result of the alignment, SpatialData enabled us to identify the common spatial area, which can be accessed using SpatialData queries across datasets.
Next we used the collective information from all three datasets to create a shared set of spatial annotations. Briefly, we selected four regions of interest (ROIs) based on histological features present in the H&E image using napari-spatialdata (Extended Data Fig. 6). We then used genome-wide transcriptome information in Visium to estimate copy number states (using CopyKat18) and to annotate major genetic subclones. Finally we annotated cell types in two Xenium replicates by transferring cell-type labels from an independent breast cancer single-cell RNA sequencing (scRNA-seq) atlas19 (ingest, implemented in scanpy14; Fig. 2b).
To exemplify how SpatialData can be used to transfer spatial annotations between datasets, we considered the masks from Visium capture locations and aggregated cell-type information from the overlapping Xenium cells to estimate cell-type fractions at each location. For comparison we also considered a deconvolution-based analysis of Visium counts (using cell2location20) with the same scRNA-seq-derived cell types19 as reference. We observed high concordance of cell-type abundance estimates between Xenium replicates (median Pearson’s R = 0.88 across Visium locations) and overall good agreement between Xenium- and deconvolution-based estimates (median Pearson’s R = 0.69).
Analogous to the aggregation at Visium locations, we considered ROIs defined from H&E and areas defined by the union of subclone locations from Visium (Fig. 2d and Supplementary Fig. 3a). Again we quantified cell-type fractions within each region, either directly using cell count fractions from Xenium or via deconvolution of the corresponding Visium data. The two Xenium replicates showed high concordance of cell-type fractions, and Xenium and Visium were consistent.
As a second aggregation use case we compared expression estimates for individual genes at Visium capture locations using either Xenium or Visium data. We again transferred Visium capture locations to aggregate individual molecule counts from Xenium into the Visium masks (Fig. 2e and Supplementary Fig. 3b). As expected, the aggregated counts were highly concordant between Xenium replicates (median Pearson’s R = 0.62; Fig. 2e and Supplementary Fig. 3c–e) and, to a lesser extent, between Xenium and Visium counts (median Pearson’s R = 0.48; Supplementary Fig. 3c–e). We also noted a direct relationship between overall transcript abundance and the agreement between different tissue sections and technologies (Fig. 2e).
In sum, these examples illustrate the flexibility of the aggregation functionality that can be applied between SpatialElements of different kinds (points, circular capture locations, cells and larger anatomical ROIs) to transfer diverse types of spatial annotation (cell expression, cell-type fractions). Further examples and advanced-use cases of SpatialData aggregation operations are discussed in Extended Data Fig. 2.
SpatialData facilitates the processing of a wide range of uni- and multimodal datasets. The online documentation of SpatialData comes with vignettes that illustrate additional use cases. For example, we illustrate how SpatialData can serve as a backend to facilitate the training of deep learning models (Extended Data Fig. 5 and Supplementary Note 4), and to conduct downstream analysis using spatial interpretation tools such as Squidpy (Supplementary Fig. 2). As a starting point for using SpatialData in conjunction with different technologies, we also currently provide preformatted SpatialData objects from >40 datasets acquired by eight technologies (Supplementary Table 2). Interactive annotation can be performed on both single- and multimodality datasets. Finally we explored how SpatialData can align multiple fields of view into a global reference coordinate system by mapping 12 Visium slides to a large prostate section (Extended Data Fig. 7). Further information, including comprehensive documentation of the SpatialData Python library, tutorials, example datasets and a contributor guide, is available online (https://spatialdata.scverse.org).
Here we present SpatialData, a flexible, community standards-based framework for storage, processing and annotation of data from virtually any spatial omics technology available to date. The ability to flexibly create common coordinate systems by aligning datasets is a critical cornerstone to establishing comprehensive CCFs, which will unlock new analysis approaches that facilitate robust comparison and reuse of samples across studies. In conclusion, the flexibility and readily accessible solutions provided by the SpatialData framework enable new possibilities in analysis and enhance the reproducibility of integrated spatial analysis.
As the uptake of SpatialData continues to grow its utility will increase further. Ongoing developments (discussed in Supplementary Notes 5 and 6) extend the interoperability of SpatialData with R/Bioconductor21, provide support for multiscale point and polygon representations—such as polygonal meshes and five-dimensional volumetric images (that is, czyx images with an additional time component)—and support cloud-based data access both programmatically and via the visualization tool Vitessce22. In summary, SpatialData provides an open and universal data framework for spatial omics.
Methods
SpatialData framework
The SpatialData framework comprises a core package, spatial data and associated satellite packages napari-spatialdata, spatialdata-io and spatialdata-plot, compatible with Python 3.9 and above. All code is available on GitHub as part of the scverse organization and is licensed under the permissive ‘BSD 3-Clause License’. The project structures inherit from the scverse cookiecutter and the napari plugin cookiecutter, thus implementing unit tests and precommit checks in a continuous integration setting. The documentation is built using Sphinx and hosted on Read the Docs. It includes application programming interface (API) descriptions, example notebooks and a table with links to downloadable spatial omics datasets. Each dataset can be downloaded in full (.zip) or even directly accessed from the cloud (public S3 storage). Documentation, tutorials and sample data can be found in the links below.
-
Documentation: https://spatialdata.scverse.org
-
Installation instructions: https://spatialdata.scverse.org/en/latest/installation.html
-
Tutorials: https://spatialdata.scverse.org/en/latest/tutorials/notebooks/notebooks.html
-
Sample data: https://spatialdata.scverse.org/en/latest/tutorials/notebooks/datasets/README.html
We also provide a contribution guide and technical design document to encourage adoption. Users can reach out to the core development team via the GitHub Issues bug-tracking system. To encourage collaboration between the imaging and scverse communities we have created a public chat stream on the imagesc Zulip messaging platform: https://imagesc.zulipchat.com/#narrow/stream/329057-scverse.
SpatialData framework dependencies
The framework depends on routinely used Python libraries. In detail, the spatialdata package depends on networkx, numpy (scientific stack), anndata (single-cell data), dask-image, multiscale-spatial-image, ome-zarr-py, spatial-image, xarray, xarray-schema, xarray-spatial, zarr (raster spatial data), geopandas, pyarrow, pygeos, shapely (vector spatial data), fsspec, rich, tqdm, typing_extensions (utilities) and torch (deep learning, optional dependency).
The satellite packages spatialdata-io, spatialdata-plot and napari-spatialdata require additional dependencies; we refer the reader to the Reporting Summary for a complete list, and to the pyproject.toml and setup.cfg files of the corresponding GitHub repositories for the most up-to-date list, as the packages and their dependency continuously evolve.
All packages in the SpatialData framework are routinely published to PyPI via GitHub Actions and, as such, pip can be used readily to install the software and all its dependent libraries. Conda support is in preparation.
Raw human breast cancer Xenium and Visium data
We downloaded the raw data from https://www.10xgenomics.com/products/xenium-in-situ/preview-dataset-human-breast.
Loading Xenium and Visium datasets into SpatialData
The 10x Xenium and Visium readers from spatialdata-io were used to read the data into SpatialData objects. For the Xenium datasets, the DAPI channel was stored as a multiscale Image, cell and nuclei segmentation masks and boundaries were stored as Shapes elements whereas the transcripts were stored as Points. The metadata and count matrices were stored as a Table in the SpatialData object. For the Visium dataset, the H&E image was stored as a multiscale Image, the array capture areas (circles) were stored as Shapes and the count matrix and annotations were stored in the Table.
Cell-type annotation of Xenium replicates
We annotated cells from Xenium replicates using a publicly available scRNA-seq breast cancer atlas19 comprising nine malignant and normal cell types and 29 subtypes. After subsetting the atlas to the subset of 313 genes present in the Xenium panel, we applied the ingest method for label transfer as implemented in the Scanpy package (v.1.9)14 to annotate cells from the Xenium replicates. We transferred major cell-type labels first (coarse grained) and then within each class we mapped minor cell types (fine grained). In the current analysis only major cell types are shown. The nine major cell types are B cells, cancer-associated fibroblasts (CAFs), cancer epithelial, endothelial, normal epithelial, plasmablasts and perivascular-like cells (PVL) and T cells.
Alignment to create common coordinate systems
We selected three landmark points from the images from the two Xenium replicates and the Visium dataset. Landmark points are to be selected on each of the images in the same order and there should be a 1-to-1 spatial correspondence between sets of points. Xenium replicate 1 was used as the reference to which Xenium replicate 2 and Visium were aligned using the SpatialData function align_elements_using_landmarks. We used napari-spatialdata to annotate the landmark points and to view the result of alignments. Internally, Dask’s lazy-loading and Zarr’s multiscale representation made it possible to performantly explore and zoom the datasets, even in a low-memory device like a standard laptop.
Computation of cell-type fractions for Visium
Following alignment, the shared area between each cell and from the Xenium replicates and Visium locations was computed. Cell-type fractions were then computed for each Visium location based on the surface fractions of the locations covered by each cell type. This was done using the SpatialData aggregate function with fractions=True, and was performed separately for Xenium replicates 1 and 2.
Cell-type deconvolution using cell2location
We used cell2location (v.0.1.3)20 to estimate cell-type fractions at Visium locations, with the aforementioned breast cancer atlas as the reference. For this task we operated on the subset of 313 genes present in the Xenium replicates and subset the Visium dataset and breast cancer atlas to those genes. We set the default parameters as suggested in the cell2location tutorial (https://cell2location.readthedocs.io/en/latest/notebooks/cell2location_tutorial.html). The analysis can be found at https://github.com/scverse/spatialdata-notebooks/tree/main/notebooks/paper_reproducibility. For visualization, only cell types contributing at least 5% per Visium capture location were taken into account then the quantity at each location was normalized to have a total of 1.
ROI selection with napari-spatialdata
Following alignment, four ROIs were selected based on the H&E image from the Visium dataset using the napari-spatial data plugin, and these ROIs were then added to the aligned Xenium replicates. Each ROI was selected based on its distinct microanatomical characteristics and then labeled manually based on the underlying cell-type composition from the Xenium replicates.
Clone detection on Visium using CopyKat
We used CopyKat (v.1.1.0)18 with default parameters to estimate copy number states from the Visium count matrix followed by hierarchical clustering, which identified three major clusters on the locations labeled as ʻaneuploidʼ; these three clusters were used as genetic subclones. We also transferred clone labels to overlapping cells from Xenium replicates; these labels were stored as a SpatialData table element. This analysis was conducted in R separately (the notebooks repository: https://github.com/scverse/spatialdata-notebooks/tree/main/notebooks/paper_reproducibility).
Visium’s anndata table was saved in .h5ad AnnData format14,25 for loading and analysis in R, and clone labels were then transferred back to SpatialData via .h5ad. There are ongoing efforts in the Bioconductor community to enable direct loading of anndata tables into R from Zarr, such as anndataR26, which would obviate the need for exporting as.h5ad (HDF5 format) when completed.
ROI cell-type fractions
We next computed, for each ROI and clone, the fractions of cell types for the cells contained within them. The SpatialData aggregation APIs offer a convenient interface to compute these metrics, independently if what is being aggregated is a set of circles or polygons, and if the target region is a polygonal ROI or a set of circles defining a particular clone.
Transcript aggregations
For each Visium capture location we aggregated transcripts from the Xenium replicates falling into each Visium location; we performed this analysis for Xenium replicates 1 and 2 separately. This yielded two aggregated count matrices that were saved as separate layers in Visium’s SpatialData objects table.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
We converted several example datasets to Zarr using the SpatialData package. At the time of writing we included data from the following technologies: NanoString CosMx, 10x Genomics Xenium, 10x Genomics Visium, CyCIF, MERFISH, MIBI-TOF and Imaging Mass Cytometry. The scripts used to convert data, as well as the converted data, are accessible from https://spatialdata.scverse.org/en/latest/tutorials/notebooks/datasets/README.html. For an overview of the datasets and their respective source publication please refer to Supplementary Table 2.
Code availability
SpatialData is available as a Python package via pip, and comes with an extensive set of examples and tutorials that can be accessed from the documentation at https://spatialdata.scverse.org. Furthermore, the documentation also includes a contribution guide for researchers interested in participating in the design and implementation of the framework. All scripts used to reproduce the analyses included in this manuscript can be downloaded from the spatialdata-notebook repository: https://github.com/scverse/spatialdata-notebooks/tree/main/notebooks/paper_reproducibility.
References
Asp, M., Bergenstråhle, J. & Lundeberg, J. Spatially resolved transcriptomes-next generation tools for tissue exploration. Bioessays 42, e1900221 (2020).
Rao, A., Barkley, D., França, G. S. & Yanai, I. Exploring tissue architecture using spatial transcriptomics. Nature 596, 211–220 (2021).
Vandereyken, K., Sifrim, A., Thienpont, B. & Voet, T. Methods and applications for single-cell and spatial multi-omics. Nat. Rev. Genet. 24, 494–515 (2023).
Seferbekova, Z., Lomakin, A., Yates, L. R. & Gerstung, M. Spatial biology of cancer evolution. Nat. Rev. Genet. 24, 295–313 (2023).
Moses, L. & Pachter, L. Museum of spatial transcriptomics. Nat. Methods 19, 534–546 (2022).
Park, J. et al. Spatial omics technologies at multimodal and single cell/subcellular level. Genome Biol. 23, 256 (2022).
Moffitt, J. R., Lundberg, E. & Heyn, H. The emerging landscape of spatial profiling technologies. Nat. Rev. Genet. 23, 741–759 (2022).
Rood, J. E. et al. Toward a common coordinate framework for the human body. Cell 179, 1455–1467 (2019).
Palla, G., Fischer, D. S., Regev, A. & Theis, F. J. Spatial components of molecular tissue biology. Nat. Biotechnol. 40, 308–318 (2022).
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).
Moore, J. et al. OME-Zarr: a cloud-optimized bioimaging file format with international community support. Histochem. Cell Biol. 160, 223–251 (2023).
Moore, J. et al. OME-NGFF: a next-generation file format for expanding bioimaging data-access strategies. Nat. Methods 18, 1496–1498 (2021).
Virshup, I. et al. The scverse project provides a computational ecosystem for single-cell omics data analysis. Nat. Biotechnol. 41, 604–606 (2023).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Palla, G. et al. Squidpy: a scalable framework for spatial omics analysis. Nat. Methods 19, 171–178 (2022).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Janesick, A. et al. High resolution mapping of the breast cancer tumor microenvironment using integrated single cell, spatial and in situ analysis of FFPE tissue. Nat. Commun. 14, 8353 (2023).
Gao, R. et al. Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes. Nat. Biotechnol. 39, 599–608 (2021).
Wu, S. Z. et al. A single-cell and spatially resolved atlas of human breast cancers. Nat. Genet. 53, 1334–1347 (2021).
Kleshchevnikov, V. et al. Cell2location maps fine-grained cell types in spatial transcriptomics. Nat. Biotechnol. 40, 661–671 (2022).
Gentleman, R. C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004).
Keller, M. S. et al. Vitessce: a framework for integrative visualization of multi-modal and spatially-resolved single-cell data. Preprint at https://doi.org/10.31219/osf.io/y8thv (2023).
The MONAI Consortium. Project MONAI. Zenodo https://doi.org/10.5281/zenodo.4323059 (2020).
Gayoso, A. et al. A Python library for probabilistic analysis of single-cell omics data. Nat. Biotechnol. 40, 163–166 (2022).
Virshup, I., Rybakov, S., Theis, F. J., Angerer, P. & Alexander Wolf, F. anndata: Annotated data. Preprint at bioRxiv https://doi.org/10.1101/2021.12.16.473007 (2021).
anndataR: AnnData interoperability in R. GitHub https://github.com/scverse/anndataR (2023).
Li, T. et al. WebAtlas pipeline for integrated single cell and spatial transcriptomic data. Preprint at bioRxiv https://doi.org/10.1101/2023.05.19.541329 (2023).
Moses, L., Jackson, K., Luebbert, L. & Pachter, L. Voyager: from geospatial to spatial omics. GitHub https://github.com/pachterlab/voyager (2023).
Righelli, D. et al. SpatialExperiment: infrastructure for spatially-resolved transcriptomics data in R using Bioconductor. Bioinformatics 38, 3128–3131 (2022).
Dries, R. et al. Giotto: a toolbox for integrative analysis and visualization of spatial expression data. Genome Biol. 22, 78 (2021).
Peters Couto, B. Z., Robertson, N., Patrick, E. & Ghazanfar, S. MoleculeExperiment enables consistent infrastructure for molecule-resolved spatial omics data in bioconductor. Bioinformatics 39, btad550 (2023).
Yuan, Z. et al. SODB facilitates comprehensive exploration of spatial omics data. Nat. Methods 20, 387–399 (2023).
Xu, Z. et al. STOmicsDB: a database of Spatial Transcriptomic data. Nucleic Acids Res. 52, D1053–D1061 (2024).
Baker, E. A. G. et al. emObject: domain specific data abstraction for spatial omics. Preprint at bioRxiv https://doi.org/10.1101/2023.06.07.543950 (2023).
Long, B., Miller, J. & The SpaceTx Consortium. SpaceTx: a roadmap for benchmarking spatial transcriptomics exploration of the brain. Preprint at https://doi.org/10.48550/arXiv.2301.08436 (2023).
Erickson, A. et al. Spatially resolved clonal copy number alterations in benign and malignant tissue. Nature 608, 360–367 (2022).
Acknowledgements
We thank the following individuals for their contributions: D. Bredikhin for participation in a hackathon in Basel (April 2022) focused on discussions on representations for multiple modalities and in the scverse ecosystem; B. Wadie, C. Tischer, S. Gonzalez Tirado and L. Hetzel for attending a hackathon in Heidelberg (June 2022); A. Lomakin for his contributions to discussions on alignment of clones and niches; O. Lazareva for contributing work on clonality for the breast cancer study during the de.NBI BioHackathon SpaceHack project in Lutherstadt-Wittenberg (December 2022); and organizers of and participants in the de.NBI BioHackathon SpaceHack project. We thank H. L. Crowell, C. Ahlmann-Eltze, M. Smith, N. Eiling, L. Meyer and L. Moses for valuable discussions on R interoperability and their prototype implementations of R readers for OME–Zarr and SpatialData objects. We also thank I. Gold, M. Keller, N. Gehlenborg, T. Li and O. Bayraktar for discussions and initial implementations concerning JavaScript interoperability for remote data visualization with Vitessce22, in particular as part of the WebAltas pipeline27. In addition we thank M. Klein for his valuable contributions to napari-spatialdata during a hackathon in Heidelberg (April 2023); F. Wünnemann for his contributions on spatialdata-io in another hackathon in Heidelberg (July 2023); Q. Blampey for contributions to spatialdata-io; A. Shmatko for contribution to implementation of the napari lasso tool; A. Defauw for his work on the apply function for raster data; M. McCormick for discussions and support regarding usage of the packages SpatialImage and MultiscaleSpatialImage; W. Moore for discussions on OME–NGFF and technical support on OME–Zarr; J. Bogovic for developing the OME–NGFF transformation specification; J. Lüthi and C. Mah for discussions during SpatialData meetings; A. S. Eisenbarth for bug fixes and general feedback; and T. Graf for his work on a prototype involving nonlinear transformations. In addition we thank group members of the Stegle and Theis laboratory for helpful discussions. Finally we acknowledge the respective funding programs of the authors. L.M. is supported by the EMBL International PhD Programme. G.P. is supported by the Helmholtz Association under the joint research school Munich School for Data Science and by the Joachim Herz Foundation. K.A.Y. was supported by the Open Research Data Program of the ETH Board and a Personalized Health and Related Technologies Transition Postdoc Fellowship (no. PHRT 2021–448). E.H. is supported by the DKFZ International PhD Programme. T.T. is supported by the Helmholtz Association under the joint research school Munich School for Data Science. W.-M.V. is supported by the EMBL International PhD Programme and by research funding from Cellzome, a GSK company. B.R. and Y.S. are supported by the Flanders AI Research Program. L.P. is supported by a PhD fellowship from The Research Foundation—Flanders (grant no. 11J7324N). Y.S. is supported by the FWO-EOS program and the BOF-GOA fund. H.V. Is supported by a BMBF grant (SIMONA). S.K.S. acknowledges core funding from the European Molecular Biology Laboratory and research funding from Cellzome, a GSK company. J.M. was supported for work on OME–NGFF by grant nos. 2019-207272 and 2022-310144 and on Zarr by grant nos. 2019-207338 and 2021-237467 from Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation; and was funded by Deutsche Forschungsgemeinschaft (German Research Foundation, no. 501864659) as part of NFDI4BIOIMAGE. F.J.T. acknowledges support by the Helmholtz Association’s Initiative and Networking Fund through Helmholtz AI (grant no. ZT-I-PF-5-01), by Wellcome Leap as part of the ∆Tissue Program and by the Chan Zuckerberg Initiative DAF (advised fund of Silicon Valley Community Foundation, grant no. 2021-240328 (5022)). O.S. acknowledges support by Wellcome Leap as part of the ∆Tissue Program. This project has been made possible in part by grant no. 2023- 323350 from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation.
Funding
Open access funding provided by European Molecular Biology Laboratory (EMBL).
Author information
Authors and Affiliations
Contributions
L.M., G.P., K.A.Y. and I.V. contributed equally. E.H., T.T., and W.V. contributed equally. L.M., G.P., K.A.Y. and I.V. designed and authored the spatialdata library, with contributions from I.K. during early prototyping. L.M., G.P., K.A.Y., I.V. and J.M. authored the spatialdata storage specification. G.P., L.M., M.T., W.V. and R.B.S. wrote napari-spatialdata. H.V. designed and prototyped spatialdata-plot with input from T.T., G.P. and L.M. T.T., G.P., S.K.S. and H.V. implemented spatialdata-plot. G.P., L.M., L.L. and W.-M.V. implemented the spatialdata-io library. E.H. performed analysis on the (Xenium and Visium) breast cancer dataset with input from L.M., G.P. and K.A.Y. E.H., W.-M.V., B.R. and L.P. contributed to library improvement. O.S., F.J.T. and J.M. supervised the work.
Corresponding authors
Ethics declarations
Competing interests
J.M. holds equity in Glencoe Software, which builds products based on OME–NGFF. F.J.T. consults for Immunai, Inc., Singularity Bio B.V., CytoReason Ltd, Cellarity and Omniscope and has ownership interest in Dermagnostix GmbH and Cellarity. O.S. is a paid consultant of Insitro, Inc., S.K.S. is a consulting scientific cofounder for Digital Biology, Inc. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Nils Gehlenborg and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Rita Strack, in collaboration with the Nature Methods team. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Illustration of the SpatialData query function.
To facilitate analyses on large datasets, SpatialData enables the selection of distinct regions within a dataset. The spatial query interface allows users to request the data contained in a query region, which can be specified both as a bounding box or a polygonal region. The query region can be specified using any of the coordinate systems present in the SpatialData object. The query operator returns a derived SpatialData object that contains the data within the query region for all layers, including the corresponding table annotations. The bounding box spatial query can be performed in 2D for all elements or in 3D for raster elements (that is, Image and Labels) and points; extended discussion on 3D queries is presented in Supplementary Note 6. Shown are code excerpts from the spatial query tutorial. This specific tutorial explains how a region of interest can be specified, such as rectangular bounding boxes or defined via polygonal shapes, and how the data underlying the specified query region can be retrieved. The full example can be found in the ‘spatial query’ notebook in the online documentation (https://spatialdata.scverse.org/en/latest/tutorials/notebooks/notebooks/examples/spatial_query.html).
Extended Data Fig. 2 Schematic representation of the SpatialData aggregation operations.
Aggregation operations are the foundation to flexibly transfer quantifications and annotations across modalities when conducting multimodal analyses. SpatialData enables the aggregation (also referred to as accumulation in image processing) of data stored in any SpatialElement into any set of target geometries or masks. Example applications include count aggregation of the number of single molecules for a specific gene within polygon geometries representing cells. Similarly, molecule counts within image masks representing the cytoplasm of the cells. Another example is averaging cell gene expression within a given anatomical region (see also main text Fig. 2). Predefined aggregation operators (count, sum, mean, standard deviation) can be applied to any SpatialElement. Additionally, SpatialData supports the definition of user-specified aggregation operators. Leveraging common coordinate systems, aggregation operations can be applied to collections of datasets, including across datasets with different spatial scales and/or partially overlapping datasets. Tutorials on how to use the aggregation system are available as part of the SpatialData online documentation (https://spatialdata.scverse.org/en/latest/tutorials/notebooks/notebooks/examples/aggregation.html).
Extended Data Fig. 3 Example of using napari-spatialdata to visualize and annotate spatial datasets.
Napari-spatialdata enables the interactive visualization of SpatialElements (Images, Labels, Points, Shapes) together with associated annotations (such as gene expression, cluster annotations etc.). Embeddings of molecular profiles (for example, t-SNE, UMAP) can be interactively queried via the scatter plot widget. Spatial annotations can be interactively created via drawing of regions in the napari viewer. The corresponding annotations are then exported into the underlying SpatialData, facilitating their use in downstream analyses. a. NanoString CosMx dataset and interactive selection with a lasso from the UMAP plot computed from the cell gene expression and colored by Leiden clusters. The lasso tool in the scatterplot windows is used to annotate a set of cells. The annotation can be visualized in space and can be exported for downstream usage. b. MERFISH mouse brain dataset (Allen Institute prototype MERFISH pipeline35) featuring gene expression, polygonal ROIs annotating anatomical regions and cell types assigned to single molecule points.
Extended Data Fig. 4 Illustration of the static plotting library spatialdata-plot.
The spatialdata-plot library enables the streamlined visualization of complex multi-modality data. The set of elements to be rendered (Images, Labels, Points, Shapes), as well as specific parameters for plotted elements can be specified by the user. For example, Shapes representing cells can be annotated by the expression level of a target gene. The plotting library automatically accounts for transformations and alignments of the underlying common coordinate system. Tutorials how to use spatialdata-plot are available as part of the online documentation (Section ‘Visualizations’, https://spatialdata.scverse.org/en/latest/tutorials/notebooks/notebooks.html).
Extended Data Fig. 5 SpatialData facilitates the preparation of datasets for deep learning applications and it integrates with existing deep learning ecosystems.
(a) Building on the query interface, SpatialData allows to generate PyTorch datasets that represent tiles of the original SpatialData. Shown is an example use case, using tiles centered on cells to train a DenseNet encoder model for supervised cell-type prediction. The specific model architecture, without weights, is provided by the MONAI framework, and this example shows how we can readily interface with existing deep learning ecosystems. (b) The effective definition of deep learning datasets can harness common coordinate systems to allow for the combination of different spatially aligned elements. Shown are H&E image and Xenium replicate 1 aligned datasets precedently introduced in main text Fig. 2a. (c) Enlarged view of a subset of the two datasets, overlaying the cells from Xenium, colored by cell type, to the H&E image from Xenium. SpatialData allows to extract image tiles of the desired resolution (here 32x32 pixels) around the Xenium cells. (d) The tiling extraction process takes advantage of the multiscale representation and the chunked Zarr storage for efficient memory usage. The first allows the extraction of the tiles from the appropriate (downscaled) resolution, the second ensures that only the data chunk(s) containing the information about the tiles are loaded from disk. Note: the 500x and 1000x downscaling factors and the size of the chunks have been chosen for illustrative purposes. (e) Visualization of cell-type labels predicted by the model. Note: due to the illustrative purpose of this example, focusing on the demonstration of the infrastructure, network training has been limited to a small number of epochs, and systematic hyperparameter optimization has been omitted. This is reflected in the suboptimal accuracy of the predictions. The full example can be found in the online documentation (https://spatialdata.scverse.org/en/latest/tutorials/notebooks/notebooks/examples/densenet.html).
Extended Data Fig. 6 Napari-based visualization of the Visium and the two Xenium datasets from the breast cancer study presented in main text.
(a) H&E image from the Visium dataset annotated with the ROIs for anatomically relevant tissue compartments. (b) Multimodal visualization of the H&E image from the Visium data, the two immunofluorescence images associated with the Xenium data, the Visium array capture locations colored by gene expression (showed with transparency), the Xenium cells showing cell types and the four manually annotated ROIs. (c) Visualization of the clone annotations estimated from Visium count data. Dedicated tutorials on how to use napari-spatialdata to align different modalities via landmark-based annotation and how to manually draw regions of interest, can be found in the online documentation (https://spatialdata.scverse.org/en/latest/tutorials/notebooks/notebooks/examples/alignment_using_landmarks.html, https://spatialdata.scverse.org/en/latest/tutorials/notebooks/notebooks/examples/napari_rois.html).
Extended Data Fig. 7 Example of using SpatialData to combine multiple datasets from a prostate cancer study into a common coordinate system.
Shown is a common coordinate system constructed using data from Erikson et al.36. The study comprises multiple Visium H&E and Spatial Transcriptomics36 datasets from multiple tissue samples, with partially overlapping fields-of-view distributed across the tissues. (a) Spatial layout of the 15 fields-of-view for the Visium experiments for one of the tissues. Coordinate transformations used to align the fields-of-view were derived using SpatialData (landmark-based alignment), by aligning each image to the global layout image available from the original publication. (b) Screenshot of the visualization of all Visium datasets for one of the tissue samples in the context of the whole tissue coordinate system using napari-spatialdata. (c) The SpatialData multiscale image representation, napari-spatialdata allows to view and interactively explore all of the large images (15 images, ≈ 580 megapixels each) aligned together with the spatial gene expression. We can also visualize multiple modalities together, such as adding to the view also the Spatial Transcriptomics data. The full example can be found in (https://github.com/scverse/spatialdata-notebooks/blob/main/notebooks/paper_reproducibility/lundeberg.ipynb), and a dedicated tutorial on coordinate transformation can be found in the online documentation (https://spatialdata.scverse.org/en/latest/tutorials/notebooks/notebooks/examples/transformations.html). The layout image used in the background in panels A and B is in the original publication36 under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Supplementary information
Supplementary Information
Supplementary Tables 1–4, Figs. 1–3 and Notes 1–5.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Marconato, L., Palla, G., Yamauchi, K.A. et al. SpatialData: an open and universal data framework for spatial omics. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02212-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41592-024-02212-x