SpotClean adjusts for spot swapping in spatial transcriptomics data

Spatial transcriptomics is a powerful and widely used approach for profiling the gene expression landscape across a tissue with emerging applications in molecular medicine and tumor diagnostics. Recent spatial transcriptomics experiments utilize slides containing thousands of spots with spot-specific barcodes that bind RNA. Ideally, unique molecular identifiers (UMIs) at a spot measure spot-specific expression, but this is often not the case in practice due to bleed from nearby spots, an artifact we refer to as spot swapping. To improve the power and precision of downstream analyses in spatial transcriptomics experiments, we propose SpotClean, a probabilistic model that adjusts for spot swapping to provide more accurate estimates of gene-specific UMI counts. SpotClean provides substantial improvements in marker gene analyses and in clustering, especially when tissue regions are not easily separated. As demonstrated in multiple studies of cancer, SpotClean improves tumor versus normal tissue delineation and improves tumor burden estimation thus increasing the potential for clinical and diagnostic applications of spatial transcriptomics technologies.


nature research | reporting summary
April 2020

Data analysis
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability Field-specific reporting Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative. Reporting for specific materials, systems and methods We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response.
Seurat pipeline was applied to analyze case study datasets for variable gene selection, scaling, dimension reduction, clustering, and UMAP visualization. Differential expression (DE) analyses were conducted using two-sample two-sided t-tests. p-values were adjusted using Benjamini-Hochberg correction.
Cell type compositions were estimated using SPOTlight. Spearman correlation was used to evaluate the similarity between spots and single cells. Spatial clustering was conducted using BayesSpace. SpotClean was benchmarked with other single-cell RNA-seq decontamination methods SoupX and DecontX.
Raw sequence data for the 3 human-mouse chimeric experiments are available at GEO (accession number: GSE178221). Links to 16 public spatial transcriptomics datasets are available in Supplementary Table 6. The human breast cancer single-cell RNA-seq data from Chung et al.2313 is available at GEO (accession number: GSE75688). The human colorectal cancer single-cell RNA-seq data from Li et al.2416 is available at GEO (accession number: GSE81861). Additional datasets used to investigate permeabilization times are available at GEO (accession numbers: GSE169749, GSE178361, GSE188888, GSE190595, and GSE193460). Processed data for reproducing results in our studies are available at Zenodo27. The GRCh38+mm10 reference genome is available at 10x Genomics (refdata-gex-GRCh38-and-mm10-2020-A).
No sample-size calculation was performed. Our study identifies and provides an approach to correct for a general technical artifact (spot swapping) in spatial transcriptomics experiments. We generated and then analyzed three chimeric samples to show the existence and quantify the extent of spot swapping in the experiments. Spot swapping was detected in all of our chimeric samples as well as many other publicly available datasets.
No data were excluded from the analyses.
We validated the technical artifact in 10x Visium spatial transcriptomics experiments using our in-house samples, and detected the artifact in 16 public datasets across multiple platforms. The effect is ubiquitous. Any existing or new data coming from these protocols can be used to validate our finding following similar analyses. This is not relevant to our study. Our study does not involve any comparison of experimental groups under different treatments.
Blinding was not relevant to our study. Our study does not involve any comparison of experimental groups under different treatments.