Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies

Abstract

Identifying genes that display spatial expression patterns in spatially resolved transcriptomic studies is an important first step toward characterizing the spatial transcriptomic landscape of complex tissues. Here we present a statistical method, SPARK, for identifying spatial expression patterns of genes in data generated from various spatially resolved transcriptomic techniques. SPARK directly models spatial count data through generalized linear spatial models. It relies on recently developed statistical formulas for hypothesis testing, providing effective control of type I errors and yielding high statistical power. With a computationally efficient algorithm, which is based on penalized quasi-likelihood, SPARK is also scalable to datasets with tens of thousands of genes measured on tens of thousands of samples. Analyzing four published spatially resolved transcriptomic datasets using SPARK, we show it can be up to ten times more powerful than existing methods and disclose biological discoveries that otherwise cannot be revealed by existing approaches.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Method schematic of SPARK and simulation results.
Fig. 2: Analyzing the mouse olfactory bulb dataset (n = 260 spots).
Fig. 3: Analyzing human breast cancer data (n = 250 spots).
Fig. 4: Analyzing the mouse hypothalamus data (n = 4,975 cells).

Data availability

This study made use of four publicly available datasets. These include the mouse olfactory bulb and human breast cancer data http://www.spatialtranscriptomicsresearch.org), the MERFISH data (https://datadryad.org/stash/dataset/doi:10.5061/dryad.8t8s248) and the SeqFISH data (https://www.cell.com/cms/10.1016/j.neuron.2016.10.001/attachment/759be4dc-04a6-4a58-b6f6-9b52be2802db/mmc6.xlsx). In addition, all raw data and processed data used for analysis are also available at https://github.com/xzhoulab/SPARK.

Code availability

All source code used in our experiments have been deposited at http://www.xzlab.org/software.html.

References

  1. 1.

    Chen, K. H., Boettiger, A. N., Moffitt, J. R., Wang, S. & Zhuang, X. RNA imaging. Spatially resolved, highly multiplexed RNA profiling in single cells. Science 348, aaa6090 (2015).

    PubMed  PubMed Central  Google Scholar 

  2. 2.

    Lubeck, E., Coskun, A. F., Zhiyentayev, T., Ahmad, M. & Cai, L. Single-cell in situ RNA profiling by sequential hybridization. Nat. Methods 11, 360–361 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. 3.

    Femino, A. M., Fogarty, K., Lifshitz, L. M., Carrington, W. & Singer, R. H. Visualization of single molecules of mRNA in situ. Method Enzymol. 361, 245–304 (2003).

    CAS  Google Scholar 

  4. 4.

    Lovatt, D. et al. Transcriptome in vivo analysis (TIVA) of spatially defined single cells in live tissue. Nat. Methods 11, 190–196 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. 5.

    Simone, N. L., Bonner, R. F., Gillespie, J. W., Emmert-Buck, M. R. & Liotta, L. A. Laser-capture microdissection: opening the microscopic frontier to molecular analysis. Trends Genet. 14, 272–276 (1998).

    CAS  PubMed  Google Scholar 

  6. 6.

    Junker, J. P. et al. Genome-wide RNA tomography in the zebrafish embryo. Cell 159, 662–675 (2014).

    CAS  PubMed  Google Scholar 

  7. 7.

    Stahl, P. L. et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science 353, 78–82 (2016).

    CAS  PubMed  Google Scholar 

  8. 8.

    Ke, R. Q. et al. In situ sequencing for RNA analysis in preserved tissue and cells. Nat. Methods 10, 857–860 (2013).

    CAS  PubMed  Google Scholar 

  9. 9.

    Lee, J. H. et al. Highly multiplexed subcellular RNA sequencing in situ. Science 343, 1360–1363 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. 10.

    Crosetto, N., Bienko, M. & van Oudenaarden, A. Spatially resolved transcriptomics and beyond. Nat. Rev. Genet 16, 57–66 (2015).

    CAS  PubMed  Google Scholar 

  11. 11.

    Fan, X. et al. Spatial transcriptomic survey of human embryonic cerebral cortex by single-cell RNA-seq analysis. Cell Res. 28, 730–745 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. 12.

    Svensson, V., Teichmann, S. A. & Stegle, O. SpatialDE: identification of spatially variable genes. Nat. Methods 15, 343–346 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  13. 13.

    Edsgard, D., Johnsson, P. & Sandberg, R. Identification of spatial expression trends in single-cell gene expression data. Nat. Methods 15, 339–342 (2018).

    PubMed  PubMed Central  Google Scholar 

  14. 14.

    Lea, A. J., Alberts, S. C., Tung, J. & Zhou, X. A flexible, efficient binomial mixed model for identifying differential DNA methylation in bisulfite sequencing data. PloS Genet. 11, e1005650 (2015).

    PubMed  PubMed Central  Google Scholar 

  15. 15.

    Sun, S. Q. et al. Differential expression analysis for RNAseq using Poisson mixed models. Nucleic Acids Res. 45, e106 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. 16.

    Lun, A. Overcoming systematic errors caused by log-transformation of normalized single-cell RNA sequencing data. Preprint at BioRxiv https://doi.org/10.1101/404962 (2019).

  17. 17.

    Li, Y., Tang, H. C. & Lin, X. H. Spatial linear mixed models with covariate measurement errors. Stat. Sin. 19, 1077–1093 (2009).

    PubMed  PubMed Central  Google Scholar 

  18. 18.

    Ben-Ahmed, K., Bouratbine, A. & El-Aroui, M. A. Generalized linear spatial models in epidemiology: a case study of zoonotic cutaneous leishmaniasis in Tunisia. J. Appl. Stat. 37, 159–170 (2010).

    Google Scholar 

  19. 19.

    Breslow, N. E. & Lin, X. H. Bias correction in generalized linear mixed models with a single-component of dispersion. Biometrika 82, 81–91 (1995).

    Google Scholar 

  20. 20.

    Sun, S. Q. et al. Heritability estimation and differential analysis of count data with generalized linear mixed models in genomic sequencing studies. Bioinformatics 35, 487–496 (2019).

    CAS  PubMed  Google Scholar 

  21. 21.

    Liu, Y. W. et al. ACAT: a fast and powerful P value combination method for rare-variant analysis in sequencing studies. Am. J. Hum. Genet 104, 410–421 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. 22.

    Pillai, N. S. & Meng, X. L. An unexpected encounter with Cauchy and Levy. Ann. Stat. 44, 2089–2097 (2016).

    Google Scholar 

  23. 23.

    Tepe, B. et al. Single-cell RNA-seq of mouse olfactory bulb reveals cellular heterogeneity and activity-dependent molecular census of adult-born neurons. Cell Rep. 25, 2689–2703 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. 24.

    Rouillard, A. D. et al. The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins. Database 2016, baw100 (2016).

    PubMed  PubMed Central  Google Scholar 

  25. 25.

    Adan, R. A. H. et al. Rat oxytocin receptor in brain, pituitary, mammary-gland, and uterus—partial sequence and immunocytochemical localization. Endocrinology 136, 4022–4028 (1995).

    CAS  PubMed  Google Scholar 

  26. 26.

    Lever, J., Zhao, E. Y., Grewal, J., Jones, M. R. & Jones, S. J. M. CancerMine: a literature-mined resource for drivers, oncogenes and tumor suppressors in cancer. Nat. Methods 16, 505–507 (2019).

    CAS  PubMed  Google Scholar 

  27. 27.

    Moffitt, J. R. et al. Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Science 362, eaau5324 (2018).

    PubMed  PubMed Central  Google Scholar 

  28. 28.

    Fabio, K. et al. Synthesis and evaluation of potent and selective human V1a receptor antagonists as potential ligands for PET or SPECT imaging. Bioorgan. Med. Chem. 20, 1337–1345 (2012).

    CAS  Google Scholar 

  29. 29.

    Ozturk, A., DeKosky, S. T. & Kamboh, M. I. Genetic variation in the choline acetyltransferase (CHAT) gene may be associated with the risk of Alzheimer’s disease. Neurobiol. Aging 27, 1440–1444 (2006).

    CAS  PubMed  Google Scholar 

  30. 30.

    Kiaris, H., Schally, A. V. & Kalofoutis, A. Extrapituitary effects of the growth hormone-releasing hormone. Vitam. Horm. 70, 1–24 (2005).

    CAS  PubMed  Google Scholar 

  31. 31.

    Shah, S., Lubeck, E., Zhou, W. & Cai, L. In situ transcription profiling of single cells reveals spatial organization of cells in the mouse hippocampus. Neuron 92, 342–357 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. 32.

    Tasic, B. et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat. Neurosci. 19, 335–346 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  33. 33.

    Li, X. H., Polter, A. & Yang, S. FoxO transcription factors—regulation in brain and behavioral manifestation. Biol. Psychiat. 63, 150–159 (2008).

    Google Scholar 

  34. 34.

    Hoekman, M. F. M., Jacobs, F. M. J., Smidt, M. P. & Burbach, J. P. H. Spatial and temporal expression of FoxO transcription factors in the developing and adult murine brain. Gene Expr. Patterns 6, 134–140 (2006).

    CAS  PubMed  Google Scholar 

  35. 35.

    Cattaneo, A. et al. FoxO1, A2M, and TGF-β 1: three novel genes predicting depression in gene X environment interactions are identified using cross-species and cross-tissues transcriptomic and miRNomic analyses. Mol. Psychiatr. 23, 2192–2208 (2018).

    CAS  Google Scholar 

  36. 36.

    Shrestha, B. R. et al. Sensory neuron diversity in the inner ear is shaped by activity. Cell 174, 1229–1246 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. 37.

    Sun, Y. F. et al. A central role for Islet1 in sensory neuron development linking sensory and spinal gene regulatory programs. Nat. Neurosci. 11, 1283–1293 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  38. 38.

    Wang, X. et al. Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Science 361, eaat5691 (2018).

    PubMed  PubMed Central  Google Scholar 

  39. 39.

    Voss, S., Zimmermann, B. & Zimmermann, A. Detecting spatial structures in throughfall data: the effect of extent, sample size, sampling design, and variogram estimation method. J. Hydrol. 540, 527–537 (2016).

    Google Scholar 

  40. 40.

    Lark, R. M., Heuvelink, G. B. M., Bishop, T. F. A., Burgess, T. M. & Webster, R. 1980. Optimal interpolation and isarithmic mapping of soil properties. I. The semi-variogram and punctual kriging. Eur. J. Soil Sci. 31, 315–331 (2019).

    Google Scholar 

  41. 41.

    Li, H. F., Calder, C. A. & Cressie, N. Beyond Moran's I: testing for spatial dependence based on the spatial autoregressive model. Geogr. Anal. 39, 357–375 (2007).

    Google Scholar 

  42. 42.

    Radeloff, V. C., Miller, T. F., He, H. S. & Mladenoff, D. J. Periodicity in spatial data and geostatistical models: autocorrelation between patches. Ecography 23, 81–91 (2000).

    Google Scholar 

  43. 43.

    Rodriques, S. G. et al. Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution. Science 363, 1463–1467 (2019).

    CAS  PubMed  Google Scholar 

  44. 44.

    Diggle, P. J., Tawn, J. A. & Moyeed, R. A. Model-based geostatistics. J. R. Stat. Soc. Ser. C Appl. Stat. 47, 299–326 (1998).

    Google Scholar 

  45. 45.

    Christensen, O. F. & Waagepetersen, R. Bayesian prediction of spatial count data using generalized linear mixed models. Biometrics 58, 280–286 (2002).

    PubMed  Google Scholar 

  46. 46.

    Rousset, F. & Ferdy, J. B. Testing environmental and genetic effects in the presence of spatial autocorrelation. Ecography 37, 781–790 (2014).

    Google Scholar 

  47. 47.

    Vanhatalo, J., Pietilainen, V. & Vehtari, A. Approximate inference for disease mapping with sparse Gaussian processes. Stat. Med. 29, 1580–1607 (2010).

    PubMed  Google Scholar 

  48. 48.

    Lin, X. H. & Breslow, N. E. Bias correction in generalized linear mixed models with multiple components of dispersion. J. Am. Stat. Assoc. 91, 1007–1016 (1996).

    Google Scholar 

  49. 49.

    Satterthwaite, F. E. An approximate distribution of estimates of variance components. Biometrics Bull. 2, 110–114 (1946).

    CAS  Google Scholar 

  50. 50.

    Benjamini, Y. & Yekutieli, D. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 1165–1188 (2001).

    Google Scholar 

  51. 51.

    Yu, G. C., Wang, L. G., Han, Y. Y. & He, Q. Y. clusterProfiler: an R package for comparing biological themes among gene clusters. Omics 16, 284–287 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  52. 52.

    Moffitt, J. R. et al. Data from: molecular, spatial and functional single-cell profiling of the hypothalamic preoptic region. Dryad Digital Repository https://doi.org/10.5061/dryad.8t8s248 (2018).

Download references

Acknowledgements

This study was supported by National Institutes of Health (NIH) grants R01HG009124 and R01GM126553, and National Science Foundation (NSF) grant DMS1712933. S.S. was supported by NIH grant R01HD088558 (PI Tung), the National Natural Science Foundation of China (grant 61902319) and the Natural Science Foundation of Shaanxi Province (grant 2019JQ127). J.Z. was supported by NIH grant U01HL137182 (PI Kang).

Author information

Affiliations

Authors

Contributions

X.Z. conceived the idea and provided funding support. S.S. and X.Z. designed the experiments. S.S. and J.Z. developed the method, implemented the software, performed simulations and analyzed real data. S.S., J.Z. and X.Z. wrote the manuscript.

Corresponding author

Correspondence to Xiang Zhou.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Peer review information Nicole Rusk and Lin Tang were the primary editors on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Comparison of different methods in the simulations based on mouse olfactory bulb data.

For all the simulation based on mouse olfactory bulb data, the number of spots are fixed at n=260. Each simulation replicate contains 10,000 independently simulated genes. (a)Quantile-quantile plot of the observed -log10 p-values from different methods against the expected -log10 p-values in null simulations with different noise levels. The p-values are combined across 10 simulation replicates. Simulations are performed under different nugget values that represent different noise levels: τ2=0.2, 0.35 and 0.6 (left to right). Compared methods include SPARK (pink), SpatialDE (purple), Trendsceek.E (light salmon) which is the Emark test of Trendsceek, Trendsceek.ρ (yellow-green) which is the Markcorr test of Trendsceek, Trendsceek.γ (light green) which is the Markvario test of Trendsceek, and Trendsceek.V (wheat) which is the Vmark test of Trendsceek. (b)Power comparison of different methods in alternative simulations under different noise levels. Power plots show the proportion of true positives (y-axis) detected by different methods at a range of false discovery rates (FDR; x-axis) in the alternative simulations. The proportion of true positives is averaged across 10 simulation replicates. Compared methods include SPARK (pink), SpatialDE (purple) and Trendsceek (sky-blue) which is the combined test of Trendsceek. Simulations were performed under different noise levels characterized by different nugget values: τ2=0.2, 0.35 and 0.6 (left to right) and different spatial expression patterns I-III as illustrated in the main Fig. 1c (top to bottom). (c)Power comparison of different methods in alternative simulations under different pattern strength. Power plots show the proportion of true positives (y-axis) detected by different methods at a range of false discovery rates (FDR; x-axis) in the alternative simulations. The proportion of true positives is averaged across 10 simulation replicates. Compared methods include SPARK (pink), SpatialDE (purple) and Trendsceek (sky-blue) which is the combined test of Trendsceek. Simulations were performed under different noise levels characterized by different spatial expression pattern strength: two-fold, three-fold and four-fold (left to right) and different spatial expression patterns I-III as illustrated in the main Fig. 1c (top to bottom).

Supplementary Figure 2 Power comparison of different methods in the simulations based on original Trendsceek paper.

Power plots show the proportion of true positives (y-axis) detected by different methods at a range of false discovery rates (FDR; x-axis) in the alternative simulations. The proportion of true positives is averaged across 10 simulation replicates. Each simulation replicate contains 1,000 independently simulated genes. Compared methods include SPARK (pink), SpatialDE (purple), Trendsceek (sky-blue) which is the combined test of Trendsceek. Pattern strength (SE strength) in hotspot and non-radial streak patterns is defined as the fold change between mean expression value in spiked cells and mean expression value in background cells. SE strength in linear gradient pattern is defined as the percentage of cells that display expression gradient. (a)Power comparison of different methods in the alternative simulations under different sample sizes (n=100, 200 and 500, left to right). Simulations were performed under moderate fraction of marked cells (20%) and moderate SE strength (2-fold) for the hotspot (top) and non-radial streak (bottom) patterns. (b)Power comparison of different methods in the alternative simulations under different SE strength (1.5-fold,2-fold and 2.5-fold, left to right). Simulations were performed under moderate fraction of marked cells (20%) and moderate sample size (n=200) for the hotspot (top) and non-radial streak (bottom) patterns. (c)Power comparison of different methods in the alternative simulations under different fraction of marked cells (0.1,0.2 and 0.3, left to right). Simulations were performed under moderate SE strength (2-fold) and moderate sample size (n=200) for the hotspot (top) and non-radial streak (bottom) patterns. (d)Power comparison of different methods in the alternative simulations under different sample sizes (n=100, 200 and 500, left to right) when the SE strength is extremely strong. Simulations were performed under moderate fraction of marked cells (20%) and extremely strong SE strength (5-fold) for the hotspot (top) and non-radial streak (bottom) patterns. (e)Power comparison of different methods in linear gradient simulations under different sample size and SE signal strength. Simulations were performed under different sample size: n= 100, 200 and 500 (left to right). Simulations were also performed under three SE signal strength levels (top to bottom): strong SE strength (50% cells are marked cells that display expression gradient), moderate SE strength (40%), and weak SE strength (30%).

Supplementary Figure 3 Analyzing the mouse olfactory bulb data (n=260).

(a)Quantile-quantile plot for the mouse olfactory bulb data. Quantile-quantile plot of the observed -log10 p-values from different methods against the expected -log10 p-values under the null in the real data analysis. Compared methods include SPARK (pink), SpatialDE (purple), Trendsceek.E (light salmon) which is the Emark test of Trendsceek, Trendsceek.ρ (yellow-green) which is the Markcorr test of Trendsceek, Trendsceek.γ (light green) which is the Markvario test of Trendsceek, and Trendsceek.V (wheat) which is the Vmark test of Trendsceek. (b)Boxplot displays expression level for the significant genes identified by different methods in the mouse olfactory bulb data. Results are shown for 710 genes that are detected by SPARK only (first column), 5 genes that are detected by SpatialDE only (middle column), and 62 genes that are detected by both methods. Each grey dot represents the mean expression value of one SE gene across all spots. For the SE genes identified only by SPARK, the mean gene expression values range from 0.16 to 6.46 with a median 2.04; for the SE genes identified only by SpatialDE, the mean gene expression values range from 0.01 to 5.71 with a median 0.01; and for the SE genes identified by both methods, the mean gene expression values range from 0.14 to 7.34 with a median 3.16. The gene expression level is computed on the log2 scale. (c)Visualization of significant SE genes identified by SPARK in the mouse olfactory data. Scatter plot of 772 significant SE genes identified by SPARK. We first used the UMAP (umap R package) to reduce the dimension into two dimensions. Then, we used the cell labels obtained from hierarchical agglomerative clustering to visualize the distribution of cells in each cluster. SE genes are detected based on an FDR cutoff of 0.05. (d)Spatial expression pattern for five genes that are only identified by SpatialDE in mouse olfactory bulb data. Genes are identified based on an FDR cutoff of 0.05. Color represents relative gene expression level (purple: high; green: low). Most of these genes are only expressed in one or two spatial locations and may be false signals. (e)Spatial expression patterns summarized based on the 772 SE genes that are identified by SPARK when the number of clustering is set to be three in the hierarchical agglomerative clustering. (f)Spatial expression patterns summarized based on the 772 SE genes that are identified by SPARK when the number of clustering is set to be five. The three main patterns summarized in (e) appear to be a subset of the patterns summarized in (f).

Supplementary Figure 4 Spatial expression pattern for SE genes identified only by SPARK in mouse olfactory bulb data (n=260).

SE Genes are identified based on an FDR cutoff of 0.05. Color represents relative gene expression level (purple: high; green: low). SPARK p-values are shown in parenthesis (Cauchy combination rule). (a)Spatial expression pattern for 20 pattern I genes identified only by SPARK. We ranked all 119 pattern I genes identified only by SPARK based on their p-values from the most significant to the least significant. We then obtained 20 genes with increasing p-values from ranked list. (b)Spatial expression pattern for 20 pattern II genes identified only by SPARK. We ranked all 270 pattern II genes identified only by SPARK based on their p-values from the most significant to the least significant. We then obtained 20 genes with increasing p-values from ranked list. (c)Spatial expression pattern for 20 pattern III genes identified only by SPARK. We ranked all 321 pattern III genes identified only by SPARK based on their p-values from the most significant to the least significant. We then obtained 20 genes with increasing p-values from ranked list.

Supplementary Figure 5 Validation of SE genes in the mouse olfactory blub data (n=260) identified by SPARK.

SE Genes are identified based on an FDR cutoff of 0.05. The enriched GO terms/KEGG pathways are determined based on a BH adjusted p-value cutoff = 0.05 (clusterProfiler). (a)Venn Diagram of enriched GO terms based on the SE genes identified by SPARK and SpatialDE. The SE genes identified by SPARK are enriched in 1023 GO terms, 64 of which are overlapped with SpatialDE. (b)Venn Diagram of enriched KEGG pathways based on the SE genes identified by SPARK and SpatialDE. The SE genes identified by SPARK are enriched in 79 KEGG pathways, which cover the two KEGG pathways enriched by SpatialDE identified SE genes. (c)Bubble plot shows –log10 p-values for gene sets uniquely identified by SPARK (959). P-values are from clusterProfiler analysis and the dashed line indicates a cutoff 0.05. Gene sets are colored by three categories: GO biological process (blue), GO molecular function (purple), GO cellular component (yellow). (d)Spatial expression pattern for eight genes identified by SPARK in the mouse olfactory bulb. SE Genes are identified based on an FDR cutoff of 0.05. Color represents relative gene expression level (purple: high; green: low). SPARK p-values are shown in parenthesis (Cauchy combination rule). These eight genes are previously known to be spatially expressed in the mitral cell layer. Five of these genes (Nmb, Reln, Rcan2, Shisa3 and Plcxd2) were not identified by SpatialDE. None of the eight genes were identified by Trendsceek. (e)Bubble plot shows –log10 p-values for gene sets in pattern I uniquely identified by SPARK (393) in the mouse olfactory bulb data. P-values are from clusterProfiler analysis and the dashed line indicates a cutoff 0.05. Venn Diagram of enriched GO terms based on the SE genes identified by SPARK and SpatialDE for pattern I is embedded inside the panel. Gene sets are colored by three categories: GO biological process (blue), GO molecular function (purple), GO cellular component (yellow). (f)Bubble plot shows –log10 p-values for gene sets in pattern II uniquely identified by SPARK (537) in the mouse olfactory bulb data. P-values are from clusterProfiler analysis and the dashed line indicates a cutoff 0.05. Venn Diagram of enriched GO terms based on the SE genes identified by SPARK and SpatialDE for pattern II is embedded inside the panel. Gene sets are colored by three categories: GO biological process (blue), GO molecular function (purple), GO cellular component (yellow). (g)Bubble plot shows –log10 p-values for gene sets in pattern III uniquely identified by SPARK (662) in the mouse olfactory bulb data. P-values are from clusterProfiler analysis and the dashed line indicates a cutoff 0.05. Venn Diagram of enriched GO terms based on the SE genes identified by SPARK and SpatialDE for pattern III is embedded inside the panel. Gene sets are colored by three categories: GO biological process (blue), GO molecular function (purple), GO cellular component (yellow).

Supplementary Figure 6 Analyzing the human breast cancer data (n=250).

(a)Quantile-Quantile plot for human breast cancer data. Quantile-quantile plot of the observed -log10 p-values from different methods against the expected -log10 p-values under the null in the real data analysis. p-value distribution suggests that SPARK detected more genes with spatial expression pattern than SpatialDE. In contrast, p-values from various Trendsceek tests all behave approximately like null and suggest a lack of power by Trendsceek tests. (b)Boxplot displays expression level of the significant genes identified by different methods. Results are shown for 202 genes that are detected by SPARK only (first column), 30 genes that are detected by SpatialDE only (second column), 2 genes that are detected by Trendsceek only (third column), and 10 genes that are detected by all methods. Each grey dot represents the mean expression value of one SE gene across all spots. For the SE genes identified only by SPARK, the mean gene expression values range from 0.11 to 3.53 with a median 0.85; for the SE genes identified only by SpatialDE, the mean gene expression values range from 0.01 to 0.8 with a median 0.08; for the SE genes identified only by Trendsceek, the mean gene expression values range from 0.56 to 1.56 with a median 1.06; and for the SE genes identified by all methods, the mean gene expression values range from 0.51 to 2.85 with a median 1.94. The gene expression level is computed on the log2 scale. (c)Venn Diagram of enriched GO terms based on the SE genes identified by SPARK and SpatialDE. The SE genes are detected based on an FDR cutoff of 0.05. The SE genes identified by SPARK are enriched in 542 GO terms, 266 of which are overlapped with SpatialDE. The enriched GO terms pathways are determined based on a BH adjusted p-value cutoff = 0.05 (clusterProfiler). (d)Venn Diagram of enriched KEGG pathways based on the SE genes identified by SPARK and SpatialDE. The SE genes are detected based on an FDR cutoff of 0.05. The SE genes identified by SPARK are enriched in 20 KEGG pathways, which cover the three KEGG pathways enriched by SpatialDE identified SE genes. The enriched KEGG pathways are determined based on a BH adjusted p-value cutoff = 0.05 (clusterProfiler). (e)Spatial expression pattern for ten SE genes identified by SPARK. Genes are identified based on an FDR cutoff of 0.05. Color represents relative gene expression level (purple: high; green: low). These ten genes are cancer relevant genes with known spatial localization highlighted in the original study. SPARK p-values are shown in parenthesis (Cauchy combination rule). SCGB2A2, KRT17 and MMP14 were not identified by both SpatialDE and Trendsceek. (f)Bubble plot shows –log10 p-values for gene sets uniquely identified by SPARK (351). The p-values are from clusterProfiler analysis and the dashed line indicates a cutoff 0.05. Gene sets are colored by three categories: GO biological process (blue), GO molecular function (purple), GO cellular component (yellow).

Supplementary Figure 7 Spatial expression pattern for 20 SE genes identified only by SPARK in the human breast cancer data (n=250).

Genes are identified based on an FDR cutoff of 0.05. Color represents relative gene expression level (purple: high; green: low). SPARK p-values are shown in parenthesis (Cauchy combination rule). We ranked all 202 SE genes identified only by SPARK based on their p-values from the most significant to the least significant. We then obtained 20 genes with increasing p-values from ranked list and show these genes here to provide an overview on the spatial expression pattern for genes that are only detected by SPARK.

Supplementary Figure 8 Analyzing the mouse hypothalamus data (n=4,975).

(a)Quantile-quantile plot of the observed -log10 p-values from different methods against the expected -log10 p-values under the null in the real data analysis. Compared methods include SPARK (pink), SpatialDE (purple), Trendsceek.E (light salmon) which is the Emark test of Trendsceek, Trendsceek.ρ (yellow-green) which is the Markcorr test of Trendsceek, Trendsceek.γ (light green) which is the Markvario test of Trendsceek, and Trendsceek.V (wheat) which is the Vmark test of Trendsceek. p-value distribution suggests both SPARK and SpatialDE are more powerful than various Trendsceek tests in detecting genes with spatial expression pattern. (b)Venn diagram shows the overlap of the significant genes detected by SPARK, SpatialDE and Trendsceek. SE genes are determined based on the p-value threshold that only one blank control gene is also detected. (c)Spatial distributions of five individual cell classes. Cells in each of the five cell classes are shown as colored dots (legends below), while the remaining cells are shown as gray dots. (d)Spatial distributions of cell class representative genes highlighted in the original study. Spatial expression pattern of five representative genes (Slc17a6, Pdgfra, Fn1, Selplg, Aqp4) for each of the five cells classes are shown. These five genes are identified as SE by both SPARK and SpatialDE. The p-values for the five genes from SPARK are shown inside parenthesis (Cauchy combination rule). Color represents relative gene expression level (purple: high; green: low).

Supplementary Figure 9 Analyng t mouse hippocampus data (n=131).

(a)Voronoi representative of the tissue structure in the mouse hippocampus data. Red dashed rectangle represents the area that is adjusted for border artifacts. (b)Quantile-quantile plot of the observed -log10 p-values from different methods are plotted against the expected -log10 p-values under the null in the permuted data. p-values are combined across 100 permutation replicates. Compared methods include SPARK (pink), SpatialDE (purple), Trendsceek.E (light salmon) which is the Emark test of Trendsceek, Trendsceek.ρ. (yellow-green) which is the Markcorr test of Trendsceek, Trendsceek.γ. (light green) which is the Markvario test of Trendsceek, and Trendsceek.V (wheat) which is the Vmark test of Trendsceek. Both SPARK Trendsceek test statistics produce reasonably calibrated p-values while SpatialDE yield slightly conservative p-values. (c)Quantile-quantile plot of the observed -log10 p-values from different methods against the expected -log10 p-values under the null in the real data analysis. p-value distribution suggests that SPARK detected more genes with spatial expression pattern than SpatialDE. In contrast, p-values from various Trendsceek tests all behave approximately like null and suggest a lack of power by Trendsceek tests. (d)Power plot shows the number of genes with spatial expression pattern (y-axis) identified by different methods at a range of false discovery rates (FDRs; x-axis). Across a range of FDRs, SPARK detected more genes with spatial expression pattern than SpatialDE and Trendsceek (sky-blue, which is combined test of Trendsceek). (e)Venn diagram shows the overlap of significant genes detected by SPARK, SpatialDE and Trendsceek. Genes are detected based on an FDR cutoff of 0.05. (f)Spatial expression patterns for eleven SE genes identified by both SPARK and SpatialDE. Genes are identified based on an FDR cutoff of 0.05. Color represents relative gene expression level (purple: high; green: low). SPARK p-values are shown in parenthesis (Cauchy combination rule). Among these genes, only Ctss gene was also identified by Trendsceek. (g)Spatial expression patterns for six SE genes identified only by SPARK in mouse hippocampus data. Genes are identified based on an FDR cutoff of 0.05. Color represents relative gene expression level (purple: high; green: low). SPARK p-values are shown in parenthesis (Cauchy combination rule). (h)Spatial expression patterns for three SE genes identified only by Trendsceek in mouse hippocampus data. Genes are identified based on an FDR cutoff of 0.05. Color represents relative gene expression level (purple: high; green: low).

Supplementary Figure 10 Comparison of Moran’s I test and SPARK.

The Moran’s I test was performed using the moran.test from R package spdep. Moran’s I test did not detect any significant genes in the human breast cancer data, which is thus not shown here. (a)Quantile-quantile plot of the observed -log10 p-values from different methods against the expected -log10 p-values under the null in the permuted mouse olfactory bulb data (n=260). The p-values are combined across 10 permutation replicates. (b)Venn diagram shows the overlap of the significant genes detected by SPARK and Moran’s I test in the mouse olfactory bulb data (n=260). Genes are detected based on an FDR cutoff of 0.05. (c)Venn diagram shows the overlap of the significant genes detected by SPARK and Moran’s I test in the mouse hypothalamus data (n=4,975). Genes are detected based on an FDR cutoff of 0.05. (d)Venn diagram shows the overlap of the significant genes detected by SPARK and Moran’s I test in the mouse hippocampus data (n=131). Genes are detected based on an FDR cutoff of 0.05.

Supplementary Figure 11 Comparison of Poisson version SPARK and Gaussian version SPARK.

(a)Mean and variance plots for the four real datasets: mouse olfactory bulb data (n=260), human breast cancer data (n=250), mouse hypothalamus data (n=4,975) and mouse hippocampus data (n=131). Each dot represents a gene, whose mean (x-axis) and variance (y-axis) were calculated. Loess curve is fitted for each dataset (red line). A clear mean-variance dependency pattern, with variance being larger than the mean, is clearly visualizable for all four data sets, regardless whether they are sequencing based or smFISH based. In each panel, the diagonal dashed line is where the variance equals the mean. (b)Quantile-quantile plot for p-values obtained from the Gaussian version of SPARK. Quantile-quantile plot of the observed -log10 p-values from gaussian version of SPARK are plotted against the expected -log10 p-values under the null either for the real data (steel blue) or for the permuted data (light pink). Results are shown for the mouse olfactory bulb data (n=260), the human breast cancer data (n=250), the mouse hypothalamus data (n=4,975); and the mouse hippocampus data (n=131), from left to right. The p-values are obtained through Cauchy combination rule. (c)Venn diagram shows the overlap of significant genes detected by (the Poisson version of) SPARK and the Gaussian version of SPARK in all four real data sets. Genes are detected based on an FDR cutoff of 0.05. Results are shown for the mouse olfactory bulb data (n=260), the human breast cancer data (n=250), the mouse hypothalamus data (n=4,975); and the mouse hippocampus data (n=131), from left to right.

Supplementary Figure 12 Computational time of different methods for analyzing data with different sample sizes in the simulations.

Plot shows computational time in minutes (y-axis; on log10-scale) for analyzing 100 genes with different sample sizes (x-axis) for different methods. Compared methods include SpatialDE (purple), Trendsceek (blue), SPARK (pink), SPARK with 10 threads (dotted pink line). Computation are carried out using Intel Xeon E5-2683 2.00GHz processors. Both SPARK and SpatialDE are much more computationally efficient than Trendsceek.

Supplementary Figure 13 Comparison of aggregated p-values and single kernel p-values.

(a)The observed -log10 p-values from each kernel of SPARK are plotted against the expected -log10 p-values under the null for the mouse olfactory bulb data (n=260). (b)The observed -log10 p-values from each kernel of SPARK are plotted against the expected -log10 p-values under the null for the human breast cancer data (n=250). (c)The observed -log10 p-values from each kernel of SPARK are plotted against the expected -log10 p-values under the null for the mouse hypothalamus data (n=4,975). (d)The observed -log10 p-values from each kernel of SPARK are plotted against the expected -log10 p-values under the null for the mouse hippocampus data (n=131). (e)Heatmaps illustrates five Gaussian kernels and five cosine kernels used in the mouse olfactory bulb data (n=260). Five Gaussian kernels (GSP1-GSP5) are used to capture a range of possible focal expression correlation patterns while five cosine kernels (COS1-COS5) are used to capture a range of periodic expression patterns. The detailed construction of these kernels is described in Supplementary Notes.

Supplementary information

Supplementary Information

Supplementary Figs. 1–13, Supplementary Table 3, Supplementary Notes and Supplementary Results.

Reporting Summary

Supplementary Table 1

Enrichment analysis of SE genes for the mouse olfactory bulb data.

Supplementary Table 2

Enrichment analysis of SE genes for the human breast cancer data.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sun, S., Zhu, J. & Zhou, X. Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies. Nat Methods 17, 193–200 (2020). https://doi.org/10.1038/s41592-019-0701-7

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing