Modeling Spatial Correlation of Transcripts with Application to Developing Pancreas

Recently high-throughput image-based transcriptomic methods were developed and enabled researchers to spatially resolve gene expression variation at the molecular level for the first time. In this work, we develop a general analysis tool to quantitatively study the spatial correlations of gene expression in fixed tissue sections. As an illustration, we analyze the spatial distribution of single mRNA molecules measured by in situ sequencing on human fetal pancreas at three developmental time points–80, 87 and 117 days post-fertilization. We develop a density profile-based method to capture the spatial relationship between gene expression and other morphological features of the tissue sample such as position of nuclei and endocrine cells of the pancreas. In addition, we build a statistical model to characterize correlations in the spatial distribution of the expression level among different genes. This model enables us to infer the inhibitory and clustering effects throughout different time points. Our analysis framework is applicable to a wide variety of spatially-resolved transcriptomic data to derive biological insights.


Supplementary Notes 1 In Situ RNA Sequencing Data of Pancreas
To demonstrate the use of our computational analysis tool, we study in situ RNA sequencing of human fetal pancreatic tissue. The study was performed on tissues from de-identified donors with informed consent, and was approved by the ethics committee of the Stanford University Institutional Review Board (IRB). All methods were performed in accordance with the relevant guidelines and regulations. The data are collected at three developmental ages -80, 87 and 117 days post-fertilization.
Experimental procedure. Anonymized fetal pancreas tissues were OCTembedded and immediately frozen in liquid nitrogen upon arrival and stored at -80 • C. Before tissue staining, blocks were cut in 10 µm-thick sections which were collected on superfrost plus slides and stored at 80 • C.
All the solution used for tissue staining were prepared using DEPC-treated or RNase-free reagents. Slides were let air dry before fixation in 4% (w/v) paraformaldehyde in PBS for 45 min at room temperature. After fixation sections were washed in PBS for 5 min at room temperature and permeabilized with 0.01% (w/v) pepsin in 0.1 M HCl for 5 min at 37 • C followed by wash in PBS (5 min room temperature) and a wash in EtOH 70, 85 and 100% for one min each at room temperature.
Slides were let air dry before mounting a silicon hybridization chamber (Secure-seal, SIGMA) of different size depending on the size of the tissue section. Tissues were washed once with PBST (1x PBS, 0.05% tween-20) at room temperature before incubation in 1x RT buffer (ThermoFisher) until reverse transcription.
In situ reverse transcription was carried out adding all the cDNA primers together (Supplementary Table 7) at a final concentration of 0.3 µM each, 1x RT buffer (ThermoFisher), 0.2 µg/µl BSA (NEB), 0.5 mM dNTPs (ThermoFisher), 1 U/µl RiboLock RNase Inhibitor (ThermoFisher) and 10 U/µl Revert AID H Minus Reverse Transcriptase (ThermoFisher) and incubating the tissue at 37 • C overnight. After reverse transcription, slides were washed twice in PBST and post-fixation was done with 4% (w/v) paraformaldehyde in PBS for 10 min at room temperature followed by two washes in PBST 5 min at room temperature and incubation in 1x Ampligase buffer.
Slides were washed twice in PBST before removing the hybridization chamber and washed in EtOH 70, 85 and 100% for 5 min each at room temperature. Stained sections were air-dried and mounted with SlowFade Antifade Gold mountant (ThermoFisher) and a coverslip before imaging.
Following staining cycles were done as follow: slides were dip in 70% EtOH until the coverslip fell off. Section were then washed in 85 and 100% EtOH 5 min each and air dried. UNG treatment was done adding 1x UNG buffer (ThermoFisher), 0.2 µg/µl BSA (NEB) and 0.05 U/µl UNG (ThermoFisher) for 30 min at 37 • C. Uracil cleavage was followed by was in pre-warmed 65% formamide in water for 5 min at 55 • C and two washes in PBST. Second and third staining cycles and nuclear staining were performed as described above.
Image acquisition and analysis. Imaging was carried out on a Zeiss Axioplan epifluorescence microscope equipped with an Axiocam 506 mono camera (Zeiss)and filter-cubes for DAPI, FITC, Cy3 and Cy5. Each sample was imaged after every round of hybridization with a 20x/0.8 Plan-Apochromat objective (Zeiss) for a total of three rounds. Multiple fields of view were acquired with 10% overlapping and 10-15 z-stacks (about 0.5 µm step size). The resulting images were projected (maximum intensity projection, MIP) and automatically stitched using DAPI staining. MIP images were shade corrected using the automatic function in Zen Acquisition software (Zeiss) feeding a minimum intensity projection. Images were exported as 16bit grayscale images. Background subtraction was done using ImageJ software by measuring the fluorescence intensity outside the signals and removing it from the corresponding image.
Image registration was done in ImageJ as follow: images from the same round of hybridization, except for the DAPI staining, were combined (MIP). A first registration between images from different rounds of hybridization was done using the nuclei staining (DAPI images). A second registration was done using the MIPs from the other channels. Image alignment was done using MultiStackReg plugin in ImageJ using one round of hybridization (typically the first) as a reference and aligning the other rounds to the first. For every alignment a rigid transformation was applied, and a transformation matrix was saved and applied to the rest of the images.
Pre-aligned and background-corrected images were analyzed with CellProfiler 2.1.1 (rev 6c2d896) to identify nuclei,cellsand fluorescent spots (RCA). The intensity and position of RCA products were measured using the same pipeline as in Mignardi et al. (2015) [1]. The barcode decoding was obtained using the same MATLAB script as described before (Ke et al., 2013) [2] and a quality threshold was applied to the detected barcodes as described above. All the raw images, CellProfiler pipelines and oligonucleotide sequences are accessible in Supplementary Table. 7, Supplementary Table 8 and Supplementary Table  9.

Algorithm for Identifying of Endocrine Islets
The endocrine islets are identified according to Algorithm 1. In this analysis, we set number of possible radius m = 10, maximum possible islet radius r max = 110 µm, minimum possible islet radius r min = 10 µm which is around the averaged nuclei spacing.

Algorithm 1 Identification of Endocrine Islets
Input: Nuclei positions (x i , y i ), total number of SST, GLUC and INS transcripts in ith cell n i (i = 1, .., N ), maximum possible radius r max , minimum possible radius r min and number of radius steps m. Initiate islets region set C = ∅ and exocrine nuclei set P exo = {i |1 i N, n i 1} for t = 0 to m do Islet radius r = (1 − t/m)r max + (t/m)r min . Get the number of neighboring nuclei n (neighbor) i within a radius r for ith nucleus, i ∈ P exo . The temporary islets set P t = {i |i ∈ P exo , n , remove i * from P t circle c * has center (x i * , y i * ) and radius r. if c * do not intersect with C then Add c * to islets set C.
We then identified clusters of endocrine cells for all three samples. Within each sample we identified clusters of different size and the distribution of cluster size for each sample is plotted in Supplementary Figure 2. The total number of reads inside identified endocrine regions is given in Supplementary Table 5.

Density Profile-based Analysis
The density profile-based analysis is carried out to capture the relation between transcripts and other morphological features of the tissue such as nuclei's position or developing pancreatic islets.
The density profiles are calculated based on kernel density estimation. Given the characteristic distance of each transcript x i (i = 0, ..., n), the density ρ at distance x is estimated as where n is the transcripts number, K is a kernel with bandwidth w. For example, when analyzing the relation between gene expression and endocrine regions, the characteristic distance x could be the distance between the transcript and the boundary of its closest endocrine islet. For a good approximation at short distance, we use a 1D kernel with linear combination correction [3]. The difference between two density profiles ρ 1 (x) and ρ 2 (x) is characterized by symmetric Kullback-Leibler (KL) divergence which averages the KL divergence from ρ 1 to ρ 2 and the divergence from ρ 2 to ρ 1 .

Islet shape analysis.
In the experiment, the boundaries for endocrine islets can be nicely approximated by circles and the model characterizes the spatial correlations in the circular areas. We note that the statistical model does not rely on particular boundary shape and can be readily applied to non-circular cases. To illustrate the model's performance when fitted on non-circular regions, we repeat the spatial correlation analysis on pancreatic islets approximated by squared shapes. The endocrine islets are identified as squares, following Algorithm 1 with the radius of circles replaced by the side length of squares. Supplementary Figure 5 shows that the squared boundaries do not capture the shapes of islets well and tend to include more exocrine pancreas compared to circular boundaries. We again fit the statistical model and summarize the typical results in Supplementary Table 2, in correspondence with Table 2 in the main text -most spatial correlations among genes are fitted to be close to 1 and three pairs of genes with positive spatial correlation are identified.

Evaluation on synthetic data
In this section, we evaluate our statistical model on synthetic datasets and demonstrate its power in characterizing the spatial distribution of the expression level among different genes.
Other methods. The computational analysis on the gene-gene spatial correlations has not been addressed in literature. Recent methods such as SpatialIDE identifies the spatial variation of individual genes on their own and the gene-gene spatial correlations are not discussed [4].
We propose two other methods as comparison -a baseline model with preliminary statistics and a pairwise Strauss process model.
In the baseline, the spatial correlation between type i and type j transcripts is defined as where ρ i (j) denotes the mean density of type j transcripts in the neighborhood of type i within radius r and ρ(j) is the mean density of type j transcripts in the whole analyzed region. The case γ ij,baseline > 1 indicates that type j transcripts are more likely to appear near type i, thus a clustering effect is expected. Similarly, 0 < γ ij,baseline < 1 represents an inhibition effect and γ ij,baseline = 1 corresponds to an independent relation. We further compare our multitype Strauss model with a pairwise Strauss model. The spatial correlation between type i and j are modeled with Strauss process for gene i and j only, ignoring the effects of other genes.
Synthetic datasets. We generate two toy spatial transcriptome datasets I and II with known gene-gene spatial correlations, as shown in Supplementary Figure 5. Here the interaction radius is set to be 1 unit and the analyzed region is a 20 units × 20 units square. The scale unit can be related to different length measure. Taking the pancreas transcriptomic dataset as a example, 1 unit can correspond to 20µm.
In dataset I, three genes A, B and C are correlated following Supplementary  Figure 5a. There is a clustering effect between pairs (A, B) and (B, C), while the generation of gene B and C are independent. In the experiment, gene A is randomly sampled from a Poisson process on the squared region. Gene B and C are randomly generated in the neighborhood of gene A following Poisson process independently. The intensities for A, B and C are 0.1, 0.09 and 0.08, respectively. An example of generated spatial transcriptomic data are plotted in Supplementary Figure 5b.
In dataset II, two more genes D and E are added, as shown in Supplementary  Figure 5c. There is a clustering effect between gene D and E, which are both independent on the rest three genes. Genes A, B and C are generated following the same procedure as in dataset I. We sample gene D from a Poisson process on the whole region with intensity 0.1 and sample gene E from a Poisson process in the neighborhood of gene D with intensity 0.09, as illustrated in Supplementary  Figure 5d.
Performance. Our model successfully captures the spatial correlations among genes and significantly outperforms the other two methods. The performance of fitting the baseline method, pairwise Strauss model and our proposed model on dataset I and II are given in Supplementary Table 3 and Supplementary Table  4, respectively.
Our statistical model is able to distinguish between spatial correlation and spatial co-occurrence, while the other two methods cannot. In the synthetic dataset I, gene B and C are independently generated, but they both correlate with gene A. Thus the two independent genes are also more likely to appear close to each other because of their common neighbor gene A. As shown in Supplementary Table 3, our statistical model correctly captures their independence as γ ij = 1.00 ± 0.02. However, the other two methods directly contribute the co-occurrence of B and C to their spatial correlation and yield γ ij > 1 indicating a clustering effect.
When more types of transcripts are considered in synthetic dataset II, our statistical model continues to learn the correct correlations, as indicated by Supplementary Table 4. The independence between the two groups (A, B, C) and (D, E) is characterized by γ ij = 1 with gene i and j from each group. The other two methods still mistakenly concludes that there is a clustering effect between the two independent genes B and C.

Full Results for Gene Correlation
In the experiment, the spatial correlation γ ij among genes within endocrine islets are fitted with our statistical model. The full results are summarized in Supplementary Table 6

Supplementary Tables
Supplementary Table 1 Table 3: Spatial correlation γ ij (mean ± std) for synthetic dataset I, fitted with three methods -the preliminary statistics baseline, the pairwise Strauss process model and our proposed statistical model. The true spatial correlationγ ij equals to 1 when gene i and j are independent and is larger than 1 when there is a clustering effect. We randomly generate dataset I for 20 times and average the results.
Supplementary Table 4: Spatial correlation γ ij (mean ± std) for synthetic dataset II with three methods. The true spatial correlationγ ij equals to 1 when gene i and j are independent and is larger than 1 when there is a clustering effect. We randomly generate dataset II for 20 times and average the results.