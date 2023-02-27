Spectral template matching for scRNA-seq signal manipulation

We developed scPrisma, a spectral analysis framework that uses topological priors over underlying signals in single-cell data, to allow for their inference, enhancement and filtering. The core of scPrisma uses spectral template matching between the spectrum (the eigendecomposition of the covariance matrix) of a set of single-cell data (for example, scRNA-seq) and the expected analytical spectrum of a structure or process we aim to enhance or filter. To analyze a theoretical covariance spectrum (by analyzing its eigenvalues and eigenvectors), we need a reference model. Focusing first on cyclic signals, we propose a simple toy model of periodic biological signals (Methods). The covariance matrix of the gene expression matrix of this model is a circulant matrix of a special form that depends on the model parameters (Methods). Circulant matrices have closed-form formula for their eigenvectors and eigenvalues12 (Fig. 1–1), which we used to estimate the ordering of cells along the cyclic topology. This was done by optimizing for a permutation matrix that maximizes the projection of the data over the theoretical spectrum (Methods; Fig. 1–2). As input for the remainder of scPrisma’s workflow, cellular ordering can also be informed by prior knowledge of low-resolution pseudotime. Based on the reconstructed ordering, scPrisma infers topologically informative genes, as the set of genes that maximizes the projection over the theoretical spectrum (Methods; Fig. 1–3A). scPrisma can then enhance (filter) the signal related to the cyclic process by filtering out gene expression entries that do not maximize (minimize) the projection over the theoretical spectrum (Methods; Fig. 1–4A and 3B). scPrisma can successfully reconstruct, filter and enhance periodic signals in simulated single-cell data (Supplementary Note A.1, Extended Data Fig. 1 and Supplementary Fig. 1).

scPrisma manipulates the cell cycle signal in HeLa cells

We first tested our approach on a scRNA-seq dataset of HeLa cells, unsynchronized across the cell cycle13 (Fig. 2). To assess the results, we used a list of approximately 400 genes, classified according to cell cycle phases13, where for each phase, the corresponding genes were summed and normalized and their circular mean and variance were calculated14 (Supplementary Note A.4). For an ordered reconstructed signal, the ordering of the circular means should correspond to the cell cycle phases and the circular variance should be less than 1, while for randomly ordered data, the circular variance is expected to be close to 1 (corresponding to a uniform signal along the cycle). Following standard preprocessing of the HeLa single-cell data (Methods; Supplementary Note A.4), the distributions of all phases were found to be nearly uniform (Fig. 2b; mean circular variance = 0.991). However, after cyclic ordering by scPrisma (Methods), different phases of the cell cycle became clearly separated (Fig. 2b; mean circular variance = 0.849) and peaked progressively according to the correct phase ordering.

Fig. 2: scPrisma manipulates the cell cycle signal in HeLa cells. a, PCA representation of 683 cells in the unordered and ordered raw gene expression data, and data following spectral cyclic enhancement and filtering. The cells are colored according to the corresponding rows in the gene expression matrix. b, Smoothed polar plot of the normalized sum of the gene sets corresponding to different cell cycle phases13. The circular mean of each phase is marked by a correspondingly colored star. c,d, Expression of CDC20 (which peaks in the M phase33) and RRM2 (which peaks in the S phase33) as a function of cellular location in the gene expression matrix. e, PCA representation of cells along iterations of the cyclic enhancement algorithm. f, Violin plot of AUC scores for n = 50 experiments of sampling random gene subsets and inference of cyclic genes by scPrisma based on unordered, ordered, enhanced and filtered data. White dots mark the median, gray bars mark the interquartile range and thin gray lines mark the rest of the distribution, except for outliers. Full size image

To further enhance the cell cycle signal in the data, we employed scPrisma’s cyclic enhancement algorithm. Iterations of our algorithm gradually revealed a cyclic signal, which was apparent after filtering less than 7% of the total signal and was clearly revealed after removing 22% of the total signal (Fig. 2e). In addition, the reconstructed angular ordering was correlated with the cell cycle phases and the circular variance of phase-corresponding marker genes decreased substantially (Fig. 2b; mean circular variance = 0.666). Further, genes associated with different cell cycle phases peaked progressively in their expected order (Methods; Fig. 2c,d and Supplementary Fig. 2).

Next, we approached the reverse challenge: to filter out the cell cycle signal from the HeLa cells data. After applying scPrisma’s cyclic filtering algorithm to the ordered data, the separation between different cell cycle phases and their corresponding progressive peaks was lost (Fig. 2b; mean circular variance = 0.989).

Finally, we identified genes related to the cell cycle using the genes inference algorithm (Methods; Supplementary Note A.4). When using as input both randomly selected subsets of cell cycle-related genes13 and subsets of genes unrelated to the cell cycle, we found a substantial improvement in our ability to identify cell cycle-related genes after reordering the cells (mean AUC = 0.804) relative to the original data (mean AUC = 0.295). Moreover, relative to the ordered data, identifying cell cycle-related genes was further improved following spectral cyclic enhancement (AUC mean 0.838, outperforming available baselines; Supplementary Fig. 2) and was diminished following cyclic filtering (mean AUC 0.183; Fig. 2f).

scPrisma disentangles spatiotemporal signals in the liver

We next dissected spatiotemporal signals via a scRNA-seq dataset which captures gene expression variation of hepatocytes in the mammalian liver across both space (spatial zonation across the periportal to pericentral axis) and time (temporal variation across the circadian rhythm)15. Similarly to the cell cycle, the circadian rhythm was also expected to exhibit a cyclic structure in gene expression space. Here we used experimental prior knowledge in the form of low-resolution sampling time15 to order the cells in a cycle. In this setting, we still missed information about temporally informative genes, and spatiotemporal information was still entangled (for example, Pck1 varied informatively across both space and time of day15). We leveraged scPrisma to disentangle these data and showed clear enhancement and filtering of the circadian rhythm, relative to raw data (K = 4 Adjusted Rand Score (ARI) of KMeans = 0.96; 0.013; 0.11, respectively; Fig. 3a,e and Supplementary Fig. 3A; Methods). scPrisma outperformed available baselines for filtering the circadian rhythm (Supplementary Note A.5). These results were reflected in the behavior of individual genes; Pck1 is a rhythmic gene that was highly expressed at ZT06 and ZT12 (ref. 15), and indeed, following spectral cyclic enhancement, its resulting expression in ZT00 and ZT18 was diminished, while cyclic filtering resulted in a nearly constant temporal expression of Pck1 (Fig. 3c). Spatially, Pck1 is periportally zonated, and indeed, spectral cyclic enhancement flattens its spatial expression, while cyclic filtering retains its spatial variation (Fig. 3c). Similar behavior can be observed for additional spatiotemporally informative genes (Fig. 3d and Supplementary Fig. 4).

Fig. 3: Disentanglement of spatial and temporal signals in liver lobules. a,b, PCA representation of raw single-cell data of 4,000 cells (1,000 from each time point), and following spectral cyclic enhancement, cyclic filtering, linear enhancement and linear filtering. The cells are colored either according to their associated time points (ZT, sampled at four equally spaced time points along the circadian rhythm) (a), or by their respective spatial location (‘layer’, according to the zonation analysis done in 15) (b). c,d, Heatmaps of the expression of Pck1 (c) and Oat (d) following scPrisma analysis as a function of the sampling time and zonation layer. e, ARI of KMeans clustering with K = 4 (corresponding to four underlying time points) and K = 8 (corresponding to eight underlying zonation layers) after applying each of scPrisma’s spectral algorithms. Full size image

In a complementary manner, spectral analysis focusing on the characteristics of linear signals can be used to filter the spatial linear signal in the collective gene expression of hepatocytes (Methods). Indeed, following spectral linear filtering by scPrisma, the cyclic circadian signal was clearly revealed (Fig. 3a and Supplementary Fig. 3A) and the linear zonation signal was blurred out relative to the raw data (Fig. 3b and Supplementary Fig. 3B,E; K = 8 ARI of KMeans = 0.017; 0.11, respectively). Focusing again on Pck1 expression, linear spectral enhancement retained only the expression around the portal vein and reduced the temporal variance, while linear filtering reduced the zonation variance and yet retained the temporal variance (Fig. 3c).

Finally, we evaluated the dominant structure in the data following scPrisma analysis. As predicted, following spectral enhancement, the data clustered according to the enhanced signal, while following spectral filtering, the data clustered according to the unfiltered signal; for example, filtering the cyclic signal led to better identification and clustering of the spatial zonation signal (Fig. 3e).

scPrisma manipulates the diurnal cycle in Chlamydomonas

To demonstrate the use of scPrisma for more complex systems with diverse prior knowledge, we next turn our attention to scRNA-seq data collected for Chlamydomonas (green algae), grown under two contrasting conditions, iron replete (Fe+) and iron deficient (Fe−)16. In both conditions, an expression signal that reflects the 24 h diurnal cycle was previously detected16. To evaluate progression of cells along the diurnal cycle, we used marker genes corresponding to different cycle phases obtained from bulk RNA-sequencing17 (Supplementary Note A.6). Using scPrisma’s cyclic enhancement resulted in robust reconstruction of the diurnal cycle for each of the two conditions. This was done by splitting the 24-h cycle into six phases and validating that they are well separated following reconstruction and enhancement (Methods; Fig. 4a for Fe+ condition and Supplementary Fig. 5A for Fe− condition). Further, concatenating the enhanced cyclic signal of both experiments resulted in a reconstructed synchronized diurnal cycle (Supplementary Fig. 5B). We next focused on enhancing the biological differences between the Fe− and Fe+ conditions by spectrally filtering their shared diurnal cycle (Fig. 4). As expected, cyclic filtering increased the differences between the clusters of Fe− and Fe+ associated cells (Silhouette score before/after filtering = 0.088/0.136, Fig. 4b). scPrisma outperformed state-of-the-art cyclic filtering methods, including ccRemover7, Seurat10 and Cyclum9 (Silhouette scores = 9.815 × 10-6, 9.868 × 10-4 and 0.052, respectively; Supplementary Note A.8 and Fig. 4b).

Fig. 4: scPrisma detects and filters the diurnal cycle in Chlamydomonas. a, PCA representation of 3,000 cells in spectrally enhanced cyclic signal of the Fe+ experiment16. Each plot represents 1/6 of the 24 h cycle. The cells are colored according to the normalized sum of the marker genes associated with each phase17. b, PCA representation of 6,000 cells (3,000 of each condition) in Fe+ (blue) and Fe− (orange) conditions of raw gene expression data, data following scPrisma’s cyclic filtering (Silhouette score before/after filtering = 0.088/0.136, Calinski and Harabasz score before/after filtering = 600.802/1000.44) and enhancement, and data following filtering by Cyclum (Silhouette score = 0.052, Calinski and Harabasz score = 337.629), Seurat (Silhouette score = 9.868 × 10−4, Calinski and Harabasz score = 1.29 × 10−10) and ccRemover (Silhouette score = 9.815 × 10−6, Calinski and Harabasz score = 6.25 × 10−13). Full size image

scPrisma extracts SCN cell-type-specific temporal signals

We next focus on scRNA-seq data collected for mice SCN, the mammalian brain’s circadian pacemaker18. In this experiment, cells were sampled at 12 time points along two days. Again, we leveraged the cyclic nature of the circadian rhythm and explicitly used the prior knowledge regarding the experimental sampling times (instead of running the reconstruction algorithm). We first clustered the cells using the Louvain algorithm and mapped individual clusters to cell types using established marker genes18 (Supplementary Note A.7 and Supplementary Fig. 6). scPrisma’s cyclic enhancement over each cell type separately revealed a cyclic signal associated with the circadian rhythm for 5/8 of cell types (Fig. 5 and Supplementary Fig. 7; Methods). The three cell types that did not expose a clear cyclic signal (NG2, microglia and tanycytes) exhibit the lowest fraction of rhytmic gene expression18. Moreover, we measured the separation of cells that were sampled at different time points, before and after cyclic filtering/enhancement using the Calinski and Harabasz score19. Overall, as expected, separation increased substantially following cyclic enhancement and decreased following cyclic filtering, which, as above, is least substantial for the three cell types exhibiting the lowest fraction of rhythmic genes (Fig. 5d). It can be observed that the cellular density varies with the rhythmic process and is correlated with the peak of temporal expression of rhythmic genes (Fig. 5a,b and Supplementary Note A.7). Focusing on gene expression, we found that spectral cyclic enhancement diminishes the expression of cell-type marker genes and retains the expression of rhythmic genes (core clock genes and protein folding genes, as characterized in ref. 18; Fig. 5c and Supplementary Fig. 7). Conversely, following cyclic filtering, cell-type marker gene expression was retained, while the resulting temporal expression of rhythmic and protein folding-related genes flattened (Fig. 5c and Supplementary Fig. 7).

Fig. 5: scPrisma extracts cell-type specific circadian rhythm signals in the SCN. a, PCA representation of cells in raw SCN gene expression data18 and data following cyclic enhancement by scPrisma, colored according to CT, for several cell types: ependymal, SCN neurons, oligodendrocytes and astrocytes. b, PCA representation of ependymal cells colored by Tef expression and mean Tef expression as a function of circadian time, for raw data (top), data following scPrisma’s cyclic enhancement (middle) and filtering (bottom). c, Heatmaps of ependymal expression of (from left to right) cell-type marker genes, rhythmic genes and protein folding genes, for raw data (left) and data following scPrisma’s cyclic enhancement (middle) and filtering (right). d, Calinski and Harabasz score for cells sampled at different CT for raw data, and data following scPrisma’s cyclic enhancement and filtering. Cell types with low fraction of circadian genes18 exhibit lower scores. e, Pre- and post-filtering gene expression dotplots of the marker genes of SCN neurons (before filtering, clusters 1, 3 and 4 contain mixtures of two different subtypes), and pre- and post-filtering heatmaps showing the contribution of cells sampled at different CT to each neuronal cluster. f, Regulatory network structure of core clock genes21. g, Pre- and post-enhancement sum of scores of all regulatory interactions of the core clock network shown in f, scored by GRNBoost2, for each SCN cell type. h, Pre- and post-enhancement mean Rorc, Ahsa2 and Hsp90ab1 expression as a function of CT. Following cyclic enhancement, regulatory interactions between the transcription factor Rorc and Ahsa2, Hsp90ab1 were uncovered. i, Mean Avp and Plau expression as a function of CT. Following cyclic enhancement, cell–cell interactions between astrocytes and microglia cell types, mediated by Avp and Plau, were revealed. Full size image

scPrisma further enhanced cell-type classification, inference of gene regulatory interactions related to the circadian rhythm and underlying cell–cell interactions. When aiming to characterize cells by their type, additional biological signals can interfere with that task as similarity between cells can arise due to multiple factors. For example, direct clustering of cells according to their gene expression profiles may capture similarities according to the circadian rhythm phase and not their type, which can substantially hinder our ability to distinguish different cell types. Clustering the neurons yielded 14 distinct clusters, three of which can be identified using either established marker genes or previous subtype classification18 as containing mixture of neurons from both SCN neuronal subtypes N0 and N2 (clusters 1, 3, 4; Supplementary Note A.7 and Fig. 5e). We found that the circadian rhythm signal interferes with the proper classification of cell subtypes in this case, supported by the observation that in clusters 1 and 3 the majority of cells (79% and 66%, respectively) were sampled at circadian time points (CT) = 14/18/22, while in cluster 4, 93% of cells were sampled at CT 02/06/10, which suggests that the clustering of cells in this subpopulation is dominated by their distinct temporal signatures and not their types (Fig. 5e and Extended Data Fig. 2). We were able to overcome the cell-type misclassification by spectrally filtering the circadian rhythm signal using scPrisma, after which, the clustering algorithm yielded a unique cluster for each of the two neuronal subtypes, N0 and N2 (Fig. 5e). Moreover, as expected, the distribution over CT within each cluster flattened following cyclic filtering (Fig. 5e and Extended Data Fig. 2; mean circular variance increased from 0.781 to 0.863 following filtering).

Mixed biological signals in single-cell data can also interfere with the inference of gene regulatory networks. Therefore, we used scPrisma to highlight a set of regulatory interactions related to the circadian rhythm that were difficult to identify in the original data. Specifically, we expected that regulatory interactions that can be revealed following cyclic enhancement would be enriched with interactions associated with the cyclic circadian process. Indeed, regulatory interactions between core clock genes, as inferred using the gene regulatory network inference algorithm GRNBoost2 (ref. 20), are more highly correlated to the established core clock interaction network21 (Fig. 5f) in the cyclically enhanced single-cell data, relative to the raw data, for 7/8 of the cell types (Fig. 5g and Supplementary Note A.7). Going beyond the core clock interaction network, using a list of known mice transcription factors22, we searched for inferred interactions (based on GRNBoost2) which are substantially enhanced following scPrisma cyclic analysis (Methods), where the regulator is a core clock transcription factor (Nr1d1, Nr1d2, Rora, Rorb, Rorc, Dbp, Tef23; a full list of inferred interactions is available in Supplementary Table 1). For example, focusing on genes inferred to be highly regulated by Rorc in ependymal cells following spectral enhancement, we found that the genes that received the highest score were Ahsa2 (0 to 6.341), Kif21a (0 to 6.231), Hsp90ab1 (0.070 to 5.731) and Mt1 (0 to 5.002) and the peaks of these genes along the circadian rhythm overlapped with the peak of Rorc, following spectral enhancement (Fig. 5h and Supplementary Fig. 7E). These results are consistent with previous results showing the existence of a regulatory interaction between Rorc and Hsp90ab1 which is dependent on the time of day24.

Finally, we used scPrisma to infer hidden cell–cell interactions related to the circadian rhythm. We compared cell–cell communication patterns, using CellPhoneDB25, between different cell types at corresponding time points. Similarly to the regulatory network inference described above, we were able to recover interactions that were substantially enhanced following scPrisma’s cyclic analysis (Methods; Fig. 5I, Supplementary Note A.7 and Supplementary Table 2).

Generalized template matching by scPrisma

Beyond cyclic and linear topologies, scPrisma can be used to manipulate a variety of different, complex topological signals in single-cell data. This is possible because scPrisma can use the numerical spectrum of a given covariance matrix, instead of the analytical spectrum, as was the case for the cyclic and linear topologies. We demonstrated the diversity of scPrisma on three additional types of templates as follows: (1) Clusters—we constructed a cluster-based template to enhance the separation, or maximize the variation in gene expression, between different cellular clusters and states (Supplementary Note A.13), specifically between cellular states of hepatocytes during day and night time (Extended Data Fig. 3A). Further, this topology can be used for data integration via the spectral filtering workflow, as we demonstrated for human pancreas scRNA-seq data which were collected from four different studies26,27,28,29, labeled and preprocessed as given in ref. 30. scPrisma can filter batch effects in this case, demonstrated visually (Extended Data Fig. 3D,E) and quantitatively, as following cluster-based filtering, the Calinski and Harabasz score between the different batches drops from 298.63 to 3.80 and the score between the annotated cell types rises from 90.97 to 107.48. scPrisma’s results for batch correction are competitive or outperform state-of-the-art tailored methods for this task (Supplementary Note A.13). (2) Multiple cycling processes—scPrisma can reconstruct multiple cycling processes across SCN cell types both for a synchronized case (Extended Data Fig. 4) and an un-synchronized case (Supplementary Note A.13 and Extended Data Fig. 5). In the latter, more challenging case, circadian core clock genes are out-of-phase in the SCN neurons versus multiple other cell types18. scPrisma can enhance every periodic signal separately by designing a covariance matrix that is block circulant (Supplementary Note A.13). Furthermore, scPrisma can be used for the general, iterative inference of cyclic processes, although this task is more challenging and scPrisma is not optimized for it. As an example, we used human embryonic stem cells (hESC) single-cell dataset31. We showed that scPrisma can be used (de novo, without prior knowledge on marker genes) to first infer a cyclic process corresponding to the cell cycle and then filter it out and use the filtered data to reconstruct a second cyclic process corresponding to an oscillatory pattern related to the experimental setup (Supplementary Note A.13 and Supplementary Fig. 8). The encoding of both oscillatory processes by the hESC population is consistent with previous findings of ref. 31. (3) Two-dimensional (2D) tissue organization—last, we enhanced the spatial signal in a spatially informed (Slide-seqV2) 2D dataset of the mouse hippocampus32. We computed the shortest path matrix (calculated based on the spatial k-nearest neighbor graph over the data) and transformed it into an affinity matrix using a heat kernel (Supplementary Note A.13). In this case, the affinity matrix is used by scPrisma as the covariance matrix of the spatial signal. scPrisma enables flexible manipulation of the spatial signal, which we leveraged to extract, and then either enhance or filter the spatial signal in the data by applying the enhancement algorithm based on numerical eigendecompositions of the affinity matrix (Supplementary Note A.13 and Extended Data Fig. 6).