Abstract
Singlecell RNA sequencing has been instrumental in uncovering cellular spatiotemporal context. This task is challenging as cells simultaneously encode multiple, potentially crossinterfering, biological signals. Here we propose scPrisma, a spectral computational method that uses topological priors to decouple, enhance and filter different classes of biological processes in singlecell data, such as periodic and linear signals. We apply scPrisma to the analysis of the cell cycle in HeLa cells, circadian rhythm and spatial zonation in liver lobules, diurnal cycle in Chlamydomonas and circadian rhythm in the suprachiasmatic nucleus in the brain. scPrisma can be used to distinguish mixed cellular populations by specific characteristics such as cell type and uncover regulatory networks and cell–cell interactions specific to predefined biological signals, such as the circadian rhythm. We show scPrisma’s flexibility in incorporating prior knowledge, inference of topologically informative genes and generalization to additional diverse templates and systems. scPrisma can be used as a standalone workflow for signal analysis and as a prior step for downstream singlecell analysis.
Main
In recent years, progress in singlecell RNA sequencing (scRNAseq) that contains information about the gene expression profiles of the multitude of cells across tissues has led to substantial improvement in our understanding of a variety of intracellular and intercellular processes^{1}. Recent computational advances have pushed forward the interpretation of these data to extract information about the heterogeneity of cell types and states and their collective structure and behavior, including spatial context^{2}, gene regulatory patterns^{3}, cell type^{4} and temporal processes such as lineage^{5} and cell cycle^{6}. Because scRNAseq data contain multiple biological signals, it can be challenging to uncover a particular underlying signal and in many cases, prior information about the signal in question is needed^{6,7}. Recently, we have shown how topological priors about the hierarchical structure of lineage and differentiation in singlecell data can be leveraged for their identification using spectral approaches, as they exhibit powerlaw signatures in the covariance eigenvalue distribution^{8}. Relying on such global topological features can provide a more robust and generalizable approach than relying on specific features such as marker genes^{6,7}, as these are many times unique to different biological systems and processes and can be challenging to infer for new systems. Here we show how topological priors can be used beyond signal identification. We use priors about the periodicity and linearity of diverse biological processes and later generalize to additional topologies, to either enhance or filter them out from singlecell data using a spectral projection approach.
Biological processes that are inherently periodic are abundant and have important roles in diverse contexts, such as the cell cycle and circadian rhythm. Multiple computational methods aim to infer periodic signals from singlecell data, with a particular focus to extract information related to the cell cycle or to remove its effect^{6,7,9,10}. However, the majority of these methods are heavily based on the cell cycle marker gene information, including ccRemover^{7}, Seurat^{11} and reCAT^{6}, which makes them difficult to generalize across systems and across periodic signals beyond the cell cycle. On the contrary, Cyclum^{9}, which does not rely on marker genes, is an autoencoderbased approach that optimizes a circular embedding for singlecell data to infer and remove cell cycle effects. However, fitting the data to a onedimensional circle does not generally capture the variability of cyclic biological processes and lacks flexibility as it cannot easily incorporate additional prior knowledge, such as lowresolution temporal information, which may be necessary for weak cyclic signals.
Here we present scPrisma, a general spectral framework (Fig. 1) for the reconstruction, enhancement and filtering of signals in singlecell data based on their topology and inference of topologically informative genes. We benchmark scPrisma and demonstrate its performance over simulated data and seven scRNAseq datasets. Specifically, we show how the cell cycle can be revealed or filtered in a population of HeLa cells, how circadian rhythm and spatial zonation can be decoupled in liver lobules, how differences in Chlamydomonas that were grown in different environments can be emphasized by filtering their diurnal cycle signal, and how the signature of the circadian rhythm can be revealed in multiple cell types in the suprachiasmatic nucleus (SCN) in the brain, the master circadian pacemaker in mammals. In addition, we show how using scPrisma allows us to better distinguish distinct cellular subtypes of SCN neurons following temporal filtering, and uncover signalrelated gene regulatory networks and cell–cell interactions following enhancement of the circadian rhythm signal. Finally, beyond cyclic and linear templates, scPrisma can be used to manipulate diverse template types, enhance the separation between clusters, identify multiple cyclic processes and enhance spatial signals in spatial transcriptomics. scPrisma is versatile and enables topological signal manipulation without lowdimensional embedding, which renders the results useful for diverse types of downstream analyses. Furthermore, it is flexible as it enables integration of diverse types of prior knowledge (such as lowresolution temporal ordering) but does not rely on it and can be used for de novo analyses.
Results
Spectral template matching for scRNAseq signal manipulation
We developed scPrisma, a spectral analysis framework that uses topological priors over underlying signals in singlecell data, to allow for their inference, enhancement and filtering. The core of scPrisma uses spectral template matching between the spectrum (the eigendecomposition of the covariance matrix) of a set of singlecell data (for example, scRNAseq) and the expected analytical spectrum of a structure or process we aim to enhance or filter. To analyze a theoretical covariance spectrum (by analyzing its eigenvalues and eigenvectors), we need a reference model. Focusing first on cyclic signals, we propose a simple toy model of periodic biological signals (Methods). The covariance matrix of the gene expression matrix of this model is a circulant matrix of a special form that depends on the model parameters (Methods). Circulant matrices have closedform formula for their eigenvectors and eigenvalues^{12} (Fig. 1–1), which we used to estimate the ordering of cells along the cyclic topology. This was done by optimizing for a permutation matrix that maximizes the projection of the data over the theoretical spectrum (Methods; Fig. 1–2). As input for the remainder of scPrisma’s workflow, cellular ordering can also be informed by prior knowledge of lowresolution pseudotime. Based on the reconstructed ordering, scPrisma infers topologically informative genes, as the set of genes that maximizes the projection over the theoretical spectrum (Methods; Fig. 1–3A). scPrisma can then enhance (filter) the signal related to the cyclic process by filtering out gene expression entries that do not maximize (minimize) the projection over the theoretical spectrum (Methods; Fig. 1–4A and 3B). scPrisma can successfully reconstruct, filter and enhance periodic signals in simulated singlecell data (Supplementary Note A.1, Extended Data Fig. 1 and Supplementary Fig. 1).
scPrisma manipulates the cell cycle signal in HeLa cells
We first tested our approach on a scRNAseq dataset of HeLa cells, unsynchronized across the cell cycle^{13} (Fig. 2). To assess the results, we used a list of approximately 400 genes, classified according to cell cycle phases^{13}, where for each phase, the corresponding genes were summed and normalized and their circular mean and variance were calculated^{14} (Supplementary Note A.4). For an ordered reconstructed signal, the ordering of the circular means should correspond to the cell cycle phases and the circular variance should be less than 1, while for randomly ordered data, the circular variance is expected to be close to 1 (corresponding to a uniform signal along the cycle). Following standard preprocessing of the HeLa singlecell data (Methods; Supplementary Note A.4), the distributions of all phases were found to be nearly uniform (Fig. 2b; mean circular variance = 0.991). However, after cyclic ordering by scPrisma (Methods), different phases of the cell cycle became clearly separated (Fig. 2b; mean circular variance = 0.849) and peaked progressively according to the correct phase ordering.
To further enhance the cell cycle signal in the data, we employed scPrisma’s cyclic enhancement algorithm. Iterations of our algorithm gradually revealed a cyclic signal, which was apparent after filtering less than 7% of the total signal and was clearly revealed after removing 22% of the total signal (Fig. 2e). In addition, the reconstructed angular ordering was correlated with the cell cycle phases and the circular variance of phasecorresponding marker genes decreased substantially (Fig. 2b; mean circular variance = 0.666). Further, genes associated with different cell cycle phases peaked progressively in their expected order (Methods; Fig. 2c,d and Supplementary Fig. 2).
Next, we approached the reverse challenge: to filter out the cell cycle signal from the HeLa cells data. After applying scPrisma’s cyclic filtering algorithm to the ordered data, the separation between different cell cycle phases and their corresponding progressive peaks was lost (Fig. 2b; mean circular variance = 0.989).
Finally, we identified genes related to the cell cycle using the genes inference algorithm (Methods; Supplementary Note A.4). When using as input both randomly selected subsets of cell cyclerelated genes^{13} and subsets of genes unrelated to the cell cycle, we found a substantial improvement in our ability to identify cell cyclerelated genes after reordering the cells (mean AUC = 0.804) relative to the original data (mean AUC = 0.295). Moreover, relative to the ordered data, identifying cell cyclerelated genes was further improved following spectral cyclic enhancement (AUC mean 0.838, outperforming available baselines; Supplementary Fig. 2) and was diminished following cyclic filtering (mean AUC 0.183; Fig. 2f).
scPrisma disentangles spatiotemporal signals in the liver
We next dissected spatiotemporal signals via a scRNAseq dataset which captures gene expression variation of hepatocytes in the mammalian liver across both space (spatial zonation across the periportal to pericentral axis) and time (temporal variation across the circadian rhythm)^{15}. Similarly to the cell cycle, the circadian rhythm was also expected to exhibit a cyclic structure in gene expression space. Here we used experimental prior knowledge in the form of lowresolution sampling time^{15} to order the cells in a cycle. In this setting, we still missed information about temporally informative genes, and spatiotemporal information was still entangled (for example, Pck1 varied informatively across both space and time of day^{15}). We leveraged scPrisma to disentangle these data and showed clear enhancement and filtering of the circadian rhythm, relative to raw data (K = 4 Adjusted Rand Score (ARI) of KMeans = 0.96; 0.013; 0.11, respectively; Fig. 3a,e and Supplementary Fig. 3A; Methods). scPrisma outperformed available baselines for filtering the circadian rhythm (Supplementary Note A.5). These results were reflected in the behavior of individual genes; Pck1 is a rhythmic gene that was highly expressed at ZT06 and ZT12 (ref. ^{15}), and indeed, following spectral cyclic enhancement, its resulting expression in ZT00 and ZT18 was diminished, while cyclic filtering resulted in a nearly constant temporal expression of Pck1 (Fig. 3c). Spatially, Pck1 is periportally zonated, and indeed, spectral cyclic enhancement flattens its spatial expression, while cyclic filtering retains its spatial variation (Fig. 3c). Similar behavior can be observed for additional spatiotemporally informative genes (Fig. 3d and Supplementary Fig. 4).
In a complementary manner, spectral analysis focusing on the characteristics of linear signals can be used to filter the spatial linear signal in the collective gene expression of hepatocytes (Methods). Indeed, following spectral linear filtering by scPrisma, the cyclic circadian signal was clearly revealed (Fig. 3a and Supplementary Fig. 3A) and the linear zonation signal was blurred out relative to the raw data (Fig. 3b and Supplementary Fig. 3B,E; K = 8 ARI of KMeans = 0.017; 0.11, respectively). Focusing again on Pck1 expression, linear spectral enhancement retained only the expression around the portal vein and reduced the temporal variance, while linear filtering reduced the zonation variance and yet retained the temporal variance (Fig. 3c).
Finally, we evaluated the dominant structure in the data following scPrisma analysis. As predicted, following spectral enhancement, the data clustered according to the enhanced signal, while following spectral filtering, the data clustered according to the unfiltered signal; for example, filtering the cyclic signal led to better identification and clustering of the spatial zonation signal (Fig. 3e).
scPrisma manipulates the diurnal cycle in Chlamydomonas
To demonstrate the use of scPrisma for more complex systems with diverse prior knowledge, we next turn our attention to scRNAseq data collected for Chlamydomonas (green algae), grown under two contrasting conditions, iron replete (Fe^{+}) and iron deficient (Fe^{−})^{16}. In both conditions, an expression signal that reflects the 24 h diurnal cycle was previously detected^{16}. To evaluate progression of cells along the diurnal cycle, we used marker genes corresponding to different cycle phases obtained from bulk RNAsequencing^{17} (Supplementary Note A.6). Using scPrisma’s cyclic enhancement resulted in robust reconstruction of the diurnal cycle for each of the two conditions. This was done by splitting the 24h cycle into six phases and validating that they are well separated following reconstruction and enhancement (Methods; Fig. 4a for Fe^{+} condition and Supplementary Fig. 5A for Fe^{−} condition). Further, concatenating the enhanced cyclic signal of both experiments resulted in a reconstructed synchronized diurnal cycle (Supplementary Fig. 5B). We next focused on enhancing the biological differences between the Fe^{−} and Fe^{+} conditions by spectrally filtering their shared diurnal cycle (Fig. 4). As expected, cyclic filtering increased the differences between the clusters of Fe^{−} and Fe^{+} associated cells (Silhouette score before/after filtering = 0.088/0.136, Fig. 4b). scPrisma outperformed stateoftheart cyclic filtering methods, including ccRemover^{7}, Seurat^{10} and Cyclum^{9} (Silhouette scores = 9.815 × 10^{6}, 9.868 × 10^{4} and 0.052, respectively; Supplementary Note A.8 and Fig. 4b).
scPrisma extracts SCN celltypespecific temporal signals
We next focus on scRNAseq data collected for mice SCN, the mammalian brain’s circadian pacemaker^{18}. In this experiment, cells were sampled at 12 time points along two days. Again, we leveraged the cyclic nature of the circadian rhythm and explicitly used the prior knowledge regarding the experimental sampling times (instead of running the reconstruction algorithm). We first clustered the cells using the Louvain algorithm and mapped individual clusters to cell types using established marker genes^{18} (Supplementary Note A.7 and Supplementary Fig. 6). scPrisma’s cyclic enhancement over each cell type separately revealed a cyclic signal associated with the circadian rhythm for 5/8 of cell types (Fig. 5 and Supplementary Fig. 7; Methods). The three cell types that did not expose a clear cyclic signal (NG2, microglia and tanycytes) exhibit the lowest fraction of rhytmic gene expression^{18}. Moreover, we measured the separation of cells that were sampled at different time points, before and after cyclic filtering/enhancement using the Calinski and Harabasz score^{19}. Overall, as expected, separation increased substantially following cyclic enhancement and decreased following cyclic filtering, which, as above, is least substantial for the three cell types exhibiting the lowest fraction of rhythmic genes (Fig. 5d). It can be observed that the cellular density varies with the rhythmic process and is correlated with the peak of temporal expression of rhythmic genes (Fig. 5a,b and Supplementary Note A.7). Focusing on gene expression, we found that spectral cyclic enhancement diminishes the expression of celltype marker genes and retains the expression of rhythmic genes (core clock genes and protein folding genes, as characterized in ref. ^{18}; Fig. 5c and Supplementary Fig. 7). Conversely, following cyclic filtering, celltype marker gene expression was retained, while the resulting temporal expression of rhythmic and protein foldingrelated genes flattened (Fig. 5c and Supplementary Fig. 7).
scPrisma further enhanced celltype classification, inference of gene regulatory interactions related to the circadian rhythm and underlying cell–cell interactions. When aiming to characterize cells by their type, additional biological signals can interfere with that task as similarity between cells can arise due to multiple factors. For example, direct clustering of cells according to their gene expression profiles may capture similarities according to the circadian rhythm phase and not their type, which can substantially hinder our ability to distinguish different cell types. Clustering the neurons yielded 14 distinct clusters, three of which can be identified using either established marker genes or previous subtype classification^{18} as containing mixture of neurons from both SCN neuronal subtypes N0 and N2 (clusters 1, 3, 4; Supplementary Note A.7 and Fig. 5e). We found that the circadian rhythm signal interferes with the proper classification of cell subtypes in this case, supported by the observation that in clusters 1 and 3 the majority of cells (79% and 66%, respectively) were sampled at circadian time points (CT) = 14/18/22, while in cluster 4, 93% of cells were sampled at CT 02/06/10, which suggests that the clustering of cells in this subpopulation is dominated by their distinct temporal signatures and not their types (Fig. 5e and Extended Data Fig. 2). We were able to overcome the celltype misclassification by spectrally filtering the circadian rhythm signal using scPrisma, after which, the clustering algorithm yielded a unique cluster for each of the two neuronal subtypes, N0 and N2 (Fig. 5e). Moreover, as expected, the distribution over CT within each cluster flattened following cyclic filtering (Fig. 5e and Extended Data Fig. 2; mean circular variance increased from 0.781 to 0.863 following filtering).
Mixed biological signals in singlecell data can also interfere with the inference of gene regulatory networks. Therefore, we used scPrisma to highlight a set of regulatory interactions related to the circadian rhythm that were difficult to identify in the original data. Specifically, we expected that regulatory interactions that can be revealed following cyclic enhancement would be enriched with interactions associated with the cyclic circadian process. Indeed, regulatory interactions between core clock genes, as inferred using the gene regulatory network inference algorithm GRNBoost2 (ref. ^{20}), are more highly correlated to the established core clock interaction network^{21} (Fig. 5f) in the cyclically enhanced singlecell data, relative to the raw data, for 7/8 of the cell types (Fig. 5g and Supplementary Note A.7). Going beyond the core clock interaction network, using a list of known mice transcription factors^{22}, we searched for inferred interactions (based on GRNBoost2) which are substantially enhanced following scPrisma cyclic analysis (Methods), where the regulator is a core clock transcription factor (Nr1d1, Nr1d2, Rora, Rorb, Rorc, Dbp, Tef^{23}; a full list of inferred interactions is available in Supplementary Table 1). For example, focusing on genes inferred to be highly regulated by Rorc in ependymal cells following spectral enhancement, we found that the genes that received the highest score were Ahsa2 (0 to 6.341), Kif21a (0 to 6.231), Hsp90ab1 (0.070 to 5.731) and Mt1 (0 to 5.002) and the peaks of these genes along the circadian rhythm overlapped with the peak of Rorc, following spectral enhancement (Fig. 5h and Supplementary Fig. 7E). These results are consistent with previous results showing the existence of a regulatory interaction between Rorc and Hsp90ab1 which is dependent on the time of day^{24}.
Finally, we used scPrisma to infer hidden cell–cell interactions related to the circadian rhythm. We compared cell–cell communication patterns, using CellPhoneDB^{25}, between different cell types at corresponding time points. Similarly to the regulatory network inference described above, we were able to recover interactions that were substantially enhanced following scPrisma’s cyclic analysis (Methods; Fig. 5I, Supplementary Note A.7 and Supplementary Table 2).
Generalized template matching by scPrisma
Beyond cyclic and linear topologies, scPrisma can be used to manipulate a variety of different, complex topological signals in singlecell data. This is possible because scPrisma can use the numerical spectrum of a given covariance matrix, instead of the analytical spectrum, as was the case for the cyclic and linear topologies. We demonstrated the diversity of scPrisma on three additional types of templates as follows: (1) Clusters—we constructed a clusterbased template to enhance the separation, or maximize the variation in gene expression, between different cellular clusters and states (Supplementary Note A.13), specifically between cellular states of hepatocytes during day and night time (Extended Data Fig. 3A). Further, this topology can be used for data integration via the spectral filtering workflow, as we demonstrated for human pancreas scRNAseq data which were collected from four different studies^{26,27,28,29}, labeled and preprocessed as given in ref. ^{30}. scPrisma can filter batch effects in this case, demonstrated visually (Extended Data Fig. 3D,E) and quantitatively, as following clusterbased filtering, the Calinski and Harabasz score between the different batches drops from 298.63 to 3.80 and the score between the annotated cell types rises from 90.97 to 107.48. scPrisma’s results for batch correction are competitive or outperform stateoftheart tailored methods for this task (Supplementary Note A.13). (2) Multiple cycling processes—scPrisma can reconstruct multiple cycling processes across SCN cell types both for a synchronized case (Extended Data Fig. 4) and an unsynchronized case (Supplementary Note A.13 and Extended Data Fig. 5). In the latter, more challenging case, circadian core clock genes are outofphase in the SCN neurons versus multiple other cell types^{18}. scPrisma can enhance every periodic signal separately by designing a covariance matrix that is block circulant (Supplementary Note A.13). Furthermore, scPrisma can be used for the general, iterative inference of cyclic processes, although this task is more challenging and scPrisma is not optimized for it. As an example, we used human embryonic stem cells (hESC) singlecell dataset^{31}. We showed that scPrisma can be used (de novo, without prior knowledge on marker genes) to first infer a cyclic process corresponding to the cell cycle and then filter it out and use the filtered data to reconstruct a second cyclic process corresponding to an oscillatory pattern related to the experimental setup (Supplementary Note A.13 and Supplementary Fig. 8). The encoding of both oscillatory processes by the hESC population is consistent with previous findings of ref. ^{31}. (3) Twodimensional (2D) tissue organization—last, we enhanced the spatial signal in a spatially informed (SlideseqV2) 2D dataset of the mouse hippocampus^{32}. We computed the shortest path matrix (calculated based on the spatial knearest neighbor graph over the data) and transformed it into an affinity matrix using a heat kernel (Supplementary Note A.13). In this case, the affinity matrix is used by scPrisma as the covariance matrix of the spatial signal. scPrisma enables flexible manipulation of the spatial signal, which we leveraged to extract, and then either enhance or filter the spatial signal in the data by applying the enhancement algorithm based on numerical eigendecompositions of the affinity matrix (Supplementary Note A.13 and Extended Data Fig. 6).
Discussion
In this study, we developed scPrisma, a spectral analysis workflow based on topological priors for reconstruction, informative genes inference, signal filtering and enhancement. While we focus on periodic signals (cell cycle in HeLa cells, diurnal cycle in Chlamydomonas and circadian rhythm in the SCN), we also demonstrate it for convoluted spatial and cyclic signals (spatial zonation and circadian rhythm in liver lobules) and a diversity of additional signals and topologies, such as clusters, multiple cycling processes and 2D spatial templates.
scPrisma presents three major contributions as follows: First, it embodies a full workflow for analyzing underlying topological signals based on an approach that can be performed either de novo or enhanced using prior knowledge (for example, lowresolution pseudotime or marker gene information). This flexibility allows scPrisma to uncover topological signals of varying strengths. Second, scPrisma enables both signal enhancement and filtering without embedding to lower dimensions, which makes it useful as a prior step for existing downstream analyses, such as inferring gene regulation networks and cell–cell interactions. This can accelerate biological discovery, as we exemplify for SCN neurons, by revealing gene regulation patterns and cell–cell interactions that are associated with a specific biological process such as the circadian rhythm. Third, the enhancement algorithm does not overfit to a circular topology, by applying the genes inference task before enhancement (thus retaining only genes related to the desired signal), controlling the level of filtering by regularization and restricting the range of entries in the filtering matrix. Future work can leverage scPrisma’s flexibility and robustness to optimize it to diverse tasks that arise in the context of singlecell and spatial omics analysis, such as generalized spatial analysis, data integration and iterative manipulation of signals, which is a promising, yet challenging, direction for future work.
A computational challenge arises due to the nonconvexity and runtime complexity of the reconstruction task (Extended Data Fig. 7). This optimization challenge is relieved for cyclic signals, as the theoretical analysis of the eigenvectors does not depend on the specific values of the matrix but only on its circulant property. Moreover, the multiple solutions for the cyclic reconstruction task (every circular shift of a solution is a valid solution) ease the convergence to a feasible solution. In addition, the reconstruction step can be done either by scPrisma or by other pseudotime trajectory reconstruction algorithms. Another challenge is applying scPrisma when it is not clear whether a signal corresponding to the template exists or is strong enough to be detected in the data. This challenge is alleviated due to several reasons. First, we reason and provide empirical support on both synthetic and real singlecell data that scPrisma avoids overfitting to input template topologies. Therefore, in general, scPrisma does not converge to a topology that is not a reflection of a strongenough signal in the data, as is demonstrated for the SCN, where the algorithm is limited in its convergence over cell types with weak periodic signal. Second, evaluating results obtained by scPrisma can be done by comparison to partial prior knowledge, such as marker genes known to be related (or unrelated) to recovered biological signals, lowresolution sampling times of cells relative to temporal signals, or by interpreting gene ontology enrichment analysis following signal manipulation. Additionally, we suggest a measure for the quality of convergence of scPrisma, the projection proportion score, and while it can be useful for exploring analysis options, it is not associated with a single threshold that can distinguish successful convergence, as it is affected by the signal and data characteristics (Supplementary Note A.12 and Supplementary Fig. 9).
While in this work we focused on periodic signals, which can be analyzed analytically, scPrisma can also be applied using a numerical eigendecomposition of a covariance matrix that is either inferred from the data or constructed numerically based on a topological model. We anticipate that scPrisma will accelerate singlecellbased research by enhancing target signals of interest and enabling their identification and analysis and providing a general workflow for singlecell signal disentanglement in diverse biological contexts.
Methods
Spectral analysis of cyclic signals
For theoretical analysis, we constructed three simple models for the cyclic signals. In the first model, illustrated in Supplementary Fig. 10, we receive as input the number of cellular variations (q), the numbers of genes (p) and the number of changes between neighboring cells (k). We start with a root cell whose expression profile is a binary vector with p entries ({1, 0}^{p}). Each gene is approximated to be either expressed (ON,1) or not expressed (OFF,0). Then, the next cell in the cycle is generated by duplicating the existing cell, choosing uniformly k genes and switching their state. This process is repeated q times. Then, within the last generated cell, k genes whose state differs from the root cell are chosen at random, their state is switched and the cell is duplicated. This process is then continued until the gene expression of the newest cell is identical to the root cell. For the analysis of the covariance matrix of the model, we will use a similar Markovian assumption to the assumption that was used in refs. ^{8,34}; the covariance between the expression profiles of two cells, separated by m state changes, where m is the minimum distance between the cells clockwise and counterclockwise (undirected cyclostationary assumption), is given by \(\alpha (m)=E[X(m)X(0)]=\exp (2mk/p),\) where 0 ≤ m ≤ n/2, p is the number of genes and k is the number of changes between neighboring cells. More information about the model and the estimation of α from real data is described in Supplementary Note A.11. According to this assumption, the expected covariance matrix of the gene expression matrix is circulant:
where n is the number of cells. The first column (\(\overrightarrow{c}\)) of a circulant matrix specifies the entire matrix. The (k, j) entry of a general circulant matrix C is given by \({C}_{k,j}={\overrightarrow{c}}_{(jk) \% n}\)^{35}. The spectrum of a circulant matrix has analytical closed formula^{12}. Specifically, the eigenvalues are the discrete Fourier transform of the first row, and the eigenvectors are the normalized Fourier modes. Because a covariance matrix is symmetric and positive semidefinite, all its eigenvalues are real. Therefore, the eigenvalues are the discrete cosine transform of the first row^{36}:
The sth entry of the ith eigenvector corresponding to the ith eigenvalue is^{35}
To test our approach, we defined two additional models, described in Supplementary Note A.2, which we used in the simulated data section (Supplementary Note A.1).
Spectral analysis of linear signals
Similarly to the analysis of cyclic signals, we first construct a simple model for linear signals. We follow a similar linear model to the one that was presented in ref. ^{8}. The model receives the same input as the cyclic model, and each cell is represented by a binary vector {1, 0}^{p}. We start from a root cell, and then over n iterations, a new cell is created in the linear chain by changing the state of k randomly chosen genes relative to the previous cell in the chain. As in the cyclic model, we assume that the covariance between the gene expression profiles of two cells, separated by m state changes, is given by \(\alpha (m)=E[X(m)X(0)]=\exp (2m/p)\). Thus, the expected covariance matrix of the gene expression matrix is
This matrix is a special case of a Toeplitz matrix and is particularly known as Kac–Murdock–Szego matrix^{37}. The eigenvalues of such a Kac–Murdock–Szego matrix can be approximated as ref. ^{37}:
The corresponding eigenvectors can either be estimated analytically^{38} or calculated by the numerical decomposition of the theoretical matrix.
Preprocessing
We used a standard preprocessing pipeline as follows: first removing genes that are not expressed in any of the cells in our data, applying percell normalization by dividing each count by the total counts of that particular cell, applying log transformation and retaining only highly variable genes.
For the reconstruction algorithm, we scaled L_{2} of each cell to 1 to ensure that the circulant matrix has constant diagonal. For the gene inference algorithm, scaling L_{2} of each gene to 1 should be applied, as the score of each gene is relative to the rest of the genes. To estimate α, representing the correlation between neighbors according to the target topology, we search for the α value that best matches the spectrum of the given gene expression matrix. The results were improved by applying the algorithms after removing the theoretical covariance vector associated with the largest eigenvalue. For the cyclic case, the values of this eigenvector are constant.
scPrisma generalcase algorithm

1.
Choose the desired topology (for example, periodic/linear). Calculate the theoretical covariance eigenvectors and eigenvalues.

2.
Preprocess the data.

3.
Reconstruct the signal by reordering the gene expression rows by solving Problem 2 (below) or by using prior knowledge.
Option 1—signal enhancement:

(a)
Infer informative genes by solving Problem 3 (below) or by using prior knowledge and remove the rest of the genes.

(b)
Enhance the desired signal by solving Problem 4 (below).
Option 2—signal filtering:

(a)
Filter out the desired signal by solving Problem 5.
Signal reconstruction
With a closed formula for the spectrum (2) and (3), we can estimate the pseudotime of the underlying cyclic trajectory. This can be done by estimating the rows reordering of the gene expression matrix that maximizes the projection over the theoretical spectrum. This problem can be formulated as a matrix permutation problem.
Problem 1
Matrix permutation problem for estimating a pseudotime that maximizes the projection over the theoretical spectrum:
where A is the original gene expression matrix, \({\overrightarrow{v}}_{i}\) and λ_{i} are the theoretical eigenvectors and corresponding eigenvalues, respectively, and P is the set of permutation matrices. Under the assumption that A has a permutation Ẽ such that the spectrum of \(({\tilde{\rm {E}}}*A)*({\tilde{\rm {E}}}*A)^{\rm{T}}\) matches the theoretical spectrum, the optimal solution is E = Ẽ (Supplementary Note A.15). Now, consider an identical formulation of the function we wish to maximize: \(\mathop{\sum }\nolimits_{i = 0}^{n1}{\lambda }_{i}* {{\left\Vert {(E* A)}^{\rm{T}}* {\overrightarrow{v}}_{i}\right\Vert }_{2}^{2}}\). This formula aims to maximize the product of each gene sorted by the permutation matrix and each theoretical eigenvector multiplied by its eigenvalue. These theoretical eigenvectors, as they are the eigenvectors of the theoretical covariance matrix, represent the variance along the theoretical topology. As a result, the permutation which maximizes this objective maximizes the variance along the theoretical topology.
Permutation problems are known to be NPHard^{39}. We will follow previous studies in solving a convex relaxation of this problem, and instead of searching for a permutation matrix, we will search for a doubly stochastic matrix (the Birkhoff polytope)^{39,40}:
Problem 2
Convex relaxation of Problem 1:
Here the objective function is quadratic and convex (Supplementary Note A.16). Despite the fact that maximizing it is not convex optimization, previous studies have shown that such problems can be efficiently resolved with stochastic gradient descent^{41,42}. To project into the Birkhoff polytope, we used Bregmanian bistochastication algorithm^{40}. Finally, for rounding the doubly stochastic matrix to a permutation matrix, we used a simple greedy algorithm. Specifically, the algorithm iterates over all rows, for each row rounds the maximum entry in each column that does not have 1 value yet to 1 and rounds to 0 the rest of the entries. The output of this algorithm is the permuted gene expression matrix: A_{ordered} = \(E*A\).
Genes inference
Once the reconstructed signal is obtained, either by solving Problem 2 or by prior knowledge, identification of informative genes that are related to the desired signal is possible. This can be achieved by filtering genes that do not maximize the projection over the theoretical spectrum. Because of convexity considerations, it would be easier to infer genes that are not related to the desired signal and then flip the results. Inference of the genes that are not related to the desired signal can be done by solving the following optimization problem:
Problem 3
Genes inference:
Because genes are represented by columns of A, each entry on the diagonal of D represents the influence of the respective gene on the spectrum. The number of filtered genes can be controlled by adding regularization. We can either increase the regularization coefficient, γ, to filter fewer genes or decrease it to filter more genes. The output of this algorithm is the gene expression matrix, after nullifying the genes that are not informative relative to the signal: D_{1} = I − D, A_{geneinferred} = A_{ordered} \(*\) D_{1}.
Filtering and enhancement
After inferring the set of informative genes (the genes related to the reconstructed signal), the next step is to remove any information that is not related to the signal, from the expression of those genes. This can be achieved by removing the diagonal constraint from Problem 3 and replacing the matrix product by the Hadamard product (elementwise product). For every entry in the expression matrix, A_{i,j}, this formulation matches an optimization variable F_{i,j}. Therefore, the enhanced gene expression matrix contains only expression profiles that maximize the projection over the theoretic spectrum.
Problem 4
Signal enhancement:
Because this problem is not a convex optimization problem, it can be solved by using stochastic gradient ascent (adding noise at each iteration^{43}). The output of this algorithm is the gene expression matrix, after eliminating information that is unrelated to the signal of interest: A_{enhanced} = A_{genesinferred} ⊙ F.
Another option is similar to that described for Problem 3, which is to transform this problem into a minimization problem for filtering the reconstructed signal, thus turning it into a convex optimization problem (Supplementary Note A.16). Formulating this problem as a minimization problem eliminates the variance along the theoretical topology.
Problem 5
Signal filtering:
The output of this algorithm is the gene expression matrix, after eliminating the information that is related to the signal of interest: A_{filtered} = A_{ordered} ⊙ F.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The scRNAseq datasets used for this study were acquired from the Gene Expression Omnibus (GEO) database with the following accession numbers: HeLaS3 (GSM4224315), liver (GSE145197), Chlamydomonas (GSE157580), SCN (GSE117295) and hESC (GSE64016). SlideseqV2 dataset of the mice hippocampus was generated from ref. ^{32} and downloaded using Squidpy^{44}. Pancreas datasets were generated from refs. ^{26,27,28,29} and downloaded using Scanpy^{45}.
Code availability
The code for scPrisma is publicly available at https://github.com/nitzanlab/scPrisma/
References
Wu, A. R. et al. Quantitative assessment of singlecell RNAsequencing methods. Nat. Methods 11, 41–46 (2014).
Nitzan, M., Karaiskos, N., Friedman, N. & Rajewsky, N. Gene expression cartography. Nature 576, 132–137 (2019).
Jansen, C. et al. Building gene regulatory networks from scatacseq and scRNAseq using linked self organizing maps. PLoS Comput. Biol. 15, e1006555 (2019).
Plass, M. et al. Cell type atlas and lineage tree of a whole complex animal by singlecell transcriptomics. Science 360, eaaq1723 (2018).
Forrow, A. & Schiebinger, G. Lineageot is a unified framework for lineage tracing and trajectory inference. Nat. Commun. 12, 4940 (2021).
Liu, Z. et al. Reconstructing cell cycle pseudo timeseries via singlecell transcriptome data. Nat. Commun. 8, 22 (2017).
Barron, M. & Li, J. Identifying and removing the cellcycle effect from singlecell RNAsequencing data. Sci. Rep. 6, 33892 (2016).
Nitzan, M. & Brenner, M.P. Revealing lineagerelated signals in singlecell gene expression using random matrix theory. Proc. Natl Acad. Sci. USA 118, e1913931118 (2021).
Liang, S., Wang, F., Han, J. & Chen, K. Latent periodic process inference from singlecell RNAseq data. Nat. Commun. 11, 1441 (2020).
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating singlecell transcriptomic data across different conditions, technologies and species. Nat. Biotechnol. 36, 411–420 (2018).
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of singlecell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).
Rojo, O. & Rojo, H. Some results on symmetric circulant matrices and on symmetric centrosymmetric matrices. Linear Algebra Appl. 392, 211–233 (2004).
Schwabe, D., Formichetti, S., Junker, J. P., Falcke, M. & Rajewsky, N. The transcriptome dynamics of single cells during the cell cycle. Mol. Syst. Biol. 16, e9946 (2020).
Jammalamadaka, S. R. & Sengupta, A. Topics in Circular Statistics Vol. 5 (World Scientific, 2001).
Droin, C. et al. Spacetime logic of liver gene expression at sublobular scale. Nat. Metab. 3, 43–58 (2021).
Ma, F., Salomé, P. A., Merchant, S. S. & Pellegrini, M. Singlecell RNA sequencing of batch Chlamydomonas cultures reveals heterogeneity in their diurnal cycle phase. Plant Cell 33, 1042–1057 (2021).
Strenkert, D. et al. Multiomics resolution of molecular events during a day in the life of Chlamydomonas. Proc. Natl Acad. Sci. USA 116, 2374–2383 (2019).
Ma, D. et al. Spatiotemporal singlecell analysis of gene expression in the mouse suprachiasmatic nucleus. Nat. Neurosci. 23, 456–467 (2020).
Caliński, T. & Harabasz, J. A dendrite method for cluster analysis. Commun. Stat. 3, 1–27 (1974).
Moerman, T. et al. GRNBoost2 and Arboreto: efficient and scalable inference of gene regulatory networks. Bioinformatics 35, 2159–2161 (2019).
Pett, J.P., Kondoff, M., Bordyugov, G., Kramer, A. & Herzel, H. Coexisting feedback loops generate tissuespecific circadian rhythms. Life Sci. Alliance 1, e201800078 (2018).
Hu, H. et al. AnimalTFDB 3.0: a comprehensive resource for annotation and prediction of animal transcription factors. Nucleic Acids Res. 47, D33–D38 (2019).
Kim, Y. H. & Lazar, M. A. Transcriptional control of circadian rhythms and metabolism: a matter of time and space. Endocrine Rev. 41, 707–732 (2020).
Lee, Y. et al. Timeofday specificity of anticancer drugs may be mediated by circadian regulation of the cell cycle. Sci. Adv. 7, eabd2645 (2021).
Efremova, M., VentoTormo, M., Teichmann, S. A. & VentoTormo, R. Cellphonedb: inferring cell–cell communication from combined expression of multisubunit ligand–receptor complexes. Nat. Protoc. 15, 1484–1506 (2020).
Segerstolpe, Å et al. Singlecell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).
Baron, M. et al. A singlecell transcriptomic map of the human and mouse pancreas reveals inter and intracell population structure. Cell Syst. 3, 346–360 (2016).
Wang, Y. J. et al. Singlecell transcriptomics of the human endocrine pancreas. Diabetes 65, 3028–3038 (2016).
Muraro, M. J. et al. A singlecell transcriptome atlas of the human pancreas. Cell Syst. 3, 385–394 (2016).
Polański, K. et al. BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics 36, 964–965 (2020).
Leng, N. et al. Oscope identifies oscillatory genes in unsynchronized singlecell RNAseq experiments. Nat. Methods 12, 947–950 (2015).
Stickels, R. R. et al. Highly sensitive spatial transcriptomics at nearcellular resolution with slideseqv2. Nat. Biotechnol. 39, 313–319 (2021).
Santos, A., Wernersson, R. & Jensen, L. J. Cyclebase 3.0: a multiorganism database on cellcycle regulation and phenotypes. Nucleic Acids Res. 43, D1140–D1144 (2015).
Qin, C. & Colwell, L. J. Power law tails in phylogenetic systems. Proc. Natl Acad. Sci. USA 115, 690–695 (2018).
Gray, R. M. Toeplitz and Circulant Matrices: A Review (Now, 2006).
Demidenko, E. Applications of symmetric circulant matrices to isotropic Markov chain models and electrical impedance tomography. Adv. Pure Math. 7, 188–198 (2017).
Grenander, U. & Szegö, G. Toeplitz Forms and Their Applications (Univ. California Press, 1958).
Trench, W. F. Spectral decomposition of KacMurdockSzego matrices https://works.bepress.com/william_trench/133/ (2010).
Fogel, F., Jenatton, R., Bach, F. & d’Aspremont, A. Convex relaxations for permutation problems. In Proc. 26th International Conference on Neural Information Processing Systems (eds Burges, C. J. C. et al.) 1016–1024 (Curran Associates Inc., 2013).
Wang, F., Li, P. & Konig, A. C. Learning a bistochastic data similarity matrix. In Proc. 2010 IEEE International Conference on Data Mining 551–560 (IEEE, 2010).
Shamir, O. Convergence of stochastic gradient descent for PCA. In Proc. 33rd International Conference on Machine Learning (eds Balcan, M. F. & Weinberger, K. Q.) 257–265 (PMLR, 2016).
Zhang, L., Yang, T., Yi, J., Jin, R. & Zhou, Z.H. Stochastic optimization for kernel PCA. In Proc. Thirtieth AAAI Conference on Artificial Intelligence 2316–2322 (AAAI Press, 2016).
Daneshmand, H., Kohler, J., Lucchi, A. & Hofmann, T. Escaping saddles with stochastic gradients. In Proc. 35th International Conference on Machine Learning (eds Dy, J. & Krause, A.) 1155–1164 (PMLR, 2018).
Palla, G. et al. Squidpy: a scalable framework for spatial omics analysis. Nat. Methods 19, 171–178 (2022).
Wolf, F. A., Angerer, P. & Theis, F. J. Scanpy: largescale singlecell gene expression data analysis. Genome Biol. 19, 15 (2018).
Acknowledgements
We thank N. Moriel, Y. Constantini, Z. Piran, E. Memet and the rest of our group members, and R. Mintz, O. Karin, D. Shalev and I. Alon for meaningful discussions and feedback. We thank O. Mittelpunkt for assistance in the graphic design. This work was funded by the Center for Interdisciplinary Data Science Research at the Hebrew University of Jerusalem (J.K.), an Azrieli Foundation Early Career Faculty Fellowship, Israel Science Foundation Research Grant (1079/21) and the European Union (ERC, DecodeSC, 101040660) (M.N.). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them.
Author information
Authors and Affiliations
Contributions
J.K. and M.N. conceived the study, designed the research and developed the framework; J.K. implemented the method and analyzed the data, with guidance from M.N.; Y.B. contributed to the theoretical analysis of linear signals; J.K. and M.N. wrote the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Biotechnology thanks Ken Chen, Feng Bao and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 scPrisma identifies and filters periodic signals in simulated data.
(A) PCA representation and gene expression covariance matrix before and after applying the reconstruction algorithm. This was done over a simulated gene expression matrix (100 cells, 500 genes, w=0.3) encoding a cyclic signal according to the spatial model (Supplementary A.2). The covariance matrix, after applying the reconstruction algorithm, is circulant. (B) Spearman correlation between the groundtruth permutation and the predicted permutation as a function of SNR over 300 simulations. A cyclic signal was simulated similarly to (A), Gaussian noise was added with varying variance, while the expression matrix was clipped to be positive. (C) AUCROC of informative genes inference task as a function of SNR over 300 simulations, each consisting of a cyclic signal and a lineage signal (256 cells, 250 genes) that were concatenated to form a combined gene expression matrix (256 cells, 500 genes), supplemented by additive Gaussian noise. (D) Filtering and enhancement of cyclic signals in combined simulated data. The combined simulation consists of the sum of linear and cyclic signals (200 cells, 500 genes each), and a Gaussian noise matrix (variance = 0.1). (E) Boxplots comparing the filtering and enhancement algorithms of scPrisma with Cyclum over n=100 simulations of the combined signal as described in (D). For the box plots, the center line is the median, box limits are the 0.25 and 0.75 quantiles, vertical lines extend from the top of the box to indicate the maximum value, and from the bottom to indicate the minimum value.
Extended Data Fig. 2 Temporal filtering by scPrisma enhances cell type clustering of singlecell data.
Analysis based on singlecell data of suprachiasmatic nucleus neurons^{15}. (A) Before temporal filtering, clusters ‘1’ and ‘3’ mostly contain cells that were sampled in CT=14/18/22 while cluster ‘4’ mostly contains cells that were sampled in CT=06/10. (B) However, following filtering, the corresponding clusters (now labeled ‘0’ and ‘2’) are wellmixed in terms of their temporal labels.
Extended Data Fig. 3 Enhancement and filtering by scPrisma of clustered structure in cellular populations.
(AC) Enhancement of clustered structure in the liver lobule scRNAseq data^{15}; (A) 2D PCA of raw (left) and clusterbased enhanced (right) data, highlighting the separation between samples that were collected during the night (ZT = 0, 18) and the day (ZT = 6, 12). (B) Mean (triangle) and variance (error bar) of Pck1 expression as a function of circadian time, for raw (left) and clusterbased enhanced (right) data. At each time point, the sample size is N=1000. (C) 2D PCA colored by Pck1 expression for raw (left) and clusterbased enhanced (right) data. (D,E) Filtering of clustered structure, leading to integration of pancreas scRNAseq data collected from 4 different studies^{26,27,28,29}; UMAPs of pre and post data integration, colored by (D) batch and (E) cell type.
Extended Data Fig. 4 Enhancement by scPrisma of synchronized periodic processes.
Analysis based on SCN singlecell gene expression data^{18}. (AC) 2D PCA of ependymal and endothelial cells, for (A) raw and (B,C) cyclicenhanced data, colored by (A,B) circadian sample time (left) and by cell type (right), and by (C) Dbp (left) and Hsp90ab1 (right) expression.
Extended Data Fig. 5 Enhancement by scPrisma of unsynchronized periodic processes.
Analysis based on SCN singlecell gene expression data^{18}. (A,B) 2D PCA of ependymal cells and SCN neurons, for (A) raw and (B) cyclicenhanced data, colored by circadian sample time (left) and by cell type (right). (C,D) 2D tSNE of ependymal cells and SCN neurons, for cyclicenhanced data, colored by circadian sample time (C; left), cell type (C; right), and Dbp expression (D). (E) Mean (triangle) and variance (error bar) of Dbp expression as a function of circadian time, which peaks at CT = 10 in ependymal cells (top), and peaks at CT = 2/6 in SCN neurons (bottom). The sample sizes at each time point (CT/N) for ependymal cells are: CT02/428, CT06/493, CT10/521, CT14/519, CT18/365 and CT22/480, and for SCN neurons they are: CT02/667, CT06/741, CT10/870, CT14/670, CT18/1375 and CT22/1029.
Extended Data Fig. 6 Spatial enhancement by scPrisma.
Analysis based on SlideseqV2 mouse hippocampus data ^{32}(subsampled to 29,250 cells). Raw (left) and spatiallyenhanced (right) gene expression of the top 3 spatially informative genes (based on Moran’s I score).
Extended Data Fig. 7 Runtime and memory consumption analysis of scPrisma algorithms.
Simulated cyclic signal of 2000 genes, varying number of cells, w = 0.3 and added Gaussian noise (variance=0.1), using a single GPU‘NVIDIA RTX A5000’.
Supplementary information
Supplementary Information
Supplementary Notes A.1–A.16, Figs. 1–10, and References.
Supplementary Table 1
List of gene regulatory interactions related to the circadian rhythm signal inferred by GRNBoost2 after cyclic enhancement by scPrisma.
Supplementary Table 2
List of cell–cell interactions related to the circadian rhythm signal inferred by CellPhoneDB after cyclic enhancement by scPrisma.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Karin, J., Bornfeld, Y. & Nitzan, M. scPrisma infers, filters and enhances topological signals in singlecell data using spectral template matching. Nat Biotechnol (2023). https://doi.org/10.1038/s41587023016635
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41587023016635