Abstract
Singlecell mRNA sequencing, which permits whole transcriptional profiling of individual cells, has been widely applied to study growth and development of tissues and tumors. Resolving cell cycle for such groups of cells is significant, but may not be adequately achieved by commonly used approaches. Here we develop a traveling salesman problem and hidden Markov modelbased computational method named reCAT, to recover cell cycle along time for unsynchronized singlecell transcriptome data. We independently test reCAT for accuracy and reliability using several data sets. We find that cell cycle genes cluster into two major waves of expression, which correspond to the two wellknown checkpoints, G1 and G2. Moreover, we leverage reCAT to exhibit methylation variation along the recovered cell cycle. Thus, reCAT shows the potential to elucidate diverse profiles of cell cycle, as well as other cyclic or circadian processes (e.g., in liver), on singlecell resolution.
Introduction
Cell cycle studies, a longstanding research area in biology, are supported by transcriptome profiling with traditional technologies, such as qPCR^{1}, microarrays^{2}, and RNAseq^{3}, which have been used to quantitate gene expression during cell cycle. However, these strategies require a large amount of synchronized cells, i.e., microarray and bulk RNAseq, or they may lack observation of whole transcriptome, i.e., qPCR. Moreover, in the absence of elaborative and efficient cell cycle labeling methods, a highresolution whole transcriptomic profile along an intact cell cycle remains unavailable.
Recently, singlecell RNAsequencing (scRNAseq) has become an efficient and reliable experimental technology for fast and lowcost transcriptome profiling at the singlecell level^{4, 5}. The technology is employed to efficiently extract mRNA molecules from single cells and amplify them to certain abundance for sequencing^{6}. Singlecell transcriptomes facilitate research to examine temporal, spatial and microscale variations of cells. This includes (1) exploring temporal progress of single cells and their relationship with cellular processes, for example, transcriptome profiling at different time phases after activation of dendritic cells^{7}, (2) characterizing spatialfunctional associations at singlecell resolution which is essential to understand tumors and complex tissues, such as space orientation of different brain cells^{8}, and (3) unraveling microscale differences among homogeneous cells, inferring, for example, axonal arborization and action potential amplitude of individual neurons^{9}.
One of the major challenges of scRNAseq data analysis involves separating biological variations from highlevel technical noise, and dissecting multiple intertwining factors contributing to biological variations. Among all these factors, determining cell cycle stages of single cells is critical and central to other analyses, such as determination of cell types and developmental stages, quantification of cell–cell difference, and stochasticity of gene expression^{10}. Related computational methods have been developed to analyze scRNAseq data sets, including identifying oscillating genes and using them to order single cells for cell cycle (Oscope)^{11}, classifying single cells to specific cell cycle stages (Cyclone)^{12}, and scoring single cells in order to reconstruct a cell cycle timeseries manually^{13}. Besides, several computational models have been proposed to reconstruct the timeseries of differentiation process, including principal curved analysis (SCUBA)^{14}, construction of minimum spanning trees (Monocle^{15} and TSCAN^{16}), nearestneighbor graphs (Wanderlust^{17} and Wishbone^{18}) and diffusion maps (DPT)^{19}. In fact, even before scRNAseq came into popular use, the reconstruction of cell cycle timeseries was accomplished using, for example, a fluorescent reporter and DNA content signals (ERA)^{20}, and images of fixed cells (Cycler)^{21}. However, despite these efforts, accurate and robust methods to elucidate timeseries of cell cycle transcriptome at single cell resolution are still lacking.
Here we propose a computational method termed reCAT (recover cycle along time) to reconstruct cell cycle timeseries using singlecell transcriptome data. reCAT can be used to analyze almost any kind of unsynchronized scRNAseq data set to obtain a highresolution cell cycle timeseries. In the following, we first show one marker gene is not sufficient to give reliable information about cell cycle stages in scRNAseq data sets. Next, we give an overview of the design of reCAT, followed by an illustration of applying reCAT to a single cell RNAseq data set called mESCSMARTer, and the demonstration of robustness and accuracy of reCAT. At the end, we give detailed analyses of several applications of reCAT. All data sets used in this study are listed in Table 1.
Results
High variation of expression measures within cells
We found that the expression level of one marker gene was insufficient to reveal the cell cycle stage of a single cell as a result of high stochasticity of gene expression and heterogeneity of cell samples. Therefore, we propose to use a group of cell cycle marker genes, combined with proper computational models, to reconstruct pseudo cell cycles from scRNAseq data with high accuracy.
Using a mouse embryonic stem cells (mESC) scRNAseq data set developed by Buttener et al. (2015)^{22}, we showed that the expression of cell cycle marker genes has high stochasticity. The data set, termed mESCSMARTer, consists of 232 eligible samples labeled according to cell cycle stages by Hoechst staining. We examined several highconfidence cell cycle marker genes, as shown in Fig. 1a. The cell cycle stages in which these genes have maximum mean relative expression levels are consistent with their existing records^{29}, but the distribution of expression levels between two cell cycle stages showed high overlap (Fig. 1a), indicating that a single marker gene is insufficient to determine the cell cycle stage for a single cell. In addition, we showed that mean gene expression levels, averaging over 20 cell samples, remain highly stochastic (Supplementary Fig. 1).
We further examined the consistency of cell cycle stages of maximum mean expression levels of cell cycle marker genes between different cell populations. We selected six singlecell transcriptome sample groups from different tissues and experimental conditions (Table 1), and performed four pairwise comparisons, showing the results in Fig. 1b. Assuming consistency between maximum mean expression levels of marker genes and their corresponding cell cycle stages, all drops should be located along the diagonal. In fact, however, many counts spread into offdiagonal entries, showing apparent relatively low consistency (Fig. 1b).
An overview of the reCAT approach
Given an scRNAseq data set, reCAT reconstructs cell cycle timeseries and predicts cell cycle stages along the timeseries. The reconstructed timeseries generally consists of multiple cell cycle phases (e.g., ≥10), each of which may contain one or multiple cells. Two fundamental assumptions underlie the cell cycle model: (1) different cell cycle phases form a cycle and (2) transcriptome at a certain cell cycle phase would have a smaller difference relative to that of its most adjacent phase compared to a more distant phase. Hence, reCAT models the reconstruction of the timeseries as a traveling salesman problem (TSP), which herein finds the shortest possible cycle by passing through each cell/cluster exactly once and returning to the start.
As shown in Fig. 1c, reCAT can be described as a process consisting of four steps. The first step is data processing, including quality control, normalization, and clustering of single cells using the Gaussian mixture model (GMM) according to a userdefined phase number k. We defined the distance between two clusters as the Euclidian distance between their means. In the second step, the order of the clusters was recovered by finding a traveling salesman cycle. Since TSP is a wellknown NPhard problem, we developed a novel and robust heuristic algorithm, termed consensusTSP, to find the solution. For the third step, we designed two scoring methods, Bayesscores and meanscores, to discriminate among cycle stages (G0, G1, S, or G2/M). Finally, in the fourth step, we designed a hidden Markov model (HMM) based on these two scoring methods to segment the timeseries into G0, G1, S and G2/M, and a Kalman smoother to estimate the underlying gene expression levels of the singlecell timeseries (Methods).
An illustration of reCAT working principles
The mESCSMARTer data set (Buettner, et al. 2015) was used to illustrate the principles underlying the reCAT approach. Only cell cycle genes listed in Cyclebase^{31} (378) were used in reCAT to get the expression matrix, while other genes were excluded based on the risk of adding noise to the model. The samples were clustered into eight classes (k = 8), and the mean expression levels of these eight clusters were arranged into the optimal traveling salesman cycle. Fig. 2a displays all single cells and a cycle formed by eight cluster centers in a twodimensional plot using principal component analysis (PCA), in which colors correspond to experimentally determined cell cycle stages. In Fig. 2b, we linearized the traveling salesman cycle into a pseudo timeseries of eight phases and plotted the composition of single cells at each phase. The figure shows agreement between the predicted pseudo timeseries and the experimentally determined cell cycle stage labels, thereby supporting the validity of the TSP model. In summary, both plots demonstrate a gradual and smooth transition of labeled singlecell components along the pseudo timeseries. In the Supplementary Material, we showed that the expression trends of wellstudied cell cycle marker genes (Supplementary Table 2) are coherent with the order of the clusters (Supplementary Fig. 2). Moreover, we converted the covariance matrices of each cluster into a vector (Methods) and computed a traveling salesman cycle using these cluster vectors. The generated timeseries (Supplementary Fig. 3a) is also consistent with the above one (Fig. 2a ), demonstrating that the traveling salesman cycle is inherent within the data.
Components of reCAT and their validation
At the center of reCAT is a novel heuristic algorithm, termed consensusTSP (Methods), to solve TSP robustly. It should be noted that no known polynomial time algorithm can solve the TSP problem for every case. On the other hand, scRNAseq data are highly noisy; even the optimal traveling salesman cycle may not represent the correct cell cycle order. To overcome these problems, we designed a twostep strategy. In the first step, consensusTSP groups a set of n single cells into k clusters for various values of k ≤ n, and for each set of k clusters, it generates one TSP route using the arbitrary insertion algorithm^{32}. Then the second step of consensusTSP integrates these routes to produce a consensus traveling salesman cycle (Supplementary Fig. 4, Supplementary Note 2).
ConsensusTSP was shown to outperform Oscope^{7}, the arbitrary insertion algorithm (Fig. 2c), and other wellknown TSP algorithms (Supplementary Figs 5 and 6) according to the correlationscore, a Pearson correlation coefficient (PCC)based scoring function that measures the agreement between a predicted pseudo timeseries and experimentally determined cell cycle stage labels (Methods). In Fig. 2d, we demonstrated that consensusTSP also outperformed current singlecell pseudo time reconstruction methods, including SCUBA^{14}, Monocle^{15}, TSCAN^{16}, Wanderlust^{17}/Wishbone^{18} and DPT^{19} (also in Supplementary Fig. 7 and Supplementary Note 4). The comparisons were based on the correlationscores and changeindex values (Methods). The latter index measures how frequent experimentally determined single cell labels change along the timeseries. ConsensusTSP is not only robust (Fig. 2c, Supplementary Note 4 ) but also scales up well for thousands of single cells (Supplementary Fig. 4f ). We observed similar results using the cell cycle stagelabeled mouse embryonic stem cell Quartzseq (mESCQuartz) data set^{23} (the left panel of Fig. 3a) and the cell cycle stagelabeled human embryonic stem cell SMARTseq (hESC) data set^{11} (Supplementary Fig. 8a). Of course, the scoring methods of evaluation may have their own limitations. In addition, one point should be noted about the data generation, if cells with the same cell cycle labels were processed and sequenced in the same batch, these cells can be clustered together nicely because of the batch effects, which leads to high scoring values, but cells within each cell cycle stage may not be properly ordered.
We designed two scoring methods, called ‘Bayesscores’ and ‘meanscores’, to discriminate among the cell cycle stages (Methods). The Bayesscore is a supervised learning method, which computes Naive Bayesian likelihood values using expression level comparisons of preselected gene pairs as input features. The model uses a training data set to determine a fixed number of informative gene pairs^{33}. This Naive Bayesian design is able to decrease the effect of stochasticity in scRNAseq data (Supplementary Fig. 9, Supplementary Note 2). The meanscore is an unsupervised method, which computes the mean of log expression levels of a selected set of marker genes specific to each cell cycle stage. The values of these scores reveal membership of a cluster (or a cell) to a certain cell cycle stage.
We trained the Bayesscores using the mESCSMARTer data, and we tested both Bayesscores and meanscores on the mESCQuartz, mESCSMARTer (only meanscores) and hESC data sets. The curves of these scores are shown in Fig. 3a, Supplementary Figs 7a and 8b,c, respectively. We observe clear cyclic variations of these curves along cell cycle. In practice, the Bayesscores performed especially well in distinguishing G0/G1/S from G2/M. The peak for the G1/S meanscore values is usually near the start site of the S stage (Supplementary Figs 7a, 8b and 10), while the peak for the G2/M meanscore values is often near the late G2 stage. For each kind of meanscore, the values at the G0 stage are significantly lower than those at the other stages (Supplementary Note 3), which can be combined into the HMM to discriminate G0 from the other cell cycle stages.
Identification of cell cyclerelated genes
The noise of gene expression measurements of single cells is high. Therefore, to better observe gene expression variation along the cell cycle timeseries computed by reCAT, we designed a Kalman smoother to estimate the sequential expression levels for a gene (Methods). We employed two statistics, distance correlation (dCor)^{34} and K nearest neighbors (KNN)mutual information (KNNMI)^{35}, to test the significance of the associations between the sequential expression levels of a gene and the pseudo timeseries, in order to identify cell cyclerelated genes not listed in Cyclebase.
We applied the Kalman smoother to the multipotent progenitor cells from young mice (youngMPP) in the mouse hematopoietic stem cell SMARTseq (mHSC) dataset (Table 1) which contain several groups of mouse hematopoietic stem cells, tested all genes and ranked them according to their significance scores (Supplementary Table 3, Supplementary Fig. 11). Afterwards, the sequential expression levels of the top five nonCyclebase genes by dCor and KNNMI were plotted in Fig. 3b. Eight out of the ten genes were confirmed to be strongly related to cell cycle by published literature, although functions of the other two were not clearly recorded (Supplementary Table 4). For instance, Ncapd2 (nonSMC condensin I complex subunit D2), a protein coding gene, has high expression levels at S and G2 stages (Fig. 3b). It belongs to a large protein complex involved in chromosome condensation, and it is annotated as a cell cyclerelated gene by Gene Ontology^{36}. However, it was not included in Cyclebase.
Decomposing proportions of cell cycle stages for mHSCs
Leveraging Bayesscores and meanscores along the pseudo cell cycle timeseries, reCAT applies an HMM to segment the timeseries into cell cycle stages of G0, G1, S and G2/M (Methods, Supplementary Fig. 12). We applied reCAT to mHSC data, and at the G1 stage, results showed that young individuals had a higher proportion of longterm HSCs (LTHSC), 41 out of 167 cells, when compared to old individuals with 10 out of 183 cells (Fig. 4a). This is an independent and quantitative confirmation of the original findings by using the staining approach.
Highresolution transcription atlas of cell cycle in mESCs
We next applied reCAT to the mESC samples, termed mESCCmp, which were cultured in serum, 2i and a2i medium, respectively^{25} for comparison (Kolodziejczyk et al. 2015). Previously, Granovskaia et al.^{37} built a highresolution transcription profile using synchronized budding yeast cells. Similarly, we obtained a highresolution transcription atlas of the mitotic cell cycle in mESCs (Fig. 4b, Supplementary Fig. 13) from scRNAseq data without synchronization through an in silico approach. Two adjacent cells on the recovered pseudo timeseries have a time gap theoretically less than 5 min on average according to the doubling time of about 20 h, which shows a higher resolution than that produced by Granovskaia et al. for budding yeast. During the cell cycles, known cell cycle related genes, arranged by their recorded peak time in Cyclebase (Supplementary Table 5), display two main types of expression waves (Fig. 4b, Supplementary Figs 2 and 13), which correspond to the two wellknown checkpoints, G1 and G2. We can also observe decreased expression of cell cycle genes at the end of the cell cycle, which may be caused by degradation of mRNA molecules^{38}. We leveraged the decreased expressions to estimate the doubling time of the 2i and serum samples and found it consistent with the values reported in the original paper (Supplementary Fig. 14).
Changes of stage proportions during differentiation
We examined scRNAseq data of human myoblast (hMyo)^{15}, as developed by Trapnell, et al. (2014), termed hMyo, which consist of differentiating myoblasts sampled at 0th, 24th, 48th, and 72nd hour time points, respectively. We applied reCAT to reconstruct a pseudo cell cycle timeseries for each of the four sample groups. Fig. 4c shows the proportions of different cell cycle stages estimated at each sampling time point using the HMM model. A strong negative correlation is shown between differentiation progress and cell cycle activity, as a higher proportion of cells are found in cell cycle at the start of differentiation compared to later differentiation time points. The relatively low proportion of cells in cell cycle at the 72h timepoint is also consistent with the reduced proportion of differentiated cells to divide, as previously documented (Fig. 4c, Supplementary Fig. 15). We obtained a similar result using the mouse distal lung epithelium (mDLM) SMARTseq data set^{26}, termed mDLM, which consists of four groups of cells sampled at four different developmental stages (Supplementary Fig. 16). In the absence of synchronization procedures during differentiation, each of the four cell groups contains slight inner heterogeneity, further proving that reCAT is unaffected by that factor. Even in a cancer cell data set of human metastatic melanoma^{27}, termed hMel, with cancer cell heterogeneity in each sample group, reCAT clearly identified cell cycle status of single cells (Supplementary Fig. 17).
Recovery of methylation profile along cell cycle
Using a parallel singlecell genomewide methylome and transcriptome sequencing data set^{28}, termed mESCMT, we show that reCAT is able to recover timeseries epigenome along cell cycle via scRNAseq data. The 61 mESCs were concurrently processed by both SMARTseq for scRNAseq data and bisulfite sequencing (BS) for singlecell methylation data. We processed the scRNAseq data first using reCAT to obtain the pseudo timeseries (Supplementary Fig. 18) and associated the methylation data with the timeseries. We scanned the whole genome methylation levels along the cell cycle (Methods) and discovered that the methylation rate was higher at G1/S phase compared to other cell cycle stages (Fig. 4d). The observation agrees with and extends the conclusion by Brown, et al. (2007)^{39}, but it contradicts the conclusion of Vandiver, et al. (2015)^{40}. Furthermore, we calculated the mean methylation level for promoter regions of gene sets with peak gene expression levels in G1 and G2/M, respectively (Methods). The results imply that the methylation levels for promoter regions of the cell cycle genes vary along cell cycle (Fig. 4d).
Discussion
Aiming to obtain a highresolution transcriptomic change that occurs along cell cycle, we developed an scRNAseq data analysis approach called reCAT. In basic cell cycle studies, reCAT can (1) recover transcriptome change without cell synchronization, which might otherwise alter the native processes, and (2) examine those cells in a developing population or tissue, e.g., during differentiation, that have entered G0 vs. those that continue to divide, thus linking transcriptional changes during development to cell cycle. Therefore, as a novel computational approach to reconstruct cycle along time for unsynchronized singlecell transcriptome data, reCAT is a promising tool with a number of merits. With higher quality and quantity^{41} of sequencing samples, more delicate timeseries profiles can be modeled in general. Moreover, reCAT has the potential to observe various epigenomics^{42, 43} along cell cycle, leveraging parallel sequencing of RNA and DNA^{44}, which has been demonstrated in this work. Even further, reCAT method can be used in research of other cyclic or circadian expression (e.g., in liver)^{45}.
reCAT could be refined in several ways. Instead of the preselected gene set (378 genes), we would prefer semisupervised selection of cell cycle genes from the data, as this could lead to better performance in future analysis. The scoring metrics (i.e., Bayesscores and meanscores) to indicate cell cycle stages also need improvements to be less noisy and more informative. Additionally, in a given cell cycle, variation of cell cyclerelated gene expression predominates over that of the corresponding differentiation. Accordingly, reCAT separates cell cycle analysis from differentiation, which may introduce some bias, but this, too, can be further improved by a combined model. On the contrary, although some reported studies treated cell cycle as noises to be filtered, cell cycle has considerable influence on the investigated biological processes, e.g. myogenesis and embryogenesis. Thus, a model is needed for considering multiple processes simultaneously.
Methods
Data set selection
Ten data sets were used for analysis (Table 1). Among them, four data sets have experimentally derived cell cycle stage labels: the mouse embryonic stem cell RNAseq data (mESCSMARTer), mESCQuartz, hESC, and three cell lines, H9, MB and PC3, sequenced by qPCR. The hESC samples were labeled by fluorescent ubiquitinationbased cellcycle indicators (FUCCI)^{30}, while others were labeled by Hoechst staining.
The six unlabeled data sets include mHSC, mESCs scRNAseq samples from different culture conditions (mESCCmp), hMyo cells sampled at four different time points, mDLM cells sampled at four different time points, hMel scRNAseq samples, and the mESCs processed by scRNAseq and bisulfite in parallel (mESCMT). The mHSC, mESCCmp and mESCMT data sets consist homogeneous cells within each group, while the hMyo, mDLM and hMel data sets were sampled from heterogeneous cells.
Quality control, normalization and preprocessing
We processed scRNAseq data in the following procedure. For data with FPKM or TPM expression levels, we considered samples having more than 4000 genes with expression levels exceeding 2, as eligible. For data with counts for expression levels, we followed existing procedures^{22} for quality control. Then we deleted genes whose mean expressions were excessively low, e.g., lower than 2 for mean TPM, in order to focus on informative genes. We used the normalization step developed in DESeq^{46} to obtain relative expression levels. After quality control and normalization, the expression levels of the 378 cell cycle genes, as defined in Cyclebase, were extracted for downstream analysis. Finally, all gene expression levels were transformed by log_{2}(Exp + 1) to prevent domination of highly expressed genes.
For methylation data, methylation status of a CpG site was considered a binary value in a single cell, unlike a rate in bulk BS. The binary value for singlecell BS data was determined by comparing methylated and unmethylated counts of a CpG site. We generated two results from methylation data of the mESCMT data set in our analysis. The first result is overall methylation level of whole genome, which is the ratio of the number of methylated sites over the number of all measured sites. The second result is mean methylation levels for promoter regions of two gene sets, which contain Cyclebase genes labeled with G1 and G2/M peak expression, respectively. A gene promoter region was defined as a +/−3 kbp window centered on the transcriptional start site. After methylation levels were obtained, the curves of methylation levels along the pseudo timeseries were drawn using an average smoother of nine points.
Definition of gene sets
We mainly use four gene sets correlated with cell cycle. (A) The first gene set was obtained from Cyclebase 3.0 which collected 378 genes from dozens of cell cyclerelated papers. For genes in Cyclebase, expression peak time, significance and source organisms, for example, are documented. (B) The second set (Supplementary Table 1) consists of 60 highest ranked Cyclebase genes, with 20 having their maximum expression levels at each of three cell cycle stages (G1, S, and G2/M). (C) The third set (Supplementary Table 2) contains 15 high confidence cell cycle related genes selected according to published literatures. (D) The fourth gene set (Supplementary Table 5) includes 120 highest ranked Cyclebase genes, with 20 having their maximum expression levels at each of six cell cycle stages (G1, G1/S, S, G2, G2/M and M).
Clustering method
Assume that we are given n single cells, each with an observed expression vector e _{ i } = (e _{ i1}, … ,e _{ im }) for m genes and i = 1,2, … ,n. Considering that negative binomial distribution is widely used to model gene expression levels, we approximate the logarithm of the negative binomial distribution by a Gaussian distribution (lognormal). Thus, we used the GMM to model clusters of gene expression profiles of single cells. A GMM with k clusters can be described as:
where \({\cal N}\left( { \cdot { \boldsymbol{\mu} },{\bf{\Phi }}} \right)\) denotes the Gaussian pdf with mean gene expression vector μ and covariance matrix Φ, and {π _{1}, … ,π _{ k }} are mixture weights satisfying \(\mathop {\sum}\nolimits_{r = 1}^k {{\pi _r} = 1} \) where 0 ≤ π _{ r } ≤ 1, r∈{1, … ,k}. The mixture model can be solved by an expectation maximization algorithm.
Modeling as a TSP
We cluster n single cells into K clusters through the GMM whose mean gene expression vectors are μ _{1}, … , μ _{ K }, each representing a cell cycle phase. Using these K mean vectors, we construct an undirected weighted complete graph G, where nodes correspond to the K mean vectors, and the edges that connect every pair of nodes are weighted by the Euclidean distance between the two vectors. Our goal is to find a Hamilton cycle C _{ K } in this graph such that every node appears in the cycle exactly once, and the total edge weight of the cycle is minimized. This describes the TSP, which is the classic NPhard problem in computer algorithm theory.
In our case, the TSP is actually a Euclidean TSP because it satisfies three criteria: nonnegative distances, symmetry of distances, and triangle inequality of distances. It should be noted that the Euclidean TSP is also an NPhard problem, and no known polynomial time algorithm can solve this problem for every case. We therefore designed a heuristic algorithm, called consensusTSP, which is based on an arbitrary insertion algorithm, to solve the TSP problem^{32}. The arbitrary insertion algorithm is a randomized algorithm with O(n ^{2})running time for a graph with n nodes, and for the worst case, it gives a 2ln(n)approximation. We chose this algorithm because it can produce a more robust solution than the greedy nearest neighbor algorithm.
Given the generated K clusters, there are two steps for the heuristic TSP algorithm. The first step is to compute traveling salesman cycles for different k (e.g., k = 7,8, … , K), and the second step is to merge the cycles into a consensus cycle. In the first step, for each k, it takes the k clusters computed from the GMM as input, runs the arbitrary insertion algorithm n _{fold} · k times, and selects the shortest TSP cycle among these n _{fold} · k cycles. In the second step, it merges the K−6 shortest cycles generated in the first step into a consensusTSP cycle (Supplementary Methods, Supplementary Fig. 4).
Timeseries scoring metrics
The goal is to develop a quantitative measure of accuracy of computed TSP cycle C _{ k } using known cell cycle stage labels. Our idea is to compute the PCC between C _{ k } and experimentally determined cell cycle labels.
Let an ndimensional vector \({\tilde{\bf l}} = ( {{{\tilde l}_{_1}}, \ldots ,{{\tilde l}_{_n}}} )\) denote the experimentally determined cell cycle labels for given n single cells, where \({\tilde l} \in \left\{ {1,2,3} \right\}\) with 1, 2, and 3 indicating the G0/G1, S, and G2/M cell cycle stages, respectively. If cells are labeled by other stages, e.g., G0 or M, the label numbers can be adjusted.
Then we transform the generated traveling salesman cycle C _{ k } into an ndimensional vector l as follows. Assume that C _{ k } consists of a circle of k clusters, c _{1} − c _{2} − ⋯ − c _{ k } − c _{1}. Without loss of generality, we cut the edge c _{ k } − c _{1} to open the cycle and form a linear path, c _{1} − c _{2} − ⋯ − c _{ k }, which represents a pseudotime series with c _{1} and c _{ k } as the start and the end of a cell cycle, respectively. We assign a sequential index j to every cell in jth cluster: l _{ i } = j if the ith single cell belongs to the jth cluster along the timeseries. Thus we obtain a vector l = (l _{1}, … ,l _{ n }) where l _{ i }∈{1,2, … ,k}. We then calculate the PCC between \({\tilde{\bf l}}\) and l to measure how well the linear path c _{1} − c _{2} − ⋯ − c _{ k } fits with the experimental data.
Since C _{ k } has k edges, it can be cut into k different linear paths: c _{1} − c _{2} − … − c _{ k }, c _{2} − c _{3} − … − c _{ k } − c _{1},…, and c _{ k } − c _{1} − … − c _{ k − 1}, and their k reverse paths: c _{ k } − c _{ k−1} − … − c _{1}, c _{1} − c _{ k } − c _{ k − 1} − … − c _{2},…, and c _{ k−1} − c _{ k−2} − … − c _{1} − c _{ k }. For each of these 2k paths, we can compute a PCC score and select the maximum PCC score ρ to represent the correlationscore between the traveling salesman cycle C _{ k } and the experimentally determined cell cycle labels \({\tilde{\bf l}}\).
The second metric is called “changeindex”, which measures how frequent an experimentally determined single cell labels changes along the timeseries. Ideally, a perfect timeseries would change labels twice, G1 to S and S to G2/M. Thus, we define the changeindex as 1−(s _{ c }−2)/(N−3), where s _{ c } means the sum of the label changes between two adjacent cells. A perfect timeseries would have changeindex value of 1, while the worst timeseries where s _{ c } = N−1would have a value of 0.
Bayesscores and meanscores to assess cell cycle phases
Given a traveling salesman cycle C _{ k } computed from single cell data, we want to determine where the cell cycle stages are located. We designed two methods for this purpose: a supervised Naive Bayes model to compute the probability that a cluster belongs to each of three cell cycle stages, including ‘G1’, ‘S’, and ‘G2/M’ (Bayesscores), and an unsupervised method to compute the mean expression of a selected subset of cell cycle genes for each of six cell cycle stages, including ‘G1’, ‘G1/S’, ‘S’, ‘G2’, ‘G2/M’, ‘M’ (meanscores) (Supplementary Methods). Thus Bayesscores consist of three dimensions and meanscores consist of six dimensions.
We used the cell cyclelabeled mESCSMARTer data to train the Bayesscores. Following the literature^{33}, we selected a set of informative gene pairs specific to each of the three cell cycle stage; then the gene pairs selected for each stage were unified with N _{p} pairs (Supplementary Methods). Without loss of generality, we focused on the G1 stage and converted expression of each cluster (or single cell) into a binary vector as follows. For the ith of the N _{p} pairs, i.e., gene a and gene b, we assign a value −1 if their expression levels satisfy e _{ a } < e _{ b }, and 1 otherwise. Let the probability p _{ i } be the fraction of G1 stage clusters with value 1 for the ith gene pair, and let the probability 1−p _{ i } be that with value −1. The Naive Bayes model can be expressed as follows: Let \({\bf{x}} = \left( {{x_1}, \ldots ,{x_{{N_{\rm{p}}}}}} \right)\) be the binary vector computed from the gene pairs for an unlabeled cluster. The posterior probability that x belongs to G1 can be expressed as
Thus the Bayesscores are log_{10}(P(xG1)P(G1)), log_{10}(P(xS)P(S)), and log_{10}(P(xG2M)P(G2M)), respectively, with the prior P(G1) = P(S) = P(G2M). We also tested the LassoLogistic regression (Supplementary Note 2, Supplementary Methods), but the Naive Bayes had better performance.
To determine the meanscores of a cluster, which is based on the mean of log_{2}(TPM + 1) of cell cycle genes, we compute the expression mean of a selected subset of marker genes for each cell cycle stage. We selected six gene sets with recorded ‘Peaktime’ as ‘G1’, ‘G1/S’, ‘S’, ‘G2’, ‘G2/M’, and ‘M’ stage from the Cyclebase genes (378) and then computed the corresponding scores for each cluster (single cell).
HMM for segmentation
Given a traveling salesman cycle of K clusters, we applied a HMM (Supplementary Fig. 12) to determine cell cycle stages. Let H = {G0,G1,S,G2/M} denote the set of hidden states (cell cycle stages) and A = (a _{ ij })_{ N × N } be the matrix of transition probabilities between the stages, where N = 4 denotes the number of stages. If no obvious sign indicates the existence of G0 cells, we only consider G1, S and G2/M. Thus, a state transition exists only when it is from a cell cycle stage to itself or to a physiologically subsequent stage. Along the generated timeseries, we characterize a cell i∈{1,2, … , n} using a ninedimensional scoring vector o _{ i } = (o _{ i1},o _{ i2}, … ,o _{ i9}), which includes three Bayesscores and the six meanscores to describe membership of a cell to a specific cell cycle stage. Therefore, when a cell is at a stage h∈H, it emits a ninedimentional scoring vector described by a multivariate Gaussian distribution \({\cal N}\left( {{\boldsymbol{\mu}_h},{{\bf{\Sigma }}_h}} \right)\).
Provided with this formulation, we first estimate the parameters Θ = (A,μ _{ h },Σ _{ h }) from the observed scores of cells O = (o _{1},o _{2}, … ,o _{ n }) along the timeseries using the Baum–Welch (BW) algorithm. To determine the cell cycle starting point, we tried each cell in the cycle as a starting point, and selected the one that has the highest likelihood for observation. In the implementation of the BW algorithm^{47}, we adopted logarithm transformation to small intermediate probabilities to avoid underflow. We then implement the Viterbi algorithm to obtain the most likely assignment of the cells, thereby partitioning the timeseries into cell cycle stages (Supplementary Methods).
Kalman smoother and correlation detection
As scRNAseq expression noise obeys negative binomial distribution^{48}, it can be regarded as normal distribution after logarithm. Hence, timeseries expression of single cells can be modeled as a random walk plus (RWP) noise model, which is one of the simplest dynamic linear models. Each cell i has a timeseries index t _{ i } ∈{1,2, … n}; hence, the cells can be arranged as (1,2, … ,T) with n = T here. For a selected gene, cells have the observed expression e _{ t } (t = 1,2, … ,T) and the real expression z _{ t } (t = 1,2, … ,T) along the cell cycle timeseries. Hence, the RWP model can be expressed as:
In other words, two adjacent cells have a firstorder Markov correlation along the timeseries, and the observed expression is generated by adding a normally distributed noise of zero mean to the real expression. In practice, we use Kalman smoother equations, or the Rauch–Tung–Striebel equations (Rauch et al. (1965)) to estimate the real expression \({\hat z_t}\).
With the noise filtered out, we are able to determine whether the expression of a gene exhibits a timeseries pattern along the cell cycle by correlating the estimated expression values \( {{{\hat z}_t}} \) with the timeseries index t. Apparently, neither Pearson’s nor Spearman’s correlation coefficients can work here, owing to the nonmonotonic property of expression along a time series. Therefore, we adopted three statistical methods (dCor^{28}, KNNMI^{29}, MIC^{49}) capable of detecting the nonlinear relationship between two variables.
Code availability
The open source implementation of reCAT in R is available on GitHub: https://github.com/tinglab/reCAT.
Data availability
No new data was generated in this study. All the data sets used can be find through the accession numbers provided in the original publications cited in Table 1.
References
Zhao, Y. et al. Dysregulation of cardiogenesis, cardiac conduction, and cell cycle in mice lacking miRNA12. Cell 129, 303–317 (2007).
Spellman, P. T. et al. Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9, 3273–3297 (1998).
Ly, T. et al. A proteomic chronology of gene expression through the cell cycle in human myeloid leukemia cells. Elife 3, e01630 (2014).
Wu, A. R. et al. Quantitative assessment of singlecell RNAsequencing methods. Nat. Methods 11, 41–46 (2014).
Tang, F. et al. mRNASeq wholetranscriptome analysis of a single cell. Nat. Methods 6, 377–382 (2009).
Kolodziejczyk, A. A., Kim, J. K., Svensson, V., Marioni, J. C. & Teichmann, S. A. The technology and biology of singlecell RNA sequencing. Mol. Cell 58, 610–620 (2015).
Shalek, A. K. et al. Singlecell RNAseq reveals dynamic paracrine control of cellular variation. Nature 510, 363–369 (2014).
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of singlecell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).
Cadwell, C. R. et al. Electrophysiological, transcriptomic and morphologic profiling of single neurons using Patchseq. Nat. Biotechnol. 34, 199–203 (2016).
Stegle, O., Teichmann, S. A. & Marioni, J. C. Computational and analytical challenges in singlecell transcriptomics. Nat. Rev. Genet. 16, 133–145 (2015).
Leng, N. et al. Oscope identifies oscillatory genes in unsynchronized singlecell RNAseq experiments. Nat. Methods 12, 947–950 (2015).
Scialdone, A. et al. Computational assignment of cellcycle stage from singlecell transcriptome data. Methods 85, 54–61 (2015).
Kowalczyk, M. S. et al. Singlecell RNAseq reveals changes in cell cycle and differentiation programs upon aging of hematopoietic stem cells. Genome Res. 25, 1860–1872 (2015).
Marco, E. et al. Bifurcation analysis of singlecell gene expression data reveals epigenetic landscape. Proc. Natl Acad. Sci. USA 111, E5643–E5650 (2014).
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).
Ji, Z. & Ji, H. TSCAN: pseudotime reconstruction and evaluation in singlecell RNAseq analysis. Nucleic Acids Res. 44, e117 (2016).
Bendall, S. C. et al. Singlecell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell 157, 714–725 (2014).
Setty, M. et al. Wishbone identifies bifurcating developmental trajectories from singlecell data. Nat. Biotechnol. 34, 637–645 (2016).
Haghverdi, L., Buttner, M., Wolf, F. A., Buettner, F. & Theis, F. J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods 13, 845–848 (2016).
Kafri, R. et al. Dynamics extracted from fixed cells reveal feedback linking cell growth to cell cycle. Nature 494, 480–483 (2013).
Gut, G., Tadmor, M. D., Pe’er, D., Pelkmans, L. & Liberali, P. Trajectories of cellcycle progression from fixed cell populations. Nat. Methods 12, 951–954 (2015).
Buettner, F. et al. Computational analysis of celltocell heterogeneity in singlecell RNAsequencing data reveals hidden subpopulations of cells. Nat. Biotechnol. 33, 155–160 (2015).
Sasagawa, Y. et al. QuartzSeq: a highly reproducible and sensitive singlecell RNA sequencing method, reveals nongenetic geneexpression heterogeneity. Genome Biol. 14, R31 (2013).
McDavid, A. et al. Modeling Bimodality improves characterization of cell cycle on gene expression in single cells. PLoS Comput. Biol. 10, e1003696 (2014).
Kolodziejczyk, A. A. et al. Single Cell RNAsequencing of Pluripotent States unlocks modular transcriptional variation. Cell Stem Cell 17, 471–485 (2015).
Treutlein, B. et al. Reconstructing lineage hierarchies of the distal lung epithelium using singlecell RNAseq. Nature 509, 371–378 (2014).
Tirosh, I. et al. Dissecting the multicellular ecosystem of metastatic melanoma by singlecell RNAseq. Science 352, 189–196 (2016).
Angermueller, C. et al. Parallel singlecell sequencing links transcriptional and epigenetic heterogeneity. Nat. Methods 13, 229–232 (2016).
Whitfield, M. L. et al. Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol. Biol. Cell 13, 1977–2000 (2002).
SakaueSawano, A. et al. Visualizing spatiotemporal dynamics of multicellular cellcycle progression. Cell 132, 487–498 (2008).
Santos, A., Wernersson, R. & Jensen, L. J. Cyclebase 3.0: a multiorganism database on cellcycle regulation and phenotypes. Nucleic Acids Res. 43, D1140–D1144 (2015).
Rosenkrantz, D. J., Stearns, R. E., Philip, M. & Lewis, I. An analysis of several heuristics for the traveling salesman problem. SIAM J. Comput. 6, 563–581 (1977).
Tan, A. C., Naiman, D. Q., Xu, L., Winslow, R. L. & Geman, D. Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics 21, 3896–3904 (2005).
Kosorok, M. R. On Brownian distance covariance and high dimensional data. Ann. Appl. Stat. 3, 1266–1269 (2009).
Kraskov, A., Stögbauer, H. & Grassberger, P. Estimating mutual information. Physical Review E 69, 066138 (2004).
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene ontology consortium. Nat. Genet. 25, 25–29 (2000).
Granovskaia, M. V. et al. Highresolution transcription atlas of the mitotic cell cycle in budding yeast. Genome Biol. 11, R24 (2010).
Sharova, L. V. et al. Database for mRNA halflife of 19,977 genes obtained by DNA microarray analysis of pluripotent and differentiating mouse embryonic stem cells. DNA Res. 16, 45–58 (2009).
Brown, S. E., Fraga, M. F., Weaver, I. C., Berdasco, M. & Szyf, M. Variations in DNA methylation patterns during the cell cycle of HeLa cells. Epigenetics 2, 54–65 (2007).
Vandiver, A. R., Idrizi, A., Rizzardi, L., Feinberg, A. P. & Hansen, K. D. DNA methylation is stable during replication and cell cycle arrest. Sci. Rep. 5, 17911 (2015).
Macosko, E. Z. et al. Highly parallel genomewide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
Nagano, T. et al. Singlecell HiC reveals celltocell variability in chromosome structure. Nature 502, 59–64 (2013).
Buenrostro, J. D. et al. Singlecell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015).
Macaulay, I. C. et al. G&Tseq: parallel sequencing of singlecell genomes and transcriptomes. Nat. Methods 12, 519–522 (2015).
Zhang, R., Lahens, N. F., Ballance, H. I., Hughes, M. E. & Hogenesch, J. B. A circadian gene expression atlas in mammals: implications for biology and medicine. Proc. Natl Acad. Sci. USA 111, 16219–16224 (2014).
Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
Mann, T. P. Numerically stable hidden Markov model implementation. http://bozeman.genome.washington.edu/compbio/mbt599_2006/hmm_scaling_revised.pdf. (2006).
Grun, D., Kester, L. & van Oudenaarden, A. Validation of noise models for singlecell transcriptomics. Nat. Methods 11, 637–640 (2014).
Reshef, D. N. et al. Detecting novel associations in large data sets. Science 334, 1518–1524 (2011).
Acknowledgements
We thank Xuegong Zhang, Peter Kharchenko, Grace Xiao, Lin Wan and Jianyang Zeng for constructive criticism. We are grateful to Xiangyu Li, Kui Hua, Jun Li, Weilong Guo and Zhiyi Qin for fruitful discussion. We also thank Siqi Qu, Qiongye Dong, Aleksandra A. Kolodziejczyk and Florian Buettner for their technical support. This work was supported by the National Science Foundation of China [61673241, 61561146396], National Basic Research Program of China [2012CB316504, 2012CB316503]; Hitech Research and Development Program of China [2012AA020401]; NSFC [61305066, 91010016, 91519326, 31361163004]; NIH/NHGRI [5U01HG00653103; 4R01HG006465] and the Joint NSFCISF Research Program, jointly funded by the National Natural Science Foundation of China and the Israel Science Foundation.
Author information
Authors and Affiliations
Contributions
Z.L. conceived the main strategies and developed the method. Z.L., T.C. and M.Q.Z. designed the study. Z.L., H.L., K.X. and H.W. performed the analysis. Z.L., T.C., R.J., K.X., N.C. and O.M.A. wrote the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Liu, Z., Lou, H., Xie, K. et al. Reconstructing cell cycle pseudo timeseries via singlecell transcriptome data. Nat Commun 8, 22 (2017). https://doi.org/10.1038/s4146701700039z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4146701700039z
This article is cited by

scPrisma infers, filters and enhances topological signals in singlecell data using spectral template matching
Nature Biotechnology (2023)

Universal prediction of cellcycle position using transfer learning
Genome Biology (2022)

Tempo: an unsupervised Bayesian algorithm for circadian phase inference in singlecell transcriptomics
Nature Communications (2022)

Integrative insights and clinical applications of singlecell sequencing in cancer immunotherapy
Cellular and Molecular Life Sciences (2022)

Singlecell landscape of nuclear configuration and gene expression during stem cell differentiation and X inactivation
Genome Biology (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.