Reconstructing cell cycle pseudo time-series via single-cell transcriptome data

Single-cell mRNA sequencing, which permits whole transcriptional profiling of individual cells, has been widely applied to study growth and development of tissues and tumors. Resolving cell cycle for such groups of cells is significant, but may not be adequately achieved by commonly used approaches. Here we develop a traveling salesman problem and hidden Markov model-based computational method named reCAT, to recover cell cycle along time for unsynchronized single-cell transcriptome data. We independently test reCAT for accuracy and reliability using several data sets. We find that cell cycle genes cluster into two major waves of expression, which correspond to the two well-known checkpoints, G1 and G2. Moreover, we leverage reCAT to exhibit methylation variation along the recovered cell cycle. Thus, reCAT shows the potential to elucidate diverse profiles of cell cycle, as well as other cyclic or circadian processes (e.g., in liver), on single-cell resolution.


Supplementary Notes Supplementary Note 1: State-of-the-art in single-cell RNA-seq
Hundreds of thousands of single cells in multiple tissues, including embryonic tissues 5 , cancer tissue 6-8 , immune 9 , neuro 10 , and complex tissue 11 , have been assayed by scRNA-seq. It was also combined with simultaneous DNA sequencing 12 and methylation sequencing 13 . Several computational methods have been developed for data analysis, including normalization 14 , differential expression detection 15 , differentiation cascade construction 16 , removal of confounding factors 1 , oscillatory gene identification 17 , transcription dynamics modeling 18 and cell classification 19 . However, cell cycle is not taken into account in most singlecell differentiation studies 20 , even though cell cycle activities impact physiological function of cells in so many ways 21 .

Supplementary Note 2: Comparisons and selection of computational methods
Selection of TSP methods, and feature gene sets. Among TSP solutions generated by the arbitrary insertion algorithm, the cycle lengths show obvious variation when the cluster number is above 20. The lengths also have apparent negative correlation with the correlation-scores (Supplementary Figure 3c), proving the principle of our approach. It is theoretically impossible to solve a TSP when the cluster number is large, because it is an NP-hard problem. For the different existing heuristic methods 22 , when we tested the implementations in the R package 'TSP', nearest insertion show high accuracy and stability (Supplementary Figure 5).When clustering numbers are small, correlation-scores are generally high (Supplementary Figure 5). This could be attributed to greater noise reduction through clustering and closer proximity of the TSP solution to the global optimal for smaller . Therefore, besides reducing computational time for solving TSP, clustering improves accuracy of the generated time-series. For feature gene selection, both Cyclebase (378) and Buettner's (892) cell cycle gene sets gave the best results (Supplementary Figure 6). This shows that these known cell cycle genes are the most informative and that a proper number of cell cycle genes can yield more accurate time-series. Thus we chose the Cyclebase (378) gene set for reCAT.
Comparison between Bayes-scores and Lasso-Logistic based score. Using features generated by gene expression comparisons, we compared Naïve Bayes-based Bayes-scores with Lasso-Logistic regression scores (LLR, Supplementary Methods; Supplementary Figure 9). The LLR-based computes a probabilistic score for assigning a cell (or a cell group) to a specific cell cycle stage, but the scores show high variation. In contrast, the Naïve Bayes based method generates smoother scores. Therefore, we adopted the Bayesscores.

Supplementary Note 3: Complementary analysis
Additional notes for the scores and expression profiles along cell cycle. Bayes-scores performs well in distinguishing between G1 and G2/M stages. If G1 scores are higher than G2/M scores, the corresponding cells are likely to be in G0, G1, or S stages. On the other hand, if G2M scores are higher, the cells are very likely to be in G2/M stage. If there is a smooth crossover between G1 and G2/M, this crossover point is often at the end of the S stage. The highest point of G1/S scores is usually near the start of the S stage, and the highest point of G2/M often occurs before cell division 21 . G0 stage has similar Bayes-scores profiles as G1 stage, but the mean-scores are generally lower, with G1 scores a little higher than the other scores. Therefore, we design an HMM to incorporate these scores to determine each of the cell cycle stages (G0, G1, S and G2/M).

Discussion of experimental technologies to generate cell cycle stage labels.
Cell cycle stage labels generated from different experimental technologies can have different accuracies. Though the original paper, which revealed FUCCI, used Hoechst to measure the accuracy of FUCCI 23 , no literature reported accuracy comparisons between the two technologies. According to the principles of the two technologies, the resolution of FUCCI may be higher than Hoechst sometime. However, because the definition of cell cycle stage is based on DNA replication and division, we tended to believe Hoechst is more fundamental for cell cycle stage determination. Apparently, both of the technologies give incorrect cell labels, and the ones of the mESC-SMARTer data (Supplementary Figure 7a) may skew training of Bayes-scores in our analyses.

Supplementary Note 4: Further assessment for the pseudo time algorithm
We assessed the results using Bayes-scores and mean-scores, which were developed independently against the tested pseudo time-series construction algorithms. Using the cell cycle-labeled mESC-SMARTer dataset, we plotted mean-scores of the time-series produced by reCAT (Supplementary Figure 7a Since the pseudo time algorithm is based on TSP solving, people may care about its robustness. Therefore, we took the following tests to prove that. It is most straightforward to use labeled data sets. Therefore, for each 'K' on the horizontal axis, 200 trials were respectively implemented. For each trial, correlation-score was calculated to test the robustness, on the mESC-SMARTer data (Figure 2d), the hESC data (Supplementary Figure 8a) and mESC-Quartz data (Figure 3a).
In summary, the algorithm of reCAT is more reliable for cell cycle pseudo-time reconstruction, mainly because of its accuracy and robustness. The good results of reCAT can first be attributed to the fact that it is based on a circular model that brings in more prior information. Second, it merges many routes together to produce a robust result, similar to the Wanderlust approach 24 . Third, it is a nonlinear method able to fit nonlinear properties of data.

Supplementary Note 5: Further assessment for Bayes-scores and means-scores
As discussed at the beginning of the paper, the greatest challenge faced in delineating cell cycle stages is the high level of uncertainty of marker gene expression. Therefore, Bayes-scores and mean-scores were designed to overcome this uncertainty. The genes used by Bayes-scores and mean-scores are well known and with specific biological explanation 25 . Specifically, Bayes-scores is based on the Naïve Bayesian model that combines a well-behaved cell classification feature selection method 26 , which is about expression comparisons of thousands of gene pairs. Mean-scores is based on the means of expression levels of tens of marker genes at each cell cycle stage. Methods similar to mean-scores have been used as a reliable 21 .
We demonstrated that the Naïve Bayes model was more robust to noise than other computational methods such as the Logistic regression (Supplementary Note 2, Supplementary Figure 9). To evaluate the robustness of the Bayes-scores, we trained and tested the Bayes-score method on the labeled mESC-SMARTer dataset. Through ten-fold cross-validation, we found that 94.3% of the G1 cells had higher G1 Bayes-scores than the G2/M Bayes-scores, and that 92.9% of the G2/M cells had higher G2/M Bayes-scores than G1 Bayes-scores. The results demonstrate that the Bayes-scores are very accurate in discriminating between G1 and G2/M cells. We then applied the trained Bayes-score method to discriminate G1 and G2/M cells on the labeled hESC and mESC-Quartz dataset. The results showed that the error rates were 18.4% for the G1 cells and 36.8% for the G2/M cells. While all the G1 and G2/M cells in the mESC-Quartz were correctly discriminated. We did not test the model on the S stage because the S stage is a transitive stage, which can easily be confused with the G1 and G2/M stages.
For mean-scores, we compared G1/S scores with G2/M scores of the mESC-SMARTer data, and plotted these two scores on a 2-dimensional space in which a linear classifier (generated by SVM) achieved 88.9% accuracy to discriminate G1 and G2/M cells. We further showed the robustness of means-scores via comparison with housekeeping genes and random selected genes. For the different types of cells, compared with housekeeping genes and the random selected genes, the curves of G1/S and G2/M dimensions exhibit a clear phase gap along the time-series (Supplementary Figure 10).
Though the calculated values of the scoring methods have some noise and undesirable results, which might because of data qualities, gene selections and stochasticity, the HMM in reCAT can combine all Bayes-scores and mean-scores together to segment the pseudo time series into different cell cycle stages. Besides, the segmentation of HMM model has some robustness to noise and only sensitive to the variation trends of the values but not the values themselves.

Supplementary Methods
Information extraction from the covariance. Covariance matrices of each cluster were calculated as , and PCA (principle component analysis) was performed to extract information within each cluster. For PCA, = • • ′ where is the covariance of all samples, and is a symmetric matrix composed of eigenvalues of . can be estimated and columns of it are composed of homologous eigenvectors. For each covariance matrix corresponding to cluster , = ′ • • . The vector which represents can be obtained through combining the diagonal elements of , marked as .
Bayes-scores and mean-scores to assess cell cycle. Bayes-scores and mean-scores reveal the membership of a cell (cluster) for a certain cell cycle stage from two different aspects. Bayes-scores are based on comparison of the gene expression pairs in a cell and integrating the results of a certain number of comparison pairs. Mean-scores are based on comparing specific gene expressions in different samples and integrating the expression into one value by averaging.
We propose a Naïve Bayes model to calculate the likelihood that a cell (cluster) belongs to a specific cell cycle stage. The process is similar to classification of a cell via the Bayes decision rule; thus, we refer to the prediction scores obtained in this way as Bayes-score. Formally, let = { , … , } be the set of annotated cell cycle genes, with expression of denoted by . For each pair of genes and with < , we compare the expression level using a sign function, defined as There are two steps in the training phase: feature selection and likelihood estimation. We follow the literature 26  ( =̂) to be equal for different stages. Therefore, the Bayes-scores is defined as log (∏ (s =̂| = P = ̂) ). The training dataset is mESC-SMARTer.
Logistic regression compared with Bayes-score. Logistic regression is a classification method which can offer probabilistic measures. Because the dimension number of a feature vector is far larger than the sample number, Lasso is integrated into the logistic regression. Thus, the gene expression comparison result of a cell , i.e. = ( , … , P ), is the input, and the generated probability for a cell in a specific cycle stage ̂, which stands for the membership of the cell in that stage, is the output. Without loss of generality, we focus on the G1 stage; thus, the formula of the probability can be expressed as The parameters can be estimated by some optimization methods for ̆= ( ). mESC-SMARTer data were used to train for estimating the parameters.
Additional notes for HMM for segmentation. Procedures of parameter estimation and inference already exist 27 . Scaling is necessary in the implementation of the Baum-Welch re-estimation process. Otherwise, forward and backward probabilities are too small to store in a computer since the sequence is too long.
Before segmentation, we choose a group of initial values for the Baum-Welch algorithm. Because of the various data types, limited cell samples, high level of noise and high demand for accuracy, we choose a confidence interval of 4-10 samples for each stage by direct observation (Supplementary Note 3). Then we use maximum likelihood estimation (MLE) to estimate mean and variance of each dimension of the scores. The obtained means and variances are the required initial values.