Identifying and removing the cell-cycle effect from single-cell RNA-Sequencing data

Single-cell RNA-Sequencing (scRNA-Seq) is a revolutionary technique for discovering and describing cell types in heterogeneous tissues, yet its measurement of expression often suffers from large systematic bias. A major source of this bias is the cell cycle, which introduces large within-cell-type heterogeneity that can obscure the differences in expression between cell types. The current method for removing the cell-cycle effect is unable to effectively identify this effect and has a high risk of removing other biological components of interest, compromising downstream analysis. We present ccRemover, a new method that reliably identifies the cell-cycle effect and removes it. ccRemover preserves other biological signals of interest in the data and thus can serve as an important pre-processing step for many scRNA-Seq data analyses. The effectiveness of ccRemover is demonstrated using simulation data and three real scRNA-Seq datasets, where it boosts the performance of existing clustering algorithms in distinguishing between cell types.

remove the cell-cycle effect by simply excluding these cell-cycle genes from the analysis is not a viable strategy. This is because the cell cycle also affects the expression level of many genes which are thought to be unrelated to the cell cycle 12 , although usually to a lesser extent compared to the cell-cycle genes. For example, when considering a set of over 6,500 genes not previously associated with the cell cycle, Buettner et al. 12 found that 44% of the genes showed significant correlation with at least one cell-cycle gene.
The scLVM (single-cell latent variable model) algorithm first proposed the idea of estimating the cell-cycle effect and then removing this effect from scRNA-Seq data 12 . All genes are retained after applying scLVM, but the effect of the cell cycle will be removed from their expression levels. scLVM uses only the cell-cycle genes to identify the cell-cycle effect. It develops a sophisticated Bayesian latent variable model to reconstruct hidden factors in the expression profile of the cell-cycle genes. It declares that the leading K(K ≥ 1) factors are the cell-cycle effect and removes them from the whole dataset. No formal statistical methods have been proposed to choose K with the authors recommending using either the default value K = 1, or relying on a scree plot of the variance captured by each latent factor and using the elbow point, similar to choosing the number of significant components in a PCA. scLVM has shown its ability in removing the cell-cycle effect from real scRNA-Seq datasets, which are the first real data example we will show in the Results section of the main text and the real data example in the Supplementary Information Section S1. To date, scLVM is still the only available method for removing the cell-cycle effect. The key assumption that scLVM makes is that all the main effects in the expression of cell-cycle genes are cell-cycle effects. However, we have realized that this may not hold, making the application of scLVM hazardous. Cells of different cell types, even if they are in the same time point of their cell cycle, should have different expressions even on cell-cycle genes. In other words, the expression of cell-cycle genes is also influenced by the cell type. We call the expression change caused by the cell type "the cell-type effect". Likewise, there can be effects from experimental condition, disease state, etc. There is no guarantee that the cell-cycle effect is stronger than all other effects, even on the cell-cycle genes. Indeed even if the cell-cycle effect is the strongest, it is not likely that all its components (generally, more than one latent factor is needed to describe the cell-cycle effect) are stronger than the components of other effects. In other words, some of the leading K factors of the gene expression profile of the cell-cycle genes may not be generated by the cell cycle and instead may originate from biological features of interest such as differences in cell type. Removing all the leading K factors will remove these signals of interest from the data, compromising the downstream analysis of the data, such as clustering analysis for cell-type discovery, defeating the purpose of a scRNA-Seq experiment. For clearer illustration, we show four cases in Table 1 as examples. In case 1, the first leading factor in the cell-cycle genes represents the cell-cycle effect; scLVM will work when K = 1 is used. In case 2, the top two leading factors in the cell-cycle genes both represent the cell-cycle effect; scLVM will work when K = 2 is used, although the cell-cycle effect will not be removed completely when the default value K = 1 is used and other effect(s) may be removed along with the cell-cycle effect if K > 2 is used. In case 3, the first leading factor represents another effect of interest; scLVM will remove this effect no matter what K value is used, meaning that scLVM will always fail. In case 4, the first and third leading factors Here the data is clustered into six groups corresponding to the combinations of cell type and cell-cycle status. (b) scLVM corrected data (one latent factor removed). The data clusters into three groups corresponding to cell-cycle status. (c) scLVM corrected data (three latent factors removed). No distinct clusters are observed. (d) ccRemover corrected data. The data splits into two groups corresponding to the cell types.
Scientific RepoRts | 6:33892 | DOI: 10.1038/srep33892 represent the cell-cycle effect; scLVM will not remove the cell-cycle effect completely (when K < 3) and/or remove other effects as well (when K ≥ 2).
A better method should include a mechanism to check each factor and make a judgement as to whether the factor represents the cell-cycle effect. We propose a method called cell-cycle remover (ccRemover) that effectively identifies the components of the cell-cycle effect from scRNA-Seq data. It then removes them from the data while preserving the other components of the data. ccRemover identifies the cell-cycle effect using the expression profiles of all genes. For simplicity, we call genes that are not annotated as cell-cycle genes "control genes". The assumption that ccRemover makes is that the cell-cycle effect is stronger on average in the cell-cycle genes than control genes. ccRemover carries out a simple PCA on the expression profiles of control genes to capture the sources of variation/effects, represented by the loadings of the principal components. It then projects the expression of all genes on these loadings to get the component scores for each gene. The magnitude of the component scores measures the strength of the effects on the genes. For each effect (principal component), ccRemover compares the average magnitude of the component scores of the cell-cycle genes with the average magnitude of the component scores of control genes. It declares all effects whose average magnitude is larger on the cell-cycle genes than on the control genes as cell-cycle effects. A formal bootstrap-based statistical test is developed for this comparison. Then all effects declared as the cell-cycle effect are removed from the whole dataset by subtracting the projections of gene expression profiles on these effects. This identification and removal process is repeated until no more principal components are identified as the cell-cycle effect. For details of the algorithm, refer to the Methods section. ccRemover is implemented as an R package 36 and is available at http://www.nd.edu/~jli9/ccRemover.zip for review proposes. Once the paper is accepted, it will be made publicly available via Bioconductor 37 .
The performance of ccRemover is demonstrated using a simulated dataset and three real scRNA-Seq datasets, where ccRemover is able to successfully remove the effects of the cell cycle from the data while preserving the other components of the data. We show that ccRemover can aid in the identification of subpopulations of cells, improve the clustering analysis of single cells and performs favorably compared to scLVM.

Results
For the simulated data and each of the real scRNA-Seq datasets the analysis follows the same process. Firstly we apply scLVM and ccRemover to the original dataset, which we call "the original data". scLVM is applied using the python script available from the scLVM GitHub page. This gives us a scLVM corrected dataset and a ccRemover corrected dataset, which we refer to as the "scLVM corrected data" and the "ccRemover corrected data" respectively. Once we have acquired the three datasets the same clustering algorithms and statistical tests are applied to each of them allowing us to compare the performance of the methods.
We use the same set of cell-cycle genes when applying scLVM and ccRemover. The lists of cell-cycle genes are acquired by combining two sources. Firstly Biomart was used to download lists of genes that were annotated to the cell cycle 38 . In addition two R packages were used to retrieve gene annotation data from GO term 39 , and these were org.Mm.eg.db 40 and org.Hs.eg.db 41 for annotations for mouse and human respectively. For the choice of K (the number of leading factors to be removed) in scLVM, we try both the default value K = 1 and the value given by the scree plot.
Simulation Data. We simulate data matrix X that contains measurements for 50 cells and 2,000 genes, of these genes 400 are assigned as cell-cycle genes. The cells are randomly assigned to the two classes (cell types) and three cell-cycle stages. Suppose cell j is assigned to class t j and cell-cycle stage s j , t j ∈ {1, 2} and s j ∈ {1, 2, 3}. We simulate X ij , the expression of gene i in cell j by where Y is the cell-type effect, Z is the cell-cycle effect, and W is random noise. The cell-type effect is generated for control genes, and the random noise is generated by W ij ∼ N(0, 1). In Fig. 1, the data is plotted on the first two principal components with the shape of the points corresponding to their cell type and the color corresponding to their cell-cycle stage. Ideally, the data should be separated only by shape and not color, that is, by cell type and not cell-cycle stage. However, on the original data ( Fig. 1a), the cells are clustered into six different groups corresponding to the cell type and cell-cycle stage combinations demonstrating how the cell cycle can confound the analysis of scRNA-Seq data.
scLVM removes the first leading factor (K = 1, default choice) or the first three leading factors (K = 3, suggested by the scree plot). Figure 1b shows the results when the first leading factor is removed, where the cells are clustered into three groups according to the cell-cycle stage, and cells from different cell types are completely indistinguishable. scLVM has failed completely here by mistakenly removing the cell-type effect instead of the cell-cycle effect. Figure 1c shows the results when all three leading factors are removed. The cells exhibit no clear clusters, indicating that scLVM has removed both the cell-cycle effect and the cell-type effect. The data has effectively been rendered useless as it now contains just noise. Figure 1d shows the results of correcting the data using ccRemover, where the cells are well separated by the cell type and within each cluster cells with different cell-cycle stages are completely mixed. This means that the cell-cycle effect has been thoroughly removed, while the cell-type information has been preserved. ccRemover is able to correctly identify the second and third principal components as cell-cycle effects and removes them.
In our simulation study above, we made two simplifications in the data simulation. First, we simulated Gaussian data directly instead of simulating the raw count data, normalizing the counts by the sequencing depth, and then log-transforming them. Second, we simulate the cell cycle as three discrete stages. In reality, the cell-cycle status is more like a continuous variable, as even if two cells are in the same stage, they may still differ in how far they have progressed through that stage.
In our simulation, the leading latent factor of the expression profile of cell-cycle genes is not the cell-cycle effect. This corresponds to case 3 in Table 1, and scLVM fails as expected. Changing our simulation parameters can make the data represent other cases in Table 1, on many of which we should not expect such distinct results between scLVM and ccRemover. We will show a wider range of comparisons using real datasets.
Real dataset 1: T helper cell data. The first real dataset is the differentiating T-helper (T H ) cell dataset that was used to display the ability of scLVM to help reveal hidden subpopulations of cells by Buettner et al. 12 . We will demonstrate that ccRemover also has this ability, and improves on the performance of scLVM. The dataset was generated by Mahata et al. 42 to study the differentiation of T H cells and the steroids they synthesize to contribute to immune homeostasis. The data was created by polarizing naive T H cells in vitro towards a T H 2 subtype, leading to a population in which there are cells differentiating towards the T H 2 subtype and cells which are not. The objective for this dataset is to identify biologically meaningful clusters of cells. The original dataset was downloaded from the supplementary materials of Buettner et al. 12 and contains normalized and log transformed expression measurements for 81 cells and 7,073 genes, of which 532 were identified as cell-cycle genes. For this dataset, the scLVM corrected data along with cluster assignments for the corrected data are also available from the same source and were used to evaluate the performance of scLVM. When ccRemover is applied to the original data it identifies the first principal component to be a cell-cycle effect on the first iteration. Once this effect is removed from the data no other features are deemed to be cell-cycle related.
Both scLVM and ccRemover remove the cell-cycle effect efficiently on this data. To check this, in Fig. 2, we plot the density of the expression level of cell-cycle genes selected from the top ranked genes on Cyclebase 43 . On the original data (red lines), many genes display a bimodal density commonly seen in scRNA-Seq data indicating the on-off action of genes, in this case, controlled by the cell cycle 18,[44][45][46] . On the scLVM (green lines) or ccRemover (blue lines) corrected data, the bimodality of the densities largely disappears and most genes display a unimodal distribution indicating that the cell-cycle effect has been reduced or removed completely for these genes.
To determine if biologically meaningful clusters can be discovered from the data we avail of a criterion for measuring performance used by Buettner et al. during their analysis. There is a list of 122 known T H 2 signature genes curated by Buettner et al. If the cells are clustered into two clusters and genes that are differentially expressed between these two clusters are identified, these T H 2 signature genes should be over-represented in the set of differentially expressed genes if different clusters represent physiologically distinct subpopulations of cells. This over-representation can be summarized by an odds ratio of the percentage of T H 2 signature genes in the set of differentially expressed genes to that in all genes. A large odds ratio is favored.
To implement this criterion, we applied 2-means clustering and use a t-test with false discovery rate 0.01 to identify differentially expressed genes. Then the odds ratio, the 95% confidence interval of the odds ratio, and the p-value of the hypothesis of odds ratio > 1 were calculated by a hypergeometric test. The results are shown in Table 2. On the original data, the odds ratio is less than 1, indicating that the clustering of cells is unlikely to be physiologically meaningful. The true substructure of the data is completely obscured, and this could be due to the confounding effects of the cell cycle.
On the scLVM corrected data, the odds ratio is 2.382, with the lower confidence interval bound of 1.518 and p-value 1.542 × 10 −4 . On the ccRemover corrected data, the odds ratio is 3.439, with the lower confidence interval The densities are displayed for the original (red), scLVM corrected (green) and ccRemover corrected (blue) data. The genes were selected from among the top ranked genes on Cyclebase. The original data displays bimodal densities which are common in scRNA-Seq data indicating genes whose expression switches on and off. When the cell-cycle effect is removed using ccRemover or scLVM these bimodal densities disappear.
Scientific RepoRts | 6:33892 | DOI: 10.1038/srep33892 bound 2.297 and p-value 2.373 × 10 −9 . This indicates that both scLVM and ccRemover are able to remove the cell-cycle effect from the data so that the true substructure of the data can be revealed, and ccRemover removes the cell-cycle effect more thoroughly and/or keeps other biological features more intact compared to scLVM.
We also applied ccRemover to another dataset used by Buettner et al. 12 , the mouse embryonic stem cells (mESC) dataset. This dataset contains cells from the same cell type, but the cell-cycle stage of each cell is known a priori by the Hoechst staining. By applying two criteria proposed by Buettner et al. 12 , we find both scLVM and ccRemover are very successful in removing the cell-cycle effect from this dataset, and ccRemover outperforms scLVM slightly. Our full analysis of this dataset is presented in the Supplementary Information (Supplementary Section S1).
Real dataset 2: human glioblastomas data. This dataset contains cells from five human glioblastomas 47 .
It was created by Patel et al. by isolating individual cells from freshly resected and dissociated IDH1/2 wild-type primary human glioblastomas, MGH26, MGH28, MGH29, MGH30 and MGH31. This dataset has log transformed and centered TPM (Transcripts Per Million) measurements for 5,948 genes and 430 single cells with each tumor represented by 70 to 118 cells. Of the 5,948 genes 412 were identified as cell-cycle genes. It has been shown that the level of cell-cycle activity within this dataset is very low, with an average of only 8% of the cells per tumor showing cell-cycle activity 47 . For this dataset the objective is to cluster the cells by their tumor of origin.
When scLVM was applied, the scree plot suggests removing the first leading hidden factor, agreeing with its default choice. When ccRemover was applied to this dataset the 5 th , 6 th and 9 th components were identified as cell-cycle effects and removed on the first iteration. On the second iteration the 10 th component was identified as a cell-cycle effect. Once this effect was removed from the data there were no more cell-cycle effects detected.
Hierarchical clustering was applied to the (original, scLVM corrected, and ccRemover corrected) data, splitting the cells into five clusters, with each cluster being assigned the class of the majority of the cells contained within the cluster. The results are shown in Fig. 3. On the original data, 87.44% of the cells were clustered correctly. From the plot of the dendrogram (Fig. 3a) it is clear that the MGH31 (red) cluster contains cells from all the other tumors that have been incorrectly classified, the MGH28 (purple) and MGH30 (blue) clusters also display significant impurities. On the scLVM corrected data, 90.00% of the cells were classified correctly, an improvement of over 2.5% from the original data. On the ccRemover corrected data, 92.32% of the cells were classified correctly, an increase of nearly 5% from the original data. The purity of the clusters in the dendrogram (Fig. 3c) for the ccRemover corrected data show marked improvement over the original data, and especially the MGH28 and MGH31 clusters show convincing improvements in purity. This result is particularly striking when considering the very low levels of cell-cycle activity within this dataset and demonstrates that ccRemover can improve the downstream analysis of scRNA-Seq data even when the cell-cycle effect is not very strong.   The TPM values for 57,820 genes are available for each of the 176 cells. Prior to analysis any genes which had zero expression for over two thirds of the cells were removed from the data, leaving 10,977 genes of which 757 were annotated to the cell cycle. The data was transformed to a log-scale by adding 1 to each of the measurements and taking the natural log.
The scree plot from scLVM suggests removing the first leading hidden factor, agreeing with its default choice. Instead, ccRemover suggests removing six principal components in its four iterations, and interestingly, these six components do not include the very first principal component.
When using 3-means clustering on the original data, the three clusters represent the three cell types perfectly, and thus there is no room for improvement. Instead, we consider using 4-means clustering, in order to see whether the 77 cells of the first cell type can be clustered accordingly to the two sets of technical replicates, LC.PT and LC.PT_RE. Figure 4 shows the results. On both the original data (Fig. 4a) and the scLVM corrected data (Fig. 4b), the LC.PT and LC.PT_RE cells are split into two clusters (clusters 3 and 4) each containing roughly equal proportions of cells from each set, indicating that the technical replicates are non-separable. On the ccRemover corrected data (Fig. 4c), on the other hand, the majority (80%) of cluster 3 are cells from the LC.PT_RE group, while the majority (89%) of cluster 4 are cells from the LC.PT group. This means that cells from different sets of technical replicates are largely separated by the batch effect. This batch effect is present in all three of the original and corrected datasets, but it has a noticeable influence in the clustering results only on ccRemover corrected data. The reason could be that the batch effect is confounded by the stronger cell-cycle effect in the original data, and it stands out when the cell-cycle effect was removed by ccRemover. scLVM may have not removed the cell-cycle effect thoroughly enough to make a difference.
Further analysis was carried out to determine if this is the case. Figure 5 displays heat maps of the expression of the top ranked cell-cycle genes from Cyclebase 43 . The cell-cycle genes displayed in the heat map are ordered based on the time point of the cell cycle at which their expression peaks. If the cell-cycle effect exists, there should be blocks of similar expression levels, and these blocks should not occupy from the first row to the last row as the genes do not achieve their peak expressions at the same time point of the cell cycle. On the original data (Fig. 5a), there are clear such blocks, and the most prominent one is shown in a blue box. For the scLVM corrected data the blocks are less apparent but still present (Fig. 5b), indicating that the cell-cycle effect has been removed partially. For the ccRemover corrected data (Fig. 5c), there are no easily visible blocks left indicating that ccRemover has effectively removed the cell-cycle effect from this dataset. For both the scLVM and ccRemover corrected data the range of expression for the cell-cycle genes is reduced and so the heat map colors show less variation.
This example shows a feature of ccRemover: while it quite thoroughly removes the cell-cycle effect, it keeps all other effects, favorable (like the cell-type effect) or unfavorable (like the batch effect), intact. This is exactly what ccRemover is designed for. In this example, ccRemover makes the batch effect stand out, which may actually facilitate removing the batch effect. This can be done by using software specifically designed for removing the batch effect, and it is out of scope of this paper.

Discussion
ScRNA-Seq data suffer from a systematic bias which is introduced by the cell cycle. The cell cycle can have a confounding effect on the analysis of scRNA-Seq data, conceal the true biological features of interest and compromise the interpretation of scRNA-Seq experiments. In fact, we saw for the differentiating T-cell data that the true substructure of the data was undetectable unless the cell-cycle effect was removed. This can increase the difficulty of identifying new subtypes and subpopulations of cells in scRNA-Seq data. The current method developed for removing this effect, scLVM, does not inspect whether a leading factor represents the cell cycle, and thus it has a considerable risk of removing other important features of the data, as well as removing the cell-cycle effect incompletely. We developed a new method, ccRemover, that includes a formal statistical test to inspect whether an effect is a component of the cell-cycle effect or not. By using this test, ccRemover is able to remove the effects of the cell cycle from scRNA-Seq data quite thoroughly while preserving the other information that is contained within the data. Applying ccRemover to remove the cell-cycle effect can allow previously distorted signals of interest to emerge from the data and improve the analysis of scRNA-Seq data. This has been shown in both simulation data and real datasets.
The idea of using control genes for removing excess variation is not new 26,50,51 . The control genes from prior applications are genes which are known a priori to be "null" with regard to the biological factor of interest. Control genes in ccRemover are somewhat different: instead of using genes that are annotated not to be affected by the factor of interest (the cell cycle), we define them to be genes that are not annotated to be affected. We expect control genes in ccRemover to be affected by the cell-cycle effect, although the effect is weaker among control genes than it is among the cell-cycle genes. In practice, we recommend simply using all genes that are not annotated as cell-cycle genes as control genes; thus, once the set of cell-cycle genes is retrieved, the set of control genes is obtained freely. ccRemover relies on a known set of cell-cycle genes, which are often retrieved from annotation databases. In reality, the annotation databases are always incomplete and inaccurate. Using additional simulations, we have shown that ccRemover seems to be quite tolerant to incomplete and/or inaccurate annotations. See Section S3 in Supplementary Information for details.
As ccRemover utilizes the control genes to capture various effects in the data, if the cell-cycle effect only influences a few of the control genes, it may not be captured by the principal components and hence missed for detection by ccRemover. To explore how robust ccRemover is to the sparsity of the cell-cycle effect among the control genes we performed additional simulations (Section S2 in Supplementary Information). Using a modified version of the model from the simulation section we simulate data where the cell cycle affects differing proportions of the control genes. ccRemover is able to effectively identify and remove the cell-cycle effect from the data if at least 8% of the control genes are affected by the cell-cycle. This required proportion can be even lower when the data has a greater absolute number of genes or cells. This requirement is likely to be satisfied for most real datasets. For example, we have previously mentioned that Buettner et al. found that 44% of 6,500 genes previously considered unrelated to the cell-cycle were correlated with at least one cell-cycle gene 12 .
The cell cycle is often the main source of biological noise in scRNA-Seq data. When it is removed by ccRemover, other effects may stand out as the main confounding factors, as we have shown in our third real data example. ccRemover does not remove these effects as it is designed for removing the cell-cycle effect exclusively. However, if a set of genes are known to be more influenced by a particular effect to be removed, one can treat this set of genes as set A (the cell-cycle genes) and then ccRemover can be directly used to remove this effect.

Methods
We describe our ccRemover algorithm in this section. Denote the matrix of gene expression values as X, with element X i,j being the expression value for the i th gene and the j th cell, i = 1, … , p and j = 1, … , n. We recommend transforming the data to a log scale and centering it on a gene-by-gene basis. Let = − A i i { : gene is a cell cycle gene} and = B i i { : gene is a control gene}, with the corresponding data matrices X A and X B . The numbers of genes in A and B are represented as |A| and |B| respectively. Thus, the dimensions of X, X A and X B are p × n, |A| × n and |B| × n, respectively.
The ccRemover algorithm follows these steps: 1. Perform a PCA on the data matrix of control genes X B . Let the loadings be v 1 , … , v n , and the corresponding component scores be β 1  to extract the cell-cycle effect from the data matrix. Subtract these projections from X to remove them from the data. That is, the corrected data matrix with the cell-cycle effect removed is given by

j S j j T
Steps 1 to 4 are repeated until no more cell-cycle effects are identified (i.e. no statistically significant Δ j > 0 is found). We have found that usually no more than four repetitions are needed.
We use the following two-class bootstrap procedure 52 to test whether Δ j is significantly larger than 0: 1. Take a random sample with replacement of |A| rows from X A and another |B| rows from X B . This gives the resampled data matrices ⁎ X A and ⁎ X B . 2. Calculate ∆ ⁎ j a bootstrap replicate of Δ j , by applying steps 1 and 2 of the algorithm of ccRemover to ⁎ X B and For most datasets, we suggest using C Δ = 3, which roughly corresponds to a p-value of 0.01. We used this cutoff for all our simulations and real data examples except for the glioblastoma data, where it is known that the cell-cycle activity is at a very low level 47 . We use a smaller cutoff value C Δ = 2 for this data, and it roughly corresponds to a p-value of 0.05.