Abstract
Singlecell messenger RNA sequencing (scRNAseq) has emerged as a powerful tool to study cellular heterogeneity within complex tissues. Subpopulations of cells with common gene expression profiles can be identified by applying unsupervised clustering algorithms. However, technical variance is a major confounding factor in scRNAseq, not least because it is not possible to replicate measurements on the same cell. Here, we present BEARscc, a tool that uses RNA spikein controls to simulate experimentspecific technical replicates. BEARscc works with a wide range of existing clustering algorithms to assess the robustness of clusters to technical variation. We demonstrate that the tool improves the unsupervised classification of cells and facilitates the biological interpretation of singlecell RNAseq experiments.
Introduction
The gene expression landscape of single cells can reveal important biological insights into the processes driving development or disease. The development of techniques to sequence mRNA from individualized cells (scRNAseq) has enabled researchers to study cell subpopulations, including rare cell types, at an unprecedented scale and resolution^{1,2,3}.
However, scRNAseq has inherently high technical variability, and it is not possible to have true technical replicates for the same cell. This presents a major limitation for scRNAseq analysis^{4, 5}. Specifically, read count measurements often vary considerably as a result of stochastic sampling effects, arising from the limited amount of starting material^{4, 5}. Also, falsenegative observations frequently occur because expressed transcripts are not amplified during library preparation (the dropout effect)^{4, 5}. Another common problem is systematic variation due to minute changes in sample processing; these batchdependent differences in cDNA conversion, library preparation and sequencing depth can easily mask biological differences among cells and might compromise many published scRNAseq results^{2, 6}.
One widely adopted approach to adjust for technical variation between samples is the addition of known quantities of RNA spikeins to each cell sample before cDNA conversion and library preparation^{7}. Several methods use spikeins to normalize read counts per cell before further analysis^{8, 9}, but this use has been criticized because it exacerbates the effect of differences in RNA content per cell, e.g., due to variations in cell size^{2, 8}. Unfortunately, the limited volumes of starting material in singlecell transcriptomics inherently preclude the possibility of true technical replication.
To address this shortcoming of scRNAseq analysis, we developed BEARscc (Bayesian ERCC Assessment of Robustness of singlecell clusters), an algorithm that uses spikein measurements to model the distribution of experimental technical variation across samples to simulate realistic technical replicates. The simulated replicates can be used to quantitatively and qualitatively evaluate the effect of measurement variability and batch effects on analysis of any scRNAseq experiment, facilitating biological interpretation. BEARscc represents a use for spikein controls that is not subject to the same problems as persample normalization.
In many scRNAseq studies, statistical clustering methods are used to identify cells with similar gene expression profiles that could represent distinct cell types^{1, 10, 11}. BEARscc was designed specifically with this application in mind. The simulated technical replicates generated by BEARscc can be fed into most existing clustering algorithms. The BEARscc package provides analysis tools to evaluate the resulting replicate clusters, and can thus reveal how robust the classification of cells into subtypes is to technical variation.
Results
Outline of BEARscc workflow
Conceptually, BEARscc addresses the lack of experimental technical replicates in singlecell studies by simulating technical replicates. These simulated technical replicates are based on RNA spikeins included in the experiment. Because RNA spikeins have undergone the same sequencing steps as the cellular RNA, they can be used to create an experimentspecific model of the technical variability. The simulated replicates can then be analyzed using almost any existing clustering method (to group cells with similar gene expression profiles) as a way of assessing how technical variation might influence the clusters identified in the real experimental data (i.e., how ‘robust’ the clusters are to technical variation). This helps in the identification of clusters that are most likely to represent real biological subpopulations of cells.
BEARscc consists of three steps (Fig. 1): modelling technical variance based on spikeins (Step 1); simulating technical replicates (Step 2); and clustering simulated replicates (Step 3). In Step 1, an experimentspecific model of technical variability (noise) is estimated using observed spikein read counts. This model consists of two parts. In the first part, expressiondependent variance (i.e., the expected variance of expression levels for a gene with a particular abundance) is approximated by fitting read counts of each spikein across cells to a mixture model (see Methods). The second part addresses dropout effects (i.e., falsenegative observations that occur if expressed transcripts are not amplified during library preparation). Based on the observed dropout rate for spikeins of a given concentration, BEARscc generates a ‘dropout injection distribution’, which models the likelihood that a given transcript concentration will result in a dropout. Next, a ‘dropout recovery distribution’ is estimated from the dropout injection distribution using Bayes’ theorem; the dropout recovery distribution models the likelihood that a transcript that had no observed counts in a cell was a false negative.
In Step 2, BEARscc applies the model from Step 1 to produce simulated technical replicates. For every observed gene’s read count (in the real experimental data set) below which dropouts occurred amongst the spikeins, BEARscc assesses whether to convert the count to zero (using the dropout injection distribution). For observations where the read count is zero, the dropout recovery distribution is used to estimate a new value, based on the overall dropout frequency for that gene. After this dropout processing, all nonzero read counts are substituted with a value generated by the model of expressiondependent variance (from the first part of Step 1), parameterized to the observed counts for each gene. Step 2 can be repeated any number of times to generate a collection of simulated technical replicates.
This set of simulated technical replicates can then be reanalyzed in the same way as the original experimental observations to assess the robustness of the results to modelled technical variation. Specifically, in Step 3 we focus on clustering analysis as this is a common approach for analyzing scRNAseq data to identify groups of cells with similar gene expression profiles. Each simulated technical replicate is clustered using the same algorithm parameters as the original experimental observations. An association matrix is created in which each element indicates whether two cells share a cluster identity (1) or cluster apart from each other (0) in a particular replicate (Fig. 1, step 3). We provide a visual representation of the clustering variation on a cellbycell level by combining association matrices to form the ‘noise consensus matrix’. Each element of this matrix represents the fraction of simulated technical replicates in which two cells cluster together (the ‘association frequency’), after using a chosen clustering method.
To quantitatively evaluate the results, three metrics are calculated from the noise consensus matrix. “Stability” is the average frequency with which cells within a cluster associate with each other across simulated replicates. A stability value above 0.5 indicates that more cells than expected by chance are grouped together irrespective of technical variance. ‘Promiscuity’ measures the association frequency between cells within a cluster and those outside of it. A promiscuity value above 0.5 signifies that some cells in the cluster are better placed in other clusters. ‘Score’ is the difference between stability and promiscuity and reflects the overall robustness of a cluster to technical variance. A value above 0 suggests that the grouping of cells in this cluster is not purely an artefact of technical variance (Supplementary Fig. 1).
Determining the optimal number of clusters, k, into which cells should be grouped is an inherently difficult problem in scRNAseq analysis. Heuristics, such as the silhouette index or the gap statistic^{12, 13}, are commonly used (e.g., in RaceID2). Other tools, such as BackSPIN, employ custom algorithms to arrive at a fixed number of k, while e.g., SC3 leaves the decision to the user. All of these approaches, however, fail to account for the expected technical variance between measurements of single cells. BEARscc’s score statistic can help to refine what k to use, given a clustering algorithm. By performing hierarchical clustering on the noise consensus matrix, BEARscc can split cells into any number of clusters between 1 and N (the total number of cells). The hierarchical clustering with a maximum score (within a biologically reasonable range) represents a ‘metaclustering’ with an optimal tradeoff between withincluster stability and betweencluster variability (see Methods). This metaclustering methodology enables a semiautomatic refinement of existing clustering results.
Evaluation of the BEARscc model of technical variance
Given the difficulty of generating true technical replicates from singlecell material, we generated a set of experimental replicates for which we diluted one RNAseq library derived from bulk human brain tissue to singlecell RNA concentrations and sequenced 48 of these samples with ERCC spikeins^{12}. Each of these 48 samples is a ‘real’ technical replicate to compare to the simulated technical replicates generated by BEARscc. The mean and variance of the simulated read counts produced by BEARscc closely matched the experimentally determined values (Fig. 1, step 1—top; Supplementary Fig. 2a, b). For 95% of the genes expressed in the library, the simulated dropout rate differed from the observed dropout rate by <9% (Fig. 1, step 1—bottom; Supplementary Fig. 2c). Together, these results suggest that technical variation simulated by BEARscc closely resembles technical variation observed experimentally. The simulated expression of genes with less than 1 observed count deviated slightly from the experimentally determined values (Supplementary Fig. 2a), however such small expression differences are unlikely to be reproducible as they fall outside the dynamic range of any single cell experiment.
Testing the utility of BEARscc in clustering analysis
To test whether BEARscc can improve the detection of true subpopulations of cells from singlecell transcriptome analysis, we performed a control experiment in which we sequenced 45 ‘blank’ samples opposite the diluted brain RNA, in two batches. The blanks only contained spikeins and trace amounts of environmental contamination, producing sporadic read counts. We clustered the data from the brain samples and blanks using three widely used clustering algorithms (RaceID2^{10}, BackSPIN^{11}, and SC3^{14}), either alone or after simulating technical replicates using BEARscc. Correct clustering should give perfect separation of brain and blank samples. To avoid artifacts due to differences in amplificationdependent library size, we applied an adjusted cpm normalization (see Methods). Otherwise, standard parameters were used for all three clustering algorithms. As an alternative to BEARscc, we also tested a simple sampling approach where we repeatedly sampled half of all expressed genes and reclustered the cells based on this subset (see Methods). Without BEARscc or this sampling approach, all three clustering algorithms created falsepositive clusters (Fig. 2a, Supplementary Fig. 3ac, top). BEARscc provided a clear improvement over the original clustering and the sampling approach (Fig. 2a). Overall, BEARscc separated brain tissue and blank samples correctly and eliminated spurious clusters that corresponded to batch effects (Supplementary Fig. 3a, c, coloured bars above matrices). In the case of using BEARscc with RaceID2, three outlier cells were incorrectly identified to be robust clusters (Supplementary Fig. 3b, coloured bars above matrix); the libraries for these three samples contained fewer than 1000 observed transcripts, indicating that BEARscc is limited by RaceID2’s oversensitivity to library size differences.
Applying BEARscc to a well annotated biological data set
To further test the utility of BEARscc, we applied BEARscc to a previously published data set^{15} of scRNAseq measurements from early C. elegans embryogenesis. In this study, mRNA from four biological replicates of cells from the 1, 2, 4, 8 and 16cell stage was sequenced. We made the assumption that cells from the 1, 4 and 16cell stages are more different between than within stages. We therefore expected to observe clusters that preferentially group cells from the same stage across biological replicates. To test this, we ran RaceID2^{10} and BackSPIN^{11} on the expression data from these three distinct stages and compared the resulting cluster assignments to the metaclusters created by BEARscc, or by gene sampling as described above. We found that BEARscc outperformed gene sampling and either RaceID2 or BackSPIN alone, as measured by the adjusted rand index (expecting three distinct clusters, see Fig. 2a). RaceID2 alone produced 10 clusters, which were reduced to two by BEARscc (Supplementary Fig. 4a and Supplementary Fig. 5a, b). BackSPIN alone resulted in 34 clusters, which BEARscc merged to 7 clusters (Supplementary Fig. 4b and Supplementary Fig. 5c, d). These data suggest that BEARscc reduces overclustering by commonly used clustering algorithms and improves the interpretation of scRNAseq data.
BEARscc enhances the interpretation of published data sets
To assess the robustness of computational celltype detection based on real biological scRNAseq data, we applied BEARscc to two previously published data sets. We first reanalyzed murine brain data (3005 cells) from Zeisel et al.^{11}, using BEARscc together with BackSPIN, which is the clustering algorithm used in the original publication. Based on the score statistic, BEARscc reduced the 24 clusters produced by BackSPIN alone into 11 clusters, which corresponded well with the manually curated cell types described in the original publication (Adjusted Rand Index 0.72 with BEARscc, and 0.55 for BackSPIN alone; Fig. 2a (right), Fig. 2b). Therefore, BEARscc provided an optimal grouping of cells without the effort of manual curation.
In a second evaluation, we reanalyzed murine intestinal data (291 cells) from Grün et al.^{10}, using BEARscc to generate simulated technical replicates and RaceID2 (as described in the original publication) for clustering. The score metric from BEARscc indicated that 219 out of 291 cells were robustly classified in the original work. However, the two largest clusters—‘cluster 1’ and ‘cluster 2’—exhibited low scores (−0.07 and 0.20, respectively) compared to the other nonoutlier clusters 3, 4 and 5 (Supplementary Fig. 6a). The BEARscc noise consensus matrix reveals high variability in the clustering patterns of cells in clusters 1 and 2 (Supplementary Fig. 6b). Grün et al. suggest that clusters 1 and 2 reflect closely related, undifferentiated cell types (‘transitamplifying’ and ‘stemlike’, respectively). Expression patterns of genes characteristic of the two clusters were highly similar (Supplementary Fig. 7a), compared to the expression differences between cluster 1 and the nextlargest cluster (cluster 5) (Supplementary Fig. 7b). Expression foldchanges between clusters 1 and 2 were reduced in technical replicates, falling below the significance threshold for many genes. BEARscc shows that many cells in clusters 1 and 2 cannot be reliably classified into one cluster or the other. The sharp distinction between clusters 1 and 2 described in the original publication is therefore likely to be a result of technical variation, rather than a defining biological feature of these cells. Instead, cells in clusters 1 and 2 seem to lie on a gradient of differentiation between two cellular states. More work will be needed to fully determine how the differentiation state of stemlike cells is reflected by their transcriptome. Nevertheless, this example demonstrates how BEARscc can help to improve the biological interpretation of scRNA data.
Performance of BEARscc
With singlecell experiments becoming larger and larger, program execution time can become a bottleneck for computational analyses. Using simulated data, we show that BEARscc’s runtime grows linearly with increasing numbers of cells, assuming a constant number of genes (Supplementary Fig. 8). On a desktop PC (Intel i7 with 2.9 GHz), a single BEARscc process required ~16 min and 19 min to generate a simulated technical replicate from the C. elegans (14,448 genes by 115 samples) and murine brain (19,972 genes by 3005 samples) data, respectively. Importantly, the generation of simulated replicates can be distributed across multiple independent processes on multiple machines. The real time requirement for generating a replicate data sets is thus mostly limited by the available hardware. Once simulated technical replicates have been generated, the runtime of any downstream clustering analysis is dependent on the specifics of the respective clustering algorithm.
Discussion
BEARscc addresses the challenges posed by intrinsically high technical variability in singlecell transcriptome sequencing experiments and enables the evaluation of single cell clustering results. Importantly, BEARscc is not a clustering algorithm in itself, but rather a tool to evaluate the results produced by any available clustering algorithm. To do so, it aggregates the information from exogenous control spikeins across samples to create a model of both the expected variance of endogenous read counts and the likelihood of falsenegative measurements (dropouts). This represents an alternative use of spikein controls that is not subject to potential issues surrounding the use of spikeins for persample normalisation. Our application of BEARscc to biological datasets demonstrates that BEARscc reduces overclustering, is able to identify biologically relevant cell groups in an unsupervised way and provides additional insights for the interpretation of singlecell sequencing experiments.
We note that extreme batch effects with a multimodal distribution of variance could skew BEARscc’s noise model and lead to biased simulated replicates. We envision that future versions of BEARscc will attempt to detect and warn about such biases. Furthermore, while the dropout model calculated by BEARscc is accurate for genes with an average expression of more than one count, there is still scope for improvement. Future work will focus on more precise models of dropouts in the context of very low gene expression. As it stands, BEARscc enables users to identify the components of scRNAseq clustering results that are robust to noise, thereby increasing confidence in those results for downstream analysis. Therefore, we recommend that future scRNAseq analysis pipelines apply the best available clustering algorithm in conjunction with BEARscc in order to define the most biologically meaningful groups of cells for interpretation.
Methods
Public data analysis
Primary murine cortex and hippocampus singlecell measurements for 3005 cells from Zeisel et al.^{11} were retrieved from the publicly available Linnarsson laboratory data repository [http://linnarssonlab.org/cortex/]. Primary murine intestinal single cell measurements of 260 cells from Grün et al.^{10} were downloaded from the van Oudenaarden github repository [https://github.com/dgrun/StemID]. Primary C. elegans embryo singlecell measurements for 219 cells from Tintori et al.^{15} were obtained from the SRA^{16} sequencing database with project id SRP070155. Raw reads were mapped to PRJNA13758.WS259 using STAR^{17}. Exact position duplicates were removed. These samples were deeply sequenced (~2 million reads or more per cell) resulting in saturated ERCC spikeins with limited spikein dropouts. To allow BEARscc to build the noise models from spikein data for simulating of technical replicates, mapped libraries were downsampled to 20% of mapped reads. Features were counted using HTseq^{17}. Samples identified by Tintori et al. as failing quality control were excluded, and counts were normalized to adjusted counts per million as described below.
Algorithmic generation of simulated technical replicates
Simulated technical replicates were generated from the noise mixture model and two dropout models. For each gene, the count value of each sample is systematically transformed using the mixture model, Z(c), and the dropout injection, Pr(X = 0Y = k), and recovery, Pr(Y_{ j } = yX_{ j } = 0), distributions in order to generate simulated technical replicates as indicated by the following pseudocode:
FOR EACH gene, j
FOR EACH count, c
IF c=0
n ← SAMPLE one count,y, from Pr(Y_{ j } = yX_{ j } = 0)
IF n=0
c ← 0
ELSE
c ← SAMPLE one count from Z(n)
ENDIF
ELSE
IF c ≤ k
dropout ← TRUE with probability, Pr(X = 0Y = k)
IF dropout=TRUE
c ← 0
ELSE
c ← SAMPLE one count from Z(c)
ENDIF
ELSE
c ← SAMPLE one count from Z(c)
ENDIF
ENDIF
RETURN c
DONE
DONE
Modelling noise from spikeins
Technical variance was modelled by fitting a single parameter mixture model, Z(c) to th spikeins’ observed count distributions. The noise model was fit independently for each spikein transcript and subsequently regressed onto spikein mean expression to define a generalized noise model. This was accomplished in three steps:

1.
Define a mixture model composed of Poisson and Negative Binomial random variables: Z~(1 − α)*Pois(μ) + α*NBin(μ,σ)

2.
Empirically fit the parameter, α_{ i }, in a spikeinspecific mixturemodel, Z_{ i }, to the observed distribution of counts for each ERCC spikein transcript, i, where μ_{ i } and σ_{ i } are the observed mean and variance of the given spikein. The parameter, α_{ i }, was chosen such that the error between the observed and mixture model was minimized.

3.
Generalize the mixture model by regressing α_{ i } parameters and the observed variance σ_{ i } onto the observed spikein mean expression, μ_{ i }. Thus the mixture model describing the noise observed in ERCC transcripts was defined solely by μ, which was treated as the count transformation parameter, c, in the generation of simulated technical replicates.
In step 2, a mixture model distribution is defined for each spikein, i: Z_{ i }(α_{ i },μ_{ i },σ_{ i })~(1 − α_{ i })*Pois(μ_{ i }) + α_{ i }*NBin(μ_{ i },σ_{ i }). The distribution, Z_{ i }, is fit to the observed counts of the respective spikein, where α_{ i } is an empirically fitted parameter, such that the α_{ i } minimizes the difference between the observed count distribution of the spikein and the respective fitted model, Z_{ i }. Specifically, for each spikein transcript, μ_{ i } and σ_{ i } were taken to be the mean and standard deviation, respectively, of the observed counts for spikein transcript, i. Then, α_{ i } was computed by empirical parameter optimization; α_{ i } was taken to be the α_{ i,j } in the mixturemodel, Z_{i,j}(α_{i,j},μ_{ i },σ_{ i })~(1 − α_{i,j})*Pois(μ_{ i }) + α_{i,j}*NBin(μ_{ i },σ_{ i }), found to have the least absolute total difference between the observed count density and the density of the fitted model, Z_{ i }. In the case of ties, the minimum α_{i,j} was chosen.
In step 3, α(c) was then defined with a linear fit, α_{ i } = a*log2(μ_{ i }) + b. σ(c) was similarly defined, log2(σ_{ i }) = a*log2(μ_{ i }) + b. In this way, the observed distribution of counts in spikein transcripts defined the single parameter mixture model, Z(c), used to transform counts during generation of simulated technical replicates:
During technical replicate simulation, the parameter c was set to the observed count value, a, and the transformed count in the simulated replicate was determined by sampling a single value from Z(c=a).
Inference of dropout distributions using spikeins
A model of the dropouts was developed to inform the permutation of zeros during noise injection. The observed zeros in spikein transcripts as a function of actual transcript concentration and Bayes’ theorem were used to define two models: the ‘dropout injection distribution’ and the ‘dropout recovery distribution’.
The dropout injection distribution was described by Pr(X = 0Y = y), where X is the distribution of observed counts and Y is the distribution of actual transcript counts; the density was computed by regressing the fraction of zeros observed in each sample, D_{ i }, for a given spikein, i, onto the expected number of spikein molecules in the sample, y_{ i }, e.g., D = a*y + b. Then, D describes the density of zeroobservations conditioned on actual transcript number, y, or Pr(X = 0Y = y). Notably, each gene was treated with an identical density distribution for dropout injection.
In contrast, the density of the dropout recovery distribution, Pr(Y_{ j } = yX_{ j } = 0), is specific to each gene, j, where X_{ j } is the distribution of the observed counts and Y_{ j } is the distribution of actual transcript counts for a given gene. The genespecific dropout recovery distribution was inferred from dropout injection distribution using Bayes’ theorem and a prior. This was accomplished in 3 steps:

1.
For the purpose of applying Bayes’ theorem, the genespecific distribution, Pr(X_{ j } = 0Y_{ j } = y), was taken to be the the dropout injection density for all genes, j.

2.
The probability that a specific transcript count was present in the sample, Pr(Y_{ j } = y), was a necessary, but empirically unknowable prior. Therefore, the prior was defined using the law of total probability, an assumption of uniformity, and the probability that a zero was observed in a given gene, Pr(X_{ j } = 0). The probability, Pr(X_{ j } = 0), was taken to be the fraction of observations that were zero for a given gene. This was done to better inform the density estimation of the genespecific dropout recovery distribution.

3.
The dropout recovery distribution density was then computed by applying Bayes’ theorem:
In the second step, the law of total probability, an assumption of uniformity, and the fraction of zero observations in a given gene were leveraged to define the prior, Pr(Y_{ j } = y). First, a threshold of expected number of transcripts, k in Y, was chosen such that k was the maximum value for which the dropout injection density was nonzero. Next, uniformity was assumed for all expected number of transcript values, y greater than zero and less than or equal to k; that is Pr(Y_{ j } = y) was defined to be some constant probability, n. Furthermore, Pr(Y_{ j } = y) was defined to be 0 for all y>k. To inform Pr(Y_{ j } = y) empirically, Pr(Y_{ j } = 0) and n were derived by imposing the law of total probability (2) and unity (3) yielding a system of equations:
The probability that a zero is observed given there are no transcripts in the sample, Pr(X_{ j } = 0Y_{ j } = 0), was assumed to be 1. With the preceding assumption, solving for Pr(Y_{ j } = 0) and n gives:
In this way, Pr(Y_{ j } = y) was defined by Eq. (4) for y in Y_{ j } less than or equal to k and greater than zero, and defined by Eq. (5) for y in Y_{ j } equal to zero. For y in Y_{ j } greater than k, the prior Pr(Y_{ j } = y) was defined to be equal to zero.
In the third step, the previously computed prior, Pr(Y_{ j } = y), the fraction of zero observations in a given gene, Pr(X_{ j } = 0), and the dropout injection distribution, Pr(X_{ j } = 0Y_{ j } = y), were utilized to estimate with Bayes’ theorem the density of the dropout recovery distribution, Pr(Y_{ j } = yX_{ j } = 0). During the generation of simulated technical replicates for zero observations and count observations less than or equal to k, values were sampled from the dropout recovery and injection distributions as described in the pseudocode of the algorithm.
Observing real technical noise
Brain whole tissue total RNA (Agilent Technologies, cat. 540005) was diluted to 10 pg aliquots and added to 1 μL. cDNA conversion, library preparation, and sequencing were performed by the Wellcome Trust Center for Human Genomics Sequencing Core. Blank samples were identically prepared with nuclease free water. Samples were pipetted into 96well plates and treated as single cells using Smartseq2 cDNA conversion as described by Picelli et al.^{19} with minor modifications. The library was prepared using Fludigm’s recommendations for Illumina NexteraXT at ¼ volume with minor modifcations, and sequenced on the Illumina HiSeq4000 platform. Raw reads were mapped to hg19 using STAR^{17}. Exact position duplicates were removed, and features were counted using HTseq^{17}. Counts were normalized to adjusted counts per million as described below.
Normalization of counts
For the brain and blanks control experiment and C. elegans data, raw counts were normalized to adjusted counts per million to reduce batch effects: raw counts were divided by the total number of counts and multiplied by 10^{6}. For each sample library, a detection rate was computed by dividing the number of genes with at least one observed count by the total number of observed genes across all libraries. A scaling factor was then computed for each sample library by dividing each library’s detection rate by the median detection rate, as initially suggested by Hicks et al.^{20}.
Clustering of counts data
BackSPIN, SC3 and RaceID2 were run according to algorithmspecific recommendations^{10, 11, 14}. RaceID2 was allowed to identify cluster number under default parameters. For the brain and blanks control experiment and C. elegans data, RaceID2 was modified to skip normalization since scaled counts per million normalization had already been applied to the data set. The number of clusters, k, selected for SC3 clustering was determined empirically by selecting k with the optimal silhouette distribution across noise injected counts matrices.
Computation of consensus matrix
Hundred simulated replicate matrices for n cells and m genes were clustered using the respective clustering algorithm (SC3, BackSPIN, RaceID2) as described above. Cluster labels were used to compute an n × n binary association matrix for each clustering. Each element of the association matrix represents a cell–cell interaction, where a value of 1 indicates that two cells share a cluster and a value of 0 indicates two cells do not share a cluster. An arithmetic mean was taken for each respective element across the resulting 100 association matrices to produce an n × n noise consensus matrix, where each element represents the fraction of noise injected counts matrices that, upon clustering, resulted in two cells sharing a cluster.
Computation of BEARscc cluster metrics
To calculate cluster stability, the noise consensus matrix was subset to cells assigned to the cluster. The cluster stability was then calculated as the arithmetic mean of the upper triangle of the subset noise consensus matrix. To calculate cluster promiscuity, the rows of the noise consensus matrix were subset to cells assigned to the cluster and the columns are subset to the cells not assigned to the cluster. For clusters with as many or more cells assigned to them than not assigned, the promiscuity was defined as the arithmetic mean of the elements in the subset matrix. Otherwise, the columns were further subset to the same number of cells as were assigned to the cluster, where the cells outside of the cluster with the strongest mean association with cells inside the cluster are chosen. The promiscuity was defined as the arithmetic mean of the elements in this further subset matrix. Each cluster’s promiscuity was subtracted from its stability to calculate cluster score.
Computation of BEARscc cell metrics
To calculate a cell’s stability, the arithmetic mean was taken of that cell’s association frequencies with other cell’s within the cluster. To calculate a cell’s promiscuity, there were two cases. For cells in clusters with as many or more cells assigned to them than not assigned, the promiscuity was the arithmetic mean of that cell’s association frequencies with all cells not assigned to the relevant cluster. For cells in clusters of size n, with fewer cells assigned to them than not assigned, the cell’s promiscuity was the arithmetic mean with the n cells not assigned to the cluster with the highest association frequencies. Each cell’s promiscuity was subtracted from its stability to calculate cell score.
Estimating the background distribution of BEARscc metrics
To compute null distributions for the stability, promiscuity and score metrics, random clusters were generated with varying numbers of cells m in a data set, where each cell was assigned to an arbitrary reference cluster with size n. We then computed 100 random association matrices by taking each possible celltocell association and assigning a 0 or 1 with equal probability, i.e., a cell was equally likely to associate or not with any cell. The noise consensus matrix was computed from these 100 random association matrices. From the noise consensus, the score, stability, and promiscuity metrics were calculated both per cell and per cluster. This computation was repeated 100 times for each set m and n parameters to describe the null distribution of BEARscc metrics at the cell and cluster level.
Evaluation of BEARscc runtime
To estimate the runtime requirements of BEARscc, a random sampling of 2000 samples and 20,000 genes was taken from the brain whole tissue counts matrix with replacement. This large count matrix was subset to an increasing number of genes (63, 504, 700, 1008, 1800 and 2016) and cells (6, 12, 25, 36, 50, 72) to generate counts matrices with increasing numbers of elements. The BEARscc functions estimate_noiseparameters and simulate_replicates were run on each count matrix on a standard desktop PC (Intel i7 with 2.9 GHz). The microbenchmark R package was used to measure the execution time for each counts matrix subset.
Estimation of cluster number k
To determine the cluster number, k, from the hierarchical clustering of the noise consensus, the resulting dendrogram was cut multiple times to form N clusterings with cluster numbers k=1 to k=N clusters. The average score metric was computed for each clustering, and k was chosen by taking the k with the maximum average score metric. Evaluating all possible k from 1 to the number of cells in the experiment is computationally expensive and unlikely to be biologically meaningful. In this work, N was capped at 0.1 times the number of cells in the experiment: N=10 for the brain and blanks control, N=30 for the murine intestine experiment, and N=300 for the murine brain data.
Gene sampling
For comparison with BEARscc, 100 subsampling iteration matrices for n cells and m genes were generated by sampling one half of expressed genes and clustered using the respective clustering algorithm (SC3, BackSPIN, RaceID2). For each data set, genes were excluded with less than 25 total raw counts across all samples in the cohort. The remaining genes formed the sample space. In each subsampling iteration, one half of the genes were sampled without replacement, and their expression across cells was used as the counts matrix. Identically to the computation of the BEARscc noise consensus matrix, cluster labels were used to compute an n × n binary association matrix for each clustering, and an arithmetic mean was taken for each respective element across the resulting 100 association matrices to produce an n × n subsampling consensus matrix. Identically to BEARscc analysis, the BEARscc score metric was used to determine cluster number k, and the resulting cluster labels for each data set and algorithm were compared with BEARscc by computing the adjusted rand index for each with respect to the relevant ground truth.
Code availability
BEARscc[https://doi.org/10.18129/B9.bioc.BEARscc] is freely available as an R package through Bioconductor^{18}.
Data availability
The single cell RNAsequencing counts from primary murine brain and intestine are available on the Linnarsson laboratory data repository [http://linnarssonlab.org/cortex/] and the van Oudenaarden laboratory github repository [https://github.com/dgrun/StemID], respectively. The single RNAsequencing raw reads from primary C. elegans embryo are available on the SRA^{16} sequencing database through the GEO data repository with accession number GSE77944[https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE77944]. The raw reads counts for the brain and blanks experiment are available on the GEO data repository with accession number GSE95155 [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE95155].
References
Grün, D. et al. Singlecell messenger RNA sequencing reveals rare intestinal cell types. Nature 525, 251–255 (2015).
Wagner, A., Regev, A. & Yosef, N. Revealing the vectors of cellular identity with singlecell genomics. Nat. Biotechnol. 34, 1145–1160 (2016).
Tirosh, I. et al. Dissecting the multicellular ecosystem of metastatic melanoma by singlecell RNAseq. Science 352, 189–196 (2016).
Grün, D., Kester, L. & van Oudenaarden, A. Validation of noise models for singlecell transcriptomics. Nat. Methods 11, 637–640 (2014).
Kim, J. K., Kolodziejczyk, A. A., Illicic, T., Teichmann, S. A. & Marioni, J. C. Characterizing noise structure in singlecell RNAseq distinguishes genuine from technical stochastic allelic expression. Nat. Commun. 6, 8687–8688 (2015).
Hicks, S. C., Townes, F. W., Teng, M. & Irizarry, R. A. Missing data and technical variability in singlecell RNAsequencing experiments. Preprint at: https://doi.org/10.1093/biostatistics/kxx053 (2017).
Jiang, L. et al. Synthetic spikein standards for RNAseq experiments. Genome Res. 21, 1543–1551 (2011).
Vallejos, C. A., Marioni, J. C. & Richardson, S. BASiCS: Bayesian analysis of singlecell sequencing data. PLoS Comput. Biol. 11, e1004333–18 (2015).
Brennecke, P. et al. Accounting for technical noise in singlecell RNAseq experiments. Nat. Methods 10, 1093–1095 (2013).
Grün, D. et al. De novo prediction of stem cell identity using singlecell transcriptome data. Stem Cell 19, 266–277 (2016).
Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by singlecell RNAseq. Science 347, 1138–1142 (2015).
Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
Tibshirani, R., Walther, G. & Hastie, T. Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B Stat. Methodol. 63, 411–423 (2001).
Kiselev, V. Y. et al. SC3: Consensus clustering of singlecell RNAseq data. Nat. Methods 14, 483–486 (2017).
Tintori, S. C., Osborne Nishimura, E., Golden, P., Lieb, J. D. & Goldstein, B. A transcriptional lineage of the early C. elegans embryo. Dev. Cell 38, 430–444 (2016).
Leinonen, R., Sugawara, H. & Shumway, M. International Nucleotide Sequence Database Collaboration. The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2011).
Dobin, A. et al. STAR: ultrafast universal RNAseq aligner. Bioinformatics 29, 15–21 (2013).
Gentleman, R. C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004).
Picelli, S. et al. Fulllength RNAseq from single cells using Smartseq2. Nat. Protoc. 9, 171–181 (2014).
Hicks, S. C., Teng, M. & Irizarry, R. A. On the widespread and critical impact of systematic bias and batch effects in singlecell RNASeq data. Preprint at: https://doi.org/10.1101/025528 (2015).
Acknowledgements
We thank Mary Muers, Andy Roth and Chris Ponting for careful reading of the manuscript, and Rory Bowden, Amy Trebes and the HighThroughput Genomics team at the Wellcome Trust Centre for Human Genetics for assistance with sequencing. All authors acknowledge support from Ludwig Cancer Research. D.T.S. was supported by Nuffield Department of Clinical Medicine and the Clarendon Fund. M.J.W. was supported by Cancer Research UK. M.J.W. and R.P.O. received funding from the NIHR Biomedical Research Centre. R.P.O. received funding from Oxford Health Services Research Committee and Oxford University Clinical Academic Graduate School.
Author information
Authors and Affiliations
Contributions
D.T.S. conceived and implemented the computational approach under the supervision of B.S.B., M.J.W., R.P.O. and X.L. designed the initial experimental study that led to the development of the presented approach and included the sequencing of brain and blanks samples. B.S.B. and D.T.S. prepared the manuscript, with contributions from all authors.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Severson, D.T., Owen, R.P., White, M.J. et al. BEARscc determines robustness of singlecell clusters using simulated technical replicates. Nat Commun 9, 1187 (2018). https://doi.org/10.1038/s4146701803608y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4146701803608y
This article is cited by

Penalized Latent Dirichlet Allocation Model in SingleCell RNA Sequencing
Statistics in Biosciences (2021)

Red panda: a novel method for detecting variants in singlecell RNA sequencing
BMC Genomics (2020)

Eleven grand challenges in singlecell data science
Genome Biology (2020)

Single cell RNAseq reveals profound transcriptional similarity between Barrett’s oesophagus and oesophageal submucosal glands
Nature Communications (2018)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.