Determining sequencing depth in a single-cell RNA-seq experiment

An underlying question for virtually all single-cell RNA sequencing experiments is how to allocate the limited sequencing budget: deep sequencing of a few cells or shallow sequencing of many cells? Here we present a mathematical framework which reveals that, for estimating many important gene properties, the optimal allocation is to sequence at a depth of around one read per cell per gene. Interestingly, the corresponding optimal estimator is not the widely-used plug-in estimator, but one developed via empirical Bayes.


Supplementary Figures
Gene relative abundance curve. Panels a-c show the relative abundance curve for the brain datasets, 293T datasets and 3T3 datasets, respectively. These panels, along with the upper-left panel of Figure  1c, imply that the relative abundance of genes is consistent within the same tissue across different experiments. We note that brain_2k seems to be an outlier in panel a and in other experiments as well, e.g., Supplementary Figure Figure 2. Optimal trade-off. Panels a-b are two other genes (MS4A1, CD4) that we consider for pbmc_4k. Panel c is for pbmc_8 where we consider the gene CD4 same as panel b. Panels d-g are for brain datasets (brain_1k, brain_2k, brain_9k, brain_1.3m) where we consider the gene S100a10 (see Supplementary Figure 1a for the location of this gene on the relative abundance curve). All panels show that the current datasets should have been sequenced slightly deeper to achieve the optimal budget allocation Supplementary Figure 3. Determine the shape parameter r for gamma distribution. Computing the theoretical error curve (Fig. 1b, Supp. Fig. 2) requires knowing the shape parameter r for the gamma distribution, which is determined by the mean-variance relationship regression. Specifically, the overdispersion model assumes that . Then we can perform the log-scale regression , where r corresponds to the negative log intercept. Panels a-f show the regression result on 6 different datasets, where the slopes are all very closed to 2, indicating that the mean-variance model is appropriate. The estimated values for the shape parameter r are between 1 and 2.5. Hence we choose for computing all theoretical error curves.  Figure 2) requires knowing the shape parameter r for the gamma distribution, which is determined by the mean-variance relationship regression. Specifically, the overdispersion model assumes that Var = mean + 1 r mean 2 . Then we can perform the log-scale regression log(Var mean) = 2 log mean log r, where r corresponds to the negative log intercept. Panels a-f show the regression result on 6 different datasets, where the slopes are all very closed to 2, indicating that the mean-variance model is appropriate. The estimated values for the shape parameter r are between 1 and 2.5. Hence we choose r = 1.5 for computing all theoretical error curves We use the empirical distribution of marker genes in pbmc_4k as the true gene distribution, including IL7R, CD3G, CD3E, CD3D, LCK, NKG7, GZMA, CST7, CD79A, MS4A1, S100A8, S100A9, MNDA, FGL2. To provide a single-gene level error characterization, the gene distribution is normalized so that each gene has the same mean expression level. Then, the data is generated according to model (1). Without loss of generality, we choose to be the number of genes for the inactive probability as well as the pairwise inactive probability. All experiments are repeated 20 times and the 3std confidence intervals are provided. The simulation results are used to generate the post-hoc guidance table (Fig. 1e), where the smallest mean reads (per cell per gene) such that the relative error is smaller than 10% is selected as the reliable detection threshold. We note that 10% corresponds to -2 for the log10 relative error (squared) and -1 for others (log10 W1, log10 cosine). Simulation for post-hoc analysis for PBMCs. We use the empirical distribution of marker genes in pbmc_4k as the true gene distribution, including IL7R, CD3G, CD3E, CD3D, LCK, NKG7, GZMA, CST7, CD79A, MS4A1, S100A8, S100A9, MNDA, FGL2. To provide a single-gene level error characterization, the gene distribution is normalized so that each gene has the same mean expression level. Then, the data is generated according to model (2). Without loss of generality, we choose k to be the number of genes for the inactive probability as well as the pairwise inactive probability. All experiments are repeated 20 times and the 3std confidence intervals are provided. The simulation results are used to generate the post-hoc guidance table (Figure  2b), where the smallest mean reads (per cell per gene) such that the relative error is smaller than 10% is selected as the reliable detection threshold. We note that 10% corresponds to -2 for the log10 relative error (squared) and -1 for others (log10 W1, log10 cosine) Genes with mean read counts smaller than 0.1 are discarded prior to the experiment. We note that the variation of EB estimates are higher when the estimated values are small. This is because the plug-in estimator introduces an artificial bias that reduces the variance. In other words, despite the high variation, the EB estimates are still preferred since they are closer to the biological truth. See Supp. Note 4.2 for more details. Figure 8. Consistency plot for cv. Genes with mean read counts smaller than 0.1 are discarded prior to the experiment. We note that the variation of EB estimates are higher when the estimated values are small. This is because the plug-in estimator introduces an artificial bias that reduces the variance. In other words, despite the high variation, the EB estimates are still preferred since they are closer to the biological truth. See Variation of the EB estimates in Supplementary Note 4 for more details To compute the W1 distance, we match the two distributions to have the same mean, and then compute the W1 distance according to the standard definition. We also match the mean of the EB distribution and the original distribution for visualization purpose. In most cases the biological relation for the gene pairs can be inferred directly from the corresponding gene names. a) EIF3F and EIF3L encode the F and L subunits of the Eukaryotic Translation Initiation Factor 3. b) COX7B and COX5B encode the 7B and 5B subunits of Cytochrome C Oxidase. c) CDC42SE1 and CDC42SE2 are paralogous genes encoding CDC42 small effector proteins 1 and 2. d) PPP1CA and PPP1R18 encode the subunits of Protein Phosphatase 1. e-f) PSMA7, PSME1, PSME2, and PSMB9 encode the subunits of the proteasome , i.e., estimating the zero proportion at one read per cell) for the gene VGF. The relative error is evaluated against the gold standard smFISH result. 3std confidence intervals are provided. optimal depth optimal depth Estimation of CV Estimation of inactive probability = 1/n reads Supplementary Figure 14. The sequencing budget trade-off for estimating cv (left) and inactive probability (right, k = 1, i.e., estimating the zero proportion at one read per cell) for the gene VGF. The relative error is evaluated against the gold standard smFISH result. 3std confidence intervals are provided

Zheng
Klein Svensson CV Inactive prob.
Supplementary Figure 15. Estimated mean, cv, and inactive probability for three ERCC datasets. For ERCCs, the expected number of molecules can be estimated from the known ERCC concentration and the dilution factor. Both cv and inactive probability are expected to be close to zero since the number of ERCC molecules should be the same for all cells. The post-hoc thresholds (Fig. 1e, Supp. Note Sec. 6.7) are shown with a black dashed line. As we can see, beyond this threshold the EB estimates are very close to the ground truth. In contrast, the the plug-in estimates are inflated due to the technical noise and become close to the EB estimates only for the ERCCs with more UMIs.

( = 5)
Supplementary Figure 15. Estimated mean, cv, and inactive probability (k = 5n reads ) for three ERCC datasets. For ERCCs, the expected number of molecules can be estimated from the known ERCC concentration and the dilution factor. Both cv and inactive probability are expected to be close to zero since the number of ERCC molecules should be the same for all cells. The post-hoc thresholds (Figure 2b, Details of the ERCC experiments in Supplementary Note 6) are shown with a black dashed line. As we can see, beyond this threshold the EB estimates are very close to the ground truth. In contrast, the the plug-in estimates are inflated due to the technical noise and become close to the EB estimates only for the ERCCs with more UMIs Pure RNA controls (human lymphoblastoma, K562) Supplementary Figure 16. The Klein dataset contains pure RNA controls that are also expected to display very small biological variability across cells. The estimated variance and cv for all genes are shown in the panel, where the EB estimates are very close to zero while the plug-in estimates are inflated due to the technical variation.
Supplementary Figure 16. The Klein dataset contains pure RNA controls that are also expected to display very small biological variability across cells. The estimated variance and cv for all genes are shown in the panel, where the EB estimates are very close to zero while the plug-in estimates are inflated due to the technical variation Trade-off simulation for the Svensson data cv inactive prob. optimal depth optimal depth cv is estimated to be zero Supplementary Figure 17. The sequencing budget tradeoff for estimating cv (left) and inactive probability (right), where the optimal depth is below 0.5 for estimating cv and 0.5-1 for estimating the inactive probability. We note that the flat area on the left panel is because the estimated cv is 0 and is truncated to have a finite y-axis value. The details of the simulation is as follows. The tradeoff curve is averaged over 6 ERCC spike-ins in the Svensson data that have enough UMIs to subsample from and have similar depths (average UMIs between 10-20). The error is averaged over the 6 ERCCs and is evaluated against the ground truth, i.e., 0 for both cv and inactive probability. For example, the error for cv is , where ground truth is zero and the mean is over the 6 ERCCs. The inactive probability is estimated using . All points are simulated with 100 repetitions and 3std confidence intervals are provided.
log 10(mean( c v 2 )) = 5 Supplementary Figure 17. The sequencing budget tradeoff for estimating cv (left) and inactive probability (right), where the optimal depth is below 0.5 for estimating cv and 0.5-1 for estimating the inactive probability. We note that the flat area on the left panel is because the estimated cv is 0 and is truncated to have a finite y-axis value. The details of the simulation is as follows. The tradeoff curve is averaged over 6 ERCC spike-ins in the Svensson data that have enough UMIs to subsample from and have similar depths (average UMIs between 10-20). The error is averaged over the 6 ERCCs and is evaluated against the ground truth, i.e., 0 for both cv and inactive probability. For example, the error for cv is log 10(mean(ĉv 2 )), where ground truth is zero and the mean is over the 6 ERCCs. The inactive probability is estimated using k = 5n reads . All points are simulated with 100 repetitions and 3std confidence intervals are provided  Figure 18. To assess the sensitivity of using reference data to estimate the detection limit p ⇤ for the proposed experimental design procedure, we consider four different types of reference data based on the biological sample and the sequencing technology (first row). Here, "in-sample" means that the reference data is from the same biological sample as the data for the current study. E.g., the reference data may come from the pilot experiment. Similarly, "out-of-sample" means that the reference data is from a different biological sample, e.g., obtained from independent replicate or some other publicly available dataset. To simulate these four scenarios, we consider four additional PBMC datasets from a recent study 1 (rows 2-3). Specifically, 10x2A_pbmc1, 10x2B_pbmc1 are two scRNA-seq datasets generated with the 10x v2 technology which is the same as the other 10x datasets considered in the paper. bulk_pbmc1 and bulk_pbmc2 are two bulk RNA-Seq datasets. The first three datasets, i.e., 10x2A_pbmc1, 10x2B_pbmc1, and bulk_pbmc1, are from the same biological sample. These four datasets, along with the pbmc_4k dataset considered in the paper before, cover all four reference data types considered in this simulation. In addition, to provide a fair comparison and to match the smaller scale of a pilot experiment, we subsample the two scRNA-seq reference data, i.e., 10x2B_pbmc1 and pbmc_4k, to have 500 cells and 2000 reads per cell each.
The results are shown in Supplementary Figure 19 for the four reference data types respectively. As shown on the left panels, there is a good concordance of the relative abundance between the reference data and the data for the current study for each gene. Furthermore, on the right panels, we consider each gene as the gene of interest at the detection limit p ⇤ separately and assume the true value of p ⇤ is given by the value from the data for the current study. Using this value of p ⇤ in the proposed experimental design procedure, the corresponding ideal estimation error is computed using the closed-form formula (Figure 1b). Not knowing p ⇤ in reality before the experiment, we estimate it using the reference data and use the estimated value for the proposed experimental design procedure (Figure 1c) to determine the sequencing depth, resulting in a reference-based error which is greater than the ideal error. We show the ratio between the reference-based error and the ideal error on the right panels. As we can see, the ratio is small for most genes above the rare gene boundary (7 transcripts per cell for a cell with 300k transcripts in total, see Experimental design in the Methods section). Specifically, of all the genes above the rare gene boundary, 99%, 95%, 92%, 90% have an error ratio <1.5 for the four reference data types, respectively. We hence conclude that all four types of reference data can be used to accurately determine the optimal sequencing depth. We also note that in terms of preference for the reference data, in-sample scRNA-seq is most preferable while out-of-sample RNA-Seq is least preferable.  Consider a single-cell RNA-seq (scRNA-seq) experiment where we sequence n cells cells with n reads reads per cell on 6 average, where we note throughout the paper by read counts we mean the unique molecular identifier counts (UMIs) 7 and by sequencing depth we mean the total number of UMIs per cell. The total number of reads can be written as 8 B = n cells ⇥ n reads and can be thought of as the budget we have for this experiment. This budget is directly related to 9 the resource we can spend on this particular experiment-in many scientific studies, there is usually a constraint 10 on this budget. Fixing the budget B, there are different ways to split it between n cells and n reads : we can sequence 11 many cells with few reads per cell or a few cells with mean reads per cell. Given some common statistical inference 12 tasks, we are interested in the optimal allocation of the budget B between n cells and n reads that achieves the smallest 13 estimation error, which we call the sequencing budget allocation problem. The data can be summarized by a n cells ⇥ G cell-gene matrix, where G is the number of genes and the c-th row, 16 Y c 2 Z G , represents the observed read counts for cell c, where c = 1, · · · , n cells . We assume that the data is generated 17 via the following statistical model.

18
For cell c, let its relative gene expression level be represented as X c = [X c1 , · · · , X cG ] 2 D G , where D G represents the probability simplex which imposes the constraint that Â G g=1 X cg = 1. We note that X c is only associated with the biology and is independent of the sequencing process. Without loss of generality we assume that X c 's are samples drawn independently and identically distributed (i.i.d.) from some unknown cell distribution P X , i.e., where we use the standard notation [n] = {1, 2, · · · , n}. P X may have mixed components, accounting for different 19 sub-types in the cell population. The variation among X c accounts for the biological variation of the data.

20
The gene expression level X c is measured by the observed read counts Y c via sequencing, for which the cell size factors g c that account for different sequencing depths across cells are assumed to be generated i.i.d. from some unknown distribution P g with an additional constraint that E[g c ] = 1. For each cell c, given the gene expression level X c and the size factor g c , first the total number of reads of this cell follows a Poisson distribution n reads, c ⇠ Poi(g c n reads ) ⇤ ; second each read goes to gene g with probability X cg . This hierarchical Poisson multinomial model can be equivalently written as an independent Poisson sampling model 4 : To summarize, the complete statistical model can be written as

27
⇤ Note that here n reads has a slightly different interpretation to be the expected number reads per cell other than the actual average number of reads per cell. This does not make a big difference in practice since the two are very similar due to concentration of the Poisson distribution. 28 The sequencing depth n reads and the size factor g c can be estimated with a much higher preconfidence intervalsion as compared to other quantities. Therefore unless otherwise specified, we treat them as known quantities in the analysis. The estimators for n reads and g c can be written as

Computing sequencing depth and size factor for a scRNA-seq dataset
Here, the sequencing depth n reads is estimated by a simple average of the entire read count matrix over the cells, 29 which is very accurate since it utilizes all the data. The size factor g c , in spite of being a random variable, can be 30 pinned down with a high accuracy given the data. Recall that the total number of reads for each cell follows a Poisson 31 model [n readsc |g c , n reads ] ⇠ Poi(g c n reads ). Hence when g c n reads is large, we know that n readsc ⇡ g c n reads . Since n readsc 32 is observed and n reads can be accurately estimated, g c can also be accurately estimated. In fact, as further elaborated 33 in Remark 3 postponed to Supplementary Note 7, the estimation error is 1 p n cells _n reads with high probability. To put it 34 in context, consider a typical scRNA-seq experiment where n cells ⇠ 5k, n reads ⇠ 5k, and G ⇠ 30k. Since Â g X cg = 1, 35 the expression level for each gene is roughly X cg ⇠ 1/G. Then the read count for an entry Y cg ⇠ X cg n reads ⇠ n reads /G, 36 which is usually a single-digit number and often zero. Hence, as compared to the estimation error of n reads and g c , it 37 is much more challenging to estimate the relative gene expression level X cg .
The gene moments often play an important role in data pre-processing, feature selection, and gene-type 42 identification. For example, the first moment (mean) can be used to filter out lowly expressed genes. The  3. The inactive probability for gene g with the definition The inactive probability is used to quantify to the proportion of cells where gene g is inactive and can be viewed as a proxy of the zero probability as discussed in 19 . In 19 , the authors consider the probability that a gene has zero read count: Our definition generalizes this to a range of quantities specified by different k's. As a special case, k=• 57 corresponds to the probability that X cg is zero. The smoothing function exp( kx) is used because even for a 58 cell population where a gene is not supposed to express, we often observe some low expression level. Figure    4. The pairwise inactive probability for the gene pair g 1 g 2 with the definition Similar to the previous case, the pairwise inactive probability quantifies how often the two genes are inactive (or equivalently active) at the same time. This quantity can be used to analyze the gene co-expression network, 68 e.g., in 19 .

69
5. The marginal gene distribution P X g for each gene. This quantity is considered in 11 .

70
For all quantities listed above, we study the optimal budget split between the number of cells n cells and the 71 sequencing depth n reads that achieves the smallest error. Interestingly we found that this optimal budget split is the 72 same for all these quantities, which, in plain English, can be stated as follows. For any budget B, make n reads some 73 constant value and then let n cells = B/n reads , whereas this constant may depend on the quantities we consider to 74 estimate but is independent of the sequencing budget B. Mathematically, 75 Theorem 1. (Optimal budget allocation, informal) For estimating the gene moments, the covariance matrix, the inactive probability, the pair-wise inactive probability and the distribution, the optimal budget allocation is n reads = Q(1), and n cells = Q(B).
This optimality is in the minimax sense over a family of distributions P X with mild assumptions and the optimal error 76 rate is achieved by the empirical Bayes (EB) estimators.

Supplementary Note 3 Experimental Design
Theorem 1 implies that the optimal depth n ⇤ reads is a constant independent of the sequencing budget B. The exact 94 value of such constant is further investigated in this section.

95
First we consider the widely-used overdispersion model, where for each gene g, the read counts Y cg are assumed to follow a negative binomial distribution 7, 20, 21 . Since the negative binomial distribution can be derived as a gamma-Poisson mixture, the resulting model can be viewed as a special case of the hierarchical model (6) in which the underlying gene expression marginal distribution is the gamma distribution † , i.e., X cg ⇠ Gamma(r g , q g ). This is analyzed in detail in Supplementary Note 5, with the corresponding EB estimatorsq EB g andr EB g given in (53) and the estimation error in Lemma 2. Specifically, the estimation error can be written as where For the case of estimating the gamma parameters, the exact 96 value of the optimal depth n ⇤ reads can be derived from the above equations. Specifically, first we have that c g, 1 ⇠ 97 c g, 2 ⇠ 1 since the size factors are close to 1. Second, the shape parameter r g can be estimated via the mean-variance . Here we have defined the error to be the average of 102 MSE(q EB g ) and MSE(r EB g ) without loss of generality; using any one of the two terms yields a similar result. It can 103 be further inferred that optimal depth is ⇠ 1 read per cell per gene for estimating the gamma parameters. Here we 104 note that the ⇠ 1 read refers to the quantity pn reads , i.e., the mean reads for the gene under consideration rather than 105 n reads , the UMIs for all genes in a cell.

106
Then how about estimating other quantities? Lemma 2 also shows that the optimal depth is around 1 read per cell  PBMCs, 25k for mice brain, and 100k for cell lines.

121
An orthogonal calculation shows a similar result. Specifically, the maximum sequencing depth can be estimated as maximum sequencing depth = total transcripts per cell ⇥ UMI efficiency.
(16) † The constraint X cg < 1 can be neglected here without loss of generality. This is because the relative expression X cg is of the order 1/G which is much smaller than 1. With a mean much smaller than 1, the truncated gamma distribution with truncation at 1 is very close to the untruncated version.
For the total transcripts per cell, since there are roughly N = 200-400k transcripts (mRNAs) for a typical mammalian cell 22 and given that some cells (e.g., PBMCs) may have a much lower mRNA content, we can conservatively 123 use N = 100-300k without loss of generality. For the UMI efficiency, it is reported to be 8% for the External 124 RNA Controls Consortium synthetic RNAs (ERCCs) for the v1 chemistry (Figure 2d in 3 ). However, it is found 125 that the UMI efficiency for endogenous genes are usually much higher than that for ERCCs, since endogenous 126 genes have longer poly(A) tails which makes it easier for cDNA conversion 23 . Considering the fact that the v2 127 chemistry is believed to have a higher UMI efficiency, a conservative estimate yields a UMI efficiency of 10-15% 128 for sequencing endogenous genes for the v2 chemistry (currrent), which translates to a maximum of 10-45k UMIs 129 per cell. In addition, it is reported that other technologies can potentially sequence at a much higher depth than the 130 10x technology 23 .

131
Under the recommended sequencing depth (e.g., n reads = 14k for brain_9k), the Poisson model (6) is a good approximation of the actual sequencing process. Consider a typical case where there are N = 200k mRNAs per cell and the UMI efficiency h = 0.1. For a gene (e.g., gene 1) with N 1 transcripts in the cell, its corresponding UMI counts, sequenced at the depth of n reads UMIs per cell, follows a hypergeometric distribution where Y cg 1 can be understood as the number of transcripts sequenced for gene 1 while randomly selecting n reads 132 transcripts from a population of N transcripts containing N 1 transcripts for gene 1. We note that this essentially 133 assumes that each transcript molecule is equally likely to be sequenced; such assumption is used in other works, 134 e.g., 11, 24 .

135
Since n reads = 14k is roughly 7% of N, the hypergeometric distribution can be well approximated by the binomial In this section we introduce the principles for designing empirical Bayes estimators for different distributional quantities. For simplicity let us ignore the size factor g c and the sequencing depth n reads for now by considering the following statistical model: We note that both quantities can be accurately estimated and thus can be easily added back later.

148
As a general recipe, the plug-in estimator uses the scaled (relative) read counts Y 1 /n reads , Y 2 /n reads , · · · as a 149 proxy for the true relative gene expression levels X 1 , X 2 · · · effectively estimating the corresponding distributional 150 quantities by "plugging-in" the observed values. For example, the plug-in estimator estimates the mean of the gene 151 expression distribution P X by that of P Y/n reads , the variance of P X by that of P Y/n reads , etc.

152
The problem with this is that the estimated results are usually overly variable due to the presence of the Poisson noise. Taking the variance estimation as an example, the plug-in estimated variance can be written as whereȲ cg is the sample mean of Y cg . How does this compare to the variance of X cg ? A simple calculation gives that from which we can see that the plug-in estimate is overly inflated by the Poisson noise.

153
The EB estimators in general refer to the estimators that are aware of the noise model (which is Poisson here) and correct the noise introduced by it. In the case of estimating the variance, the bias can be easily corrected by simply subtracting the mean, where the corresponding EB variance estimator can be written as There is no general principle for designing EB estimators and they are usually developed in the literature in a empirical distributionP X g that most likely of Y cg (scaled by 1/n reads ) gives empirical distribution of Y cg via model (6) Supplementary Table 1. A comparison of the plug-in estimator and the EB estimator for estimating various quantities, where the two estimators are deliberately written in similar forms for the ease of comparison.

171
The EB framework also encompasses the overdispersion model that is widely-used in scRNA-seq. As a brief history, what is observed in practice that the variance is often greater than the mean, i.e., Var(Y g ) > mean(Y g ), which is 178 referred to as the overdispersion phenomenon 20, 21 . In order to address this issue, the overdispersion model assumes 179 a negative binomial distribution for the read counts Y g ⇠ NB(r g , p g ), which has the mean-variance relationship 180 Var(Y g ) = mean(Y g )+ 1 r g mean 2 (Y g ) and a g = 1 r g is call the overdispersion parameter 33, 34 . This model is then adopted 181 for scRNA-seq 7 . Later on, since the negative binomial (NB) model does not capture the dropouts well, a zero-inflated 182 version is later purposed that incorporates a zero component into the negative binomial model 9 .
183 Both the NB model and the zero-inflated NB model can be thought as special cases of the empirical Bayes model (18) considered in this paper. Let q g = p g 1 p g . Then we can write the negative binomial model (NB) and the zero-inflated negative binomial (ZINB) model in the framework of EB model as NB : Y cg ⇠ NB(r g , p g ) , X cg ⇠ Gamma(r g , q g ),Y cg |X cg ⇠ Poi(X cg ), In other words, they correspond to the empirical Bayes models with P X g defined to be the gamma distribution and the 184 zero-inflated gamma distribution respectively. Both models make parametric assumptions on the gene distribution 185 P X g in order to estimate distributional quantities like mean and variance. However, from the analysis of the EB 186 framework we know that for estimating these quantities, such parametric assumptions are not necessary. To put it in 187 another way, the fundamental part for scRNA-seq modelling is not the negative binomial distribution but the Poisson 188 noise inside it.

189
The overdispersion parameter a g is often estimated via method of moments (MOM) bŷ where [ mean(Y cg ) and c Var(Y cg ) are the empirical mean and variance of the read counts. Interestingly, the square-root of this estimate corresponds to the cv of P X g , Since this relation is independent of the negative binomial model, pâ g is still a good estimate of cv(X cg ) even under

Variation of the EB estimates 194
As a general phenomenon, the plug-in estimator adds an artificial inflation to the estimate and therefore reduces the relative variation. E.g., see the points in the middle left panel of Figure 3a and Supplementary Figures 8-9. Since the axes are in log scale and the EB estimates have smaller values, the absolute variations for the two estimates are not very different, which are proportional to 1/ p n cells . It is only that the plug-in estimates have smaller relative fluctuations. However, the plug-in estimates are actually further away from the ground truth as compared to EB. This can be understood by considering the simplified model (18) in the context of estimating cv. In this case, the plug-in and the EB estimator can be written as where will be a large value that is very different from the ground truth.

200
In this section we formally state and prove Theorem 1 while providing details for the design of the EB estimators. As a brief introduction of the minimax formulation 35 for the budget allocation problem, suppose we are interested in estimating some generic distributional quantity q (determined by P X g ) under the squared error (q q ) 2 . Since we do not know the gene distribution in prior, we would like to guarantee that this error is small for a large family of distributions P X g 2 P. Thus we should consider the worst-case error In order to make it small, we can choose the budget allocation (n cells , n reads ) as well as the estimatorq , whereas the best of them yields the smallest error: which we call the minimax error or the minimax rate. Note that this minimax error is achieved by having the optimal 201 budget allocation (n ⇤ cells , n ⇤ reads ) and the optimal estimatorq ⇤ at the same time. Hence the optimal budget allocation 202 is always accompanied with the corresponding optimal estimator, and when discussing one of them, one should 203 always keep in mind this correspondence.

204
Before stating the main theorem, we define the big O notation 36 for asymptotic analysis in the regime where the budget B ! •. The big O notation is widely used in asymptotic analysis and can be briefly summarized as follows. For two sequences {x n }, {y n } where n = 1, 2, · · · , x n = Q(y n ) , 9 c 1 , c 2 , n 0 > 0, s.t. 8n > n 0 , c 1 y n  x n  c 2 y n , x n = W(y n ) , 9 c 1 , n 0 > 0, s.t. 8n > n 0 x n c 1 y n .
The formal statement of Theorem 1 is as follows.

205
Theorem 2. Assume that there exist some constants 0 < c 0 < c 1 < • such that the size factor 8c, c 0 < g c < c 1 . Then 206 the minimax rate for estimating moments, pairwise moments, inactive probability, pairwise inactively probability, 207 and distribution are all Q( 1 B ). Namely, inf n cells ,n reads inf M 11,g 1 g 2 sup P Xg 1 g 2 E h M 11,g 1 g 2 M 11,g 1 g 2 inf n cells ,n reads inf p 0,g 1 g 2 (k) where P exp is an exponential family parametrized by a as defined later in (67). The minimax rates are achieved by the EB estimators (Supplementary Table 1 We follow the standard sandwich approach to prove Theorem 2. Specifically, suppose we want to show the minimax rate inf n cells ,n reads We can do it via two steps. First we show that there exists an allocation (n ⇤ cells , n ⇤ reads ) and an estimatorq ⇤ with which Second we show that for any allocation (n a cells , n a reads ) and any estimatorq a , the worst-case error The minimax rate can then be proved by combining the two. We call the first step the upper bound and the second the

Moments and covariance 212
The kth moment of a gene g can be written as .

Recall that the Poisson factorial moment has the property
Since Y cg |X cg ⇠ Poi(g c n reads X cg ), Fact 1 suggests that Then naturally, an unbiased estimator for M k,g can be derived by averaging the factorial moment over all cells: This also generalizes to the pairwise case. Suppose we are interested in estimating the first pairwise moment: Then the corresponding unbiased estimator can be written aŝ We call these estimators the EB moment estimators and their performance can be characterized as below: 215 Lemma 1. Assume that there exist some constants 0 < c 0 < c 1 < • such that the size factor 8c, c 0 < g c < c 1 . Then under the tradeoff (39), the worst-case mean square error (MSE) The moments give rise to many quantities that are important to scRNA-seq analysis. Some of them are Cov(X cg 1 , X cg 2 ) = M 11,g 1 g 2 M 1,g 1 M 1,g 2 (51) where r is the Pearson correlation. In addition, since all above quantities are smooth functions of the moments, .

217
Yet, one mysterious piece is the constants hidden inside the big theta terms in (39), which vary case-by-case for 218 estimating different quantities. Can they be unrealistically large in spite of being constants? It is hard to derive such 219 constant for the original budget allocation problem and for estimating a general quantity. However for the case of 220 using EB estimators to estimate quantities like the first two moments or the gamma parameters, the closed-form 221 expressions for optimal budget allocation are available.

222
As a recap, the EB estimators for the first and the second moment are given in (44). For assuming P X g to be the gamma distribution with the shape parameter r g and the scale parameter q g , the EB estimators are obtained by plugging in the EB moment estimates: (53) We note that they also correspond to the common practice of first using the method of moments estimator (MOM) to The MSE for estimating the first moment, the second moment, the gamma parameters using EB estimators are The optimal n reads that minimizes the above errors are respectively We note that for the first moment, the error is O( 1 B ) as long as n cells = W(B), which includes the case where 226 n cells = Q(B). Hence the result is consistent with Theorem 2. Interestingly the result for the first moment implies 227 that the error always gets smaller when n cells gets larger. In the limiting case where the error is minimized, there are infinitely many cells, each with almost zero reads, which corresponds to bulk RNA-Seq. This is consistent with 229 the observation that bulk RNA-Seq has a good estimation of the mean gene expression level, but lacks a further 230 resolution for the finer structures (say the second moment).

231
Since the size factors are usually close to 1, we have c g, 1 ⇡ c g, 2 ⇡ 1. Since there are G genes, for a typical gene X cg ⇠ 1 G and therefore q ⇠ 1 G , M 2 ⇠ 1 G 2 , Var(X 2 cg ) ⇠ 1 G 4 . Also, the shape parameter r ⇠ 1. Therefore based on these approximations, the optimal allocations are Then the mean read for this typical gene is n ⇤ reads M 1,g ⇠ n ⇤ reads /G ⇠ 1. In other words, the optimal budget allocation 232 for this gene is around 1 read per cell for the four quantities considered above. Recall that for gene g, the inactive probability is defined as Define t c = k n reads g c and an ancillary random variable L ⇠ Bin(d 1 2 log 2 n cells (t c 1) 2 t c 2 e, 1 t c ). Then the estimator can be constructed as follows: Such estimator directly generalizes to the pairwise case, where we recall that the pairwise inactive probability is The EB estimator for the pairwise inactive probability can be written aŝ Lemma 3. Assume that there exist some constants 0 < c 0 < c 1 < • such that the size factor 8c, c 0 < g c < c 1 . Then under the tradeoff (39), the worst-case mean square error (MSE) sup P Xg 1 g 2 E h p EB 0,g 1 g 2 (k) p 0,g 1 g 2 (k)

243
Consider the distribution of gene g, P X g . Following the recipe in 10, 11 , we assume a zero-inflated p-parameter exponential family model for the distribution P X g , i.e., where Q(x) = [q 1 (x), · · · , q p (x)], and a 2 R p . The first component q 1 (x) = I {x=0} models the zero inflation and 244 q 2 (x), · · · q p (x) are the spline basis that are set to be 0 when x = 0, where we choose the degree to be 5. To make the 245 problem computationally feasible, we assume that X cg can be discretized as X 2 X = {x 1 , · · · , x m }. We differ from 246 previous works 10, 11 by the way we choose the values in X : instead of having points equally spaced between 0 and 247 1, we let them to space quadratically as X = { 1 h , 2 2 h , · · · , 1} for some h. This is because the estimation becomes 248 harder at the region closed to 0 and we need a finer grid there, which is inspired by 37 . We observed a significant 249 performance improvement from such modification. 250 We infer the coefficients a via maximum likelihood, where the formula for the derivatives are given in 30 and we 251 include them here for completeness. We denote the corresponding estimator byâ EB .

252
Remark 2. (Maximum likelihood inference) For simplicity we omit the subscript "cg" in this section and use X, Y in places where we previously use X cg and Y cg . Since we have discretized X, we can define the m-dimensional probability mass vector p X : [p X ] j = P X (x j ; a). We can also define the m ⇥ p matrix Q : Q ji = q i (x j ). Note that they respect the relationship p X (a) = exp (Qa f (a)) .
Furthermore, define the conditional probability matrix P Y |X : [P Y |X ] k j = P(Y = k|X = x j ). Then the probability of Y = k can be written as The log likelihood function for a single observation Y c is and further the full log likelihood function is We the maximum likelihood estimateâ EB is obtained by optimizinĝ Regarding this, some useful quantities are: , and W k = [w k1 , · · · , w km ] T .

253
Lemma 4. (Distribution estimation) tradeoff For any distribution P X g in the exponential family (67), under the tradeoff (39) the EB distribution estimatorâ EB satisfies ).

Lower bound 254
In this section we give a lower bound for Lemma 1, 3, 4, and hence establishing the optimality of the tradeoff (39) as 255 well as the optimal rate of the EB estimators. The lower bound is proved via Le Cam's two-point method 38 and is 256 stated as below:

257
Lemma 5. Under any tradeoff (n reads , n cells )| n reads n cells =B and for any estimators, the worst-case error where the sup is over the exponential family P exp as defined in (67) for (74) and is over all distributions for others.

258
Also we let g c = 1 for all c without loss of generality. (79) A careful inspection of (77) and (79) reveals that the estimation error of (7) is around 1 p n cells _n reads . 263 6.2 Algorithm efficiency 264 We implement the EB estimators in an efficient way that can be applied on very-large-scale scRNA-seq datasets 265 like the brain_1.3m dataset, which is computationally prohibitive for many scRNA-seq algorithms available. We 266 benchmark the running time for estimating the moment, the pairwise moment, the inactive probability, and the 267 pairwise inactive probability on three datasets with various sizes, i.e., pbmc_4k, brain_9k, brain_1.3m. For these 268 datasets, we use the top 4000 genes and estimate the related quantities, since most scRNA-seq analysis are performed 269 using less than 4000 genes. We report the average running time over 5 repetitions along with the standard deviation.

270
The data loading time via scanpy 2 , an efficient python implementation of popular scRNA-seq algorithms, is also 271 reported for comparison. The reported time is for loading the data for a second time after the cache file was built, 272 which is much faster than the first time. All algorithms are run using single core (CPU model: AMD Opteron(tm) 273 Processor 6378) and can be run on a laptop. We do not include distribution estimation since usually the distribution 274 estimation is done on a few genes but not all of them, which presents a smaller challenge for computational efficiency.

275
The results are shown in Supplementary Table 2, where we see it is very efficient for estimating the 1d quantities 276 like the moments and the inactive probability. Estimating the covariance and the pairwise inactive probability takes a 277 longer time but is still within a feasible range, i.e., ⇠ 12 hours for the brain_1.3m dataset. In practice to speed up the 278 analysis process, we suggest to first estimate the 1d quantities to select a smaller subset of genes, and estimate more 279 computationally-demanding quantities using only this smaller subset.

Definition of errors in simulations 281
The error used in the simulations (Figure 2a, Supplementary Figures 4-6) are defined as follows. Suppose x is the true parameter andx is the estimated value, which can be scalar, vector, or matrix. Then the errors are log10 relative error = log 10 log10 cosine distance = log 10 1   Since the number of UMIs per cell is small as compared to the 10x datasets, we found that it is no longer a good assumption that the size factor g c can be estimated with high accuracy. Hence, in these two experiments, we treat them as random variables instead of given fixed numbers. Specifically, following model (6) and by Fact 1, Taking expectation of both sides to have Hence, the kth moment can be alternatively estimated aŝ For the smFISH data, we assume the following model As a result, the kth moment can be estimated aŝ For the inactive probability, the smFISH ground truth is obtained by subsampling the smFISH data to have k 305 reads per cell on average and then computing the empirical zero proportion. Proof. We first show (47). Consider the tradeoff (39), any distribution P X g , and any size factor {g c } n cells c=1 that satisfies the assumption in Lemma 1. The bias and the variance can be computed as where the third equality is due to Fact 1. The variance where c 2k,l = S(2k, l) is the Stirling numbers of the second kind, a constant that depends on 2k and l. Since g c can be treated as a constant in the asymptotic analysis and so is n reads due to (39), we further conclude that We next show (48). Again consider the tradeoff (39), any distribution P X g 1 g 2 , and any size factor {g c } n cells c=1 that 337 satisfies the assumption in Lemma 1. The bias and the variance can be computed as follows.
338 bias[M EB 11,g 1 g 2 ] = E[M EB 11,g 1 g 2 ] M 11,g 1 g 2 By letting n cells = Q(B), the MSE ofM EB Proof. We consider the first moment, the second moment, and the gamma parameters in order as follows. We omit 341 the gene subscript g for simplicity. 342 The first moment. The MSE of the first moment where c g, 1 = 1 n cells Â n cells c=1 1 g c . Also it is not hard to see that the above decrease as n reads decreases. Hence n ⇤ reads (M EB 1 ) ! 343 0.

344
The second moment. The MSE of the second moment We calculate the two terms separately. For the first term The second term Putting the two terms together we get where c g, 1 = 1 n cells Â n cells c=1 1 g c and c g, 2 = 1 n cells Â The gamma parameters. Let X cg ⇠ Gamma(r, q ), for the shape parameter r and scale parameter q . Then the moments (of this gamma distribution) are M k = q k (r+k 1)! (r 1)! with special cases M 2 = r(r + 1)q 2 (93) M 4 = r(r + 1)(r + 2)(r + 3)q 4 (95) Var(X cg ) = rq 2 (96) Var(X 2 cg ) = 2(2r + 3)(r + 1)rq 4 .
The EB estimator for the gamma parameters can be written as (q EB ,r EB ) = h(M EB 1 ,M EB 2 ), with the function whose Jacobian LetS = JSJ T . Using the delta method we have with the diagonal elements ofS: S 11 = 2r + 3 r q 2 + c g, 1 n reads 4r + 5 r q + c g, 2 n 2 reads 2(r + 1) r (107) S 22 = 2r(r + 1) + c g, 1 n reads 4r(r + 1) q + c g, 2 n 2 reads 2r(r + 1) Then the optimal n reads forq EB andr EB can be derived as Proof. We first show (65). Consider the tradeoff (39), any distribution P X g , and any size factor {g c } n cells c=1 that satisfies the assumption in Lemma 3. Let n reads = k 2c 0 = O(1). Then t c  2 for all cell c and hence the estimator can be simplified aŝ Next we compute the bias and the variance under this scenario.
(1 t c ) y exp( n reads g c X cg ) (n reads g c X cg ) y y! # p 0,g (k) (114) where the third equality is by Poisson probability and the fifth by Taylor expansion. The variance where the last equality is by noting that (1 t c ) 2Y cg  1. Then by letting n cells = Q(B), the MSE of thep EB 0,g (k) We next show (66). Again consider the tradeoff (39), any distribution P X g 1 g 2 , and any size factor {g c } n cells c=1 that satisfies the assumption in Lemma 3. Let n reads = k 2c 0 = O(1). Then t c  2 for all cell c and hence the estimator can be simplified aŝ Next we compute the bias and the variance under this scenario.
Letting n cells = Q(B) and we reach that Proof. We use Le Cam's two-point method ( 38 , Chapter 2) to prove the lower bound, whose core theorem can be 355 stated as follows:

356
Theorem 3. (Le Cam's method 38 ) For any family P of distributions for which there exists a pair P 1 , P 2 2 P satisfying the corresponding distributional quantities |q (P 1 ) q (P 2 )| d , then the worse-case error after n observations has the lower bound where P ⇥n 1 refers to the product distribution of n samples.

357
In order to use Theorem 3, we need to find a pair of distributions P X g and P 0 X g whose corresponding quantities to estimate-may it be the moments, the inactive probability, or the distribution parameters-differ at least by W( 1 p B ), 359 and yet their induced observation distributions P Y g and P 0 Y g are almost indistinguishable, as quantified by their total 360 variation distance kP ⇥n cells Y g P 0⇥n cells Y g k TV bounded from above by a constant that is smaller than 1.

361
We now construct such a distribution pair P X g , P 0 X g and use it to prove all results. In order to do so, this distribution pair should lie in the exponential family P exp as defined in (67). Following the recipe in Lemma 6, first let {x i } 363 be the same support as X which is used to discretize the exponential family distribution (67). Then there exists 364 an a 0 whose corresponding {p i }-as specified by letting P X g (x; a 0 ) = P 0 X g (x)-are all in [0, 1] and thus satisfy the 365 requirement in Lemma 6. Let d be the one specified by Lemma 6. Since P exp has one free dimension for the zero 366 probability, there exists an a such that P X g (x = 0; a) = P X g . In other words, We know that pair of distributions that 367 we have just constructed satisfy (125) and lie in P exp .

368
Since the second half of (125) has already given the indistinguishability condition, in order to use Theorem 3 to 369 show the result, we only need to make sure that for the distribution pair P X g , P 0 X g , the difference of the quantity to

372
First, for the kth-moment (70), Second, for the pairwise moment (71), let us augment a second dimension by defining the distribution pair to be P X g 1 g 2 = P X g ⇥ P X g 2 and P 0 X g 1 g 2 = P 0 X g ⇥ P X g 2 for P X g 2 that takes value x 1 with probability 1. Since the second dimension is constant and is independent of the first dimension, the indistinguishability condition still holds. At the same time, Third, for the inactive probability (72), |p 0 (k)(P X g ) p 0 (k)(P 0 Fourth, for the pairwise inactive probability (73), augment the second dimension the same way as we did for the pairwise moment. Then |p 0 (k)(P X g 1 g 2 ) p 0 (k)(P 0 Finally, for the distribution (74), since the zero probability is a continuous function of a 0 , the first element of Lemma 6. (Auxiliary lemma for the proof of Lemma 5) Let {p i , x i } m i=1 be any ensemble satisfying Â m i=1 p i = 1, 8i, 0  p i  1, x i > 0. For any tradeoff (n reads , n cells ) such that n reads n cells = B, there exists d  1 2 such that the distribution pair P X g , P 0 X g defined by P 0 X g (x) = where P Y g and P 0 Y g are the corresponding observation distributions.

377
Proof. (Proof of Lemma 6) The observation distributions can be written as where poi(y, l ) represents that probability that Y = y when Y follows a Poisson distribution with rate l , i.e., Y ⇠ Poi(n reads x i ). Define Then we have Then the KL divergence between P Y g and P 0 Y g can be computed as D KL (P Y g kP 0 Y g ) = • Â y=0 P Y g (y) log P Y g (y) P 0 Y g (y) = P Y g (0) log P Y g (0) P 0 Y g (0) where the fifth equality is by the Taylor expansion log(1 + x) = x x 2 2 + o(x 2 ) for x = o(1). Next we consider the total variation distance of the product distribution In both cases, and hence there exists a c 0 such that kP ⇥n cells Y g P 0⇥n cells Y g k TV  1 2 . Hence, we have completed the proof.