Abstract
Allelespecific expression (ASE) at singlecell resolution is a critical tool for understanding the stochastic and dynamic features of gene expression. However, low read coverage and high biological variability present challenges for analyzing ASE. We demonstrate that discarding multimapping reads leads to higher variability in estimates of allelic proportions, an increased frequency of sampling zeros, and can lead to spurious findings of dynamic and monoallelic gene expression. Here, we report a method for ASE analysis from singlecell RNASeq data that accurately classifies allelic expression states and improves estimation of allelic proportions by pooling information across cells. We further demonstrate that combining information across cells using a hierarchical mixture model reduces sampling variability without sacrificing celltocell heterogeneity. We applied our approach to reevaluate the statistical independence of allelic bursting and track changes in the allelespecific expression patterns of cells sampled over a developmental time course.
Introduction
Allelic imbalance of transcripts is common across many genes^{1}. It can range from a subtle imbalance to complete monoallelic expression as in imprinted genes^{2} or genes under dosage compensation by X chromosome inactivation^{3,4}. Singlecell RNA sequencing (scRNASeq) can reveal features of cellular gene expression that cannot be observed in bulk RNA sequencing^{5}. In single cells, allelic proportions often form Ushaped or Wshaped distributions due to the occurrence of monoallelic transcriptional bursts^{6}. However, our ability to discern gene expression dynamics is limited by low depth of sequencing coverage per cell^{7,8,9,10,11,12,13,14} and thus it is critical to make full use of all information available in scRNASeq data.
We propose an approach to classification and estimation of allelespecific gene expression in single cells (Fig. 1 and Methods). We first count the allelespecific read alignments using one of two methods. In the uniquereads method, we exclude multimapping reads (multireads) and count only reads that map unambiguously to one allele of one gene. Alternatively, we can include multireads using an expectationmaximization (EM) algorithm to estimate counts by weighted allocation^{15,16,17}. Next, we compute a probabilistic classification of each gene in each cell into paternal monoallelic, biallelic, or maternal monoallelic expression states. Lastly, we can apply partial pooling to improve the individual celllevel estimation by combining information across cells that are in the same allelic expression state. The classification and partial pooling steps inform one another and are applied iteratively. Partial pooling can be applied to either of the read counting results leading to four different methods for estimating allelic proportions: (a) unique reads, (b) weighted allocation, (c) unique reads with partial pooling, or (d) weighted allocation with partial pooling. These methods are implemented in our \({\mathtt{scBASE}}\) software.
In the following sections, we examine the effects of weighted allocation and partial pooling on classification and estimation of allelic proportions. We then apply \({\mathtt{scBASE}}\) methods to evaluate the statistical independence of allelic bursting. Finally, we illustrate the interpretive power of allelic expression analysis of scRNASeq using data from a development time course^{8}.
Results
Application of \({\mathtt{scBASE}}\)
We applied \({\mathtt{scBASE}}\) methods to scRNASeq data from 286 preimplantation mouse embryo cells from an F1 hybrid mating between female CAST/EiJ (CAST) and male C57BL/6J (B6) mice^{8}. Cells were sampled along a time course from the zygote and early 2cell stages through the late blastocyst stage of development. We created a diploid transcriptome from CAST and B6specific sequences of each annotated transcript (Ensembl Release 78)^{18} and aligned reads from each cell to obtain allelespecific alignments. In order to ensure that genes had sufficient polymorphic sites for ASE analysis, we restrict attention to 13,032 genes that had at least four allelic unique reads in at least 10% of cells. Where indicated below, we apply \({\mathtt{scBASE}}\) to only 122 cells from the blastocyst stages of development, or to only 60 cells in the midblastocyst stage.
Discarding multireads increases spurious ASE calls
A read that maps to one allele of one gene is a unique read. A read that maps uniquely to one gene but to both allelic copies is an allelic multiread. A read that maps to multiple genes but only to one allele at each is a genomic multiread. A read that maps to multiple genes and to both alleles of any of those genes is a complex multiread. Contrary to our intuition, complex multireads convey information about allelespecific expression (Supplementary Fig. 1). We obtained unique reads and weighted allocation counts for each of 286 cells. The sequence reads include 2.5% genomic multireads, 59.3% allelic multireads, and 23.3% complex multireads. Thus, the uniquereads method retains only 14.9% of the available reads for analysis. This substantial loss of information could lead to high variability of allelic proportions. As a result, we find that the uniquereads method finds more monoallelic expression (Fig. 2a and Supplementary Fig. 1), calling on average \(\sim\)66 more genes monoallelic in each cell. We also observed \(\sim\)1,908 genes where the uniquereads method fails to detect biallelic expression in some cells whereas weighted allocation counts are consistently biallelic, for example, Mtdh (Fig. 2b and Table 1). The high frequency of monoallelic expression calls from unique reads can be misinterpreted as allelic bursting and gene expression can appear to be more dynamic.
Partial pooling improves the accuracy of allelic proportions
Over the course of the embryonic time series, the frequency of allelic expression varies dramatically, especially in the earliest stages where there is a very high rate of maternal monoallelic expression. In order to avoid mixing very disparate cell types in our evaluation of partial pooling, we restricted our analysis to the 122 mature blastocyst cells, the largest group in Deng et al.^{8} data. These cells have average coverage of \(\sim\)14.8M reads per cell. This allowed us to downsampled the data by randomly selecting 1% of reads to obtain an average coverage of 148k reads per cell — a depth of coverage similar to current scRNASeq applications. We estimated allelic proportions on both full and downsampled data using each of four methods implemented in \({\mathtt{scBASE}}\). We compared the estimated allelic proportions from the downsampled data to estimates obtained from the full data using the corresponding unique reads or weighted allocation estimates with no pooling. The full data are based on 100fold more reads per sample and provide an approximate truth standard. A similar approach to evaluation of singlecell data analysis was employed by Huang et al.^{19}.
In order to assess the effects of partial pooling, we computed differences in the mean squared error (MSE) of estimated allelic proportions with and without partial pooling. Partial pooling applied to the uniqueread counts improves estimation for the majority of genes (4,392 versus 1,367 out of 5,759 genes) with an average MSE difference of 0.018 (Fig. 3a). Partial pooling applied to the weighted allocation counts improves estimation for most genes (5,078 versus 1,673 out of 6,751 genes) with an average MSE difference of 0.016 (Fig. 3b). In both cases, the greatest gains are seen in the low expression range (\(<10\) unique reads per gene). For the most highly expressed genes, there is little or no reduction in MSE, which is consistent with our expectation that pooling of information across cells is most impactful when coverage is low.
The timing of allelic bursting is coordinated
The timing of allelic bursting events is a defining feature of stochasticity in gene expression^{20}. One fundamental question is whether the occurrence of allelic bursts is coordinated or if bursts occur independently for each allele. Statistical independence of maternal and paternal bursting can be evaluated using a \(2\times 2\) table of counts of the numbers of cells for which a given gene is expressed only from the maternal allele, only from the paternal allele, from both, or not expressed (as in Table 1). If allelic bursts occur independently, the logodds ratio (logOR) computed from this \(2\times 2\) table should be close to zero. In order to relate this standard approach^{21} for testing the independence hypothesis to alternative methods^{8,22} that have been proposed for scRNASeq data, it is helpful to consider a geometric representation of the proportions of cells in each allelic expression state (Fig. 4a). Proportions are numbers greater than or equal to zero that sum to one. They can be represented as a point in a 3dimensional tetrahedron in 4dimensional space — the 4D simplex^{23}. When maternal and paternal bursting events occur independently, the proportions should fall near the 3dimensional surface within the simplex where the logOR is equal to zero (the point cloud region in Fig. 4a). The method of testing independence used in Deng et al.^{8} and Larsson et al.^{22} imposes an additional constraint on the \(2\times 2\) table proportions by assuming that the frequencies of maternal and paternal bursting events are equal (\({p}_{M}={p}_{P}\)). This constraint corresponds to a 2dimensional crosssection of the simplex, indicated by the blue triangle in Fig. 4a. Projection of points in the 4D simplex onto this triangle produces the graphical representation used by Deng et al. (e.g., Fig. 4b). This illustrates how the Deng et al. method is a special case of the logOR test.
We evaluated independence of allelic expression on the 122 mature blastocyst cells, as was done in Jiang et al.^{24}. We first simulated data under the assumption of independent allelic bursting (Methods) and plotted the results to illustrate how points will be distributed in this diagram when the pure independence model is true (Fig. 4b). Next we estimated the \(2\times 2\) table of allelic expression states using counts obtained from each of the four methods implemented in \({\mathtt{scBASE}}\). The appearance of the data in Fig. 5 is qualitatively distinct from the simulated data (Fig. 4b). Moreover, the null hypothesis of independence is rejected for the majority of genes regardless of the method used to estimate the allelic states (Supplementary Fig. 2a, b, c, d).
We evaluated independence with \({\mathtt{SCALE}}\) software, as in Jiang et al.^{24} using both unique reads and weighted allocation counts as input. We found significant nonindependence for 3,381 genes using unique reads and for 4,815 genes using weighted allocation. We then applied partial pooling and found 6,068 significant genes using unique reads with partial pooling and 6,761 significant genes using weighted allocation with partial pooling. These results are reported using a false discovery rate of 5%. To assess the magnitude of the departure from independence we note that 2,845 and 3,763 (out of 8290) genes had \(\)logOR\( \, > \, 2\) using unique reads and weighted allocation, respectively. After partial pooling 5,622 and 6,209 genes have \(\)logOR\( \, > \, 2\) for unique reads and weighted allocation, respectively. The majority of genes had positive logOR, indicating a tendency for bursting to occur more in synchrony than chance would predict (Supplementary Fig. 2e).
We repeated these analyses using three additional datasets^{6,22,25} and arrive at similar conclusions in each case (Supplementary Figs. 3, 4, 5, 6, 7, and 8). The evidence for statistical dependence of bursting is strong and application of weighted allocation and partial pooling strengthens this conclusion.
Characterizing allelic imbalance across a cell population
The \({\mathtt{scBASE}}\) classification step provides a way to characterize the distribution of allelic expression states for any gene across a population of cells. We first compute the posterior probability of allelic expression states P, B, and M, for each gene in each cell. This classification assumes that all genes are expressed at some level, which may be very low for some genes. Thus, there is no state representing the absence of expression. This allows us to classify the allelic expression of cells that may have zero read counts due to statistical sampling. For each gene, we then estimate the proportion of cells in each allelic expression state and represent these proportions as points in a triangular simplex diagram. (Note that this representation is coming from projecting out the noexpression dimension in the 4D simplex, Fig. 4a.) To interpret the distribution of allelic expression across cells, we designate seven patterns of allelic expression (Fig. 6). Genes that are predominantly expressed as P, B, or M will appear near the corresponding vertex of the triangle (\({\bf{P}}\), \({\bf{B}}\) or \({\bf{M}}\) region). Genes with mixed allelic states will appear along the edges (\({\bf{PB}}\), \({\bf{BM}}\), or \({\bf{MP}}\) region) or near the center of the triangle (all three states, \({\bf{PBM}}\) region). For example, the gene Pacs2, which is expressed from either the maternal or the paternal allele but rarely both, is classified as an \({\bf{MP}}\) gene. The biallelic region (\({\bf{B}}\)) includes genes that are consistently expressed from both alleles e.g., Mtdh. The \({\bf{PB}}\) and \({\bf{BM}}\) regions include genes that show a mixture of biallelic and monoallelic expression with a strong allelic imbalance, e.g., Timm23 and Tulp3. The majority of genes (56.9%) in the blastocyst stages of development are in the \({\bf{PBM}}\) region (Supplementary Fig. 9). These genes display a mix of mono and biallelic expression states (e.g., Akr1b3) that is consistent with dynamic bursting of allelespecific gene expression.
We applied \({\mathtt{scBASE}}\) with weighted allocation and partial pooling to track changes in the ASE patterns of cells sampled over a developmental time course (Fig. 7a, Supplementary Figs. 9 and 10). Our aim is to classify allelic state distributions within subpopulations of cells defined by developmental stages. To achieve this, we first ran \({\mathtt{scBASE}}\) MCMC algorithm on all 286 cells to estimate the prior parameters, \({\alpha }_{g}^{s}\) and \({\beta }_{g}^{s}\) (Fig. 1 and Supplementary Methods). These parameters describe the distribution of allelic proportions in each allelic state. We then ran \({\mathtt{scBASE}}\) EM algorithm (with the prior parameters fixed) on each subpopulation of cells to estimate developmental stagespecific parameters (see Methods). In the zygote and early 2cell stages, most genes show monoallelic maternal expression. At this stage, the hybrid embryo genome is not being transcribed and the mRNA present is derived from the mother (inbred CAST genome). At the mid 2cell stage the hybrid embryo is being transcribed and we start to see expression of the paternal allele for some genes. Many genes exhibit the \({\bf{M}}\) and \({\bf{BM}}\) patterns through the 8 or 16cell stages perhaps due to the persistence of longlived mRNA species that were present at the 2cell stage. The biallelic class \({\bf{B}}\) dominates the late 2cell and 4cell stages indicating high levels of expression at rates that exceed the halflife of most mRNA species. In the later stages of development, 8cell through late blastocyst, most genes transition into the \({\bf{PBM}}\) pattern.
There are \(\sim \! 400\) genes that make dramatic transitions across allelic expression states. For example, Akr1b3 (Fig. 7b) starts in the zygote and early 2cell stage with only maternal alleles present. It transitions to biallelic expression by the mid 2cell stage indicating the onset of transcription of the paternal allele. It then transitions through the paternal monoallelic state. Our interpretation is that the early maternally derived transcripts were present prior to fertilization and these transcripts are still present when the paternal allele in the hybrid embryo gene starts to express. The early maternal transcripts are largely degraded by the 4 to 8cell stages where we see only expression from the paternal allele. In the early blastocyst stages, we start to see embryonic expression of maternal alleles resulting in a biallelic expression pattern by the late blastocyst stage.
Discussion
Allelic expression in single cells has provided new insights into the dynamic regulation of gene expression^{6}. However, estimates of allelic proportions can display high statistical variation due to low depth of sequencing coverage per cell. The common practice of discarding multimapping reads exacerbates this problem. The \({\mathtt{scBASE}}\) algorithm reduces statistical variability by retaining and disambiguating multiread data. It further improves estimation of allelic proportions by partial pooling of information across cells in the same ASE states. As a result we can obtain a more precise and accurate picture of gene expression dynamics in which biological stochasticity is revealed by reducing statistical variation.
Weighted allocation has been demonstrated to improve gene expression estimation in wholetissue RNASeq^{15,16,17}. When estimating total gene expression with weighted allocation, only genomic multireads need to be resolved and these typically represent a small proportion of all reads. When estimating allelespecific expression, however, depending on the levels of nucleotide heterozygosity, the majority of reads may lack distinguishing polymorphisms and will be allelic multireads. Complex multireads with ambiguity in both genomic and allelic alignment can carry useful information about allelespecific expression, as illustrated in Supplementary Fig. 1.
\({\mathtt{scBASE}}\) uses partial pooling in the context of a mixture model with three allelic expression states (paternal monoallelic, biallelic, and maternal monoallelic) to preserve celltocell heterogeneity by pooling information across cells that are in the same state. Combining information across cells, therefore, does not weaken the signals of strong allelic imbalance. We applied \({\mathtt{scBASE}}\) to X chromosome genes in female cells of three different datasets^{6,8,25}. In the Reinius et al. fibroblast data, partial pooling corrected the allelic proportions of Xist gene expression towards either maternal or paternal monoallelic expression for both unique reads and weighted allocation counts (Supplementary Fig. 11). Looking at expression of all X chromosome genes in these same cells, we observe that partial pooling strengthens the expected pattern of expression due to X chromosome inactivation (XCI) consistent with Xist allele expression (Supplementary Fig. 12). We observe that XCI is often incomplete and not uniform across cells. In the Chen et al. and Deng et al. datasets, Xist is clearly in the biallelic expression state in many of mouse embryo cells, epistem cells, or motor neuron cells and this is preserved after partial pooling (Supplementary Figs. 13 and 14). We also observe that XCI is not fully established for these cells (Supplementary Figs. 15 and 16). In addition, for genes that are reported to be imprinted^{26,27,28} we examined their allelic expression. Irrespective of the estimation method applied, many of these genes do not appear to be fully imprinted in these three datasets (Supplementary Figs. 17, 18, 19, and 20). However, for those genes that do show evidence of imprinting, i.e., appear in \({\bf{M}}\) or \({\bf{P}}\)class, partial pooling improves the evidence for monoallelic expression for both unique reads and weighted allocation counts.
The \({\mathtt{scBASE}}\) analysis incorporates statistical uncertainty in both the classification of allelic expression state and the estimated allelic proportions of a gene. To evaluate the precision of the estimated parameters, we have computed the posterior standard deviation of allele proportions across a range of total read counts and with varying numbers of cells (286 cells versus 60 cells). The trends are as expected, deeper read coverage or more cells improves the precision of estimation (Supplementary Fig. 21). Our probabilistic classification accounts for uncertainty and can estimate the allelic expression state of a gene even when few or no reads are sampled from a given cell based on the behavior of other cells. The \({\mathtt{scBASE}}\) model is still reliable with degenerate inputs, for example, in the most extreme case of a single cell and a gene with zero total reads, the algorithm provides a sensible answer: class probabilities are \((\frac{1}{3},\frac{1}{3},\frac{1}{3})\) and a nearly uniform distribution for allelic proportion (mean at 0.5 with standard deviation of 0.2), indicating that the data do not contain any information. As the number of cells or the read depth increases, the class probabilities become more concentrated and the posterior distribution for the allelic proportion gets narrower. Partial pooling has the biggest impact when read coverage is low and the number of cells is large (Fig. 3 and Supplementary Fig. 21).
\({\mathtt{scBASE}}\) software can be implemented as part of a scRNASeq analysis pipeline. For example, we ran \({\mathtt{SCALE}}\) software, which analyzes the dynamics of gene expression, using \({\mathtt{scBASE}}\) estimated counts as input. We found that both weighted allocation and partial pooling counts identified many more genes as nonindependent (Results and Supplementary Figs. 2a, b, c, d). Our findings suggest that running \({\mathtt{SCALE}}\) with \({\mathtt{scBASE}}\) estimated counts as input will result in more accurate estimates of bursting kinetics and reduced levels of monoallelic gene expression when compared to standard pipelines that rely on uniqueread counts.
The statistical properties of allelic bursting shed light on the nature of gene expression regulation. If expression bursts are statistically independent, this would imply that the regulation of allelic expression is local and acting autonomously at each allele. Under the perfect independence model, there would be no shared regulation of expression across alleles and the counts of cells in each allelic state will satisfy statistical criteria for independence. Under an alternative model, perfect dependence, bursting would be precisely coordinated across alleles and bursts would occur synchronously. All cells would be in either the biallelic or not expressed states. Our analysis of published scRNASeq data from four different experiments^{6,8,22,25} indicates that neither of these extremes is true (Fig. 5 and Supplementary Figs. 2, 3, 4, 5, 6, 7, and 8). We observed that the pattern of bursting is statistically dependent and positively correlated (\({\mathrm{logOR}} \, > \, 0\)) for the majority of genes. It is neither statistically independent nor perfectly synchronous. This suggests that regulation of allelic expression has both shared and locally autonomous components. While our statistical analysis cannot identify the mechanisms of regulation, it seems plausible that diffusible transcription factors could be responsible for the coordinated component of regulation. Local control is likely to be cisacting and may involve stochastic variation in the activation of the transcriptional machinery. Additional experimental work would be required to test these hypotheses and to identify the cisacting molecular events that trigger bursting of gene expression. However, the available data are sufficient to reject both hypotheses of perfect independence and of perfect dependence of allelic bursting.
Weighted allocation of multireads captures information in scRNASeq that is lost when multireads are discarded. It results in fewer spurious monoallelic expression calls and improves the accuracy of estimated allele proportions. Partial pooling is a technique for leveraging the information across many cells to improve the precision of estimation at the individual cell level. Pooling must be accomplished without compromising the celltocell heterogeneity that singlecell analysis aims to reveal. In \({\mathtt{scBASE}}\), this is achieved by pooling within classes of a mixture model of allelic expression states. Based on the evaluations presented here, we recommend weighted allocation with partial pooling as the best approach to estimate expected counts for scRNASeq data. However, the \({\mathtt{scBASE}}\) software implements alternative methods, which could be useful for further evaluation in diverse applications. The retention of multimapping sequence reads and partial pooling of information are approaches that can be applied to a wide range of sequencing applications but they are especially critical in singlecell analysis where the number of cells is large and the number of reads available to quantify each gene in each cell may be very small.
Methods
Data
Deng et al.^{8} sampled 286 preimplantation embryo cells from an F1 hybrid of CAST\(\times\)B6 along the stages of prenatal development. Embryos were manually dissociated into single cells using Invitrogen TrypLE and singleend RNASeq sequencing was performed using Illumina HiSeq 2000 (Platform GPL12112). There were fastqformat read files for four singlecell samples from zygote stage, eight from early 2cell, 12 from mid 2cell, 10 from late 2cell, 14 from 4cell, 47 from 8cell, 30 from 16cell, 43 from early blastocyst, 60 from midblastocyst, and 58 from late blastocyst stage. The Reinius et al. data^{6} consist of primary mouse fibroblast cells from the F1 reciprocal crosses of CAST\(\times\)B6 (125 cells, sextyped) and B6\(\times\)CAST (113 cells, sextyped). The Chen et al. data^{25} are from mouse embryonic stem cells (mESCs) from an F1 hybrid of B6\(\times\)CAST: 111 mESCs cultured in 2i and LIF, 120 mESCs cultured in serum and LIF, 183 mouse Epistem cells (mEpiSCs), and 74 postmitotic neuron cells. The samples are sextyped. Larsson et al.^{22} generated 224 individual primary mouse fibroblast cells from the F1 hybrid of CAST\(\times\)B6. As the data are from nonstandard SMARTSeq2 platform, we downloaded the allelespecific UMI counts directly from their \({\mathtt{txburst}}\) github repository (https://github.com/sandberglab/txburst), and did not apply weighted allocation to these data. See Data Availability below.
scRNASeq read alignment
For the F1 hybrid mouse we aligned reads to a phaseknown diploid transcriptome — this is a bestcase scenario for phasing. When dealing with more complex genomes, phasing should be performed beforehand if haplotypespecific transcriptomes are not available and \({\mathtt{scphaser}}\)^{29} is one possible approach. We reconstructed the CAST genome by incorporating known SNPs and short indels (Sanger REL1505) into the reference mouse genome sequence (Genome Reference Consortium Mouse Reference 38) using \({\mathtt{g2gtools}}\) (http://churchilllab.github.io/g2gtools/). We lifted the reference gene annotation (Ensembl Release 78) over to the CAST genome coordinates, and derived a CASTspecific transcriptome. The B6 transcriptome is based on the mouse reference genome. We constructed a bowtie (v1.0.0) index to represent the diploid transcriptome with two alleles of each transcript. We aligned reads using bowtie with parameters ‘–all’, ‘–best’, and ‘–strata’, allowing for three mismatches (‘v 3’). These settings enable us to find all of the best alignments for each read. For example, if there is a zeromismatch alignment for a read, all alignments with zeromismatch will be accepted.
Overview of the scBASE model
The \({\mathtt{scBASE}}\) algorithm is composed of three steps: read counting, classification, and partial pooling (Fig. 1). The read counting step is applied first to resolve read mapping ambiguity due to multireads and to estimate expected read counts. The read counting step is not a requirement since the following steps are applicable to any allelespecific count estimates. The classification and partial pooling steps are executed iteratively to classify the allelic expression state and to estimate the allelic proportions for each gene in each cell using a hierarchical mixture model. We have implemented \({\mathtt{scBASE}}\) as a Monte Carlo Markov chain (MCMC) algorithm^{30}, which randomly samples parameter values from their conditional posterior distributions. We have also implemented the classification and partial pooling steps as an ExpectationMaximization (EM) algorithm^{31} that converges to the maximum a posteriori parameter estimates (Supplementary Methods). MCMC is flexible, and the sampling distributions and priors are easy to change in the MCMC code. MCMC provides the full posterior distribution of allelic proportions and thus provides useful information about the uncertainty of estimated parameters. We also found that MCMC is more stable when fitting allelic proportion of monoallelic classes. The EM algorithm is much faster, but it provides only point estimation. We provide a brief description of the algorithm here and provide additional details in Supplementary Methods.
Read counting: In order to count all of the available sequence reads for each gene and allele, we have to resolve read mapping ambiguity that occur when aligning reads to a diploid genome. Genomic multireads align with equal quality to more than one gene. Allelic multireads align with equal quality to both alleles of a gene. In \({\mathtt{scBASE}}\), multireads are resolved by computing a weighted allocation based on the estimated probability of each alignment. We use an EM algorithm implemented in \({\mathtt{EMASE}}\) software for this step^{17}. Alternatively, read counting could be performed using similar methods implemented in \({\mathtt{RSEM}}\)^{15} or \({\mathtt{kallisto}}\)^{16} software. The estimated maternal read count (\({x}_{gk}\)) for each gene (\(g\)) in each cell (\(k\)) is the weighted sum of all reads that align to the maternal allele, where the weights are proportional to the probability of the read alignment. Similarly, the estimated paternal read count (\({y}_{gk}\)) is the weighted sum of all reads that align to the paternal allele. The total read count is the sum of the allelespecific counts (\({n}_{gk}={x}_{gk}+{y}_{gk}\)). A parameter of interest is the allelic proportion \({p}_{gk}\). The read counting step provides an initial estimate \({\hat{p}}_{gk}={x}_{gk}/{n}_{gk}\), which we refer to as the weighted allocation estimated counts (b).
Classification: In the classification step, we estimate the allelic expression state (\({z}_{gk}\)) for each gene in each cell. The allelic expression state is a latent variable with three possible values \({z}_{gk}\in \{P,B,M\}\) representing paternal monoallelic, biallelic, and maternal monoallelic expression, respectively. Uncertainty about the allelic expression state derives from sampling variation that can produce zero counts for one or both alleles even when the allelespecific transcripts may be present in the cell. We account for this uncertainty by computing a probabilistic classification based on a mixture model in which the maternal read counts \({x}_{gk}\) are drawn from one of three betabinomial distributions (given \({n}_{gk}\)) according to the allelic expression state \({z}_{gk}\). For a gene in the biallelic expression state the maternal allelic proportion is denoted \({p}_{gk}^{B}\) and, as suggested by the notation, it may vary from cell to cell following a beta distribution. For a gene in the paternal monoallelic expression state, the allelic proportion \({p}_{g}^{P}\) follows a beta distribution with a high concentration of mass near zero. Similarly, for a gene in the maternal monoallelic expression state, we model \({p}_{g}^{M}\) using a beta distribution with the concentration of mass near one. The beta distribution parameters for the maternal and paternal states are genespecific but are constant across cells.
Partial pooling: The classification step assumes that the mixture model parameters are known. This model describes genespecific allelic proportions for each cell and thus it has a very large number of parameters. In the scRNASeq setting where thousands of genes are measured but low read counts and sampling zeros are prevalent, we may have limited data to support their reliable estimation. Bayesian analysis using a hierarchical model is well suited for estimation in settings with large numbers of parameters. In this context, the hierarchical model improves the precision of estimation by borrowing information across cells for each gene, giving more weight to cells that are in the same allelic expression state. This estimation technique is referred to as partial pooling. Specifically, we sample the mixture weights \(({\pi }_{g\cdot }^{P},{\pi }_{g\cdot }^{B},{\pi }_{g\cdot }^{M})\) and the classspecific allele proportions \(({p}_{g}^{P},{p}_{gk}^{B},{p}_{g}^{M})\); generate classification probabilities \(({\pi }_{gk}^{P},{\pi }_{gk}^{B},{\pi }_{gk}^{M})\); and then estimate the allelic proportions as a weighted average
The average value across many iterations is \({\tilde{p}}_{gk}\), the partial pooling estimator.
Estimating allelic proportions in subsets of cells or genes
The \({\mathtt{scBASE}}\) algorithm is designed to model heterogeneous ASE states in any population of cells. In some experiments, we may be interested in the distribution of allelic expressions states within predefined groups of cells. For example, in the data from Deng et al., the early stages of development are highly skewed toward maternal expression. If the groups of cells are large, we could apply \({\mathtt{scBASE}}\) separately for each group. However, if the the number of cells is small, we recommend a twostage procedure. First, run MCMC with all of the available cells to estimate the prior parameters, \({\alpha }_{g}^{s}\) and \({\beta }_{g}^{s}\). These prior parameters are common across all cells and we can estimate them most accurately in this way. Then, holding \({\alpha }_{g}^{s}\) and \({\beta }_{g}^{s}\) fixed, reestimate the groupspecific parameters, \({\pi }_{g\cdot }^{s}\), \({\pi }_{gk}^{s}\), and \({p}_{gk}\), within each cell type using the EM algorithm version of \({\mathtt{scBASE}}\). We applied this approach to Deng et al. data to generate Fig. 7a.
When groups of genes are expected to have different distributions of allelic states, e.g., X chromosome genes, it makes sense to run \({\mathtt{scBASE}}\) separately for these genes. Our analyses of female X chromosome genes used this strategy (Supplementary Figs. 12, 15, 16).
Assigning allelic expression states from estimated counts
Uniqueread counts are obtained directly from counting reads after discarding all genomic and allelic multireads. Weighted allocation counts are derived from the EM algorithm as described above. To estimate counts after partial pooling, we multiply \({\tilde{p}}_{gk}\) by the total gene expression counts. We note that estimated counts are not integers and may be nonzero but less than one. Classification of allelic expression states for each gene in each cell directly from observed or estimated counts requires setting a threshold for monoallelic expression. For each allele, we regarded it as expressed if its estimated abundance is greater than one reads (or one UMI as in Larsson et al.^{22}).
Gene classification using its ASE profile across many cells
We classify a gene according to the proportion of cells in P, B, and Mstates, \(({\pi }_{g\cdot }^{P},{\pi }_{g\cdot }^{B},{\pi }_{g\cdot }^{M})\), that are estimated by the partial pooling model. If a majority of cells (\({\pi }_{g\cdot }^{s}\, > \, {0.7}\)) are in a particular ASE state, \(s\in \{P,B,M\}\), then we will assign the gene to the class \({\bf{P}}\) (monoallelic paternal; blue), \({\bf{B}}\) (biallelic; yellow), or \({\bf{M}}\) (monoallelic maternal; red) respectively. When a majority of cells are a mixture of two of those classes (\({\pi }_{g\cdot }^{{s}_{1}}+{\pi }_{g\cdot }^{{s}_{2}}\, > \, {0.9}\) where \({s}_{1},{s}_{2}\in \{P,B,M\}\)), we classify it into either of \({\bf{PB}}\) (mixture of monoallelic paternal and biallelic; green), \({\bf{BM}}\) (mixture of monoallelic maternal and biallelic; orange), or \({\bf{MP}}\) (a mixture of monoallelic maternal and paternal; purple). Otherwise, genes that present all three ASE states are classified as \({\bf{PBM}}\) (mixture of all; gray). We specified these seven classes in a ternary simplex diagram^{32} (Fig. 6). The class boundaries are arbitrary but the aim of this classification is to provide a simple descriptive summary of the gene expression states present in a population of cells.
Sampling reads
We randomly sampled 1% of reads in each of 122 cells at the early, mid, and late blastocyst stages to obtain an average read count of \(\sim\)148k reads per cell. We chose the blastocyst cell types because, unlike cells in earlier developmental stages, they show the widest range of different states of allelic expression. The original analysis of \({\mathtt{SCALE}}\)^{24} also used the same 122 cells. We applied the uniquereads method and weighted allocation algorithm to the full set of \(\sim\)14.8M reads and also applied each of four estimation methods (unique reads, weighted allocation counts, unique reads with partial pooling, and weighted allocation with partial pooling) to the downsampled data. We compared estimates obtained from the downsampled data to the full data estimates and computed the mean squared error of estimation across cells for each gene.
Simulation of counts under perfect independence model
We randomly sampled the marginal probabilities of maternal and paternal allelic expression, \({p}_{M}\) and \({p}_{P}\) from uniform distribution between 0 and 1. Then we generated 2\(\times\)2 tables by sampling counts from multinomial distribution with probability \(\{{p}_{M}{p}_{P},\ \ \ {p}_{M}(1{p}_{P}),\ \ \ (1{p}_{M}){p}_{P},\ \ \ (1{p}_{M})(1{p}_{P})\}\) for biallelic, maternal monoallelic, paternal monoallelic, and silent cells respectively.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
We downloaded Deng et al.^{8} data, Series GSE45719, from Gene Expression Omnibus (GEO) at http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE45719. Reinius et al.^{6} data are available from GEO at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE75659. We downloaded Chen et al.^{25} data (files in SRA format) available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE74155. For the analysis of Larsson et al.^{22} data, we downloaded the allelespecific UMI counts from https://github.com/sandberglab/txburst/tree/master/data (as of 19 April 2019). All other relevant data are available upon request.
Code availability
We have implemented our approach in extensible opensource software, scBASE, available at https://github.com/churchilllab/scBASE with the MIT licence.
References
 1.
Crowley, J. J. et al. Analyses of allelespecific gene expression in highly divergent mouse crosses identifies pervasive allelic imbalance. Nat. Genet. 47, 353–360 (2015).
 2.
Santoni, F. A. et al. Detection of imprinted genes by singlecell allelespecific gene expression. Am. J. Hum. Genet. 100, 444–453 (2017).
 3.
Tukiainen, T. et al. Landscape of x chromosome inactivation across human tissues. Nature 550, 244–248 (2017).
 4.
Garieri, M. et al. Extensive cellular heterogeneity of x inactivation revealed by singlecell allelespecific expression in human fibroblasts. Proc. Natl Acad. Sci. USA 115, 13015–13020 (2018).
 5.
Linnarsson, S. & Teichmann, S. A. Singlecell genomics: coming of age. Genome Biol. 17, 97 (2016).
 6.
Reinius, B. et al. Analysis of allelic expression patterns in clonal somatic cells by singlecell rnaseq. Nat. Genet. 48, 1430–1435 (2016).
 7.
Brennecke, P. et al. Accounting for technical noise in singlecell RNAseq experiments. Nat. Methods 10, 1093–1095 (2013).
 8.
Deng, Q., Ramsköld, D., Reinius, B. & Sandberg, R. Singlecell rnaseq reveals dynamic, random monoallelic gene expression in mammalian cells. Science 343, 193–196 (2014).
 9.
Kim, J. K. et al. Characterizing noise structure in singlecell RNAseq distinguishes genuine from technical stochastic allelic expression. Nat. Commun. 6, 8687 (2015).
 10.
Stegle, O., Teichmann, S. A. & Marioni, J. C. Computational and analytical challenges in singlecell transcriptomics. Nat. Rev. Genet. 16, 133–145 (2015).
 11.
Bacher, R. & Kendziorski, C. Design and computational analysis of singlecell RNAsequencing experiments. Genome Biol. 17, 63 (2016).
 12.
Rostom, R., Svensson, V., Teichmann, S. A. & Kar, G. Computational approaches for interpreting scRNAseq data. FEBS Lett. 591, 2213–2225 (2017).
 13.
Qiu, X. et al. Singlecell mRNA quantification and differential analysis with Census. Nat. Methods 14, 309–315 (2017).
 14.
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating singlecell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).
 15.
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNASeq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).
 16.
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Nearoptimal probabilistic rnaseq quantification. Nat. Biotechnol. 34, 525–527 (2016).
 17.
Raghupathy, N. et al. Hierarchical analysis of rnaseq reads improves the accuracy of allelespecific expression. Bioinformatics 34, 2177–2184 (2018).
 18.
Keane, T. M. et al. Mouse genomic variation and its effect on phenotypes and gene regulation. Nature 477, 289–294 (2011).
 19.
Huang, M. et al. Saver: gene expression recovery for singlecell rna sequencing. Nat. Methods 15, 539–542 (2018).
 20.
Reinius, B. & Sandberg, R. Random monoallelic expression of autosomal genes: stochastic transcription and allelelevel regulation. Nat. Rev. Genet. 16, 653–664 (2015).
 21.
Agresti, A. Contingency Tables 2nd edn (John Wiley and Sons, 2007).
 22.
Larsson, A. J. M. et al. Genomic encoding of transcriptional burst kinetics. Nature 565, 251–254 (2019).
 23.
Slavković, A.& Fienberg, S. in Algebraic and Geometric Methods in Statistics Ch. 3 (eds. Gibilisco, P., Riccomagno, E., Rogantin, M.P. & Wynn, H.P.) 63–81 (Cambridge Univ. Press, 2009).
 24.
Jiang, Y., Zhang, N. R. & Li, M. Scale: modeling allelespecific gene expression by singlecell rna sequencing. Genome Biol. 18, 74 (2017).
 25.
Chen, G. et al. Singlecell analyses of x chromosome inactivation dynamics and pluripotency during differentiation. Genome Res. 26, 1342–1354 (2016).
 26.
Babak, T. et al. Global survey of genomic imprinting by transcriptome sequencing. Curr. Biol. 18, 1735–1741 (2008).
 27.
The Jackson Laboratory. Mouse genome informatics. Jackson Lab. http://www.informatics.jax.org/searchtool/Search.do?query=genetic+imprinting&submit=Quick%250D%250ASearch (2019).
 28.
Jirtle, R. L. Imprinted genes: by species. geneimprint. http://www.geneimprint.com/site/genesbyspecies.Mus+musculus (2012).
 29.
Edsgärd, D., Reinius, B. & Sandberg, R. scphaser: haplotype inference using singlecell RNAseq data. Bioinformatics 32, 3038–3040 (2016).
 30.
Carpenter, B. et al. Stan: a probabilistic programming language. J. Stat. Softw. Art. 76, 1–32 (2017).
 31.
Kleinman, J. C. Proportions with extraneous variance: single and independent sample. J. Am. Stat. Assoc. 68, 46–54 (1973).
 32.
Harper, M. et al. pythonternary: Ternary plots in python. GitHub. https://github.com/marcharper/pythonternary (2015).
Acknowledgements
This work has been supported by the National Institutes of Health (NIH) grant R01GM070683. We would also like to thank Steven C. Munger and Daniel A. Skelly for their helpful comments on this manuscript.
Author information
Affiliations
Contributions
K.C., N.R. and G.C. conceived and planned the study. K.C. performed the model implementation and analyses. K.C. and G.C. interpreted the scientific findings. K.C. and G.C. discussed and wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Choi, K., Raghupathy, N. & Churchill, G.A. A Bayesian mixture model for the analysis of allelic expression in single cells. Nat Commun 10, 5188 (2019). https://doi.org/10.1038/s41467019130990
Received:
Accepted:
Published:
Further reading

Allelespecific expression: applications in cancer and technical considerations
Current Opinion in Genetics & Development (2021)

Natural genetic variation determines microglia heterogeneity in wildderived mouse models of Alzheimer’s disease
Cell Reports (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.