Single-cell data provide a means to dissect the composition of complex tissues and specialized cellular environments. However, the analysis of such measurements is complicated by high levels of technical noise and intrinsic biological variability. We describe a probabilistic model of expression-magnitude distortions typical of single-cell RNA-sequencing measurements, which enables detection of differential expression signatures and identification of subpopulations of cells in a way that is more tolerant of noise.
View full text
At a glance
: Modeling single-cell RNA-seq measurement.
a) Smoothed scatter plot comparing gene-expression estimates from two MEFs, illustrating the types of cell-to-cell variability observed. RPM, reads per million. ( b) Plots showing expression of Rnaseh2a and Bmp4, as examples of top differentially expressed genes, from CuffDiff2 (ref. 14) comparison of ten ES and ten MEF cells. Triangles show expression magnitudes observed in different cells, and whiskers span the range of observed expression magnitudes. ( c) Plot showing a cross-comparison of single-cell measurements in cells of the same type, determining whether the transcript is likely to have been successfully amplified in both experiments (correlated component). ( d) Plot showing read counts observed for a particular cell ( y axis) relative to the expected expression magnitude ( x axis; see c). The measurement is modeled as a mixture of dropout (red) and successful amplification processes (blue), with magnitude-dependent mixing of the two processes. ( e, f) Probability of transcript-detection failures (dropout events) as a function of expression magnitude for individual ES and MEF cells ( 2 e) and for individual cells from 4-, 8- and 16-cell embryos ( 12 f).
: Applying single-cell models for differential expression and subpopulation analyses.
a) Expression differences of Sox2 between all ES and MEF cells, measured by Islam et al. . The plots show posterior probability ( 2 y axis, probability density) of expression magnitudes in mouse ES (mES, top) and MEF (bottom) cells. The model fitted for each single cell is used to estimate the likelihood that a gene is expressed at any particular level, given the observed data (red or blue curves). The black curve shows the estimated joint posterior distribution for the overall level for each cell type. The posterior probability of the fold-expression difference is shown in the middle plot with the associated raw P value (two-sided) of differential expression. ( b) Expression differences of Dazl between cells of 8-cell and 16-cell mouse embryo stages , as in 12 a. A regulatory factor expressed in mammalian embryos , 19, 20 Dazl is expressed at earlier stages and shows a drop-off between 8- and 16-cell stages. ( c) Receiver operating characteristic curves comparing the ability to detect differentially expressed genes, with bulk expression measurements as a benchmark . SCA, single-cell assay 17 ; AUC, area under curve. ( 15 d) Performance of error model–based transcriptional similarity measures in distinguishing ES and MEF cell types. The plot shows the fraction of correctly classified cells, assessed for increasingly difficult classification problems by iterative exclusion of up to 7,000 of the most informative genes (i.e., genes differentially expressed between ES and MEF, x axis). The 95% confidence bands (of the mean) are shown in light shading.