Validation of noise models for single-cell transcriptomics

Journal name:
Nature Methods
Year published:
Published online

Single-cell transcriptomics has recently emerged as a powerful technology to explore gene expression heterogeneity among single cells. Here we identify two major sources of technical variability: sampling noise and global cell-to-cell variation in sequencing efficiency. We propose noise models to correct for this, which we validate using single-molecule FISH. We demonstrate that gene expression variability in mouse embryonic stem cells depends on the culture condition.

At a glance


  1. Analysis of gene expression noise with single-cell mRNA sequencing.
    Figure 1: Analysis of gene expression noise with single-cell mRNA sequencing.

    (a) Hand-picked single mESCs were spiked with foreign RNA (left) and sequenced with a modified CEL-Seq3 protocol (Online Methods). To measure technical noise, RNA aliquots from bulk cells were treated likewise (right). (b) CV computed on the basis of transcripts per million (TPM) versus reads per million (RPM). The diagonal (red solid line) and twofold change intervals (red dashed lines) are indicated. (c) CV across control samples as a function of average expression. Blue line indicates CV for a hypothetical Poissonian distribution; dashed line represents CV computed from the s.d. of β, i.e., for global tube-to-tube variability. (d) Count distribution of Pou5f1 transcripts across cells and controls (dashed lines) fitted by negative binomials (solid lines). (e) Different functions were fitted to the count distribution in cells and controls. The goodness of fit was assessed by a χ2 test. The bar plot shows the number of genes for which a given distribution was not rejected (χ2 test P > 0.01).

  2. Modeling of technical variability and inference of biological noise.
    Figure 2: Modeling of technical variability and inference of biological noise.

    (a) Schematic of transcript normalization for model I. (b) Linear regression of the transcript counts on the predicted spike-in molecule numbers. Different slopes reflect varying efficiencies between samples. (c,d) CV as a function of average expression in cells and controls with (c) and without (d) count normalization. Predictions are indicated for model I in c and models II and III in d. The moving average and the CV for Poissonian noise are also shown. (e) Schematic of biological noise inference. Cell-to-cell noise (red) was fitted by a negative binomial, and a deconvolution of technical noise (gray) yields biological variability (turquoise). Example distributions are shown for Pou5f1. Noise distributions are similar for models II and III but narrower for model I owing to the elimination of global variability by normalization.

  3. Validation of predicted biological variability by smFISH.
    Figure 3: Validation of predicted biological variability by smFISH.

    (a) Pou5f1 transcripts labeled with Cy5 using smFISH in mESCs (maximal z projections). Single molecules appear as diffraction-limited spots. Nuclei were stained with DAPI. Scale bar, 10 μm. (b) Count distribution for smFISH on Pou5f1 (>100 cells) and a negative binomial fit (black line) with uncertainty interval (dashed lines). The P value for rejecting a negative binomial was computed by a χ2 test. (c) CV measured in cells, as inferred after deconvolution of technical noise and measured with smFISH. In the model I comparison, the CV after normalization of transcript counts without deconvolution of sampling noise is also shown. Error bars are derived from estimated standard errors of the numerical fits. (d) z score of deviations between models and smFISH-based CVs averaged across genes. The z score after normalization (Median norm.) of transcript counts without deconvolution of sampling noise is also shown. (e) Distribution of Fano factors as measured in cells and controls and as inferred for biological variability using model III. The distribution after normalization of transcript counts without deconvolution of sampling noise is also shown (Median norm.). The inset shows a histogram of log2 fold changes between Fano factors before and after deconvolution of technical noise. (f) Scatter plot of Fano factors in the serum versus 2i conditions. Genes that have different Fano factors within their error bars between the two conditions are colored. Error bars are based on standard errors of fitting parameters.

Accession codes

Primary accessions

Gene Expression Omnibus


  1. Munsky, B., Neuert, G. & van Oudenaarden, A. Science 336, 183187 (2012).
  2. Eldar, A. & Elowitz, M.B. Nature 467, 167173 (2010).
  3. Hashimshony, T., Wagner, F., Sher, N. & Yanai, I. Cell Rep. 2, 666673 (2012).
  4. Sasagawa, Y. et al. Genome Biol. 14, R31 (2013).
  5. Tang, F. et al. Nat. Methods 6, 377382 (2009).
  6. Ramsköld, D. et al. Nat. Biotechnol. 30, 777782 (2012).
  7. Islam, S. et al. Genome Res. 21, 11601167 (2011).
  8. Picelli, S. et al. Nat. Methods 10, 10961098 (2013).
  9. Shapiro, E., Biezuner, T. & Linnarsson, S. Nat. Rev. Genet. 14, 618630 (2013).
  10. Kivioja, T. et al. Nat. Methods 9, 7274 (2012).
  11. Shiroguchi, K., Jia, T.Z., Sims, P.A. & Xie, X.S. Proc. Natl. Acad. Sci. USA 109, 13471352 (2012).
  12. Hug, H. & Schuler, R. J. Theor. Biol. 221, 615624 (2003).
  13. Shalek, A.K. et al. Nature 498, 236240 (2013).
  14. Islam, S. et al. Nat. Methods 11, 163166 (2014).
  15. Jaitin, D.A. et al. Science 343, 776779 (2014).
  16. Brennecke, P. et al. Nat. Methods 10, 10931095 (2013).
  17. Ying, Q.-L. et al. Nature 453, 519523 (2008).
  18. The External RNA Controls Consortium. Nat. Methods 2, 731734 (2005).
  19. Raj, A., Peskin, C.S., Tranchina, D., Vargas, D.Y. & Tyagi, S. PLoS Biol. 4, e309 (2006).
  20. Raj, A., van den Bogaard, P., Rifkin, S.A., van Oudenaarden, A. & Tyagi, S. Nat. Methods 5, 877879 (2008).
  21. Li, H. & Durbin, R. Bioinformatics 26, 589595 (2010).
  22. Meyer, L.R. et al. Nucleic Acids Res. 41, D64D69 (2013).
  23. Anders, S. & Huber, W. Genome Biol. 11, R106 (2010).
  24. Robinson, M.D., McCarthy, D.J. & Smyth, G.K. Bioinformatics 26, 139140 (2010).
  25. Byrd, R.H., Lu, P., Nocedal, J. & Zhu, C. SIAM J. Sci. Comput. 16, 11901208 (1995).

Download references

Author information

  1. These authors contributed equally to this work.

    • Dominic Grün &
    • Lennart Kester


  1. Hubrecht Institute-KNAW (Royal Netherlands Academy of Arts and Sciences), Utrecht, The Netherlands.

    • Dominic Grün,
    • Lennart Kester &
    • Alexander van Oudenaarden
  2. University Medical Center Utrecht, Cancer Genomics Netherlands, Utrecht, The Netherlands.

    • Dominic Grün,
    • Lennart Kester &
    • Alexander van Oudenaarden


D.G., L.K. and A.v.O. conceived the methods. D.G. developed the noise models, performed all computations and wrote the manuscript. L.K. performed all experiments and corrected the manuscript. A.v.O. guided experiments, data analysis and writing of the manuscript, and corrected the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (11,966 KB)

    Supplementary Figures 1–15, Supplementary Table 1 and Supplementary Notes 1–4.

Excel files

  1. Supplementary Table 2 (82 KB)

    GO terms enriched among genes with increased expression variability in serum versus 2i culture condition. Enriched biological processes and enriched molecular functions are given as separate lists. Only significantly enriched GO-terms (P < 0.05) were included. The lists indicate the GO-term ID, the hypergeometric P-value, the odds ratio, the expected number of genes associated with each GO-term, the observed number of genes for each GO-term, the size of the GO-term (total number of genes associated) and a short description. For the inference of over-represented GO terms, the set of differentially variable genes was compared to the universe of all genes expressed in the two conditions. The GOstats package was used to compute GO enrichment in R.

  2. Supplementary Table 3 (56 KB)

    Probe set composition of smFISH probes used. Each column represents a probe set for the gene specified in the column header. All probes were labeled on the 3' end with TMR, Alexa594 or Cy5.

Additional data