Article series: Single-cell omics

Computational and analytical challenges in single-cell transcriptomics

Journal name:
Nature Reviews Genetics
Volume:
16,
Pages:
133–145
Year published:
DOI:
doi:10.1038/nrg3833
Published online

Abstract

The development of high-throughput RNA sequencing (RNA-seq) at the single-cell level has already led to profound new discoveries in biology, ranging from the identification of novel cell types to the study of global patterns of stochastic gene expression. Alongside the technological breakthroughs that have facilitated the large-scale generation of single-cell transcriptomic data, it is important to consider the specific computational and analytical challenges that still have to be overcome. Although some tools for analysing RNA-seq data from bulk cell populations can be readily applied to single-cell RNA-seq data, many new computational strategies are required to fully exploit this data type and to enable a comprehensive yet detailed study of gene expression at the single-cell level.

At a glance

Figures

  1. Comparison of bulk and scRNA-seq analytical strategies.
    Figure 1: Comparison of bulk and scRNA-seq analytical strategies.

    A flow chart of the steps in analysis of high-throughput RNA sequencing (RNA-seq) data from bulk cell populations and from single cells is shown. Methods that are common to both approaches are shown in purple, whereas key differences in analysis methods between bulk-based RNA-seq and single-cell RNA-seq (scRNA-seq) are shown in blue and red, respectively. FPKM, fragments per kilobase of exon per million fragments mapped; PCA, principal component analysis.

  2. Quality control and normalization.
    Figure 2: Quality control and normalization.

    a | Basic quality control steps are shown. After generating single-cell RNA sequencing (scRNA-seq) data, a key first step is to assess the quality of the data. In addition to quality metrics developed for bulk RNA-seq, it is important to determine whether cells have been captured efficiently and the mRNA fraction amplified faithfully. Two simple but important criteria are to compare the percentage of unmapped reads and the percentage of reads mapped to the external spike-in molecules across cells. Cells in which either of these values is high (grey) are of poor quality and should be discarded, leaving only the higher-quality cells (green) for downstream analyses. b | Spike-ins can be used to model technical variability and examine relative variability in cell size for non-unique molecular identifier (UMI)-based scRNA-seq data. If external spike-in molecules are added at the same volume to the RNA mixture from each cell before processing, they can be used to quantify the degree of technical variability across cells and to examine the relationship between technical variation and gene expression (upper panel). The x axis shows average expression levels across cells, and the y axis shows the squared coefficient of variation; blue points are extrinsic spike-in molecules. The red line indicates the fitted relationship between technical noise and gene expression strength. Additionally, by calculating the ratio between the numbers of reads mapped to the spike-in sequences and to the genes from the organism of interest, the relative amount of mRNA contained in each cell can be estimated (lower panel). c | Spike-ins can also be used to model technical variability and to examine relative variability in cell size for UMI-based scRNA-seq data. Similar to part b, the upper panel illustrates the relationship between technical noise and expression strength — the difference is that the expression level of each gene is now quantified as the number of unique cDNA molecules. Additionally, spike-ins can be used to quantify the capture efficiency and thus infer the number of mRNA molecules contained in the lysate of each cell (lower panel). Upper panels of parts b and c adapted from Ref. 33 and Ref. 40, respectively, Nature Publishing Group.

  3. Confounding variables and how to account for them.
    Figure 3: Confounding variables and how to account for them.

    a | For each gene, the observed expression profile generated from single-cell RNA sequencing (scRNA-seq) is caused by a combination of factors. For example, if cells are being sampled randomly from a mixed population containing naive (that is, undifferentiated) cells and cells that are closer to being fully differentiated, then for each cell, the expression profile is a combination of a variety of factors (including position on the differentiation cascade, cell cycle state and apoptotic state). Factors such as the cell cycle or apoptotic state can be considered confounders that prevent the signal of interest (the differentiation state of a cell) from being uncovered. b | Confounding factors need to be identified and corrected for in downstream analyses. Latent-variable models, which are built on approaches applied in bulk RNA-seq studies to infer and correct for hidden factors that cause gene expression heterogeneity56, 57, 59, can be used to deduce the correlation between cells due to factors such as the cell cycle or apoptotic state. Subsequently, the extent of variance in the expression of each gene across cells that is attributable to this factor (and other factors) can be inferred. Additionally, the scRNA-seq data can be corrected by using regression analyses to remove the confounding factor, thus facilitating downstream analyses such as clustering or network analyses. Figure from Ref. 61, Nature Publishing Group.

  4. Finding new cell types and allocating cells along a differentiation cascade.
    Figure 4: Finding new cell types and allocating cells along a differentiation cascade.

    Unbiased clustering approaches based on principal component analysis (PCA)-like methods62, 63, 66 can be used on a mixed population of cells, to either map them along a differentiation cascade or cluster them into new cell types55, 58. Subsequently, the newly identified cascades or populations can be characterized, and new marker genes can be found by identifying genes or transcript isoforms that are differentially expressed between the populations.

  5. The kinetics of transcription.
    Figure 5: The kinetics of transcription.

    a | Single-cell RNA sequencing (scRNA-seq) can be used to study the kinetics of transcription. RNA labelling followed by pulse microscopy (left panel) can be used to track the expression of a gene over time82. scRNA-seq can be used to obtain an instantaneous snapshot of this distribution by measuring the expression of an individual gene across many cells (middle panel)81. Subsequently, these data can be used to draw inferences about the kinetics of transcription. b | Allele-specific expression can be studied using scRNA-seq. Allele-specific expression can be assayed using single-nucleotide polymorphisms (SNPs) in the sequence of a transcript to allocate reads to alternative alleles. Subsequently, the number of cells in which both alleles are expressed and the numbers of cells in which allele 1 or allele 2 is exclusively expressed can be counted. This allows the identification of genes that display evidence of monoallelic expression4. One important challenge is to address technical issues, especially allelic dropout during sample preparation, which can bias the results.

References

  1. Bernstein, B. E. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 5774 (2012).
  2. Brawand, D. et al. The evolution of gene expression levels in mammalian organs. Nature 478, 343348 (2011).
  3. Blekhman, R., Oshlack, A., Chabot, A. E., Smyth, G. K. & Gilad, Y. Gene regulation in primates evolves under tissue-specific selection pressures. PLoS Genet. 4, e1000271 (2008).
  4. Deng, Q., Ramskold, D., Reinius, B. & Sandberg, R. Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science 343, 193196 (2014).
  5. Barreiro, L. B. et al. Deciphering the genetic architecture of variation in the immune response to Mycobacterium tuberculosis infection. Proc. Natl Acad. Sci. USA 109, 12041209 (2012).
  6. Curtis, C. et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 346352 (2012).
  7. Shapiro, E., Biezuner, T. & Linnarsson, S. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nature Rev. Genet. 14, 618630 (2013).
    This is a related review discussing challenges and analysis opportunities of single-cell sequencing, for example, to reconstruct lineages in cancer.
  8. Perou, C. M. et al. Molecular portraits of human breast tumours. Nature 406, 747752 (2000).
  9. Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M. & Gilad, Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 15091517 (2008).
  10. Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods 5, 621628 (2008).
  11. Nagalakshmi, U. et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 13441349 (2008).
  12. Perry, G. H. et al. Comparative RNA sequencing reveals substantial genetic variation in endangered primates. Genome Res. 22, 602610 (2012).
  13. van 't Veer, L. J. et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530536 (2002).
  14. Sandberg, R. Entering the era of single-cell transcriptomics in biology and medicine. Nature Methods 11, 2224 (2014).
  15. Ohnishi, Y. et al. Cell-to-cell expression variability followed by signal reinforcement progressively segregates early mouse lineages. Nature Cell Biol. 16, 2737 (2014).
  16. Skamagki, M., Wicher, K. B., Jedrusik, A., Ganguly, S. & Zernicka-Goetz, M. Asymmetric localization of Cdx2 mRNA during the first cell-fate decision in early mouse development. Cell Rep. 3, 442457 (2013).
  17. Tang, F. et al. Tracing the derivation of embryonic stem cells from the inner cell mass by single-cell RNA-Seq analysis. Cell Stem Cell 6, 468478 (2010).
  18. Diez-Roux, G. et al. A high-resolution anatomical atlas of the transcriptome in the mouse embryo. PLoS Biol. 9, e1000582 (2011).
  19. Munsky, B., Neuert, G. & van Oudenaarden, A. Using gene expression noise to understand gene regulation. Science 336, 183187 (2012).
  20. Raj, A. & van Oudenaarden, A. Nature, nurture, or chance: stochastic gene expression and its consequences. Cell 135, 216226 (2008).
  21. Chalfie, M., Tu, Y., Euskirchen, G., Ward, W. W. & Prasher, D. C. Green fluorescent protein as a marker for gene expression. Science 263, 802805 (1994).
  22. Coons, A. H., Creech, H. J. & Jones, R. N. Immunological properties of an antibody containing a fluorescent group. Proc. Soc. Exp. Biol. Med. 47, 200202 (1941).
  23. Taniguchi, K., Kajiyama, T. & Kambara, H. Quantitative analysis of gene expression in a single cell by qPCR. Nature Methods 6, 503506 (2009).
  24. Raj, A., van den Bogaard, P., Rifkin, S. A., van Oudenaarden, A. & Tyagi, S. Imaging individual mRNA molecules using multiple singly labeled probes. Nature Methods 5, 877879 (2008).
  25. Faddah, D. A. et al. Single-cell analysis reveals that expression of nanog is biallelic and equally variable as that of other pluripotency factors in mouse ESCs. Cell Stem Cell 13, 2329 (2013).
  26. Tang, F. et al. mRNA-seq whole-transcriptome analysis of a single cell. Nature Methods 6, 377382 (2009).
  27. Islam, S. et al. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res. 21, 11601167 (2011).
  28. Ramskold, D. et al. Full-length mRNA-seq from single-cell levels of RNA and individual circulating tumor cells. Nature Biotech. 30, 777782 (2012).
  29. Sasagawa, Y. et al. Quartz-seq: a highly reproducible and sensitive single-cell RNA sequencing method, reveals non-genetic gene-expression heterogeneity. Genome Biol. 14, R31 (2013).
  30. Jaitin, D. A. et al. Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science 343, 776779 (2014).
  31. Hashimshony, T., Wagner, F., Sher, N. & Yanai, I. CEL-seq: single-cell RNA-seq by multiplexed linear amplification. Cell Rep. 2, 666673 (2012).
  32. Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nature Methods 10, 10961098 (2013).
    Recent protocol developments, such as the development of Smart-seq2, have helped to substantially reduce biases and improved the sensitivity of scRNA-seq.
  33. Brennecke, P. et al. Accounting for technical noise in single-cell RNA-seq experiments. Nature Methods 10, 10931095 (2013).
    This paper reports a statistical approach that estimates and accounts for technical sources of variation in scRNA-seq experiments. This method exploits spike-ins to separate technical and biological variability of individual genes (see also reference 75).
  34. Wu, A. R. et al. Quantitative assessment of single-cell RNA-sequencing methods. Nature Methods 11, 4146 (2014).
  35. Patel, A. P. et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 344, 13961401 (2014).
    This paper provides an example in which sequencing the transcriptomes of a large number of single cells provided important insights into intra- and inter-tumour heterogeneity.
  36. Shalek, A. K. et al. Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature 509, 363369 (2014).
  37. Wang, Z., Gerstein, M. & Snyder, M. RNA-seq: a revolutionary tool for transcriptomics. Nature Rev. Genet. 10, 5763 (2009).
  38. Oshlack, A., Robinson, M. D. & Young, M. D. From RNA-seq reads to differential expression results. Genome Biol. 11, 220 (2010).
  39. Jiang, L. et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 21, 15431551 (2011).
  40. Islam, S. et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nature Methods 11, 163166 (2014).
    UMIs allow individual molecules to be barcoded. This protocol enables the absolute number of transcribed molecules to be estimated independently of amplification biases.
  41. Fonseca, N. A., Rung, J., Brazma, A. & Marioni, J. C. Tools for mapping high-throughput sequencing data. Bioinformatics 28, 31693177 (2012).
  42. Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctions with RNA-seq. Bioinformatics 25, 11051111 (2009).
  43. Wu, T. D. & Nacu, S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873881 (2010).
  44. Guttman, M. et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature Biotech. 28, 503510 (2010).
  45. Anders, S., Pyl, P. T. & Huber, W. HTseq — a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166169 (2015).
  46. Davis, M. P., van Dongen, S., Abreu-Goodger, C., Bartonicek, N. & Enright, A. J. Kraken: a set of tools for quality control and analysis of high-throughput sequence data. Methods 63, 4149 (2013).
  47. Robinson, J. T. et al. Integrative genomics viewer. Nature Biotech. 29, 2426 (2011).
  48. Thorvaldsdottir, H., Robinson, J. T. & Mesirov, J. P. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 14, 178192 (2013).
  49. Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).
  50. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
    This seminal paper describes statistical methods to test for differential gene expression using RNA-seq data. Although developed in the context of RNA-seq studies on bulk cell populations, this work has laid the foundation for a large family of normalization procedures, including recent methods that are dedicated to scRNA-seq data (see reference 33).
  51. Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).
  52. Lin, C. Y. et al. Transcriptional amplification in tumor cells with elevated c-Myc. Cell 151, 5667 (2012).
  53. Loven, J. et al. Revisiting global gene expression analysis. Cell 151, 476482 (2012).
  54. Krebs, J. E., Goldstein, E. S. & Kilpatrick, S. T. Lewin's Genes XI (Jones & Bartlett Publishers, 2014).
  55. Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Bayesian approach to single-cell differential expression analysis. Nature Methods 11, 740742 (2014).
    This paper presents a Bayesian approach to test for differential gene expression in scRNA-seq studies. This approach extends methods for bulk RNA-seq (for example, reference 50) by accounting for single-cell-specific noise, such as dropout events and amplification biases.
  56. Pickrell, J. K. et al. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768772 (2010).
  57. Leek, J. T. & Storey, J. D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, 17241735 (2007).
  58. Stegle, O., Parts, L., Durbin, R. & Winn, J. A. Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Comput. Biol. 6, e1000770 (2010).
  59. Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nature Protoc. 7, 500507 (2012).
  60. Risso, D., Ngai, J., Speed, T. P. & Dudoit, S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nature Biotech. 32, 896902 (2014).
  61. Buettner, F. et al. Accounting for cell-to-cell heterogeneity in single-cell RNA-seq data reveals novel structure between cells. Nature Biotech. http://dx.doi.org/10.1038/nbt.3102 (2015).
    Confounding factors such as the cell cycle can obscure biologically relevant molecular signatures in scRNA-seq data sets. This work describes a computational approach to account for confounding factors. Related methods developed for bulk RNA profiling experiments are described in references 57–60.
  62. Treutlein, B. et al. Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq. Nature 509, 371375 (2014).
  63. Durruthy-Durruthy, R. et al. Reconstruction of the mouse otocyst and early neuroblast lineage at single-cell resolution. Cell 157, 964978 (2014).
  64. Moignard, V. et al. Characterization of transcriptional networks in blood stem and progenitor cells using high-throughput single-cell gene expression analysis. Nature Cell Biol. 15, 363372 (2013).
  65. Mahata, B. et al. Single-cell RNA sequencing reveals T helper cells synthesizing steroids de novo to contribute to immune homeostasis. Cell Rep. 7, 11301142 (2014).
    This paper provides an example from T cell biology that shows how gene–gene correlations in scRNA-seq studies can be used to reveal novel mechanistic insights.
  66. Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nature Biotech. 32, 381386 (2014).
    This paper describes a computational approach to reconstruct a pseudotemporal order from multiple scRNA-seq snapshot experiments, for example, along a differentiation trajectory.
  67. Lee, J. H. et al. Highly multiplexed subcellular RNA sequencing in situ. Science 343, 13601363 (2014).
  68. Lovatt, D. et al. Transcriptome in vivo analysis (TIVA) of spatially defined single cells in live tissue. Nature Methods 11, 190196 (2014).
  69. Pettit, J. B., Tomer, R., Achim, K., Azizi, L. & Marioni, J. C. Identifying cell types from spatially referenced single-cell expression datasets. PLoS Comput. Biol. 10, e1003824 (2014).
  70. Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139140 (2010).
  71. Hardcastle, T. J. & Kelly, K. A. baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11, 422 (2010).
  72. Shalek, A. K. et al. Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells. Nature 498, 236240 (2013).
  73. Anders, S., Reyes, A. & Huber, W. Detecting differential usage of exons from RNA-seq data. Genome Res. 22, 20082017 (2012).
  74. Katz, Y., Wang, E. T., Airoldi, E. M. & Burge, C. B. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nature Methods 7, 10091015 (2010).
  75. Grun, D., Kester, L. & van Oudenaarden, A. Validation of noise models for single-cell transcriptomics. Nature Methods 11, 637640 (2014).
  76. Friedman, J., Hastie, T. & Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432441 (2008).
  77. Segal, E. et al. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nature Genet. 34, 166176 (2003).
  78. Liao, J. C. et al. Network component analysis: reconstruction of regulatory signals in biological systems. Proc. Natl Acad. Sci. USA 100, 1552215527 (2003).
  79. Bansal, M., Belcastro, V., Ambesi-Impiombato, A. & di Bernardo, D. How to infer gene networks from expression profiles. Mol. Syst. Biol. 3, 78 (2007).
  80. Pe'er, D., Regev, A., Elidan, G. & Friedman, N. Inferring subnetworks from perturbed expression profiles. Bioinformatics 17 S215S224 (2001).
  81. Kim, J. K. & Marioni, J. C. Inferring the kinetics of stochastic gene expression from single-cell RNA-sequencing data. Genome Biol. 14, R7 (2013).
  82. Raj, A., Peskin, C. S., Tranchina, D., Vargas, D. Y. & Tyagi, S. Stochastic mRNA synthesis in mammalian cells. PLoS Biol. 4, e309 (2006).
  83. Kaern, M., Elston, T. C., Blake, W. J. & Collins, J. J. Stochasticity in gene expression: from theories to phenotypes. Nature Rev. Genet. 6, 451464 (2005).
  84. Larson, D. R. What do expression dynamics tell us about the mechanism of transcription? Curr. Opin. Genet. Dev. 21, 591599 (2011).
  85. Schwanhausser, B. et al. Global quantification of mammalian gene expression control. Nature 473, 337342 (2011).
  86. McManus, C. J. et al. Regulatory divergence in Drosophila revealed by mRNA-seq. Genome Res. 20, 816825 (2010).

Download references

Author information

Affiliations

  1. European Molecular Biology Laboratory European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

    • Oliver Stegle,
    • Sarah A. Teichmann &
    • John C. Marioni
  2. Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

    • Sarah A. Teichmann &
    • John C. Marioni

Competing interests statement

The authors declare no competing interests.

Corresponding author

Correspondence to:

Author details

  • Oliver Stegle

    Oliver Stegle is a group leader at the European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) in Cambridge, UK. His group develops statistical methods to analyse high-dimensional molecular traits in different contexts. He received his Ph.D. from the University of Cambridge, UK, in physics in 2009, working with David MacKay. After a period as a postdoctoral researcher at the Max Planck Campus in Tübingen, Germany, he moved to the EMBL-EBI in 2012 to establish his own research group.

  • Sarah A. Teichmann

    Sarah A. Teichmann is a group leader at the European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) and Wellcome Trust Sanger Institute in Cambridge, UK. Her group studies global gene regulation with a focus on CD4+ T cells, and the assembly and evolution of protein complexes. She received her Ph.D. in computational genomics from the Medical Research Council (MRC) Laboratory of Molecular Biology, Cambridge, UK, where she worked with Cyrus Chothia, and was a Beit Memorial Fellow at University College London, UK, with Janet Thornton. From 2001 to 2012 she was an MRC programme leader at the MRC Laboratory of Molecular Biology. She moved to the EMBL-EBI and Wellcome Trust Sanger Institute in 2013.

  • John C. Marioni

    John C. Marioni is a group leader at the European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) and an associate faculty member at the Wellcome Trust Sanger Institute in Cambridge, UK. His group develops computational methods for understanding the regulation of gene expression in the context of evolution and development, with a particular focus on investigating variability in expression (and other molecular traits) between individual cells. He received his Ph.D. from the University of Cambridge, UK, in applied mathematics, working with Simon Tavaré. After a period of postdoctoral research at the University of Chicago, Illinois, USA, under the supervision of Matthew Stephens, he moved to the EMBL-EBI in 2010 to establish his own research group. John C. Marioni's homepage.

Additional data