The development of an algorithm called RaceID enables the identification of rare cell types by single-cell RNA sequencing, even when they are part of a complex mixture of similar cells. See Letter p.251
How many cell types are there in the human body? Thanks to progress in single-cell sequencing technologies, scientists are now addressing this question in a systematic and non-biased way. On page 251 of this issue, Grün et al.1 take this research forward another step, describing an algorithm called RaceID that can identify rare cell types in a complex mixture of cells.
The transcriptome is the complete collection of RNA molecules present in a cell. Standard approaches to sequencing these assemblages provide an average view of the transcriptome across many cells, and so cannot provide information about differences between individual cells (heterogeneity), or about the characteristics of rare cell types within a heterogeneous population. Such analyses require single-cell transcriptome-sequencing technologies2,3,4, and in the past few years it has become possible to acquire transcriptome data for hundreds and even thousands of single cells5,6. Questions have been raised, however, about how reliably information can be mined from these huge data sets, particularly given that they produce considerable technical noise7 owing to inaccuracies in the techniques used.
The epithelial cells that line the intestine absorb nutrients and defend the body against microorganisms. The epithelium contains six mature cell types, which are continually renewed by a small population of adult stem cells8. This cell layer is one of the best-studied models of self-renewal and differentiation in adult stem cells, and many markers of distinct epithelial cell types have been characterized. This makes it an invaluable system for developing techniques and algorithms for ingle-cell analysis.
Grün et al. used a single-cell transcriptome-sequencing technique to analyse 238 epithelial cells obtained from mouse intestinal organoids — 'mini guts' grown in vitro from a single stem cell that contain every cell lineage of the intestinal epithelium. Using standard clustering algorithms, the authors distinguished three major cell populations (a rapidly dividing precursor population called transit amplifying cells, absorptive cells called enterocytes and secretory cells). One algorithm, K-means clustering, could distinguish several subgroups within the abundant enterocyte cell population, including early and late progenitors and mature cells. However, none of the algorithms could distinguish subgroups within the rare secretory-cell lineage, which was represented by only 20 cells in the sample.
The secretory-cell lineage contains at least three cell types, one of which — the hormone-producing enteroendocrine cells — can be further divided into more than ten different subtypes according to the hormones that they secrete9. Enteroendocrine cells have key roles in maintaining gut homeostasis, and so an ability to distinguish the different subgroups is desirable. But because of the similarity of their transcriptomes, the subgroups could not be discriminated by standard algorithms in the authors' initial analyses.
To get around this limitation, Grün et al. developed RaceID, a simple and clever algorithm that clearly distinguishes different secretory cell types. The algorithm assumes that a given cell type is likely to strongly express a certain number of cell-type-specific 'outlier' genes. Such genes can be identified if care is taken to exclude technical and biological noise (biological noise arises from differences in transcript expression between individual cells of the same type). RaceID identifies outlier cells in each cluster after a K-means clustering step. An outlier cell is defined as expressing a certain number of outlier genes at levels significantly exceeding the modelled noise. In this way, identification of a cell type will not depend on global cell–cell differences, as in standard clustering algorithms, but on only a few genes (Fig. 1).
During single-cell sequencing, each RNA transcript must be amplified many times to provide enough material for accurate sequencing. But the amplification step can introduce technical noise, because small errors in measuring the number of transcripts produced from each gene in a cell are magnified during replication. The authors exclude this noise using a previously reported technique10 to add a unique molecular 'barcode' to each individual transcript before amplification. This enables the RaceID algorithm to determine whether high levels of gene expression are real or an artefact of amplification. Grün and colleagues demonstrated the efficiency of this strategy using a pool-and-split experiment. They pooled transcripts from 93 cells, split the RNA into 93 equal samples, which created 93 'average' single cells, then amplified and sequenced each sample separately; and no false positive rare cell types were detected.
RaceID identified the gene Reg4 as being highly expressed specifically in enteroendocrine cells. Grün et al. isolated and sequenced a population of 161 Reg4-expressing cells. Using RaceID, they identified new enteroendocrine subtypes and validated them in vivo at the level of both RNA and protein. This confirmed that RaceID can be used for the identification of rare cell types.
There has been much debate about whether the intestinal stem-cell population is heterogeneous. Can RaceID find subtypes in this population, which is marked by expression of the gene Lgr5? Grün and colleagues sequenced transcriptomes from 288 Lgr5-expressing cells. RaceID identified these cells as largely homogeneous — the stem-cell population — mixed with a population of rare Lgr5-expressing secretory cells. However, as the authors note, it remains possible that the stem-cell population is heterogeneous, but that differences are below a level detectable even by RaceID.
The major limiting factor for RaceID is the accuracy of single-cell sequencing. It is still not possible to measure low-level gene expression accurately in a single cell, and the technical noise for detection of such genes will be too high to identify outliers. The genes for the transcription factors that determine a given cell type are generally not expressed as highly as those encoding hormones, for instance. This might prevent RaceID from discerning potentially functionally important rare cell types in which the differentially expressed genes are likely to mainly encode transcription factors, and may explain the fact that Grün et al. were unable to detect stem cells in the initial organoid analysis, because the cells express Lgr5 at low levels.
The potential for falsely 'identifying' new rare cell types should also be considered. Care must be taken to avoid nucleic-acid cross-contamination or incomplete cell dissociation. It will be necessary to validate putative cell types at the RNA and even protein level.
In terms of sensitivity, accuracy and comprehensiveness, current single-cell sequencing techniques and bioinformatics tools are far from perfect. This is particularly true when it comes to discovering rare cell types. But through the unremitting efforts of Grün et al. and others, in the near future we may be able to chart a complete cell-lineage map of the human body.
Grün, D. et al. Nature 525, 251–255 (2015).
Tang F., Lao, K. & Surani, M. A. Nature Methods 8, S6–S11 (2011).
Eberwine, J., Sul, J. Y., Bartfai, T. & Kim, J. Nature Methods 11, 25–27 (2014).
Treutlein, B. et al. Nature 509, 371–375 (2014).
Klein, A. M. et al. Cell 161, 1187–1201 (2015).
Macosko, E. Z. et al. Cell 161, 1202–1214 (2015).
Stegle, O., Teichmann, S. A. & Marioni, J. C. Nature Rev. Genet. 16, 133–145 (2015).
Clevers, H. Cell 154, 274–284 (2013).
May, C. L. & Kaestner K. H. Mol. Cell. Endocrinol. 323, 70–75 (2010).
Jaitin, D. A. et al. Science 343, 776–779 (2014).
About this article
Neuroscience Bulletin (2018)
Lineage specification of early embryos and embryonic stem cells at the dawn of enabling technologies
National Science Review (2017)
Frontiers in Plant Science (2016)