Abstract
Understanding the development and function of an organ requires the characterization of all of its cell types. Traditional methods for visualizing and isolating subpopulations of cells are based on messenger RNA or protein expression of only a few known marker genes. The unequivocal identification of a specific marker gene, however, poses a major challenge, particularly if this cell type is rare. Identifying rare cell types, such as stem cells, short-lived progenitors, cancer stem cells, or circulating tumour cells, is crucial to acquire a better understanding of normal or diseased tissue biology. To address this challenge we first sequenced the transcriptome of hundreds of randomly selected cells from mouse intestinal organoids1, cultured self-organizing epithelial structures that contain all cell lineages of the mammalian intestine. Organoid buds, like intestinal crypts, harbour stem cells that continuously differentiate into a variety of cell types, occurring at widely different abundances2. Since available computational methods can only resolve more abundant cell types, we developed RaceID, an algorithm for rare cell type identification in complex populations of single cells. We demonstrate that this algorithm can resolve cell types represented by only a single cell in a population of randomly sampled organoid cells. We use this algorithm to identify Reg4 as a novel marker for enteroendocrine cells, a rare population of hormone-producing intestinal cells3. Next, we use Reg4 expression to enrich for these rare cells and investigate the heterogeneity within this population. RaceID confirmed the existence of known enteroendocrine lineages, and moreover discovered novel subtypes, which we subsequently validated in vivo. Having validated RaceID we then applied the algorithm to ex vivo-isolated Lgr5-positive stem cells and their direct progeny. We find that Lgr5-positive cells represent a homogenous abundant population of stem cells mixed with a rare population of Lgr5-positive secretory cells. We envision broad applicability of our method for discovering rare cell types and the corresponding marker genes in healthy and diseased organs.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 51 print issues and online access
$199.00 per year
only $3.90 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Sato, T. et al. Single Lgr5 stem cells build crypt-villus structures in vitro without a mesenchymal niche. Nature 459, 262–265 (2009)
van der Flier, L. G. & Clevers, H. Stem cells, self-renewal, and differentiation in the intestinal epithelium. Annu. Rev. Physiol. 71, 241–260 (2009)
Engelstoft, M. S., Egerod, K. L., Lund, M. L. & Schwartz, T. W. Enteroendocrine cell types revisited. Curr. Opin. Pharmacol. 13, 912–921 (2013)
Shapiro, E., Biezuner, T. & Linnarsson, S. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nature Rev. Genet. 14, 618–630 (2013)
Jaitin, D. A. et al. Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science 343, 776–779 (2014)
Treutlein, B. et al. Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq. Nature 509, 371–375 (2014)
Patel, A. P. et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 344, 1396–1401 (2014)
Shalek, A. K. et al. Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature 510, 363–369 (2014)
Pollen, A. A. et al. Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nature Biotechnol. 32, 1053–1058 (2014)
Xue, Z. et al. Genetic programs in human and mouse early embryos revealed by single-cell RNA sequencing. Nature 500, 593–597 (2013)
Clevers, H. The intestinal crypt, a prototype stem cell compartment. Cell 154, 274–284 (2013)
Barker, N. Adult intestinal stem cells: critical drivers of epithelial homeostasis and regeneration. Nature Rev. Mol. Cell Biol. 15, 19–33 (2014)
Hashimshony, T., Wagner, F., Sher, N. & Yanai, I. CEL-Seq: single-cell RNA-Seq by multiplexed linear amplification. Cell Rep. 2, 666–673 (2012)
Grün, D., Kester, L. & van Oudenaarden, A. Validation of noise models for single-cell transcriptomics. Nature Methods 11, 637–640 (2014)
Tibshirani, R., Walther, G. & Hastie, T. Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. B. 63, 411–423 (2001)
Van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2570–2605 (2008)
Schonhoff, S. E., Giel-Moloney, M. & Leiter, A. B. Minireview: Development and differentiation of gut endocrine cells. Endocrinology 145, 2639–2644 (2004)
Habib, A. M. et al. Overlap of endocrine hormone expression in the mouse intestine revealed by transcriptional profiling and flow cytometry. Endocrinology 153, 3054–3065 (2012)
Egerod, K. L. et al. A major lineage of enteroendocrine cells coexpress CCK, secretin, GIP, GLP-1, PYY, and neurotensin but not somatostatin. Endocrinology 153, 5782–5795 (2012)
Raj, A., Van Den Bogaard, P., Rifkin, S. A., Van Oudenaarden, A. & Tyagi, S. Imaging individual mRNA molecules using multiple singly labeled probes. Nature Methods 5, 877–879 (2008)
Saruta, M. et al. Urocortin 3/stresscopin in human colon: possible modulators of gastrointestinal function during stressful conditions. Peptides 26, 1196–1206 (2005)
Baker, M. E. Albumin, steroid hormones and the origin of vertebrates. J. Endocrinol. 175, 121–127 (2002)
Tian, H. et al. A reserve stem cell population in small intestine renders Lgr5-positive cells dispensable. Nature 478, 255–259 (2011)
Barker, N. et al. Identification of stem cells in small intestine and colon by marker gene Lgr5. Nature 449, 1003–1007 (2007)
Roth, S. et al. Paneth cells in intestinal homeostasis and tissue injury. PLoS ONE 7, e38965 (2012)
Li, N. et al. Single-cell analysis of proxy reporter allele-marked epithelial cells establishes intestinal stem cell hierarchy. Stem Cell Rep. 3, 876–891 (2014)
Buczacki, S. J. A. et al. Intestinal label-retaining cells are secretory precursors expressing Lgr5. Nature 495, 65–69 (2013)
Yin, X. et al. Niche-independent high-purity cultures of Lgr5+ intestinal stem cells and their progeny. Nature Methods 11, 106–112 (2014)
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010)
Meyer, L. R. et al. The UCSC Genome Browser database: extensions and updates 2013. Nucleic Acids Res. 41, D64–D69 (2013)
The External RNA Controls Consortium. a progress report. Nature Methods 2, 731–734 (2005)
Acknowledgements
This work was supported by an European Research Council Advanced grant (ERC-AdG 294325-GeneNoiseControl) and a Nederlandse Organisatie voor Wetenschappelijk Onderzoek (NWO) Vici award.
Author information
Authors and Affiliations
Contributions
D.G., A.L. and A.v.O. conceived the project. D.G. developed the algorithm, performed all computations and wrote the manuscript. A.L., L.K. and K.W. performed all sequencing experiments. A.L. performed the lineage tracing experiment and all imaging experiments. N.S. made the Reg4–dsRed mouse and was supervised by H.C.; O.B. contributed the Lgr5–GFP organoids. A.L., L.K., N.S. and H.C. edited the manuscript. A.v.O. guided experiments, data analysis and writing of the manuscript, and edited the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Extended data figures and tables
Extended Data Figure 1 Sequencing statistics.
a, Histogram of the number of sequenced transcripts per cell. The median (red line) is 8,559. The 288 cells were sequenced on two lanes to a total depth of 106,950,038 reads. Of those, 32,694,069 (31%) were mapped with a valid cell-specific barcode. b, Histogram of the total number of reads per cell divided by the total number of sequenced transcripts as counted with unique molecular identifiers. The average level of oversequencing across all genes is 6.9-fold (red line).
Extended Data Figure 2 Clustering reveals major transcriptome differences between intestinal cells.
a, Dendrogram obtained by hierarchical clustering of the transcriptome correlation matrix of 238 intestinal cells that survived all filtering steps using Euclidean distance metric. At least three distinct groups of cells can be recognized. b, Gap statistic of k-means clustering of the correlation matrix as a function of the cluster number. The gap statistic reflects the difference of the average within cluster distance of points in uniformly distributed data and the actual data. The first local maximum provides a good estimate for the number of clusters that achieves optimal separation of the data into clusters15. Data points and error bars represent mean and standard deviation across 50 bootstrap samples. For the intestinal cells a number of six clusters was predicted on the basis of the gap statistic.
Extended Data Figure 3 Marker gene expression reveals intestinal cell types.
a–f, In the t-SNE maps the transcript count of known marker genes is colour-coded. Shown are maps for the enterocyte specific gene Alpi (a), the enteroendocrine marker Chga (b), the goblet cell marker Muc2 (c), the Paneth cell marker Lyz1 (d), for transcript counts aggregated across all ribosomal genes (e) and for the stem cell marker Lgr5 (f). The latter two are upregulated in cells of cluster 2, for which no other specific markers could be identified. These cells most likely correspond to transit-amplifying cells. The inset shown in b depicts transcript counts of Chga on a logarithmic scale, since the dynamic range of this gene was very large. Dim, dimension.
Extended Data Figure 4 Identification of rare cell types with RaceID.
a, Variance of transcript count across the entire ensemble of sequenced intestinal cells as a function of mean expression. The red line represents a second order polynomial (upper left corner) that was fitted to the data. Assuming that a large number of genes follows a similar transcript count distribution across different cell types this function can be used to estimate the parameters for a negative binomial that represents a background model for the expected transcript count variability at a given mean expression. The probability of observing a given transcript count in a particular cell of a cluster can be computed using this distribution with the average expression within this cluster as input. If the expression of at least two genes has a probability <10−4 after multiple testing correction a cell is considered an outlier. b, Number of outlier cells as a function of the probability threshold. The threshold used in this study (10−4) is indicated (red broken line). c, d, RaceID of pool-and-split controls reflects a low false-positive rate (see Supplementary Note). c, t-SNE map of 93 pool-and-split controls. RaceID identifies only a single large cluster with no outliers. d, Outlier probability for all pool-and-split controls. The p-value of all cells is higher than the default threshold for outlier identification (10−4). e, RaceID on a defined mixture of cells demonstrates high specificity (see Supplementary Note). RaceID clusters for a mixture of cells comprising 75 random organoid cells, 5 mouse embryonic stem cells (mESCs) and 5 mouse embryonic fibroblasts (MEFs). Two out of five MEFs did not pass the filtering criteria (>3,000 transcripts per cell). EC, enterocytes; ECpr, enterocyte precursors; TA, transit amplifying cells; EE, enteroendocrine cells; EEpr, enteroendocrine precursors; P, Paneth cells. Dim, dimension.
Extended Data Figure 5 Benchmarking RaceID.
a, b, To benchmark the RaceID algorithm we compared to a previously published method developed to distinguish cell types from the spleen5 in thousands of sequenced single cells. We implemented the method according to the specification provided in the original paper. A shortcoming of the method is that it has to be initialized with an expected number K of cell types. Running the algorithm with K = 6 (a) yields results very similar to k-means clustering with K = 6 (Fig. 1c). However, when the algorithm is run with a larger number of cell types, e. g. K = 20 (b), rare cells of the secretory lineage can still not be separated while clusters corresponding to relatively uniform cell types fall apart. We conclude that this algorithm performs well for more abundant cell types but cannot identify rare cell types. c, Principal component analysis (PCA) of the transcriptome similarities. The cell types identified by RaceID are highlighted. The first two principal components can only classify major groups of enterocytes, transit amplifying cells, and secretory cells. d, To demonstrate that our method is not affected by technical noise due to varying detection efficiency across individual cells, we downsampled the transcriptome of all cells with >3,000 transcripts to the same size, given by the minimum total transcript counts across all cells that passed the filtering, and repeated the outlier identification. The t-SNE map shows all cell types identified by this strategy and the results are highly consistent with the normalization-based approach. e, The t-SNE map shows the results of RaceID run with relaxed filtering constraints (>1,000 transcripts per cell and only genes with more than three transcripts in at least one cell) as used for the Reg4-positive organoid cells. All the rare secretory cell types identified with the original settings were recovered. The different stages of enterocyte differentiation are also apparent. EC, enterocytes; EClpr, late enterocyte precursors; ECepr, early enterocyte precursors; TA, transit amplifying cells; G, goblet cells; EE, enteroendocrine cells; EElpr, late enteroendocrine precursors; EEepr, early enteroendocrine precursors; P, Paneth cells; T, tuft cells. f, Same as Extended Data Fig. 4a, but cell cycle related genes are highlighted as blue circles. This set of genes comprises all genes containing “cell cycle” within their associated “biological process” Gene Ontology terms. Cell cycle related genes do not show increased variability and are thus unlikely to lead to false positives in the outlier detection by RaceID. Dim, dimension.
Extended Data Figure 6 Purification of Reg4-positive cells from reporter mouse organoids.
In total, 288 cells were sequenced. Ninety-six and 192 cells were analysed from two separate sorting experiments. a, Single small intestine cells derived from the wild-type (WT, upper panel) and Reg4–dsRed (lower panel) mice were sorted by FACS. Live cells were gated as DsRed-positive (lower panel, gate denoted by black rectangle, DsRed-High). SSC-W, side scatter width, DsRed-H, DsRed height. A median number of 2,813 transcripts per cell were quantified. After filtering out cells with <1,000 transcripts and genes with <3 transcripts in all cells or >2,000 transcripts in a single cells, 161 cells remained for analysis. b, c, In the t-SNE maps of Reg4-positive cells the transcript count of dsRed, driven by a Reg4 promoter (b) and endogenous Reg4 (c) are colour-coded on a logarithmic (log2) scale. Assuming a previously estimated sensitivity of our sequencing protocol14, we measure ∼10% of all expressed transcripts. Reporter expression is about eightfold reduced in comparison to endogenous Reg4, but expression of both the reporter gene and Reg4 is strongest in the Tac1/Tph1 expressing enteroendocrine cells, while expression in Paneth-like cells is reduced. Noticeably, expression of Reg4 and the reporter gene is also reduced in the Cck-positive enteroendocrine cells, similar to Chga. d, The t-SNE map shows the results of RaceID with more stringent filtering constraints (>3,000 transcripts per cell and only genes with a minimum of five transcripts in at least one cell) as used for the random organoid cells. Although the total number of cells is reduced to 135, most subtypes and rare cells identified with the relaxed settings are still observed, including the Afp and Alb expressing sub-types, the Ucn3-positive cells, the Cck-positive cells, the contamination by enterocytes and transit amplifying cells as well as the different Paneth cell states. e–h, Marker gene expression reveals intestinal cell types among Reg4-positive cells. In the t-SNE maps of Reg4-positive cells the transcript count of known marker genes is colour-coded on a logarithmic (log2) scale. Shown are maps, for transcript counts aggregated across all defensin genes which are highly expressed in Paneth cells (e), for transcript counts aggregated across all ribosomal genes (f), for the enterocyte marker Apoa1 (g) and for the enteroendocrine markers Chga (h). Dim, dimension.
Extended Data Figure 7 Expression of intestinal hormones in Reg4-positive cells.
In the t-SNE maps of Reg4-positive cells the transcript count of hormone encoding genes is colour-coded on a logarithmic (log2) scale. a–f, Shown are maps for cholecystokinin (Cck) (a), ghrelin (Ghrl) (b), secretein (Sct) (c), neurotensin (Nts) (d), proglucagon (Gcg) (e), and somatostatin (Sst) (f). Dim, dimension.
Extended Data Figure 8 Heterogeneity of substance P producing enteroendocrine cells.
In the t-SNE maps of Reg4-positive cells the transcript count of genes specifically expressed in subtypes of enteroendocrine cells is colour-coded on a logarithmic (log2) scale. Shown are maps for tachykinin (Tac1), which encodes substance P, (a), tryptophan hydroxylase 1 (Tph1) (b), urocortin 3 (Ucn3) (c), alpha-fetoprotein (Afp) (d), and VGF nerve growth factor inducible (Vgf) (e). f, The heat map shows the average expression of inferred marker genes for the enterochromaffin subtypes (cluster 2, 3 and 7). To show all genes on the same scale, the sum of average expression levels in each of the three clusters was normalized to one. Expression levels are shown for six cluster 2 markers and for six cluster 7 markers. Cluster 3 is distinct by the downregulation of both sets. Cluster 5 (not shown here) does not have specific markers and differs from the other clusters by lower expression of mature enterochromaffin markers (Chga, Chgb, Tac1, Tph1). This cluster likely comprises non-mature enterochromaffin cells. Dim, dimension.
Extended Data Figure 9 Single-molecule FISH and immunofluorescence experiments confirm expression of markers for enteroendocrine cell sub-populations in the mouse small intestine.
a, b, Small intestine cryosections were hybridized with smFISH probe libraries. Scale bar, 10 μm. a, The novel marker Ucn3 was found in a small number of Tac1-positive cells. Probes against Tac1, conjugated with TMR (upper panel, red), and against Ucn3, conjugated with Cy5 (lower panel, green), were hybridized to the tissue sections. Dashed lines indicate cell borders. A cell co-expressing the two markers (Tac1+/Ucn3+) is shown in the left column. A cell expressing Tac1, but not Ucn3 (Tac1+/Ucn3−), is shown in the right column. b, The novel marker Vgf is expressed by a subpopulation of Tph1-positive cells. Probes against Tph1, conjugated to TMR (upper panel, red), and against Vgf, conjugated to Cy5 (lower panel, green), were used for hybridization. Cell borders were stained by phalloidin-AlexaFluor 488 (not shown). Dashed lines demarcate cell borders. A Tph1-positive cell, expressing Vgf (Tph1+/Vgf+) is shown in the left column. An example of a Tph1-positive cell, expressing no Vgf (Tph+/Vgf−) is shown in the right column. c, d, Immunostaining was performed on cryosections of mouse small intestinal tissue. Scale bar, 20 μm. c, Expression of UCN3 was observed in a few TAC1-positive cells within the jejunum. Frozen tissue sections were indirectly stained with anti-UCN3 (left panel, red), and anti-TAC1 (middle panel, light blue) antibodies. Nuclei were visualized with DAPI (dark blue). A cell, expressing both markers (TAC1+/UCN3+), is shown in the upper row. A cell, positive for TAC1, but not UCN3 (TAC1+/UCN3−), is shown in the lower row. The arrowhead points at the UCN3-negative cell. d, VGF is expressed by a subpopulation of TAC1-positive jejunal and ileal cells. VGF (left panel, red) and TAC1 (second panel from the left, grey) expression was visualized with indirect immunostaining. Expression of AFP was detected using a directly conjugated antibody (second panel from the right, green). Nuclei were counterstained with DAPI (blue). A TAC1-positive cell, expressing VGF and AFP (TAC1+/VGF+/AFP+) is shown in the upper panel. An example of a TAC1-positive cell, expressing no VGF or AFP (TAC1+/VGF−/AFP−) is shown in the lower panel. Arrowheads point at the VGF- and AFP-negative cell.
Extended Data Figure 10 Purification of Lgr5–GFP+ cells from reporter mouse23 organoids.
Single small intestinal organoid cells, derived from Lgr5–GFP mice were sorted by FACS. In total, 96 cells were sequenced from a single experiment on four lanes with 31,065,854 reads in total of which 33% could be mapped to the transcriptome. Every UMI-derived transcript was sequenced on average 6.4 times. A median number of 9,626 transcripts per cell were quantified. After filtering out cells with <3,000 transcripts and genes with <5 transcripts in all cells or >500 transcripts in a single cell, 74 cells remained for analysis. a, The t-SNE map shows the cell types identified by RaceID. Only a single predominant cell type and few outliers were observed. Cluster 1 comprises intestinal stem cells while the few outliers represent Paneth and enteroendocrine cells. b–d, The t-SNE maps show expression of the stem cell marker Lgr5 (b), the stem cell marker Lrig1 (c), and the +4 niche marker Hopx (d). All markers are homogenously expressed across all cells at low transcript counts. We only observed marginal expression of the stem cell marker Bmi1 in few cells and we did not observe expression of Tert in any of the cells, which is likely owing to the limited sensitivity of the method. The RaceID results indicate that Lgr5-positive intestinal stem cells represent a uniform population of cells. e, f, Combined analysis of random organoid and Lgr5-positive cells using RaceID. The initial clusters of the random organoid cells are conserved. The Lgr5-positive cells give rise to a uniform group that merges with the CBC/TA cluster from the random organoid cells (cluster 2). Shown is a heat map representation (e) and a t-SNE map (f) of the cell-to-cell transcriptome distances. Clusters are indicated by the same colours along the axes of the heat map (e) and for individual data points in the t-SNE map (f). Dim, dimension.
Supplementary information
Supplementary Table 1
This file contains Supplementary Table 1. Differentially regulated genes within cell clusters derived for random organoid cells. For each cluster, the first column contains the gene identifier, composed of the official gene symbol and the chromosome separated by a double underscore. The second and third columns contain the average expression across all cells and across cells within the cluster, respectively, normalized to the median expression in the cluster. The third column indicates the fold change and the last column shows the p-value for the observed fold change (see Methods). (XLSX 255 kb)
Supplementary Table 2
This file contains Supplementary Table 2. Differentially regulated genes within cell clusters derived from random Reg4-dsRed positive mouse intestinal cells. For each cluster, the first column contains the gene identifier, composed of the official gene symbol and the chromosome separated by a double underscore. The second and third columns contain the average expression across all cells and across cells within the cluster, respectively, normalized to the median expression in the cluster. The third column indicates the fold change and the last column shows the p-value for the observed fold change (see Methods). (XLSX 100 kb)
Supplementary Table 3
This file contains Supplementary Table 3, a list of 96 primers used for single cell sequencing and list of smFISH probes. (XLSX 47 kb)
Supplementary Table 4
This file contains Supplementary Table 4, mRNA molecule count in intestinal cells measured by smFISH. (XLSX 14 kb)
Supplementary Table 5
This file contains Supplementary Table 5, read statistics for Lgr5-EGFP+ and YFP+ cells isolated from life intestine. See Extended Data Figure 1 and 10 for details of the organoid data. (XLSX 8 kb)
Supplementary Table 6 (transcript_counts_intestine)
This file contains Supplementary Table 6, sample data for applying the RaceID algorithm. The table contains the transcript count data for the 288 sequenced organoid cells presented in Figure 1 and 2. Primary and processed data for all experiments can be retrieved from Gene Expression Omnibus, accession no. GSE62270. This file was updated on 10 September 2015 to correct a file name. (ZIP 1375 kb)
Supplementary Information
This file contains a Supplementary Note, including benchmarking and experimental validation of the RaceID algorithm. (PDF 222 kb)
Supplementary Data 1 (RaceID_class.R)
This file contains the R code of the RaceID algorithm. This file was updated on 10 September 2015 to correct a file name. (ZIP 8 kb)
Supplementary Data 2 (RaceID_sample.R)
This file contains the R code of the RaceID sample commands. This file was updated on 10 September 2015 to correct a file name. (ZIP 1 kb)
Supplementary Data 3
This file contains a manual of the R code for the RaceID algorithm. (PDF 2666 kb)
Rights and permissions
About this article
Cite this article
Grün, D., Lyubimova, A., Kester, L. et al. Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature 525, 251–255 (2015). https://doi.org/10.1038/nature14966
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nature14966
This article is cited by
-
Single-nucleus transcriptomic analysis reveals the relationship between gene expression in oligodendrocyte lineage and major depressive disorder
Journal of Translational Medicine (2024)
-
Enabling methanol fixation of pediatric nasal wash during respiratory illness for single cell sequencing in comparison with fresh samples
Pediatric Research (2024)
-
Gastric intestinal metaplasia: progress and remaining challenges
Journal of Gastroenterology (2024)
-
Gene expression in organoids: an expanding horizon
Biology Direct (2023)
-
iDESC: identifying differential expression in single-cell RNA sequencing data with multiple subjects
BMC Bioinformatics (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.