Blood formation is believed to occur through stepwise progression of haematopoietic stem cells (HSCs) following a tree-like hierarchy of oligo-, bi- and unipotent progenitors. However, this model is based on the analysis of predefined flow-sorted cell populations. Here we integrated flow cytometric, transcriptomic and functional data at single-cell resolution to quantitatively map early differentiation of human HSCs towards lineage commitment. During homeostasis, individual HSCs gradually acquire lineage biases along multiple directions without passing through discrete hierarchically organized progenitor populations. Instead, unilineage-restricted cells emerge directly from a ‘continuum of low-primed undifferentiated haematopoietic stem and progenitor cells’ (CLOUD-HSPCs). Distinct gene expression modules operate in a combinatorial manner to control stemness, early lineage priming and the subsequent progression into all major branches of haematopoiesis. These data reveal a continuous landscape of human steady-state haematopoiesis downstream of HSCs and provide a basis for the understanding of haematopoietic malignancies.
All mature blood and immune cells are thought to derive from self-renewing and multipotent HSCs. According to the current model, initiation of differentiation is associated with the loss of self-renewal and generation of discrete multipotent, oligopotent and subsequently unipotent progenitor cell stages1,2. These lineage-restricted progenitors are thought to be generated in a stepwise manner by several subsequent binary branching decisions leading to the classical hierarchical tree-like model of haematopoiesis1,2,3,4,5,6. However, this model is mainly based on analyses of fluorescence-activated cell sorting (FACS)-purified cell populations. Even if followed up by single-cell assays3,4,7, such analyses derive average properties of predefined cell populations and thereby miss both quantitative changes within gates as well as transition states falling between often subjectively set gates.
Moreover, the lineage contribution associated with each population is typically determined by assays such as colony formation or transplantation. While these assays read out lineage potential, the actual cell fate during homeostasis in vivo may be different8,9. Depending on the assays and markers used, partly conflicting branching points and hierarchies have been proposed10,11,12,13,14.
Recent studies based on novel single-cell approaches have challenged more fundamental aspects of this classical model. For instance, unipotent progenitors can derive directly from HSCs without proceeding through oligopotent progenitors14,15 and lineage commitment was observed in progenitors proposed to be oligopotent7,10,16. However, many of these studies focused on more differentiated compartments7,10,16 or used predefined subpopulations to investigate single-cell heterogeneity7,17, impeding the characterization of transitions between cell stages. Therefore, it remains unclear how individual HSCs enter lineage commitment during homeostasis in vivo. To establish a comprehensive model of haematopoiesis that can reconcile previous findings, a combined view of transcriptomic and functional changes along the developmental progression of individual cells is required. Here we developed an approach that integrates the reconstruction of developmental trajectories18,19 with the quantitative linkage between transcriptomic and functional single-cell data17 and thus provides a detailed view on lineage commitment of individual haematopoietic stem and progenitor cells (HSPCs) into all major branches of human haematopoiesis.
Healthy human bone marrow cells were labelled with a panel of up to 11 FACS surface markers commonly used to characterize human HSPCs5,6 (see Methods and Supplementary Table 1). All HSPCs, defined by the absence of lineage markers (Supplementary Table 1) and expression of CD34 (Lin−CD34+), were individually sorted and enriched for immature cells (see Methods). The surface marker fluorescence intensities of all markers were recorded to retrospectively reconstruct immunophenotypes (CD10, CD38, CD45RA, CD90, CD135, and depending on the experiment CD2, CD7, CD49f, CD71, CD130, FCER1A, ITGA5 and KEL, Supplementary Fig. 1a). Such index-sorted HSPCs derived from the bone marrow of two healthy individuals were subjected to RNA-seq analysis (‘index-omics’, 1,034 and 379 single cells; see Supplementary Fig. 1b for the distribution of cells within classically defined gates5,6 and Supplementary Fig. 2 for quality metrics of single-cell RNA-seq) to determine their transcriptomes or individually cultured ex vivo (‘index-culture’, 2,038 single cells) to quantify megakaryocytic, erythroid and myeloid lineage potential. Subsequently, the functional and transcriptomic data sets were integrated by regression models using commonly indexed surface marker expression to identify the molecular and cellular events associated with the differentiation of human HSCs at the single-cell level (Fig. 1). To make this data type accessible, we developed indeXplorer, a web-based platform that combines features of FACS software (for example, custom gating) with tools for single-cell transcriptomics data analysis (for example, differential expression analysis, clustering, principal component analysis) in a single graphical user interface (Supplementary Fig. 3 and http://steinmetzlab.embl.de/shiny/indexplorer/?launch=yes).
Early haematopoiesis is a continuous process
HSCs and their immediate progeny, such as multipotent progenitors (MPPs) or multilymphoid progenitors (MLPs), are located in the Lin−CD34+CD38− compartment, whereas more differentiated progenitors reside in the Lin−CD34+CD38+ compartment5,7. Global gene expression analysis of single cells within these two compartments revealed fundamentally different transcriptomic structures. In both individuals, the Lin−CD34+CD38+ progenitors could be separated into clusters corresponding to distinct progenitor cell types of all major branches of haematopoiesis (Fig. 2a and see below). In contrast, clustering within the Lin−CD34+CD38− compartment was largely unstable, as demonstrated by cluster stability analysis (Supplementary Fig. 4a), the absence of clusters according to Gap statistics (Supplementary Fig. 4b), and a recently published algorithm for the clustering of single cells20 (Supplementary Fig. 4c). A simulated series of random steps from an individual cell to one of its nearest neighbours (see Methods) revealed that the majority of Lin−CD34+CD38− cells were highly interconnected, contrasting the disconnected cell types from the Lin−CD34+CD38+ compartment (Fig. 2b). Unsupervised visualization of all individual cells irrespective of FACS markers by t-SNE confirmed that Lin−CD34+CD38− cells formed a single continuously connected entity. In contrast, Lin−CD34+CD38+ cells emerged into locally clustered cell populations, with the exception of some phenotypic common myeloid progenitors (CMPs) and CD10+ MLPs, suggesting that the classification based on differential CD38 expression is excellent, but not absolute (Fig. 2c).
Notably, the absence of hierarchical structures in the primitive Lin−CD34+CD38− compartment was due to the gradual nature of differences between cells in that compartment, and not due to insufficient data quality or a lack of transcriptomic heterogeneity: a principal component analysis of Lin−CD34+CD38− cells resolved more than 10 distinct, variable biological processes in this compartment, such as cell cycle activation and lineage priming (Supplementary Fig. 4d–f). These processes are tightly correlated to surface marker expression (Supplementary Fig. 4g).
Collectively, these observations are incompatible with the classical model of early haematopoiesis, which assumes a hierarchical tree-like structure of discrete progenitors downstream of HSCs. In contrast, our data suggest that HSCs and their immediate progeny are initially part of a continuum of low-primed undifferentiated (‘CLOUD’)-HSPCs within the Lin−CD34+CD38− compartment (see also below). Discrete populations are established only when differentiation has progressed to the level of restricted progenitors typically associated with the upregulation of CD38.
Lineage-restriction downstream of the HSPC continuum
To characterize the discrete populations in the Lin−CD34+CD38+ compartment, we performed gene expression and cell surface marker analyses as well as functional validations at the single-cell level. Our analyses revealed that these populations correspond to lineage-restricted progenitors of all major branches of bone marrow haematopoiesis, including B-cell progenitors of distinct stages, megakaryocyte/erythrocyte committed progenitors (ME, Ery, Mk), neutrophil-primed progenitors (Neutro), monocyte/dendritic cell (Mono/DC) progenitors, and eosinophil/basophil/mast cell progenitors (Eo/Baso/Mast), as well as immature myeloid progenitors (Fig. 3a and Supplementary Table 2). Importantly, populations cluster by cell type and not by individual in a cross-individual comparison (Fig. 3b). The comparison of the surface marker expression of these populations to the commonly applied gating scheme5 using our indexed data set showed that immunophenotypically defined oligopotent progenitor populations (megakaryocyte–erythroid progenitors, MEPs; granulocyte–monocyte progenitors, GMPs; B-cell–NK-cell progenitors, B–NKPs) were mainly comprised of cell types with unilineage-specific gene expression profiles (Fig. 3c) and functional unipotency (Fig. 4a, b).
Cells within the classic GMP compartment were separated into several neutrophil-primed progenitors (N-0 to N-3), as well as into monocyte/dendritic cell progenitors (Mono/DC). The distinct neutrophil-primed progenitors probably represent progenitors at different developmental stages and granule composition (Fig. 4c and Supplementary Fig. 4h)21,22. Immunophenotypically, all neutrophil-primed progenitors express the surface markers CD135 and CD45RA, which are progressively upregulated during maturation (Fig. 4c). In contrast to neutrophil-primed progenitors, Eo/Baso/Mast progenitors did not fall into the classical GMP gate but displayed a Lin−CD34+CD38+CD10−CD45RA−CD135mid immunophenotype (Fig. 3c), and expressed transcription factors important for early MEP commitment (GATA2 and TAL1) supporting a recent study suggesting that granulocyte subtypes might derive from distinct haematopoietic lineages12.
The MEP gate consisted of megakaryocytic (Mk) progenitors expressing typical Mk genes, of erythroid-committed (E-1, E-2) progenitors of distinct developmental stages, differing in haemoglobin and GATA1 expression, as well as of subpopulations showing combined expression of megakaryocytic and erythroid genes (M/E). Our single-cell transcriptome data suggested CD71 (TRFC) and the red blood cell antigen KEL to be highly indicative for erythroid fate, which was confirmed by single-cell culture assays using CD71 and KEL as indexing antibodies (Fig. 4d).
For individual 2, two CD10+ B-cell progenitor clusters (small pre-B-cells, sB and large pre-B-cells, lB) were observed. sB was characterized by high CD9 messenger RNA expression, high CD10 surface expression and small cell size (forward scatter (FSC)), whereas lB showed high expression of interleukin-7 receptor (IL7RA) mRNA, intermediate CD10 surface levels, expression of cell-cycle-related genes and large cell size (Fig. 4e and Supplementary Fig. 4i and Supplementary Table 2). This suggests that sB corresponds to small pre-B-cells, and lB to large pre-B-cells, progenitor populations that have been well characterized in the murine system, but to a lesser extent in the human system23. To validate and prospectively isolate large pre-B-cells and small pre-B-cells, we used IL7RA and CD9 FACS markers, which allowed us to recapitulate the levels of CD10 surface expression, cell size and cell cycle activity as predicted from the index-omics data (Fig. 4f and Supplementary Fig. 4j). In contrast to individual 2, in individual 1, only small pre-B-cells were observed (Fig. 3b).
For both individuals, we also observed CD38-positive HSPCs with a gene expression profile of rather immature cells (Im) (Fig. 3a). These clustered globally with the Lin−CD34+CD38− compartment in t-SNE analyses, and expressed lower levels of CD38 (Supplementary Fig. 4k). Most of these cells displayed an immunophenotype typical for CMPs (Lin−CD34+CD38+CD45RA−CD135+); however, the composition of the cell types present in the CMP gate depends strongly on the exact gating strategy applied (see below, Supplementary Fig. 5h, i).
Developmental trajectories of early human haematopoiesis
To obtain a detailed view on the transition from stem cells to lineage-restricted progenitors in the continuous HSPC landscape, we developed STEMNET, a new dimensionality reduction algorithm. STEMNET identifies genes specific to the six Lin−CD34+CD38+ restricted progenitor populations defined above (Neutro, Eo/Baso/Mast, B-cell, Mono/DC, Ery and Mk; see Supplementary Table 3 for a list of genes used by STEMNET) and then computes the probability that each primitive (‘CLOUD’) HSPC can be assigned to any of these classes. STEMNET thereby places the six developmental endpoints on the corners of a simplex. This resulted in the arrangement of the least-primed HSCs, such as CD49f+ HSCs, to the centre, and the remaining HSPCs localizing in between according to their degree of priming (Fig. 5a, and see Supplementary Fig. 5a, b for individual 2). To describe the position of each cell we computed the predominant direction of priming d as the developmental endpoint closest to the cell and the degree of lineage priming Srel as the (Kullback–Leibler) distance from the least-primed cell.
This analysis suggests that HSCs located in the centre of the ‘CLOUD’ gradually acquired continuous lineage priming into either of the major branches. While lympho/myeloid and megakaryocytic/erythroid priming formed major points of attraction, a clear separation into single lineages was not present at this stage (Fig. 5a). In contrast, lineages were clearly separated at the level of Lin−CD34+CD38+ progenitors, without further sub-branching in this compartment (Fig. 5a, see Supplementary Fig. 5c for CD38 expression). Importantly, these results are not due to limitations of the bioinformatics method, as STEMNET is able to detect both subsequent branching points and discrete intermediate populations on simulated data (Supplementary Fig. 6a–d). Moreover, applying diffusion pseudotime (DPT), a different recently published method for the inference of developmental trajectories24 to our data confirmed the absence of subsequent binary branch points and the direct lineage commitment from CLOUD-HSPCs along continuous trajectories (Supplementary Fig. 6e).
Within the differentiation continuum, STEMNET analysis located previously defined immunophenotypic populations according to their known lineage potential5 (Fig. 5b, see Supplementary Fig. 5b for individual 2). For example, GMPs were distributed to the neutrophil and monocytic/dendritic cell branches while MEPs were located to the megakaryocytic and erythroid branches (notice that the localization of CMPs critically depends on the exact CD38 and CD135 gating strategy, Supplementary Fig. 5h, i). In contrast, immunophenotypic MLPs were located close to the separation of lymphoid, neutrophil and monocytic/dendritic cell lineages (Fig. 5b and Supplementary Fig. 5b), with individual cells already primed towards specific lineages, in line with frequent functional commitment to single lineages in mouse LMPPs15. Together, these analyses suggest that developmental stages immediately downstream of HSCs such as MLPs and MPPs do not represent discrete cell types located at defined branching points, but should rather be considered as transitory states within the HSPC continuum with higher probability for commitment to particular lineages.
While undergoing lineage commitment only very few cells acquired a transcriptomic state of dual-lineage priming (Supplementary Fig. 5d, e), in accordance with a recent single-cell transcriptomic study on mouse GMPs20. However, our analyses suggest that a direct transition from a primed multi-lineage towards a unilineage transcriptomic state represents the main route of lineage commitment, whereas dual-lineage states (such as Gfi1+Irf8+ GMPs, Supplementary Fig. 5f) exist, but represent rare exceptions. Importantly, both transcriptomic and functional (Supplementary Fig. 5g) lineage combinations of bipotent cells were not restricted to the combinations predicted by the classical model, conflicting with a strictly ordered hierarchy of branching events. Along these lines, co-expression of opposing pairs of transcription factors, such as IRF8 and PU.1 (SPI1) that have been thought to establish an oligopotent state, occurred at much lower frequency than previously expected (see Fig. 8a(viii, xi))25.
Transcriptomic priming mediates lineage commitment
Single-cell RNA-seq protocols require cell lysis and therefore prohibit subsequent functional interrogation of the same single cell. However, the use of indexed FACS surface markers common to both single-cell ex vivo culture data and single-cell RNA-seq data allowed us to quantitatively link the amount and direction of transcriptomic priming to functional properties such as lineage potential and proliferative capacity. For example, the STEMNET-predicted dominant direction of transcriptional priming into the lympho/myeloid versus the megakaryocytic/erythroid direction was strongly correlated to the surface marker expression of CD135 and CD45RA (Fig. 6a(i, ii)), which could be used to qualitatively predict the predominant cell type in colonies of our single-cell cultures (note that lymphoid progenitors do not grow in these conditions, and that myeloid sublineages are not resolved) (Fig. 6a(ii)). Utilizing all recorded surface markers for linear models on the single-cell RNA-seq data allowed us to quantitatively predict the dominant cell type present in the single-cell cultures for the Lin−CD34+CD38+ (P = 3.7 × 10−23) and the Lin−CD34+CD38− compartment (P = 3.7 × 10−22, Fig. 6a(iii) and Supplementary Fig. 7a for the full specification of regression models). Moreover, predicting erythroid and megakaryocytic priming individually revealed that the amount of lineage-specific priming was linked to functional lineage commitment (Fig. 6b, c and Supplementary Fig. 7b, c). However, colonies derived from Mk-primed cells were frequently dominated by other cell types due to their lower proliferative capacity ex vivo (Supplementary Fig. 7b). STEMNET further predicted Lin−CD34+CD38−CD45RA−CD90−CD135− cells to be primed towards megakaryocytic differentiation (Fig. 6d, left panel). To functionally validate this prediction in vivo, we FACS-sorted these cells, transplanted them into sublethally irradiated NSG mice and quantified their lineage output 14 days post transplantation. As predicted, these cells, which we termed Mk-primed MPPs, predominantly generated thrombocytes if compared with MLPs and HSCs (Fig. 6d, right panel). Together, these analyses revealed that transcriptomic priming is linked to the restriction of lineage potential at an early stage in vitro and in vivo.
We next estimated the degree of transcriptomic lineage priming Srel for individual cells from the culture experiments (Fig. 7a, b). As expected, committed progenitors with a high degree of inferred transcriptomic lineage priming formed small colonies (Fig. 7a) of a single-cell type (Fig. 7b). In contrast, primitive HSPCs (low inferred Srel) frequently displayed multi- or bilineage potential (Fig. 7b) and generated much larger colonies (Fig. 7a). However, not all of the primitive HSPCs displayed multipotency, but frequently appeared to be lineage-restricted while typically retaining a high proliferative capacity comparable to their multipotent counterparts (Fig. 7c). These data suggest that proliferative capacity and lineage potency are not obligatorily linked.
To investigate the ability of cells with various amounts of priming to switch lineage potential, we cultured HSPCs in the absence and presence of erythropoietin (EPO). Progenitors that formed exclusively erythroid colonies in the presence of EPO were unable to give rise to alternative lineages in the absence of EPO (Fig. 7d). Moreover, we cultured single HSPCs for one week, split the colonies into four and determined the lineage outcome of the daughter colonies two weeks later. In line with the predictions of our model, the degree of transcriptomic priming was anticorrelated to the propensity of cells to generate daughters with variable lineage composition (Supplementary Fig. 7d, e). Together, these results support the hypothesis that early lineage priming of primitive HSPCs coincides with a loss of functional plasticity.
Molecular processes underlying HSC commitment
To characterize stemness, early lineage priming and transcriptional cell type manifestation on the molecular level, we identified co-expressed gene modules whose activities were associated with the direction and/or the degree of priming. We visualized the activity of these gene modules on the differentiation landscape established above (Fig. 8a(i)) and along the progression from HSCs to each of the six lineages (Fig. 8b and Supplementary Fig. 8a, b and Supplementary Table 4 for a complete list). Importantly, data from both individuals yielded highly comparable results (Supplementary Fig. 8). To gain additional information about biological processes associated with HSC differentiation, we determined the mean expression of genes for each gene ontology (GO) term, and selected representative examples that changed significantly during early lineage priming (Fig. 8c). Together, these analyses provide insights into the global molecular and cell biological processes HSCs encounter while undergoing continuous lineage priming, unilineage commitment and subsequent differentiation.
The least-primed state was characterized by expression of the HOXA3/PRDM16/HOXB6 module26,27,28 (Fig. 8a(ii), b and Supplementary Table 4) and associated with typical stem cell properties such as cell cycle quiescence, low expression of the entire gene expression machinery, low total RNA content (measured by mRNA reads per in vitro spike in RNA read), low cellular respiration29, low CD38 and high CD90 surface expression5 (Fig. 8c). The expression of the HLF/ZFP36L2 module (which also contains the transcription factors MECOM/EVI1, HFL, GATA3) was highest in immature HSCs, but present in the entire ‘CLOUD’ (Fig. 8a(iii), b and Supplementary Table 4)30,31,32.
Intriguingly, stem cells also expressed genes from the earliest priming modules from both the lympho/myeloid (FLT3/SATB1 module) and the megakaryocyte/erythrocyte (GATA2/NFE2 module)33 lineages in a non-exclusive manner (Fig. 8a(iv–v)). These data suggest that the first transcriptional priming events into the predominantly lympho/myeloid or the megakaryocyte/erythrocyte direction are already present in most primitive HSCs, coinciding with the occurrence of first functional lineage biases already at this stage (Figs 6a, b, 7a Srel bin 1 and 2). A number of additional gene modules were activated in a combinatorial fashion between lineages, similar to previous observations from bulk RNA-seq34 (Fig. 8 and Supplementary Fig. 8a and Supplementary Table 4).
Following acquisition of lineage priming, HSCs upregulate their gene expression machinery, mRNA and protein biosynthesis, and respiration29,35, while cell cycle activity increases only marginally (Fig. 8c). At this stage, cells start to express lineage-specific gene modules, for example the SPI1/GFI1 module for the neutrophil lineage (Fig. 8a(viii)) or the IRF1/CASP1 module33 for the B-cell lineage (Fig. 8a(vi)). Other modules active at this stage, however, are shared between lineages; for example, the TAL1/HFS1 module is shared between the erythroid and the megakaryocytic lineage, whereas the EAF2/KLF4 module is shared between the neutrophil and the monocyte lineage. This coincides with the observation that most progenitors at this stage display narrow restriction in their developmental potential, whereas some progenitor cells remain oligopotent15 (Fig. 7b, Srel bin 3).
Manifestation of lineage-specific differentiation is accomplished by activation of gene modules such as the CEBPA/CEPBD module for the neutrophil lineage, the EBF1/ID3 module for the B-cell lineage, the IRF8 module for the monocytic/dendritic lineage, the GPI1BB/PBX1 module for the megakaryocytic lineage and the GATA1/KLF1 module for the erythroid lineage33,36,37 (Fig. 8a(x–xv), b). In all cases, this step is accompanied by cell cycle activation, CD38 surface marker upregulation (Fig. 8c) and unipotency (Fig. 7b, Srel bin 4 and 5).
Together, our data suggest that HSCs are characterized by the expression of specific stem cell modules in combination with early, probably antagonizing priming modules. During the continuous priming and differentiation process the stem cell modules and certain (but not all) early priming modules already expressed in HSCs are turned off, while specific lineage modules become reinforced to drive differentiation towards lineage commitment and manifestation (Fig. 8a, b). Transcription factors from upstream modules may trigger expression of downstream modules, as in the case of GATA2, TAL1 and GATA133. In contrast, transcription factors from mutually exclusive downstream modules may inhibit each other; for example, IRF8 is known to repress CEBPA38. Such inhibitory interactions may render oligopotent progenitors unstable7,10,15, and thus less abundant than previously anticipated (Fig. 7b). In contrast, in cells with a low amount of priming, expression levels of mutually exclusive modules are sufficiently small to allow uni-, oligo- or multipotency.
In summary, we provide a global view of the early human haematopoiesis during homeostasis. Our data set combines both information on the lineage potential of HSCs (index-culture) and insights into the unperturbed lineage commitment of HSCs during human haematopoiesis (reconstruction of developmental trajectories from static single-cell expression data), where lineage tracing approaches8,9 are not possible. Here, we rely on single-cell culture data and xenotransplantation for functional validation, which unlike gene expression or cellular barcoding measure developmental potential, not fate.
Our results are incompatible with fundamental aspects of the differentiation-tree model, in which HSCs are required to pass through discrete and definable intermediate progenitor cell stages by subsequent binary cell fate decisions made on branching points. Instead, we propose that early haematopoiesis is represented by a cellular continuum of low-primed undifferentiated (CLOUD)-HSPCs. This HSPC continuum contains phenotypic MPPs and MLPs, which do not constitute discrete progenitor cell types, but rather transitory states. CLOUD-HSPCs gradually acquire transcriptomic lineage priming in a combination of multiple directions, with some cell state transitions and lineage combinations more likely to occur than others. Distinct lineages emerge directly from CLOUD-HSPCs, earlier than previously anticipated and without passing through a series of discrete, stable progenitors. Our data suggest a multidimensional molecular and cellular landscape of steady-state human haematopoiesis defined by a continuous flow of differentiation and emergence of lineage trajectories independent of each other. This landscape can be visualized by using the classical Waddington’s landscape as a blueprint39,40,41, which more appropriately reflects the continuous nature of haematopoiesis than a ‘cell type tree’ (Fig. 8d). Haematopoietic stem cells reside in a flat valley at the top. Barriers separating individual lineages emerge early and deepen gradually, illustrating the acquisition of lineage biases driven by small differences in gene expression of early fate mediators. When barriers become insurmountable, cell type manifestation and lineage commitment are established.
While our study provides detailed insight into lineage commitment from HSCs into all branches of human bone marrow haematopoiesis, it does not cover lineage decisions occurring further downstream or outside the bone marrow, such as T-cell development. Given the low frequency of eosinophil/basophil/mast cell and monocyte/dendritic cell progenitors within the CD34+ bone marrow compartment, our study cannot fully resolve the separation and maturation of these lineages.
Together, our data determine a comprehensive continuum-based model of early human haematopoiesis, which will probably have important implications for the aetiology of haematologic disorders and which may serve as a paradigm for other adult stem cell systems.□
Bone marrow aspirations.
Bone marrow aspirates from healthy individuals between 25 and 39 years of age were obtained at the University clinics in Heidelberg and Mannheim after written informed consent. The use of human samples for RNA-seq and functional studies was approved by the local ethics committees in accordance with the Declaration of Helsinki. Bone marrow mononuclear cells were isolated by gradient centrifugation using Histopaque-1077 (Sigma).
Bone marrow mononuclear cells were stained with surface markers for 30 min on ice according to standard protocols. For FACS sorting, BD FACS Aria II/III or Fusion flow cytometers (BD Bioscience) equipped with 405 nm, 488 nm, 561 nm and 633 nm (Aria)/642 nm (Fusion) lasers were used. For flow cytometric analyses, LSRII and LSRFortessa flow cytometers (BD Biosciences) equipped with 350 nm, 405 nm, 488 nm, 561 nm and 640 nm lasers were used. For Ki67-Hoechst cell cycle analysis, surface staining was performed as described previously43. Subsequently, cells were fixed and permeabilized using cytofix–cytoperm buffer (BD Bioscience), and incubated with Ki67 antibody overnight at 4 °C. Cells were stained with 2 μg ml−1 Hoechst 33342 (Invitrogen) and analysed. Data were analysed using FlowJo (TreeStar), indeXplorer or R.
Single-cell liquid cultures (‘index-cultures’).
Fresh human bone marrow mononuclear cells were stained as described above with fluorescence-labelled antibodies against CD2, CD34, CD38, CD45RA, CD71, CD90, CD130, CD135, CD238 (KEL), FcεRI and a lineage cocktail consisting of CD4, CD8, CD11b, CD14, CD19, CD20, CD56, CD235a and CD10. Single Lin−CD34+CD38+CD10− and Lin−CD34+CD38−CD10−HSPCs were sorted into ultralow attachment 96-well plates (Corning) containing 100 μl StemSpan SFEM media (Stem Cell Technologies), L-glutamine (100 ng m−1), penicillin/streptomycin (100 ng ml−1) and the following human cytokines: SCF (20 ng ml−1, Peprotech), Flt3-L (20 ng ml−1, Peprotech), TPO (50 ng ml−1, Peprotech), IL-3 (20 ng ml−1, Peprotech), IL-6 (20 ng ml−1, Peprotech), G-CSF (20 ng ml−1, Peprotech), IL-5 (20 ng ml−1, Peprotech), M-CSF (20 ng ml−1, Peprotech), GM-CSF (20 ng ml−1, Peprotech) and EPO (4 U m−1, R&D). For the experiment displayed in Fig. 7d, Epo was left out from the medium. Note that the CD38+ and CD38− gates were set to touch (see also Supplementary Fig. 1a).
Fluorescence intensities were recorded for every channel for each sorted cell and used to retrospectively reconstruct immunophenotypic populations. Cells were cultured for 21 days at 5% CO2 and 37 °C. To characterize clonal progeny, colonies were imaged by microscopy and subsequently analysed for CD15, CD33, CD41a and CD235a expression by flow cytometry. Note that under these conditions, only myeloid (CD33), erythroid (CD235a) and megakaryocytic (CD41a) colonies are efficiently generated. Colonies were judged on the basis of their visual morphology and expression of surface markers. Colony size and lineage output were based on flow cytometry and confirmed by microscopy. A colony was determined to be positive for a particular lineage if ≥10 cells of the respective cell type were detected.
For the ‘split-in-four’ experiment (Supplementary Fig. 7d, e), colonies were evaluated 7 days after seeding of single cells and colonies with more than 50 cells were equally split into 4 wells and cultured for an additional 14 days before colony size and lineage output were scored.
NSG mice were bred and housed under specific pathogen-free conditions at the central animal facility of the German Cancer Research Center. All animal experiments were approved by the Regierungspräsidium Karlsruhe under Tierversuchsantrag numbers G108/12 and G210/12.
A total of 17,000 FACS-sorted HSCs (Lin−CD34+CD38−CD90+CD45RA−), MLPs (Lin−CD34+CD38−CD45RA+) or Mk-primed MPPs (Lin−CD34+CD38−CD90−CD135−) from healthy bone marrow were injected into the femoral bone marrow cavity of female mice at 15 weeks of age that had been sublethally irradiated (200 cGy) 24 h before injection.
Two weeks after xenotransplantation, lineage-specific human engraftment in the injected femur was evaluated by flow cytometry using anti-human-CD45-PE, anti-human-CD235a-APC and anti-human-CD41a-FITC antibodies.
Single-cell transcriptome sequencing (‘index-omics’).
A 25-year-old male donor (individual 1) and a 29-year-old female donor (individual 2) were selected for single-cell RNA-seq. Fresh bone marrow mononuclear cells were stained as described above with fluorescence-labelled antibodies against CD34, CD38, CD45RA, CD90, CD49f, CD135, CD10, CD7 and a lineage cocktail consisting of CD4, CD8, CD11b, CD14, CD19, CD20, CD56 and CD235a. Fluorescence intensities were recorded for every channel for each sorted cell and used to reconstruct immunophenotypic populations subsequently.
While the frequently used smart-seq2 protocol44 failed to amplify transcriptomes from bone marrow-derived human HSPCs, both the QUARTZ-seq protocol45 and a modified smart-seq2 protocol (see below) yielded good-quality cDNA (Supplementary Fig. 2a). To avoid method-specific biases, data were generated using both QUARTZ-seq (individual 2) and smart-seq2.HSC (individual 1), and all findings were systematically compared between individuals (Figs 2 and 3b and Supplementary Figs 4a, b, 5a, b and 8c).
For individual 1, eight plates of Lin−CD34+CD38− and six plates of Lin−CD34+CD38+ HSPCs were sorted and whole transcriptome amplification was performed using the smart-seq2 protocol44, but using 5 μl of a modified RT buffer containing 1× SMART First Strand Buffer (Clontech), 1 mM dithiothreitol (Clontech), 1 μM template switching oligo (Exiqon), 10 U μl−1 SMARTScribe (Clontech) and 1 U μl−1 RNASin plus (Promega). ERCC spike-ins were included at a final dilution of 1:1,000,000. Libraries were constructed using a home-made Tn5 transposase (based on ref. 46). Note that the CD38+ and CD38− gates were set to touch (see also Supplementary Fig. 1a).
For individual 2, eight plates of Lin−CD34+CD38−, one plate of Lin−CD34+CD38−CD90+CD45RA− and four plates of Lin−CD34+CD38+ HSPCs were sorted and whole transcriptome amplification was performed using the QUARTZ-Seq protocol45. ERCC spike-ins were included into the lysis buffer at a final dilution of 1:2,000,000. Libraries were constructed using Nextera Tn5 (Illumina) following the protocol provided, but using 1/4 of all volumes. Libraries were then sequenced on an Illumina HiSeq 2500 platform.
Raw data processing and quality control.
Reads were demultiplexed and, where applicable, the remaining poly-A tail of the mRNA was trimmed off. Reads were then aligned to the Homo sapiens genome (build 37.68, also containing the ERCC spike in sequences) using GSNAP47, with the expected paired-end length set to 400 bp and the allowable deviation from the expected paired-end length set to 100 bp. Reads overlapping uniquely with mRNA genes were counted using HTSeq48. As a first filtering step, we retained all cells in which we observed more than 750 genes at a minimum of 10 reads each, and a total of at least 150,000 reads. We removed all genes from the data set that were not observed by at least 10 reads in at least 5 cells. Statistics on these filtering steps are displayed in Supplementary Fig. 2.
We then fitted error models49 to the readcount data (see also below). In 35 cells of individual 2 and 1 cell of individual 1, we observed an extreme overdispersion of the genes classified as non-dropout events. These cells were removed. In individual 1, we further excluded 13 cells with an abnormal CD38−CD90high immunophenotype (Supplementary Fig. 1a). These cells were clear outliers also with regard to gene expression, as they mostly expressed genes associated with various types of mature immune cell (not shown).
Data normalization using posterior odds ratio.
We designed a normalization method to address the following two challenges: single-cell transcriptomics has large technical variability; and human haematopoietic stem and progenitor cells largely differ in RNA content (Supplementary Fig. 2h).
While lowly expressed genes are sometimes observed in cells with high total RNA content, they are almost never seen in cells with low total RNA content (Supplementary Fig. 2i). As this effect is the same for all genes of low expression level, it will induce some correlation structure on the data. In our data set, the first principal component was correlated to the library size and mRNA content, which may dominate over the effects of developmental transitions (Supplementary Fig. 2j, panel i). Normalization through division by total library size or harmonic mean estimator does not resolve this issue, as lowly expressed genes are still unobserved (zero) in cells of low mRNA content (Supplementary Fig. 2i, j panel ii). We and others have therefore used hierarchical models that assume that molecule counts are created by sampling from the true amount of mRNA molecules with cell-specific sampling efficiencies50,51. To adapt these approaches to the case where no molecular barcodes were used, we here use the error model of ref. 49, which describes the posterior probability of a gene expression level x in a cell c as where pd is the probability of a dropout event at gene expression x, pNB is the probability of observing rc reads in the case of no dropout and pPoisson(x) is the probability of observing rc spurious reads in the case of a dropout. Ωc is a vector of cell-specific and numerically optimized parameters: the slope and intercept of pd as a function of rc; the slope and intercept of x as a function of rc; the dispersion of the negative binomial distribution pNB(x|rc); and the background frequency λ of the Poisson distribution, which was fixed to 0.1.
The maximum posterior average expression across all cells is then given by While the mean of ∏ cp(x|rc, Ωc) describes the expression magnitude of a gene in a given cell, its spread describes the uncertainty due to technical noise. To obtain a single number that weighs expression magnitude by confidence level, we compute a posterior odds ratio (POR): POR can be interpreted as the evidence (in bits) that a specific gene in a specific cell is expressed more highly (or lowly) than in the average cell. The use of POR scores in principal component analysis solved the problems associated with the above-mentioned normalization strategies (Supplementary Fig. 2j panel iii). POR scores were used as the measure of gene expression for all analyses.
For hierarchical clustering, we selected the 1,000 most variable genes of each population. We then used Ward linkage on Euclidean distances. Gap statistics was computed on the same hierarchical clustering function using the R package cluster. Random walk analysis52 was performed by constructing a 5-nearest-neighbour graph on correlation distances, initializing at a random node, and then simulating a series of random steps on the 5-connected graph. The local clustering coefficient of a node in such a graph quantifies the extent to which the neighbours of two connected cells are themselves connected to each other. It was computed using the transitivity function of the igraph package53.
Basic set-up. To identify processes associated with the transition of HSCs to progenitor cell types, we sought a lower-dimensional representation of the HSPC data that reflects lineage priming. We therefore trained an elastic-net regularized generalized linear model (GLMNET) of the multinomial family on the most mature populations (N1-3, EBM, MD, spB1/2, E1/2 and Mk from Fig. 2a for individual 1, or lpB, EBM, N, ME and MD for individual 2), using class membership as the response variable. During this step, a number of population-specific genes was identified (Supplementary Table 3). The classifier then used the expression of these genes in all cells to estimate the probability pij that a cell i belongs to class j. From these probabilities, we compute the Kullback–Leibler distance from the average HSPC, which can be interpreted as the amount of lineage information a given cell has acquired: where is the average probability of a cell to belong to class j. We further assign each cell a predominant direction of priming as For displaying the six-dimensional vector pi in two dimensions, the developmental endpoints are arranged on the edge of a circle and all cells are placed in between. Each endpoint k is assigned with an angle αk. The class probabilities pik are then transformed to Cartesian coordinates by and To find the optimal arrangement of the developmental endpoints on the circle, lineages with common precursor stages are placed next to each other. The proximity between lineages l and k is computed by All arrangements are tested and the arrangement with the highest proximity is chosen. This approach is based on a method termed ‘circular a posteriori projection’51.
Data simulation. To test the ability of the STEMNET method to uncover binary branching events and discrete subpopulations, we quantitatively specified alternative models of cell fate specification and reshuffled our original data according to these models (Supplementary Fig. 6). In particular, we assumed that each cell is located on a binary tree, where nodes represent branching points and edges between nodes represent developmental trajectories. Each node Vi is specified by a tuple (E1, E2, p1, p2, h) with E1,2 pointing to the left and right child, p1,2 giving the probability that a cell adapts the fate associated with the left and right child (p1 + p2 = 1), and h ∈ (0,1) giving the height of the node (for developmental endpoints, h = 1, and for the root, h = 0). A cell is then defined by the tuple (h,E), where E points to the next node downstream of the cell.
For the scenario depicted in Supplementary Fig. 6a, cells were generated by randomly drawing values h from a Beta distribution with parameters (2,3). E was assigned by moving down a distance of h from the root and randomly choosing a branch according to p1,2 at each node that was passed. For the scenario depicted in Supplementary Fig. 6d cells were then scattered around the nearest node assuming an average distance of 0.01. The developmental distance D(ci, Vj) between a cell ci and a node Vj is then computed by traversing through the tree and summing all distances h that are passed along the way. For example, the distance between two developmental endpoints that diverge at a node with h = 0.6 is 0.8. To generate synthetic data from these cell fate specification models, we extracted the coefficients of the STEMNET classifier (Supplementary Table 3), and for each developmental endpoint j compiled lists of genes with nonzero coefficient. Gene expression values for these genes were then reordered across cells i to follow the developmental distance D(ci, Vj) (that is, assuming that gene expression of lineage-specific genes was entirely determined by developmental distance, Supplementary Fig. 6a). Alternatively, gene expression values were randomly reshuffled such that the correlation between developmental distance from Vj and gene expression equals the empirically observed correlation between gene expression and pj from the STEMNET classifier (Supplementary Fig. 6b–d).
Quantitative link between single-cell transcriptomics and single-cell culture. To quantitatively link single-cell transcriptomic properties (such as the amount or direction of priming) to single-cell functional properties, we made use of FACS markers used in both experiments. In particular, for each transcriptomic property, we constructed a regression model with logicle transformed flow cytometry markers as explanatory variables and the property as a response variable. To achieve greater robustness than in standard linear regression, we applied GLMNET models of the normal family for this task, and used tenfold cross-validation to determine the regularization parameter λ. The regression coefficients of these models are shown in Supplementary Fig. 7a together with the R2 these models achieve in tenfold cross-validation if applied to the single-cell transcriptomic data. We then applied these classifiers to logicle transformed flow cytometry data from the single-cell culture experiment to estimate the magnitude of single-cell transcriptomic properties in that experiment. To further improve the classifier, we also included rank-transformed mRNA expression levels of TFRC (CD71) and KEL in the training data, and rank-transformed flow cytometry data of CD71 and KEL in the single-cell culture experiment.
Identification of gene clusters associated with lineage priming. We then identified genes whose expression depends on Srel, d, or both, by separately fitting four different linear models to the expression data of each gene. The first model describes gene expression as a function of the predominant direction d, which is a categorial variable. It best fits to genes that are up- or downregulated early during developmental progression in a certain direction and stay unchanged until the end. The second model describes gene expression as a function of a third-degree polynomial through log10Srel. It best fits to genes that are up- or downregulated at a specific stage of developmental progression, independent of the developmental direction. The third model describes gene expression as a function of d, a third-degree polynomial through log10Srel and the interaction of d and log10Srel. It best fits to genes that are up- or downregulated at a specific stage of development in a specific direction. The fourth model describes gene expression as a constant. It best fits to genes that do not change systematically during acquisition of lineage fate. For each gene, we identified the optimal model by comparing the models’ Bayesian Information Criteria (BIC). For each class of genes (dependent on log10Srel, d or both) separately, we identified subgroups of genes that display similar dependencies on log10Srel and d by performing hierarchical clustering using correlation distance and complete linkage on the fitted values from the preferred model.
Statistics and reproducibility.
Single-cell RNA-seq was performed on two different individuals. Totals of 1,034 (for I1) and 379 cells (for I2) were included into the study. Single-cell culture was performed for 2,038 cells. As indicated in the figure legends, P values are computed from the Pearson product moment correlation test, kernel-density-based global two-sample comparison test or two-tailed unpaired t-test.
For animal experiments, no statistical method was used to predetermine sample size. The experiments were not randomized. The investigators were not blinded to animal allocation during experiments and outcome assessment.
Most analyses were performed in indeXplorer, a custom-made software for the analysis of single-cell index-sorting/transcriptomic data sets. indeXplorer was written in R and relies on the package shiny; code is available from https://git.embl.de/velten/indeXplorer.
For analyses that were not performed in indeXplorer directly, we provide an R package containing all code at https://git.embl.de/velten/STEMNET.
RNA-seq data that support the findings of this study have been deposited in the Gene Expression Omnibus (GEO) under accession code GSE75478. Processed data are available at http://steinmetzlab.embl.de/shiny/indexplorer/?launch=yes for browsing. All other data supporting the findings of this study are available from the corresponding author on reasonable request.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Gene Expression Omnibus
We thank C. Drumm for help with 3D graphics, K. Hexel, S. Schmitt, C. Felbinger and M. Eich from the DKFZ flow cytometry facility for flow cytometry support, the EMBL Genomics Core Facility for sequencing and R. Aiyar, A. Jones, M. Milsom and all members of HI-STEM and the Steinmetz group for helpful discussions on the manuscript as well as T. Schroeder and D. Löffler for initial discussions. This work was supported by the SFB873 funded by the Deutsche Forschungsgemeinschaft (DFG) (to C.L., M.A.G.E. and A.T.), the Dietmar Hopp Foundation (to M.A.G.E. and A.T.) and the US National Institutes of Health (P01 HG000205 to L.M.S.).