Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# Identification of leukemic and pre-leukemic stem cells by clonal tracking from single-cell transcriptomics

## Abstract

Cancer stem cells drive disease progression and relapse in many types of cancer. Despite this, a thorough characterization of these cells remains elusive and with it the ability to eradicate cancer at its source. In acute myeloid leukemia (AML), leukemic stem cells (LSCs) underlie mortality but are difficult to isolate due to their low abundance and high similarity to healthy hematopoietic stem cells (HSCs). Here, we demonstrate that LSCs, HSCs, and pre-leukemic stem cells can be identified and molecularly profiled by combining single-cell transcriptomics with lineage tracing using both nuclear and mitochondrial somatic variants. While mutational status discriminates between healthy and cancerous cells, gene expression distinguishes stem cells and progenitor cell populations. Our approach enables the identification of LSC-specific gene expression programs and the characterization of differentiation blocks induced by leukemic mutations. Taken together, we demonstrate the power of single-cell multi-omic approaches in characterizing cancer stem cells.

## Introduction

Tissues with high cellular turnover, such as the hematopoietic system or the intestine, depend on “professional” adult stem cells for their continuous regeneration1. Oncogenic mutations in these cells can cause cancers that maintain a hierarchical organization reminiscent of the tissue of origin. Only cancer stem cells (CSCs), residing at the top of the hierarchy, are able to fuel long-term cancer growth and drive relapse, whereas the bulk of the cancer consists of rapidly dividing cells with limited capacity for self-renewal, i.e., cells that exhaust their replicative potential after a finite number of divisions2,3,4. Owing to their stem cell-like properties, CSCs constitute an important driver of relapse, but their low division rates make them difficult to target therapeutically. Tools that permit the confident identification and characterization of CSCs are therefore urgently needed.

Acute myeloid leukemia (AML) serves as a paradigm for the study of cancer stem cells5. In 10–20% of healthy individuals over age 70, the acquisition of pre-leukemic mutations in hematopoietic stem cells (HSCs) results in the dominance of a small number of HSC-derived clones, a process termed Clonal Hematopoiesis of Indeterminate Potential (CHIP)6,7. While such pre-leukemic stem cells (pre-LSCs) are capable of giving rise to healthy blood and immune cells, additional mutations can cause a complete block in differentiation and thereby result in the malignant expansion of aberrant progenitor cells8. The accumulation of these so-called “blast” cells is ultimately fueled by the presence of leukemic stem cells (LSCs). Classic chemotherapy regimens primarily target actively cycling “blast” cells and initially lead to remission. Since quiescent or protected LSCs often avoid eradication, relapse rates are high with 5-year survival rates below 15% for patients over the age of 60. A key goal is, therefore, to identify therapeutic strategies for targeting LSCs, while sparing healthy HSCs9,10,11. Characterizing gene expression differences between HSCs, pre-LSCs and LSCs would be a valuable step towards that goal.

Previously, LSC-specific gene expression patterns were characterized by isolating cells positive for stem cell-specific surface markers used in the healthy hematopoietic system, such as CD34 (refs.12,13,14,15). More recently, leukemic engraftment rates in xenotransplant models were correlated with gene expression16,17,18,19. However, these approaches all measure impure populations of cells. Here, we propose that by measuring mutational status and gene expression in single cells simultaneously, cancer stem cells can be uniquely distinguished from both mature cancer cells (based on gene expression) and healthy stem cells (based on mutational status) (Fig. 1a). Finally, pre-LSCs are thought to typically carry mutations associated with CHIP (e.g., in DNMT3A) but not mutations associated with leukemia (e.g., in NPM1)7,20,21, potentially enabling their identification by profiling both known leukemic and known preleukemic mutations.

While we and others have demonstrated the utility of single-cell genomics for mapping hematopoietic differentiation hierarchies22,23,24, tracking mutations or clones in single-cell gene expression data remains difficult. Previous work has amplified somatic mutations from complementary DNA (cDNA)25,26,27,28, extracted mutational information from single-cell RNA-seq reads29, or processed both genomic DNA and RNA from single cells30,31,32. However, these protocols suffer from a lack of confidence in assigning cells to clones and/or require prior knowledge of genomic mutation sites. As an alternative tool for clonal tracking, the use of endogenous mitochondrial mutations as clonal markers has been proposed, obviating the need for prior knowledge of genomic mutations33,34. However, the application of these methods to characterize LSCs has not been demonstrated, and in particular requires the ability to reliably detect clonal expansion events, associate clinically relevant coding mutations to clones with high confidence, and draw statements on gene expression changes between clones.

Here, we introduce MutaSeq, a workflow that amplifies nuclear mutations from cDNA, and mitoClone, a computational tool that achieves high-confidence clonal assignments and de novo discovery of clones using mitochondrial marker mutations when available. MutaSeq data from four AML patients allows us to distinguish HSCs, pre-LSCs, LSCs, and progenitor/blast populations. Thereby, we identify transcriptomic consequences of leukemic and pre-leukemic mutations relevant to stem cells. Additionally, we characterize the contribution of different leukemic and pre-leukemic clones to healthy and disease-specific bone marrow populations with unprecedented detail. Altogether, our results demonstrate cancer stem cell identification and characterization by simultaneous mapping of genomic and mitochondrial mutations in single-cell transcriptomes.

## Results

### MutaSeq provides high coverage of genomic and mitochondrial mutations

To establish a robust experimental setup for the clonal tracking of human cells in single-cell transcriptomic data, we evaluated various modifications of the Smart-seq2 protocol aimed at increasing coverage at polymorphic genomic sites of interest (Supplementary Fig. 1a). We found that inclusion of targeting primers during reverse transcription frequently resulted in the formation of undesired byproducts, especially when targeting a higher number of sites (Supplementary Fig. 1a–d). By contrast, when sites of interest were targeted during cDNA amplification, we obtained high-quality transcriptome data while increasing the average number of target sites captured per cell by 2–4-fold compared to a non-targeted approach (Fig. 1b, c and Supplementary Fig. 1a, b). An automated pipeline for primer design that minimizes off-target sites and potential primer-dimer formation is available at https://github.com/veltenlab/PrimerDesign (see also the Methods section). MutaSeq stably works with up to 30–40 primer pairs (targets); when higher numbers of primers are included, library quality progressively decreases (Supplementary Fig. 1e). In a test with only highly expressed target genes, target amplicons are created from all primer pairs in virtually all single cells (Supplementary Fig. 1f, g).

To evaluate MutaSeq, we performed deep exome sequencing of an AML patient (P1) and designed primers targeting 14 nuclear mutations (Supplementary Data 1 and 4). We then systematically compared the performance of MutaSeq and non-targeted Smart-seq2 on CD34+ cells from this patient. MutaSeq increased the number of target sites covered per single cell from a median of 1 to a median of 4 (Fig. 1c–e) and recapitulated the variant allele frequencies estimated by exome sequencing with higher accuracy than Smart-seq2 (Fig. 1f), while maintaining comparable transcriptome data quality (Supplementary Fig. 1h, i). Both methods underestimated the abundance of frameshift mutations, possibly as a consequence of nonsense mediated decay35 (Fig. 1e, f). While our data do not provide statistical evidence for an effect of target gene length or sequence complexity on dropout, we cannot exclude such effects.

Importantly, both methods provide an excellent coverage of the mitochondrial genome (mean ~100X mitochondrial coverage given a mean sequencing depth of ~788,000 total reads per cell, Fig. 1g), unlike most other single-cell RNA-seq protocols applied in the context of clonal tracking27,28,29,31 (Supplementary Fig. 1j). Altogether, these results demonstrate that MutaSeq efficiently covers the mitochondrial genome in single-cell RNA-sequencing experiments and provides improved coverage of genomic target sites compared to Smart-seq2. Importantly, it requires no changes to existing Smart-seq2 pipelines, except for the addition of targeting primers during cDNA amplification.

### Simultaneous mapping of mitochondrial and genomic mutations permits high-confidence tracking of leukemic, pre-leukemic, and healthy clones

To investigate if MutaSeq can distinguish leukemic, pre-leukemic and residual healthy clones, we generated data from four AML patients with heterogeneous genotypes and phenotypes (Figs. 2a, 3b and Supplementary Fig. 2). To allow for a better characterization of stem cells in each patient, cells were sorted such that putative stem and progenitor cells (CD34+) and putatively more mature cells (CD34-) were approximately covered at equal portions (Supplementary Fig. 2). Of note, two of the patients (P2, P4), exhibited bone marrow consisting of >99% of CD34− cells (Supplementary Fig. 2a). Owing to different capture rates across individuals and gates, the final data set consisted of between 618 to 1430 cells per patient, of which between 190 and 968 were CD34+ (Supplementary Fig. 2b, c). We therefore avoid statements on quantitative shifts in population size between patients throughout this manuscript.

We then called nuclear genomic mutations, as well as mitochondrial mutations, at the single-cell level in order to cluster cells into clonal hierarchies. Bulk exome sequencing of the patients had identified known pre-leukemic mutations present at high allele frequency and known leukemic mutations present at a somewhat lower allele frequency (Fig. 2a and Supplementary Data 1. Patient 1: mutations in SRSF2,TET2/CEBPA and SRSF2,TET2/KLF7; Patient 2: mutations in DNMT3A/NPM1; Patient 3: mutations in SRSF2/IDH2; Patient 4: leukemic Trisomy 8 and BRAF mutations). While some statements on clonal hierarchies could be drawn solely based on calls of these nuclear somatic mutations (Supplementary Fig. 3a), the relatively high dropout of these sites impeded robust assignments of cells to clones (Supplementary Fig. 3b–d). Moreover, the result was biased by the expression levels of the mutated genes of interest: in cells with low expression, dropout was higher, leading to a higher fraction of false-negative calls, i.e., false classifications of mutant cells as reference (Fig. 1e and Supplementary Fig. 3c, d). Similar issues were faced by other methods using related approaches of mutation amplification from cDNA27,28.

To overcome these limitations, we next determined if mitochondrial mutations can be used to refine clonal hierarchies jointly with the nuclear mutations. Since mitochondrial RNA is extensively edited, it is important to cluster cells based solely on mutations, and not on seemingly polymorphic sites that are the result of post-transcriptional events with no relationship to clonal structure. Here, we developed and tested a filtering strategy that only makes use of mitochondrial mutations for clonal tracking if they uniquely occur in individual patients (Supplementary Fig. 4a, see also Methods). This idea assumes that RNA editing events are typically shared between individuals, whereas somatic mutations are not. Using the whole-exome sequencing data, we validated that this approach, at the level of genomic sites, correctly distinguishes mutations and RNA editing events with a precision of 97% (Supplementary Fig. 4b, c). We further analyzed various control data sets with known associations between mitochondrial mutations and clones33 to validate that this approach enables the detection of relevant mitochondrial mutations without a need for a DNA-based reference, and further enables the unsupervised identification of clones (Supplementary Fig. 4d–f). We have implemented all the required filtering and blacklisting routines required for the identification of high-confidence somatic mitochondrial variants in the mitoClone R package (https://github.com/veltenlab/mitoClone).

We then computed clonal hierarchies from both nuclear and mitochondrial somatic variants using a mathematical model that accounts for allelic dropout36 (see Methods for detail, all required tools are contained in the mitoClone package). In patients P1 and P2, pre-leukemic as well as sub-clonal leukemic mutations were significantly associated with distinct sets of well-covered mitochondrial variants (Supplementary Fig. 5a, b), such that clonal hierarchies could be delineated, and a confident assignment of cells to clones became possible (Fig. 2b, c and see Supplementary Fig. 3b for a quantitative analysis). Unlike genomic mutation calling from cDNA, identification of clonal identities from mitochondrial mutations is mostly not, and in one case weakly, affected by gene expression levels or library quality (Supplementary Figs. 3c, d, 5c), and is possible at lower sequencing depths (Supplementary Fig. 5d–f), since mitochondrial genes are consistently highly expressed.

Importantly, the clonal structure was validated by targeted DNA-sequencing from single-cell derived colonies (Patient 1, Fig. 2d). Across the patients, we identified clones carrying known pre-leukemic mutations (e.g., in SRSF2, DNMT3A) and sub-clones carrying known leukemic mutations (e.g., CEBPA, NPM1) (Fig. 2e, f). Below we functionally characterize these clones as leukemic or pre-leukemic based on their contribution to healthy blood production (Fig. 4).

In patient P3, allele frequencies of leukemic and pre-leukemic mutations were near 50%, indicating the presence of a single leukemic clone (Supplementary Data 1). In patient P4, an 18-year old individual with a leukemia driven by triplication of chromosome 8, no mitochondrial markers were identified, even though exome sequencing suggested the presence of sub-clonal variants (Supplementary Data 1). In this case, MutaSeq still permits a qualitative analysis using genomic mutations alone (see below). Taken together, our approach allows for the identification of putatively leukemic, pre-leukemic, and healthy clones and can assign cells to clones with high confidence if mitochondrial somatic variation is present.

### Identification and characterization of clones de novo

We next investigated if the use of mitochondrial somatic variants enables the identification and characterization of clones without prior knowledge of nuclear mutations. To that end, we made use of a data set from patient P1 generated without amplification of nuclear sites (i.e., standard Smart-seq2). A clear clonal structure was identified in an unsupervised manner based solely on mitochondrial variants (Supplementary Fig. 6a). In order to examine whether the presence of somatic genetic variability is associated with the different clones, we then queried the mutational status of 13,797 genomic sites annotated as mutated in AML in the COSMIC database37 using a beta-binomial model (see Methods). This unsupervised analysis revealed a highly significant association of the SRSF2 P95H mutation with the leukemic and pre-leukemic sub-clones (Supplementary Fig. 6b, c). Their malignant nature was further evidenced by a markedly reduced ability to contribute to the T cell lineage (Supplementary Fig. 6d and see also below).

To further demonstrate our ability to identify clones de novo, we highlight a clonal expansion of non-leukemic cells in P2 (Fig. 2f). This clone would have been missed by approaches relying on genomic mutations alone27,28,31. Interestingly, these cells were not associated with the pre-leukemic DNMT3A mutation. By again querying sites from the COSMIC database37 using a beta-binomial model, we identified that they had uniquely acquired a mutation in the RPL3 gene (Supplementary Fig. 6b, e). These results suggest that this clonal expansion event is independent of the leukemia and associated with the acquisition of unrelated nuclear mutations.

We also take note of a putative non-leukemic clone in P1 marked by a single mitochondrial variant (5492T>C). With one exception, all cells carrying this variant are positive for the T-cell marker CD3 (Figs. 2b4a). Hence, this variant was likely acquired in a T-cell precursor or T-cell clone, although we cannot formally exclude that it corresponds to a T-cell-specific RNA editing event.

Taken together, these results demonstrate that our approach can identify and characterize clones de novo without prior knowledge of nuclear genomic mutations. The mitoClone package implements all routines for clonal clustering and mutation calling.

### A map of cell states in leukemic bone marrow identifies HSC-like cells

We next used single-cell transcriptome data to distinguish stem cells, progenitor cells and leukemic blasts. To define cell populations in our samples, we integrated the gene expression data from all four patients along with data from CD34+ bone marrow cells of a healthy individual22 into a two-dimensional representation, and characterized cell types based on marker gene expression (Fig. 3a, Supplementary Fig. 7a–c, f, Supplementary Data 2).

All four bone marrow samples contained cells that clustered with HSCs and multipotent progenitor cells (HSC/MPP) from the healthy individual (Fig. 3b–d) and displayed a CD34+CD38low, FSC/SSClow phenotype (Fig. 3e). Unsupervised analysis separated these cells into quiescent immature HSC-like cells (HLF+), more proliferative erythromyeloid primed progenitors (GATA2+), and lymphomyeloid primed progenitors (FLT3+) (Fig. 3f and Supplementary Fig. 7f). Cells resembling healthy erythroid progenitors and MEPs were also identified, alongside various types of B-cell precursors and T/NK cells (Supplementary Fig. 7b, c).

In contrast to non-leukemic populations, the majority of cells from leukemic bone marrow samples (“blasts”) were very different between the patients, in line with previous observations27 (Fig. 3d): we observed (a) differentiated blasts (CD34-SSC/FSChi) that expressed neutrophil genes such as calprotectin (S100A8, S100A9) and AZU1 (Fig. 3e and Supplementary Fig. 7a); (b) CD34+ blasts that were highly mitotic and expressed the fetal hemoglobin HBZ; and c) CD34+CD38low blasts expressing markers typical of hematopoietic stem and progenitor cells (HSPCs), such as PROM1 (CD133) and MEIS1 (Supplementary Fig. 7a, d, e). The latter population appeared to be connected to the HSC/MPP population across a continuum of states that gradually upregulate AP1 transcription factor expression (FOS, JUN, FOSB, JUNB, JUND) while down-regulating MHC class II (Fig. 3g, h and Supplementary Data 3). A global analysis of highly expressed genes across all populations revealed that high expression levels of AP1, as well as several Krüppel-like factors and poly-A binding proteins, were common to all different blast populations from the patients and distinguished them from healthy progenitors (Fig. 3h, i). Of note, high AP1 and KLF activity have recently been identified as a hallmark conserved across genetically distinct types of AML by bulk-sequencing studies that only investigated blast populations38. Our results demonstrate that indeed these transcription factors appear to be relevant in all, phenotypically very different, blast populations.

Taken together, all four leukemia samples could be stratified into stem cells, progenitors, and blasts. Furthermore, all patients retain cells highly similar to healthy HSCs, although in variable abundance.

### Clonal tracking identifies cellular differentiation states and gene expression patterns associated with pre-leukemic and leukemic mutations

Gene expression information alone was insufficient to distinguish cancerous from non-cancerous HSPCs (Supplementary Fig. 7f). To definitively characterize cells as (pre-)LSCs or residual healthy cells, we therefore integrated single-cell gene expression data and clonal tracking results (Fig. 4a–d). If mitochondrial somatic variability was present, we were able to assign clonal identities with high confidence, allowing us to draw quantitative statements (Patients P1 and P2). In the absence of mitochondrial somatic variability, we used nuclear mutation calls in SRSF2, IDH2 and NUP188 for purely qualitative statements (Patients P3 and P4). The capture rates of these marker sites ranged from 70% (SRSF2) to 11% (NUP188) (Supplementary Fig. 8a).

Clones associated with leukemic mutations were most prevalent in the blast compartments and were also detected in the HSPC compartment, but were almost absent in lymphoid (B, NK, T) lineages. By contrast, clones associated with pre-leukemic mutations were found in all lineages, but mostly displayed a decreased prevalence in lymphoid lineages (Fig. 4e and Supplementary Fig. 8b–d). These observations confirm the designation of these clones as “leukemic” and “pre-leukemic”. Furthermore, these results highlight that the leukemic mutations may initiate differentiation blocks at various levels, as previously reported4,27,39. For example, in patients P1 and P3, leukemic cells had retained the ability to contribute to the erythroid lineage, while in patient P2, this activity was restricted to the pre-leukemic and non-leukemic clones. Importantly, the leukemic cells in P1, P2, and P3 were observed in a cell state that is highly reminiscent of healthy HSCs/MPPs on a molecular level, and retains the ability to contribute to various lineages, i.e., is functionally multipotent.

Next, we investigated the molecular effects of distinct mutations in detail. Previously, the consequences of specific leukemic mutations were commonly studied in mouse models. MutaSeq data allows us to compare gene expression between clones differing only in a single mutation, thereby elucidating the specific effects of that mutation on human hematopoiesis.

Mutations in the de novo DNA methyl transferase DNMT3A are the most common cause of benign clonal expansions of HSCs in individuals of advanced age6,7. In patient P2, DNMT3A-mutated pre-leukemic HSPCs were rather stem-like (with relatively high expression of HLF) or primed into the erythromyeloid direction (with relatively high expression of GATA2) (Figs. 4b and 5a). These results are in line with recent findings from DNMT3A knock-out mouse models40. Interestingly, independent of cell state, this coincided with an upregulation of MLLT3, a gene whose enforced expression promotes erythroid-megakaryocytic output from HSCs41 (Fig. 5b, c and Supplementary Data 3).

Mutations in the multifunctional ribonucleoprotein NPM1 are identified as drivers for acute myeloid leukemia in 30% of patients and frequently co-occur with pre-leukemic DNMT3A mutations21. In patient P2, the erythromyeloid bias of the DNMT3A clone was lost upon acquiring the leukemic NPM1 mutation. NPM1 mutated cells upregulated HOXB3, again in line with data from mouse models42,43,44 (Fig. 5b, c). Independent of cell state, these cells further exhibited upregulation of CD96 RNA expression, which has previously been identified as a leukemia stem cell-specific marker12. CD96 was also highly expressed on leukemic HSC/MPP-like cells of patient P3, but not in patient P1, further illustrating the patient-specific nature of LSC markers (Supplementary Fig. 8e).

Mutations in KLF7 are not commonly observed in leukemia; however, krüppel-like factors are highly expressed by genetically and phenotypically different blasts38 (see above). In patient P1, the KLF7 mutated clone displayed a higher proportion of cells in G2/M phase (Fig. 5d). Based on a reanalysis of ATAC-seq data from human CD34+ cells45 we found that enhancers containing KLF7-binding sites were enriched near tumor suppressor genes46, including CDK6, PTEN, RUNX1, and FLT3, compared to active enhancers not containing KLF7 binding sites (p = 0.002, Supplementary Fig. 8f).

In sum, we have used a small heterogeneous patient cohort to demonstrate, as a proof-of-concept, that the acquisition of specific mutations is frequently linked to an altered gene expression program, which is consistent with data obtained from mouse models40,42,43,44.

Finally, we compared gene expression between all (pre-)leukemic and non-leukemic cells, with the goal of identifying potential markers or drug targets present in all (pre-)leukemic cells, but not in residual healthy cells. In patient P2, our analysis identified a small number of hits (Fig. 5e and Supplementary Data 3), most notably FOS, which was expressed in all (pre-)leukemic cells across cell types, but not by non-leukemic HSC/MPPs (Fig. 5f), and the MHC class I genes HLA-L and HLA-H, which were expressed by all healthy cells, but not by (pre-)leukemic cells. Across all patients, FOS was consistently overexpressed in cells carrying (pre-)leukemic lesions, both in HSCs/MPPs, and in other cell types (Fig. 5g).

Taken together, these results demonstrate the ability of MutaSeq and mitoClone to delineate developmental and molecular effects of clonal evolution caused by leukemic and pre-leukemic mutations.

## Discussion

Herein, we have described a joint single-cell transcriptomics and clonal tracking approach (MutaSeq and mitoClone) for characterizing LSCs, charting their differentiation capabilities, and mapping the molecular consequences of oncogenic mutations. While single-cell gene expression profiling permits the identification of cells with a stem-cell signature, clonal tracking using genomic and mitochondrial mutations allows for a clean separation between healthy and cancerous clones. Thereby we distinguish LSCs, HSCs, pre-LSCs, healthy progenitors, and blasts. We have demonstrated this approach in the context of acute myeloid leukemia, and we propose that similar approaches may be applied to other types of cancers.

By applying our approach to bone marrow samples from four AML patients, we have demonstrated its capabilities:

Charting the differentiation capacities of hematopoietic clones. By quantifying the contribution of (pre-)leukemic cells to blood lineages, we have shown that in patients P1–P3, leukemic clones not only form blasts but also exist in an HSC-like state. Pre-leukemic clones additionally contribute to erythroid and lymphoid lineages. Multipotent HSCs are, therefore, the likely cell of origin in these patients4. In patient P4, with one exception, only cells with a mature phenotype displayed leukemic mutations, illustrating that the disease can also be fueled by cells with a committed phenotype39. Alternatively, the LSC population in this patient might be rare among CD34+ cells.

Identification of clones de novo. In the presence of mitochondrial somatic variability, our approach does not rely on previously known nuclear mutations to detect clones. In the example of patient 2, we have thereby identified an expanded clone present at very low frequency in total bone marrow, but highly abundant in the CD34+ fraction. We have demonstrated that this clone is associated with a nuclear mutation in the RPL3 gene, which we discovered de novo. This result illustrates the low clonal complexity of hematopoiesis in individuals of advanced age, which is often not associated with candidate driver mutations20.

Identification and characterization of LSCs. Importantly, our approach was capable of identifying leukemic cells highly reminiscent of healthy HSCs. These cells will be important to target therapeutically without ablating their healthy counterparts9,10,11. While stemming from a study cohort of limited size, our data suggest that FOS might constitute a potential target for further investigation, as it is expressed throughout the disparate leukemic populations including the most HSC-like cells, pre-LSCs, and blasts, but not in healthy HSCs. Further, we have confirmed that CD96 is a specific marker for LSCs in some patients12. Studies in larger cohorts will be required to assess how generally applicable these findings are.

Taken together, our study expands upon earlier work on the molecular phenotypes of AML blasts27,38 and constitutes the first detailed characterization of LSCs by single-cell transcriptomics. This advance is owed to three crucial aspects of experimental design.

Enrichment of relevant starting populations. LSCs are excessively rare and generally present at below 0.1% of total bone marrow16. In order to characterize these cells by single-cell genomics, a prior enrichment, e.g., by sorting for CD34 expression, is essential. In future studies, this approach can be complemented with markers that also label CD34- LSCs, such as GPR56 (ref. 18).

Deep transcriptome sequencing. In some cases, the bulk of leukemic cells displays gene expression signatures highly similar to stem cells, as observed here for the CD34+ blasts of patient P1. The differences between LSCs and residual healthy HSCs are even more subtle. Previous work using shallow, microwell-based sequencing of AML cells27 has not identified differences between LSCs, CD34+ blasts and residual healthy HSCs.

High-confidence clonal tracking. When available, the use of mitochondrial variants enables the confident assignment of cells to clones, and thereby, a quantitative analysis of gene expression. In two out of four AML patients, we identified high-confidence, specific mitochondrial genetic markers for pre-leukemic and leukemic sub-clones. In the third patient, allele frequencies of leukemic and pre-leukemic mutations were around 50%, indicating dominance of a single leukemic clone. In the final patient, sub-clonal genomic mutations were observed, but not associated with mitochondrial variability. Interestingly, this patient was only 18 years old and exhibited a leukemia possibly driven by a “catastrophic” triplication of chromosome 8. The length of the pre-leukemic phase, the buildup of mitochondrial variants accumulated during normal ageing, as well as unknown factors affecting mitochondrial mutation rates might all contribute to the presence of mitochondrial marker mutations.

In the context of recent methods for clonal tracking within single-cell transcriptomics27,28,29,31,33, the use of the MutaSeq protocol and mitoClone computational pipeline combines the strengths of previous approaches relying on either nuclear or mitochondrial variants, but also has limitations. Specifically, droplet- or microwell-based protocols for single-cell RNA-seq and clonal tracking27,28,29 provide low coverage of mitochondrial genomes and high dropout of nuclear genomic sites, and therefore do not enable a confident assignment of cells to clones. Single-cell ATAC-sequencing33,34,47 and Smart-seq2 efficiently cover mitochondrial genomes, but their coverage of nuclear genomic sites is absent or low, hardening interpretability of the data. Finally, while TARGET-seq addresses the limitation of dropout of nuclear sites, depending on the implementation it does not offer sufficient mitochondrial coverage, or is of very limited throughput. Of all methods available to date, the approach presented here converges on crucial aspects, specifically: (a) high mitochondrial coverage, allowing us to identify benign expanded clones de novo, even in the absence of known genetic markers, (b) decreased dropout of relevant genomic mutations, permitting the association of clones with genomic mutations, if present, and (c) a highly confident assignment of cells to clones, enabling quantitative analyses of clone-specific gene expression (Supplementary Fig. 9). These capabilities expand the potential applications of our approach to the study of clonal dynamics during ageing and oncogenesis beyond the hematopoietic system.

### Limitations

The major limitation of our pipeline is that it requires natural somatic variability to resolve clones at high confidence. In the absence of mitochondrial somatic variation, the MutaSeq protocol can be used to draw qualitative statements on clonal differentiation capacities (similar to ref. 28, and with an improvement over Smart-seq2), but due to dropout neither enables the statistically confident assignment of cells to clones, nor differential expression testing between clones. In the presence of mitochondrial somatic variation, an association between clones and nuclear mutations is only possible for mutations in highly expressed genes, and can be limited by the nature of the mutation (e.g., frameshift mutations) and possibly other factors such as sequence complexity. The third limitation of MutaSeq is its relatively low throughput. This limitation is currently shared with Smart-seq2 and alternative single-cell RNA-seq methods allowing high-confident assignment of cells to clones31,33. Future work will focus on the inclusion of full-length coverage of the mitochondrial genome in droplet-based single-cell RNA-seq platforms.

## Methods

### Patient and sample collection

The AML samples were collected from diagnostic bone marrow aspirations at the University hospitals in Heidelberg, Germany, and Mannheim, Germany after obtaining informed written consent. Bone marrow mononuclear cells were isolated by density gradient centrifugation and stored in liquid nitrogen until further use. All experiments involving human samples were conducted in compliance with the Declaration of Helsinki and all relevant ethical regulations and were approved by the ethics committees of the medical faculties Heidelberg and Heidelberg-Mannheim of the University of Heidelberg, the Bioethics Internal Advisory Committee (BIAC) at EMBL and the CRG bioethics committee (CEIC-Parc de Salut Mar).

### Deep exome sequencing and target selection

For exome sequencing, DNA was extracted from 9 × 103 flow sorted CD34 + cells (for CD34+ leukemias: P1, P3) or total bone marrow. As healthy controls, we used a buccal swap (P1), FACS-sorted CD45-CD105+MSCs (P3) or in vitro expanded MSCs (P2 and P4)48. Sequencing libraries were constructed using the SureSelect HS XT Target Enrichment System v6 (Agilent), and a mean on-exon sequencing coverage of at least >70X was obtained for each patient. Genomic alignments were performed using BWA MEM v0.7.15 (ref. 49) and cancer variants were identified using Mutect2 v3.8 (P1 and P3) and v4.0.9 (P2 and P4)50, following the GATK best practice recommendations. Variants were annotated using ANNOVAR51. Output from Mutect2 was filtered to remove variants that did not overlap with known genes. The final list of candidate variants included only those with allele frequencies (AF) >4% in the cancer exome sample and with an AF fourfold larger than in the healthy exome sample (Supplementary Data 1). Finally, the candidates for targeting were hand-selected from this list with a focus on cancer relevant genes, highly expressed genes, and potential sub-clonal markers.

### FACS sorting

Bone marrow mononuclear cells were stained according to standard protocols. In brief, cells were thawed, washed once in medium (IMDM, 10% FCS, 20 U/mL DNAse I) and 3 million cells were resuspended in 100 µL medium containing antibodies diluted as described in Supplementary Table 1. Following incubation for 30 min on ice, cells were resuspended in phosphate-buffered saline with 2% FCS. For single-cell liquid cultures and MutaSeq, cells were stained with fluorescent-labeled antibodies against lineage markers (CD4, CD8, CD19, CD20, CD41a, CD235a) and additional markers (CD45RA, CD135, GPR56, CD34, CD38, CD90, CD33, Tim3), and sorted according to the gating scheme illustrated in Supplementary Fig. 2. BD FACS Fusion (BD Biosciences) equipped with 405, 488, 561, and 640 nm lasers were used. Of note, in P1 and P4, Lin+ cells could not be efficiently processed into libraries for unknown reasons and are, therefore, excluded. A list of all antibodies used can be found in Supplementary Table 1.

### Cell culture

K562 cells were purchased from ATCC (catalog number CCL-243) and cultivated in RPMI-1640 (Thermo 21875034) supplemented with 10% FBS and P/S.

### Primer design

Primers for MutaSeq, for other single-cell targeting protocols tested (Supplementary Fig. 1), as well as for targeted DNA sequencing were designed using the computational pipeline available at http://git.embl.de/velten/PrimerDesign. For MutaSeq, the refgene transcripts spanning each genomic site of interest were selected as template; if multiple refgene transcripts were found for one site, a consensus transcript containing only exonic sequences present in all variants was created. We then used primer3 (ref. 52) to design five possible pairs of primers for each intended target, with an amplicon length of 90–145 bp and a melting temperature of (nominally) 60 °C. BLAST was used to remove primer pairs, which potentially form off-target amplicons. Then, the pair complementarity (i.e., potential to form dimers) was computed for each possible combination of primers across all target sites (forward-reverse, forward-forward and reverse-reverse). In order to identify a set of primers that covered the maximal number of genes while strictly forbidding primers with high complementarity scores, a graph was constructed that connected all primers with different targets to each other if their complementarity score was lower than 15. A maximum clique-finding algorithm53 was then used to identify the largest mutually connected component in the graph. Thereby, the largest number of targets that efficiently avoids dimer formation was selected.

Nextera adapters were added to all primers designed accordingly (fwd: GTCGTCGGCAGCGTCAGATGTGTATAAGAGACAG, rev: GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG).

For targeted DNA-seq experiments, the genomic sequence surrounding the target was used as template and nested PCR primers were designed. Inner primers were designed as in the case of MutaSeq, and outer primers surrounding the inner PCR product with an amplicon length of 200–350 bp and a nominal annealing temperature of 58 °C were added. A list of all primers used for this study is included in Supplementary Data 4.

### Single-cell RNA sequencing with targeting of genomic sites of interest (MutaSeq)

MutaSeq is based on the Smart-seq2 protocol54,55 with the modifications introduced by ref. 22. For lysis, we used 5 µL of a buffer containing 0.1 µL RNAsin + (Promega), 0.04 µL 10% Triton X-100 (SigmaAldrich), 0.1 µL of 100 µM Smart-seq2 Oligo-dT primer (SigmaAldrich) and 1 µL dNTP mix (10 mM each, NEB). In P1, we had additionally included 0.075 µL of a 1:1,000,000 dilution of ERCC spike-in mix 1 (Ambion), as well as a control spike-in to quantify the false-positive detection rate of mutations (Supplementary Fig. 10). Plates were snap frozen directly after sorting and later thawed at 10 °C in a PCR machine for 5 min and denatured at 72 °C for 3 min. 5 µL of a buffer containing 0.25 µL RNAsin +, 2 µL 5x SMART FS buffer, 0.5 µL DTT 20 mM, 1 µL SmartScribe enzyme (all TaKaRa) and 0.2 µL 50 µM Smart-seq2 TSO (Exiqon) were then added and RT was performed for 90 min at 42 °C, 10 cycles of [50 °C, 2 min and 42 °C, 2 min], and enzyme inactivation at 70 °C for 15 min. Then, we added 15 µL PCR mix containing 12.5 µL KAPA HiFi HS mastermix (Merck), 0.25 µL 10 µM Smart-seq2 ISPCR primer (SigmaAldrich) and 0.5 µL of a pool of all targeting primers, present at 1 µM each. cDNA amplification was performed by 98 °C 3 min, 21 cycles of [98 °C, 20 sec, 67 °C, 60 sec, 72 °C 6 min], and 72 °C, 5 min. cDNA was the cleaned up using an equal volume (25 µL) of CleanPCR beads (CleanNA) and tagmented using homemade Tn5 (ref. 56). In brief, cDNA was diluted to ~150 pg/µL, Tn5 was diluted 1:10–1:100 and combined with (20 mM Tris-HCl pH7.5, 20 mM MgCl2, 50% DMF) and diluted DNA in a 1:2:1 volume ratio at total volume of 5 µL. The mix was incubated at 55 °C for 3 min and afterwards shifted to ice and inactivated by the addition of 1.25 µL 0.2% SDS. PCR was performed by adding 6.75 µL of KAPA HiFi HS mastermix, 0.75 µL of DMSO and 1.25 µL each of the forward i5 and reverse i7 library primer (Supplementary Data 4) at 10 µM. PCR program was 72 °C 3 min, 95 °C 30 sec, 12 cycles of [98 °C, 20 sec, 58 °C, 15 sec, 72 °C 30 sec], and 72 °C, 3 min.

### Single-cell cultures

Bone Marrow mononuclear cells from patient P1 were stained and Lin- or Lin-CD34 + single cells were index-sorted into ultra-low attachment 96-well plates (Corning) containing 100 µL StemSpan SFEM media (Stem Cell Technologies). Media was supplemented with penicillin/streptomycin (100 ng/mL), l-glutamine (100 ng/mL) and the following human cytokines (all from Peprotech): SCF (20 ng/mL), Flt3-L (20 ng/mL), TPO (50 ng/mL), IL-3 (20 ng/mL), IL-6 (20 ng/mL), G-CSF (20 ng/mL), EPO (40 ng/mL), IL-5 (20 ng/mL), M-CSF (20 ng/mL), GM-CSF (50 ng/mL). After 21 days at 5% CO2 and 37 °C, colonies were imaged by microscopy, and processed as detailed in the following.

### Targeted DNA sequencing by nested PCR amplification

Single-cell derived colonies were transferred into 50 µL buffer RLT (Qiagen). Cleanup was performed using CleanPCR beads (CleanNA) at a 1.8x volume ratio and eluted in 20 µL 10 mM Tris-HCl pH 7.8. 4.5 µL were transferred to a PCR plate containing 7.5 µL Kapa HiFi HS mastermix and 3 µL of a pool of all outer primers (Supplementary Data 4, each primer at 0.5 µM) were added, followed by a PCR program of 98 °C 3 min, 30 cycles of [98 °C, 20 sec, 63 °C, 60 sec, 72 °C 10 sec] and 72 °C, 5 min and subsequent enzymatic cleanup with 2.5 µL 10x ExoI buffer, 0.4 µL ExoI (NEB) and 0.4 µL FastAP (ThermoFisher), 30 min incubation at 37 °C and 5 min inactivation at 95 °C. Afterwards, 1 µL was transferred to a PCR tube containing 5.9 µL water, 7.5 µL Kapa HiFi HS mastermix and 0.6 µL of a pool of all inner primers (Supplementary Data 4, each primer at 0.5 µM), followed by a PCR program of 98 °C 3 min, 15 cycles of [98 °C, 20 sec, 65 °C, 15 sec, 72 °C 30 sec], 72 °C, 5 min and enzymatic cleanup as above. One microliter was then transferred into a PCR with Nextera indexing primers (Supplementary Data 4) and amplified with 98 °C 3 min, 10 cycles of [98 °C, 20 sec, 60 °C, 15 sec, 72 °C 30 sec] and 72 °C, 5 min.

### Processing of next generation sequencing data

Raw sequencing reads from MutaSeq and Smart-seq2 experiments were processed using the BBDuk software to trim both the standard Illumina Nextera adapters and the ISPCR adapter. Reads were then mapped to the hg38 human genome (Ensembl release 95) using STAR v2.6 (ref. 57), with the outFilterMismatchNmax parameter set to 5. Exonic gene counts were tabulated, keeping only reads that did not overlap with targeted regions, overlapped with only one annotated gene, and with lengths greater than 30 nt. For the colony DNA sequencing experiment, reads were mapped to the hg38 human genome using bwa mem (v0.7.17)49.

### Analysis of mitochondrial mutations and reconstruction of clonal hierarchies (mitoClone package)

The following set of routines are implemented in the mitoClone package available form https://github.com/veltenlab/mitoClone, and documented further in the package vignettes.

#### Construction of allele count tables

For each cell and position of mitochondrial genome as well as other genomic sites of interest, count tables for all nucleotides (A/C/G/T), and deletions were created from the BAM files. To this end, the mitoClone package implements the baseCountsFromBamList function, which essentially serves as a wrapper to the bam2R function from the deepSNV R package58.

#### Filtering of mitochondrial variants

To identify relevant somatic variants, we implemented the mutationCallsFromCohort function. In short, we select coordinates in the mitochondrial genome containing at least five reads each in at least 20 cells. To distinguish RNA editing events and true mitochondrial mutations at the level of genomic sites, we then identify mitochondrial variants that occur in several individuals. For this purpose only, individual cells are called as “mutant” in a given site of the mitochondrial genome if at least 10% of the reads from that cell were from a minor allele (i.e., distinct from the reference). Mutations present in at least 1% of cells in a given patient, but no more than 10 cells in any other individual, are then included into the final data set and counts supporting the reference and mutant alleles are computed as for sites of interest in the nuclear genome. Mutations present in several individuals are stored as a blacklist and were used further for filtering some of the data analyzed in Supplementary Fig. 4. Importantly, the result from this step is simply a list of genomic sites that are likely to display genetic variability across single cells. The vignette “Variant calling and blacklist creation of the mitoClone package” provides further recommendation for the choice of filtering parameters.

#### Construction of clonal hierarchies

To construct a basic clonal hierarchy, we implemented the muta_cluster function. In short, all nuclear and mitochondrial variants with coverage in at least 20% of cells are selected. Observed variant allele frequencies (VAF) are then computed for each cell. From these values, we compute a ternary matrix of observed variant calls Nc,g; for each cell c and genomic site g, cells were classified as mutant (1) if the VAF was above 5%, reference (0) if it was below 5%, or dropout (NA) if the site was not covered. We thereby assign a genotype “mutant” or “non-mutant” to single cells at all genomic sites selected in the filtering step, using a less stringent cutoff for calling the site as mutant. Then, PhISICS36 is used to reconstruct a clonal tree. Unlike conventional algorithms for the reconstruction of phylogenetic hierarchies, PhISCS is very robust with regard to noise. For the figures presented in the main text, we ran PhISCS assuming a false-positive rate (FPR) of 3% and an allelic dropout rate (AD) of 10% across all genes. We additionally estimated the dropout rate on a per-gene level from the number of complete dropouts, and varied the resulting parameter vector using Latin hypercube sampling around the means using a beta distribution with concentration parameter of 10. Across 80 sampling runs, the same tree was consistently obtained except that the order of nodes within clones was swapped (Supplementary Data 5). PhISCS results are, therefore, robust to variations in the parameters of the statistical technique used. The false-positive rate of MutaSeq had been empirically estimated using spike-in controls (Supplementary Fig. 10).

#### Clustering of mutations into clones and assignment of cells to clones

PhISCS provides a maximum likelihood phylogenetic tree, enforcing an ordering of mutations. In reality, however, not all intermediate evolutionary steps are represented by cells present in the biological sample (for example, in P1, there are few or no cells displaying the SRSF2 mutation, but not the mt:11559 G > A mutation). Hence, the order of the nodes in the maximum likelihood tree is to some extent arbitrary and driven by noise; even if there is some statistical support for a specific order, it may be attractive in practice to merge mutations into clones so as to obtain a biologically meaningful, interpretable analysis. We therefore implemented the clusterMetaclones function, which employs a likelihood-based approach. The maximum likelihood tree is split into contiguous linear branches (e.g., for P1, nuc:SRSF2, mt:11559G > A, mt:7527DEL and nuc:EAPP constitute one such branch). Within each branch, all nodes are then swapped with each other and the likelihood of the data given the altered structure is calculated using the PhISCS model:

$${\cal{L}} = \mathop {\prod }\limits_c \mathop {\prod }\limits_g \left\{ {\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} {\alpha \,{\mathrm{if}}\,M_{cg} = 1\,\& \,N_{cg} = 0} \\ {\beta \,{\mathrm{if}}\,M_{cg} = 0\,\& \,N_{cg} = 1} \\ {1\,{\mathrm{if}}\,N_{cg} = NA} \end{array}} \\ {\left( {1 - \alpha } \right) * \left( {1 - \beta } \right)\,{\mathrm{else}}} \end{array}} \right.$$
(1)

Here, Mcg indicates whether according to the model, cell c is mutant at genomic site g. α is the allelic dropout rate and β is the false-positive rate.

The branch is then split into clones such that within each clone, the average difference in log-likelihood incurred by swapping nodes was smaller than 1 per cell. This threshold is set arbitrarily to obtain a practically useful grouping of mutations into clones. The vignette “Computation of clonal hierarchies and clustering of mutations” of the mitoClone package provides further practical recommendations.

The same model was used to compute the likelihood of clonal assignments for each cell. For the analysis in Supplementary Fig. 3b, this estimate was then transformed to bits of information using the Kullback–Leibler distance:

$$D_{{\mathrm{KL}}} = \mathop {\sum }\limits_c {\cal{L}}_c\log _2\left( {{\cal{L}}_c * \left| C \right|} \right)$$
(2)

where |C| is the total number of clones.

#### Quantitative analyses

For all quantitative analyses of clones (differential expression testing, constribution of clones to cell types) cells with a likelihood of clonal assignment of <0.8 were removed.

#### De novo mutation calling

To identify nuclear mutations associated with the clones, we performed variant calling at a list of candidate sites from COSMIC37. We therefore focused on a subset of COSMIC including putatively pathogenic SNVs and small InDels, variants in expressed genes (mean > 20 reads per cell), and variants associated with a primary site “haematopoietic_and_lymphoid_tissue”, obtaining a list of 13,797 sites of interest. SNVs commonly observed as germline variants in the 1000 genomes project data set were removed59. Allele count tables were created for each site as described above. Finally, a beta-binomial model with the same probability for mutant in all cells was compared to a beta-binomial model with a different probabilities for mutant in each clone using Akaike’s Information Criterion.

### Single-cell gene expression data analysis

Cells with <500 distinct genes observed and genes that appeared in <5 cells were removed. Additionally, data from a healthy individual (“H1”) was downloaded from the NCBI Gene Expression Omnibus (GSE75478). Data from all individuals was then integrated using scanorama according to the workflow described by the authors60. The low-dimensional data representation obtained by scanorama was then loaded into Seurat61, and default Seurat implementations of UMAP62,63 and graph-based clustering were used for data visualization and clustering, respectively, using the first 15 scanorama components.

For more detailed analyses of the T/NK-cell and HSC/MPP populations, cells with these identities were selected and the MNN64 data integration workflow was repeated using raw expression counts of these specific cell populations as input.

Differential expression testing was performed using MAST65, using a linear model containing the variable of interest (e.g., clonal identity), a library quality covariate (i.e., the number of genes observed per cell), and, when applicable, additional covariates accounting for patient and cell type.

For the display of gene expression values only, data were normalized according to the Seurat defaults (i.e., divided by the total count of RNA in the cell, multiplied by a scale factor of 10,000 and log-transformation).

### Data visualization

All plots were generated using the ggplot2 (v. 3.2.1) and pheatmap (v. 1.0.12) packages in R 3.6.2. Boxplots are defined as follows: the middle line corresponds to the median; lower and upper hinges correspond to first and third quartiles. The upper whisker extends from the hinge to the largest value no further than 1.5 * IQR from the hinge (where IQR is the inter-quartile range, or distance between the first and third quartiles). The lower whisker extends from the hinge to the smallest value at most 1.5 * IQR of the hinge. Data beyond the end of the whiskers are called “outlying” points and are plotted individually66.

### Statistics and reproducibility

Statistical analyses were performed using R 3.6.2. Statistical details for each experiment are provided in the figure legends. FlowJo v10 TreeStar was used for the analysis of flow cytometry data.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

## Data availability

Count tables and other processed data necessary to reproduce all analysis from the manuscript are deposited in figshare with https://doi.org/10.6084/m9.figshare.12382685.v1 (ref. 67). Raw sequencing data are deposited under a Data Access Agreement to protect patient privacy in the European Genome-Phenome Archive with the accession id EGAS00001003414. Requests for data access shall be addressed to LMS (larsms@embl.de). The following restrictions apply: Research with the goal of identifying characteristics of the patient not related to the leukemia (such as surname inference and ancestry research) are excluded. The use of the data for projects not related to cancer research is excluded, exceptions may apply in the context of research aiming to develop new bioinformatics methods. Sequencing data from the healthy individuals are deposited in GEO with the accession id GSE75478. The source data underlying Figs. 1f, 5d, 5h, and Supplementary Fig. 8c–e are provided as a Source Data file. All the other data supporting the findings of this study are available within the article and its supplementary information files and from the corresponding author upon reasonable request. Source data are provided with this paper.

## Code availability

Code for MutaSeq primer design is available at https://github.com/veltenlab/PrimerDesign68. A package for data analysis is available at https://github.com/veltenlab/mitoClone69.

## References

1. 1.

Clevers, H. What is an adult stem cell? Science 350, 1319–1320 (2015).

2. 2.

Meacham, C. E. & Morrison, S. J. Tumour heterogeneity and cancer cell plasticity. Nature 501, 328–337 (2013).

3. 3.

Kreso, A. & Dick, J. E. Evolution of the cancer stem cell model. Cell Stem Cell 14, 275–291 (2014).

4. 4.

Shlush, L. I. et al. Identification of pre-leukaemic haematopoietic stem cells in acute leukaemia. Nature 506, 328–333 (2014).

5. 5.

Greaves, M. Leukaemia ‘firsts’ in cancer research and treatment. Nat. Rev. Cancer 16, 163–172 (2016).

6. 6.

Jaiswal, S. et al. Age-related clonal hematopoiesis associated with adverse outcomes. N. Engl. J. Med. 371, 2488–2498 (2014).

7. 7.

Genovese, G. et al. Clonal hematopoiesis and blood-cancer risk inferred from blood DNA sequence. N. Engl. J. Med. 371, 2477–2487 (2014).

8. 8.

Bowman, R. L., Busque, L. & Levine, R. L. Clonal hematopoiesis and evolution to hematopoietic malignancies. Cell Stem Cell 22, 157–170 (2018).

9. 9.

Trumpp, A. & Wiestler, O. D. Mechanisms of disease: cancer stem cells–targeting the evil twin. Nat. Clin. Pract. Oncol. 5, 337–347 (2008).

10. 10.

Essers, M. A. G. & Trumpp, A. Targeting leukemic stem cells by breaking their dormancy. Mol. Oncol. 4, 443–450 (2010).

11. 11.

Pollyea, D. A. & Jordan, C. T. Therapeutic targeting of acute myeloid leukemia stem cells. Blood 129, 1627–1635 (2017).

12. 12.

Hosen, N. et al. CD96 is a leukemic stem cell-specific marker in human acute myeloid leukemia. Proc. Natl Acad. Sci. USA 104, 11008–11013 (2007).

13. 13.

Majeti, R. et al. Dysregulated gene expression networks in human acute myelogenous leukemia stem cells. Proc. Natl Acad. Sci. 106, 3396–3401 (2009).

14. 14.

Majeti, R. et al. CD47 is an adverse prognostic factor and therapeutic antibody target on human acute myeloid leukemia stem cells. Cell 138, 286–299 (2009).

15. 15.

Jan, M. et al. Prospective separation of normal and leukemic stem cells based on differential expression of TIM3, a human acute myeloid leukemia stem cell marker. Proc. Natl Acad. Sci. USA 108, 5009–5014 (2011).

16. 16.

Eppert, K. et al. Stem cell gene expression programs influence clinical outcome in human leukemia. Nat. Med. 17, 1086–1093 (2011).

17. 17.

Ng, S. W. K. et al. A 17-gene stemness score for rapid determination of risk in acute leukaemia. Nature 540, 433–437 (2016).

18. 18.

Pabst, C. et al. GPR56 identifies primary human acute myeloid leukemia cells with high repopulating potential in vivo. Blood 127, 2018–2027 (2016).

19. 19.

Raffel, S. et al. BCAT1 restricts αKG levels in AML stem cells leading to IDHmut-like DNA hypermethylation. Nature 551, 384–388 (2017).

20. 20.

Zink, F. et al. Clonal hematopoiesis, with and without candidate driver mutations, is common in the elderly. Blood 130, 742–752 (2017).

21. 21.

Papaemmanuil, E. et al. Genomic classification and prognosis in acute myeloid leukemia. N. Engl. J. Med. 374, 2209–2221 (2016).

22. 22.

Velten, L. et al. Human haematopoietic stem cell lineage commitment is a continuous process. Nat. Cell Biol. 19, 271–281 (2017).

23. 23.

Paul, F. et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).

24. 24.

Tusi, B. K. et al. Population snapshots predict early haematopoietic and erythroid hierarchies. Nature 555, 54–60 (2018).

25. 25.

Giustacchini, A. et al. Single-cell transcriptomics uncovers distinct molecular signatures of stem cells in chronic myeloid leukemia. Nat. Med. 23, 692–702 (2017).

26. 26.

Filbin, M. G. et al. Developmental and oncogenic programs in H3K27M gliomas dissected by single-cell RNA-seq. Science 360, 331–335 (2018).

27. 27.

van Galen, P. et al. Single-cell RNA-seq reveals AML hierarchies relevant to disease progression and immunity. Cell 176, 1265–1281.e24 (2019).

28. 28.

Nam, A. S. et al. Somatic mutations and cell identity linked by genotyping of transcriptomes. Nature 571, 355–360 (2019).

29. 29.

Petti, A. A. et al. A general approach for detecting expressed mutations in AML cells using single cell RNA-sequencing. Nat. Commun. 10, 3660 (2019).

30. 30.

Macaulay, I. C. et al. G&T-seq: parallel sequencing of single-cell genomes and transcriptomes. Nat. Methods 12, 519–522 (2015).

31. 31.

Rodriguez-Meira, A. et al. Unravelling intratumoral heterogeneity through high-sensitivity single-cell mutational analysis and parallel RNA sequencing. Mol. Cell 73, 1292–1305.e8 (2019).

32. 32.

Dey, S. S., Kester, L., Spanjaard, B., Bienko, M. & van Oudenaarden, A. Integrated genome and transcriptome sequencing of the same cell. Nat. Biotechnol. 33, 285–289 (2015).

33. 33.

Ludwig, L. S. et al. Lineage tracing in humans enabled by mitochondrial mutations and single-cell genomics. Cell 176, 1325–1339.e22 (2019).

34. 34.

Xu, J. et al. Single-cell lineage tracing by endogenous mutations enriched in transposase accessible mitochondrial DNA. eLife 8, 1–14 (2019).

35. 35.

Coban-Akdemir, Z. et al. Identifying genes whose mutant transcripts cause dominant disease traits by potential gain-of-function alleles. Am. J. Hum. Genet. 103, 171–187 (2018).

36. 36.

Malikic, S. et al. PhISCS: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data. Genome Res. 29, 1860–1877 (2019).

37. 37.

Tate, J. G. et al. COSMIC: the catalogue of somatic mutations in cancer. Nucleic Acids Res. 47, D941–D947 (2019).

38. 38.

Assi, S. A. et al. Subtype-specific regulatory network rewiring in acute myeloid leukemia. Nat. Genet. 51, 151–162 (2019).

39. 39.

Shlush, L. I. et al. Tracing the origins of relapse in acute myeloid leukaemia to stem cells. Nature 547, 104–108 (2017).

40. 40.

Izzo, F. et al. DNA methylation disruption reshapes the hematopoietic differentiation landscape. Nat. Genet. 52, 378–387 (2020).

41. 41.

Pina, C., May, G., Soneji, S., Hong, D. & Enver, T. MLLT3 regulates early human erythroid and megakaryocytic cell fate. Cell Stem Cell 2, 264–273 (2008).

42. 42.

Uckelmann, H. J. et al. Therapeutic targeting of preleukemia cells in a mouse model of NPM1 mutant acute myeloid leukemia. Science 367, 586–590 (2020).

43. 43.

Brunetti, L. et al. Mutant NPM1 maintains the leukemic state through HOX expression. Cancer Cell 34, 499–512.e9 (2018).

44. 44.

Vassiliou, G. S. et al. Mutant nucleophosmin and cooperating pathways drive leukemia initiation and progression in mice. Nat. Genet. 43, 470–475 (2011).

45. 45.

Satpathy, A. T. et al. Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat. Biotechnol. 37, 925–936 (2019).

46. 46.

Zhao, M., Kim, P., Mitra, R., Zhao, J. & Zhao, Z. TSGene 2.0: an updated literature-based knowledgebase for tumor suppressor genes. Nucleic Acids Res. 44, D1023–D1031 (2016).

47. 47.

Lareau, C. A. et al. Massively parallel single-cell mitochondrial DNA genotyping and chromatin profiling. Nat. Biotechnol. https://doi.org/10.1038/s41587-020-0645-6 (2020).

48. 48.

Schallmoser, K. et al. Rapid large-scale expansion of functional mesenchymal stem cells from unmanipulated bone marrow without animal serum. Tissue Eng. Part C. Methods 14, 185–196 (2008).

49. 49.

Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at arXiv https://arxiv.org/abs/1303.3997 (2013).

50. 50.

Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013).

51. 51.

Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).

52. 52.

Untergasser, A. et al. Primer3 — new capabilities and interfaces. Nucleic Acids Res. 40, e115 (2012).

53. 53.

Östergård, P. R. J. A fast algorithm for the maximum clique problem. Discret. Appl. Math. 120, 197–207 (2002).

54. 54.

Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods 10, 1096–1098 (2013).

55. 55.

Picelli, S. et al. Full-length RNA-seq from single cells using Smart-seq2. Nat. Protoc. 9, 171–181 (2014).

56. 56.

Hennig, B. P. et al. Large-scale low-cost NGS library preparation using a robust Tn5 purification and tagmentation protocol. G3 8, 79–89 (2018).

57. 57.

Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

58. 58.

Gerstung, M. et al. Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nat. Commun. 3, 811 (2012).

59. 59.

Consortium, T. 1000 G. P. A global reference for human genetic variation. Nature 526, 68–74 (2015).

60. 60.

Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. 37, 685–691 (2019).

61. 61.

Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21 (2019).

62. 62.

Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–47 (2019).

63. 63.

McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at arXiv https://arxiv.org/abs/1802.03426 (2018).

64. 64.

Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).

65. 65.

Finak, G. et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 16, 278 (2015).

66. 66.

Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer, 2009).

67. 67.

Velten, L., Story, B. A. & Steinmetz, L. M. Dataset for the manuscript “Identification of leukemic and pre-leukemic stem cells by clonal tracking from single-cell transcriptomics”. figshare https://doi.org/10.6084/m9.figshare.12382685.v1 (2021).

68. 68.

Velten, L., Story, B. A. & Steinmetz, L. M. PrimerDesign package v1.0. https://doi.org/10.5281/zenodo.4443088 (2021).

69. 69.

Velten, L., Story, B. A. & Steinmetz, L. M. mitoClone package v1.0. https://doi.org/10.5281/zenodo.4443074 (2021).

## Acknowledgements

We thank Niko Beerenwinkel for discussions, and Andreas Gschwind for contributing code for primer design. We thank the members of the Steinmetz and Haas labs for discussions and support, and we thank the EMBL and DKFZ Genomics core facilities, and the EMBL flow cytometry core facility for technical support. This project was financially supported by the Deutsche José Carreras Leukämie Stiftung grant DJCLS 20R/2017 (to L.V., S.H., L.M.S., and A.T.), the Emerson foundation grant 643577 (to L.V. and L.M.S.) and the German Bundesministerium für Bildung und Forschung (BMBF) through the Juniorverbund in der Systemmedizin “LeukoSyStem” (FKZ 01ZX1911D to L.V., S.H., and S.R). Contributions by S.R. were further supported by Emmy Noether Fellowship RA 3166/1-1 (DFG). Contributions by C.P. were supported by a Max-Eder Grant (German Cancer Aid 70111531). Contributions by D.N., J.C.J., W.K.H., and T.B. were supported by the Gutermuth Foundation, the H.W. & J. Hector fund, Baden-Württemberg, and the Dr. Rolf M. Schwiete Fund, Mannheim. D.N. is an endowed professor of the Deutsche José Carreras Leukämie Stiftung (DJCLS H 03/01).

## Author information

Authors

### Contributions

L.V. developed MutaSeq and performed single-cell RNA-seq experiments with assistance by J.M., D.R.L., and C.S.T. P.H.M. performed sample characterizations, single-cell culture experiments, and FACS sorting with support by M.P., A.D., S.R., and supervision by S.H. B.A.S. and L.V. developed mitoClone and analyzed the data with contributions from R.F. L.V., S.H., A.T., and L.M.S. conceived the study. L.V. and B.A.S. wrote the manuscript with contributions from P.H.M., S.H., S.R., and L.M.S. C.L., D.N., J.C.J., C.P., T.B., W.K.H., and C.M.T. collected samples and performed initial sample characterization. All authors have read and commented on the manuscript.

### Corresponding authors

Correspondence to Lars Velten or Lars M. Steinmetz.

## Ethics declarations

### Competing interests

L.M.S. is co-founder of Sophia Genetics and Levitas Bio and consultant for several companies on genetic analysis. All other authors declare no competing interests.

Peer review information Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

Reprints and Permissions

Velten, L., Story, B.A., Hernández-Malmierca, P. et al. Identification of leukemic and pre-leukemic stem cells by clonal tracking from single-cell transcriptomics. Nat Commun 12, 1366 (2021). https://doi.org/10.1038/s41467-021-21650-1

• Accepted:

• Published:

• ### Taming Cell-to-Cell Heterogeneity in Acute Myeloid Leukaemia With Machine Learning

• Yara E. Sánchez-Corrales
• , Ruben V. C. Pohle
• , Sergi Castellano
•  & Alice Giustacchini

Frontiers in Oncology (2021)

• ### Analyse von Krebsstammzellen auf Einzelzellebene

TumorDiagnostik & Therapie (2021)