The incidence of acute myeloid leukaemia (AML) increases with age and mortality exceeds 90% when diagnosed after age 65. Most cases arise without any detectable early symptoms and patients usually present with the acute complications of bone marrow failure1. The onset of such de novo AML cases is typically preceded by the accumulation of somatic mutations in preleukaemic haematopoietic stem and progenitor cells (HSPCs) that undergo clonal expansion2,3. However, recurrent AML mutations also accumulate in HSPCs during ageing of healthy individuals who do not develop AML, a phenomenon referred to as age-related clonal haematopoiesis (ARCH)4,5,6,7,8. Here we use deep sequencing to analyse genes that are recurrently mutated in AML to distinguish between individuals who have a high risk of developing AML and those with benign ARCH. We analysed peripheral blood cells from 95 individuals that were obtained on average 6.3 years before AML diagnosis (pre-AML group), together with 414 unselected age- and gender-matched individuals (control group). Pre-AML cases were distinct from controls and had more mutations per sample, higher variant allele frequencies, indicating greater clonal expansion, and showed enrichment of mutations in specific genes. Genetic parameters were used to derive a model that accurately predicted AML-free survival; this model was validated in an independent cohort of 29 pre-AML cases and 262 controls. Because AML is rare, we also developed an AML predictive model using a large electronic health record database that identified individuals at greater risk. Collectively our findings provide proof-of-concept that it is possible to discriminate ARCH from pre-AML many years before malignant transformation. This could in future enable earlier detection and monitoring, and may help to inform intervention.
This is a preview of subscription content
Subscribe to Nature+
Get immediate online access to the entire Nature family of 50+ journals
Subscribe to Journal
Get full journal access for 1 year
only $3.90 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
Deschler, B. & Lübbert, M. Acute myeloid leukemia: epidemiology and etiology. Cancer 107, 2099–2107 (2006).
Corces-Zimmerman, M. R., Hong, W. J., Weissman, I. L., Medeiros, B. C. & Majeti, R. Preleukemic mutations in human acute myeloid leukemia affect epigenetic regulators and persist in remission. Proc. Natl Acad. Sci. USA 111, 2548–2553 (2014).
Shlush, L. I. et al. Identification of pre-leukaemic haematopoietic stem cells in acute leukaemia. Nature 506, 328–333 (2014).
Genovese, G. et al. Clonal hematopoiesis and blood-cancer risk inferred from blood DNA sequence. N. Engl. J. Med. 371, 2477–2487 (2014).
Jaiswal, S. et al. Age-related clonal hematopoiesis associated with adverse outcomes. N. Engl. J. Med. 371, 2488–2498 (2014).
Xie, M. et al. Age-related mutations associated with clonal hematopoietic expansion and malignancies. Nat. Med. 20, 1472–1478 (2014).
Busque, L. et al. Nonrandom X-inactivation patterns in normal females: lyonization ratios vary with age. Blood 88, 59–65 (1996).
Shlush, L. I. Age-related clonal hematopoiesis. Blood 131, 496–504 (2018).
Acuna-Hidalgo, R. et al. Ultra-sensitive sequencing identifies high prevalence of clonal hematopoiesis-associated mutations throughout adult life. Am. J. Hum. Genet. 101, 50–64 (2017).
McKerrell, T. et al. Leukemia-associated somatic mutations drive distinct patterns of age-related clonal hemopoiesis. Cell Rep. 10, 1239–1245 (2015).
Wong, T. N., et al. Role of TP53 mutations in the origin and evolution of therapy-related acute myeloid leukaemia. Nature 518, 552–555 (2015).
Yoshizato, T. et al. Somatic mutations and clonal hematopoiesis in aplastic anemia. N. Engl. J. Med. 373, 35–47 (2015).
Krönke, J. et al. Clonal evolution in relapsed NPM1-mutated acute myeloid leukemia. Blood 122, 100–108 (2013).
Papaemmanuil, E. et al. Genomic classification and prognosis in acute myeloid leukemia. N. Engl. J. Med. 374, 2209–2221 (2016).
Forbes, S. A. et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 45, D777–D783 (2017).
Shlush, L. I. et al. Tracing the origins of relapse in acute myeloid leukaemia to stem cells. Nature 547, 104–108 (2017).
Buscarlet, M. et al. DNMT3A and TET2 dominate clonal hematopoiesis and demonstrate benign phenotypes and different genetic predispositions. Blood 130, 753–762 (2017).
Arber, D. A. et al. The 2016 revision to the World Health Organization classification of myeloid neoplasms and acute leukemia. Blood 127, 2391–2405 (2016).
Hu, L. et al. Prognostic value of RDW in cancers: a systematic review and meta-analysis. Oncotarget 8, 16027–16035 (2017).
Balicer, R. D. & Afek, A. Digital health nation: Israel’s global big data innovation hub. Lancet 389, 2451–2453 (2017).
Dagan, N., Cohen-Stavi, C., Leventer-Roberts, M. & Balicer, R. D. External validation and comparison of three prediction tools for risk of osteoporotic fractures using data from population based electronic health records: retrospective cohort study. Br. Med. J. 356, i6755 (2017).
McKerrell, T. & Vassiliou, G. S. Aging as a driver of leukemogenesis. Sci. Transl. Med. 7, 306fs38 (2015).
Vickers, A. J. Prediction models in cancer care. CA Cancer J. Clin. 61, 315–326 (2011).
Cassidy, A. et al. The LLP risk model: an individual risk prediction model for lung cancer. Br. J. Cancer 98, 270–276 (2008).
Wang, X., Oldani, M. J., Zhao, X., Huang, X. & Qian, D. A review of cancer risk prediction models with genetic variants. Cancer Inform. 13, 19–28 (2014).
Fuster, J. J. et al. Clonal hematopoiesis associated with TET2 deficiency accelerates atherosclerosis development in mice. Science 355, 842–847 (2017).
Jaiswal, S. et al. Clonal hematopoiesis and risk of atherosclerotic cardiovascular disease. N. Engl. J. Med. 377, 111–121 (2017).
Martincorena, I. & Campbell, P. J. Somatic mutation in cancer and normal cells. Science 349, 1483–1489 (2015).
Riboli, E. et al. European Prospective Investigation into Cancer and Nutrition (EPIC): study populations and data collection. Public Health Nutr. 5, 1113–1124 (2002).
Newman, A. M. et al. An ultrasensitive method for quantitating circulating tumor DNA with broad patient coverage. Nat. Med. 20, 548–554 (2014).
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Kennedy, S. R. et al. Detecting ultralow-frequency mutations by Duplex Sequencing. Nat. Protoc. 9, 2586–2606 (2014).
Koboldt, D. C. et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568–576 (2012).
Yang, H. & Wang, K. Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR. Nat. Protoc. 10, 1556–1566 (2015).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Gerstung, M. et al. Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nat. Commun. 3, 811 (2012).
Gerstung, M. et al. Precision oncology for acute myeloid leukemia using a knowledge bank approach. Nat. Genet. 49, 332–340 (2017).
Martincorena, I. et al. High burden and pervasive positive selection of somatic mutations in normal human skin. Science 348, 880–886 (2015).
Buels, R. et al. JBrowse: a dynamic web platform for genome visualization and analysis. Genome Biol. 17, 66 (2016).
Stephens, P. J. et al. The landscape of cancer genes and mutational processes in breast cancer. Nature 486, 400–404 (2012).
Raine, K. M. et al. cgpPindel: identifying somatically acquired insertion and deletion events from paired end sequencing. Curr. Protoc. Bioinformatics 52, 15.17.1–15.7.12 (2015).
Menzies, A. et al. VAGrENT: Variation Annotation Generator. Curr. Protoc. Bioinformatics 52, 15.18.1–15.18.11 (2015).
Antoniou, A. C. et al. A weighted cohort approach for analysing factors modifying disease risks in carriers of high-risk susceptibility genes. Genet. Epidemiol. 29, 1–11 (2005).
Therneau, T. & Grambsch P. M. Modeling Survival Data: Extending the Cox Model 1st edn (Springer-Verlag, New York, 2000).
Harrell, F. E. Jr, Lee, K. L. & Mark, D. B. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat. Med. 15, 361–387 (1996).
O’Quigley, J., Xu, R. & Stare, J. Explained randomness in proportional hazards models. Stat. Med. 24, 479–489 (2005).
This work was supported by a Quest for Cure grant to L.I.S., J.C.Y.W. and M.D.M. from the Leukemia and Lymphoma Society, and the following grants to L.I.S from: ERC Horizon 2020 MAMLE, Abisch-Frenkel foundation and an American Society of Hematology Scholar Award. Further funding to J.E.D. was provided by the Canada Research Chair Program, Ontario Institute for Cancer Research, the province of Ontario, Canadian Cancer Society, the Canadian Institutes for Health Research and the Ontario Ministry of Health and Long Term Care to UHN, whose views are not expressed here. Work conducted at the Sanger Institute was supported by the Wellcome Trust and UK Medical Research Council. S.A. was personally funded by the Benjamin Pearl fellowship from the McEwen Centre for Regenerative Medicine, G.C. by a Wellcome Trust Clinical PhD Fellowship (WT098051); G.S.V. by a Wellcome Trust Senior Fellowship in Clinical Science (WT095663MA) and a Cancer Research UK Senior Cancer Research Fellowship (C22324/A23015). G.S.V.'s laboratory is also funded by the Kay Kendall Leukaemia Fund and Bloodwise. We thank A. Mitchell and all members of the Dick and Shlush laboratories for comments and T. Hudson for early study planning; G. Barabash for organising the Clalit dataset collaboration. The EPIC study centres were supported by the Hellenic Health Foundation, Regional Government of Asturias, the Regional Government of Murcia (no. 6236), the Spanish Ministry of Health network RTICCC (ISCIII RD12/0036/0018), FEDER funds/European Regional Development Fund (ERDF), “a way to build Europe”, Generalitat de Catalunya, AGAUR 2014SGR726; EPIC Ragusa in Italy-Aire-Onlus Ragusa; Epic Italy-Associazione Italiana per la Ricerca sul Cancro (AIRC) Milan, Italy. S.V.B. and T.J.P. are supported by the Gattuso-Slaight Personalized Cancer Medicine Fund at the Princess Margaret Cancer Centre.
Nature thanks R. Levine, P. Van Loo and the other anonymous reviewer(s) for their contribution to the peer review of this work.
The authors declare no competing interests.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Red and blue lines represent the proportion of pre-AML cases and controls, respectively, that had ARCH-PD mutations with VAF ≥ 10%.
Extended Data Fig. 2 Serially collected sampling supports a long-lived HSPCs as the cell of origin for most ARCH-PD clones.
a, b, VAF trajectory of persistent clones carrying putative driver mutations in controls (a) and pre-AML cases (b). Age is indicated on the x axis. Top, VAF is shown on the y axis and each persistent mutation is shown in a different colour, with circles denoting individual serial samples and solid lines representing the growth trajectory between serial samples. Bottom, dashed lines indicate the time interval between the last sampling and the end of follow-up (controls) or AML diagnosis (cases). c, Clonal growth rates (α) are shown for 27 control clones corresponding to 54 time points and 13 pre-AML clones corresponding to 15 time points. Box plot centres, hinges and whiskers represent the median, first and third quartiles and 1.5 × interquartile range, respectively.
a, Receiver operating characteristic curve for prediction of AML development using model 1 (see Methods). The red dot indicates the point on the curve with the highest positive predictive value with sensitivity of 41.9% and specificity of 95.7%. b, c, Kaplan–Meier estimates of time to AML diagnosis for individuals predicted to develop AML (red) and not develop AML (blue) using model 1 (b; hazard ratio, 10.38; P = 4.2 × 10−10, Wald test) and model 2 (c; hazard ratio, 10.75; P = 1.75 × 10−8, Wald test), from the point of enrolment until the end of follow-up for patients enrolled in the EPIC study.
a–c, Time-dependent receiver operating characteristic curve for Cox proportional hazards model trained on the discovery cohort (n = 505 unique individuals, 91 pre-AML and 414 controls) (a), validation cohort (n = 291 unique individuals, 29 pre-AML and 262 controls) (b) and combined cohorts (c). d–f, Dynamic AUC for Cox proportional hazards models trained on the discovery cohort (d), validation cohort (e) or combined cohort (f). g, h, Red and blue bars indicate the observed and expected VAF (g) and driver frequency (h) of pre-AML cases and controls for each gene indicated on the x axis.
a, Kaplan–Meier curves of AML-free survival, defined as the time between sample collection and AML diagnosis, death or last follow-up. Survival curves are stratified according to mutation status in genes mutated in at least three samples across the combined validation and discovery cohorts. n = 796 unique individuals. b, Kaplan–Meier curve of AML-free survival stratified according to RDW value >14 or ≤14. Plot represents data for n = 128 biologically independent individuals who had RDW measurements, including all pre-AML cases regardless of ARCH-PD status, and controls with ARCH-PD (controls without detectable mutations were omitted).
a, Kaplan–Meier curves showing age stratified survival rates for 875 individuals who developed AML. b, Line plot representation of the number of cases per 100,000 control individuals in the EHR database. The centre values and error bars define the mean and s.d., respectively.
Normalized laboratory measurements for pre-AMLs (red) and controls (blue) (middle) and their association (bottom) with higher risk of AML are shown. The grey bars indicate the percentage of pre-AML cases with laboratory results either below the 1st percentile or above the 99th percentile. Box plot centres, hinges and whiskers represent the median, first and third quartiles and 1.5 × interquartile range, respectively.
The relative contribution of the top 50 features incorporated into the EHR prediction model, ranked according to their predictive value (gain). 1Y, one year before AML diagnosis; 2Y, two years before AML diagnosis; 3Y, three years before AML diagnosis; BASO%, percentage of basophils; BMI, body mass index; EOS.abs, absolute eosinophil count; EOS%, percentage of eosinophils; HYPO%, percentage of hypochromia; LUC, large unstained cells; LYM%, percentage of lymphocytes; LYMPH.abs, absolute lymphocyte count; MACRO%, percentage of macrocytosis; MCH, mean corpuscular haemoglobin; MCV, mean corpuscular volume; MON%, percentage of monocytes; MONO.abs, absolute monocyte count; MPV, mean platelet volume; NEUT.abs, absolute neutrophil count; NEUT%, percentage of neutrophils; PLT, platelet count; RBC, red blood cell count; RDW, red cell distributiom width; WBC, white blood cell count.
Heat map illustrating absolute values of clinical measurements. Blue, white and red indicate low, intermediate and high values, respectively. Light grey indicates missing data. False-negative and true-positive annotations are indicated at the bottom as dark-grey and yellow colour bars, respectively. 1Y, one year before AML diagnosis; 2Y, two years before AML diagnosis; 3Y, three years before AML diagnosis; BASO%, percentage of basophils; EOS%, percentage eosinophils; EOS.abs, absolute eosinophil count; HCT, haematocrit; HDL; high density lipoprotein; HGB, haemoglobin; Hyper%, percentage of hyperchromia; Hypo%, percentage of hypochromia; LDL, low density lipoprotein; LUC, large unstained cells; LYM%, percentage of lymphocytes; LYMP.abs, absolute lymphocyte count; MACRO%, percentage of macrocytosis; MCH, mean corpuscular haemoglobin; MCHC, mean corpuscular haemoglobin concentration; MCV, mean corpuscular volume; MICR%, percentage of microcytosis; MON%, percentage of monocytes; MONO.abs, absolute monocyte count; MPV, mean platelet volume; PLT, platelet count; NEUT%, percentage of neutrophils; NEUT.abs, absolute neutrophil count; RBC, red blood cell count; RDW, red cell distribution width; Transamina, transaminase; Transpeptid., transpeptidase; TSH, thyroid stimulating hormone; WBC, white blood cell count.
Supplementary Note - Genetic model related code. Code for the derivation of the genetic AML prediction model.
Clinical characteristics of the discovery and validation cohorts: This table contains survival and other available clinical metadata for the study cohorts.
ARCH-PD mutations: This table lists putative oncogenic mutations.
Genetic models performance and coefficients: This table contains genetic AML prediction model coefficients and performance metrics.
Features and parameters of the EHR based model: This table details the criteria for AML case ascertainment for the clinical AML prediction model as well as clinical features included and parameters used for machine learning.
About this article
Cite this article
Abelson, S., Collord, G., Ng, S.W.K. et al. Prediction of acute myeloid leukaemia risk in healthy individuals. Nature 559, 400–404 (2018). https://doi.org/10.1038/s41586-018-0317-6
Genome-wide association studies identify novel genetic loci for epigenetic age acceleration among survivors of childhood cancer
Genome Medicine (2022)
Prediction of occult tumor progression via platelet RNAs in a mouse melanoma model: a potential new platform for early detection of cancer
Journal of Translational Medicine (2022)
Immunity & Ageing (2022)
Clonal evolution in patients developing therapy-related myeloid neoplasms following autologous stem cell transplantation
Bone Marrow Transplantation (2022)