Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Prediction of acute myeloid leukaemia risk in healthy individuals


The incidence of acute myeloid leukaemia (AML) increases with age and mortality exceeds 90% when diagnosed after age 65. Most cases arise without any detectable early symptoms and patients usually present with the acute complications of bone marrow failure1. The onset of such de novo AML cases is typically preceded by the accumulation of somatic mutations in preleukaemic haematopoietic stem and progenitor cells (HSPCs) that undergo clonal expansion2,3. However, recurrent AML mutations also accumulate in HSPCs during ageing of healthy individuals who do not develop AML, a phenomenon referred to as age-related clonal haematopoiesis (ARCH)4,5,6,7,8. Here we use deep sequencing to analyse genes that are recurrently mutated in AML to distinguish between individuals who have a high risk of developing AML and those with benign ARCH. We analysed peripheral blood cells from 95 individuals that were obtained on average 6.3 years before AML diagnosis (pre-AML group), together with 414 unselected age- and gender-matched individuals (control group). Pre-AML cases were distinct from controls and had more mutations per sample, higher variant allele frequencies, indicating greater clonal expansion, and showed enrichment of mutations in specific genes. Genetic parameters were used to derive a model that accurately predicted AML-free survival; this model was validated in an independent cohort of 29 pre-AML cases and 262 controls. Because AML is rare, we also developed an AML predictive model using a large electronic health record database that identified individuals at greater risk. Collectively our findings provide proof-of-concept that it is possible to discriminate ARCH from pre-AML many years before malignant transformation. This could in future enable earlier detection and monitoring, and may help to inform intervention.

This is a preview of subscription content

Access options

Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Prevalence of ARCH, number of mutations and clone size in individuals who developed AML.
Fig. 2: Accumulation of specific recurrent AML mutations in healthy individuals at a young age is associated with progression to AML.
Fig. 3: Model of future risk of AML.
Fig. 4: Increased risk of AML development inferred from electronic health records.


  1. Deschler, B. & Lübbert, M. Acute myeloid leukemia: epidemiology and etiology. Cancer 107, 2099–2107 (2006).

    Article  PubMed  Google Scholar 

  2. Corces-Zimmerman, M. R., Hong, W. J., Weissman, I. L., Medeiros, B. C. & Majeti, R. Preleukemic mutations in human acute myeloid leukemia affect epigenetic regulators and persist in remission. Proc. Natl Acad. Sci. USA 111, 2548–2553 (2014).

    Article  PubMed  ADS  CAS  PubMed Central  Google Scholar 

  3. Shlush, L. I. et al. Identification of pre-leukaemic haematopoietic stem cells in acute leukaemia. Nature 506, 328–333 (2014).

    Article  PubMed  PubMed Central  ADS  CAS  Google Scholar 

  4. Genovese, G. et al. Clonal hematopoiesis and blood-cancer risk inferred from blood DNA sequence. N. Engl. J. Med. 371, 2477–2487 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  5. Jaiswal, S. et al. Age-related clonal hematopoiesis associated with adverse outcomes. N. Engl. J. Med. 371, 2488–2498 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  6. Xie, M. et al. Age-related mutations associated with clonal hematopoietic expansion and malignancies. Nat. Med. 20, 1472–1478 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  7. Busque, L. et al. Nonrandom X-inactivation patterns in normal females: lyonization ratios vary with age. Blood 88, 59–65 (1996).

    PubMed  CAS  Google Scholar 

  8. Shlush, L. I. Age-related clonal hematopoiesis. Blood 131, 496–504 (2018).

    Article  PubMed  CAS  Google Scholar 

  9. Acuna-Hidalgo, R. et al. Ultra-sensitive sequencing identifies high prevalence of clonal hematopoiesis-associated mutations throughout adult life. Am. J. Hum. Genet. 101, 50–64 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  10. McKerrell, T. et al. Leukemia-associated somatic mutations drive distinct patterns of age-related clonal hemopoiesis. Cell Rep. 10, 1239–1245 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  11. Wong, T. N., et al. Role of TP53 mutations in the origin and evolution of therapy-related acute myeloid leukaemia. Nature 518, 552–555 (2015).

    Article  PubMed  ADS  CAS  Google Scholar 

  12. Yoshizato, T. et al. Somatic mutations and clonal hematopoiesis in aplastic anemia. N. Engl. J. Med. 373, 35–47 (2015).

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  13. Krönke, J. et al. Clonal evolution in relapsed NPM1-mutated acute myeloid leukemia. Blood 122, 100–108 (2013).

    Article  PubMed  CAS  Google Scholar 

  14. Papaemmanuil, E. et al. Genomic classification and prognosis in acute myeloid leukemia. N. Engl. J. Med. 374, 2209–2221 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  15. Forbes, S. A. et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 45, D777–D783 (2017).

    Article  PubMed  CAS  Google Scholar 

  16. Shlush, L. I. et al. Tracing the origins of relapse in acute myeloid leukaemia to stem cells. Nature 547, 104–108 (2017).

    Article  PubMed  ADS  CAS  Google Scholar 

  17. Buscarlet, M. et al. DNMT3A and TET2 dominate clonal hematopoiesis and demonstrate benign phenotypes and different genetic predispositions. Blood 130, 753–762 (2017).

    Article  PubMed  CAS  Google Scholar 

  18. Arber, D. A. et al. The 2016 revision to the World Health Organization classification of myeloid neoplasms and acute leukemia. Blood 127, 2391–2405 (2016).

    Article  PubMed  CAS  Google Scholar 

  19. Hu, L. et al. Prognostic value of RDW in cancers: a systematic review and meta-analysis. Oncotarget 8, 16027–16035 (2017).

    PubMed  Google Scholar 

  20. Balicer, R. D. & Afek, A. Digital health nation: Israel’s global big data innovation hub. Lancet 389, 2451–2453 (2017).

    Article  PubMed  Google Scholar 

  21. Dagan, N., Cohen-Stavi, C., Leventer-Roberts, M. & Balicer, R. D. External validation and comparison of three prediction tools for risk of osteoporotic fractures using data from population based electronic health records: retrospective cohort study. Br. Med. J. 356, i6755 (2017).

    Article  Google Scholar 

  22. McKerrell, T. & Vassiliou, G. S. Aging as a driver of leukemogenesis. Sci. Transl. Med. 7, 306fs38 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Vickers, A. J. Prediction models in cancer care. CA Cancer J. Clin. 61, 315–326 (2011).

    PubMed  PubMed Central  Google Scholar 

  24. Cassidy, A. et al. The LLP risk model: an individual risk prediction model for lung cancer. Br. J. Cancer 98, 270–276 (2008).

    Article  PubMed  CAS  Google Scholar 

  25. Wang, X., Oldani, M. J., Zhao, X., Huang, X. & Qian, D. A review of cancer risk prediction models with genetic variants. Cancer Inform. 13, 19–28 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. Fuster, J. J. et al. Clonal hematopoiesis associated with TET2 deficiency accelerates atherosclerosis development in mice. Science 355, 842–847 (2017).

    Article  PubMed  PubMed Central  ADS  CAS  Google Scholar 

  27. Jaiswal, S. et al. Clonal hematopoiesis and risk of atherosclerotic cardiovascular disease. N. Engl. J. Med. 377, 111–121 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Martincorena, I. & Campbell, P. J. Somatic mutation in cancer and normal cells. Science 349, 1483–1489 (2015).

    Article  PubMed  ADS  CAS  Google Scholar 

  29. Riboli, E. et al. European Prospective Investigation into Cancer and Nutrition (EPIC): study populations and data collection. Public Health Nutr. 5, 1113–1124 (2002).

    Article  PubMed  CAS  Google Scholar 

  30. Newman, A. M. et al. An ultrasensitive method for quantitating circulating tumor DNA with broad patient coverage. Nat. Med. 20, 548–554 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  31. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  32. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  33. Kennedy, S. R. et al. Detecting ultralow-frequency mutations by Duplex Sequencing. Nat. Protoc. 9, 2586–2606 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  34. Koboldt, D. C. et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568–576 (2012).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  35. Yang, H. & Wang, K. Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR. Nat. Protoc. 10, 1556–1566 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  36. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  37. Gerstung, M. et al. Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nat. Commun. 3, 811 (2012).

    Article  PubMed  CAS  Google Scholar 

  38. Gerstung, M. et al. Precision oncology for acute myeloid leukemia using a knowledge bank approach. Nat. Genet. 49, 332–340 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  39. Martincorena, I. et al. High burden and pervasive positive selection of somatic mutations in normal human skin. Science 348, 880–886 (2015).

    Article  PubMed  PubMed Central  ADS  CAS  Google Scholar 

  40. Buels, R. et al. JBrowse: a dynamic web platform for genome visualization and analysis. Genome Biol. 17, 66 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  41. Stephens, P. J. et al. The landscape of cancer genes and mutational processes in breast cancer. Nature 486, 400–404 (2012).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  42. Raine, K. M. et al. cgpPindel: identifying somatically acquired insertion and deletion events from paired end sequencing. Curr. Protoc. Bioinformatics 52, 15.17.1–15.7.12 (2015).

    Google Scholar 

  43. Menzies, A. et al. VAGrENT: Variation Annotation Generator. Curr. Protoc. Bioinformatics 52, 15.18.1–15.18.11 (2015).

    Google Scholar 

  44. Antoniou, A. C. et al. A weighted cohort approach for analysing factors modifying disease risks in carriers of high-risk susceptibility genes. Genet. Epidemiol. 29, 1–11 (2005).

    Article  PubMed  Google Scholar 

  45. Therneau, T. & Grambsch P. M. Modeling Survival Data: Extending the Cox Model 1st edn (Springer-Verlag, New York, 2000).

    Book  MATH  Google Scholar 

  46. Harrell, F. E. Jr, Lee, K. L. & Mark, D. B. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat. Med. 15, 361–387 (1996).

    Article  PubMed  Google Scholar 

  47. O’Quigley, J., Xu, R. & Stare, J. Explained randomness in proportional hazards models. Stat. Med. 24, 479–489 (2005).

    MathSciNet  Article  PubMed  Google Scholar 

Download references


This work was supported by a Quest for Cure grant to L.I.S., J.C.Y.W. and M.D.M. from the Leukemia and Lymphoma Society, and the following grants to L.I.S from: ERC Horizon 2020 MAMLE, Abisch-Frenkel foundation and an American Society of Hematology Scholar Award. Further funding to J.E.D. was provided by the Canada Research Chair Program, Ontario Institute for Cancer Research, the province of Ontario, Canadian Cancer Society, the Canadian Institutes for Health Research and the Ontario Ministry of Health and Long Term Care to UHN, whose views are not expressed here. Work conducted at the Sanger Institute was supported by the Wellcome Trust and UK Medical Research Council. S.A. was personally funded by the Benjamin Pearl fellowship from the McEwen Centre for Regenerative Medicine, G.C. by a Wellcome Trust Clinical PhD Fellowship (WT098051); G.S.V. by a Wellcome Trust Senior Fellowship in Clinical Science (WT095663MA) and a Cancer Research UK Senior Cancer Research Fellowship (C22324/A23015). G.S.V.'s laboratory is also funded by the Kay Kendall Leukaemia Fund and Bloodwise. We thank A. Mitchell and all members of the Dick and Shlush laboratories for comments and T. Hudson for early study planning; G. Barabash for organising the Clalit dataset collaboration. The EPIC study centres were supported by the Hellenic Health Foundation, Regional Government of Asturias, the Regional Government of Murcia (no. 6236), the Spanish Ministry of Health network RTICCC (ISCIII RD12/0036/0018), FEDER funds/European Regional Development Fund (ERDF), “a way to build Europe”, Generalitat de Catalunya, AGAUR 2014SGR726; EPIC Ragusa in Italy-Aire-Onlus Ragusa; Epic Italy-Associazione Italiana per la Ricerca sul Cancro (AIRC) Milan, Italy. S.V.B. and T.J.P. are supported by the Gattuso-Slaight Personalized Cancer Medicine Fund at the Princess Margaret Cancer Centre.

Reviewer information

Nature thanks R. Levine, P. Van Loo and the other anonymous reviewer(s) for their contribution to the peer review of this work.

Author information

Authors and Affiliations



S.W.K.N., O.W., N.M.C. and E.N. contributed equally to the work. S.A. performed error-corrected sequencing, analysed sequencing data, performed statistical analyses, contributed to genetic predictive model derivation and wrote the manuscript. G.C. performed variant calling, statistical analyses, derived genetic predictive models and wrote the manuscript. M.G., S.W.K.N., O.W. and R.C. derived genetic predictive models. N.M.C., E.N. and N.B. derived the clinical prediction model. P.C.Z., Z.Z., I.C., K.N., C.L., C.H., D.H., F.M., J.E., J.K.M., D.P., L.T., P.K., S.V.B. and A.Br. and A.Ba. provided sequencing and technical support and enabled sample acquisition. L.H., Y.S., T.T.W., T.J.P., K.R. and D.J. provided bioinformatics support. R.L., S.H., M.J., K.M.B., A.Kr. and N.J.W. enabled sample acquisition, clinical data curation and/or provided clinical expertise. D.S., J.D.M., P.A., E.S., S.B., P.Be., M.D.M and I.M. contributed to data analysis and interpretation. P.J.C. and E.P. contributed to data interpretation and designed the targeted sequencing assay for the validation cohort. J.C.Y.W. revised the manuscript. J.R.Q., A.Ka., C.L.V., A.T., E.S.-F., J.M.H., R.C.T., R.T., G.M., H.B., S.Pa., R.K., S.S., S.Po., N.J.W., N.S., K.-T.K., M.F., J.M.K., E.R., P.V. and R.V. enabled sample acquisition (EPIC). A.T. and R.D.B. analysed Clalit data and derived the clinical prediction model. M.G. derived predictive genetic models, contributed to sequencing data analysis and manuscript writing. J.E.D. contributed to funding applications, study supervision and manuscript writing. P.Br. supervised sample acquisition from all EPIC centres. G.S.V. and L.I.S. designed and supervised all aspects of the study and wrote the manuscript.

Corresponding authors

Correspondence to Moritz Gerstung, John E. Dick, Paul Brennan, George S. Vassiliou or Liran I. Shlush.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Prevalence of ARCH-PD mutations with VAF ≥ 10% according to age.

Red and blue lines represent the proportion of pre-AML cases and controls, respectively, that had ARCH-PD mutations with VAF ≥ 10%.

Extended Data Fig. 2 Serially collected sampling supports a long-lived HSPCs as the cell of origin for most ARCH-PD clones.

a, b, VAF trajectory of persistent clones carrying putative driver mutations in controls (a) and pre-AML cases (b). Age is indicated on the x axis. Top, VAF is shown on the y axis and each persistent mutation is shown in a different colour, with circles denoting individual serial samples and solid lines representing the growth trajectory between serial samples. Bottom, dashed lines indicate the time interval between the last sampling and the end of follow-up (controls) or AML diagnosis (cases). c, Clonal growth rates (α) are shown for 27 control clones corresponding to 54 time points and 13 pre-AML clones corresponding to 15 time points. Box plot centres, hinges and whiskers represent the median, first and third quartiles and 1.5 × interquartile range, respectively.

Extended Data Fig. 3 Performance of the combined model in predicting progression to AML.

a, Receiver operating characteristic curve for prediction of AML development using model 1 (see Methods). The red dot indicates the point on the curve with the highest positive predictive value with sensitivity of 41.9% and specificity of 95.7%. b, c, Kaplan–Meier estimates of time to AML diagnosis for individuals predicted to develop AML (red) and not develop AML (blue) using model 1 (b; hazard ratio, 10.38; P = 4.2 × 10−10, Wald test) and model 2 (c; hazard ratio, 10.75; P = 1.75 × 10−8, Wald test), from the point of enrolment until the end of follow-up for patients enrolled in the EPIC study.

Extended Data Fig. 4 AML predictive models.

ac, Time-dependent receiver operating characteristic curve for Cox proportional hazards model trained on the discovery cohort (n = 505 unique individuals, 91 pre-AML and 414 controls) (a), validation cohort (n = 291 unique individuals, 29 pre-AML and 262 controls) (b) and combined cohorts (c). df, Dynamic AUC for Cox proportional hazards models trained on the discovery cohort (d), validation cohort (e) or combined cohort (f). g, h, Red and blue bars indicate the observed and expected VAF (g) and driver frequency (h) of pre-AML cases and controls for each gene indicated on the x axis.

Extended Data Fig. 5 AML-free survival based on mutation status and RDW.

a, Kaplan–Meier curves of AML-free survival, defined as the time between sample collection and AML diagnosis, death or last follow-up. Survival curves are stratified according to mutation status in genes mutated in at least three samples across the combined validation and discovery cohorts. n = 796 unique individuals. b, Kaplan–Meier curve of AML-free survival stratified according to RDW value >14 or ≤14. Plot represents data for n = 128 biologically independent individuals who had RDW measurements, including all pre-AML cases regardless of ARCH-PD status, and controls with ARCH-PD (controls without detectable mutations were omitted).

Extended Data Fig. 6 Description of the cohort and the EHR-derived measurements.

a, Kaplan–Meier curves showing age stratified survival rates for 875 individuals who developed AML. b, Line plot representation of the number of cases per 100,000 control individuals in the EHR database. The centre values and error bars define the mean and s.d., respectively.

Extended Data Fig. 7 Laboratory measurements contributing to the EHR model.

Normalized laboratory measurements for pre-AMLs (red) and controls (blue) (middle) and their association (bottom) with higher risk of AML are shown. The grey bars indicate the percentage of pre-AML cases with laboratory results either below the 1st percentile or above the 99th percentile. Box plot centres, hinges and whiskers represent the median, first and third quartiles and 1.5 × interquartile range, respectively.

Extended Data Fig. 8 Top 50 parameters for the EHR model.

The relative contribution of the top 50 features incorporated into the EHR prediction model, ranked according to their predictive value (gain). 1Y, one year before AML diagnosis; 2Y, two years before AML diagnosis; 3Y, three years before AML diagnosis; BASO%, percentage of basophils; BMI, body mass index; EOS.abs, absolute eosinophil count; EOS%, percentage of eosinophils; HYPO%, percentage of hypochromia; LUC, large unstained cells; LYM%, percentage of lymphocytes; LYMPH.abs, absolute lymphocyte count; MACRO%, percentage of macrocytosis; MCH, mean corpuscular haemoglobin; MCV, mean corpuscular volume; MON%, percentage of monocytes; MONO.abs, absolute monocyte count; MPV, mean platelet volume; NEUT.abs, absolute neutrophil count; NEUT%, percentage of neutrophils; PLT, platelet count; RBC, red blood cell count; RDW, red cell distributiom width; WBC, white blood cell count.

Extended Data Fig. 9 Distribution of EHR model parameters.

Heat map illustrating absolute values of clinical measurements. Blue, white and red indicate low, intermediate and high values, respectively. Light grey indicates missing data. False-negative and true-positive annotations are indicated at the bottom as dark-grey and yellow colour bars, respectively. 1Y, one year before AML diagnosis; 2Y, two years before AML diagnosis; 3Y, three years before AML diagnosis; BASO%, percentage of basophils; EOS%, percentage eosinophils; EOS.abs, absolute eosinophil count; HCT, haematocrit; HDL; high density lipoprotein; HGB, haemoglobin; Hyper%, percentage of hyperchromia; Hypo%, percentage of hypochromia; LDL, low density lipoprotein; LUC, large unstained cells; LYM%, percentage of lymphocytes; LYMP.abs, absolute lymphocyte count; MACRO%, percentage of macrocytosis; MCH, mean corpuscular haemoglobin; MCHC, mean corpuscular haemoglobin concentration; MCV, mean corpuscular volume; MICR%, percentage of microcytosis; MON%, percentage of monocytes; MONO.abs, absolute monocyte count; MPV, mean platelet volume; PLT, platelet count; NEUT%, percentage of neutrophils; NEUT.abs, absolute neutrophil count; RBC, red blood cell count; RDW, red cell distribution width;  Transamina, transaminase; Transpeptid., transpeptidase; TSH, thyroid stimulating hormone; WBC, white blood cell count.

Extended Data Table 1 Genes sequenced by cRNA bait pull-down in the validation cohort

Supplementary Information

Supplementary Information

Supplementary Note - Genetic model related code. Code for the derivation of the genetic AML prediction model.

Reporting Summary

Supplementary Table 1

Clinical characteristics of the discovery and validation cohorts: This table contains survival and other available clinical metadata for the study cohorts.

Supplementary Table 2

ARCH-PD mutations: This table lists putative oncogenic mutations.

Supplementary Table 3

Genetic models performance and coefficients: This table contains genetic AML prediction model coefficients and performance metrics.

Supplementary Table 4

Features and parameters of the EHR based model: This table details the criteria for AML case ascertainment for the clinical AML prediction model as well as clinical features included and parameters used for machine learning.

Source data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Abelson, S., Collord, G., Ng, S.W.K. et al. Prediction of acute myeloid leukaemia risk in healthy individuals. Nature 559, 400–404 (2018).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

Further reading


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing