Letter | Published:

Prediction of acute myeloid leukaemia risk in healthy individuals

Naturevolume 559pages400404 (2018) | Download Citation


The incidence of acute myeloid leukaemia (AML) increases with age and mortality exceeds 90% when diagnosed after age 65. Most cases arise without any detectable early symptoms and patients usually present with the acute complications of bone marrow failure1. The onset of such de novo AML cases is typically preceded by the accumulation of somatic mutations in preleukaemic haematopoietic stem and progenitor cells (HSPCs) that undergo clonal expansion2,3. However, recurrent AML mutations also accumulate in HSPCs during ageing of healthy individuals who do not develop AML, a phenomenon referred to as age-related clonal haematopoiesis (ARCH)4,5,6,7,8. Here we use deep sequencing to analyse genes that are recurrently mutated in AML to distinguish between individuals who have a high risk of developing AML and those with benign ARCH. We analysed peripheral blood cells from 95 individuals that were obtained on average 6.3 years before AML diagnosis (pre-AML group), together with 414 unselected age- and gender-matched individuals (control group). Pre-AML cases were distinct from controls and had more mutations per sample, higher variant allele frequencies, indicating greater clonal expansion, and showed enrichment of mutations in specific genes. Genetic parameters were used to derive a model that accurately predicted AML-free survival; this model was validated in an independent cohort of 29 pre-AML cases and 262 controls. Because AML is rare, we also developed an AML predictive model using a large electronic health record database that identified individuals at greater risk. Collectively our findings provide proof-of-concept that it is possible to discriminate ARCH from pre-AML many years before malignant transformation. This could in future enable earlier detection and monitoring, and may help to inform intervention.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


  1. 1.

    Deschler, B. & Lübbert, M. Acute myeloid leukemia: epidemiology and etiology. Cancer 107, 2099–2107 (2006).

  2. 2.

    Corces-Zimmerman, M. R., Hong, W. J., Weissman, I. L., Medeiros, B. C. & Majeti, R. Preleukemic mutations in human acute myeloid leukemia affect epigenetic regulators and persist in remission. Proc. Natl Acad. Sci. USA 111, 2548–2553 (2014).

  3. 3.

    Shlush, L. I. et al. Identification of pre-leukaemic haematopoietic stem cells in acute leukaemia. Nature 506, 328–333 (2014).

  4. 4.

    Genovese, G. et al. Clonal hematopoiesis and blood-cancer risk inferred from blood DNA sequence. N. Engl. J. Med. 371, 2477–2487 (2014).

  5. 5.

    Jaiswal, S. et al. Age-related clonal hematopoiesis associated with adverse outcomes. N. Engl. J. Med. 371, 2488–2498 (2014).

  6. 6.

    Xie, M. et al. Age-related mutations associated with clonal hematopoietic expansion and malignancies. Nat. Med. 20, 1472–1478 (2014).

  7. 7.

    Busque, L. et al. Nonrandom X-inactivation patterns in normal females: lyonization ratios vary with age. Blood 88, 59–65 (1996).

  8. 8.

    Shlush, L. I. Age-related clonal hematopoiesis. Blood 131, 496–504 (2018).

  9. 9.

    Acuna-Hidalgo, R. et al. Ultra-sensitive sequencing identifies high prevalence of clonal hematopoiesis-associated mutations throughout adult life. Am. J. Hum. Genet. 101, 50–64 (2017).

  10. 10.

    McKerrell, T. et al. Leukemia-associated somatic mutations drive distinct patterns of age-related clonal hemopoiesis. Cell Rep. 10, 1239–1245 (2015).

  11. 11.

    Wong, T. N., et al. Role of TP53 mutations in the origin and evolution of therapy-related acute myeloid leukaemia. Nature 518, 552–555 (2015).

  12. 12.

    Yoshizato, T. et al. Somatic mutations and clonal hematopoiesis in aplastic anemia. N. Engl. J. Med. 373, 35–47 (2015).

  13. 13.

    Krönke, J. et al. Clonal evolution in relapsed NPM1-mutated acute myeloid leukemia. Blood 122, 100–108 (2013).

  14. 14.

    Papaemmanuil, E. et al. Genomic classification and prognosis in acute myeloid leukemia. N. Engl. J. Med. 374, 2209–2221 (2016).

  15. 15.

    Forbes, S. A. et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 45, D777–D783 (2017).

  16. 16.

    Shlush, L. I. et al. Tracing the origins of relapse in acute myeloid leukaemia to stem cells. Nature 547, 104–108 (2017).

  17. 17.

    Buscarlet, M. et al. DNMT3A and TET2 dominate clonal hematopoiesis and demonstrate benign phenotypes and different genetic predispositions. Blood 130, 753–762 (2017).

  18. 18.

    Arber, D. A. et al. The 2016 revision to the World Health Organization classification of myeloid neoplasms and acute leukemia. Blood 127, 2391–2405 (2016).

  19. 19.

    Hu, L. et al. Prognostic value of RDW in cancers: a systematic review and meta-analysis. Oncotarget 8, 16027–16035 (2017).

  20. 20.

    Balicer, R. D. & Afek, A. Digital health nation: Israel’s global big data innovation hub. Lancet 389, 2451–2453 (2017).

  21. 21.

    Dagan, N., Cohen-Stavi, C., Leventer-Roberts, M. & Balicer, R. D. External validation and comparison of three prediction tools for risk of osteoporotic fractures using data from population based electronic health records: retrospective cohort study. Br. Med. J. 356, i6755 (2017).

  22. 22.

    McKerrell, T. & Vassiliou, G. S. Aging as a driver of leukemogenesis. Sci. Transl. Med. 7, 306fs38 (2015).

  23. 23.

    Vickers, A. J. Prediction models in cancer care. CA Cancer J. Clin. 61, 315–326 (2011).

  24. 24.

    Cassidy, A. et al. The LLP risk model: an individual risk prediction model for lung cancer. Br. J. Cancer 98, 270–276 (2008).

  25. 25.

    Wang, X., Oldani, M. J., Zhao, X., Huang, X. & Qian, D. A review of cancer risk prediction models with genetic variants. Cancer Inform. 13, 19–28 (2014).

  26. 26.

    Fuster, J. J. et al. Clonal hematopoiesis associated with TET2 deficiency accelerates atherosclerosis development in mice. Science 355, 842–847 (2017).

  27. 27.

    Jaiswal, S. et al. Clonal hematopoiesis and risk of atherosclerotic cardiovascular disease. N. Engl. J. Med. 377, 111–121 (2017).

  28. 28.

    Martincorena, I. & Campbell, P. J. Somatic mutation in cancer and normal cells. Science 349, 1483–1489 (2015).

  29. 29.

    Riboli, E. et al. European Prospective Investigation into Cancer and Nutrition (EPIC): study populations and data collection. Public Health Nutr. 5, 1113–1124 (2002).

  30. 30.

    Newman, A. M. et al. An ultrasensitive method for quantitating circulating tumor DNA with broad patient coverage. Nat. Med. 20, 548–554 (2014).

  31. 31.

    Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).

  32. 32.

    McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

  33. 33.

    Kennedy, S. R. et al. Detecting ultralow-frequency mutations by Duplex Sequencing. Nat. Protoc. 9, 2586–2606 (2014).

  34. 34.

    Koboldt, D. C. et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568–576 (2012).

  35. 35.

    Yang, H. & Wang, K. Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR. Nat. Protoc. 10, 1556–1566 (2015).

  36. 36.

    Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

  37. 37.

    Gerstung, M. et al. Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nat. Commun. 3, 811 (2012).

  38. 38.

    Gerstung, M. et al. Precision oncology for acute myeloid leukemia using a knowledge bank approach. Nat. Genet. 49, 332–340 (2017).

  39. 39.

    Martincorena, I. et al. High burden and pervasive positive selection of somatic mutations in normal human skin. Science 348, 880–886 (2015).

  40. 40.

    Buels, R. et al. JBrowse: a dynamic web platform for genome visualization and analysis. Genome Biol. 17, 66 (2016).

  41. 41.

    Stephens, P. J. et al. The landscape of cancer genes and mutational processes in breast cancer. Nature 486, 400–404 (2012).

  42. 42.

    Raine, K. M. et al. cgpPindel: identifying somatically acquired insertion and deletion events from paired end sequencing. Curr. Protoc. Bioinformatics 52, 15.17.1–15.7.12 (2015).

  43. 43.

    Menzies, A. et al. VAGrENT: Variation Annotation Generator. Curr. Protoc. Bioinformatics 52, 15.18.1–15.18.11 (2015).

  44. 44.

    Antoniou, A. C. et al. A weighted cohort approach for analysing factors modifying disease risks in carriers of high-risk susceptibility genes. Genet. Epidemiol. 29, 1–11 (2005).

  45. 45.

    Therneau, T. & Grambsch P. M. Modeling Survival Data: Extending the Cox Model 1st edn (Springer-Verlag, New York, 2000).

  46. 46.

    Harrell, F. E. Jr, Lee, K. L. & Mark, D. B. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat. Med. 15, 361–387 (1996).

  47. 47.

    O’Quigley, J., Xu, R. & Stare, J. Explained randomness in proportional hazards models. Stat. Med. 24, 479–489 (2005).

Download references


This work was supported by a Quest for Cure grant to L.I.S., J.C.Y.W. and M.D.M. from the Leukemia and Lymphoma Society, and the following grants to L.I.S from: ERC Horizon 2020 MAMLE, Abisch-Frenkel foundation and an American Society of Hematology Scholar Award. Further funding to J.E.D. was provided by the Canada Research Chair Program, Ontario Institute for Cancer Research, the province of Ontario, Canadian Cancer Society, the Canadian Institutes for Health Research and the Ontario Ministry of Health and Long Term Care to UHN, whose views are not expressed here. Work conducted at the Sanger Institute was supported by the Wellcome Trust and UK Medical Research Council. S.A. was personally funded by the Benjamin Pearl fellowship from the McEwen Centre for Regenerative Medicine, G.C. by a Wellcome Trust Clinical PhD Fellowship (WT098051); G.S.V. by a Wellcome Trust Senior Fellowship in Clinical Science (WT095663MA) and a Cancer Research UK Senior Cancer Research Fellowship (C22324/A23015). G.S.V.'s laboratory is also funded by the Kay Kendall Leukaemia Fund and Bloodwise. We thank A. Mitchell and all members of the Dick and Shlush laboratories for comments and T. Hudson for early study planning; G. Barabash for organising the Clalit dataset collaboration. The EPIC study centres were supported by the Hellenic Health Foundation, Regional Government of Asturias, the Regional Government of Murcia (no. 6236), the Spanish Ministry of Health network RTICCC (ISCIII RD12/0036/0018), FEDER funds/European Regional Development Fund (ERDF), “a way to build Europe”, Generalitat de Catalunya, AGAUR 2014SGR726; EPIC Ragusa in Italy-Aire-Onlus Ragusa; Epic Italy-Associazione Italiana per la Ricerca sul Cancro (AIRC) Milan, Italy. S.V.B. and T.J.P. are supported by the Gattuso-Slaight Personalized Cancer Medicine Fund at the Princess Margaret Cancer Centre.

Reviewer information

Nature thanks R. Levine, P. Van Loo and the other anonymous reviewer(s) for their contribution to the peer review of this work.

Author information

Author notes

  1. These authors contributed equally: Sagi Abelson, Grace Collord

  2. These authors jointly supervised this work: Moritz Gerstung, John E. Dick, Paul Brennan, George S. Vassiliou, Liran I. Shlush


  1. Princess Margaret Cancer Centre, University Health Network (UHN), Toronto, Ontario, Canada

    • Sagi Abelson
    • , Ting Ting Wang
    • , Zhen Zhao
    • , Iulia Cirlan
    • , Trevor J. Pugh
    • , Scott V. Bratman
    • , Jean C. Y. Wang
    • , Mark D. Minden
    • , John E. Dick
    •  & Liran I. Shlush
  2. Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK

    • Grace Collord
    • , Calli Latimer
    • , Claire Hardy
    • , Keiran Raine
    • , David Jones
    • , Philip Beer
    • , Sam Behjati
    • , Inigo Martincorena
    • , Peter J. Campbell
    • , Elli Papaemmanuil
    • , Moritz Gerstung
    •  & George S. Vassiliou
  3. Department of Paediatrics, University of Cambridge, Cambridge, UK

    • Grace Collord
    •  & Sam Behjati
  4. Institute of Biomaterials and Biomedical Engineering, University of Toronto, Toronto, Ontario, Canada

    • Stanley W. K. Ng
  5. Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel

    • Omer Weissbrod
    • , Netta Mendelson Cohen
    • , Eran Segal
    •  & Amos Tanay
  6. Department of Immunology, Weizmann Institute of Science, Rehovot, Israel

    • Elisabeth Niemeyer
    •  & Liran I. Shlush
  7. Clalit Research Institute, Tel Aviv, Israel

    • Noam Barda
    •  & Ran D. Balicer
  8. Ontario Institute for Cancer Research, Toronto, Ontario, Canada

    • Philip C. Zuzarte
    • , Lawrence Heisler
    • , Yogi Sundaravadanam
    • , Trevor J. Pugh
    • , David Soave
    • , Karen Ng
    • , John D. McPherson
    • , Faridah Mbabaali
    • , Jenna Eagles
    • , Jessica K. Miller
    • , Danielle Pasternack
    • , Lee Timms
    • , Paul Krzyzanowski
    • , Philip Awadalla
    •  & Scott V. Bratman
  9. Department of Public Health and Primary Care, Institute of Public Health, University of Cambridge School of Clinical Medicine, Cambridge, UK

    • Robert Luben
    •  & Shabina Hayat
  10. Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada

    • Ting Ting Wang
    • , Trevor J. Pugh
    •  & Mark D. Minden
  11. MRC Epidemiology Unit, University of Cambridge, Cambridge, UK

    • Diana Hoult
    • , Abigail Britten
    •  & Nicholas J. Wareham
  12. International Agency for Research on Cancer, World Health Organization, Lyon, France

    • Mattias Johansson
    • , Matthieu Foll
    • , James McKay
    •  & Paul Brennan
  13. European Molecular Biology Laboratory, European Bioinformatics Institute EMBL-EBI, Wellcome Genome Campus, Hinxton, UK

    • Rui Costa
    •  & Moritz Gerstung
  14. Department of Radiation Oncology, University of Toronto, Toronto, Ontario, Canada

    • Scott V. Bratman
  15. Department of Medicine, University of Toronto, Toronto, Ontario, Canada

    • Jean C. Y. Wang
    •  & Mark D. Minden
  16. Division of Medical Oncology and Hematology, University Health Network, Toronto, Ontario, Canada

    • Jean C. Y. Wang
    •  & Mark D. Minden
  17. Department of Molecular Haematology, Norwich Medical School, The University of East Anglia, Norwich, UK

    • Kristian M. Bowles
  18. Department of Haematology, Norfolk and Norwich University Hospitals NHS Trust, Norwich, UK

    • Kristian M. Bowles
  19. Public Health Directorate, Asturias, Spain

    • J. Ramón Quirós
  20. Hellenic Health Foundation, Athens, Greece

    • Anna Karakatsani
    • , Carlo La Vecchia
    •  & Antonia Trichopoulou
  21. 2nd Pulmonary Medicine Department, School of Medicine, National and Kapodistrian University of Athens, “ATTIKON” University Hospital, Haidari, Athens, Greece

    • Anna Karakatsani
  22. Department of Clinical Sciences and Community Health, Università degli Studi di Milano, Milan, Italy

    • Carlo La Vecchia
  23. Escuela Andaluza de Salud Pública, Instituto de Investigación Biosanitaria ibs.GRANADA, Hospitales Universitarios de Granada/Universidad de Granada, Granada, Spain

    • Elena Salamanca-Fernández
  24. CIBER Epidemiology and Public Health CIBERESP, Madrid, Spain

    • Elena Salamanca-Fernández
    • , José M. Huerta
    •  & Aurelio Barricarte
  25. Department of Epidemiology, Murcia Regional Health Council, IMIB-Arrixaca, Murcia, Spain

    • José M. Huerta
  26. Navarra Public Health Institute, Pamplona, Spain

    • Aurelio Barricarte
  27. Navarra Institute for Health Research, Pamplona, Spain

    • Aurelio Barricarte
  28. Cancer Epidemiology Unit, Nuffield Department of Population Health, University of Oxford, Oxford, UK

    • Ruth C. Travis
  29. Cancer Registry and Histopathology Department, Civic-M. P. Arezzo Hospital, Azienda Sanitaria Provinciale, Ragusa, Italy

    • Rosario Tumino
  30. Cancer Risk Factors and Life-Style Epidemiology Unit, Cancer Research and Prevention Institute – ISPO, Florence, Italy

    • Giovanna Masala
  31. Department of Epidemiology, German Institute of Human Nutrition (DIfE), Potsdam-Rehbrücke, Germany

    • Heiner Boeing
  32. Dipartimento Di Medicina Clinica E Chirurgia, Federico II University, Naples, Italy

    • Salvatore Panico
  33. Division of Cancer Epidemiology, German Cancer Research Center (DKFZ), Heidelberg, Germany

    • Rudolf Kaaks
  34. Clinical Cooperation Unit Molecular Hematology/Oncology, German Cancer Research Center (DKFZ) and Department of Internal Medicine V, University of Heidelberg, Heidelberg, Germany

    • Alwin Krämer
  35. Epidemiology and Prevention Unit, Fondazione IRCCS Istituto Nazionale dei Tumori, Milano, Italy

    • Sabina Sieri
  36. Department of Epidemiology and Biostatistics, School of Public Health, Imperial College London, London, UK

    • Elio Riboli
    •  & Paolo Vineis
  37. Italian Institute for Genomic Medicine, Torino, Italy

    • Silvia Polidoro
  38. Unit of Nutrition and Cancer, Cancer Epidemiology Research Program and Translational Research Laboratory, Catalan Institute of Oncology, ICO-IDIBELL, Barcelona, Spain

    • Núria Sala
  39. University of Cambridge, Cambridge, UK

    • Kay-Tee Khaw
  40. Division of Environmental Epidemiology and Veterinary Public Health, Institute for Risk Assessment Sciences, Utrecht University, Utrecht, The Netherlands

    • Roel Vermeulen
  41. Department of Haematology, University of Cambridge, Cambridge, UK

    • Peter J. Campbell
    •  & George S. Vassiliou
  42. Center for Molecular Oncology and Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA

    • Elli Papaemmanuil
  43. Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada

    • John E. Dick
  44. Wellcome Trust–Medical Research Council Cambridge Stem Cell Institute, University of Cambridge, Cambridge, UK

    • George S. Vassiliou
  45. Division of Hematology, Rambam Healthcare Campus, Haifa, Israel

    • Liran I. Shlush


  1. Search for Sagi Abelson in:

  2. Search for Grace Collord in:

  3. Search for Stanley W. K. Ng in:

  4. Search for Omer Weissbrod in:

  5. Search for Netta Mendelson Cohen in:

  6. Search for Elisabeth Niemeyer in:

  7. Search for Noam Barda in:

  8. Search for Philip C. Zuzarte in:

  9. Search for Lawrence Heisler in:

  10. Search for Yogi Sundaravadanam in:

  11. Search for Robert Luben in:

  12. Search for Shabina Hayat in:

  13. Search for Ting Ting Wang in:

  14. Search for Zhen Zhao in:

  15. Search for Iulia Cirlan in:

  16. Search for Trevor J. Pugh in:

  17. Search for David Soave in:

  18. Search for Karen Ng in:

  19. Search for Calli Latimer in:

  20. Search for Claire Hardy in:

  21. Search for Keiran Raine in:

  22. Search for David Jones in:

  23. Search for Diana Hoult in:

  24. Search for Abigail Britten in:

  25. Search for John D. McPherson in:

  26. Search for Mattias Johansson in:

  27. Search for Faridah Mbabaali in:

  28. Search for Jenna Eagles in:

  29. Search for Jessica K. Miller in:

  30. Search for Danielle Pasternack in:

  31. Search for Lee Timms in:

  32. Search for Paul Krzyzanowski in:

  33. Search for Philip Awadalla in:

  34. Search for Rui Costa in:

  35. Search for Eran Segal in:

  36. Search for Scott V. Bratman in:

  37. Search for Philip Beer in:

  38. Search for Sam Behjati in:

  39. Search for Inigo Martincorena in:

  40. Search for Jean C. Y. Wang in:

  41. Search for Kristian M. Bowles in:

  42. Search for J. Ramón Quirós in:

  43. Search for Anna Karakatsani in:

  44. Search for Carlo La Vecchia in:

  45. Search for Antonia Trichopoulou in:

  46. Search for Elena Salamanca-Fernández in:

  47. Search for José M. Huerta in:

  48. Search for Aurelio Barricarte in:

  49. Search for Ruth C. Travis in:

  50. Search for Rosario Tumino in:

  51. Search for Giovanna Masala in:

  52. Search for Heiner Boeing in:

  53. Search for Salvatore Panico in:

  54. Search for Rudolf Kaaks in:

  55. Search for Alwin Krämer in:

  56. Search for Sabina Sieri in:

  57. Search for Elio Riboli in:

  58. Search for Paolo Vineis in:

  59. Search for Matthieu Foll in:

  60. Search for James McKay in:

  61. Search for Silvia Polidoro in:

  62. Search for Núria Sala in:

  63. Search for Kay-Tee Khaw in:

  64. Search for Roel Vermeulen in:

  65. Search for Peter J. Campbell in:

  66. Search for Elli Papaemmanuil in:

  67. Search for Mark D. Minden in:

  68. Search for Amos Tanay in:

  69. Search for Ran D. Balicer in:

  70. Search for Nicholas J. Wareham in:

  71. Search for Moritz Gerstung in:

  72. Search for John E. Dick in:

  73. Search for Paul Brennan in:

  74. Search for George S. Vassiliou in:

  75. Search for Liran I. Shlush in:


S.W.K.N., O.W., N.M.C. and E.N. contributed equally to the work. S.A. performed error-corrected sequencing, analysed sequencing data, performed statistical analyses, contributed to genetic predictive model derivation and wrote the manuscript. G.C. performed variant calling, statistical analyses, derived genetic predictive models and wrote the manuscript. M.G., S.W.K.N., O.W. and R.C. derived genetic predictive models. N.M.C., E.N. and N.B. derived the clinical prediction model. P.C.Z., Z.Z., I.C., K.N., C.L., C.H., D.H., F.M., J.E., J.K.M., D.P., L.T., P.K., S.V.B. and A.Br. and A.Ba. provided sequencing and technical support and enabled sample acquisition. L.H., Y.S., T.T.W., T.J.P., K.R. and D.J. provided bioinformatics support. R.L., S.H., M.J., K.M.B., A.Kr. and N.J.W. enabled sample acquisition, clinical data curation and/or provided clinical expertise. D.S., J.D.M., P.A., E.S., S.B., P.Be., M.D.M and I.M. contributed to data analysis and interpretation. P.J.C. and E.P. contributed to data interpretation and designed the targeted sequencing assay for the validation cohort. J.C.Y.W. revised the manuscript. J.R.Q., A.Ka., C.L.V., A.T., E.S.-F., J.M.H., R.C.T., R.T., G.M., H.B., S.Pa., R.K., S.S., S.Po., N.J.W., N.S., K.-T.K., M.F., J.M.K., E.R., P.V. and R.V. enabled sample acquisition (EPIC). A.T. and R.D.B. analysed Clalit data and derived the clinical prediction model. M.G. derived predictive genetic models, contributed to sequencing data analysis and manuscript writing. J.E.D. contributed to funding applications, study supervision and manuscript writing. P.Br. supervised sample acquisition from all EPIC centres. G.S.V. and L.I.S. designed and supervised all aspects of the study and wrote the manuscript.

Competing interests

The authors declare no competing interests.

Corresponding authors

Correspondence to Moritz Gerstung or John E. Dick or Paul Brennan or George S. Vassiliou or Liran I. Shlush.

Extended data figures and tables

  1. Extended Data Fig. 1 Prevalence of ARCH-PD mutations with VAF ≥ 10% according to age.

    Red and blue lines represent the proportion of pre-AML cases and controls, respectively, that had ARCH-PD mutations with VAF ≥ 10%.

  2. Extended Data Fig. 2 Serially collected sampling supports a long-lived HSPCs as the cell of origin for most ARCH-PD clones.

    a, b, VAF trajectory of persistent clones carrying putative driver mutations in controls (a) and pre-AML cases (b). Age is indicated on the x axis. Top, VAF is shown on the y axis and each persistent mutation is shown in a different colour, with circles denoting individual serial samples and solid lines representing the growth trajectory between serial samples. Bottom, dashed lines indicate the time interval between the last sampling and the end of follow-up (controls) or AML diagnosis (cases). c, Clonal growth rates (α) are shown for 27 control clones corresponding to 54 time points and 13 pre-AML clones corresponding to 15 time points. Box plot centres, hinges and whiskers represent the median, first and third quartiles and 1.5 × interquartile range, respectively.

  3. Extended Data Fig. 3 Performance of the combined model in predicting progression to AML.

    a, Receiver operating characteristic curve for prediction of AML development using model 1 (see Methods). The red dot indicates the point on the curve with the highest positive predictive value with sensitivity of 41.9% and specificity of 95.7%. b, c, Kaplan–Meier estimates of time to AML diagnosis for individuals predicted to develop AML (red) and not develop AML (blue) using model 1 (b; hazard ratio, 10.38; P = 4.2 × 10−10, Wald test) and model 2 (c; hazard ratio, 10.75; P = 1.75 × 10−8, Wald test), from the point of enrolment until the end of follow-up for patients enrolled in the EPIC study.

  4. Extended Data Fig. 4 AML predictive models.

    ac, Time-dependent receiver operating characteristic curve for Cox proportional hazards model trained on the discovery cohort (n = 505 unique individuals, 91 pre-AML and 414 controls) (a), validation cohort (n = 291 unique individuals, 29 pre-AML and 262 controls) (b) and combined cohorts (c). df, Dynamic AUC for Cox proportional hazards models trained on the discovery cohort (d), validation cohort (e) or combined cohort (f). g, h, Red and blue bars indicate the observed and expected VAF (g) and driver frequency (h) of pre-AML cases and controls for each gene indicated on the x axis.

  5. Extended Data Fig. 5 AML-free survival based on mutation status and RDW.

    a, Kaplan–Meier curves of AML-free survival, defined as the time between sample collection and AML diagnosis, death or last follow-up. Survival curves are stratified according to mutation status in genes mutated in at least three samples across the combined validation and discovery cohorts. n = 796 unique individuals. b, Kaplan–Meier curve of AML-free survival stratified according to RDW value >14 or ≤14. Plot represents data for n = 128 biologically independent individuals who had RDW measurements, including all pre-AML cases regardless of ARCH-PD status, and controls with ARCH-PD (controls without detectable mutations were omitted).

  6. Extended Data Fig. 6 Description of the cohort and the EHR-derived measurements.

    a, Kaplan–Meier curves showing age stratified survival rates for 875 individuals who developed AML. b, Line plot representation of the number of cases per 100,000 control individuals in the EHR database. The centre values and error bars define the mean and s.d., respectively.

  7. Extended Data Fig. 7 Laboratory measurements contributing to the EHR model.

    Normalized laboratory measurements for pre-AMLs (red) and controls (blue) (middle) and their association (bottom) with higher risk of AML are shown. The grey bars indicate the percentage of pre-AML cases with laboratory results either below the 1st percentile or above the 99th percentile. Box plot centres, hinges and whiskers represent the median, first and third quartiles and 1.5 × interquartile range, respectively.

  8. Extended Data Fig. 8 Top 50 parameters for the EHR model.

    The relative contribution of the top 50 features incorporated into the EHR prediction model, ranked according to their predictive value (gain). 1Y, one year before AML diagnosis; 2Y, two years before AML diagnosis; 3Y, three years before AML diagnosis; BASO%, percentage of basophils; BMI, body mass index; EOS.abs, absolute eosinophil count; EOS%, percentage of eosinophils; HYPO%, percentage of hypochromia; LUC, large unstained cells; LYM%, percentage of lymphocytes; LYMPH.abs, absolute lymphocyte count; MACRO%, percentage of macrocytosis; MCH, mean corpuscular haemoglobin; MCV, mean corpuscular volume; MON%, percentage of monocytes; MONO.abs, absolute monocyte count; MPV, mean platelet volume; NEUT.abs, absolute neutrophil count; NEUT%, percentage of neutrophils; PLT, platelet count; RBC, red blood cell count; RDW, red cell distributiom width; WBC, white blood cell count.

  9. Extended Data Fig. 9 Distribution of EHR model parameters.

    Heat map illustrating absolute values of clinical measurements. Blue, white and red indicate low, intermediate and high values, respectively. Light grey indicates missing data. False-negative and true-positive annotations are indicated at the bottom as dark-grey and yellow colour bars, respectively. 1Y, one year before AML diagnosis; 2Y, two years before AML diagnosis; 3Y, three years before AML diagnosis; BASO%, percentage of basophils; EOS%, percentage eosinophils; EOS.abs, absolute eosinophil count; HCT, haematocrit; HDL; high density lipoprotein; HGB, haemoglobin; Hyper%, percentage of hyperchromia; Hypo%, percentage of hypochromia; LDL, low density lipoprotein; LUC, large unstained cells; LYM%, percentage of lymphocytes; LYMP.abs, absolute lymphocyte count; MACRO%, percentage of macrocytosis; MCH, mean corpuscular haemoglobin; MCHC, mean corpuscular haemoglobin concentration; MCV, mean corpuscular volume; MICR%, percentage of microcytosis; MON%, percentage of monocytes; MONO.abs, absolute monocyte count; MPV, mean platelet volume; PLT, platelet count; NEUT%, percentage of neutrophils; NEUT.abs, absolute neutrophil count; RBC, red blood cell count; RDW, red cell distribution width;  Transamina, transaminase; Transpeptid., transpeptidase; TSH, thyroid stimulating hormone; WBC, white blood cell count.

  10. Extended Data Table 1 Genes sequenced by cRNA bait pull-down in the validation cohort

Supplementary Information

  1. Supplementary Information

    Supplementary Note - Genetic model related code. Code for the derivation of the genetic AML prediction model.

  2. Reporting Summary

  3. Supplementary Table 1

    Clinical characteristics of the discovery and validation cohorts: This table contains survival and other available clinical metadata for the study cohorts.

  4. Supplementary Table 2

    ARCH-PD mutations: This table lists putative oncogenic mutations.

  5. Supplementary Table 3

    Genetic models performance and coefficients: This table contains genetic AML prediction model coefficients and performance metrics.

  6. Supplementary Table 4

    Features and parameters of the EHR based model: This table details the criteria for AML case ascertainment for the clinical AML prediction model as well as clinical features included and parameters used for machine learning.

Source data

About this article

Publication history




Issue Date



Further reading


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.