Genome-wide association study identi ﬁ es susceptibility loci for acute myeloid leukemia

Acute myeloid leukemia (AML) is a hematological malignancy with an unde ﬁ ned heritable risk. Here we perform a meta-analysis of three genome-wide association studies, with replication in a fourth study, incorporating a total of 4018 AML cases and 10488 controls. We identify a genome-wide signi ﬁ cant risk locus for AML at 11q13.2 (rs4930561; P = 2.15 × 10 − 8 ; KMT5B ). We also identify a genome-wide signi ﬁ cant risk locus for the cytogenetically normal AML sub-group (N = 1287) at 6p21.32 (rs3916765; P = 1.51 × 10 − 10 ; HLA ). Our results inform on AML etiology and identify putative functional genes operating in histone methylation ( KMT5B ) and immune function ( HLA ).

A cute myeloid leukemia (AML) is the most common acute leukemia in adults of European ancestry 1 comprising distinct sub-groups characterized by unique somatic genetic alterations, etiologies, and outcomes 2 . Rare germline variants in transcription factors and other genes regulating hematopoietic cell differentiation and proliferation are highly penetrant for AML 3 , with some causative for debilitating human syndromes that include AML as a component, while others cause familial AML with high penetrance. Although such variants demonstrate a role for genetics in AML susceptibility, they are rare and do not make a major contribution to population disease burden 4 . A modest increased risk of myeloid malignancy for firstdegree relatives of non-familial AML patients further supports a role for genetics in disease susceptibility 5 , which for the majority of cases is likely determined by co-inheritance of common low penetrance variants consistent with the multigenic model for complex human diseases 6,7 .
Outside of highly penetrant syndromic or familial disease, genetic susceptibility to AML remains largely unexplained. To identify AML risk loci we conducted three genome-wide association studies with participants of European ancestry, with replication in a fourth study.
Here we report the identification of risk loci for AML, including pan-AML irrespective of disease sub-type and for cytogenetically normal AML. These data inform on disease etiology and demonstrate the existence of common, low-penetrance susceptibility alleles for AML with heterogeneity in risk across sub-types.
Validation GWAS and meta-analysis. To replicate the associations at the loci identified in the discovery GWAS meta-analysis we conducted a fourth genome-wide association study of European cases and controls (GWAS 4). Following the application of SNP and sample quality control metrics (Supplementary Figs. 9 and 10), data on >7.6 million SNPs from 977 AML cases and 3728 controls of European ancestry were available for analysis, which included 465 cytogenetically normal AML cases (Supplementary Table 1). Quantile-quantile plots of observed versus expected P values (MAF > 0.01) showed minimal inflation of test statistics (λ GC = 1.012 for all AML cases and λ GC = 1.001 for cytogenetically normal AML) ( Supplementary Fig. 11). Analysis of data from GWAS 4 validated (P < 0.05) two of the associations identified in the discovery GWAS meta-analysis, with consistent direction and magnitude of effect sizes across all four studies, including 11q13.2 (rs4930561, KMT5B; P = 2.15 × 10 −8 ) for all AML irrespective of sub-type (N = 4018) and 6p21.32 (rs3916765, HLA-DQA2; P = 1.51 × 10 −10 ) for cytogenetically normal AML (N = 1287). Meta-analysis of SNPs common to all four GWAS (N = 6661818 and N = 6496414 for all AML and cytogenetically normal AML, respectively) also revealed additional borderline significant susceptibility loci at 1p31.1 (rs10789158, CACHD1, P = 2.25 × 10 −7 ) for all AML and at 7q33 (rs17773014, AKR1B1, P = 4.09 × 10 −7 ) for cytogenetically normal AML, both with consistent direction and magnitude of effect across all four studies (Figs. 1 and 2). In order to test for any potential effects of residual population substructure we re-examined the top signals using only cases and controls of UK origin in GWAS1, GWAS2, and GWAS4, and using only cases and controls of German origin in GWAS3 ( Supplementary Fig. 12). The direction and magnitude of the associations in these sub-group analyses ( Supplementary Fig. 13) are very similar to the analyses including all cases and controls (Fig. 2). As such, we report genomewide significant susceptibility loci for all AML and cytogenetically normal AML at 11q13.2 (rs4930561) and 6p21.32 (rs3916765), respectively, and borderline significant susceptibility loci for all AML and cytogenetically normal AML at 1p31.1 (rs10789158) and 7q33 (rs17773014).
AML risk variants at 11q13.2, 6p21.32, 1p31, and 7q33 have Bayesian false-discovery probabilities (BFDP) 9 of 7, <1, 3, and 5%, respectively. There was no evidence of significant heterogeneity (P < 0.05) for association with AML for any of the risk variants across the 4 GWAS studies (Fig. 2). Analysis conditioning on the top variant at each susceptibility locus did not identify any evidence of additional associations (P < 10 −4 ) within 500 kb of the lead variant (Supplementary Figs. 14-17).
There was no significant heterogeneity in AML risk for any of the four variants when cases and controls were stratified by age (<55 years; ≥55 years) (Supplementary Table 3 Table 4).
The relationship between AML risk variants and survival was evaluated in 767 AML patients (excluding acute promyelocytic leukemia) from the UK, Germany, and Hungary. However, none of the 4 AML susceptibility variants identified here were significantly associated with either relapse-free or overall survival in univariate analysis that included all AML patients (N = 767) or those with cytogenetically normal AML (N = 369) ( Supplementary Figs. 18-21).
A previously reported susceptibility variant for AML in the BICRA gene (rs75797233) 10 was not significantly associated with AML risk in our study (GWAS meta-analysis odds ratio (OR) 1.06, 95% CI 0.87-1.29). However, this variant was imputed to sufficient quality (INFO > 0.6) in only three GWAS studies and statistical power to detect an association with AML for this relatively uncommon variant (MAF 0.02) is compromised.
Inference of risk loci and biological function. We identified a genome-wide significant association for rs4930561 with risk of AML irrespective of sub-type (OR 1.17, 95% CI 1.11-1.24; P = 2.15 × 10 −8 ) which maps to the KMT5B gene on 11q13.2 (Fig. 3a). To identify putative risk loci we interrogated data from a meta-analysis of 31624 blood samples collated by the eQTLGen consortium 11 for evidence of cis-regulated genes. Forty seven genes annotated to within 500 Kb of the association signal and the sentinel variant is a significant eQTL for 12 of these, including MRPL21 (Benjamini-Hochberg corrected P-   Table 5).
Given the identification of a major AML risk allele at the HLA locus on chromosome 6p21.32 and that cancer cells acquire somatic mutations that can function as neo-antigens for immune recognition we performed a case-control analysis stratified by mutation status for NPM1 and FLT3, two genes commonly mutated and clinically significant in cytogenetically normal AML 2 . Specifically, data on NPM1 and FLT3 somatic mutation status were available for 653 and 865 AML cases, respectively. There was no significant heterogeneity in AML risk when cases and controls were stratified by either NPM1 or FLT3 mutation status, although there was a trend towards higher risk for NPM1mutated AML (OR 1.96, 95% CI 1.29-2.98; P = 1.7 × 10 −3 ) and FLT3-mutated AML (OR 1.52, 95% CI 1.07-2.16; P = 0.02) compared to NPM1-wildtype AML (OR 1.28, 95% CI 0.97 -1.68; P = 0.08) or FLT3-wildtpe AML (OR 1.26, 95% CI 1.01-1.58; P = 0.04) (Supplementary Table 6).
We also identified a borderline significant association for rs10789158 with AML irrespective of sub-type (OR 1.22, 95% CI 1.13-1.31; P = 2.25 × 10 −7 ) which maps to a block of linkage disequilibrium upstream of the CACHD1 gene on chromosome 1p31.3 (Fig. 3c). Of the 11 genes annotated to within 500 Kb of the association signal, the sentinel SNP (rs10789158) is eQTL for RAVER2 (P BH = 1.26 × 10 −2 ) and AK4 (P BH = 1.26 × 10 −2 ) where the AML risk variant is associated with higher expression of AK4 and lower expression of RAVER2 (Supplementary  Table 7).
We also identified a borderline statistically significant association with cytogenetically normal AML for rs17773014 (OR 1.26, 95% CI 1.15-1.37; P = 4.09 × 10 −7 ), which maps close to the AKR1B1 gene on chromosome 7q33 (Fig. 3d). Nine genes were annotated to within 500 Kb of the association signal and the sentinel SNP (rs17773014) is a significant eQTL for AKR1B1 in whole blood with the AML risk allele associated with higher transcript levels (P BH = 5.32 × 10 −23 ) (Supplementary Table 8 Fig. 2 Forest plots for 4 new loci associated with acute myeloid leukemia. Study cohorts, sample sizes (case and controls (con)), imputation (info) score, effect allele, effect allele frequencies (EAF), and estimated odds ratios (OR) for rs4930561 (a), rs3916765 (b), rs10789158 (c), and rs17773014 (d). The vertical line corresponds to the null hypothesis (OR = 1). The horizontal lines and square brackets indicate 95% confidence intervals (95% CI). Areas of the boxes are proportional to the weight of the study. Diamonds represent combined estimates for fixed-effect and random-effect analysis. Cochran's Q statistic was used to test for heterogeneity such that P HET > 0.05 indicates the presence of non-significant heterogeneity. The heterogeneity index, I 2 (0-100) was also measured which quantifies the proportion of the total variation due to heterogeneity. All statistical tests were two-sided and no adjustments were made for multiple comparisons.
chromosome 7q33 but which was not annotated in the eQTLGen consortium dataset.

Discussion
By conducting a meta-analysis of three large genome-wide studies with validation in a fourth study, we identify four susceptibility loci for AML, demonstrating the existence of common, lowpenetrance susceptibility alleles for this genetically complex disease. Specifically, our data identify a major susceptibility locus for AML at the 11q13.2 KMT5B gene. KMT5B (SUV420H1) encodes a lysine methyltransferase that is frequently mutated in human cancers, with gene amplifications being particularly common [13][14][15] . KMT5B is implicated in AML pathogenesis where mutation has been associated with transformation from precursor myelodysplastic syndrome to AML 16 . Mutations in other lysine methyltransferases such as KMT2A (MLL1) occur with high frequency in AML 17 . Although KMT5B is a strong candidate for an AML susceptibility gene a priori we cannot exclude mechanisms involving other local genes. For example, the AML risk variant at the 11q13.2 susceptibility locus is significantly associated with lower expression of CHKA, which encodes a protein involved in phosphatidylcholine biosynthesis. CHKA is significantly upregulated in mouse haematopoietic stem cells and human leukemia cell lines upon restoration of TET2 function 18 , a tumor suppressor which blocks aberrant self-renewal and which is frequently mutated in AML resulting in loss of function 19 .
We also report a putative pan-AML susceptibility locus at 1p31.3 that is cis-eQTL for RAVER2 and AK4. RAVER2 encodes a ribonucleoprotein involved in RNA splicing where expression associates with transformation from myelodysplastic syndrome to AML 20 . In both human and mouse hematopoietic stem cells RAVER2 is identified as a target gene for miR-99 21 , which regulates normal and malignant hematopoietic stem cell selfrenewal 21,22 and expression is significantly associated with prognosis in AML 22 . AK4 belongs to the adenylate kinase (AK) family of proteins that catalyze the phosphorylation of nucleotide monophosphate precursors to their di-and triphosphate forms. AK4 is localized to the mitochondria and is a target gene for hypoxia-inducible factor 1 alpha (HIF-1α) 23 , an established tumor suppressor in human and murine AML 24,25 .
We also identify a major susceptibility locus for cytogenetically normal AML at the 6q21.32 HLA gene. This region carries susceptibility alleles for numerous human cancers including hematological malignancies such as chronic lymphocytic leukemia 26,27 and Hodgkin lymphoma 28 , where risk is mediated via differential antigenic presentation/T-cell receptor recognition or altered risk of oncogenic infection. Somatic dysregulation of antigenic presentation through altered gene expression, loss of heterozygosity or genomic deletion is a common feature of many human solid cancers leading to failed immune surveillance [29][30][31] . Somatic loss of HLA alleles is rare in AML at disease presentation although it has been reported as a mechanism of immune escape after bone marrow transplant leading to relapse [32][33][34][35] . Our data identify the HLA-DQB1*03:02 and HLA-DQA1*03:01 alleles as significantly under-represented in cytogenetically normal AML cases compared to controls. Cytogenetically normal AML is characterized by somatic mutation in genes such as NPM1 and FLT3 2 which can function as neo-antigens [36][37][38][39][40] . For example, NPM1 mutation is reported in up to 60% of cytogenetically normal AML 41,42 where mutation leads to aberrant cytoplasmic expression that is postulated to lead to more efficient HLA presentation 43 . Our data suggest that rs3916765 affects AML risk by modulating immune recognition of mutated cells, although further work is required to determine which leukemia-specific neo-antigens are involved. The DQB1*03:02-DQA1*03:01 haplotype is associated with increased risk of autoimmune diseases including celiac disease 44 and type 1 diabetes 45,46 . Reports of concomitant celiac disease and AML are very rare 47 consistent with the DQB1*03:02-DQA1*03:01 haplotype having pleiotropic and opposing effects on risk of these two diseases. Taken together, these data implicate dysregulated immune function as a risk modifier for cytogenetically normal AML.
Our data suggest that differential expression of cis-regulated AKR1B1 (or related superfamily member AKR1B10) at the 7q33 locus modulates risk of cytogenetically normal AML. AKR1B1 and AKR1B10 encode members of the aldo-ketoreductase superfamily which catalyze the reduction of numerous aldehydes, including the aldehyde form of glucose to generate fructose via the polyol pathway 48 . Altered glucose metabolism is a hallmark of cancer, where malignant cells switch energy production from mitochondrial oxidative phosphorylation to glycolysis 49 . In AML, a switch to glycolytic metabolism is associated with disease progression and poor outcome 50 . Unlike normal monocytes, AML cells can also utilize fructose as an alternative substrate for glycolysis with expression of the GLUT5 fructose transporter as a major regulator of fructose metabolism in leukemic blast cells 51,52 , and where high expression is associated with increased proliferation, clonogenicity, migration, and invasion of AML cells 51 . Interrogation of three independent datasets identified a consistent association between elevated AKR1B1 transcript levels and shorter overall survival in AML 53 and serum fructose levels are prognostic in AML 51 , further implicating this pathway in AML disease progression.
In summary, our study identifies common susceptibility alleles at four genomic locations that modify AML risk, with evidence of sub-type specific risk loci reflecting the existence of multiple etiological pathways to disease development. Further work is required to decipher the functional basis of these risk loci although our data builds on existing evidence demonstrating a role for aberrant histone modification and altered fructose metabolism in AML pathogenesis 16,17,51,52 . Furthermore, the identification of a major AML risk variant at the HLA locus on chromosome 6 implicates altered immune function as etiologically important in AML. Our data supports existing evidence of genetic and biological heterogeneity in AML 2 and confirm the need for large collaborative studies to improve statistical power and aid the discovery of sub-type specific genetic risk loci. Collection of patient samples and associated clinico-pathological information was undertaken with written informed consent. All studies were conducted in accordance with the Declaration of Helsinki and received local institutional review board or national research ethics approval (Supplementary Table 9). Specifically, this research has been conducted using the UK Biobank Resource (Application #16583, James Allan). MRC/NCRI AML 11 trial, AML 12 trial and the UK Leukaemia Research Fund (LLR) population-based case-control study of adult acute leukemia received multicenter research ethics committee approval 54,55,75 . Research ethics committee approval was given to the Newcastle Haematology Biobank (07/H0906/109 + 5) and the AML genome-wide association study in the UK (06/q1108/92, BH136664 (7078)). AML cases and controls for Samples from the Hungarian AML patients were obtained during the standard diagnostic workup at the Hematology  Genotyping and genome-wide quality-control procedures. Genotype calling was performed using Illumina GenomeStudio software or Affymetrix Genotyping Console software v4.2.0.26. Data handling and analysis was performed using R v3.5.1, PLINK v1.9b4.4, and SNPTEST v2.5.2. Rigorous SNP and sample quality control metrics were applied to all four GWAS ( Supplementary Fig. 1). Specifically, we excluded SNPs with extreme departure from Hardy-Weinberg equilibrium (HWE; P < 10 −3 in either cases or controls) and with a low call rate (<95%). We also excluded SNPs that showed significant differences (P < 10 −3 ) between genotype batches and with significant differences (P < 0.05) in missingness between cases and controls. Individual samples with a call rate of <95% or with extreme heterozygosity rates (±3 standard deviation from the mean) were also excluded from each GWAS. Individuals were removed such that there were no two individuals with estimated relatedness pihat >0.1875, both within and across GWAS. The individual with the higher call rate was retained unless relatedness was identified between a case and a control, where the case was preferentially retained. Ancestry was assessed using principal component analysis and super-populations from the 1000 genomes project as a reference, with individuals of non-European ancestry excluded based on the first two principal components. In order to minimize any impact of population stratification among the European population we excluded outlying cases and controls identified using principal components 1 and 2 for each GWAS (Supplementary Figs. 1, 3, 4, 5, and 10).
For each GWAS, association tests were performed for all cases and cytogenetically normal AML assuming an additive genetic model, with nominally significant principal components included in the analysis as covariates. Association summary statistics were combined for variants common to GWAS 1, GWAS 2, and GWAS 3, and then for variants common to all four GWAS, in fixed effects models using PLINK v1.9b4.4. Cochran's Q statistic was used to test for heterogeneity and the I 2 statistic was used to quantify variation due to heterogeneity.
The Bayesian false discovery probability was calculated using a prior probability of association of 0.0001 and a plausible OR of 1.3 9 .
Case-control analyses were also performed stratified by sex and age in all 4 GWAS. For age, cases and controls were stratified into those <55 years and ≥55 years. GWAS 1 was not included in the meta-analysis for the ≥55 age group because the controls were recruited to the 1958 Birth Cohort and were all genotyped at the age of 45 years. Case-control analyses were performed stratified by NPM1 and FLT3 mutation status (mutation-positive and mutation-negative) in GWAS 2 and GWAS 4. Data on NPM1 and FLT3 somatic mutation status was available for 653 and 865 AML cases, respectively, including 411 and 528 cases of cytogenetically normal AML, respectively. PCR mutation analysis was performed as part of routine diagnostics for NPM1 exon 12 and FLT3 exons 14-15 (Supplementary Table 10) 78,79 .
Technical validation of AML susceptibility variants. All four AML risk variants reported here were either directly genotyped or imputed to high quality. Specifically, rs4930561 was directly genotyped in GWAS 1 and GWAS 2 and imputed in GWAS 3 and GWAS 4 (info score 0.974-0.988); rs3916765 was genotyped in GWAS 4 and imputed in GWAS 1, GWAS 2, and GWAS 3 (info score 0.901-0.995); rs10789158 was imputed in all 4 GWAS studies (info score 0.946-0.9775); and rs17773014 was directly genotyped in GWAS 3 and GWAS 4 and imputed in GWAS 1 and GWAS 2 (info score 0.985-0.993). Fidelity of array genotyping and imputed dosages was confirmed using Sanger sequencing in a subset of AML samples (including samples genotyped on both Illumina and Affymetrix platforms) for each sentinel variant with perfect or very high concordance for all four variants .
The majority of AML cases were genotyped using DNA extracted from cell/ tissue samples (blood and bone marrow) taken during AML remission. A minority of AML cases were genotyped using DNA extracted from tissue samples that include leukemic AML cells. As such, we employed a stringent HWE cut-off ( Supplementary Fig. 1) in order to eliminate SNPs potentially affected by somatic copy number alterations. Furthermore, we also used Nexus Copy Number v10 (BioDiscovery, California) to interrogate B allele frequency and Log R ratio values at loci associated with AML following genotyping of DNA extracted from leukemic AML cells. For rs4930561 (chromosome 11q13.2) we interrogated data from 352 AML cases using samples with high somatic cell content and found one case with a large deletion capturing the KMT5B locus. We also identified 12 cases with evidence of trisomy 11 or large gains affecting chromosome 11, consistent with reports of trisomy 11 in approximately 1% of AML cases 80 . For rs10789158 (chromosome 1p31.3) we identified 1 case with evidence of copy number gain. The susceptibility locus at chromosome 1 does not fall within a region reported to be recurrently somatically deleted or amplified in AML. The association signals at 6p21.32 (rs3916765) and 7q33 (rs17773014) were specific to cytogenetically normal AML and evidence of somatic copy number alterations were visible in 0 and 3 cases, respectively (based on Nexus Copy Number analysis of 127 cytogenetically normal AML cases). Specifically, there were three cases with evidence of deletions affecting the chromosome 7 risk locus that were not visible cytogenetically. Furthermore, there was no evidence of copy neutral loss of heterozygosity (>2 Mb) at any of the four AML susceptibility loci reported here. Taken together, these data limit the possibility of differential genotyping in cases and controls due to somatically acquired allelic imbalance.
HLA imputation, expression quantitative trait loci (eQTL) analysis, and functional annotation. Imputation of classical HLA alleles was performed using the SNP2HLA v1.0.3 tool using 5225 Europeans from the Type I Diabetes Genetics Consortium as a reference panel 12 . To examine the relationship between SNP genotype and gene expression and identify cis expression quantitative trait loci (eQTLs) we made use of data from the eQTLGen Consortium (http:// www.eqtlgen.org/cis-eqtls.html) for whole blood. Benjamini-Hochberg (BH)adjusted P values were estimated for each gene annotated to within 1 Mb of the sentinel SNP at each AML association signal. Regions with AML susceptibility variants were annotated for putative functional motifs using data from the ENCODE project 81 .
Relationship between SNP genotype and patient survival. The relationship between AML risk variants and survival was evaluated in a total of 767 AML patients (excluding acute promyelocytic leukemia) from the UK 54,55 , Germany 60,61 , and Hungary 74 . Briefly, patients were treated with conventional intensive AML therapy including ara-C, daunorubicin, and best supportive care. A subset of highrisk patients in the German cohort were treated with stem cell transplantation 60 . Overall survival was defined as the time from diagnosis to the date of last follow-up or death from any cause. Data on relapse-free survival was available on 358 AML patients, which was defined as the time from date of first remission to the date of last follow-up in remission or date of AML relapse. Cox regression analysis was used to estimate allele-specific hazard ratios and 95% confidence intervals for each study in analyses that included all AML cases (N = 767) and cytogenetically normal AML (N = 358).
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.