Regional evaluation of childhood acute lymphoblastic leukemia genetic susceptibility loci among Japanese

Genome-wide association studies (GWAS) performed mostly in populations of European and Hispanic ancestry have confirmed an inherited genetic basis for childhood acute lymphoblastic leukemia (ALL), but these associations are less clear in other races/ethnicities. DNA samples from ALL patients (aged 0–19 years) previously enrolled onto a Tokyo Children’s Cancer Study Group trial were collected during 2013–2015, and underwent single nucleotide polymorphism (SNP) microarray genotyping resulting in 527 B-cell ALL for analysis. Cases and control data for 3,882 samples from the Nagahama Study Group and Aichi Cancer Center Study were combined, and association analyses across 10 previous GWAS-identified regions were performed after targeted SNP imputation. Linkage disequilibrium (LD) patterns in Japanese and other populations were evaluated using the varLD score based on 1000 Genomes data. Risk associations for ARID5B (rs10821936, OR = 1.84, P = 6 × 10−17) and PIP4K2A (rs7088318, OR = 0.76, P = 2 × 10−4) directly transferred to Japanese, and the IKZF1 association was detected by an alternate SNP (rs1451367, OR = 1.52, P = 2 × 10−6). Marked regional LD differences between Japanese and Europeans was observed for most of the remaining loci for which associations did not transfer, including CEBPE, CDKN2A, CDKN2B, and ELK3. This study represents a first step towards characterizing the role of genetic susceptibility in childhood ALL risk in Japanese.

Scrutiny of the human genome through evaluation of common genetic variants has revealed hundreds of disease susceptibility loci. In childhood acute lymphoblastic leukemia (ALL), six regions that have replicated in several populations are now considered known susceptibility loci (likely representing associations with ARID5B, IKZF1, CEBPE, CDKN2A, PIP4K2A, and GATA3), with the majority of the evidence supported through studies conducted in populations of European and Hispanic descent 1,2 . Gains in statistical power achieved by recent meta-analyses of childhood ALL genome-wide association studies (GWAS) have resulted in the identification of risk-associated single nucleotide polymorphisms (SNPs) of comparatively lower allele frequencies and estimated magnitude of effects including those tagging the CDKN2B, LHPP, and ELK3 genes 3,4 . Furthermore, the concept that genetic susceptibility studies may potentially reveal race/ethnicity-specific associations was demonstrated by a recent GWAS conducted in a Chinese population which implicated a role for the WWOX gene that had not been observed in the numerous previous studies conducted in populations of European and Hispanic ancestry 5 .
Success of GWAS in identifying true disease-associated loci have largely been due to the consistently high standards in methodological rigor of the approach including, strict quality control for genotype data, attention to issues of statistical power and sample size, criteria for genome-wide significance, and integrating components of independent validation and/or functional evaluation of loci 6 . However, as is the case with other complex diseases, it is well-recognized that the known GWAS 'hits' in childhood ALL account for only a small proportion of the total estimated heritability 7 . Based on data from populations of European ancestry, it has been estimated that the currently known childhood ALL associated risk loci account for about 19 percent of the additive heritable risk, not accounting for potential impact of epistasis or gene-environment interactions 4 . Adding to this issue of 'missing heritability' is the realization that we currently know even less about the nature of childhood ALL genetic susceptibility in other populations, particularly Asians 8 .
The effects of known genetic susceptibility loci have yet to be fully confirmed in populations of non-European ancestry. Targeted validation attempts based on the same SNPs originally identified in mostly non-Hispanic whites and Hispanics have been performed in Chinese and other Asian populations, but findings have been inconsistent 9,10 . Assuming the same causal variant is operative across populations, a lack of association in Asians can be attributed to study flaws, but more likely due to reduced statistical power as a result of differences in allele frequency, strength of linkage disequilibrium (LD) with the causal variant(s), and/or the role of environmental exposures in affecting risk. Thus, a comprehensive characterization of genetic variation across the targeted genetic loci is required for an appropriate validation attempt in different populations.
To address this current gap in knowledge, we initiated an effort through the Tokyo Children's Cancer Study Group (TCCSG) to assemble resources for genomic investigation of germline contributions to childhood ALL susceptibility and outcomes. Here, we report results of our targeted analysis of previous GWAS-identified childhood ALL risk loci in a large Japanese population. We evaluated the transferability of risk associations of specific SNP loci in Japanese and interpreted the finding in the context of quantified differences in LD within those loci across populations. Furthermore, we analyzed regional SNP data in order to identify alternate SNPs which may potentially confer stronger association in Japanese.

Materials and Methods
Study Population. In collaboration with a large network of 23 hospitals participating in the Tokyo Children's Cancer Study Group (TCCSG), previously diagnosed childhood ALL patients visiting for a routine follow-up between 2013 and 2015 were invited to participate in this study. The TCCSG network includes nearly all clinical centers that diagnose and treat childhood ALL within the seven prefectures that comprise the Kanto and immediately surrounding regions 11,12 . Patients were considered eligible if they were 19 years of age or younger at the time of ALL diagnosis, enrolled onto a TCCSG treatment protocol, and self-identified as Japanese. Due to the nature of this sampling scheme, the study population comprised a survivorship population of childhood ALL patients.
Upon obtaining written informed consent, saliva samples using the Oragene Saliva DNA Self-Collection Kit (4 years of age and older) or Assisted Collection Kit (less than 4 years of age) (DNA Genotek, Ottawa, Canada) were collected from the patients with instruction by the attending physician or nurse during the follow-up outpatient visit. The collected samples were shipped at room temperature to a central laboratory (Tokyo Medical and Dental University) for processing, DNA extraction, and storage.
Controls comprised a subset of adult participants enrolled in two ongoing epidemiological studies of lifestyle-related chronic diseases in Japan, the Nagahama Study Group 13 and Aichi Cancer Center Study 14 , in which large-scale genome-wide SNP genotyping had already been performed. The Nagahama Study is a community-based prospective cohort study comprising a representative sample of residents of Nagahama City in Shiga Japan 15 . The Aichi Cancer Center Study comprised a hospital-based cohort of non-cancer outpatient visitors 16 . Despite the name, the Aichi Cancer Center resembles a general hospital that does not require physician referral in which the majority of outpatients present with no abnormal findings by clinical examination. Population substructure across regions of Japan does exist; most notably between populations of Okinawa and the other main islands of Japan collectively. Although cases and controls were recruited from different regions of the main island, simulation studies have shown only minimal genomic inflation potential when considering these two subpopulations 17 . A history of childhood leukemia was not assessed in controls; however, the rarity of this disease suggests that any previous diagnosis of childhood leukemia in controls would have a minimal effect on the results of this study.
This study protocol was approved by the institutional review boards of St. Luke's International Hospital, Tokyo Medical and Dental University, Kyoto University, and all collaborating hospitals involved in patient recruitment. Written informed consent was obtained from the parents of each participant together with a written assent by the child where possible. Patients aged 16 to 19 years were asked to provide written informed consent together with parental consent; those aged 20 or older did not require parental consent. This study was conducted in accordance with the Declaration of Helsinki.
Genotyping and Quality Control. DNA extraction from childhood ALL patients' saliva samples were performed using the Oragene prepIT DNA Extraction Kit (DNA Genotek) based on the manufacturer's instruction. The approximately 2 mL saliva samples obtained from the Oragene Self-Collection Kits yielded, on average, a total of about 50 ug of genomic DNA.
Genome-wide SNP genotyping was attempted on 621 patient samples using the Illumina HumanCoreExome-12 v1.1 BeadChip (San Diego, CA) which contained probes for approximately 550,000 SNPs. Existing control data were genotyped previously using variable versions of the same Illumina HumanCoreExome BeadChip. Quality control steps were conducted within cases and each of the two different control sample series separately. SNPs were excluded if the genotype call rate was below 99%, the distribution of genotypes clearly deviated from that expected by Hardy-Weinberg equilibrium (HWE) (P < 1 × 10 −6 ), or the minor allele frequency was less than 0.01. Samples were excluded if showed a genotyping success rate of less than 95% (51 cases and 4 controls) and relatedness based on an identity-by-descent analysis (1 case and 119 controls). In addition, principal components analysis (PCA) based on a genome-wide subset of SNPs in low LD (pruned at r 2 < 0.1) that passed quality control steps was performed on a known ethnically homogeneous population of Japanese ancestry (International HapMap Project) together with cases and controls. The PCA was conducted using the EIGENSTRAT 2.0 software package and outlier samples were excluded (2 cases and 5 controls) 18 . In result, after quality control steps and excluding 40 T-cell ALL patients, the final population for analysis included a total of 527 Japanese B-cell ALL cases and 3,882 controls with data available for 171,547 SNPs that were overlapping across the genotyped case and control series.
Targeted SNP imputation was performed on the combined case-control dataset for 10 genomic regions reported in previous childhood ALL GWAS (Table 1) using ShapeIT2 19 and Minimac3 20 , and the 1000 Genomes Project Phase III Version 5 as the reference population 21 . Poorly imputed SNPs defined by an R 2 < 0.5 were excluded from the analyses. Considering the gene and its broad surrounding region (about 100-kb flanking) for each locus, a total of 113 SNPs were excluded among 14,457 total SNPs imputed across the 10 regions. On average, about 0.8 percent of SNPs per locus were excluded based on this quality control threshold. Due to restrictions stipulated by the institutional review board approvals, data were not be made publicly available, but may be available on request in compliance with the policies and procedures of the TCCSG.
Statistical Analysis. We first tested the association between childhood ALL and 16 SNPs across the 10 genes (Table 1) identified in previous GWAS. SNPs for evaluation were selected based on the strongest result reported from the first study to report the association. Multiple SNPs tagging the same genomic region were selected if the SNP was examined across several studies. We examined the role of additional genetic variation across the entire span of the 10 targeted genes, including a 10-kb flanking region on both ends. The association between each genetic variant and risk of childhood ALL was estimated by the odds ratio (OR) per allele and 95% confidence intervals (CI) using multiple logistic regression assuming a log-additive genetic model. Genome-wide association analysis of the 171,547 SNPs showed evidence of genomic inflation (λ > 1.10); all analyses were adjusted for 10 PCA eigenvectors (λ = 1.05). For the test of specific previously reported GWAS SNPs, a nominal p-value of less than 0.05 was considered statistically significant. For the examination of other potentially associated SNPs across the genomic regions, to account for multiple comparisons in the presence of LD between SNPs, we calculated adjusted p-values based on 10,000 permutations of case-control status and considered p-values below a family-wise type I error rate threshold of 0.05 to be statistically significant. Analyses were conducted using PLINK 22 and SAS software version 9 (SAS, Cary, NC). The LocusZoom web-based resource was used to generate plots of association results by genomic region 23 .
Differences across race/ethnic populations in regional patterns of LD flanking a 10-kb region on both ends of the SNPs were quantified using the variation in LD (varLD) score applied to the 1000 Genomes Phase 3 data 21 . The varLD score is an algorithm based on comparing regional patterns of correlation previously developed by Teo et al. to quantify differences in LD within defined regions 24,25 . With the exception of the WWOX locus, the Japanese (JPT) population was compared to the combined population of European ancestry (EUR); for the WWOX locus, JPT was compared to the combined Han Chinese and Southern Han Chinese (CHB-CHS) representing the population in which the locus was originally identified. Permutation procedures were performed to determine Monte Carlo statistical significance by comparing the estimated varLD score to the null distribution of varLD scores after successive re-sampling of the two populations from the combined data 25 ; 10,000 iterations were performed. Since 9 genomic loci were tested (CDKN2A-CDKN2B were evaluated as one region), an empirical p-value of less than 0.006 was considered statistically significant for the varLD evaluation. All statistical tests were two-sided.
Risk allele frequencies and association estimates across various races/ethnicities are presented in Table 2. Among the loci identified through GWAS, only ARID5B SNPs showed a consistent association across the race/ ethnic populations despite marked differences in allele frequencies. Although only marginally significant in Chinese (rs7088318, OR = 1.23, P = 0.047), the PIP4K2A association also showed consistency across populations. Primary SNPs first reported in populations of European ancestry for IKZF1 (rs4132601 and rs11978267), then subsequently replicated in Hispanics and African Americans, showed no association in both Chinese and Japanese. The risk allele frequencies for the SNPs in Japanese (approximately 0.10) are markedly lower than frequencies observed in the original GWAS populations (approximately 0.20-0.30). Risk-associated SNPs recently identified in LHPP and ELK3 in Europeans have not yet been reported in other populations. In Japanese, rs35837782 in LHPP and rs4762284 in ELK3 did not show an association.
Using available SNP data across all 10 genetic loci including 10-kb flanking regions on both ends of the target genes, B-cell ALL risk associations were identified for alternate SNPs in IKZF1 (rs1451367, OR = 1.52, 95% CI = 1.28-1.80, P = 1.9 × 10 −6 ) ( Table 3   a nominal p-values of less than 0.05 were identified, but were not statistically significant after adjustment for the number of SNPs tested across the respective regions. To examine whether the non-transferability of association may be due to population differences in regional LD structure, varLD scores were calculated using 1000 Genomes Project Phase 3 data for Japanese, Han Chinese, and populations of European ancestry ( Table 3). The regions surrounding the ARID5B and PIP4K2A SNPs, loci that directly replicated in Japanese, did not show statistically significant evidence of regional LD differences between populations of European ancestry and Japanese. Regions surrounding the IKZF1 SNPs also showed minimal evidence of regional LD differences, but the previous GWAS-identified SNP associations did not directly transfer to Japanese. However, alternate statistically significant SNPs within IKZF1 were identified (described above). With the exception of LHPP and WWOX, the four additional genetic loci in which the association did not transfer to Japanese showed strong evidence of regional LD structure differences based on varLD evaluations.

Discussion
Aided by successful validation across multiple populations, the genome-wide association analysis approach has led to the identification of several genetic loci involved in childhood ALL risk 1,2 . However, there is still uncertainty about the role of these loci and consistency of specific SNP associations in East Asians with the majority of robust studies being performed primarily in populations of European and Hispanic ancestries. In our targeted evaluation of 16 previous GWAS-reported SNPs, we observed that the risk associations of those in ARID5B and PIP4K2A directly transfer to the Japanese population. The involvement of IKZF1 is also supported by the identification of alternate associated SNPs in proximity to the originally reported loci, and the GATA3 locus appears suggestive. However, this leaves the associations in the six remaining genes without clear evidence for a role in childhood ALL risk in East Asians. Examination of regional varLD scores showed that significant differences in LD between Japanese and the population in which the association was first reported were commonly observed in genes where the risk association did not transfer. Rather than concluding that the association is not present in Japanese, the varLD observations suggest that the associations may be obscured by differences in LD patterns and that other strategies are necessary to further clarify the role of the remaining six loci that did not transfer to this population.
Childhood ALL SNP associations in ARID5B first reported concurrently in studies performed in populations of European ancestry in the United States 27 and the United Kingdom 28 have been widely validated across multiple race/ethnic population 29 , now including Japanese. The risk-conferring minor allele frequency of rs10821936  in Japanese is similar to that of Europeans (MAF ~ 0.35), but is significantly higher in Hispanics (MAF ~ 0.45) and lower in populations of African ancestry (MAF ~ 0.20). Interestingly, this pattern is similar to the relative population differences in incidence of childhood ALL and evidence supports a role for this locus in partially explaining this difference. Based on available data from St. Jude Children's Research Hospital and descriptive statistics from the Surveillance, Epidemiology, and End Results Program in the US, it was estimated that about 30% of the observed racial differences in ALL incidence may be attributable to the higher frequency of the rs10821936 risk allele in non-Hispanic whites compared to blacks 30 . Characterization of genetic ancestry of the Children's Oncology Group Hispanic population showed increasing rs10821936 risk allele frequencies with increasing percentages of Native American ancestry 31 . Building on this observation, the California Childhood Leukemia Study reported increasing proportions of Native American ancestry to be associated with increasing risk of childhood ALL and showed that ARID5B contributes directly to the higher incidence in Hispanics compared to non-Hispanic whites 32 . However, the contribution of ARID5B is less clear in relation to Japanese given that this SNP has similar frequency and magnitude of effect as non-Hispanic whites despite known differences in incidence between the two populations. Although consistently replicating in populations of European ancestry 9 , similar to studies performed in Chinese 5 , the IKZF1 SNP association did not transfer to the Japanese population. Comparison of LD patterns based on varLD score between Europeans and Japanese did not show evidence of marked difference across an approximately 25-kb region comprising the previously reported SNPs. However, the allele frequency of the SNPs are considerably lower in East Asians at about 0.10 or less compared to close to 0.30 in Europeans and Hispanics. The ability to analyze the effect of other SNPs across the flanking regions led to the identification of an alternate associated SNP (rs1451367) located within about 10-kb that is common in East Asians (MAF ~ 0.20), but rare in Europeans (MAF < 0.01). This suggests that variation in IKZF1 is also associated with risk of childhood ALL in Japanese; however, it cannot be concluded yet whether the SNP associations are representing the same causal locus across the populations.
Based on the results of the current analysis, evidence for childhood ALL risk associations with GWAS-identified SNPs in CEBPE, CDKN2A, CDKN2B, LHPP, ELK, and WWOX is lacking. Associations represented by other SNPs potentially tagging a causal locus within these genes were also not apparent. While the evidence is still limited, results could be influenced by differences in a gene-environment effect across populations not appropriately captured, or it may be possible that certain common SNPs identified in GWAS may be representing associations with rare causal variant(s) on the same haplotype background of the GWAS-identified tag SNP 33 . If rare causal variants are at play, even modest differences in haplotype structure of the regions may significantly affect detection potential, or it is possible that the variants may not be present in Japanese. As an example, the CDKN2A risk association originally identified through GWAS based on the common variant rs3731217 34 Table 3. Comparison of linkage disequilibrium of ALL-associated genomic regions between populations of European and Japanese ancestry and alternate SNP associations in Japanese. Abbreviations: Ca, cases; Co, controls; EUR, European ancestry; GWAS, genome-wide association study; JPN, Japanese ancestry; kb, kilobase; MAF, minor allele frequency; OR, odds ratio; SNP, single nucleotide polymorphism; varLD, variation in linkage disequilibrium. a Using 1000 Genomes Project Phase 3 data, regional LD patterns in Japanese were compared to patterns in the population for which the SNP association was first reported. WWOX varLD evaluations were performed compared to the Han Chinese population. b Regions included the evaluated SNPs and an additional 10-kb span flanking both ends. Genomic positions are based on the human genome assembly GRCh37 coordinates. c Adjusted p-values (P adj ) were based on 10,000 permutations of the combined data comprising the two populations being compared. An adjusted p-value of less than 0.006 based on a Bonferroni correction was considered statistically significant since 9 genomic regions were being tested. d Odds ratios indicate the risk associated with the each additive increase in minor allele. e Adjusted p-values (P adj ) were based on 10,000 permutations of case-control status and considered p-values below a family-wise type I error rate threshold of 0.05 to be statistically significant. recently shown to be explained by a rare high-impact coding variant (rs3731249) [35][36][37] . This variant is present in about two percent of Europeans, but is not present in Japanese. With the exception of LHPP, all loci for which the associations did not transfer showed evidence of differences in genetic architecture between Japanese and Europeans based on varLD score, whereas those that transferred did not show marked differences. In line with the common disease/common variant hypothesis 38 , if the GWAS associations are instead tagging a common causal variant and assuming this variant is operative as a risk locus in Japanese as well, we would have expected the regional SNP coverage and statistical power of the current study to be sufficient to detect the association signal. The lack of association suggests a need for future studies to consider characterization of rare variants in order to fully understand the nature of these GWAS loci in Japanese.
Certain limitations inherent to this study may have also affected the results. Although our study was limited to B-cell lineage ALL similar to most previous GWAS, availability of molecular subtype data was incomplete for a large proportion of the patients. While heterogeneity by subtype in the magnitude of risk has been observed for several of the loci, effects exclusive to a specific subtype have not been clearly demonstrated and is likely not the reason for the lack of association observed. One exception may be the GATA3 risk locus identified in a GWAS of Ph-like ALL 39 and another study that observed the association specifically in non-hyperdiploid B-cell ALL that lack the ETV6-RUNX1 fusion 40 . Results for the GATA3 variant (rs3824662) in the current study were suggestive of an association among the total B-cell ALL series, but requires further evaluation in a subtype specific analysis for confirmation. Also, access to patients for recruitment into this study was through the outpatient mechanism which resulted in a study population of surviving patients. This may have led to over-representation of patients of certain disease profiles; however a 80 to 85 percent survival rate of childhood ALL, as reported by the TCCSG 41 , suggests that the effect may have been minimal since our objectives focused on validating known GWAS hits, those of which were originally identified using general ALL patient populations that comprised of the most common ALL subtypes (versus a sequencing-based design targeting rare subtypes of poor prognosis). Finally, our data included imputed genotypes to enhance the coverage of genetic variation across the targeted genomic regions. Despite stringent quality control measures and advances in imputation methodologies, uncertainty still exists and may have introduced non-differential misclassification of genotypes and a reduction in statistical power to detect associations.
In this targeted evaluation of SNPs across regions previously identified in GWAS of childhood ALL, we showed that variation in ARID5B, IKZF1, PIP4K2A, and possibly GATA3 contribute to the genetic susceptibility of childhood B-cell ALL in Japanese. There is a need to account for population-specificity in producing accurate risk prediction estimates based on inherited genetic variation. Thus, this analysis serves as the first step towards characterizing the role of genetic variation in the susceptibility to childhood ALL in the Japanese population. Identification of potential novel loci, perhaps specific to the East Asian population or those more detectable due to enhanced LD with a causal locus and/or allele frequency differences, may be possible through a genome-wide association analysis after expansion of this population for increased statistical power.