Identification of novel breast cancer susceptibility loci in meta-analyses conducted among Asian and European descendants

Known risk variants explain only a small proportion of breast cancer heritability, particularly in Asian women. To search for additional genetic susceptibility loci for breast cancer, here we perform a meta-analysis of data from genome-wide association studies (GWAS) conducted in Asians (24,206 cases and 24,775 controls) and European descendants (122,977 cases and 105,974 controls). We identified 31 potential novel loci with the lead variant showing an association with breast cancer risk at P < 5 × 10−8. The associations for 10 of these loci were replicated in an independent sample of 16,787 cases and 16,680 controls of Asian women (P < 0.05). In addition, we replicated the associations for 78 of the 166 known risk variants at P < 0.05 in Asians. These findings improve our understanding of breast cancer genetics and etiology and extend previous findings from studies of European descendants to Asian women.

B reast cancer is the most commonly diagnosed malignancy and the leading cause of cancer-related deaths in women worldwide 1 . Genetic linkage studies and family-based studies have identified many high-and moderate-penetrance mutations in breast cancer predisposition genes, including BRCA1, BRCA2, PTEN, ATM, PALB2, and CHEK2 2 . In addition, large-scale genome-wide association studies (GWAS), conducted primarily in Asian and European women, have identified more than 180 susceptibility loci for breast cancer risk [3][4][5][6][7][8] . These identified loci explain a relatively small proportion of familial relative risk of breast cancer 8 .
The Asia Breast Cancer Consortium (ABCC) is the largest breast cancer GWAS consortium conducted in Asian-ancestry populations. We have shown previously that GWAS conducted in Asians could uncover cancer genetic risk variants that are either unique to the Asian population or more difficult to identify in studies conducted in European women 3,4,[9][10][11][12][13][14][15][16] . It also has been shown that a large proportion of common susceptibility loci are shared between Asian and European populations, although the lead variants in many loci may differ between these two populations 6,8 . To search for novel breast cancer susceptibility loci, we conducted Asian-specific and cross-ancestry meta-analyses combining the data of the ABCC and the Breast Cancer Association Consortium (BCAC) with a total sample size of approximately 310,000 women (~82,000 Asians and~228,000 Europeans). We herein report the discovery of 31 potential novel risk loci for breast cancer and the replication of a large number of known breast cancer susceptibility loci in Asian women.

Results
Overall associations for newly associated loci. We identified 28 loci with at least one common variant at each locus showing a significant association with breast cancer risk in the cross-ancestry meta-analysis (i.e., P < 5 × 10 −8 ) ( Table 1). None of these lead risk variants reside within a 500 Kb region flanked by any of the 183 previously reported breast cancer risk variants. No obvious inflation in statistical estimates was observed for either Asian-specific or cross-ancestry meta-analysis after excluding known loci (sample size-adjusted λ 1000 = 1.012 and 1.001, respectively). No evidence of heterogeneity in associations was observed between the two racial populations except for rs2758598 and rs142360995 (Table 1, P heterogeneity < 0.05, consistent in direction). The OR estimates for these 28 SNPs by study within the ABCC and BCAC consortia are presented in Supplementary Data 1 and 2. We explored pleiotropic effects by assessing the newly identified lead variants and their correlated SNPs (in LD with r 2 > 0.4 in either Asians or Europeans) from the online catalog of published GWAS (GWAS catalog). Pleiotropy was found for seven of the 28 newlyassociated SNPs (Supplementary Table 2).
All of the 28 SNPs showed a nominally significant association (P < 0.05) with ER-positive breast cancer risk ( Table 2). Fourteen of the 28 risk SNPs were also associated with ER-negative breast cancer risk in the cross-ancestry meta-analysis (P < 0.05). Heterogeneity between ER+ and ER-breast cancer risk (P heterogeneity < 0.05) was observed for rs73006998, rs7765429, rs144145984, rs78588049, and rs12481286.
To uncover possible secondary association signals in newly identified breast cancer susceptibility loci, we performed analyses for SNPs within flanking 500 kb of each lead SNP, with adjustment for the lead SNPs within each dataset. We then conduced meta-analyses to combine the results across studies of Asian women. Six potential secondary associations were identified (conditional P < 1 × 10 −4 ), and all correlated (r 2 > 0.1 in 1000 Genome East Asians) except for rs7693779, at 4p12 (Supplementary Table 4).
Of the 28 SNPs newly identified to be associated with breast cancer risk, 13 SNPs are intronic, one in UTR, and 14 in intergenic regions. Using data from ENCODE and Roadmap, we found that the majority of these 28 overlapped with genomic functional biofeatures that were indicative of promoters or enhancers ( Supplementary Data 3 and 4). The enrichment analysis supported this observation (Supplementary Fig. 2A). Of particular note is a strikingly strong enrichment signal of transcribed chromatin states that was found for the newly associated loci when compared to all risk loci ( Supplementary Fig. 2B). Enrichment signals of multiple histone modifications were also observed for both newly identified and overall association loci ( Supplementary Fig. 2C, D). The newly identified loci were enriched particularly for H4K78me2 and H4K20me1. These results indicated that the newly identified loci are tightly involved in active gene transcription events. Of the 28 lead SNPs, four (rs3790585 at 1p34.1, rs6756513 at 2p13.3, rs10820600 at 9q31.1, and rs78588049 at 12q15) intersected with chromosomal segments annotated as strong enhancers or active promoters in breast tissue-originated cell lines. When all SNPs that were in LD with the lead SNPs with r 2 > 0.8 in either Asians or Europeans were evaluated, evidence of regulatory function was found for an additional seven (i.e., 1q22-rs2758598, 3q25.1-rs73006998, 3q25.31-rs11281251, 8q22.2-rs2849506, 14q24.3-rs75004998, 15q24.2-rs8027365, and 21q22.3-rs35418111).
eQTL and gene-based analyses. To identify target genes of the 28 newly identified lead SNPs, we conducted cis-eQTL analyses in four independent datasets in breast tissue. Nine eQTL associations were identified with a P < 0.05 with same association direction in two or more independent sets (Supplementary Table 5). Potential candidate genes identified in this analysis included LINC00886, ybeY metallopeptidase (YBEY), snurportin 1 (SNUPN), mannosidase alpha class 2 C member 1 (MAN2C1), T-Box 1 (TBX1), MutY DNA glycosylase (MUTYH), lysyl oxidase like 2 (LOXL2), stanniocalcin 1 (STC1), and semaphorin 4 A (SEMA4A). SNP rs144145984 was the eQTL for both LOXL2 and STC1 genes, but the association for STC1 is much stronger. Similarly, SNP rs8027365 was associated with expression levels of two genes, MAN2C1 and SNUPN.
With the exception of TBX1 and LOXL2, we were able to build breast-tissue and/or cross-tissue models for all other eQTLidentified candidate genes with a prediction R 2 > 0.01 (Supplementary Table 6). Expressions of LINC00886, YBEY, MAN2C1 and SEMA4A could be predicted with a high accuracy by both breast tissue and cross tissue models (R 2 > 0.09). We imputed expressions of seven genes other than TBX1 and LOXL2 and showed that these genes were associated with breast cancer risk in either the ABCC or BCAC data at P < 0.05 (Supplementary Table 6). Of these, genes hypothesized to have a tumorsuppressor function included LINC00886, MAN2C1, SNUPN, and STC1, while YBEY, SEMA4A, and MUTYH may have an oncogenic role in breast carcinogenesis based on their associations with breast cancer risk (Supplementary Table 7).  Associations of previously reported risk variants in Asians. Of the 183 risk variants of breast cancer reported previously, 11 and 172 were originally discovered in studies conducted in Asians and European-ancestry populations, respectively. We were able to investigate 166 variants because 15 variants originally discovered in European populations were (nearly) monomorphic in Asians and two in high LD with rs2747652 (ESR1, 6q25.1) were removed. Of the 166 SNPs, 78 were found to be associated with breast cancer risk at P < 0.05, while 131 showed associations that were consistent in direction with those originally reported (Supplementary Data 5). Associations for five variants achieved genomewide significance (P < 5 × 10 −8 , Asians), with two at 6q25.1 (ESR1 and TAB2), and one each at 15q26.1 (PRC1), 16q12.1 (TOX3), and 21q22.12 (LINC00160). Additionally, borderline genomewide significant associations were found in seven loci including 2q14.1, 2q35, 3p24.1, 5q33.3, 9q33.3, 12p13.1 and 17q22 (P < 1 × 10 −6 in Asians).
Polygenic risk scores. We evaluated the association between PRS and breast cancer risk among SWHS participants, a subset of samples included in the Asia Breast Cancer Consortium. The PRS was generated using the weights (βs) obtained from Asian-specific meta-analysis. Women with a high estimated PRS had a 3.6-fold higher risk of breast cancer compared to those who had a low PRS (highest decile vs. lowest decile, Supplementary Table 10).

Discussion
This large-scale meta-analysis, including approximately 310,000 women of Asian and European ancestry and represents the largest GWAS to identify genetic determinants for breast cancer. In addition to identifying 31 potential novel risk loci for breast cancer (Table 1, Supplementary Table 8, and Statistical Methods), we replicated in Asian women 78 of the GWAS-identified risk variants for breast cancer. Since the risk variants initially reported in European populations might not be the lead SNPs in Asians, we performed further analyses to show that 21 known susceptibility loci may harbor additional independent signals, of which 16 showed at least one stronger association than the originally reported risk SNP. Our study has generated substantial novel information to improve the understanding of breast cancer genetics and etiology and provides clues for future studies to functionally characterize the risk variants and candidate genes identified in our study.
Similar to other GWAS, nearly all of the newly identified risk variants mapped to intergenic regions or introns of genes. One exception was rs10820600, which is located in the 5′-UTR region of the SMC2 gene. SMC2 encodes the structural maintenance of chromosomes protein-2, an essential subunit of the condensin complex I and II. The protein is critically involved in chromosome condensation and segregation during cell cycles 17 . Emerging evidence shows that SMC2 mutations and dysregulated expression are associated with multiple cancers 18 .
Of the thirteen lead risk variants located in the introns of genes, six showed strong evidence of cis-regulation for seven genes nearby, including YBEY, SNUPN, MAN2C1, LINC00886, TBX1, SEMA4A, and MUTYH. For example, the locus at 21q22.3 (rs35418111) showed compelling evidence of influencing expression of YBEY, a gene that encodes a highly conserved metalloprotein. Our gene-based analysis indicated a potential oncogenic role of YBEY in breast cancer development. Although the function of YBEY has not been fully elucidated, dysregulation of its expressions caused by copy number variation has been found in familial and early-onset breast cancer 19 , as well as colorectal cancer 20 . Further, we showed that MAN2C1 may play a protective role against breast carcinogenesis in the gene-based analysis. However, another study found that MAN2C1 promotes cancer growth via a negative regulation of phosphatase and tensin homolog (PTEN) function in prostate and breast cancer cell lines 21 . These results suggested that MAN2C1 may have distinct functional impact on cancer initiation compared to that on tumor progression. Few studies have investigated the mechanistic roles of LINC00886, SNUPN and SEMA4A in cancer initiation. Germline mutations in SEMA4A have been linked to the predisposition of familial colorectal cancer type X 22 . Our study provides the first evidence linking these two genes to breast cancer susceptibility.
Potential candidate genes were also revealed by the newly associated variants lying in the intergenic regions between coding genes. LOXL2 and STC1 were pinpointed as cis targets of rs144145984 at 8p21.2. LOXL2 is a member of the lysyl oxidase family of amine oxidases and STC1 belongs to the glycoprotein hormones family. Research regarding the functions of LOXL2 and STC1 in cancer development is limited. However, pre-clinical studies have implicated LOXL2 and STC1 in the progression of breast cancer 23,24 . Inhibiting LOXL2 activity shows a 55-75% decrease in primary tumor volume in female athymic nude mice, which were implanted with MDA-MB-231 human breast cancer cells 23 . The reduction in tumor burden was suspected to be mediated by the inhibition of angiogenesis. A recent study suggested the role of STC1 played in the breast tumorigenesis could be subtype-dependent 24  The pleiotropy of rs855596 at 12q23.2 provided a plausible mechanistic link for the observed genetic association with breast cancer risk. The minor (T) allele of rs855596 is associated with decreased breast cancer risk and is linked to the minor allele G of the nearby rs703556 (r 2 = 0.94 in EA and 0.43 in East Asians). The G allele of rs703556 is associated with lower mammographic dense area in women 25 . Mammographic density, an established risk factor for breast cancer 26 , is a measure based on the radiographic appearance of the breast by mammography. Several loci were related to other cancers or benign tumors. SNPs in 22q11.21, 1q22 and 4q12 were found to be associated with risk of prostate cancer 27 , testicular germ cell tumor 28 and leiomyoma, respectively 29 . We hypothesize potential underlying mechanisms via hormone metabolism for these loci. Variants at 10p12.2 (PIP4K2A) indicated an association with risk of acute lymphoblastic leukemia 30 and 6p22.3 (CASC15) with endometrial cancer 31 , lung cancer 32 , and neuroblastoma 33 . These regions implicated in genetic susceptibility across different types of cancers may serve as prioritized target of interest for future finemapping studies. For some of the phenotypes like blood cell counts and sodium levels, we currently lack the proper knowledge to decipher the likely mechanisms that link them to breast cancer development.
Notable racial heterogeneity was found for the loci at 1q22 (rs2758598) and 8q24.11 (rs142360995), which may reflect the differential regional LD structures and allele frequency between the two populations at these loci. The effect sizes in Asians are larger than those in European populations for both SNPs, over four times for rs142360995 and two times for rs2758598. The association at 3q25.1 (rs73006998) was dominant by estimates in Asians (ABCC: 2.4 × 10 −9 ; in BCAC, P = 5.8 × 10 −3 ), although no heterogeneity was observed. Previously, the same locus was reported to be associated with hormonal receptor-positive breast cancer, with a borderline genome-wide significance in a Japanese population (rs6788895, LD r 2 = 0.76 in East Asians) 34 . We found significant heterogeneity by ER status for this locus and the association was primarily driven by ER-positive cancer. Racial heterogeneity was also observed for many known risk variants initially reported in European populations. It may be attributable to multiple factors including the Winner's curse 35 , populationspecific LD structure, and false positives in the original GWAS.
Sixty-seven of the 155 index SNPs originally reported in European-ancestry women were replicated in women of Asian descent at P < 0.05. For those not replicated in our analysis, possible explanations include differences in local LD structure and genetic architecture for the disease between these two populations and a relatively small sample size of Asian studies. In summary, in this large GWAS including 147,183 breast cancer cases and 130,749 unaffected controls, we identified 31 potential novel breast cancer susceptibility loci by meta-analyzing data of two large consortia conducted in Asian and European women.
Using an independent set of 16,787 cases and 16,680 controls of Asian ancestry, we evaluated 22 lead variants and found that all variants showed the same direction of the association, although only ten of them were statistically significant. As many of the associations were driven by GWAS of European women and the sample size of our replication set was small, the low replication rate is not unexpected. Nevertheless, our study reveals many novel loci and potential targeted genes that may influence breast cancer susceptibility, although the possibility of false-positives for some loci cannot be completely ruled out. Future investigations are warranted to replicate our findings.

Methods
Study population. The overall cross-ancestry meta-analysis was conducted using data from two large consortia, the ABCC and BCAC. Detailed descriptions of participating studies are included in Supplementary Note 1. Briefly, in the ABCC, genome-wide SNP data were generated from 24,206 breast cancer cases and 24,775 unaffected controls recruited from studies conducted in mainland China, South Korea, and Japan (Supplementary Table 1 Genotyping and quality control. All of the genotyping and quality control procedures for GWAS, except for the expanded MEGA EX chip, have been described elsewhere 3,4,[6][7][8][9][10][11][12]34,36,37 (Supplementary Table 1). The MEGA EX chip contains approximately 2.04 million variants with an excellent genomic coverage of common variants (a minor allele frequency of 0.01 or higher) across multi-racial populations. We added to the MEGA EX chip~80k variants selected from our GWAS of breast and colorectal cancers and exome sequencing data for breast cancer cases in Asian-ancestry populations. In total, 2.1 million variants were included on this array. Quality control (QC) procedure include: samples were excluded if they (i) had genotyping call rate <95%; (ii) were male based on genotype data; (ii) had a close relationship with a Pi-HAT estimate >0.25; (iii) were heterozygosity outliers; (iv) were ancestry outliers. SNPs were excluded if they had (i) a call rate <95%; (ii) no clear genotyping clusters; (iii) a minor allele frequency <0.001; (iv) a Hardy-Weinberg equilibrium test of P < 1 × 10 −6 ; (v) genotyping concordance < 95% among the duplicated QC samples 3,4,[6][7][8][9][10][11][12]34,36,37 . All of the datasets were imputed using the 1000 Genomes Project Phase 3 mixed populations as the reference panel, except for the BioBank Japan (BBJ1) study, in which the HapMap Phase II (release 22) was used. Only SNPs with an imputation R 2 > 0.3 were included in the further analyses.
Genotyping of the replication set of cases and controls was completed using the iPLEX Sequenom MassArray platform (Agena Bioscience Inc., San Diego, California, USA). One negative control (water), two blinded duplicates and two samples from the HapMap project were included as QC samples in each 96-well plate. Samples or SNPs that had a genotyping call rate of <95% were excluded. We also excluded SNPs that had a concordance with the QC samples of <95% or an unclear genotype call. If the assay could not be designed for the lead SNP, a surrogate SNP which is in LD with the lead SNP with r 2 > 0.8 in Asians (1000 Genome) was selected. Of the 28 newly identified risk variants, 22 were successfully genotyped by Sequenom and evaluated in the association analysis, while six failed in the probe designing stage. Additional 11,642 independent samples from MYBRCA and SGBCC studies (Supplementary Note 1) were also included in the replication stage in evaluation of the 22 newly identified risk variants.
Statistical methods. Logistic regression analysis was performed within each study of Asian women to obtain a per-allele odds ratio (OR) for each SNP using PLINK2.0 38 . Principal components analyses were conducted within each GWAS dataset. Age and the top two PCs were included as covariates for in all regression models. Study (COGS) or country/region (OncoArray) was also included in the analyses of BCAC data 8 . The number of PCs to be included in the regression was determined by evaluation of Scree plot. Sensitivity analyses were conducted to include top 10 PCs, which showed very similar ORs as those derived from analyses adjusted for two PCs (Supplementary Table 11). A meta-analysis was performed using METAL 39 with a fixed-effects model to generate Asian-specific and crossancestry estimates. Heterogeneity was assessed by the Cochran's Q statistic and I 2 .
For the cross-ancestry meta-analysis, we were mainly interested in evaluating variants that were associated with breast cancer risk at P < 0.01 in the Asian-specific analysis (n snp = 244,746). However, three additional lead SNPs that did not meet this criterion can also be found in Supplementary Table 8. One representative SNP with the lowest p value was reported as the index SNP for each of the newly identified loci after variant pruning (LD r 2 < 0.1). The significant locus is considered novel if it is located 500 kb away from the 183 known risk loci for breast cancer The LD with known risk SNPs was also checked to verify the independence. Among the newly associated loci, we further applied the method implemented in MR-MEGA 40 to account for the population heterogeneity for two loci showing significant heterogeneity in the cross-ancestry fixed-effect meta-analysis. The results were shown in the Supplementary Table 9. The association was slightly more significant than the original fixed-effect meta-analysis for these two loci. Inflation of the test statistics (λ) was estimated by dividing the 50th percentile of the test statistic by 0.455 (the 50th percentile for a χ 2 distribution on 1 degree of freedom) 41 . We standardized the inflation statistic to account for the large size of our study by calculating λ 1000 (λ for an equivalent study with 1000 cases and 1000 controls) 8 . For the replication stage, analyses were conducted with an adjustment for age and study.
For each of the Asian studies with GWAS data (Supplementary Table 1), we searched for independent secondary association signals within a flanking +/− 500 kb region of the lead variant in each of the newly identified breast cancer risk loci using conditional analysis, with an adjustment for the newly identified lead risk SNPs when individual-level data was available log P We used GCTA software (option -COJO) 42 to perform the conditional analysis for the BBJ1, Seoul Breast Cancer Study (SeBCS), and BCAC European GWAS, for which only summary statistics data were available. MEGA array genotyping data was used as reference panel for LD estimation. The results of individual study were combined by a fixed-effect metaanalysis using METAL. SNPs showing an association with breast cancer risk at P conditional < 1 × 10 −4 were considered independent secondary association signals. The analysis was also performed within known susceptibility loci. All statistical tests were two-sided.
Statistical power. For the cross-ancestry meta-analysis (sample size shown in the Supplementary Table 1, alpha set to 5.0 × 10 −8 ), we had >80% power to detect the association between SNP and breast cancer risk with an OR of >1.06, 1.07, and 1.11 and EAF of 0.10 in the analysis of ER-positive, ER-negative cancer and all cancer combined, respectively (Supplementary Table 18).
Functional annotation and enrichment analysis. Novel risk loci were defined as those ±500 Kb away from the lead risk variant reported by previous GWAS conducted in populations of Asian or European-ancestry for breast cancer. The lead risk SNPs newly identified in our study were defined as the variant showing an association with breast cancer risk with the lowest P-value in a given locus in the meta-analysis. Functional annotations of the lead SNPs and their correlated variants (r 2 > 0.8 in 1000 Genomes Project, East Asian or European populations) were performed using HaploReg v4.1 43 . The functional annotation of chromatin states from chromHMM, DNase I hypersensitive and histone modifications such as H3K4, H3K9 and H3K27, were based on the epigenetic data in human breast mammary epithelial cells (HMEC), MCF-7 cells, and other cell lines from the Encyclopedia of DNA Elements (ENCODE) Project and Roadmap Epigenetics Project. We further applied GARFIELD 44 to assess functional enrichment for all risk loci identified to date for breast cancer risk and those newly reported in the current study. According to GARFIELD, the significance level for the enrichment analysis was set to 9.7 × 10 −5 . Known risk loci (±500 kb) were removed when evaluating functional enrichment for the newly identified loci.
Expression quantitative loci (eQTL) analysis. To identify target genes, we performed eQTL analysis utilized four independent sets of gene expression data derived from normal breast (N = 85, GTEx, women of European ancestry), breast tumor (women of European ancestry, TCGA, N = 672; METABRIC, N = 1904) and adjacent normal tissues (women of Asian ancestry, SBCGS, N = 151). We focused on cis-eQTL analyses for genes residing ±500 Kb flanking each newly associated leading SNP. The details of data processing were described in Supplementary Note 2.
A linear regression model was used to perform eQTL analyses to estimate the additive effect for each leading SNP identified on gene expression levels. We additionally adjusted for somatic copy number alteration and methylation levels in the regression model for the analysis of TCGA data. We only adjusted for somatic copy number alteration in the analysis for the METABRIC set.
Gene-based analysis. We recently conducted a transcriptome wide association study (TWAS) to investigate associations of genetically predicted gene expression with the risk of breast cancer 45 . We utilized the same approach to examine the associations with breast cancer risk of genes located within flanking 500 kb of each newly associated leading SNP. The breast-specific prediction model was generated using the elastic net method as implemented in the glmnet R package (α = 0.5), with tenfold cross-validation 45 . To further increase statistical power, we also utilized 6,124 samples across 39 tissue types from 369 unique European individuals who had genome-wide genotype data available to build cross-tissue models 46,47 . The expression of a gene for individual i in tissue t, Y i;t , is modeled as represents the cross-tissue component of expression levels for a given gene. The mixed effect model parameters were estimated using the lme4 package in R. The predicted gene expressions b Y i in the breast-specific models and d Y CT i in the cross-tissue models then were evaluated for their associations with breast cancer risk in the ABCC and BCAC, using methods implemented in MetaXcan 48 .
Polygenic risk score. We used the 11 risk SNPs originally reported in Asian populations, 28 newly identified SNPs from the current analysis (Table 1), and 28 risk SNPs originally identified in European populations that were replicated in the Asian populations in this current study (Supplementary Data 5, P < 0.05/166) to generate polygenetic risk score (PRS). PRS were calculated as PRS ¼ P β i SNP i .
The weights, βs, used to generate the score were obtained from Asian-specific meta-analysis. The association between the score and breast cancer risk was tested in the samples from Shanghai Women's Health Study (SWHS, N total = 2427, N case = 368, N control = 2059), which were also contributed to the Asian MEGA project. The PRS was tested in both continuous (1 SD change) and categorical forms (deciles in controls). The area under the curve was also calculated to show its discriminatory ability. Overfitting is less a concern as SWHS participants only accounted for a very small proportion in the Asian-specific meta-analysis (~8%).
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.