Evaluation of the association of heterozygous germline variants in NTHL1 with breast cancer predisposition: an international multi-center study of 47,180 subjects

Bi-allelic loss-of-function (LoF) variants in the base excision repair (BER) gene NTHL1 cause a high-risk hereditary multi-tumor syndrome that includes breast cancer, but the contribution of heterozygous variants to hereditary breast cancer is unknown. An analysis of 4985 women with breast cancer, enriched for familial features, and 4786 cancer-free women revealed significant enrichment for NTHL1 LoF variants. Immunohistochemistry confirmed reduced NTHL1 expression in tumors from heterozygous carriers but the NTHL1 bi-allelic loss characteristic mutational signature (SBS 30) was not present. The analysis was extended to 27,421 breast cancer cases and 19,759 controls from 10 international studies revealing 138 cases and 93 controls with a heterozygous LoF variant (OR 1.06, 95% CI: 0.82–1.39) and 316 cases and 179 controls with a missense variant (OR 1.31, 95% CI: 1.09–1.57). Missense variants selected for deleterious features by a number of in silico bioinformatic prediction tools or located within the endonuclease III functional domain showed a stronger association with breast cancer. Somatic sequencing of breast cancers from carriers indicated that the risk associated with NTHL1 appears to operate through haploinsufficiency, consistent with other described low-penetrance breast cancer genes. Data from this very large international multicenter study suggests that heterozygous pathogenic germline coding variants in NTHL1 may be associated with low- to moderate- increased risk of breast cancer.


INTRODUCTION
NTHL1 encodes a DNA glycosylase that is a critical component of the DNA base excision repair (BER) pathway involved in the repair of oxidatively damaged DNA. It has recently been shown that carriers of bi-allelic loss-of-function (LoF) variants in NTHL1 are predisposed to colorectal adenomatous polyposis and colorectal cancer 1 , and to a multi-tumor syndrome that includes a high incidence of breast cancer in female carriers [2][3][4][5] . Grolleman et al. 4 described the largest set of carriers of bi-allelic germline NTHL1 variants (29 carriers from 17 families), and reported that 9 of 15 female carriers (60%) were diagnosed with breast cancer at an earlier age than observed in the general population (48.5 years 1 Cancer Genetics Laboratory, Peter MacCallum Cancer Centre, Melbourne, Vic, Australia. 2  compared to 62 years). In contrast to the previously described BER defect caused by MUYTH deficiency, multiple tumor types from carriers of germline bi-allelic NTHL1 LoF variants exhibit a distinctive somatic mutation pattern (Single Base Substitution Signature 30 [SBS30] in the COSMIC database 1,6 ) characterized by an abundance of C > T transitions at non-CpG sites, indicating that an NTHL1-driven BER defect was the predominant mutational process driving the development of these tumors. These data demonstrated that germline bi-allelic inactivation of NTHL1 predisposes to breast cancer, although individuals with two LoF variants are very rare in the population 4 . In contrast, 0.37% of noncancer participants in gnomAD are carriers of monoallelic LoF variants (gnomAD V2.1.1, 134,187 participants), but whether carriers are also predisposed to breast cancer has not been evaluated. To address this question, this study sequenced all exons and exon-intron boundaries of NTHL1 in 9,771 subjects in the hereditary BrEAst Case CONtrol (BEACCON) study, comprising index cases from hereditary breast cancer families who tested negative for germline pathogenic variants in BRCA1 and BRCA2 and cancer-free older female controls (average 49.7 years vs. 65.6 years) in the same population. In addition, whole-genome and targeted sequencing was performed on formalin-fixed, paraffinembedded (FFPE)-derived DNA from the breast cancers of 20 germline NTHL1 LoF variant carriers. Further NTHL1 sequencing data were analyzed from nine additional case-control studies, to give a combined analysis of 47 Table 1). A case-case analysis of cancer incidence in the families of individuals harboring heterozygous NTHL1 LoF variants found no statistically significantly elevated incidence of colorectal cancer, female breast cancer, male breast cancer or ovarian cancer when compared to the NTHL1 wild-type families, although the number of available NTHL1 families was small and the statistical power was limited (Table 2).
Missense variants in NTHL1 were also significantly enriched in the cases compared to the controls (75, 1.50% vs. 47, 0.98%, OR 1.54, 95% CI: 1.05-2.27, P = 0.02). All the variants were individually "rare" with the highest minor allele frequency [MAF] detected in the controls being 0.0015 in BEACCON and 0.0018 in gnomAD (Supplementary Table 2). Consistent with the frequencies reported in gnomAD, p.(Arg100Cys) and p.(IIe176Thr) were the most common missense variants in both the case and control cohorts. The association of NTHL1 missense variants with hereditary breast cancer remained when applying a population rarity filter or in silico prediction tools, Condel, PolyPhen2, SIFT, CADD, and REVEL to enrich for likely pathogenic variants, with the strongest effect observed for the missense variants that were selected for rarity (MAF ≤ 0.001, 39 versus 20, OR 1.88, 95% CI: 1.07-3.41) (Supplementary Table 3).
In contrast to the high penetrance reported for the multi-tumor syndrome associated with bi-allelic LoF in NTHL1 4 , the OR for heterozygous LoF or missense variants suggests only a moderateto low-penetrance effect. In this context, the background genetic risk contributed by common, low-penetrance single nucleotide polymorphisms (SNPs) may have an important risk-modifying effect as described for other low-moderate penetrance genes 7 . To assess this possibility, 70 SNPs with well-established, significant associations with breast cancer were used to calculate a polygenic risk score (PRS) for each carrier as described previously 8 . A multivariable logistic regression model was used to simultaneously evaluate the association of NTHL1 LoF variants, missense variants, and the PRS with breast cancer in the case-control cohort. The ORs observed for LoF variants, missense variants, and the OR per unit standard deviation for the PRS were 2.41 (95% CI:  Supplementary Fig. 1.

Co-segregation analysis in families with germline NTHL1 variants
Co-segregation analysis of breast cancer in families was performed in 16 multi-case families from the BEACCON study segregating NTHL1 LoF variants where detailed pedigree information was available. Sanger sequencing was performed to determine the genotype of 38 additional family members. Analysis of cosegregation using a full-likelihood method 10 calculated a maximum likelihood ratio of 1.18 at an odds ratio for breast cancer of 1.69-insufficient to either support or reject an association with breast cancer predisposition. To assess the occurrence of somatic bi-allelic inactivation in NTHL1 and characteristic mutational signatures in NTHL1-associated tumors, targeted sequencing was performed on tumor DNA from cases with a germline NTHL1 LoF variant using a customdesigned panel of 259 genes (total targeted region of 1.337 Mb) that included all exons and exon-intron boundaries of NTHL1 and 27 breast-cancer driver genes 11 . Fourteen breast cancers and ovarian cancer from individuals with heterozygous NTHL1 germline variants, together with breast cancer and colorectal cancer from the individual homozygous for germline LoF variants, were sequenced. The alternative allele frequency of the germline variants, reconstructed copy number profile, and tumor purity estimation was used to determine loss of heterozygosity in the NHTL1-associated tumors as described previously 12,13 . The biallelic mutation was confirmed in the breast and colorectal cancers from the homozygous carrier, however, there was no evidence of loss of the wild type allele or a second point mutation in NTHL1 in any cancers from heterozygous germline variant carriers. Since CpG island methylation in the promoter region is a possible alternative mechanism of gene silencing, bisulfite sequencing was performed across the NTHL1 promoter region for 13 breast cancers with sufficient DNA available. No evidence of hypermethylation was observed in any of the cancers tested ( Fig. 1a, Supplementary Fig. 2).
To investigate whether NTHL1-associated breast cancers are driven by the same mutagenesis mechanism as the colorectal cancers from carriers of homozygous mutations 1,6 , whole-genome sequencing was performed on the breast cancers from the NTHL1null and NTHL1-het carriers of three different germlines LoF variants (p.(Gln90Ter), p.(Gln287Ter), and p.(Ser22AlafsTer5)), together with nine sporadic breast cancers with no known germline cancer predisposition gene mutations as controls. A predominant mutational signature SBS30 was observed in the NTHL1-null breast cancer (Fig. 1b), consistent with the previously reported NTHL1 bi-allelic loss driving tumorigenesis mechanism. In line with the absence of bi-allelic inactivation, the three NTHL1-het breast cancers each exhibited a mixture of mutational signatures, with the top contributing signatures including COSMIC signatures SBS3, SBS5, and SBS16, similar to the sporadic breast cancers. While one of the three NTHL1-het cancers showed a minor proportion of SBS30 (C37112, p.(Gln287Ter), Fig. 1b), this was also observed in one of the nine sporadic cancers, suggesting no major difference in mutational processes between NTHL1-het and sporadic control cancers (BC3, Supplementary Fig. 3).  Fraction of genome alteration and HRD scores in NTHL1 associated breast cancers Genomic instability and HRD were measured using genome-wide copy number data from the NTHL1 associated cancers. A genomic instability index, a fraction of the genome altered by copy number (FGA) 15 , and an HRD score 16,17 were generated for 14 NTHL1-het cancers and one NTHL1-null cancer. A comparison cohort of breast cancers with expected HR defects from PALB2 (n = 15) and RAD51C (n = 9) germline LoF variant carriers, and 115 sporadic breast cancers sequenced using the same platform, were also evaluated. The NTHL1-het cancers exhibited a broad range of FGA scores (Fig. 1c) (Fig. 1d). The one NTHL1-null breast cancer also showed a very low FGA and HRD score (0.07 and 7).

NTHL1 protein expression in NTHL1 associated breast cancers
To investigate whether heterozygous NTHL1 LoF variants were associated with reduced protein expression and/or altered cellular location in breast cancers as has been observed in gastric tumors 18 , fluorescent immunohistochemistry was used to measure NTHL1 protein levels, along with an epithelial cell marker (Cytokeratin AE1/AE3), in 8 NTHL1-het breast cancers and 21 sporadic breast cancers. NTHL1 had a predominantly nuclear localization in both wild-type and NTHL1-het cancers. Compared to sporadic breast cancers, NTHL1-het cancers as a group showed a 51% reduction in NTHL1 staining in the AE1/AE3-positive cancer cells (average staining intensity 24.51 vs. 49.83, P < 0.001 by unpaired t test) (Fig. 2a, b) and a 40% reduction in the AE1/AE3negative non-cancer cells (35.62 vs. 59.65, P < 0.001). In addition, for 5 of the 8 NTHL1-het cancers, the expression of NTHL1 in the breast cancer cells (AE1/AE3 positive cells) was reduced by >30% compared to the surrounding non-cancer cells (AE1/AE3 negative cells) ( Supplementary Fig. 4), while only a small proportion of control cancers showed the same phenomenon (5 of 21), suggesting that NTHL1 expression may be attenuated further specifically in the cancer cells of NTHL1 LoF variant carriers. When considering the pathological subtypes individually the average NTHL1 expression level was lower in NTHL1-het tumors than in the controls in all three major subtypes: ER+, HER2+, and triplenegative cancers, but only reached statistical significance in the triple-negative cancers (Fig. 2c).  Table 5.
No additional bi-allelic LoF variants were identified in subjects of the validation dataset, indicating germline bi-allelic loss of NTHL1 is extremely rare as a cause of breast cancer (1/27,421 breast cancer cases and 0/19,759 controls). In the heterozygous state, the overall effect observed in the validation dataset (OR = 0.84, 95%CI = 0.62-1.13) did not support the finding in the BEACCON dataset, with 4/9 studies showing a weak positive association while the others showed no effect or a weak negative association, although the sample sizes of most studies were small (Fig. 3a). The frequency of LoF variants observed in each study varied greatly among both the cases (0.29-0.76%) and the controls (0.24-0.68%), largely driven by differences in frequency of the predominant variant p.(Gln90Ter) (Fig. 4a; Supplementary Table 6). The remaining LoF variants identified were all rare in gnomAD (MAF < 0.001), with an overall statistically significant enrichment, observed in the cases in the combined data for these variants from all 10 studies (37 vs. 12, OR 2.22, 95% CI: 1.16-4.26, Fig. 4b).
In contrast to the LoF variants, the frequency of missense variants was higher among the cases in the majority of studies in the validation dataset (6/9 studies), and was statistically significant in the overall analysis of the validation dataset (OR 1.24, 95% CI: 1.00-1.53) and the combined BEACCON and validation datasets (OR 1.31, 95% CI: 1.09-1.57) (Fig. 3b). While the detected missense variants were distributed across the whole gene (Supplementary Table 6), there was an enrichment of missense variants in the cases compared to controls in the Endonuclease III domain of NTHL1 (187 vs. 95, OR 1.42, 95% CI: 1.11-1.82) (Fig. 4c, d). When a number of in silico prediction tools and population rarity filters were applied to enrich for likely pathogenic variants in the combined data from the ten studies, the predicted missense variants were observed in excess in the cases compared to the controls and reached statistical significance in all the in silico prediction groups: Condel, PolyPhen2, SIFT, CADD, and REVEL ( Table 3). The strongest enrichment was observed for variants with a high REVEL score of pathogenicity (REVEL > 0.75, OR 1.44, 95% CI: 1.08-1.94), and interestingly these likely pathogenic variants according to REVEL were also predominantly located in the conserved functional domain of NTHL1 protein, Endonuclease III (COG0177, a member of ENDO3c Superfamily) ( Supplementary  Fig. 5).

DISCUSSION
NTHL1 is a DNA glycosylase involved in the earliest steps of the repair of oxidative DNA damage through the BER pathway. Through its endonuclease activity it acts to excise damaged nucleotides, predominantly pyrimidines, and loss of NTHL1 activity would be expected to impact the effectiveness of normal DNA repair. This effect was shown to be clinically significant by the discovery of a specific cancer predisposition syndrome involving complete loss of normal NTHL1 function through bi-allelic pathogenic germline LoF variants. The resulting condition not only demonstrates high risk for a range of different malignancies, but it was possible to show that specific mutational processes arising from the loss of NTHL1-mediated BER dominated the genetic pathology of the resulting cancers, reflected in the distinctive somatic mutational signature, COSMIC signature SBS30. Although the recessive condition has proved to be very rare, the analogy with other tumor suppressor genes raised the possibility that inheritance of a single germline pathogenic variant followed by somatic inactivation of the wildtype allele could achieve the same outcome through a "two-hit" mechanism with the result that heterozygous variants would also be associated with inherited cancer risk. We have examined this possibility specifically in regard to the risk of breast cancer through sequencing of a large number of cases and controls and examination of the somatic landscape of cancers occurring in women carrying a single NTHL1 LoF variant. The results have shown that germline bi-allelic loss is not a major contributor to breast cancer in the populations studied and there is equally no evidence that heterozygous variants act through a traditional two-hit pathway to cause breast cancer. However, the data do support a possible low-risk effect for at least some heterozygous LoF variants and rare, deleterious missense variants. Our data reflects the recent findings that heterozygous LoF variants in NTHL1 do not confer any substantial risk for colorectal cancer and do not undergo bi-allelic inactivation 20 .
The difficulty in providing clear evidence of a breast cancer risk association for LoF variants in the cases and controls appears to reflect, in part, a higher degree of heterogeneity among the ten analyzed studies compared to little variability between studies for missense variants. The statistically significant two-fold excess in the BEACCON study was not observed in the validation dataset and this discrepancy was largely driven by the variable frequency of LoF variants in control populations across cohorts which ranged from 0.24 to 0.68%. In particular, the frequency of the p.(Gln90Ter) variant, which accounted for the large majority of LoF variants, was highly variable among the 10 studies, as it is across different ethnic groups reported in gnomAD; ranging from~1 in 280 in the Finnish population to less than 1 in 3000 in African and Asian populations. Differences in the ethnic background and ascertainment methods among the different studies may explain the diverse results in the sub-studies. This variability was, however, limited to the p.(Gln90Ter) variant. If the analysis of the data from the ten studies in combination was restricted to the remaining rarer LoF variants, a statistically significant two-fold excess was observed in cases compared to the controls (37, 0.13% of 27,421 cases compared to 12, 0.06% of 19,759 controls, OR 2.22, 95% CI: 1. 16-4.26). With the exclusion of the most common LoF variant, the numbers are small and the result requires validation in further studies. Stronger evidence for a low-risk predisposition effect was found for missense variants, which were more consistently enriched in cases across sub-studies but also showed further enrichment when selected for features most likely to be associated with disruption of normal gene function: rarity, deleterious in silico properties or location in a major functional domain.
Tumor sequencing and promoter methylation analyses demonstrated that NTHL1 does not undergo bi-allelic inactivation in breast cancers, which is supported by the fact that NTHL1-het tumors retain approximately 50% of the wildtype protein expression. Although Drost et al. 6 reported loss of the wild-type allele in a single breast cancer carrying an NTHL1 LoF variant (p.(Gln287Ter)), our data suggest this is not a common event in NTHL1-associated breast tumors. The reduced, but not complete loss of NTHL1 expression in NTHL1-het cancers is consistent with the fact that, while the mutational spectrum of the NTHL1-null breast cancer strongly resembled COSMIC mutational signature SBS30, only minor components of SBS30 were observed in one of three NTHL1-het tumors examined. Although the total number of tumors analyzed remains small, the NTHL1-het tumors appeared similar to sporadic breast cancers in regard to the somatic mutational features, and in other respects (HRD scores, histopathology, and FGA).
The absence of a second hit in NTHL1 may be a generic feature of low-moderate penetrance alleles. While many high-risk breast cancer genes BRCA1 21 , BRCA2 22 , PALB2 23 , and RAD51C 12 follow the Knudson two-hit model, where a germline mono-allelic variant in  Fig. 4) was used to identify cancer cells in the breast cancer tissue in NTHL1 expression quantitation in (b) and (c). Scale bar = 100 μm. b The average intensity of NTHL1 in sporadic cancer group compared to NTHL1-het group. c The average intensity of NTHL1 in sporadic cancer group compared to NTHL1-het group according to ER, PR, and HER2 status. BC breast cancer, TNBC triple-negative breast cancer. the gene is accompanied by somatic loss of the wild-type allele, the limited data currently available for alleles which confer low-(RR 1-2) to moderate-(RR 2-4) risk suggest they do not always undergo a second hit. For example, a second hit rarely occurs in breast cancers from carriers of the lower penetrance CHEK2 p. (Ile157Thr) variant 24 or the BRCA2 p.(Lys3326Ter) variant 22 , while limited reports suggest that it does occur in breast cancers from women carrying germline CHEK2 LoF variants 24,25 . It is possible that alleles associated with low or moderate risk mediate cancer predisposition through pathways that do not require a second hit, such as haploinsufficiency where reduced protein levels rather than the complete loss are sufficient to increase the risk of disease [24][25][26][27][28][29][30][31][32][33] . Reduced NTHL1 expression has been previously described as a feature of specific malignancies 34,35 and recent Fig. 3 Frequency of heterozygous germline variants in NTHL1 and odds ratios for breast cancer observed in case-control data from ten multicenter international cohorts. a Heterozygous loss-of-function variants in NTHL1 and odds ratios in ten case-control cohorts. b Heterozygous missense variants in NTHL1 and odds ratios in ten case-control cohorts. The overall effect of odds ratios were computed based on a fixed-effect model. BEACCON  reports have found that NTHL1 missense variants can induce cellular transformation and genomic instability in vitro while retaining normal cellular location and enzymatic function 36,37 , raising the possibility that a non-canonical function may be involved. Alternatively, the distinction between high-risk cancer susceptibility variants that undergo a somatic second hit and lowrisk alleles that do not-even where bi-allelic loss appears to convey a clear oncogenic advantage, as demonstrated in the case of NTHL1 by the recessive cancer syndrome-directly reflects the fact that these alleles are less prone to obtain a second hit leading to a complete loss of the function, and always retain some activity in the tumor.
In summary, this is the first study to investigate the role of heterozygous NTHL1 LoF and missense variants in breast cancer predisposition, which included 47,180 subjects from ten international case-control studies. The data suggest that NTHL1 may be associated with a modest increase in breast cancer risk that would not be considered clinically actionable in isolation under current clinical guidelines but could be relevant when combined with additional risk factors. Molecular analyses of breast cancers from carriers indicate that NTHL1 may be included in the growing list of low-penetrance breast cancer genes that appear to function via haploinsufficiency rather than the bi-allelic inactivation mechanism almost universally observed for high-risk breast cancer predisposition genes.

Case-control Subjects
All exons and exon-intron boundaries of NTHL1 were analyzed in the index cases of 4985 hereditary breast cancer families and in 4786 cancerfree women in the hereditary BEACCON study. The cases were female breast and/or ovarian-cancer affected patients (>95% breast cancer affected) in the Variants in Practice (ViP) Study that were ascertained from the combined Victorian and Tasmanian Familial Cancer Centres, and Pathology North, NSW Health Pathology, Newcastle, Australia. The controls were cancer-free women who were greater than or equal to 40 years old from the same population from the Lifepool study as described previously 38 . A hereditary breast cancer family is defined as those assessed by a specialist Familial Cancer Clinic where most of the affected family members meet family history criteria or have individual risk factors that predict a greater than 10% chance of having a BRCA1 or BRCA2 pathogenic variant (a detailed guide is in https://www.eviq.org.au/ cancer-genetics/adult/genetic-testing-for-heritable-mutations/620-brca1and-brca2-genetic-testing). All cases tested negative for pathogenic variants in BRCA1 and BRCA2 including large scale genome rearrangements. The average breast cancer diagnosis age of the cases was 49.7 years (range 19.0-94.8). The average age of controls was 65.6 years (range 40.0-97.5). The study was approved by the Human Research Ethics Committee at the Peter MacCallum Cancer Centre (Approval # 09/29) and all participating centers. All participants provided informed consent for genetic analysis of their germline DNA.
Nine international case-control studies of diverse countries of origin and sample sizes in which all exons of NTHL1 gene were analyzed were used as validation cohorts. The studies were; UK Population-based Breast Cancer Study (SEARCH), German Consortium for Hereditary Breast and Ovarian Cancer (GC-HBOC), French familial BRCAx study (GENESIS, some data were previously published 19 Table 5).

NTHL1 sequencing using germline DNA
Germline DNA from the cases in the BEACCON study was obtained from blood and was extracted in clinical laboratories, and the DNA from controls were obtained from blood (87%) and saliva (13%) and were extracted by Lifepool researchers. All exons and 10 bp into each exon-intron boundary of the NTHL1 gene were sequenced using a customized targeted HaloPlex HS Targeted Enrichment Assay panel (Agilent Technologies, Santa Clara, CA) which was designed using Agilent's SureDesign tool at https://earray. chem.agilent.com/suredesign/. A set of 74 Ancestry Informative Markers (AIMs) [39][40][41] was included in the sequencing of 3409 subjects (1747 cases and 1662 controls) to determine the ethnicity background of study subjects. Furthermore, a total of 70 SNPs that were reported to be associated with breast cancer in GWAS studies 8 were sequenced in all subjects to calculate breast cancer PRSs. Library preparation was performed using the Agilent NGS Bravo automation system (Agilent Technologies) according to the manufacturer's protocol (Agilent Technologies, HaloPlex HS Target Enrichment System Automation Protocol For Illumina Sequencing. https://www.agilent.com/cs/library/usermanuals/ public/G9931-90010.pdf). Sequencing was performed by the Australian Genome Research Facility (North Melbourne, VIC, Australia) on an Illumina Hiseq2500 sequencer. Library pools of 96 samples were sequenced on a HiSeq2500 Genome Analyzer using 100 bp paired-end reads (Illumina, San Diego, CA) with an average read depth target of >250X. The average sequencing depth yield for NTHL1 gene was 396.13 and 358.64 for the cases and controls, respectively, with 99.32% and 99.19% of the target bases sequenced ≥10 reads.

Germline DNA sequencing alignment and variant calling
Sequencing data from the BEACCON study were processed, aligned, and analyzed through a pipeline constructed using Seqliner v0.1a (http:// bioinformatics.petermac.org/seqliner) by Bioinformatic Core Facility of Peter MacCallum Cancer Centre as described in detail elsewhere 42

Variant filters and validation
Filters were applied to the sequencing data from BEACCON to remove sequencing artifacts and included a quality score over 30, a minimum of 10 reads, and at least 5 reads supporting the alternative alleles and a variant Co-segregation analysis in families with germline NTHL1 variants Pedigree information for 16 families with an NTHL1 LoF variant were obtained and Sanger sequencing was performed to determine the genotype for a total of 38 additional family members from these families. A full-likelihood method was used for co-segregation analysis as described previously 10 . A full pedigree likelihood was calculated as a means of assessing the linkage between the variant and disease based on all available genotype information from the family, including any unaffected individuals who have been tested.
Tumor microdissection and DNA extraction FFPE breast or ovarian tumor blocks were obtained from carriers of germline variants in the ViP study, and cancer cells were collected through needle microdissection under a dissecting microscope. Hematoxylin and eosin (H&E) stained slide was reviewed for each case to identify cancer cellenriched regions to guide the microdissection and achieve high tumor purity (aimed to achieve >70% tumor cells). Between 15 and 30 slides per block (depending on the size of the tumor and the proportion of cancer cells) of 8-10 μM thickness was needle microdissected, and the DNA of the collected cancer cells was extracted and purified using the QIAamp DNA FFPE Tissue Kit (Qiagen, Valencia, CA, USA). DNA was quantified using Qubit dsDNA high-sensitivity Assay kit (ThermoFisher Scientific, MA, USA). A multiplex polymerase chain reaction (PCR)-based quality-control method reported by van Beers et al. 46 , was used to identify DNA samples with sufficient quality for molecular analysis.

Whole-genome sequencing and targeted sequencing of tumor DNA
Whole-genome sequencing (30× coverage, 100 bp paired-end) of four tumor-normal pairs was performed on the BGI-Seq platform using 500 ng of DNA extracted from FFPE tumors or from blood. Targeted sequencing of tumor DNA was performed using an Agilent SureSelect XT Custom Panel that targeted all exons of a total of 259 genes (total targeted region of 1.337 Mb) including all the candidate genes in Halo3 design, and an additional 27 breast cancer driver genes 11 . Sequencing libraries were generated using 300 ng tumor DNA and were sequenced on an Illumina Next Seq 500 (75 bp paired-end reads).

Tumor DNA sequencing alignment and variant calling
Paired-end sequence reads from tumor DNA were aligned to the GRCh37 human reference genome using BWA-MEM 15 . PCR and optical duplicate reads were removed using Picard (http://broadinstitute.github.io/picard/) and then local realignment around indels and base quality score recalibration were performed using the Genome Analysis Tool Kit (GATK). SNP and indel variants were called using GATK Unified Genotyper 43 , Platypus 45 , and Varscan2 47 . A quality score over 150, a minimum of 10 reads, and an alternate allele frequency of more than 10% were used to rule out sequencing artifacts. In paired tumor-normal sequencing data, the somatic mutations were identified by removing the germline variants. In tumor sequencing data without matched germline data, somatic mutations were identified by applying stringent filters on the population frequency (minor allele frequency, MAF ≤ 0.0001 in ExAC and gnomAD 48 ), the frequency in the inhouse sequencing cohort (<20% of 166 breast tumors with the exception of variants in PIK3CA and TP53), and removing the variants identified in germline sequencing data.

Genome-wide copy number analysis
Genome-wide copy number data were generated from off-target sequencing reads using CopywriteR v2.10 with 50 kb bins 49 . NEXUS Copy Number TM (software v8.0 with build version 9169, BioDiscovery Inc.) was used for CN segmentation using the SNPFASST2 algorithm with default parameters. Copy number gains and losses were called with log 2 ratio thresholds of >0.2 and < −0.2, respectively.
Fraction of genome altered (FGA) and homologous recombination deficiency (HRD) scores Using the genome-wide copy number data, the FGA was calculated with adjustment according to chromosome sizes as described 50,51 . An HRD score, combined from three HRD score components: number of telomeric allelic imbalances 52 , HRD-loss of heterozygosity 53 , and large-scale state transitions 54 was calculated as described in detail elsewhere 13 .

Mutational signature analysis
Mutational signatures were performed and plotted using deconstruct-Sigs 55 package in R v3.3.2 56 based on the somatic mutations identified by the whole genome or targeted sequencing of the tumors of interest. For each sample, the somatic substitutions were categorized into six basic base substitutions (C > A, C > G, C > T, T > A, T > C, and T > G) and subcategorized into 96 subcategories according to the trinucleotide context. Mutational signatures were determined referring to the COSMIC mutational signature database.

NTHL1 promoter methylation analysis
The promoter regions of NTHL1 were examined for methylation status by Sanger sequencing the promoter region using bisulfite-treated DNA. Tumor DNA was treated by the EpiTect Fast DNA Bisulfite Kit (Qiagen, Valencia, CA, USA) according to the manufacturer's instructions, followed by PCR amplification and Sanger sequencing of the promoter regions. The primer pairs were designed using the default settings of the Bisulfite Primer Seeker tool (https://www.zymoresearch.eu/bisulfite-primer-seeker, Supplementary Table 7), and/or from previously published study 57 . CpGenome Human Methylated DNA Standard (Millipore, USA) served as a methylation positive control and Female Genomic Reference DNA (Promega, Madison, WI, USA) as a negative control to assess the methylation status of tumor samples.

NTHL1 protein expression analysis
Opal multiplex fluorescent immunohistochemistry method (PerkinElmer) was used to evaluate the NTHL1 protein expression in cancer and noncancer cells in FFPE breast cancer tissue. NTHL1 antibody (Abcam, Branford, Connecticut, USA; 1:1000 dilution) and AE1/AE3 antibody (multi-cytokeratin antibody, Leica Biosystems, Wetzlar, Germany; 1:1000 dilution) were used as primary antibodies. HRP-labeled anti-rabbit antibody (PerkinElmer Waltham, Massachusetts, USA; 1:1000 dilution) and anti-mouse antibody (PerkinElmer, Waltham, Massachusetts, USA; 1:1000 dilution) were used as secondary antibodies. Spectral DAPI (PerkinElmer, Waltham, Massachusetts, USA) was used to label the nuclear. The epithelial marker AE1/AE3 was used to identify cancer cells in the breast cancer tissue, and the average intensity of nuclear expression of NTHL1 was quantitated using PerkinElmer Vectra Automated Multispectral Imaging System.

Statistical analysis
OR and two-tailed p value by Fisher's exact test for the case and control study were calculated in R 56 . The conditional Maximum Likelihood Estimate was used for OR. Fisher's exact test (two-sided) was used for the comparisons of case-control data, and a p value of ≤0.05 was considered as statistically significant. PRS was calculated based on 70 low penetrance breast cancer predisposition SNPs following a multiplicative risk model (calculated by the sum of the minor alleles weighted by the perallele log OR) described by Mavaddat et al. 8 . Mann-Whitney U test was performed for FGA and HRD score comparisons between groups of tumors in GraphPad Prism version 7.00 (California, USA). Unpaired t test was used for comparing NTHL1 expression in different comparison groups. The meta-analysis for multi-center international cohorts was performed using meta R package 58 under a fixed-effect model for analysis within subgroups or analyzing all ten cohorts. N. Li et al.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

DATA AVAILABILITY
The data generated and analyzed during this study are described in the following data record: https://doi.org/10.6084/m9.figshare.14208293 59 . The sequencing data have been deposited in the European Genotype-phenotype Archive under the following accession: https://identifiers.org/ega.dataset:EGAD00001007025 60 (study ID: EGAS00001005043). These data include NTHL1 sequencing using germline DNA, Alignment and variant calling, Whole-genome sequencing and targeted sequencing of tumor DNA, Genome-wide copy number analysis, Mutational signature analysis. In addition, the following data are not openly available to protect patient privacy: NTHL1 protein expression analysis, FCC patient database, cohorts summary, NTHL1 promoter methylation analysis. Data requests for these data should be made to the corresponding author.