Nasopharyngeal carcinoma MHC region deep sequencing identifies HLA and novel non-HLA TRIM31 and TRIM39 loci

Despite pronounced associations of major histocompatibility complex (MHC) regions with nasopharyngeal carcinoma (NPC), causal variants underlying NPC pathogenesis remain elusive. Our large-scale comprehensive MHC region deep sequencing study of 5689 Hong Kong Chinese identifies eight independent NPC-associated signals and provides mechanistic insight for disrupted transcription factor binding, altering target gene transcription. Two novel protective variants, rs2517664 (Trs2517664 = 4.6%, P = 6.38 × 10−21) and rs117495548 (Grs117495548 = 3.0%, P = 4.53 × 10−13), map near TRIM31 and TRIM39/TRIM39-RPP21; multiple independent protective signals map near HLA-B including a previously unreported variant, rs2523589 (P = 1.77 × 10−36). The rare HLA-B*07:05 allele (OR < 0.015, P = 5.83 × 10−21) is absent in NPC, but present in controls. The most prevalent haplotype lacks seven independent protective alleles (OR = 1.56) and the one with additional Asian-specific susceptibility rs9391681 allele (OR = 2.66) significantly increased NPC risk. Importantly, this study provides new evidence implicating two non-human leukocyte antigen (HLA) genes, E3 ubiquitin ligases, TRIM31 and TRIM39, impacting innate immune responses, with NPC risk reduction, independent of classical HLA class I/II alleles. Here the authors report a major histocompatibility complex (MHC) association analysis for nasopharyngeal carcinoma in Chinese individuals from Hong Kong, finding 8 independent associated loci associated with lower risk for developing nasopharyngeal carcinoma. Two non-human leukocyte antigen (HLA) genes are E3 ubiquitin ligases, TRIM31 and TRIM39, having a role in the innate immune response and implicating the importance of host Epstein-Barr virus interactions in this cancer.

N PC, an epithelial tumor arising in the nasopharynx, has distinct geographic and ethnic distributions 1 . Global cancer statistics by GLOBOCAN estimated 129,079 new cases and 72,987 deaths in 2018 2 . It is endemic in Southeast Asia, Southern China, particularly in Guangdong and Hong Kong, Taiwan, Northern Africa, and Alaska 1,3 . Locally, NPC poses considerable health burdens and is the most common cancer among males aged 20-44 3 . NPC etiology is multifactorial, including host genetics, Epstein-Barr virus (EBV) infection, and environmental factors 1,[4][5][6][7][8][9] . Utilizing various genomic approaches, we previously demonstrated the contribution of multiple loci including 6p21, TERT, and MST1R in NPC [5][6][7] . Numerous NPC genetic susceptibility association studies identified the 6p21.3 human leukocyte antigen (HLA) locus 4,7,8,[10][11][12][13][14][15][16][17][18][19][20] , supporting the hypothesis that HLA alleles or nearby non-HLA genes are major genetic determinants for NPC pathogenesis. However, most of these studies were limited by a small sample size 4,8,[11][12][13][14][15][16][17][18][19][20] . This gene-dense region is highly polymorphic; the extensive linkage disequilibrium of the HLA pattern poses challenges and requires large cohort studies to resolve the underlying genetic causal factors for NPC. Only one genome-wide association study (GWAS) reported three independent signals associated with NPC risk from the MHC region 7 ; this array-based approach may be inadequate for fine-mapping of all the independent signals within the MHC region. Hence, we used comprehensive deep sequencing of the entire 5 Mb MHC region in a large cohort of 5689 age-, genderand ethnicity-matched Hong Kong Chinese in a two-phase study to dissect in detail the genetic causal factors underlying NPC pathogenesis. To the best of our knowledge, the current study is the largest comprehensive fine-mapping analysis at the singlebase resolution to systematically resolve eight independent loci within the MHC region associated with NPC. Our data validated earlier well-known NPC GWAS findings on the HLA-A alleles and amino acids polymorphisms and, additionally, detected six independent loci at the HLA-A/B region and HLA-DQ alpha chain 1 (HLA-DQA1) 7,10 . This is the first report of the association of Tripartite motif-containing 31 (TRIM31) with NPC risk and survival. The SNP rs117495548, mapped to a region with two naturally occurring transcripts at Tripartite motif-containing 39 (TRIM39) and a read-through transcript TRIM39-Ribonuclease P/ MRP 21 kDa subunit (RPP21), associated with reduced NPC risk. This study now provides important novel insights of putative functional SNPs disrupting transcription factor binding, altering target gene transcription for NPC pathogenesis, highlighting the possible involvement of E3 ubiquitin ligases, TRIM31, TRIM39, and TRIM39-RPP21 fusion protein impacting the host innate immunity antiviral defense and for this EBV-associated cancer.
Our findings on the haplotype information and genetic heterogeneity of MHC NPC susceptibility loci across East Asian and other populations partially explain the fact that despite ubiquitous EBV infection, worldwide NPC incidence is low, while it is especially high among East Asians and Southern Chinese.

Results
Association of common variants, HLA class I alleles, and amino acids with NPC. Owing to the genetic complexity in the MHC region, earlier studies recognized how challenging it was to identify all the independent loci. Hence, we performed finemapping by a target-capture sequencing approach with 5689 genetic-and geographically matched Hong Kong Chinese for association analyses after stringent quality control steps. Demographic information of study populations used is detailed in Supplementary Table 1. Our study workflow and analysis pipeline are detailed in Supplementary Fig. 1. The discovery and validation phases included 3047 participants (1431 NPC cases and 1616 controls) and 2642 participants (1321 NPC cases and 1321 controls). The MHC region was sequenced to a mean coverage of 150×. Logistic regression association identified 1971 significant common SNPs/Indels in discovery, validation, and combined phases from 37,931 common variants (P < 5 × 10 −8 ). Forward logistic regression association analysis of the SNPs/Indels identified five independent signals within the MHC class I region that reached genome-wide significance (Supplementary Table 2). The most significant and the only susceptibility-associated signal was rs9391681 (OR = 2.11, P = 1.50 × 10 −46 ), located in the intergenic region between RPP21 and HLA-E, followed by four protective signals near HLA-A, HLA-B, TRIM31, and HCG27 loci. The novel independent signal near TRIM31 represented by rs2517664 (OR = 0.29, P = 6.38 × 10 −21 ) was interesting, as it remained independent from HLA class I alleles and amino acid variants in the subsequent conditional logistic analysis.
Multiple novel independent signals in the MHC region confer NPC risk reduction. Overall, 2967 significant common variants including 2899 SNPs/Indels, 62 amino acid polymorphisms, and 6 HLA alleles are associated with NPC after the Bonferroni correction (P < 1.33 × 10 −6 ) (Supplementary Data 1). Forward conditional logistic regression analysis was performed to resolve independent representative signals. This fine-mapping approach now clearly demonstrates eight independent signals from the MHC region associated with NPC risk (Table 1). After LD pruning, forward selection stepwise regression analysis with 226 significant SNPs (r 2 > 0.8) identified similar findings of eight independent signals associated with NPC (Supplementary  Table 8). Our study is the first to report five independent signals near HLA-A/B. Figure 1 illustrates the Manhattan plots of each stepwise analysis. The strongest susceptibility signal, rs9391681 (intergenic RPP21*HLA-E) (P = 1.50 × 10 −46 , OR = 2.11), may indirectly represent the causal signals driven by the HLA-A*02:07 allele, which is in tight linkage disequilibrium with rs9391681 (r 2 = 0.85). The effect of rs9391681 is markedly reduced (P = 7.59 × 10 −3 ), after adjusting with the covariant of HLA-A aa-C99 segregating with the HLA-A*02:07 allele (r 2 = 1) (Supplementary  Tables 6 and 7). Figure 1a shows ten additional SNPs in tight linkage disequilibrium with rs9391681 (r 2 ≥ 0.8). The proxy variant analysis in Fig. 2a further indicates the marked reduction of their effect to a very weak association (P = 3.63 × 10 −2 -7.41 × 10 −5 , Supplementary Data 1). Although it is biologically plausible to suggest HLA-A aa-C99 is the causal factor marked by rs9391681, we cannot rule out the possibilities of these ten tight linkage disequilibrium variants exerting their effects on NPC. Functional genomics data based on the transcription factor chromatin immunoprecipitation and sequencing (ChIP-Seq) Clusters from the ENCODE database prioritized four candidate functional variants (rs143982339 near ABCF1, rs9348841 near PPP1R10, rs9380181 intergenic between ABCF1*PPP1R10, and rs9380182 near FLOT1) (Supplementary Data 1). Noteworthy, POLR2A was the common transcription factor involved in all four of these. Recent GWAS consistently reported the HLA-A aa-Q62 marked by HLA-A*11:01, rather than HLA-A*02:07, is the causal factor 7,10,11 . We now confirm HLA-A aa-Q62 (P HLA-A-aa-Q62 = 6.27 × 10 −44 , OR = 0.56) as the second independent protective signal (Table 1), which is in tight linkage disequilibrium (r 2 = 0.86) with our previously reported GWAS SNP rs2860580~3 kb apart from HLA-A (Figs. 1b, 2b and Supplementary Table 6) 7 . The effect of rs2860580 disappears (unadjusted P = 1.64 × 10 −37 to adjusted P = 0.14) after adjustment with HLA-A aa-Q62 (Supplementary Table 7).
Among the three independent signals related to HLA-B locus, the most interesting is the novel independent protective signal, rs2523589 (chr6:31,359,557, hg38) located~2.4 kb upstream of HLA-B (P = 1.77 × 10 −36 , OR = 0.59), as it provides mechanistic insights for NPC genetic epidemiology. The effect of rs2523589 remains significant after adjusting with all significant variants (Figs. 1c and 2c). The effect of rs2523589 remains independent of rs2894207, previous GWAS independent signal at HLA-B/C loci, 64 kb from rs2523589 (r 2 = 0.19) (adjusted P = 4.47 × 10 −19 ) (Supplementary Table 7). However, its effect diminishes to a very weak association (P = 1.58 × 10 −3 , Supplementary Data 1) after adjusting with rs9405084, which is~1.1 kb apart from rs2523589 (r 2 = 0.91). Functional annotation suggests rs9405084 may influence epigenetic regulation as a transcriptional enhancer with high H3K27Ac histone modification and DNase I enzyme hypersensitivity. Transcription factor (ChIP-Seq) Clusters from the ENCODE database suggest rs9405084 may regulate transcription of target genes through disruption of transcription factor binding. The eQTL evidence according to GTEx suggests the possible impact of rs2523589 on expression levels of multiple candidate genes including MIR6891, HCG27, HLA-B, HLA-C, and MICA, etc. (Supplementary Table 9).
When we conditioned on the rs9391681, HLA-A aa-Q62, rs2523589, and HLA-B*55:02, we further identified the novel independent intronic signal, rs2517664, mapped to a non-HLA gene, TRIM31 (P = 6.38 × 10 −21 , OR = 0.29, Table 1). The association of rs2517664 is independent of HLA-A and HLA-B signals since the adjusted P-value of this index variant remained significant after conditional analysis adjustment with all the significant variants (Figs. 1e, 2e and Supplementary Table 7). Its effect diminishes to a very weak association with two tightly linked putative functional SNPs, rs2844796 (r 2 = 0.62) and rs1116222 (r 2 = 0.58) (P = 5.01 × 10 −6 or 3.16 × 10 −5 , Supplementary Data 1). Hence, this may be driven by two nearby functional epigenetically regulated SNPs, rs1116222 (~2.4 kb apart from rs2517664 located within a CpG island) and rs2844796 (0.75 kb apart from rs2517664 mapping to a CTCF-  Conditioned on the first six independent signals, we identified the seventh independent signal at intronic SNP rs9461780 near HLA-DQA1 at HLA class II region (P = 4.38 × 10 −17 ; OR = 0.52) ( Table 1 and Fig. 1g). Our current study did not replicate the  Table 10) 7 . The rs24821666 is~6.3 kb apart from rs9461780, but they are not tightly linked (r 2 = 0.013). The effect of rs9461780 remains strong after adjusted with other significant variants (Fig. 2g).
The last independent signal identified after conditioning on the top seven independent signals, rs117495548, is novel and maps near TRIM39 or TRIM39-RPP21 (P = 4.53 × 10 −13 , OR = 0.31, Table 1, Fig. 1h). The effect of rs117495548 remains strong after adjusted with other top significant variants (Fig. 2h). However, there is no significant association after controlling with rs148603250 near HCG17/HCG18, in tight linkage disequilibrium (r 2 = 0.99) with rs117495548. Transcription factor (ChIP-Seq) Clusters from the ENCODE database further suggest rs148603250 may be involved in transcription factor binding of RNA Binding Fox-1 Homolog 2 (RBFOX2). RBFOX2 is a key regulator of alternative exon splicing. Interestingly, this locus is annotated with two overlapping transcripts. The first is TRIM39, which encodes an E3 ubiquitin ligase involved in apoptosis or cell cycle regulation 21,22 . TRIM39-RPP21 is a read-through transcript encoded by a naturally occurring fusion product of neighboring TRIM39 and RPP21 that regulates type I interferon response and antiviral defense 23 . Independent validation of the genetic susceptibility role of TRIM39 or TRIM39-RPP21 loci was performed by the imputation of rs117495548 in our previous NPC GWAS with another Southern Chinese cohort of 3040 controls and 1583 NPC from Guangdong 7 . Logistic regression analysis indicates that rs117495548 significantly associates with NPC (MAF Control = 0.022, MAF Case = 0.014, P = 6.83 × 10 −3 , OR 95% CI = 0.62 0.44-0.88).
Gene-based association test. The gene-based association test by SKAT analysis was performed with all variants including 5% (5506) exonic variants and 95% intronic or intergenic variants. The parameters for weighting more on rare variants by Rvtests are detailed in the methods section. The top 15 candidate genes associated with NPC are summarized in Supplementary Table 11. The role of TRIM31 (P = 6.75 × 10 −29 ), HLA-B (P = 1.60 × 10 −27 ), and TRIM39 (P = 1.47 × 10 −16 ) in genetic susceptibility was further supported by gene-level association tests for NPC risk.
Haplotype analysis defined by the eight NPC independent signals. Haplotype association defined by the eight independent signals including six SNPs, HLA-A aa-Q62 and HLA-B*55:02, identifies 24/111 haplotypes significantly (OR2, normalized to H1 haplotype) associated with NPC in Hong Kong Chinese (Table 2). Noteworthy, Hong Kong Chinese individuals carrying the most common H1 haplotype, which lacks all seven protective alleles (0/ 7), is associated significantly with a higher risk of NPC (OR1 = 1.56, P = 2.90 × 10 −27 ). An additional susceptibility haplotype, H2 with the risk allele of rs9391681 (OR1 = 2.66, P = 1.76 × 10 −53 ), also associates with higher NPC risk. The haplotype association (OR2) identified 22 protective haplotypes that confer significant protection for NPC to variable extents depending upon the combinations of protective alleles ( Table 2). Ten strong effect haplotypes (H9-H11, H13-H14, H17-H18, H20, and H22-H23) carrying from one to four protective alleles with at least one allele from HLA-B*55:02 or rs2517664 or rs117495548 conferring 5.6-to 16.7-fold reduction of NPC risk relative to H1. The majority of the remaining twelve moderate effect haplotypes carrying one to three MHC class I/II protective alleles of HLA-A aa-Q62 or HLA-B loci (rs9265975/rs2523589) or HLA-DQA1 locus (rs9461780) confer 1.6-to 3.5-fold reduction of NPC risk relative to H1.
Genetic heterogeneity of MHC NPC susceptibility loci across East Asian and other populations. The eight independent NPC variant frequency distributions were compared across different populations using public database gnomAD 24 , Allele frequency net database (http://www.allelefrequencies.net), and a large general Chinese population from a psoriasis study utilizing MHCtarget sequencing (Fig. 3) 25 . Importantly, the most significant NPC susceptibility C-allele of rs9391681 is specific to and only observed in East Asian populations. Hong Kong NPC cases have the highest risk allele frequency of 23%, compared to MAF of 12% of Hong Kong healthy, 11.4% of East Asians, dropping to 6.9% of Northern Chinese and further decreasing to 0% of other non-Asian populations including African, American and European. The protective T-allele of rs2517664 of TRIM31 locus was only observed in 1.4% NPC cases compared to the 15.1% − 36.2% in populations with low-NPC incidence. The T-allele is~10 to 26fold higher among Northern Chinese, African, American, and European populations with low-NPC incidence compared to that of Hong Kong NPC cases. The other five protective alleles of rs28749142 tagging SNP of HLA-A aa-Q62 , rs2523589 upstream of HLA-B, rs9265975 downstream of HLA-B, rs9461780 near TRIM39, and rs117495548 near HLA-DQA1 are also more frequent in other low-NPC incidence populations. The Hong Kong Southern Chinese controls further demonstrate a lower frequency of protective alleles of rs28749142, rs2523589, rs2517664, rs9265975, and rs117495548 compared to the general Chinese population 25 ; this data set has a large proportion of Northern Chinese, for whom NPC incidence is much lower than in Hong Kong. The differential allelic data between Hong Kong Chinese and East Asians compared to non-Asian populations are concordant with the endemically high NPC incidence in both Hong Kong and East Asia.
Clinical relevance of TRIM31 in NPC. NPC is closely associated with EBV infection and TRIM31 deficiency attenuates innate antiviral responses to infections involves MAVS signaling 26 . Hence, we compared and detected induction of TRIM31 expression in EBV-infected normal immortalized nasopharyngeal epithelial cell cline NPC361 before and after EBV infection (Fig. 4a). Our previous methylome analysis reported TRIM31 association with de novo methylation and reduced expression levels in NPC 27 . High TRIM31 expression associates with poor prognosis and drug resistance in various cancers including colorectal, pancreatic, and liver cancers [28][29][30][31] . Immunohistochemical staining of 133 NPC primary tissues showed significantly higher TRIM31 expression clinically associated with poorer survival by Kaplan-Meier analysis in a subset with positive TRIM31 expression in inflammatory cells (Fig. 4b, P = 0 (Fig. 4c). Further mechanistic studies are needed for the functional role of TRIM31 in NPC pathogenesis.

Discussion
Using MHC region two-phase deep sequencing of the largest cohort reported to-date, we identified eight independent common loci associated with NPC as compared to only three previously reported independent loci by NPC GWAS studies (Supplementary Table 10) 7,10,11,14,32 . The majority of these independent signals (7/8) at the MHC region confer protection. The protective signals for NPC are not generally observed for other cancers. Further MHC sequencing investigation in other cancer types in the Hong Kong population, together with ESCC cases from Henan in Mainland China, did not detect similar results (Supplementary Table 14). Similar target-capture studies in other cancers associated with the MHC region, such as lung cancer and the viral-associated cancers, liver and cervical cancers, may help to address whether these seven protective signals identified or other SNPs in strong LD with these are specific for NPC. In summary, this current study identified two independent opposing signals (susceptibility rs9391681 and protective HLA-A aa-Q62 ) related to the HLA-A locus, three signals related to HLA-B locus, one near HLA-DQA1, and two novel non-HLA loci, TRIM31 and TRIM39 or TRIM39-RPP21. Three of the five independent signals at HLA-A/B are previously unreported SNPs rs9391681, rs2523589, and rs9265975 in tight linkage with potentially functional SNPs residing on regulatory elements to possibly confer higher risk or protection for NPC by modulating gene expression.
Our data now provide novel clues linking reduction of NPC risk with non-HLA genes near TRIM31 and TRIM39 or TRIM39-RPP21 loci independent of the HLA Class I and II alleles and HLA-A/B/C amino acid variants. This is of importance since the TRIM proteins are a big family of 80 distinct ubiquitin E3 ligases involved in many cellular processes that have central roles in antiviral host defenses 33 . TRIM proteins regulate innate immune sensing, direct restriction of viral infection, and autophagymediated antiviral immune responses in mammalian cells 33 . The read-through transcript TRIM39-RPP21 is known to mediate interferon response, a major host immune cytokine response against viral infection 34 . The intimate interaction between the host immune system and EBV infection at the nasopharyngeal

2.44e−04
Twenty-four out of 111 haplotypes were significantly associated with NPC passing Bonferroni correction (p < 4.50 × 10 −4 ). H1 (bolded) indicates individuals predisposed with higher NPC risk when carrying this susceptibility haplotype lacking all seven protective alleles.
OR1 is the raw odds ratio of each haplotype. OR2 (bolded) is the normalized odds ratio relative to H1, which is the most frequent haplotype in Hong Kong Chinese. epithelium appears to be essential for NPC development. Several key EBV proteins such as BDLF3, EBNA-1, EBNA-3C, LMP-1, affect the ubiquitin-proteasomal system with E3 ligases impacting host immune evasion, apoptosis, and cell cycle deregulation through degradation of substrates such as MHC I, MHC II, p53, Rb, p21, and NF-κB pathways 35 . Our findings now provide evidence and unique insight on the genetic linkage of host immunity and the development of viral latency with its consequential impact on NPC development risk. This is the first report linking TRIM31 to reduced NPC risk and poor survival. Recent studies suggest that TRIM31 regulates MAVS aggregation upon viral infection to activate the innate immune response as the first-line defense against invading pathogens 26 . Mechanistically, TRIM31 regulates various molecular pathways including NLRP3, MAVS, NFκB, and p53 signaling, through its ubiquitin ligase activity to induce lysine-linked polyubiquitination of protein substrates affecting key cancer-related biological processes and regulating inflammation, antiviral infection, innate immunity, drug resistance, and metastasis 26,28,29,36 . These findings support the clinical relevance of TRIM31, as a potential independent poor prognostic marker in NPC due to altered NFkB signaling critical for NPC development 37 . Its role in promoting tumor aggressiveness was previously reported in various solid tumors including pancreatic, colorectal, and liver cancers 28,29,31 . We now show up-regulation of TRIM31 in a near-normal immortalized non-tumorigenic nasopharyngeal epithelial cell line infected with EBV. We speculate that TRIM31 may be involved in viral stimulation of innate immune responses, which is an early event preceding cancer initiation to maintain tissue homeostasis. These seemingly contradictory observations may be related to the different timing and versatile roles that TRIM31 has in cancer initiation, tumor progression, and the development of drug resistance 38 . Targeting TRIM31 signaling in pancreatic cancer was suggested to have therapeutic potential, as overexpression of TRIM31 activated NFkB resulting in gemcitabine resistance 28 . Our findings suggest the potential therapeutic option for targeting TRIM31 signaling. TRIM39 is a RING domain-containing E3 ubiquitin ligase that regulates the p53, APC/C, and NFκB pathways controlling apoptosis, cell proliferation, and inflammatory signaling by ubiquitylation 21,22,39,40 . In NPC, TRIM39 functionally affects cell proliferation, epithelial-mesenchymal transition, and metastasis and clinically correlates significantly with tumor size, stage, and metastasis 41 . The TRIM39 or TRIM39-RPP21 locus is associated with Behcet's disease, a chronic inflammatory autoimmune disease 42 . GWAS reports indicate a linkage to cutaneous lupus erythematosus 23 and behavioral risk tolerance 43 and identified rs2517664 near TRIM31 as one of the independent loci associated with general cognitive function and immune regulatory loci for mouth ulcers 44,45 . Further in-depth functional characterization of the genetic causal factors linked with TRIM31, TRIM39, and TRIM39-RPP21 loci in NPC pathogenesis is warranted, but this is beyond the scope of the present study. Our findings provide new insights for NPC pathogenesis regarding TRIM-mediated anti-EBV host innate immune defense 35,46 . Previous NPC GWAS suggest the only independent diseasedriven HLA-A signal coming from rs2860580 in tight linkage disequilibrium with the HLA-A*11:01 7,8,10,11 . The protective role of HLA-A*11:01 allele is supported by the efficient presentation of EBNA4 antigens to elicit an immune response by the cytotoxic T-lymphocytes upon EBV infection 47 . Our fine-mapping study comprehensively demonstrates two disease-driven independent HLA-A signals with opposing effects reaching genome-wide   significance, the Asian-specific rs9391681 and the HLA-A aa-Q62 , which is carried not only by HLA-A*11:01, but 15 other HLA-A alleles (Supplementary Table 6). Our proxy variant analysis and functional genomics annotation prioritize four putative functional variants, rs143982339, rs9380181, rs9348841, and rs9380182, also tightly linked with rs9391681, suggesting these candidate functional variants may affect the binding of multiple transcription factors; further investigation of their regulatory mechanisms is of interest.
The highly diversified HLA-B locus in the Han Chinese population was previously suggested to have critical contribution in environmental adaptation 25 Fig. 4). Near the HLA-B regions, there are multiple independent protective signals that reached genomewide significance; two SNPs (rs2523589 and rs9265975) with moderate effect are located upstream and downstream of HLA-B and there is a strong protective effect of HLA-B*55:02 and HLA-B*07:05 alleles (Table 1 and Supplementary Table 3). GWAS identified rs2523589 as one of the immune regulatory loci associated with mouth ulcers and medication use in the UKbiobank 44,48 . Our data provide novel mechanistic insights for the involvement of putative functional regulatory variants affecting transcription factor binding and transcription of other non-HLA genes, additional to the regulation of its closest gene HLA-B expression with NPC risk. Further functional studies are required to address the causal factors in the same proxies associated with rs2523589 and rs9265975. The HLA-B amino acids associated with NPC across studies are inconsistent due to the disease signals being in tight linkage disequilibrium with HLA-B alleles. The great diversity of HLA-B alleles may provide an explanation for the discrepancy of different spectra of HLA-B amino acid polymorphisms between our Hong Kong and the Guangdong and Guangxi study 10 . Our findings further suggest multiple common and rare independent variants near HLA-B, rather than only a single independent signal rs2894207 reported by earlier GWAS 7 . Our study provides additional evidence and suggests an epigenetic regulatory function of the rare intronic SNPs rs77803816 of HCP5 and rs2596540 of MICA (Supplementary Table 13). We cannot distinguish the protective effects between rs77803816 and HLA-B*07:05 (D′ = 0.46). Further studies are required to elucidate if the novel association of HLA-B*07:05 allele is attributed to HLA-B aa-R156 or other rare functional variants such as rs2596540 at MICA and rs77803816 in HCP5. HCP5 is a regulatory lncRNA 100 kb centromeric from HLA-B involved in adaptive and innate immune responses, multiple autoimmune and viral-associated diseases such as HIV-1-associated AIDS and HCV-associated hepatocellular carcinoma [49][50][51][52][53] . MICA is a stress-induced ligand for NKG2D (Natural-killer group 2, member D) receptor recognized by NK cells and T cells.
SKAT gene association findings reinforce our observations on novel loci near TRIM31 and TRIM39 and the HLA-B identified from single SNP associations (Table 1 and Supplementary  Table 11). The gene-level evidence (MICA P = 2.44 × 10 −26 , Supplementary Table 13) supports a possible role of the rare putative functional SNP, rs2596540 located in the 5′ UTR of MICA. Earlier NPC MICA association studies were limited by a small sample size; the reported signal is not an independent signal in our current study 13,15,54 .
Our current study and earlier GWAS consistently detected only one independent signal near the HLA-DQA1 of the MHC class II region and suggested dysregulated antigen presentation. However, the detected signal at rs9461780 near HLA-DQA1 was not in the same proxy with rs28421666 7 , while others reported HLA-DQB1 and HLA-DRB1 loci 4,13,17,55,56 . The discrepancy may be partially due to studies of different populations. The underlying causal factor in the MHC class II region for NPC remains unclear.
NPC is rare in most populations worldwide. The absence of the rs9391681 risk allele and the greater abundance of the multiple protective alleles of rs2523589, rs9265975, rs2517644, and rs9260033 tagging HLA-A aa-Q62 in other ethnic populations including African, Caucasian, Ashkenazi Jews and a general Chinese cohort in contrast to NPC endemic regions is consistent with host genetic factors being critical determinants in NPC. The most common susceptibility H1 haplotype in Hong Kong Chinese suggests individuals lacking all seven protective alleles have a higher NPC risk (OR1 = 1.56). NPC risk (OR2 = 1.77) further increased for individuals carrying the H2 haplotype of the susceptibility rs9391681 C-allele. The control population with lower NPC risk carry moderate and strong protective haplotypes with different combinations of protective alleles including HLA-A aa-Q62 , SNPs near HLA-B, HLA-DQA1, B*55:02, and novel loci at TRIM31 and TRIM39 in the non-classical HLA genes ( Table 2). Our data implicate host genetic factors act as important components interacting with environmental and EBV exposure, affecting the efficiency of EBV antigen presentation to the host immune system and inhibition of innate immunity. The haplotype information and genetic heterogeneity of MHC NPC susceptibility loci across East Asian and other populations provide clues to partially explain the fact that despite ubiquitous EBV infection, worldwide NPC incidence is low, while it is especially high amongst East Asians and Southern Chinese. Further studies in NPC endemic regions from different ethnicities are warranted to validate our observation in Southern Chinese NPC patients.
In conclusion, this study of the MHC region identified seven independent signals within the MHC class I region and one within the MHC class II region for NPC risk. Findings enhance our understanding of the genetic basis for NPC predisposition in individuals carrying the only two susceptibility haplotypes lacking all the protective alleles involved in both HLA and non-HLA genes. The timing for EBV exposure and infection, host responses to the ubiquitous tumor virus infections, and the development of viral latency are complex contributors to the development of NPC in high-risk populations. The detection of an association of novel TRIM31, TRIM39, or TRIM39-RPP21 signals with NPC highlight the expected importance of host innate immune responses impacting NPC genetic susceptibility.

Methods
Study participants of Hong Kong cohorts. Two independent cohorts of 5698 unrelated individuals, including 3056 participants in the discovery phase (1438 NPC cases and 1618 matched healthy controls) and 2642 participants for the validation phase (1321 NPC cases and 1321 matched healthy controls), were utilized in this study ( Supplementary Fig. 1a). The demographic characteristics of patients and controls are summarized in Supplementary Table 1. The overall design of this study is shown in Supplementary Fig. 1. The NPC cases and controls were collected by the Tissue Bank established under the Area of Excellence (AoE) program for the period of specimen collection from 2010 to 2017 and then randomly assigned as the discovery phase and validation phase of this study.
All patients and healthy individuals were matched Han Chinese from Hong Kong (Supplementary Fig. 6). NPC patients were recruited from five Hong Kong public hospitals including Queen Mary Hospital (QMH), Queen Elizabeth Hospital Study participants of Guangdong cohort. Regional genotype data for 1583 NPC cases and 3040 controls were retrieved from two NPC GWAS projects reported previously 7,57 . Briefly, all cases were patients diagnosed with primary NPC and recruited at Sun Yat-sen University Cancer Center (SYSUCC), Guangdong, China, from October 2005 to May 2012. The diagnosis was histopathologically confirmed by two pathologists at SYSUCC according to the World Health Organization (WHO) classification. Healthy controls were recruited from physical examination centers of several large comprehensive hospitals in local communities in Guangdong. All the participants were self-reported as Southern Han Chinese ancestry. All the controls were self-reported with no history of malignancy at the time of enrollment. All the study subjects signed informed consent and studies were approved by institutional ethical committees of Sun Yat-sen University.
MHC-target sequencing, variant calling, and HLA typing. The libraries for MHC-target sequencing were prepared and sequenced in the Illumina HiSeq platform according to manufacturer instructions 5 . Briefly, the pre-capture sequencing libraries were prepared with 100-200 ng genomic DNA after fragmentation to 300-400 bp by Covaris according to the KAPA HyperPrep Kits 58 and hybridized to the SeqCap EZ Choice Enrichment Kit (Roche). Each capture included 24 indexed libraries of blood DNAs. The capture kits were designed using NimbleDesign (https://design.nimblegen.com/nimbledesign, Roche Sequencing Solutions, Inc) to capture the whole MHC region (chr6: 28,510,120-33,480,577, hg38) and other non-MHC regions. The quality of libraries was evaluated by bioanalyzer analysis (Agilent Technologies). The quantity of libraries was assessed by qubit measurement and Q-PCR quantification. The post-capture libraries were sequenced using 150 bp paired-end reads on the Illumina HiSeq platform. Sequencing reads were aligned to the human reference genome hg38 (hg38 + Alt + decay) using Bwa.kit (https://github.com/lh3/bwa/tree/master/bwakit) 59 (Supplementary Fig. S1b). The PCR duplicates were removed from the aligned file (bam) using SAMtools (version 1.3.1) 60 . Picard (version 2.2.1) was used to calculate the HsMetrics (including the coverage of the targeted region, off-target ratio) from bam files to assess the sample quality. The germline variants (SNP/InDels) were called using mpileup2snp from VarScan (version 2.3.8) 61 at base resolution in the target region (pileup2cns) with parameters (minimum coverage: 8; minimum supported alternative reads: 2; minimum alternative allele frequency: 0.2; minimum average quality: 15; P-threshold:0.01; minimum mapping quality 15). The function of the SNPs/InDels were annotated using ANNOVAR 62 and functional trackers from UCSC Genome Browser 63 .
At the sample level, we filtered out samples with insufficient sequencing coverage (<15×), a high off-target ratio, and a high missing ratio (>10%). At the variant level, we filtered variants with a high missing ratio (>10%). The variants that violated the Hardy-Weinberg equilibrium in the control samples were also filtered out (Hardy-Weinberg equilibrium P < 1e−06). The InDels with length larger than four base pairs or multiple InDels in the same location are filtered out. The identity-by-descent test was performed using PLINK 64 for variants in the MHC region to remove some potentially related samples (PI_HAT > 0.3). The principal components analysis and multi-dimensional scaling were used to check whether there were any population stratification between cases and controls. The alignments of top variants and significantly associated InDels were manually checked in the Integrative Genomics Viewer 65 for some randomly selected samples.
The classic HLA class I alleles were typed at the 4-digit resolution from cleaned FASTQ files using OptiType 66 . The protein sequences of each allele were generated according to the IMGT/HLA database (version 3.31.0) 67 based on the typed HLA alleles. The consensus sequence of each HLA gene (HLA-A, HLA-B, HLA-C) was generated from multiple sequence alignment using Clustal Omega 68 .
The final variant sets (SNPs/InDels/HLA class I alleles/ HLA class I amino acid polymorphism) were imputed and phased using Beagle (version5.0) 69 . Linkage disequilibrium statistics (D′ and r 2 ) were calculated by PLINK.
Association analysis. The common variants MAF ≥ 1% were analyzed by logistic regression model adjusted for sex and age to test the association between NPC and controls using PLINK 64 . The forward stepwise conditional logistic regression analysis was used to identify the independent signals. Pairwise conditional logistic regression was performed between all significantly associated variants to check their independence. For the rare variants (MAF < 1%), the chi-square test or Fishers' exact test were used to test the association between NPC and controls. The Fishers' exact test was used when any of the expected values of the 2 × 2 contingency table was below 5. The gene-level burden test was performed with all the variants using the SNP-set kernel association test (SKAT) from Rvtests 70 based on the NCBI RefSeq annotation and a set of parameters with a higher weight on rare variants (nPerm = 10,000, alpha = 0.001, beta1 = 1, beta2 = 25).
A classical genome-wide significance cut-off (5 × 10 −8 ) was used as the threshold for significance for both common and rare variants. HLA alleles passing the multiple test adjustment with Bonferroni correction (P < 2.63 × 10 −4 ) were considered significant.
Analysis of public MHC-target sequencing data. A large data set sequencing the MHC region in more than 20,000 individual Chinese 25 was analyzed using the same pipeline for the SNPs.
Assessment of the accuracy of typing. The SNPs and HLA alleles were validated using Sanger sequencing. For SNPs, we randomly selected samples from candidate variants for SNP validations, as well as loci validation. The overall validation rate was 98.9% (86/87). For HLA alleles, primers covering exons 2 and 3 of the HLA-A gene were designed. The Sanger sequencing results were compared with the typed HLA-A alleles. The overall validation rate was 94.4% (17/18). The sequences of primers are listed in Supplementary Table 15. For HLA alleles, in addition to the direct sequencing, we further genotyped the HLA alleles using another HLA typing software Kourami 73 . The concordance rate between these two software was 98.5% (97.5% for HLA-A, 99.6% for HLA-B, and 98.5% for HLA-C).
IHC analysis was performed to assess the expression of TRIM31 in the NPC TMA with anti-TRIM31 antibody (Proteintech, Cat#215711-AP) using the DAKO LSAB2 System-HRP (horseradish peroxidase) (DAKO) and the DAB substrate kit (ThermoFisher) according to the manufacturers' instructions 37 . Antigen retrieval was performed by boiling slides in a microwave oven for 10-14 min in 10 mM citrate buffer (pH 6.0). Briefly, endogenous peroxidase block in the deparaffinized tissue sections was performed by incubation of 3% H 2 O 2 for 30 minutes. Tissue sections were blocked in 10% goat serum diluted in PBS, followed by overnight incubation with the anti-TRIM31 antibody at 4°C and a final concentration of 2 µg/ml. The slide was further incubated with biotinylated link antibody, streptavidin-HRP, substrate chromogen solution, DAB (3,3′-diaminobenzidine) and counterstained with hematoxylin for analysis using a bright field microscope. All staining was scored by a trained pathologist (LHT) and a score between 1 and 4 was given to tumor cells and stromal inflammatory cells (1 = negative staining, 2 = weakly positive, 3 = moderately positive, and 4 = strongly positive). Survival was determined from the date of diagnosis to the date of the last disease evaluation or until death/recurrence. Kaplan-Meier methods were used to analyze survival data. Statistical analyses were performed by SPSS version 26 and a P < 0.05 was considered statistically significant.
Cloning and enhancer assay of rs77803816. Genomic sequences spanning ±250bp of SNP rs77803816 at HCP5 were amplified by the primer set (F: TTCCACCTTTCCCAACCTGT; R: ACACCAAAGCAGGACCATTC) using corresponding patient samples as templates and were cloned into the Firefly luciferasebased lentiviral enhancer assay vector pLS-mP-Luc (Addgene #106253) at the SbfI site before the minimal promoter. The vectors were then expressed in an NPC43 cell line 74 pre-labeled with ubiquitously-expressed Renilla luciferase. Luciferase activity was measured by adding D-luciferin potassium salt (BioVision, Inc., Milpitas, CA) and Enduren (Promega Corporation, Fitchburg, WI) for Firefly and Renilla luciferases, respectively, and quantified using an IVIS Spectrum In Vivo Imaging System (PerkinElmer, Waltham, MA). The transcriptional activity index was calculated as Firefly luciferase signal intensity normalized to Renilla luciferase signal intensity, followed by normalizing to the fragments of the wildtype (WT) G-allele in the same orientation of HCP5 (Forward).
Western blot analysis of TRIM31 protein expression in NP361 cells. Protein lysates of NP361 cells 75 with and without EBV infection were run and separated by 10% SDS-PAGE. Proteins were transferred to PVDF membranes, which were incubated at 4°C overnight with the primary antibodies against TRIM31 (1:1000) or p84 (1:1000) after blocking with 5% non-fat milk in TBST. The membranes were then incubated for 1 h with 1:5000 anti-mouse Alexa Fluor 800 or anti-rabbit Alexa Fluro 680. TRIM31 and p84 proteins were detected and visualized by GE Typhoon 5 Molecular Imager.
Statistics and reproducibility. Statistical tests for TRIM31 TMA survival analysis included Kaplan-Meier survival curve, log-rank test, and multivariate Cox regression were generally considered significant with P < 0.05 (SPSS version 26). Measurements in enhancer assay of rs77803816 were taken from three independent biological replicates and Student's two-tailed t-test analyzed using Prism, P < 0.05 was considered significant (GraphPad Software, San Diego, CA). Western blot images were representative from two biological replicates.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.