Fine-mapping of genetic loci driving spontaneous clearance of hepatitis C virus infection

Approximately three quarters of acute hepatitis C (HCV) infections evolve to a chronic state, while one quarter are spontaneously cleared. Genetic predispositions strongly contribute to the development of chronicity. We have conducted a genome-wide association study to identify genomic variants underlying HCV spontaneous clearance using ImmunoChip in European and African ancestries. We confirmed two previously reported significant associations, in the IL28B/IFNL4 and the major histocompatibility complex (MHC) regions, with spontaneous clearance in the European population. We further fine-mapped the association in the MHC to a region of about 50 kilo base pairs, down from 1 mega base pairs in the previous study. Additional analyses suggested that the association in MHC is stronger in samples from North America than those from Europe.

coverage of the MHC region can be gained using an imputation algorithm that takes into account the long range linkage-disequilibrium in MHC, and a large customized reference panel with improved coverage of the MHC region 4 . Thirdly, fine-mapping algorithms 5,6 , designed with the goal to resolve known genetic associations to smaller sets of variants, can be used with the high density genomic data to further improve the precision of the genetic associations.
We therefore conducted an analysis of a large pool of spontaneous resolvers and chronic patients of HCV using the ImmunoChip platform, the SNP2HLA algorithm with the T1DGC MHC imputation reference panel 4 , and a recently-developed fine-mapping algorithm 6,7 to (1) more precisely define the susceptible variant within the known associated loci; and (2) identify additional loci associated with clearance. Similar successes have been achieved in other conditions such as inflammatory bowel diseases 6,8 . Additionally, we explored the hypothesis that there are shared mechanisms that define a "brisk" immunity able to confer both susceptibility to autoimmune disease and improved control of pathogens. We also examined the influence of region (North America versus European) upon associations with HLA within the European ancestry, as previous studies have shown variability of results, especially for the class I locus [9][10][11][12] .

Results
The final dataset after QC has 166,537 variants for 527 cases/828 controls of European ancestry; and 171,161 variants for 75 cases/171 controls of African ancestry (Table 1). For each ancestry, we performed logistic regression under the additive model using the first two principal components as covariates. The QQ plot (Fig. 1, using common variants with >2% minor allele frequency) and the genomic control (GC) factors (0.98 for the European ancestry and 0.92 for the African ancestry using designated null variants) indicate the effective control of the population stratification.
For European samples, we identified 8 genome-wide significant variants (p-value < 5E-8) in two loci ( Fig. 2 and Table 2  to spontaneously clear the virus compared to those with two copies of the C allele. This variant is roughly 7,000 base pairs upstream of the IL28B gene, and has been previously reported to be associated with HCV spontaneous clearance 2 and the response to chronic HCV therapy in Asian populations 13 . Previous studies have also shown an association between IL28B and interferon-based clearance of HCV 14 , and an association between a frameshift variant upstream of IL28B and impaired clearance of hepatitis C virus 15 . Because the IL28B/IFLN4 region was not designed as a high-density locus in ImmunoChip, we could not test other variants in this region for their association with HCV spontaneous clearance, and was unable to provide a better resolution in this locus. The other genome-wide significant locus for the European samples is the major histocompatibility complex (MHC) locus. Genome-wide significant variants in this region are reported in Table 2 (before imputation). We used SNP2HLA 4 and a customized reference panel from a T1D study to impute missing variants, HLA alleles and amino acid residues for this region. We identified 12 SNPs and 5 amino acids that are genome-wide significant ( Fig. 3 and Table 3, boldfaced). No secondary signal in this region exceeded the suggestive significance threshold (1 × 10 −5 ) after conditioning on the primary signal. Therefore, all variants reported in Table 3 account for the same association signal. Using a fine-mapping algorithm described in another study 6,7 , we constructed the 99% credible set, which is a set of variants that has 99% probability of having the causal variant in this locus (Table 3, full). Comparing with the previous study 2 which identified this association to a region of more than 1 mega base pairs, we mapped this association to a much smaller region of 50,562 base pairs.
Neither the MHC nor the IL28B locus was genome-wide significant in the African ancestry. Using the heterogeneity test (fixed-effect, implemented in the R metafor package), we found that neither the MHC locus nor the IL28B locus have significantly different effect size (p-values = 0.47 and 0.29 respectively) across the two populations. Therefore, the difference in the significance is likely driven by the sample size and/or the allele frequency differences.   Table 2. Genome-wide significant associations. List of variants that have genome-wide significant association with HCV spontaneous clearance (before imputation). The genomic position is in HG18.
In addition to the genome-wide significant loci, we examined genes outside the HLA that have been previously associated with HCV spontaneous clearance 16 . Only genes IFNG-AS1 (p-value = 6E-4) and STAT1   IFNG-AS1 is a long noncoding RNA that is expressed in CD4 T cells and promotes Th1 responses 17 . STAT1 is one of the key mediators of the type I, II and III interferon responses. Since HCV is particularly diverse, with up to a 30% difference at the amino acid level between major viral genotypes, the strain of infecting virus may influence HLA-mediated clearance 11,18 . Unfortunately, information regarding the virus genotype or subtype was not available in this study so a direct comparison is therefore not possible. However, an indirect comparison is possible by taking advantage of the observation that North American patients are much more likely to be infected with the 1a virus and European patients are much more likely to be infected by the 1b virus 19 . We observed that the association in the class II MHC locus, after accounting for the sample size, age, sex and exposure (Methods), is stronger in North American samples than in European samples ( Fig. 4) with marginal significance (p = 0.044). This suggests that viral subtype may have influenced the genetic mechanism underlying the clearance of HCV. Meta-analysis by cohorts confirms this observation (Fig. 5). We also interrogated the potentially protective effect of certain SNPs associated with HLA class I alleles previously implicated in spontaneous clearance. No SNP associated with class I was associated with genome-wide significance, including those associated with HLA B*27 subtypes (p-values > 0.05). The strength of association with the SNP most closely linked with HLA-B*57 and control of HIV-1 (rs2395029) was not genome-wide significant but showed a marked difference by continent (North America p-value = 8.6E-4, Europe p-value = 0.078, overall p-value = 1.0E-4), suggesting that any protective effect of this class I allele differs by region.
Autoimmune disorders have been reported to have shared genetic susceptibility loci 20,21 . For each of 5 major autoimmune diseases, including inflammatory bowel disease, systemic lupus erythematosus, rheumatoid arthritis, celiac disease and multiple sclerosis, we listed all variants that reached p-value < 0.001 (or the best variant) in this analysis. We found no shared variant after considering multiple testing. A full exploration of the hypothesis that susceptibility to autoimmunity also confers ability to clear HCV will require a larger sample size. This analysis was only performed in the European cohort because the African cohort has even less power due to the sample size, and GWAS results in samples of African ancestry for other autoimmune disorders is more limited.
An alternate approach, taken by the International Genetics of Ankylosing Spondylitis Consortium 22 , is to search for the reported associations with other diseases in loci having suggestive evidence (p-value < 1E-5), i.e., the MHC and the IL28B loci in this study. We only performed the search in IL28B because MHC has been already implicated in many autoimmune disorders. We searched within 0.5 Mb around the lead SNP (rs8099917) in IL28B for associations with other diseases that have been reported in the NHGRI GWAS catalog (https://www. ebi.ac.uk/gwas, accessed on July 1, 2017). This catalog hosts published associations between genetic variants and thousands of diseases/traits, including autoimmune, inflammatory, cardiovascular, metabolic, brain and diseases. Three SNPs were found to be in partial linkage disequilibrium (R 2 > 0.4) with our lead SNP in IL28B, including rs12980275 (R 2 = 0.41) associated with lipid levels in hepatitis C treatment 23 , rs12979860 (R 2 = 0.42) associated with chronic hepatitis C infection/response to hepatitis C treatment 14 (discussed in the previous sections), and rs688187 (R 2 = 0.40) associated with mucinous ovarian carcinoma 24 .

Discussion
We have conducted a genome-wide association study to identify genomic variants underlying the HCV spontaneous clearance using ImmunoChip. Consistent with previous reports 2 , two loci were found to be significantly associated with the HCV spontaneous clearance in the European cohort. The ImmunoChip design, the imputation pipeline specifically designed for the MHC region and the novel fine-mapping algorithm facilitated the accurate characterization of classical HLA types and allowed us to achieve a higher resolution in the MHC region. Twelve SNPs and 5 amino acids in the MHC region were found to be significantly associated and no secondary signal remains after conditioning on the best SNP. Fine-mapping mapped this association to a region of about 50 kilo base pairs, down from 1 mega base pairs in the previous study. This fine-mapping analysis was conducted in the European population. We note that if the MHC association is shared across populations, this fine-mapping results will also be generalizable to other populations.
We found no associated variants in the African cohort, probably due to different genetic background (in the case of the IL28B locus) and limited sample size (in the case of the MHC locus). Previous studies 18 suggest that spontaneous clearance can be more common with one virus genotype than another 25 . We noted that the association in the class II MHC locus might be stronger in samples from North American than those from Europe. While viral subtyping was not available with sufficient numbers in this cohort, the virus subtype 1a is more prevalent in North America than in Europe where subtype 1b predominates. Previous studies showed that key polymorphisms between viral subtype may have influenced HLA-restricted genetic associations underlying the clearance of HCV 11,26 . In HIV-1, viral mutational escape over first decades of the epidemic reduced the protective effect of key HLA alleles on a population level 27 . For HCV, additional evidence, such as virus typing, is needed to confirm this finding.
Limitations of this study include inability to dissect SNPs near the IL28B/IFLN4 region, as this loci had not been previously implicated in autoimmune GWAS studies. While the ImmunoChip did include rs8099917 as a surrogate for this region, additional information regarding associations with rs12979860 and ss469415590 is not available 15 . Also, this study was a fine-mapping exercise that narrowed the MHC significantly but was not fully independent due to considerable overlap with the previous GWAS.
Previous studies of GWAS data revealed that there are SNPs and loci with evidence of association across multiple immune-mediated diseases 20 . We found several variants that have suggestive and plausible evidence of associations with both HCV spontaneous clearance and another autoimmune disorder. Despite the observation that none of these variants are significant after the strict Bonferroni correction, they jointly confirm the concept that shared genetic mechanisms underlie autoimmune disorders and suggest the hypothesis that susceptibility to autoimmunity may also confer ability to clear HCV. Fuller exploration of this hypothesis will require further analyses with larger sample sizes.

Methods
Overview of samples. 1,944 samples from 13 cohorts (ALIVE, BBAASH, HGDS, MHCS, Rosen and colleagues, REVELL, BAHSTION, SWAN, Toulouse, Cramp and colleagues, Hencore, Mangia and colleagues, UK Drug Use Cohort) were genotyped in this study, as previously described 2 . Self-clearance of HCV was coded as cases (718 samples) and persistence of HCV was coded as controls (1,180 samples). Samples with unidentified clearance status were not used (46 samples). All samples were genotyped using Illumina's ImmunoChip, a custom Infinium chip with 196,524 SNPs and small in/dels. A large number of these variants are in 187 high-density regions known to be associated with twelve autoimmune disorders and inflammatory diseases. Variants in these high-density regions include 289 established associations, variants from 1000 genome project low coverage pilot 1 study 28 , and variants discovered in re-sequencing 29 . In addition, roughly 25,000 variants were included as replication of unrelated diseases as part of the WTCCC2 project, with the purpose of serving as null SNPs in analyses.

Sample ethnicities.
To identify the sample ethnicities, we first constructed the principal component axes using Hapmap samples. 988 founders from Hapmap phase 3 (draft release 2) 30 , including samples from ethnicities ASW, CEU, CHB, CHD, GIH, JPT, LWK, MEX, MKK, TSI and YRI were used. To calculate the principal components, only common variants that are also present in the ImmunoChip were used, and AT/GC SNPs were excluded to avoid ambiguous strand alignment. We performed LD pruning of the variants, resulting in a total of 15,525 variants used to create the principal components. The study samples were then projected to the principal component axes and assigned the ethnicities based on their distance to the Hapmap samples. Out of 1,898 samples, 1,416 samples were mapped to European ancestry, 225 samples were mapped to African ancestry and 227 samples were admixtures and were not used in this study.

Quality control. QC was performed separately on samples of European and African ancestries separately.
Variants that failed the Hardy-Weinberg equilibrium test in controls (p-value ≤ 1E-5) or had low call rate (≤95%) were identified, and 24,820 variants were removed in European samples and 20,196 variants were removed in African samples. The remaining variants were used to perform QC in samples. Samples were cleaned for having low call rate (≤95%) or having high heterozygosity rate (>3 standard deviations from the mean).
We then created a LD pruned dataset for calculating the identity by state (IBS) matrix and the principal components. We pruned the variants using a sliding window of 50 variants, step size of 5 variants and variance inflation factor threshold of 1.25. There were 20,782 variants in European samples, and 21,778 variants in African samples after the pruning. The IBS matrix was calculated using this LD pruned dataset and checked for sample relatedness. 28 duplicated samples in European cohorts and 9 duplicated samples in cohorts of African ancestry have been identified and removed (pi_hat > 0.9). The final dataset has 527 cases and 828 controls for European cohorts, and 75 cases and 171 controls for African cohorts.
To correct for within European and within African population stratification, we calculated the principal components for samples of European ancestry and African ancestry, respectively. The first two principal components sufficiently control the population stratification in both ancestries (results not shown) and were use in the association analysis as covariates.
Imputation. Imputation of the MHC region was performed on QC cleaned data using SNP2HLA 4 . This software package takes advantage of the long-range linkage disequilibrium between HLA loci and SNP markers across the MHC region and can perform accurate imputation of classical HLA types starting from SNP genotype data. The reference panel was created using the Type 1 Diabetes Genetics Consortium's high quality HLA reference panel (roughly 5,000 European samples), which includes classical HLA alleles and amino acids at class I (HLA-A, -B, -C) and class II (-DPA1, -DPB1, -DQA1, -DQB1, and -DRB1) loci.
Association test. All association tests were performed in PLINK 1.07 31 using the logistic regression. We assumed additive models and used the first two principal components as covariates in the regression. HCV spontaneous clearance was coded as case so an odds ratio >1 indicates the tested allele increases the probability of spontaneous clearance.

Test of heterogeneity across North America and Europe samples.
To evaluate whether the effect of the MHC association is consistent across samples from North America and Europe, we conducted the association test with age, gender and the HCV exposure (IDU v.s. non-IDU) as covariates to control for potential confounding. We only used samples that have non-missing measurements in these variables. For North America samples, we have 173 cases (spontaneous clearance) and 298 controls; and for Europe samples we have 144 cases and 266 controls. The heterogeneity test was conducted using the odds ratio and standard error from the association test in a fixed-effect model implemented in the R metafor package.
Use of experimental animals, and human participants. No experimental animals were used in this study. The study protocols were approved by the institutional review board (IRB) at each center involved with recruitment (listed at the end). Informed consent and permission to share the data were obtained from all subjects, in compliance with the guidelines specified by the recruiting center's IRB. All experiments were performed in accordance with relevant guidelines and regulations.