International genome-wide meta-analysis identifies new primary biliary cirrhosis risk loci and targetable pathogenic pathways

.

Discovery of candidate causal disease variants.In functional annotation of risk loci, we identified 199 candidate variants across 28 non-HLA risk loci with probabilistic identification of causal SNPs (PICS) probability 40.0275 (ref.9).At each risk locus, the most-likely causal variant was the index variant, with median PICS probability of 0.224 and values up to 0.998 for rs2546890 at 5q33.3 (Supplementary Data 2).Looking at all candidate variants across all risk loci, the majority were intronic, upstream or downstream gene variants with no predicted functional consequence (99/199, 40%).However, a substantial proportion (59/199, 30%) were regulatory region variants, defined as SNPs located within regulatory features, including enhancers, promoters, transcription factor-binding sites and open chromatin regions (Supplementary Data 3).Notably, candidate variants at 18 (64%) of the 28 annotated risk loci included at least one regulatory region variant.In contrast, only 5 of 199 candidates were missense variants (2.5%) (Supplementary Table 3a).However, these included rs2297067 in EXOC3L4 at 14q32.32 and rs2304256 in TYK2 at 19p13.2, both predicted by SIFT and/or PolyPhen to be deleterious or potentially damaging 10,11 .Candidate variants included a single splice region variant, that is, rs17641524 at 1q31.3 that is predicted to affect splicing of DENND1B (Supplementary Table 3b).
We found that candidate variants at several risk loci are methylation quantitative trait loci (mQTLs), including mQTLs for DENNDIB, PLCL2, IRF5 and TNFRSF1A, all genes that are implicated in risk for other autoimmune diseases (Supplementary Data 4).We also found that candidate variants at several risk loci are expression quantitative trait loci (eQTLs) in lymphoblastoid and other cell lineages, including eQTLs for CCL20, IL12A, IRF5 and TYK2 (Supplementary Data 5).
At many risk loci, functional annotation highlighted a single candidate gene (Supplementary Data 2).However, most risk loci contained multiple compelling candidate variants.This complexity is well illustrated by the composite of candidate variants at the PLCL2 gene and MANBA gene loci, which include multiple eQTL and mQTL SNPs.Thus, despite the presence of many candidate variants with regulatory or epigenetic roles within PBC risk loci, more direct biological experimental approaches are required to pinpoint the disease-causal variants at these loci.
We also applied functional GWAS (FGWAS) and its associated annotation file 12 to our full set of discovery GWMA results and thereby identified 75 annotations with enrichment (Po0.01 from FGWAS) of GWMA association signals (Supplementary Data 6).After a stepwise selection approach similar to that of Pickrell 12 , the best-fitting model included six annotations highlighting negative enrichment of repressed chromatin regions in a lymphoblastoid cell line, and positive enrichment of DNase-I-hypersensitive sites in a variety of cell types, in particular CD20 þ and Th1 T cells (Supplementary Table 4).

Identification of candidate targetable biological pathways.
To identify biological pathways involved in development of PBC, we conducted pathway analysis using GCTA 13 followed by i-GSEA4GWAS 14 .We identified several immunoregulatory pathways associated with PBC, in particular, IL-12 and other cytokines as well as T-cell signalling pathways.To account for bias that might result from the strong HLA association with PBC, we repeated this analysis with SNPs/genes in the HLA region excluded.Notably, IL-12, IL-27 and JAK-STAT signalling pathways were still associated with PBC, even after their HLA contribution had been removed (Table 2).
We identified molecules that targeted these pathways by overlaying the Drug Gene Interaction database 15 and calculating a pathway specificity score and Jaccard index of each drug for each of the pathways that remained associated with PBC after the HLA contribution had been removed (Table 2, Supplementary Data 7).This combined analysis identified pathways and immunomodulatory agents that represent promising leads for further study in models of PBC.

Discussion
The current study adds to our knowledge of the genetic architecture of PBC.Notably, our data identify CCL20 as a candidate risk gene for PBC.Chemokine ligand 20 (CCL20) and its chemokine receptor CCR6 contribute to the formation and function of mucosal lymphoid tissues and are notably, in the context of the immune-mediated lymphocytic cholangitis characteristic of PBC, involved in the localization of Th17 cells and CD8 effector T cells to cholangiocytes and the periductal area in portal tracts 16 .This study also reinforces the importance of IL-12 and JAK-STAT signalling in this disease.
The functional annotation of risk loci has helped to assign priority to the candidate genes at newly identified and established risk loci.Furthermore, the identification of disease-associated regulatory variants at multiple risk loci emphasizes the potential importance of gene regulation in the pathogenesis of PBC (and presumably other complex disorders).This possibility is corroborated by the finding of numerous risk loci wherein the index and/or closely related SNPs that appear to represent regulatory, mQTL and/or eQTLs variants related to the nearby gene.Via the FGWAS analysis, this study also suggests particular importance of CD20 þ B cells and Th1 cells in the pathogenesis of PBC.However, both the cell types and the specific gene variants most relevant to PBC require further investigation and in particular exploration of the tissue-specific functional effects of the disease-associated variants.
By looking for drug-gene interactions, we have identified candidate drugs targeting specific, PBC-associated pathways, creating new opportunities to re-purpose available drugs for targeted immune therapy.Despite the speculative nature of this analysis, the data provide a start point in the search for novel therapies that are urgently needed to improve outcomes for PBC patients.

Methods
Study samples and genotyping.The use of human subjects for this study was approved by the University Health Network Research Ethics Board, The Mayo Clinic Institutional Review Board, Etico Indipendente IRCCS Istituto Clinico Humanitas, UC Davis Institutional Review Board and the Oxford Research Ethics Committee.
All PBC cases included in the Canadian-US, Italian and UK discovery and validation cohorts fulfilled the American Association for the Study of Liver Diseases criteria for PBC.
The Canadian-US discovery cohort included 499 PBC cases who were self-reported whites of European descent and 390 healthy Canadian controls, all genotyped using the Illumina HumanHap370 BeadChip.Additional controls included in this cohort were 1,094 control subjects provided from the Prostate Cancer Genetics Markers Susceptibility (CGEMS), 1,142 controls from the Breast CGEMS studies and 1,748 controls from the New York Cancer Project, all of whom who were genotyped on an Illumina 550 K bead array 4 .Following all quality control (QC) procedures, the final Canadian-US discovery set included 499 PBC cases and 4,374 controls.
The PBC cases included in the Italian discovery cohort were self-reported whites of Italian descent genotyped using the Illumina Human610-Quad BeadChip.Controls in this cohort were healthy Italians genotyped using the Illumina 1M-duo PBC risk loci identified in the current study.SNPs were taken forward for validation based on having a discovery P value o2 Â 10 À 5 (or, in the case of rs526231 and rs2434360, based on acting as a proxy for a SNP with a P value o2 Â 10 À 5 ).Discovery P values were calculated using logistic regression of individual discovery data sets in ProbABEL followed by genomic control correction of individual discovery data sets in R and fixed-effects meta-analysis in META; validation P values were calculated using logistic regression of individual data sets in PLINK followed by fixed-effect meta-analysis in META; joint P values were calculated using fixed-effect meta-analysis of discovery and validation data sets in META; see Methods.Autoimmune overlap refers to overlap between risk loci for PBC and those of other autoimmune conditions.*Functional annotation.wRegulatory variants: The index SNP or variants in strong linkage disequilibrium (LD, r 2 Z0.8) with the index SNP at this locus overlap regulatory elements that are related to the annotated gene (Supplementary Table 3).zmQTLs: The index SNP or variants in strong LD are correlated to methylation related to the annotated gene (Supplementary Data 4).yeQTLs: The index SNP or variants in strong LD are correlated to expression of the annotated gene (see Supplementary Data 3).
array.Following QC procedures, the final Italian discovery set comprised 449 cases and 940 controls.
The PBC cases included in the UK discovery cohort were self-reported whites of British descent, genotyped using the Illumina Human-660 W Quad array.Controls in this cohort were 5,163 population controls genotyped on the Illumina 1M-Duo array as part of the Wellcome Trust Case Control Consortium 2 project.Following QC procedures, the UK discovery set comprised 1,816 cases and 5,161 controls.
The 'Canadian' 903 PBC cases and 834 controls included in the validation studies were self-reported whites of European descent recruited from Canada, Europe and the United States to an ongoing PBC genetics study based in Toronto.The 721 'US' PBC cases and 294 controls included in the validation studies were self-reported whites of European descent enroled in the Mayo Clinic PBC Genetic Epidemiology registry and biorepository based at the Mayo Clinic in Rochester (https://clinicaltrials.gov/ct2/show/NCT01161953?term=pbc&rank=5).The Italian PBC cases and controls included in the validation studies were self-reported whites of Italian descent recruited to the Italian PBC Genetics study based at Instituto Humanitas in Milan.The Italian controls were obtained from Ospedale Alessandro Manzoni, Lecco, Italy and were unrelated healthy volunteers with no known non-Italian heritage.Cases and controls from the Canadian, Italian and the US cohorts were genotyped at the University Health Network/Mount Sinai Hospital Clinical Genomics Centre using a Sequenom iPLEX Gold assay.Following QC procedures, the final validation set included 903 cases and 834 controls from Canada; 300 cases and 618 controls from Italy; and 721 cases and 294 controls from the United States (Supplementary Table 1).
The 'UK' PBC cases included in the validation studies were self-reported whites of British descent recruited to the UK-PBC project via the UK-PBC Consortium (http://www.uk-pbc.com/).Cases were genotyped using Sequenom iPLEX Gold assay at the Wellcome Trust Sanger Institute Genotyping Facility (http:// www.sanger.ac.uk/).The UK validation control data were obtained from the TwinsUK resource, an adult twin registry comprising 12,000 (predominantly female) British twins.Genotype data for 3,512 twin individuals (genotyped using the Illumina HumanHap610 array) were obtained from the Department of Twin Research and Genetic Epidemiology at King's College London.One twin from each genotyped pair was included in the current study, amounting to 2,603 unrelated individuals.Following QC procedures, the final UK validation set comprised 1,792 PBC cases and 2,515 TwinsUK controls (Supplementary Table 1).
Quality control.We implemented a standard QC pipeline across all three discovery data sets, over-and-above QC procedures carried out in the respective primary analyses 2,4,5 .QC checks were carried out using the software package PLINK 17 .Within each discovery data set we removed SNPs with a genotype call rate o95%; minor allele frequency o0.05; significant deviation from Hardy Weinberg Equilibrium in controls (Po10 À 5 ) or a large difference (45%) in the proportion of missing genotypes in cases versus controls.We removed samples showing high rates of missing data (490%); whole-genome heterozygosity 4six s.d.from the mean; estimated proportion of identity by descent (IBD) sharing with another sample 40.1, or apparent gender discrepancies (based on X-chromosomal heterozygosity 40.2 for men and o0.2 for women).Principal component analysis (based on a subset of 32,000 highly informative SNPs) was carried out using the 'smartpca' routine of the EIGENSOFT package 18 to identify population outliers for exclusion and to identify principal components that differed between cases and controls; these principal components were used as covariates in subsequent association analyses.
Genome-wide imputation.We used the SNPs and samples passing QC to carry out genome-wide imputation within each of our cohorts using the software package MaCH 19 with HapMap3 CEU þ TSI samples as reference data sets.Within each cohort we used approximately the same set of genotyped SNPs in cases and controls to ensure similar levels of informativity.Following imputation, we retained only those SNPs displaying minor allele frequency 40.005 and imputation quality score R 2 40.5 in all three cohorts.Statistical analysis of discovery cohorts.Within each cohort we carried out association analysis of the genome-wide imputed data allowing for imputation uncertainty using the software package ProbABEL 20 .We performed logistic regression of disease phenotype on allele dosage; principal components that differed between cases and controls were included as covariates to help correct for population stratification.Quantile-quantile plots of the genome-wide set of test statistics were examined and genomic control correction was carried out within each cohort by multiplying the standard error of the estimated log odds ratio for each SNP by the square root of the genomic control inflation factor l (ref.21).The resulting log odds ratios and adjusted standard errors from all three cohorts were meta-analysed using the software package META to produce the final set of genome-wide discovery results 22 .
Validation analysis.We selected loci for validation if they achieved suggestive level of significance in the discovery analysis (minimum Po2 Â 10 À 5 ) and were not already known to be associated with PBC.We also selected loci for validation if they had achieved genome-wide significant association in one previous study but had never been validated in an independent cohort.We selected approximately two validation SNPs per locus; for loci displaying extended patterns of linkage disequilibrium or harbouring several putative independent association signals we attempted to select two validation SNPs within each subregion.Within each locus chosen for validation we assigned priority to SNPs according to whether they had been genotyped in the TwinsUK cohort (which was used as a validation cohort for the UK validation cases).One SNP selected for validation (rs2297067) did not have genotype data available in TwinsUK and was therefore imputed within TwinsUK based on genotyped SNPs in the surrounding 5-Mb region using the software packages SHAPEIT 23 and IMPUTE2 (ref.24), with 1,000 Genomes (Phase I version 3 integrated data, released on March 2012) used as a reference sample.The TwinsUK cohort was subjected to a variety of additional QC checks as described previously 25 ; the 2,515 controls used here correspond to the 2,520 controls used previously with an additional five exclusions due to discrepant gender 25 .
Within each validation cohort we carried out case/control association analysis of those SNPs that were successfully genotyped using logistic regression in PLINK.Results from the four validation cohorts (or from the combined discovery and validation cohorts) were combined via meta-analysis in META.
Imputation to 1,000 Genomes within validated loci.Imputation within the discovery cohorts was carried out at the six validated loci using the software packages SHAPEIT 23 and IMPUTE2 (ref.24) with the 1,000 Genomes (Phase I integrated variant set, release December and June 2013) used as a reference panel.The same genotyped SNPs that had been used to inform HapMap3 imputation for the discovery analysis were used for the 1,000 Genomes imputation within these targeted regions.Association analysis of SNPs passing post-imputation QC ('info' score 40.5) was carried out separately within each cohort, the results were genomic control corrected by multiplying the standard error of the estimated log odds ratio for each SNP by the square root of the previously estimated genomic control inflation factor l for each cohort, and results were combined across the cohorts via meta-analysis in META.This confirmed the findings from our original (HapMap3) imputation experiment but did not identify any substantially stronger associations or candidate causal variants than we had already found.
Functional annotation of validated loci.Left and right boundaries for each associated region were defined by finding a 0.1-cM interval either side of the most strongly associated SNP where no SNP has Po1 Â 10 À 5 .We looked for overlap between PBC risk loci and confirmed risk loci for other autoimmune conditions using ImmunoBase, a web-based resource focused on the genetics and genomics of immunologically related human diseases (http://www.immunobase.org/).To assign priority to candidate genes and candidate variants at risk loci, we used the online PICS (Probabilistic Identification of Causal SNPs) algorithm to identify candidate variants at each risk locus with a PICS probability 40.0275 (http://www.broadinstitute.org/pubs/finemapping/?q=pics) 9 .We adopted this threshold to be consistent with Farh et al. 9 in their paper describing the approach.Given an index SNP corresponding to the most associated SNP in a locus, the PICS algorithm calculates (based on the known linkage disequilibrium pattern in the region, as measured in a large Immunochip or 1000 Genomes reference sample) a score for each SNP in the region, representing the extent to which that SNP could, in fact, be the true causal SNP, allowing for statistical sampling variation.
We also used the FGWAS software and its associated annotation file (containing 450 genomic annotations of various types), applied to our full set of GWMA results, to investigate the extent to which genetic variants associated with PBC were enriched within specific annotation categories 12 .Testing each annotation individually, we found 75 annotations that showed enrichment (Po0.01) of GWMA association signals; as many of these annotations are correlated with one another we used a stepwise selection approach followed by cross-validation to mitigate overfitting (similar to the procedure performed by Pickrell 12 ) on these 75 annotations to identify a final best-fitting model that included 6 annotations.Annotation information used by FGWAS was derived from a variety of sources including Maurano et al. 29 , Thurman et al. 30 and Hffman et al. 31 (see Appendix of Pickrell 12 for details).
Pathway analysis.Using summary results from the GWMA (effect size, standard error and allele frequency) along with SNP linkage disequilibrium estimated from the Italian GWAS individual-level genotype data, we performed approximate conditional analysis using the software GCTA 13 .Only the independently associated signals with conditional P value and P GWMA both o0.001 were retained for further consideration.We submitted the rsIDs and P GWMA of these SNPs as well as gene sets from BioCarta, KEGG, PID and Reactome curated by MSigDB (as of 26 March 2014) to the i-GSEA4GWAS web server 14 .This programme identified genes within 20 kb of the SNPs and represented each gene by the greatest -log P GWMA of the SNP(s) mapped to it.Gene sets were then assessed for enrichment with significant genes while SNP label permutations were conducted to correct for bias from variations in gene size and gene set size.False discovery rate was used to correct for multiple testing based on the distributions of enrichment scores generated by permutation.
Drug-pathway analysis.To identify drugs that affected the pathways associated with PBC (when the HLA locus was excluded), we first identified the genes participating in each pathway.We then downloaded drug-gene associations from the Drug Gene Interaction database 15 and scored each drug by the proportion of each its targets that were in each pathway, which we termed as the drug's pathway specificity.As a secondary scoring metric, we evaluated the proportion of each pathway affected by the drug using the Jaccard index on the respective sets of pathway genes and targeted genes.To identify promising drug candidates, we ranked drugs first by our primary specificity metric and then by the secondary Jaccard index.

Italian PBC Genetics Study Group
Piero L. Almasio 34