Prostate cancer is a polygenic disease with a large heritable component. A number of common, low-penetrance prostate cancer risk loci have been identified through GWAS. Here we apply the Bayesian multivariate variable selection algorithm JAM to fine-map 84 prostate cancer susceptibility loci, using summary data from a large European ancestry meta-analysis. We observe evidence for multiple independent signals at 12 regions and 99 risk signals overall. Only 15 original GWAS tag SNPs remain among the catalogue of candidate variants identified; the remainder are replaced by more likely candidates. Biological annotation of our credible set of variants indicates significant enrichment within promoter and enhancer elements, and transcription factor-binding sites, including AR, ERG and FOXA1. In 40 regions at least one variant is colocalised with an eQTL in prostate cancer tissue. The refined set of candidate variants substantially increase the proportion of familial relative risk explained by these known susceptibility regions, which highlights the importance of fine-mapping studies and has implications for clinical risk profiling.
Prostate cancer (PrCa) is the most common cancer among males in developed countries. As there is evidence for a large heritable component for PrCa, the identification of genetic variation that increases susceptibility may help to inform screening strategies and clinical management of patients in the future. Currently, only a handful of rare genetic variants with larger effect sizes have been reported that increase the risk of PrCa (e.g., BRCA2 and ATM)1,2. By comparison, genome-wide association studies (GWAS) have reported >100 low-penetrance PrCa risk signals with small odds ratios (ORs)3. Individually, these GWAS loci only modestly influence risk. However, because the risk alleles are relatively common within the general population their cumulative impact is substantial.
When an initial GWAS identifies a susceptibility locus, any one (or more) of a large number of variants within the region may underlie the molecular mechanism that modulates risk. This includes correlated variants in linkage disequilibrium (LD) that may capture the same association signal and additional variants with independent associations. Genotyping a denser set of variants in the region facilitates characterisation of the underlying genetic architecture and makes subsequent imputation more precise and complete. Although forward stepwise selection is frequently used for fine-mapping, it has severe limitations, particularly the way LD can lead to misleading results. In this manuscript, we report the findings of a PrCa fine-mapping study in a European ancestry meta-analysis sample set that is the largest to date and utilise the well-established stochastic search and model selection framework, which more accurately represents the uncertainty in determining both the number of signals and the set of single-nucleotide polymorphisms (SNPs) that best describe the association in each region4,5,6,7. To leverage the large sample size from the overall meta-analysis, we use a novel multivariate Bayesian variable selection approach, which takes marginal SNP summary statistics as input and accounts for LD, to jointly analyse all SNPs in a region. We identify a catalogue of variants and further prioritise within this set through functional annotation, to assist identification of putative causal variants. This refined credible set of variants explains a substantially larger proportion of the estimated familial relative risk (FRR) of PrCa compared with the original GWAS tags.
Replication of reported associations prior to fine-mapping
In this study, we examined 92 PrCa GWAS risk associations within 85 distinct genomic regions reported prior to the recent meta-analysis using the OncoArray experiment8; due to their complexity, two regions (Chr8q24 and Chr6p21/MHC) were excluded and are subject to separate studies. Some regions contained more than one signal due to close proximity between the reported index SNPs. Summary results from the large European ancestry meta-analysis comprising 82,591 PrCa cases and 61,213 controls from eight GWAS sub-cohorts (OncoArray, iCOGS, UK stage 1 and 2, CaPS 1 and 2, BPC3 and NCI PEGASUS), imputed to the 1000 Genomes phase 3 reference panel, were used for our fine-mapping analysis.
We first assessed whether all 92 original associations had replicated with at least one variant in the region at a genome-wide significant level (marginal P-value <5 × 10−8). Five regions had not replicated and were excluded from downstream fine-mapping analyses accordingly (Supplementary Table 1). An additional 3 associations previously reported in different ancestral populations also had not replicated in our European sample set; however, these original lead variants were each situated within the region boundary of another replicated GWAS association and therefore the expanded region boundary was retained during fine-mapping for logistical purposes, although only the associations replicated in Europeans were considered as index variants. Fine-mapping was therefore conducted for 84 replicated, previously reported GWAS signals, within 80 distinct regions (Fig. 1). This included the region encompassing the moderate penetrance risk SNP rs138213197 in HOXB13, which although originally identified through sequencing9 was included due to its relatively close proximity to the GWAS association rs11650494. The HOXB13 region therefore also served as a useful positive control during mapping, since the known causal variant exerts a relatively large effect size (OR 3.85) and has low minor allele frequency (MAF), but the signal is also detectable through a cluster of more common variants as a ‘synthetic association’10.
The eight signals that did not replicate in our European meta-analysis may remain risk loci for PrCa in other ancestral populations or specific disease phenotypes rather than overall PrCa risk, although we cannot completely exclude the possibility that some were false positives. Two of these variants were originally reported in a multi-ethnic meta-analysis (rs7153648 and rs12051443), one failed quality control (QC) due to strongly discordant MAF between individual sub-studies within the meta-analysis (rs6625711) and is also reported as having extremely discordant MAF between 1000 Genomes phase 1 and phase 3 cohorts (MAF in EUR 0.45 vs. 0.16), one was associated with young-onset disease only (rs636291), one only for aggressive PrCa (rs1571801) and the final three were reported in populations of Chinese (rs103294), Japanese (rs2055109) or African (rs7210100) ancestry and had not been confirmed in Europeans to date11,12,13,14,15.
Multivariate fine-mapping from univariate summary statistics
We utilised Joint Analysis of Marginal summary statistics (JAM)16, a novel fine-mapping framework that uses summary statistics and explores multi-SNP models while accounting for LD. JAM provides inference of two important measures; (1) the most likely number of independent risk variants in the region and (2) a 95% credible set of variants that drive these signal(s). This credible set includes all variants from regression models that cumulatively reach at least 95% posterior probability in JAM’s stochastic search. Prior to running JAM, the variants were pruned to eliminate high LD (initially set at r2 > 0.9, decreased in r2 = 0.05 increments if required, Fig. 1). JAM was run twice for each region using independent seeds of 10 million iterations each. Final credible sets for each region included the set of tag variants identified by JAM and the pruned SNPs in high LD with these tags. Region-wide Bayes factors were used to provide evidence for the minimum number of independent signals. For 75 regions JAM successfully inferred credible sets of associated variants from the meta-analysis summary statistics, with 91% concordance of variants selected between two independent runs. For the final 5 regions, JAM did not infer a strong posterior probability for any variant, therefore was unable to select candidate variants.
Overall, we identified 99 independent PrCa risk signals within the 80 replicated regions (Tables 1–3). In all, 68 regions contained a single PrCa risk association, whilst we detected evidence for multiple independent risk signals within 12 regions (15% of replicated loci). In the initial meta-analysis data set, the 80 replicated regions contained a total of 213,728 SNPs, of which 14,463 were genome-wide significant and 25,186 marginally associated with PrCa at P < 5 × 10−5. From this variant set, JAM identified a catalogue of 3700 SNPs as the final 95% credible set of candidate causal variants for the 75 regions successfully fine-mapped (Supplementary Data 1), whilst in the 5 regions in which JAM could not identify candidate variants, a total of 175 variants had reached genome-wide significance in the univariate meta-GWAS results, including a novel more strongly associated lead variant in 4 of the 5 regions (Supplementary Data 1). The majority of variants within the JAM credible set were common (Supplementary Fig. 1a), with only 2 variants having MAF < 1% and 48 variants MAF < 5%; lower MAF variants do however represent the most likely candidate causal variants within certain regions. We also observed a slight increase in the distribution of univariate ORs for the novel lead variants we have identified in comparison to the original GWAS tag SNPs (Supplementary Fig. 1b). Only 15 original GWAS tag SNPs remained within the catalogue of candidate variants, with all other signals being replaced by more likely candidates. As expected, fine-mapping performance varied by region, with 95% credible set sizes ranging from 1 to 606 variants. We did however observe strong refinement of variants within the majority of regions (median 24 variants per region overall and 21 for single-signal regions). Indeed, among the 63 single-signal regions, 30 returned a 95% credible set containing ≤20 variants, of which 20 comprised ≤10 variants and 4 returned a credible set containing a single variant. These represent the putative causal PrCa susceptibility variant within that locus and include the well-established HOXB13 causal variant rs138213197 at Chr17q219, as well as rs10993994 in the promoter of MSMB, which modulates gene expression in prostate tissue17,18,19. These two regions serve as proof of principle; our methodology selected the presumed causal variants and therefore the remaining two single candidate variants are very likely to be causal and are strong candidates to test in functional studies. These two variants are an intronic SNP in TBX1, and a low MAF frameshift insertion in the final exon of FAM111A; which confirms for the first time in Europeans the GWAS hit at this locus previously reported in Japanese11, although the European and Japanese variants are not in LD. The 12 regions with multiple independent risk signals contained 31 independent signals in total, represented by a 95% credible set of 626 variants (median 33.5 variants per region, average 20.2 variants per association signal). Prioritisation also performed well in these complex regions. In the TERT region at Chr5p15 we observed the highest number of independent signals, 5, and the credible set comprised only 30 SNPs. Similarly, 3 regions each containing 3 signals (Chr2q37:FARP2/ANO7, Chr17q12:HNF1B and Chr19q13:KLK3) returned a combined credible set of 61 variants representing these 9 PrCa associations. Notably, we observed that the regions found to contain multiple independent signals generally had P-values and marginal ORs towards the upper end of the distribution of original GWAS hits in the univariate meta-GWAS (Supplementary Fig. 2).
Integration of annotation
We annotated variants for indicators of putative biological functionality using data from publically available databases. Intragenic variants were ascribed to genes relative to GENCODEv19, miRNA variants using MirBasev20 and variants situated within segments of the genome under evolutionary conservation were annotated using conserved element outputs generated by four algorithms (GERP++, SiPhy Omega, SiPhy Pi and Phastcons)20,21,22. For information derived from tissue-based experimental data sets, we focused primarily on those conducted in prostate cell lines; specifically DNaseI hypersensitivity sites in three prostate cell types from seven experiments in the ENCODE project, chromatin-state characterisations by ChromHMM from Taberlay et al.23, ChIP-seq peak locations for a variety of transcription factor (AR, CTCF, ERG, FOXA1, GABPA, GATA2, HOXB13 and NKX3.1) and histone mark (H3K27Ac, H3K27Me3 and H3K4Me3) data sets retrieved through the Cistrome Data Browser24, and expression quantitative trait loci (eQTLs) from a set of 359 PrCa samples in the Cancer Genome Atlas (TCGA).
To formally incorporate these annotations into the prioritisation of SNPs, for the 75 regions in which JAM selected candidate variants, we investigated posterior estimates from JAM for all 37 863 pruned tags against annotation features using a conditional quantile regression (QR) analysis25,26 at multiple quantiles (99.2, 99.4, 99.6, 99.8 and 99.95%). These correspond to posterior probabilities ranging from 0.01 to 0.99, with the exact values conditional on the linear combination of the annotations. At each quantile, we used the fitted model to calculate a predicted posterior probability given the SNP’s annotation features. A single expected posterior probability was then calculated from a weighted average of these quantile-specific expected posterior probabilities with the weight reflecting both the fit (i.e., a function of the likelihood) and variance of the predicted values from the quantile-specific model to the data. We selected a single data set for each annotation category for the QR analysis to minimise correlation between variables. Whilst the majority of tag probabilities were not notably adjusted during QR, an appreciable subset of variants were up- or downgraded based upon their annotations (ΔPosterior probabilityQR ranged between −0.304 and 0.254; 63 of the 37,863 tags had a ΔPosterior probabilityQR of magnitude ±0.005 or greater) (Supplementary Fig. 3). The conditional QR also facilitates identification of the annotations that demonstrate an association across the extreme quantiles of the posterior probabilities. Specifically, several annotations (eQTLs within TCGA PrCa tissue, AR and GATA2 transcription factor-binding sites, LNCaP DNase1, H3K27Ac and H3K4Me3 histone marks, enhancer and repressed chromatin states by ChromHMM, conservation according to GERP++, higher CADD scores and protein altering variants) had statistically significant associations (P < 1.0 × 10−3) for at least one quantile (Supplementary Data 2). That is, the upper quantiles of the posterior probability distribution for variants with any of these annotations were larger when compared with SNPs without those annotations.
For comparison to the conditional QR approach, we also used Fisher’s exact test to examine the representation of individual annotation features across variants included in the 95% credible set of prospective PrCa causal variants relative to variants not selected. Independent tests were conducted for each annotation upon the set of 37,863 tag variants analysed by JAM, of which 343 tags represented the 95% credible set of 3700 SNPs and annotations for all proxy SNPs were inherited by the tag variant. We observed significant enrichment of a number of annotations among variants in the credible set (Fig. 2, Supplementary Data 2). In particular, enrichment was found for eQTLs in the TCGA data set (P = 1.15 × 10−23); intragenic variants within protein-coding genes (P = 8.15 × 10−11; P = 6.03 × 10−5 for protein altering variants exclusively) but not non-coding transcripts (P = 0.29); promoter (P = 1.66 × 10−8), enhancer (P = 3.42 × 10−6) and transcribed (P = 3.07 × 10−7) ChromHMM states in prostate epithelial cells; DNaseI hypersensitivity sites from all seven ENCODE prostate data sets (P = 1.28 × 10−7 to 7.61 × 10−17); for AR (P = 2.33 × 10−15 to 2.86 × 10−20), ERG (P = 5.33 × 10−12 to 1.00 × 10−20), FOXA1 (P = 9.18 × 10−18 to 1.14 × 10−18), GABPA (P = 8.53 × 10−12), GATA2 (P = 1.24 × 10−12), HOXB13 (P = 8.25 × 10−9) and NKX3.1 (P = 9.44 × 10−5 to 1.43 × 10−15) transcription factor-binding sites from one or more experimental data set; for H3K27Ac (P = 5.34 × 10−19 to 1.39 × 10−21) and H3K4Me3 (P = 1.30 × 10−9 to 8.27 × 10−14) histone marks; and conserved elements within the human genome according to all four algorithms (P = 1.89 × 10−7 to 4.04 × 10−11). Of particular interest, in over half of the regions fine-mapped, at least one variant within our credible set intersected a significantly associated eQTL with a colocalisation score >0.9 (overlap between eQTL and GWAS signal) in the TCGA PrCa data set. In all, 40 of the 75 regions contained an eQTL variant among the credible set, with 91 distinct genes represented (Tables 1–3, Supplementary Data 3). In total, 127 of the 343 tags representing the credible set inherited an eQTL annotation (37%), compared with 5711 of the total 37,863 tags within these regions (17.8%). This corresponds to 1027 prostate eQTL variants among the 3700 credible set variants represented by the 343 JAM tags (27.8%), compared with 37,331 eQTLs from the 203,211 total variants within these 75 regions (18.4%).
Intuitively, some degree of correlation between the annotation features we examined would be expected, since regulatory regions of DNA may be indicated through various experimental techniques. Although annotations were jointly modelled in QR, any partial correlation could potentially inflate the extent of enrichment observed during independent Fisher’s tests. To preclude this outcome, we examined the level of correlation between separate annotations. Correlation between replicate data sets representing the same annotation category was usually moderate to high as would be expected, with more modest levels of correlation observed between different markers and information types (Supplementary Fig. 4). The level of correlation increased slightly when individual SNP annotations were collapsed onto tags, as the tag variants can inherit different annotations from separate SNPs. We performed logistic regression of the annotations used in the QR analysis in a single model, to evaluate their informativeness after adjustment for other annotation categories. In this regression, the TCGA eQTL, coding transcript and ERG transcription factor annotations were all highly significant after adjusting for multiple testing, whilst the AR transcription factor annotation was also nominally significant (Supplementary Fig. 5). The remaining annotations were not significant after adjustment for other annotations; however, within the range of information types selected, separate data sets represent broader or greater resolution functional information relative to one another and therefore may partially overlap with other markers whilst remaining instructive individually.
At several regions our catalogue of variants highlighted putative biological mechanisms that may be responsible for the differential risk of PrCa development, as well as credible sets sufficiently small to enable subsequent laboratory follow-up. One example is the Chr2q37 region described by rs3771570 in the original publication27. The original lead variant is intronic in FARP2, but multiple genes are located within the region. During fine-mapping, we observed evidence for three independent signals, one more than we previously detected28. These signals are represented by a credible set of 14 variants from 7 tags, demonstrating highly successful refinement of the original signal (Fig. 3a, Tables 1–3, Supplementary Data 1). The majority of these prospective causal variants are centred on the ANO7 gene, approximately 100 kb centromeric of FARP2. ANO7 is expressed predominantly in the prostate (http://www.proteinatlas.org/ENSG00000146205-ANO7/tissue), unlike FARP2, which is ubiquitously expressed across tissue types. Within the credible set 3 tags are selected with particularly high confidence (posterior probabilities 0.72–1); all 3 represent only themselves with no additional proxy variants to consider, and are therefore the most likely causal variants underlying the 3 signals detected. Two of these 3 candidate causal variants (rs77559646 and rs77482050) are non-synonymous SNPs in ANO7 that are uncommon among European ancestry populations, whilst the third (rs62187431) is intronic in ANO7. The 11 remaining variants in the credible set include one more missense SNP within ANO7 (rs76832527), 2 intronic variants in ANO7 (rs111770284 and rs56091437), a synonymous variant in ANO7 (rs2074840) and 7 variants that are all intronic within other genes (FARP2, PPP1R7, HDLBP and SEPT2). Our fine-mapping results therefore strongly implicate the ANO7 gene as a prospective biological effector modulating susceptibility for PrCa.
The region at Chr6q22 described by rs339331 in the original publication29 presents a good example of how variant annotations can assist further prioritisation of the most likely candidate variants even within regions where the credible set remains comparatively large after fine-mapping (Fig. 3b, Tables 1–3, Supplementary Data 1). rs339331 is intronic in RFX6, a member of the regulatory factor X transcription factor family. We observed a single signal during fine-mapping, but due to high LD between variants the credible set comprises 102 variants from 3 tags (the top tag with posterior probability 0.76 tagging 35 proxy SNPs, another with posterior probability 0.15 tagging 40 SNPs and the last with posterior probability 0.08 tagging 27 SNPs). Only 14 of these variants demonstrate any plausible biological evidence however, therefore the credible set can be filtered to prioritise this subset of variants. Four of these are proxies of the tag with the greatest statistical evidence, including the variant that demonstrates the greatest biological evidence for functionality; the original index SNP rs339331, which resides within a DNaseI peak, intersects binding sites for multiple transcription factors, including AR, FOXA1, GATA2, HOXB13 and NKX3.1, and is situated within a conserved element. rs339331 would therefore be ranked highest for follow-up based on combined statistical information and biological annotations, and has been demonstrated to alter HOXB13 transcription factor binding and RFX6 transcription during a previous functional investigation of this region30.
At the TMPRSS2 region on Chr21q22, we detected a single PrCa risk signal with a credible set of 31 SNPs from 8 tags, all of which are situated within the promoter region or first intron of TMPRSS2 (Fig. 3c, Tables 1–3, Supplementary Data 1). In all, 20 of these variants are eQTLs for TMPRSS2 in prostate tissue, whilst 2 variants intersect transcription factor-binding sites in multiple data sets, including for AR, ERG, FOXA1, GABPA, GATA2, HOXB13 and NKX3.1. In this region, the tag selected by JAM with the highest posterior probability is substantially downgraded after QR (ΔPosterior probabilityQR −0.18) due to lack of overlap with informative biological annotations, therefore it and its proxies may not in fact represent the most likely candidate causal variants. An early and common event in prostate tumour development involves a translocation that forms a TMPRSS2:ERG fusion, bringing the ERG transcription factor under transcriptional control of the more active TMPRSS2 promoter. Our fine-mapping results and biological annotations therefore allude to the possibility that subtle, heritable differences in TMPRSS2 expression could potentially operate in conjunction with a common somatic alteration to influence development of PrCa. Intriguingly, we also observed significant enrichment for variants intersecting ERG transcription factor-binding sites among our combined credible set of candidate variants across all regions using Fisher’s exact test (Supplementary Data 2, Fig. 2).
Comparison with African Ancestry meta-analysis results
Since LD patterns and allele frequencies of variants frequently differ among ancestral populations, as an additional prioritisation strategy we cross-checked meta-analysis results for variants in our 95% credible set against data from a meta-analysis of 10,202 cases and 10,810 controls with African Ancestry (AA)31. A total of 3633 of the 3700 SNPs in our credible set were available in the AA cohort, 1155 (31.8%) of which were nominally significant at P < 0.05 in the AA meta-analysis. In addition, of the 175 variants that reached genome-wide significance within the five regions in which JAM did not resolve candidate variants, 111 were nominally significant in the AA data. We would hypothesise that variants demonstrating no evidence of association in the AA data set would generally represent less likely candidate causal variants than any nominally significant variants within their region specific credible set and should be assigned lower priority when considering variants for functional confirmation studies. This extra prioritisation step does not enable us to formally exclude any variants from our credible set however, as the AA analysis may be underpowered to detect association with PrCa at specific SNPs, and additional variants within the regions fine-mapped in Europeans but not included in our credible set were not examined for association in AA data.
Estimating the GWAS loci contribution to FRR of PrCa
The proportion of FRR of PrCa explained by these risk loci before and after fine-mapping were calculated using conditional effect estimates and standard errors derived from the OncoArray sample sub-cohort. The post fine-mapping calculation was performed separately for the full set of 99 signals identified and a restricted subset of 84 variants (matching the number of original associations), in order to investigate the relative importance between replacement of GWAS tag SNPs and addition of extra novel signals. Single lead variants representing the independent signals were selected for this calculation. In regions containing a single signal, the JAM tag in the credible set with the highest Bayes factor was designated as the new lead variant, or for the five regions in which JAM did not resolve candidates the most strongly associated SNP in the meta-GWAS was taken instead. Within regions containing multiple independent hits, signals were represented by the combination of tags given the greatest posterior support by JAM. Our FRR calculations use conditional risk estimates incorporating uncertainty for each variant, plus a correction for potential bias due to risk estimation in the same sample as discovery and uncertainty in the specification of the FRR. This novel but more conservative method of risk calculation estimated that: (1) inclusion of only single ‘best’ replacement variants for each tag SNP contributes 26.5% (95% credible interval, CI, 22.7–31.5) of the known FRR of PrCa compared to 23.2% (95% CI 19.4–27.9) for the 84 previously known GWAS tag SNPs; and (2) inclusion of lead SNPs representing all of the 99 independent signals contributes 30.3% (95% CI 26.0–35.9) (Supplementary Data 4). This substantial enhancement demonstrates that the variant catalogue identified through fine-mapping explains a greater proportion of the FRR of PrCa compared to the original GWAS index SNPs, with replacement of the 84 original GWAS tag SNPs conferring a similar magnitude of increase as addition of the 15 novel independent signals we identified. We additionally calculated the contribution to FRR of PrCa for each region individually, to highlight regions that make the greatest contributions towards PrCa susceptibility (Tables 1–3). Whilst the majority of the fine-mapped GWAS loci individually contribute a small proportion towards the FRR, six regions confer in excess of 1% each. These include the moderate penetrance HOXB13 rs138213197 variant, which demonstrated the greatest contribution at 6.87%, and the multi-signal TERT locus, which explained the next highest level at 2.57%. Each of the remaining regions of higher FRR contribution contained multiple independent signals, with the exception of the single-signal MSMB locus. The magnitude of increase in proportion of FRR explained by each locus after fine-mapping was also generally greater for regions where additional independent signals were identified; for example, the ANO7 region increased 6.5 fold (from 0.1% for the original GWAS tag SNP to 0.65% after fine-mapping) and the KLK region 1.9 fold (from 0.45 to 0.86%), partly due the identification of 2 novel signals within each.
Prior to the recent OncoArray study, approximately 100 PrCa susceptibility loci identified through GWAS had been reported. Limited information was however known about the precise identity of the causal variants and functional mechanisms behind these loci despite several having been fine-mapped individually or collectively using logistic regression28,32,33,34,35. Here we present the largest genetic fine-mapping study for PrCa to date based on a meta-analysis of 82,591 cases and 61,213 controls of European ancestry, and employ a state-of-the-art multivariate Bayesian variable selection technique to prioritise candidate variants. We further refined results by incorporating functional annotation information using a novel QR approach, to assist prioritisation of candidate causal variants for downstream functional validation.
Since the meta-analysis comprised marginal summary effect estimates, we applied JAM, a joint Bayesian fine-mapping algorithm that accounts for LD in a multivariate analysis of univariate summary statistics, to identify credible candidate PrCa susceptibility variants. A stochastic variable selection approach provided posterior probabilities of association for each variant and combinations of variants within each region, as determined by a set of best models. This framework is preferred over alternative approaches, such as forward stepwise selection, which tend to underrepresent the uncertainty in the analysis and yield false levels of confidence for the final set of SNPs and number of signals represented by the single ‘best’ model. JAM also has advantages over similar Bayesian variable selection algorithms as it incorporates an extremely computationally efficient formal reversible jump Markov Chain Monte Carlo (MCMC) stochastic model search, which allows application to very large regions and does not require a prior assumption on the maximum number of causal SNPs within each region, making it more applicable to regions with larger or unknown numbers of causal variants. Linear model-based summary data methods such as JAM represent the current state of the art and have demonstrated good performance when applied to transformed logistic ORs from binary traits as opposed to linear effects for continuous traits36,37. The effectiveness of logistic/linear mapping will however vary between different genomic architectures and is dependent on factors including the number of variants and correlation structure between them within each region. In general however, the approximation should work well provided no individual variants exert large effects, as expected for GWAS loci. For 5 of the 80 regions that had replicated at genome-wide significance, JAM was unable to fit a model to the summary data and consequently we could not resolve candidate variants beyond the catalogue of genome-wide significant variants within these regions. Four of these regions were not densely genotyped on the OncoArray genotyping chip, as their discovery in a multi-ethnic meta-analysis occurred only late during chip design. In addition, the top hit within these 5 regions ranked towards the weaker end of the P-value and effect size distributions in the univariate meta-analysis prior to fine-mapping. The inability of JAM to resolve candidate causal variants within these regions therefore most likely results from mismatch between the reference correlation structure and meta-GWAS effect patterns, issues with the logistic/linear mapping in the presence of complex correlation structure, or possibly simply low signal to noise ratio within the data.
Use of multivariate models prioritised a 95% credible set of 3700 candidate variants from the 203,211 variants analysed within the 75 regions in which candidate variants were resolved; thereby markedly reducing the number of variants for further consideration. In addition, previous reports of multiple independent signals at several PrCa risk loci were confirmed, with evidence for multiple signals at 12 regions; of which 7 regions contained 2 signals, 4 demonstrated evidence for 3 signals and 5 signals were observed at the Chr5p15 TERT gene locus, which is known to contain susceptibility variants for many cancer types38. We observed no consistent pattern of LD relationship between the original GWAS tag SNPs and the independent signals identified through fine-mapping in the regions containing multiple independent signals (Supplementary Fig. 6). For example, at the ANO7 locus, the original index SNP (rs3771570) is not selected in the credible set and correlated with only 1 of the 3 independent signals detected (rs62187431, r2 = 0.61). In contrast, at the TERT region, the original index SNP (rs2242652) is in moderate or modest LD (r2 0.08–0.43) with 4 of the variants selected by JAM as representative of the 5 independent signals. Previous smaller fine-mapping studies using stepwise selection approaches had also identified evidence for independent association signals within several regions. However, these are potentially more sensitive towards subjective measures such as the P-value threshold chosen for secondary signal inclusion and LD level used to define the final list of candidate variants represented by the selected marker(s). Due to our substantially larger sample size and variant density available and the well-established superiority of Bayesian search procedures over stepwise selection in high-dimensional settings, we therefore consider this the most detailed fine-mapping study to date for variant prioritisation. Comparing our results to the previous iCOGS fine-mapping study28, in which refinement of 64 GWAS loci was attempted in a smaller European ancestry cohort of 25,723 PrCa cases and 26,274 controls, 48 regions corresponding to 52 original index SNPs replicated at genome-wide significance in both studies, of which only 21 regions had been densely genotyped on the iCOGS chip (Supplementary Data 5). Within these comparable regions, 70% of the ‘best candidate SNPs’ established using the iCOGS sample set were also included in the credible set we have identified in this study. This indicates broad stability of the results from fine-mapping studies conducted in the same ancestral population. The additional power and more dense genotyping across all regions in this study has however facilitated further refinement of potential candidate variants, identification of additional candidate variants within several regions and refinement for the first time of a number of regions in which fine-mapping had not previously been performed or had been unsuccessful. We have confirmed the existence of multiple independent risk signals at 10 loci previously reported, including identifying extra signals at the TERT (Chr5p15), ANO7 (Chr2q37) and SLC22A3 (Chr6q25) loci, and identified multiple independent association signals for the first time at two further loci, including KLK3 (Chr19q13). Eight regions demonstrating evidence for multiple independent signals in the iCOGS fine-mapping study were however not corroborated in this larger study. Notably, the conditional P-values for these secondary signals in the iCOGS fine-mapping study were below genome-wide significance in all but one of these regions. This may suggest that contrary to general assumptions that a lower burden of evidence is valid for uncorrelated variants in loci for which a genome-wide significant association has previously been observed, instead equally stringent significance thresholds should be applied for both secondary signals and initial primary signals. It is also notable that in this well-powered study, the vast majority of regions containing multiple independent signals were first reported as associated with PrCa in early GWAS using relatively modest sample sizes. This may indicate that regions with lower effect sizes and weaker evidence for association, which require larger sample sizes for their detection, are less likely to contain additional independent risk variants. Alternatively however, it could reflect lower power for the detection of additional independently associated variants within the regions that contain weaker signals, despite the large sample cohort utilised in this study.
As would be expected, refinement of putative causal risk variants varied between regions, with credible sets ranging from a single variant or handful of variants to >100 variants for a small number of regions. The regions retaining large credible set sizes appear to result primarily from large numbers of variants in high LD with the actual causal variant as opposed to low power within the region however, rendering further refinement of these signals to facilitate functional validation studies more complicated. One approach to further prioritise candidate variants could be to leverage the different LD patterns among different ancestral populations, provided that the underlying casual variants are shared and present at sufficient frequency between populations. Cross-referencing the 3700 variants within our 95% credible set with data for an African American PrCa meta-analysis from the African Ancestry Prostate Cancer GWAS Consortium highlighted a subset of 1155 variants with nominal or genome-wide significant evidence for association in this additional population. An alternative prioritisation approach is to consider pre-existing biological information, as we have described for the RFX6 (Chr6q22) region. We annotated variants against a number of publically available data sets, observing enrichment of several plausible markers of biological function active in prostate cell lines within our credible set, including intersection with prospective promoter and enhancer elements, DNaseI hypersensitivity sites, histone modification or transcription factor-binding peaks, and variants residing within protein-coding transcripts and conserved regions of the genome. Of particular interest, more than a quarter of the variants within our credible set were also eQTLs within the TCGA prostate adenocarcinoma data set. Given their statistical selection independent of this annotation and demonstrated effects upon gene expression, these eQTL variants should be considered high priority when selecting candidate causal variants for functional confirmation, alongside variants that modify the coding sequence of genes, or appear to reside within reliably annotated promoters or enhancers. Another important discovery of this study is that an appreciable number of highly ranked variants within the credible set are non-synonymous SNPs. This provides evidence that subtle alterations to structure and activity of specific proteins may give rise to the functional mechanisms behind a proportion of GWAS associations.
Some alternative fine-mapping algorithms integrate functional annotations during the statistical analysis when considering evidence for causality for each variant39,40,41,42. These methods can prove useful for enhancing variant prioritisation, provided that the annotation information is reliably indicative of causal variants. We preferred to perform statistical analysis separately from annotation and compare statistical and functional evidence for causality afterwards using conditional QR. We believe this more clearly allows the most informative annotations, and the variants that are characterised by those annotations to be highlighted within the data set, whilst also reducing the potential for penalisation of strong candidate variants due to localised artefacts or cell line-specific effects within the whole-genome biological data sets used for annotation. Our conditional QR analysis resulted in adjustment of posterior probability for a small proportion of variants and may further assist prioritisation of the most likely functional variants among the credible set selected for each region.
Fine-mapping studies are important to reveal information on the biological mechanisms underlying disease predisposition by pinpointing potential candidate genes, signalling pathways and networks that account for differences in disease risk between individuals. In addition, these studies may help to refine the contribution of GWAS loci to PrCa risk by incorporating more likely candidate variants. This study evaluated almost all previously reported PrCa GWAS regions, apart from the highly complex Chr8q24 and major histocompatibility complex (MHC) regions and associations that did not replicate in the largest European meta-analysis to date. We then subsequently re-evaluated the contribution to FRR of these known PrCa risk loci using an enhanced method in which the overall FRR of PrCa was revised upwards from 2.0 to 2.5 to reflect the most recent estimates and we also accounted for uncertainty of various estimates that can introduce bias in these calculations. Our approach therefore provides more conservative estimates than in previous publications. We demonstrated a substantial increase in the proportion of FRR explained through fine-mapping these GWAS regions (from 23.2 to 30.3%), with detailed investigation showing that a similar proportion of this enhancement was conferred by replacement of the original tag SNPs and discovery of secondary signals. It is also noteworthy that the 7.1% magnitude of increase in FRR explained after fine-mapping known loci is substantially greater than the 4.4% increase achieved through identification of 62 novel PrCa loci8. This highlights the invaluable importance of fine-mapping studies for risk prediction and their potential utility in helping to inform clinical screening studies.
Fine-mapping of GWAS loci requires comprehensive examination of variation within the region. Logistical constraints generally preclude resequencing of disease-associated loci to achieve complete variant coverage in large sample cohorts and instead mandate the use of genotype array data followed by imputation, in order to achieve sufficient sample sizes. To ensure the accuracy of downstream fine-mapping analyses, stringent variant QC must be applied to imputed data, to exclude low-quality variants that may be indicative of imputation artefacts. In this study, initial pre-imputation QC of the meta-analysis data set was first performed to exclude potential genotyping errors, followed by post-imputation QC in which variants with low MAF or imputation information score, or divergent MAF consistency between dosage and ‘best guess’-derived MAF estimates were excluded. The MAF estimate consistency check was performed to highlight additional variants for which reliability of imputation may be reduced and evaluation of variants excluded in this step revealed that the majority were situated within segments of the genome flagged as repetitive or otherwise ambiguous. Whilst we cannot guarantee that no causal variants at GWAS loci would be located within repetitive elements, we believe that the high proportion of variants filtered during QC that are located within potentially difficult to impute segments indicates an appropriate balance between controlling against both type I and II errors during the subsequent fine-mapping analyses. The inability to directly interrogate this category of variants during this study could however reflect a potential limitation.
The multivariate fine-mapping strategy we employed enabled identification of small numbers of prospective causal variants amenable to functional follow-up at many known PrCa susceptibility regions. Within this credible set of variants, we found evidence of enrichment for a number of biologically plausible mechanisms through which PrCa risk could potentially be modulated. We observed multiple independent PrCa associations at 15% of the loci fine-mapped, and several candidate genes were indicated for consideration through functional annotation. As rare variants with MAF < 0.005 were not included in our analyses, we cannot exclude a contribution of rare casual variants exerting a greater effect size giving rise to synthetic associations at any GWAS loci, although our findings indicate that these are unlikely to be widespread. Importantly, replacement of the original GWAS tag SNPs with more likely candidate variants and identification of additional independent signals resulted in a substantial increase in the proportion of the FRR of PrCa explained by these loci. This finding accounts for a portion of the ‘missing heritability’ of PrCa and has important implications for clinical risk profiling and management of patients.
Identification of PrCa risk loci to fine-map
We identified 101 independent PrCa GWAS risk associations within the literature that had been reported at genome-wide significance prior to the start of this study, the majority of which had previously been replicated within a European ancestry population3,12. Six of these lead variants were located within the Chr8q24 region that is associated with multiple cancer types in a highly complex manner, and three within the MHC Chr6p21 region. Due to the large numbers of variants, high levels of correlation and greater complexity within these regions, they are the subject of separate fine-mapping and risk stratification studies and were excluded from consideration in this analysis; the remaining 92 previously reported GWAS SNPs were selected for fine-mapping in this study. For 5 of these originally reported GWAS SNPs, no variant within ±500 kb replicated at genome-wide significance in our larger European meta-analysis and these loci were subsequently excluded from downstream Bayesian analyses and FRR calculations. An additional 2 GWAS SNPs originally reported in non-European ancestral populations and 1 reported in a previous meta-analysis did not replicate, but were situated <500 kb from an independent, replicated European risk association and were therefore still considered within the region boundaries of signals that were fine-mapped.
Selection of SNPs for fine-mapping on the OncoArray
A total of 78 PrCa risk associations that had been reported prior to the design of the OncoArray genotyping platform43 were densely genotyped within the OncoArray sample cohort. Region boundaries for dense genotyping were defined as the greater of ±500 kb from the index SNPs or the maximum distance of any variant with r2 > 0.3 to the index SNP in 1KG (phase 1 version 3, March 2012 release). All SNPs within these regions with MAF > 0.01 in any ancestral population were extracted and then we obtained Illumina Design Scores for all variants from the 1000 Genomes Project (phase I version 3, March 2012 release). From designable variants with a Design Score ≥ 0.8, we used Snagger44 to select (a) all variants correlated with the known hits at r2 > 0.6 and P < 0.05 in the iCOGS study, (b) all variants from lists of potentially functional variants, defined through ENCODE and RegulomeDB and (c) a set of SNPs to tag all remaining variants at r2 > 0.9. The 23 risk loci reported in a recent multi-ethnic meta-analysis study12 were not densely genotyped as these loci were reported after the OncoArray design; however, these regions were also fine-mapped in this study.
Meta-analysis and imputation
Genotype data for a combined 82,591 PrCa cases and 61,213 controls of European ancestry from eight GWAS (OncoArray, iCOGS, UK stage 1 and 2, CaPS 1 and 2, BPC3 and NCI PEGASUS) were used for the meta-analysis8. Per-allele ORs and standard errors were generated for the OncoArray and each GWAS, adjusting for principal components (PCs) and study relevant covariates using logistic regression. The OncoArray and iCOGS analyses were additionally stratified by country and study, respectively. We used the first seven PCs for OncoArray and first eight PCs for iCOGS samples, as additional components did not further reduce inflation in the test statistics. OR estimates were derived using either SNPTEST (https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html) or an in-house software C++ programme. OR estimates and standard errors were combined by a fixed effects inverse variance meta-analysis using METAL45. All statistical tests conducted were two-sided.
IMPUTE2 was used to impute non-genotyped SNPs within a boundary flank of ±500 kb or the maximum distance of any variant with r2 > 0.3 to the index SNP in 1KG phase 1 from the originally reported GWAS index SNP in the meta-analysis cohort. For the OncoArray data, un-phased imputation was carried out for all the fine-mapping regions. Where the boundaries of adjacent associations to fine-map overlapped, these were merged for imputation; therefore, imputation was performed as 82 discrete chunks. Within 3 of these chunks the separate signals to analyse were sufficiently dispersed to enable clear demarcation of the individual signals and retention of an appropriate flank distance; these 3 imputation chunks were therefore split prior to statistical analysis and the 92 original index SNPs analysed were fine-mapped as 85 separate regions.
We conducted a two-stage post-imputation QC process. During basic QC, imputed genotype data were filtered to retain variants with INFO ≥ 0.4 and MAF ≥ 0.005. We subsequently instituted an additional QC measure to remove imputed variants with greater genotype uncertainty in which separate MAFs were calculated based on ‘dosage’ and ‘best guess’ genotypes. Large deviations between these MAF estimates for a variant would indicate unreliable imputation performance; variants for which these differed by ≥10% were excluded from analysis. An additional benefit of this methodology is that inherently applies progressively greater stringency of QC filtering the rarer a variant is within the study population. During the post-imputation QC process, 288,033 rare variants were excluded, whilst a further 146,088 variants were removed due to low INFO score or divergent MAF consistency. This resulted in a final post-QC set for analysis of 213,728 SNPs within the 80 fine-map regions that had replicated in the initial meta-GWAS, with a minimum variant INFO score within the final data set of 0.63, and the vast majority of variants having INFO > 0.9 (Supplementary Fig. 7).
As an additional safeguard, we investigated the proportion of common variants (MAF ≥ 0.05) in 1000 Genomes European samples that were retained or excluded during our QC procedure. In total, 186,907 of 227,793 common 1000 Genomes European variants (82.1%) were included in our final post-QC data set for analysis. The vast majority of common variants excluded during QC, 37,830, were removed in the MAF consistency check step. In all, 27,070 (71.5%) of these were situated within segments of the genome flagged as repetitive or otherwise ambiguous (either masked as low complexity by RepeatMarker, or excluded by the 1000 Genomes phase 3 Strict Mask), whilst a further 4460 (11.8%) had intermediate INFO score values (0.4–0.8).
Multivariate fine-mapping towards putative causal variants
JAM16 is a novel Bayesian algorithm that searches multi-SNP models in summary data by imputing the correlation structure according to a reference panel. JAM provides inference on the number of independent signals, as well as the set of potential SNPs driving those signals. Under a standard multivariate linear regression, the vector of trait values y are regressed on a matrix of genotypes, X, under the following model
where σ2 represents the residual variance, β represents a vector of effects, which are all adjusted for one another, and I N is the N × N identity matrix. Multiplying the standard model above through by the transpose of the genotype matrix, JAM makes inference under the resulting multivariate normal (MVN) model:
The motivation for using the model in (2) rather than (1) is that individual-level data are no longer required; X′y can be derived from one-at-a-time univariate effect estimates of each variant46,47 and X′X from an estimate of the genetic correlation matrix. Note that in the case of the PrCa summary statistics, we derive X′y after first mapping the univariate log ORs to approximate linear effects via their z-scores, a strategy adopted for binary traits in other linear model-based summary statistic frameworks37,48. Consequently, the model residuals have the same interpretation as in a linear regression of a binary outcome; they cannot exceed 1 and, under the null model, their variance σ2 equals the trait variance, p(1 − p) where p is the proportion of cases. Since each region is unlikely to explain much heritability individually, we specify an inverse gamma (Γ−1) prior that loosely targets the PrCa variance in the meta-GWAS:
This corresponds to a prior expectation for σ2 equal to the PrCa variance in the meta-GWAS, 0.24, and 95% weight over the range (0.05, 0.69). The JAM model is completed by specifying a so-called ‘g-prior’ over the genetic effects, β:
The conjugate g-prior supports effects inversely proportional to the corresponding genetic co-variances and variances, as estimated from the reference matrix X′X, and has been shown to help when modelling highly correlated predictors49. There is a substantial literature on choices for the hyper-parameter, τ; we follow recommendations to set a value equal to the maximum of N and P2, where P is the number of variants in the region50,51.
Crucially, both (1) and (2) are parameterised by the same vector of multivariate (i.e., correlation adjusted) effect estimates, β; JAM is therefore able to approximate inference from a multivariate analysis of individual-level data. Optimal performance is achieved when the correlation structure X′X is taken from the original GWAS population, rather than an external reference population. We applied JAM to summary statistics from the meta-analysis data set using LD estimated according to imputed individual-level data from the OncoArray sub-cohort of 53,449 cases and 36,225 controls in which these regions had been densely genotyped.
Similar to other Bayesian stochastic variable selection approaches, JAM models a latent vector of binary indicators, γ, for whether each variant should be included (γ v = 1 if variant v is associated and included in the model, or 0 otherwise). Any specific configuration of indicators then specifies a specific model, M. Using a Bayesian stochastic search, specifically a Reversible Jump MCMC (RJMCMC) algorithm16,52, JAM searches over different possible models. By specifying a prior on the probability of including any combination of variants, we induce a prior over the ‘model space’, γ. More formally, JAM’s prior over γ induces sparsity and accounts for the multiple testing burden through use of a ‘beta-binomial’ prior on the number of associated variants or variants included in any given model, which consists of a Beta distribution over the proportion of associated variants in a particular region, conditional on which prior probabilities for each possible number of associated variants follow a binomial distribution. All configurations or combinations, including the same number of variants are given identical prior probabilities. For each region we used a beta-binomial (1, P r ) prior, where P r is the total number of variants in a region r. This places a constant prior probability for any effect in each region (i.e., one or more causal variants) of 0.5, which is split up over all possible models according to the beta-binomial distribution. Since these are previously discovered regions, this is far more generous than our prior belief would be that a random region of the genome is associated with PrCa but is more conservative for the regions in this analysis, where we estimate the false discovery rate is <10%. The marginal prior odds of any particular SNP being selected is 1/P r , and decreases with the total number of variants in the region, providing an intrinsic multiplicity correction as a function of region size53,54,55,56. The prior probabilities for ≥2, or ≥3 associated variants and so on are weakly effected by P r , however, for all regions in this analysis they are equal to the second decimal place at 0.25 and 0.12, respectively (Supplementary Table 2). More detail on the JAM model and RJMCMC algorithm can be found in the original paper16. For this analysis, each region was analysed independently, and by running two independent JAM seeds for 10 million iterations each. The JAM output provides posterior probabilities for each variant, Pr(γ v = 1|data), and for each combination of variants, Pr(M = 1|data). To determine statistical significance for individual variants, combinations of variants and for the possible number of independent signals we use Bayes factors57, the ratio of the posterior odds to the prior odds. Specifically, we used the inference of the minimum number of independent signals in the model at a regional Bayes factor threshold of 3 to define the evidence for multiple signals.
Before running JAM, Priority Pruner v0.1.3 (http://prioritypruner.sourceforge.net) was used to LD prune the imputed meta-analysis variant set at a threshold of r2 = 0.9 for the 80 regions replicated at genome-wide significance. Pruning was performed agnostic of additional prioritisation criteria (association or annotation data) to ensure unbiased Bayesian model selection. Additional pruning at lower LD levels was performed upon any regions in which the overall Bayes factor for association with PrCa fell below 1. A regional Bayes factor below 1 directly conflicts with our knowledge that these regions are robustly associated with PrCa, and was taken as an indication of collinearity; numerical instability that can occur when fitting multivariate models to highly correlated variables. Where required, the pruning threshold was lowered in r2 = 0.05 increments, to a cut-off level of r2 = 0.6. The pruned data set used in the final Bayesian analyses comprised a total of 38,745 selected tags.
The pruning thresholds used in the final results are listed in Tables 1–3. For each independent JAM analysis, the top models or combinations of included SNPs within each region as determined by the posterior probabilities of the models, Pr(M = 1|data) that summed to a cumulative posterior probability of 0.95, were used to define a run specific 95% credible set. To filter out any low confidence variants, final 95% credible sets for each region were defined according to the intersection between two independent runs of JAM, with any variants with variant-specific BF < 1 additionally removed from the amalgamated variant list due to having greater standalone evidence against association. Overall, 3761 of the 4142 unique SNPs selected by either JAM run were retained in the combined top models from both runs (90.8%), with a further 61 variants with BF < 1 removed to achieve the final 95% credible set.
Annotation of variants for functional features
Variants were annotated for a number of putative indicators of biological functionality or importance, using a range of publically available data sources. These annotations focussed on either the likely consequence or relevance of the variant resulting from its primary genomic context, or the proximity to annotated regulatory features within cell lines derived from normal prostate or PrCa tissues.
Gene-based annotation of variants was performed using wANNOVAR in relation to GENCODEv19 transcripts58. Variants residing within miRNA transcripts were subsequently added in relation to miRBase release 20 (ftp://mirbase.org/pub/mirbase/20/genomes/hsa.gff3)59. Annotation of variants that reside within genomic elements demonstrating evidence for evolutionary constraint was performed against conserved element peak outputs from comparative genomics analyses by four algorithms; GERP++ (http://mendel.stanford.edu/SidowLab/downloads/gerp/)20, SiPhy_Omega, SiPhy_Pi (https://www.broadinstitute.org/scientific-community/science/projects/mammals-models/29-mammals-project-supplementary-info)21 and PhastCons (ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons100way/)22. Variants were also scored for likelihood of prospective pathogenicity using CADDv1.3 (http://cadd.gs.washington.edu/score)60.
For annotation against prospective regulatory elements within the genome, which frequently operate in a tissue-specific context, these data sets were primarily retrieved from experiments using prostate-derived cell lines. We annotated variants that intersected DNaseI peaks data in seven individual ENCODE prostate data sets from three cell lines (LNCaP, PrEC and RWPE1; GSM816637, GSM816634, GSM1008595, GSM736565, GSM736603, GSM4742 and GSM4743)61,62. Peak data from ChIP-seq experiments for transcription factor-binding sites and histone modifications in the LNCaP, PC3, PrEC and VCaP cell lines and human prostate tumour tissue was downloaded from the Cistrome Data Browser (http://cistrome.org/db/); a resource that accumulates publically available ChIP-seq data sets and re-analyses their raw data through a standardised pipeline and QC procedure24. Downloaded CistromeDB data were converted from GRCh38 to GRCh37/hg19 reference assembly co-ordinates for compatibility with our variant data set using the UCSC Genome Browser LiftOver tool (https://genome.ucsc.edu/cgi-bin/hgLiftOver). Transcription factor-binding site data were obtained for the Androgen Receptor (GSM1236922, GSM1328945 and GSM1576447), CTCF (GSM1006874 and GSM1383877), ERG (GSM1193657 and GSM1328978), FOXA1 (GSM1068136, GSM1274873 and GSM1716762), GABPA (GSM1193660), GATA2 (GSM1600544), HOXB13 (GSM1716763 and GSM1716764) and NKX3.1 (GSM699633 and GSM989640). Histone modification data were obtained for H3K27Ac (GSM1249447 and GSM1249448), H3K27me3 (GSM1383866 and GSM1383872) and H3K4me3 (GSM1383874 and GSM945240). Finally, to facilitate deeper categorisation of the genomic context of variants within prospective regulatory features, they were annotated with their chromatin state categorisations by ChromHMM from two prostate cell lines (PrEC and PC3; GSE57498), alongside three ENCODE tier 1&2 cell lines (GM12878, H1HESC and HUVEC) to enable comparison of tissue specificity for prospective regulatory elements23,63,64.
Genotype and gene expression data for 494 samples with PrCa were downloaded from TCGA (https://gdc-portal.nci.nih.gov). For the genotype data set, QC was performed according to the protocol suggested by Anderson et al.65, removing samples with heterozygosity >2 standard deviations from the mean, individuals with low genotype call rate (<95%), non-male samples and related or duplicated samples (individuals with identity-by-descent >0.185). Variants with call rate <95% were also excluded from analysis. PC analysis was performed to induce the ancestry of the TCGA samples, using the 494 TCGA samples plus 2504 samples from the 1000 Genomes Project phase 3, with non-European or Finnish samples removed from the analysis. In total, 108 samples and 106 SNPs were removed after performing QC on genotype data. For the expression data set, we observed that samples from two plates (A31K and A30D) exhibited values substantially higher than samples on the remainder of plates, therefore samples on these plates were also excluded (27 additional samples). Out of the 494 samples, 359 therefore passed QC. Genotypes for samples passing QC were subsequently imputed to the 1000 Genomes Project phase 3 reference panel within the region boundaries applied to the fine-mapping data set using IMPUTE2. In all, 227,773 variants within the fine-mapping data set passed QC thresholds in the TCGA imputed data and therefore were available for eQTL analysis. Genes with mean expression across samples of ≤6 counts or with expression variance = 0 were also excluded (4123 and 370 genes removed, respectively). Finally, expression values were quantile-normalised by samples and rank-transformed by genes. In total, 16,038 genes passed QC out of the initial 20,531.
For the eQTL analysis, 35 PEER factors66 for the top 10,000 expressed genes were used as covariates, plus 3 genotyping PCs. eQTL analysis was performed for each region individually using FastQTL67 with 1000 permutations and a window of 1 megabase from the transcription start site of each gene. Colocalisation tests between the eQTLs and GWAS SNPs were then performed following the approach suggested by Nica et al.68. First, for each significant eQTL, we added the imputed SNP to the linear regression to assess if the inclusion better explains the change in expression of the gene.
We retrieved the P-value of this new linear regression, assigning P-value of 1 if the eQTL and imputed SNP are the same variant. Second, we ranked the P-values in descending order for each eQTL. Finally, we calculated the colocalisation score for each pair of eQTL and imputed SNPs as:
where N is the total number of imputed SNPs in that region and rank is the rank of the imputed SNP we are including. In general, if an eQTL and an imputed SNP represent the same signal, this will be reflected by the imputed SNP having a high P-value, a low rank and consequently a high colocalisation score.
Conditional QR across variant annotations was performed for the 75 regions successfully fine-mapped using JAM, with the 5 regions in which JAM was unable to resolve candidate variants excluded from this analysis. To minimise correlation between annotations, single data sets for each transcription factor, histone mark, DNaseI, conserved element and chromatin state by ChromHMM category were selected for investigation with conditional QR. Specifically, the GSM736603 DNaseI, GSM1328945 AR, GSM1383877 CTCF, GSM1193657 ERG, GSM1068136 FOXA1, GSM1193660 GABPA, GSM1600544 GATA2, GSM1716763 HOXB13, GSM989640 NKX3.1, GSM1249447 H3K27Ac, GSM1383872 H3K27me3, GSM945240 H3K4me3 and GERP++ conserved element annotation fields were selected, as these were observed to be most informative for variants within the 95% credible set, whilst the PrEC cell line ChromHMM annotation was selected over the PC3 data set due to its origin from normal prostate rather than cancerous tissue. Similarly, CADD RawScore was selected, with CADD PHRED score excluded prior to the analysis. Information on whether variants were an eQTL in the TCGA data set was included. Finally, new categories were computed to ascertain whether a variant was situated within a protein-coding transcript (intronic, exonic or untranslated region), within a non-coding transcript, and whether the variant altered protein structure (non-synonymous, non-sense, frameshift or non-frameshift insertion/deletion coding variants).
All annotations were converted to binary format for the QR analysis, with the exception of CADD RawScore, which was retained as a continuous variable. Separate variables were created for each possible ChromHMM state during conversion from categorical to binary format. QR analysis was performed upon the priority pruner tag variants that were analysed by JAM, using the statistical results from those analyses. Annotations for all proxy SNPs represented by the tag variant were therefore subsequently inherited by the priority pruner tag. For the binary annotation categories, this meant that if one or more proxies had received a given annotation then the tag would also receive that annotation, whilst for the continuous CADD RawScore, the tag inherited the highest value from all associated proxies.
For a specified quantile, τ, we first fit a conditional QR model to the estimated posterior probabilities from the JAM analysis for each variant. Second, we use the fitted model to calculate an expected posterior probability for each SNP given the annotation profile for that SNP. Since there is uncertainty in the choice of τ, we analyse the data across a range of τ = (99.2, 99.4, 99.6, 99.8 and 99.95%) and calculate a weighted average of these expected posterior probabilities to yield a final estimate. Specifically, the posterior probability from JAM, P, is modelled with a conditional QR with annotation, Z. The model is defined by an asymmetric laplace distribution (ALD):
Notice v is affected by both P and Zθ, and it suggests a weighted average of P and fitted regression quantiles can approximate v. The density function of ALD distribution is
For λ, the maximum is achieved at
We fix λ to be
where is coefficient estimates from classical conditional QR. With λ fixed, only the exponential parts of the Gaussian distribution and ALD involve v, to which we assign weight to P and . Specifically classical QR yields a prediction for each SNP, i as at ; the larger the penalty and , the less influence should have on P i . We normalise the weight for P i to be 1. For we assign weight
which is approximately the penalty at . Our approximate value for v is then
Proportion of familial risk explained
The contribution and comparison of the newly identified SNPs and the previously known variants to the familial risk, under a multiplicative model, was computed using the formula
Where λ0 is the observed FRR to first degree relatives of cases and λ m is the FRR due to locus m, calculated assuming a per-allele effect:
where p m is the frequency of the risk allele for locus m, q m = 1 − p m and r m is the estimated per-allele OR.
This calculation was performed using a Bayesian framework, which allows us to attenuate any ‘winners curse’ bias, and incorporates the uncertainty in estimating the variant-specific per-allele OR, r k , and the value of the observed familial risk, λ0. To correct for potential bias in effect estimation from using the same sample to determine the credible set of SNPs (the so-called ‘winner’s curse’), we implemented a hierarchical model similar in spirit to Zhong and Prentice69 by placing a normal prior distribution on effect estimates of the form . Here β m is the log OR from the conditional model within each region, and τ is a pre-specified variance of the effect distribution reflecting our prior beliefs. For all variants, we used a conservative value of τ = 0.05, reflecting a 95% prior probability density for a per-allele OR in the range of [0.91, 1.10]. For the FRR calculation, we specified a prior distribution as , which places a 95% prior density in the range [2.22, 2.78] on the FRR of PrCa. This calculation was performed using the JAGS software70.
To collapse our catalogue of credible variants identified through fine-mapping into a parsimonious set of SNPs matching the observed number of independent signals, we selected single representative lead variants to represent each signal. For the 63 regions in which JAM identified only a single signal, these were designated as the tag with the highest posterior evidence for association, whereas for the 5 regions in which JAM did not resolve candidate variants the variant most strongly associated with PrCa in the original meta-GWAS was designated as the novel lead variant. For the 12 regions containing multiple independent risk signals, to facilitate unbiased selection of variants representing different signals, JAM exhaustively fitted all possible multi-SNP models for the specified number of signals, and the combinations of SNPs with the highest posterior probability were selected to represent the independent signals. Separate models were run to derive the variant list for the full 99 signals identified and also a reduced set of 84 signals, matching the number of original index variants fine-mapped, to enable comparison between the contributions of replacement of the GWAS tag SNPs and addition of novel signals identified. To yield adjusted effect estimates for each lead variant in regions containing multiple signals, conditional effect estimates and standard errors for the selected ‘representative’ variants used for the FRR calculations were derived from the OncoArray sub-cohort of 53,449 cases and 36,225 controls, for which individual-level data were available.
The meta-analysis summary data used in this fine-mapping project are available from the PRACTICAL Consortium (http://practical.icr.ac.uk/blog/?page_id=8164) or GitHub (https://github.com/oncogenetics/LocusExplorer/tree/master/Data/ProstateData). Results from the fine-mapping analyses may be explored interactively through Locus Explorer71 (http://www.oncogenetics.icr.ac.uk/LocusExplorer/).
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We would particularly like to thank all the patients and control men who took part in all the studies involved in this work, as well as all the researchers, clinicians, technicians and administrative staff who have enabled this work to be carried out, and the collaborators in the PRACTICAL consortium. We wish to thank all GWAS study groups contributing to the meta-analysis data set from which these fine-mapping analyses were conducted: BPC3 (Breast and Prostate Cancer Cohort Consortium); CAPS (Cancer of the Prostate in Sweden); PEGASUS (Prostate Cancer Genome-wide Association Study of Uncommon Susceptibility Loci); The PRACTICAL (Prostate Cancer Association Group to Investigate Cancer-Associated Alterations in the Genome) Consortium; and The GAME-ON/ELLIPSE Consortium. Detailed acknowledgements and funding information for all GWAS study groups and from all the individual studies involved in the PRACTICAL Consortium are included in Supplementary Note 1.