Main

Epithelial ovarian cancer (OC) is the most common cause of gynaecological cancer death in the United Kingdom (Cancer Research UK, 2016). The high mortality associated with the disease is in part because it is often diagnosed at an advanced stage and a better understanding of germline genetic predisposition to OC may eventually lead to precision screening and earlier diagnosis (Bowtell et al, 2015). Genome-wide association studies (GWAS) have so far identified 18 loci associated with susceptibility to all invasive OC or to its most common histological subtype, serous OC (SOC), that accounts for 70% of all cases (Song et al, 2009; Bolton et al, 2010; Goode et al, 2010; Bojesen et al, 2013; Couch et al, 2013; Permuth-Wey et al, 2013; Pharoah et al, 2013; Kuchenbaecker et al, 2015). Post-GWAS studies that integrate molecular phenotypes with GWAS findings are essential to elucidate the function of the known loci in SOC development and to unravel the potential role of loci that just fail to reach the threshold for genome-wide statistical significance (P<5 × 10−8; Freedman et al, 2011; Kar et al, 2015; Lawrenson et al, 2015).

The vast majority of single-nucleotide polymorphisms (SNPs) associated with cancer susceptibility lie in non-coding regions of the genome and so do not have any impact on protein structure and function. A growing body of evidence suggests that many inherited common risk variants instead fall into non-coding regulatory elements, such as enhancers or transcription factor (TF)-binding sites (Sur et al, 2013). Different alleles of these SNPs impact the biological activity of the regulatory elements and thus modify expression of a local (cis-acting) target gene or genes.

Expression of many TFs occur in a tissue-specific manner, and binding sites and transcriptional target genes for such lineage-specific TF drivers of cancer can be enriched at risk loci, also in a tissue-specific manner. For example, breast cancer risk SNPs are enriched for binding sites of the TFs ESR1 and FOXA1 in breast cancer cells while prostate cancer risk variants are enriched for androgen receptor-binding sites in prostate cells (Cowper-Sal lari et al, 2012; Lu et al, 2012; Jiang et al, 2013; Chen et al, 2015). However, for SOC, similar links between TFs and genetic risk have not been evaluated. This is partly because the TF-target gene networks active in SOC and SOC precursor cells are poorly characterised. Moreover, genome-wide TF-binding sites have not been profiled by chromatin immunoprecipitation combined with sequencing (ChIP-Seq) in SOC precursor and SOC tissues by initiatives such as the Encyclopedia of DNA Elements and the Nuclear Receptor Cistrome projects that enabled the corresponding studies for breast and prostate cancers (Tang et al, 2011; ENCODE Project Consortium, 2012).

In the absence of such data, we searched for an in silico resource that would allow an agnostic evaluation of association between putative target genes of many different TFs and susceptibility to SOC. The Molecular Signatures Database (MSigDB) is a compendium of annotated functional pathways that includes 615 TF-target gene sets (Subramanian et al, 2005). All genes in each set share the same upstream cis-regulatory motif that is a predicted binding site for a particular TF and they thus represent the inferred target genes of that TF. The motifs themselves are regulatory motifs of mammalian TFs derived from the TRANSFAC database (Matys et al, 2006). In this study, we undertook pathway analysis using gene set enrichment (Subramanian et al, 2005) to test for overrepresentation of signals associated with SOC risk in these 615 TF-target gene sets using the two largest SOC GWAS data sets currently available for discovery and for independent replication. We further confirmed our top replicated gene set – targets of the TF PAX8 – using an alternative pathway analysis approach and used in vitro transcriptomic modelling to demonstrate perturbation of this gene set in the cellular context of SOC.

Materials and methods

Discovery, replication, and combined study populations

The discovery pathway analysis was performed on a meta-analysis of a North American and UK phase 1 GWAS of 2196 SOC cases and 4396 controls. The replication pathway analysis used data from 7035 SOC cases and 21 693 controls that were independent of the discovery participants and obtained from 43 case-control studies genotyped under the Collaborative Oncological Gene-environment Study (COGS) project. The two GWAS and the COGS studies have been described previously (Song et al, 2009; Permuth-Wey et al, 2011; Pharoah et al, 2013). The combined pathway analysis was based on a total of 9627 SOC cases and 30 845 controls from a meta-analysis that included the North American and UK GWAS, the COGS, and additional cases and controls from the Ovarian Cancer Association Consortium (OCAC) as reported previously (Kuchenbaecker et al, 2015). All participants were of European ancestry, provided informed consent, and had been recruited under protocols approved by a local ethics committee.

Single-nucleotide polymorphism data

The discovery, replication, and combined pathway analyses used summary findings (P-values) for association between SNP germline genotype and SOC susceptibility in the respective study populations. The discovery stage included 2 508 744 SNPs that had either been genotyped or imputed with imputation accuracy, r2>0.3 and had a minor allele frequency (MAF)>1% in both the North American and the UK GWAS. Samples were genotyped on Illumina (San Diego, CA, USA) platforms (317K/550K/610K) and imputed into the HapMap II (release 22) Utah residents with Northern and Western European ancestry (CEU) reference panel. As with most gene-based common variant association tests (Petersen et al, 2013), the gene-ranking procedure described below (Saccone et al, 2007; Christoforou et al, 2012) had been developed for HapMap-imputed GWAS and this guided our choice of HapMap-imputed SNP data over the more heavily correlated 1000 Genomes-imputed SNP data, which were also available. The replication stage was based on summary findings from COGS for a subset of 2 421 023 SNPs out of the 2.5 million SNPs from the discovery stage that had either been genotyped on the Illumina iCOGS custom array or imputed into the 1000 Genomes (March 2012) European reference panel with r2>0.3 and had a MAF>1% in the COGS studies. The combined pathway scan was also based on data for the same subset of SNPs but from association analysis in the combined study population. Sample and genotyping quality control, imputation, association- and meta-analysis steps for generating these three data sets have been described previously (Song et al, 2009; Permuth-Wey et al, 2011; Pharoah et al, 2013; Kuchenbaecker et al, 2015).

Gene-set enrichment analysis

Pathway analysis was conducted using the Preranked tool in the GSEA software (version 2.2.1; (Subramanian et al, 2005)) with default settings, 1000 permutations (unless otherwise specified), and no restrictions imposed on the size of gene sets that could be included. GSEA requires a list of genes ranked by any metric and a collection of annotated biological pathways or gene sets.

All 615 TF target genes sets (containing between 5 and 2657 genes; median=219 genes) annotated in the Molecular Signatures Database (MSigDB version 5.0-C3; www.broadinstitute.org/gsea/msigdb) were tested in the GSEA. Each of these gene sets represents a group of genes that share a single TF-binding site motif defined in the TRANSFAC database (version 7.4; www.gene-regulation.com; (Matys et al, 2006)). The gene sets are named after the corresponding TRANSFAC TF-binding site matrix identifier and additional details of their curation and nomenclature is available online (www.broadinstitute.org/cancer/software/gsea/wiki/index.php/MSigDB_collections#Transcription_factor_targets_.28TFT.29).

The ranked list of genes for GSEA was generated from the genome-wide SNP data by the following steps: (1) start and end positions for 23 161 genes that were unambiguously mapped were downloaded using R version 3.0.3 (Vienna, Austria; TxDb.Hsapiens.UCSC.hg19.knownGene: Annotation package for TxDb object(s). Version 2.8.0); (2) all SNPs that lay between these start and end positions were assigned to the corresponding genes; (3) the most significant P-value among all SNPs within the boundaries of each gene was adjusted for the number of SNPs in the gene by a modification of Sidak’s correction (Saccone et al, 2007; Christoforou et al, 2012) that has been shown to reduce the effect of gene size on the P-value and account for correlations due to linkage disequilibrium between SNPs (Segrè et al, 2010); (4) the genes were ranked in descending order of the negative logarithm (base 10) of the most significant P-value (after adjustment). These steps were applied to SNPs and their P-values from the discovery (19 540 genes containing1 SNP), replication, and combined data (19 364 genes containing1 SNP). Quantile–quantile plots of these gene-level P-values were plotted for each data set (Supplementary Figure S1).

The MSigDB gene set for putative targets of the TF PAX8, termed V$PAX8_B, contained 106 genes (Supplementary Table S1). These genes were grouped together because their promoter regions (±2 kb around transcription start site) contained at least one instance of the TRANSFAC motif NCNNTNNTGCRTGANNNN that matches annotation for a PAX8-binding site. Four of these genes were open reading frames and excluded from pathway analyses (Supplementary Table S1). We used Qiagen’s Ingenuity Target Explorer (targetexplorer.ingenuity.com) to identify 55 additional genes (Supplementary Table S2) that were known to interact with PAX8 according to the literature, though not necessarily by binding PAX8 as a TF (description and citation for each interaction in Supplementary Table S3). We refer to the 157 genes (102 from MSigDB and 55 from Ingenuity) collectively as the PAX8 pathway. PAX8 pathway analysis by interval enrichment The INRICH tool (Lee et al, 2012) was also used to test for enrichment of genes from the PAX8 pathway within genomic intervals associated with SOC susceptibility. The 45 intervals for INRICH were generated by taking all (n=47; Supplementary Table S4) linkage-disequilibrium independent SNPs with P<10−5 for association with SOC risk in the combined data set, adding 1 Mb on either side of each SNP to capture genes potentially cis-regulated by each SNP (van Heyningen and Bickmore, 2013), and merging overlapping intervals. INRICH was used to generate 5000 sets of 45 intervals of the same width and reasonably matched to these observed intervals in terms of gene and SNP density in each interval. The number of PAX8 pathway genes overlapping the observed and permuted intervals was compared, counting multiple PAX8 pathway genes whether they overlapped a single interval, separately. Cell culture and cell lines IGROV1 and HeyA8 cells were cultured in Dulbecco’s Modified Eagle’s medium (Caisson, Smithfield, UT, USA) containing 10% fetal bovine serum (FBS; Seradigm, Radnor, PA, USA) and Roswell Park Memorial Institute medium (Lonza, Basel, Switzerland) containing 10% FBS, respectively. IGROV1 cells were labelled with firefly luciferase and a neomycin resistance cassette by lentiviral transduction (supernatants from Children’s Hospital Los Angeles Vector Core) and selected for by supplementing the culture media with 300 μg ml−1 G418 (Sigma-Aldrich, St Louis, MO, USA). PAX8 was silenced using individual short hairpin RNAs (shRNAs) expressed from the pLKO.1 vector (Sigma-Aldrich) and delivered by lentiviral transduction. Negative control cells were infected with a non-targeted (scrambled) hairpin. Infected clones were selected using 200 (IGROV1) and 800 (HeyA8) ng/ml puromycin (Sigma-Aldrich) and PAX8 silencing confirmed using gene expression analysis performed using TaqMan probes (Life Technologies, Carlsbad, CA, USA; Supplementary Figure S2). Cell line authentication was performed on knockdown and control lines by profiling short tandem repeats using the Promega Powerplex 16HS Assay (performed at the University of Arizona Genetics Core facility). All cultures were confirmed to be free of Mycoplasma using a Mycoplasma-specific PCR. Microarray profiling and data analysis RNA was extracted from the knockdown models (n=2 per cell line), cells expressing a scrambled shRNA, and parental (untreated) cells in triplicate, at independent passages. We tested five PAX8 targeting shRNAs and measured PAX8 expression levels using targeted real-time quantitative PCR performed using TaqMan gene expression probes. We then performed whole transcriptomic profiling on the samples with the greatest knockdown. Microarray profiling was performed using the Illumina HumanHT-12 v4 Expression BeadChips at the University of Southern California Epigenome Core and University of California at Los Angeles Neuroscience Genomics Core, using standard manufacturer protocols. GenePattern (version 3.9.5; Reich et al, 2006) was used to extract signal intensity data, for cubic spline normalisation of probe expression levels (Schmid et al, 2010), and for differential expression and microarray GSEA. For differential expression, P-values from two-tailed t-tests, false discovery rate (FDR) by the Benjamini-Hochberg method, and fold changes were calculated for the following comparisons: HeyA8 untreated plus scrambled controls vs HeyA8 treated with PAX8 shRNA-1 and shRNA-2 and luciferase-labelled IGROV1 plus scrambled controls vs IGROV1 treated with PAX8 shRNA-3 and shRNA-4 (all experiments in triplicate). For microarray-GSEA, phenotype labels for these comparisons were permuted 1000 times and the standard signal-to-noise ratio was used to rank genes. Of the 157 PAX8 pathway genes, 154 were profiled on the Illumina HT-12 (three genes not profiled: LUC7L3, PKM, TMA16). Two-sided exact binomial test P-values calculated using the binom.test function in R version 3.0.3 were used to evaluate the proportion of PAX8 pathway genes that were differentially expressed. PAX8-binding sites from chromatin immunoprecipitation with sequencing data While we were completing our study, genome-wide maps of PAX8-binding sites compiled from ChIP-Seq of three immortalised fallopian tube secretory epithelial cell (FTSEC) lines (FT33, FT194, and FT246) and three ovarian cancer cell lines (OVSAHO, Kuramochi, and JHOS4) were published (Elias et al, 2016). We downloaded the ChIP-Seq peaks and their absolute summits called by Elias et al at FDR<0.01 from the Gene Expression Omnibus (accession number GSE79893) and defined PAX8-binding sites by extending each narrow peak to include the 500 base pair flanking sequence on either side (Heinz et al, 2010; Bailey and Machanick, 2012). We intersected SNPs at P<10−5 for association with SOC risk in the combined study population described above (n=930 genotyped or 1000 Genomes-imputed SNPs within 1 Mb of the eight unique index SNPs listed in Table 2 representing the eight intervals identified by INRICH) with these binding sites using BEDOPS version 2.4.20 (Neph et al, 2012). Results We began by testing the association between each of the 615 TF-target gene sets in MSigDB/TRANSFAC and SOC susceptibility by GSEA using P-values for 2.5 million SNPs from a meta-analysis of a North American and UK phase 1 GWAS of 2196 SOC cases and 4396 controls of European ancestry (Song et al, 2009; Permuth-Wey et al, 2011). SNPs were mapped to genes and the most significant SNP P-value for association with SOC risk in each gene was used to rank genes genome-wide for the GSEA. In this discovery pathway scan, 77 of the 615 TF-target gene sets were associated with SOC risk at PGSEA<0.05 (Supplementary Table S5) and putative target genes of the TF PAX8 (designated ‘V$PAX8_B’) emerged as the top ranked set (P<0.001; FDR=0.21; Table 1). Next, we sought to replicate these findings using genetic association results from independent samples not included in the discovery step. We performed a second GSEA for all 615 TF-target gene sets using association P-values for 2.4 million SNPs genome-wide in 7035 SOC cases and 21 693 controls from the Collaborative Oncological Gene-environment Study. In this replication pathway scan, 54 of the 615 TF-target gene sets were associated with SOC risk at PGSEA<0.05 (Supplementary Table S6), including 22 gene sets identified at the same significance level in the discovery analysis. The gene set containing targets of PAX8 was ranked 7th out of 615 (P=0.004; FDR=0.37; Table 1) and none of the six gene sets with a higher rank than the PAX8 targets were among the top 10 TF-target gene sets identified by the discovery step (Table 1). In order to obtain a consensus ranking of all 615 TF-target gene sets, we conducted a third GSEA using P-values for association with SOC risk from a total of 9627 SOC cases and 30 845 controls that included all discovery and replication samples and additional cases and controls from the Ovarian Cancer Association Consortium. In this combined study population, the PAX8 target gene set was once again ranked top (P<0.001; FDR=0.21; top 10 gene sets in Table 1 and full results in Supplementary Table S7).

The MSigDB/TRANSFAC PAX8 target gene set contained 102 genes (after excluding four open reading frames; Supplementary Table S1), 92 of which were covered by at least one SNP that was assessed in the genetic association studies. We expanded this gene set into what we term a PAX8 pathway by adding all genes (n=55; Supplementary Table S2) known to interact with PAX8 according to the literature, though not necessarily by binding PAX8 as a TF (description and citation for each interaction in Supplementary Table S3). This expanded 157-gene PAX8 pathway (137 of which were overlapped by at least one SNP) was also strongly associated with SOC risk in the combined study population (PGSEA=10−4 after 10 000 permutations). Next, we confirmed the association between the PAX8 pathway and SOC susceptibility using an alternative pathway analysis method called interval-based enrichment (INRICH; (Lee et al, 2012)) and used it to pinpoint specific PAX8 target genes likely driving the pathway-level signal. We identified all uncorrelated SNPs (n=47; Supplementary Table S4) associated with SOC risk at P<10−5 in the combined study population, generated two-megabase-wide intervals centred on each SNP, and merged overlapping intervals to yield 45 intervals. Fifteen of the 157 genes from the PAX8 pathway were located in eight of these 45 intervals (PINRICH=0.006 compared to 5000 permuted sets of 45 two-megabase-wide intervals). The P<10−5 index SNP at the center of five of these eight intervals was in fact genome-wide significant (P<5 × 10−8; Table 2). The SNP anchoring a sixth interval (rs2268177), though just short of genome-wide significance in the combined study population (P=9.5 × 10−7), has previously achieved this threshold in a meta-analysis of samples from OCAC with samples from the Consortium of Investigators of Modifiers of BRCA1/2 that were not included in this combined study population (Kuchenbaecker et al, 2015). Although we used a megabase flanking each P<10−5 SNP to define these intervals (to capture long-range SNP-gene cis-regulatory effects (van Heyningen and Bickmore, 2013)), in five of the eight intervals the nearest PAX8 pathway gene was less than 100 kb from the central SNP and only for two intervals did this distance extend beyond 200 kb (Table 2).

Finally, we tested whether the PAX8 pathway that had thus far been defined by combining annotations from MSigDB/TRANSFAC and curation of the published literature (that included experiments conducted in non-ovarian cell types) was cell- and cancer-type relevant. PAX8 expression was stably silenced by shRNAs in the ovarian cancer cell lines HeyA8 and IGROV1 (Supplementary Figure S2) and gene expression microarray profiling performed in knockdown and control cultures. The 157-gene PAX8 pathway, of which 154 genes were profiled on the microarray, was significantly associated with differential gene expression after PAX8 silencing in both cell line models (Pmicroarray-GSEA=0.03 for HeyA8; Pmicroarray-GSEA=0.004 for IGROV1; Supplementary Figure S3). In HeyA8 cells, 45 of these 154 genes from the PAX8 pathway were differentially expressed at P<0.05 (corresponding to a FDR<0.31; 14/154 at FDR<0.05; Pexact binomial<2.2 × 10−16 for 45/154 against 7.7/154 expected under the null hypothesis at the 5% α-level; Supplementary Table S8). In IGROV1 cells, 41 of the 154 PAX8 pathway genes were differentially expressed at P<0.05 (FDR<0.28; 17/154 at FDR<0.05; Pexact binomial<2.2 × 10−16; Supplementary Table S9). Overall, 69 of the 154 genes were differentially expressed at P<0.05 in at least one of, and 17 genes differentially expressed in both, the cell lines after PAX8 silencing (Pexact binomial=0.002 for 17/154 against 7.7/154 expected). On cross-examining results from the differential expression and INRICH analyses, we observed that of the 15 PAX8 pathway genes in eight intervals associated with SOC risk at P<10−5 (Table 2), BNC2, TNF, and NCL were differentially expressed at P<0.05 in both cell lines, HOXB5, HOXB7, HOXB8, and SP6 in IGROV1 cells only, and TERT and OSBPL7 in HeyA8 cells only (Table 3). Notably, BNC2 and HOXB7 were PAX8 target genes within 200 kb of a genome-wide significant index SNP at the 9p22.2 and 17q21.32 SOC risk loci, respectively, and differentially expressed in IGROV1 cells at FDR<0.05 (Table 3). While we were completing our study, genome-wide ChIP-Seq maps of PAX8 binding in three FTSEC and three additional ovarian cancer cell lines were published (Elias et al, 2016). We intersected these PAX8-binding sites with all SNPs at P<10−5 for association with SOC risk in the combined data set within the eight intervals identified by INRICH (+/− 1 Mb of the eight unique index SNPs listed in Table 2). SNPs at the 9p22.2, 17q21.32, and 19p13.11 genome-wide significant SOC risk loci overlapped PAX8 ChIP-binding sites in at least two of the three ovarian cancer cell lines with the most significant binding site SNP at each locus having P<7.1 × 10−8 for association with SOC risk (Supplementary Table S10).

Discussion

This is the first analysis to demonstrate that genes potentially targeted by the TF PAX8 are enriched for common genetic variation associated with SOC risk, suggesting that PAX8 may be a master transcriptional regulator of susceptibility to SOC. The emergence of the PAX8 pathway from our agnostic genome-wide approach evaluating overrepresentation of SOC risk SNPs in 615 gene sets each containing putative targets of a different TF followed by replication in independent samples and methodological replication is highly significant given recent calls for an improved understanding of the role of PAX8 in SOC biology (Bowtell et al, 2015). PAX8 encodes a member of the paired box family of TFs that contains a partial homeodomain (Chi and Epstein, 2002). It is essential for normal embryonic development of the Müllerian ducts (Mittag et al, 2007). A systematic, genome-wide RNA interference screen that included 25 ovarian cancer cell lines previously demonstrated that PAX8 is an ovarian cancer lineage-specific dependency, that is, it was the only gene that met three criteria: (a) essential for the survival and proliferation of ovarian cancer cell lines; (b) focally amplified in primary high-grade serous ovarian tumours; and (c) differentially overexpressed in ovarian cancer cell lines (Cheung et al, 2011). PAX8 has also been shown to be a key player in the proliferation, migration, and invasion of ovarian cancer cells and silencing this gene significantly inhibited anchorage-independent growth in vitro and tumour formation in a nude mouse xenograft model in vivo (Di Palma et al, 2014). Furthermore, PAX8 has been shown to drive murine SOCs originating in the fallopian tube (Perets et al, 2013). PAX8 is routinely used clinically as an epithelial marker to identify primary or metastatic tumours of Müllerian origin (Ozcan et al, 2011a, 2011b) because the TF is lineage-specific for FTSECs and ovarian surface epithelial cells (OSECs), both of which are contenders for being the cell of origin of SOC (Ozcan et al, 2011b; Adler et al, 2015).

The FDR for the PAX8-target gene set in the discovery, replication, and combined GSEA was 0.21, 0.37, and 0.21, respectively, suggesting that the result may be valid four out of five times or three out of five times. However, the convergence of orthogonal pieces of evidence: independent replication of the PAX8 GSEA findings across two of the largest genetic association studies and by INRICH, confirmation of the pathway-level association signal on including additional genes known to interact with PAX8 curated from published experiments, and significant perturbation of several of the putative PAX8 target genes near SOC risk loci in the transcriptomes of two ovarian cancer cell line models after abrogating PAX8 expression; all strongly support the observed association between this gene set and SOC risk. This convergence of evidence from integration of GWAS and cellular models also provides new insight into the potential transcriptional regulatory network of PAX8 and highlights 15 genes that are likely to be driving the association between the PAX8 pathway and risk of developing SOC (Table 3). Among these, BNC2 and HOXB7 are particularly noteworthy. BNC2 lies at 9p22.2, the first SOC risk locus to be identified (Song et al, 2009), and several functional SNPs in this region reside in enhancer elements and in a scaffold/matrix attachment region sequence that targets BNC2 (Buckley et al, under review). We have previously implicated HOXB7 at the 17q21.32 SOC risk locus in network analysis combining SOC GWAS data with SOC gene expression profiles from The Cancer Genome Atlas (Kar et al, 2015). HOXB7 overexpression in OSECs is associated with increased proliferation via fibroblast growth factor signalling (Naora et al, 2001). Our identification of SOC risk SNPs in PAX8 ChIP-binding sites at the BNC2 and HOXB7 loci further support involvement of these two genes in SOC susceptibility potentially through a PAX8-regulated mechanism. A critical next step after this pathway-based study will involve systematic genome-wide discovery of additional specific SOC risk alleles in PAX8-binding sites and/or variants that are involved in the regulation of genes that are also regulated by PAX8 (Freedman et al, 2011; Sur et al, 2013). This link between SOC risk alleles and the PAX8-target gene network may be established through dynamic expression quantitative trait locus analyses undertaken against a background of PAX8 knockdown in the relevant cell types (Califano et al, 2012; Fletcher et al, 2013).

Overall, consistent with recent pathway-level results in other cancers (Hung et al, 2015; Qian et al, 2015), the present study suggests that the genetic architecture of SOC susceptibility may be underpinned by a complex interplay between genes at the known genome-wide significant risk loci and at as yet unidentified loci that just fail to reach genome-wide significance but are functionally related to the known loci. We have demonstrated that genes interacting up- and downstream of PAX8 harbour SNPs strongly associated with SOC risk and a more comprehensive exploration of these targets may eventually open up opportunities for rational pathway-guided biomarker and therapeutic development to combat this lethal disease in its earliest stages.