Search for new loci and low-frequency variants influencing glioma risk by exome-array analysis

To identify protein-altering variants (PAVs) for glioma, we analysed Illumina HumanExome BeadChip exome-array data on 1882 glioma cases and 8079 controls from three independent European populations. In addition to single-variant tests we incorporated information on the predicted functional consequences of PAVs and analysed sets of genes with a higher likelihood of having a role in glioma on the basis of the profile of somatic mutations documented by large-scale sequencing initiatives. Globally there was a strong relationship between effect size and PAVs predicted to be damaging (P=2.29 × 10−49); however, these variants which are most likely to impact on risk, are rare (MAF<5%). Although no single variant showed an association which was statistically significant at the genome-wide threshold a number represented promising associations – BRCA2:c.9976A>T, p.(Lys3326Ter), which has been shown to influence breast and lung cancer risk (odds ratio (OR)=2.3, P=4.00 × 10−4 for glioblastoma (GBM)) and IDH2:c.782G>A, p.(Arg261His) (OR=3.21, P=7.67 × 10−3, for non-GBM). Additionally, gene burden tests revealed a statistically significant association for HARS2 and risk of GBM (P=2.20 × 10−6). Genome scans of low-frequency PAVs represent a complementary strategy to identify disease-causing variants compared with scans based on tagSNPs. Strategies to lessen the multiple testing burden by restricting analysis to PAVs with higher priors affords an opportunity to maximise study power.


INTRODUCTION
Gliomas account for~40% of all primary brain tumours and are diagnosed in around 26 000 individuals in Europe each year. 1,2 Gliomas are typically classified as being either glioblastoma (GBM) or non-GBM tumours (diffuse 'low-grade' glioma WHO grade I/II and anaplastic glioma WHO grade III tumours). 3 Most gliomas carry a poor prognosis, with the most common type, GBM, typically having a median survival of 15 months. 2 The only environmental factor consistently shown to influence glioma risk is exposure to ionising radiation, 2 which accounts for only a very small number of cases. Evidence for genetic predisposition to glioma is provided by rare inherited cancer syndromes including Turcot's and Li-Fraumeni syndromes, and neurofibromatosis. 2,4 Collectively however they account for little of the 2-fold increased risk of glioma seen in relatives of patients. 5 Much of the variation in genetic risk of glioma appears to be polygenic. Support for this proposal has come from genome-wide association studies (GWAS) which have identified common singlenucleotide polymorphisms (SNPs) at six loci influencing risk -5p15. 33 (TERT), 7p11.2 (EGFR, two regions), 8q24.21 (CCDC26), 9p21.3 (CDKN2A/CDKN2B), 11q23.3 (PHLDB1) and 20q13.33 (RTEL1). [6][7][8] Despite the success of GWAS such studies are not optimally configured to identify low-frequency variants with stronger effects. Protein altering variants (PAVs), which alter the encoded amino acid sequence, are proportionally less prevalent than synonymous variants; however, such variants are a priori more likely to have a functional impact. Coupled with the observation that Mendelian disease susceptibility is generally caused by coding sequence changes 9 suggests that association studies formulated around a gene-centric approach may be a powerful strategy for identifying disease-causing associations.
Although no rare recurrent PAV has thus far been shown to influence glioma risk the low-frequency variants NM_007194. The advent of next generation sequencing is allowing the cataloguing of recurrent coding variation, making the search for disease-causing PAVs on a genome-wide basis a viable proposition. Here we have investigated the contribution of recurrent coding variants to glioma by analysing 1882 cases and 8079 controls genotyped using the Illumina HumanExome BeadChip. To increase our power to identify disease-causing variants, we jointly tested groups of variants in a gene and incorporated information on the predicted functional consequences of PAVs. In addition we restricted our analysis to sets of genes with a higher likelihood of having a role in glioma on the basis of somatic mutation profile.

MATERIALS AND METHODS Subjects
We analysed three non-overlapping case-control series of Northern European ancestry: the UK series comprised 605 glioma cases (63% male; mean age at diagnosis 46 years) ascertained through the INTERPHONE Study 14 6 with 2400 healthy individuals from the Heinz-Nixdorf Recall study serving as controls. 17 The study was conducted with ethical review board approval. Written informed consent was obtained from all subjects. DNA was extracted from EDTA-venous bloods using conventional methodologies and quantified using PicoGreen (Invitrogen Corp., Carlsbad, CA, USA).

Genotyping and quality control
Genotyping was conducted using Illumina HumanExome-12v1_A Beadchips in accordance with the manufacturer's recommendations (Illumina). Calling of genotypes was performed using Illumina GenomeStudio version 2011.1 software. Cluster boundaries were determined by calling study samples simultaneously. Probes were excluded if monomorphic in all datasets, had a call rate o0.99 in cases/controls in a series, the difference in uncalled genotypes between cases and controls was statistically significant (Po0.05), if Hardy-Weinberg in controls Po0.001, or if non-autosomal (Supplementary Table 1). Samples were excluded if the call rate was o0.99, outlying heterozygosity (43 SD), or if a discrepancy was observed between manifest sex and X-chromosome genotype. To assess the fidelity of genotyping we examined the concordance in 493 individuals from the 1958BC, 15 which had also been sequenced 18 using TruSeq capture in conjunction with Illumina HiSeq2000 technology, and a GATK2 ref. 19 pipeline according to best practices. 20,21 Genotypes were compared at genomic positions for which allele codings could be unambiguously assigned, excluding 257A/T and C/G SNPs with MAF40.40.

Statistical and bioinformatic analysis
The main statistical and bioinformatics analyses were performed using PLINK v1.07 (ref. 22) (Cambridge, MA, USA) and R v3.0 software (Vienna, Austria). Using the EIGENSOFT v4.2 smartpca package 23,24 (Cambridge, MA, USA) we performed PCA to ensure comparability of case and controls. Individuals with non-Western European ancestry were identified and excluded by merging case and control data with 1000 Genomes project data. 100 000 ld-pruned post-QC probes were used to compute eigenvectors in each cohort. Samples exhibiting significant deviations (6 SD) from the main case/control cluster up to the first 10 eigenvectors were classified as outliers and flagged for exclusion. Outlying population structure on the pruned data set was examined using fastSTRUCTURE 25 if subsequent non-comparability was apparent between cases and controls. For first-degree relative pairs, the control from a casecontrol pair was removed; otherwise, the individual with the lower call rate was excluded. Associations were tested under an additive model. The adequacy of the case-control matching in each series and the possibility of differential genotyping of cases and controls was evaluated using quantile-quantile (Q-Q) plots of test statistics, restricting to variants with MAF40.005 to derive reasonable inflation estimates. Meta-analysis P-values and odds ratios (ORs) were calculated from per-study logistic regression beta values, under a fixedeffects model. We used Cochran's Q statistic to test for heterogeneity; restricting the reporting of novel associations to those with P het 40.05. We visually inspected genotype cluster plots for all reported variants. To explore variability in associations according to tumour histology, we derived ORs for all glioma, GBM and non-GBM. For the gene-based analysis, in addition to using the burden test which counts the number of minor alleles per gene per individual summed for all cases and controls, the sequence kernel association test (SKAT) was applied. 26 Burden and SKAT gene-based tests were based on all post-QC non-monomorphic probes mapping to RefSeq genes imposing default weights and MAFo0.05. Tests were implemented in plink-seq v0.09, and adjusted for study-specific effects by incorporating study as a covariate (using covar option). A single-variant association was declared significant if Po1.40 × 10 − 7 (Bonferroni correction for 118,815 PAVs, three tumour types). Gene-based association tests were considered significant if Po2.49 × 10 − 6 (10 045 genes, two tumour types). The power of our study to demonstrate an association for alleles with different MAFs was calculated assuming a multiplicative model. In all analyses a P-value of 0.05 was considered as representing statistical significance, after adjustment for multiple testing. Gene-set enrichment analysis (GSEA) of pre-ranked SKAT P-values, was performed on gene sets catalogued by the MSigDB v4.0 database (updated 31 May 2013) using GSEA software 27 adopting default settings. Linkage disequilibrium (LD) r 2 metrics were estimated from UK10K whole-genome data. To restrict our analysis to genes with a higher likelihood of having a role in glioma on the basis of somatic mutation profile in tumours, we used MutSigCV version 1.4 ref. 28 to identify genes harbouring more non-synonymous mutations than expected by chance given gene size, sequence context and mutation rate. Thresholding at false discovery rate Qo0.1 as advocated, 28 MutSig scores were obtained for GBM and non-GBM tumours by interrogation of TCGA (The Cancer Genome Atlas) provisional data sets using cBioPortal. 29 The Variant Effect Predictor (VEP; version 74) 30 was used to predict impact of variants on canonical Ensembl gene transcripts and functional consequences of missense variants according to SIFT, 31 PolyPhen-2 ref. 32 and CONDEL. 33 Computational modelling of the effect of amino acid changes on protein structure was carried out using the project HOPE server. 34 To assess sequence conservation we used GERP 35 and Phast_cons 36 metrics.

Quality control and array characteristics and performance
We submitted 2413 cases and 3099 controls for genotyping. Twelve cases and eight controls failed genotyping (call rateo0.95). Five hundred and nineteen cases and 807 controls were excluded for the following reasons: outlying heterozygosity in rare (47 cases, 28 controls) and common ( Figure 1B) λ was 1.058 ensuring subsequent analysis was less biased by any ancestral discordance between cases and controls (Supplementary Figure 2D). Post-QC data on 1882 cases and 8079 controls were available for analysis.
In the combined analysis of all PAVs the strongest association for risk of glioma was provided by rs593818 responsible for the XM_006722850.1(CYP4F12):c.1117A4G, p.(Ser373Gly) amino acid change (P = 1.24 × 10 − 5 ), albeit non-significant on a genome-wide basis (Supplementary Table 4). Similarly in the stratified analysis no single variant showed a globally significant association with either GBM or non-GBM tumours (Supplementary Table 4).   Figure 1 shows the relationship between effect size (measured by OR, taking the reciprocal or ORs o1.0) and MAF for 118,815 PAVs, those SNPs characterized by low MAF tending to have a higher probability of conferring more substantive risks.
To restrict our analytical space, we analysed the data set incorporating information on the predicted functional consequences of these PAVs. Of the 104 321 PAVs genotyped by the exome array for which CONDEL annotations could be obtained, the majority (64.1%) are predicted to be neutral (n = 66 841), and 35.9% deleterious (n = 37-480). Fifteen PAVs predicted to be deleterious showed an association with glioma risk at the Po10 -3 threshold (Table 1). To investigate whether PAVs predicted to be functionally deleterious were enriched for stronger effects on glioma risk, we compared the distribution of effect size (as measured by ORs) in the two CONDEL prediction categories ( Table 2). There was strong evidence of a relationship between increasing effect size and prediction of the PAV being deleterious. For PAVs with control MAF40.005 predicted to be deleterious there was an OR increase of 1.22 compared with neutral PAVs (95% confidence interval (CI): 1.19-1.26, P trend = 2.29 × 10 -49 , Table 2). Overall, PAVs classified as damaging by CONDEL were 1.43fold more likely to be associated with effect sizes ≥ 1.5 than PAVs classified as neutral (P = 4.59 × 10 -4 , OR = 1.43, 95% CI = 1.17-1.74).
A number of rare variants recognised to have pleiotropic effects on cancer risk are featured on the Illumina Exome Array (Table 4).  39 Given that such variants are a priori strong candidates for influencing the development of cancer, we examined the relationship between rs11571833, rs17879961 and rs28903098 and glioma (Table 5). For all glioma, BRCA2 p.(Lys3326Ter) carrier status conferred an OR of 1.76 (P = 0.0026), principally associated with GBM (OR = 2.3, P = 4.0x10 − 4 ). Although no association was shown for CHEK2 p.(Ile157Thr), BRIP1 p.(Pro47Ala) carrier status conferred an OR of 3.83 (P = 0.048) ( Table 4).
Although not attaining exome-wide statistical significance, further gene-based tests revealed a number of genes that were both significantly mutated in glioma tumours as well as possessing a germline variant burden (Table 5).
To gain further insight into the nature of the biological pathways impacting on glioma susceptibility, we performed GSEA using SKAT association P-values (Supplementary Table 5). This revealed a number of gene sets that were positively or negatively enriched for genes associated with glioma (ie, P GSEA o0.05). GBM glioma showed positive enrichment for genes involved in amino acid and nucleotide metabolism, and non-GBM glioma showed positive enrichment for genes involved in cell growth and development, however the majority of gene sets had an FDR Q40.25.

DISCUSSION
GWAS have become a powerful tool to identify susceptibility variants for cancer. However since the tagSNPs used in GWAS are generally not themselves candidates for causality, identification of the functional variant at a locus generally poses a significant challenge. An alternative approach is to target sequence variation, which a priori, is more likely to impact on disease status. Alleles that are functionally deleterious will tend to be selected against and thus underrepresented at high frequencies, an assertion supported by the observation of a relationship between putative functionality and MAF. Hence, it can be argued that at least some of the variants impacting on cancer risk including glioma will be rare. Although the association between the rare variant BRCA2:c.9976A4T, p.(Lys3326Ter) and glioma did not attain statistical significance such an assertion is supported by the established relationship between CHEK2:c.1100delC, p.(Thr367Metfs) and MUTYH:c.536A4G, p.(Tyr179Cys) and MUTYH:c.1187G4A, p.(Gly396Asp) variants which influence the risk of breast and CRC respectively. 12,13 To our knowledge we have conducted the largest study of the relationship between recurrent PAVs and glioma risk to date. Population stratification is a source of bias in association studies, and although adjustment of test statistics for principal components generated on common SNPs can be applied to genome scans, confounding of rare variants in spatially structured populations is not necessarily corrected by such methods. 40 Hence a major strength of our study is that it is based on three independent case-control series, thereby minimising biases as a consequence of spatial differences within one data set impacting on conclusions.
No single-variant associations with glioma attained statistical significance after correction for multiple testing. However, we did observe a significant association between variant effect size and predicted functional effect. In this study we have been limited to detecting alleles conferring ORs of 1.6 provided MAF 40.05 (80% power stipulating Po10 -7 ) or those with frequencies of~0.01 conferring ORs 42.5. Hence it is possible that PAVs do have an appreciable contribution to glioma risk but at lower individual effect sizes than previously anticipated, therefore requiring much larger case-control sample sets than we have used herein to identify them.
Testing for a burden of PAVs across genes revealed a significant association between HARS2 and GBM. HARS2 encodes a mitochondrial histidyl tRNA synthetase, mutation of which causes ovarian dysgenesis and sensorineural hearing loss. 41 Although not attaining an exome-wide significant burden of germline variants, additionally of note is CDH18 and GBM risk. CDH18 is also significantly mutated in GBM tumours and encodes a cadherin protein involved in cell-cell adhesion. The gene is expressed specifically in the nervous system and has been proposed to regulate neural morphogenesis. 42 By restricting our analysis to genes implicated in glioma by virtue of somatic mutation or variants recognised to increase risk of other cancers, we constrained the multiple testing problem and upweighted the prior probability for association with glioma. From these analyses we have provided evidence to implicate BRCA2 p.(Lys3326Ter) as well as IDH2 p.(Arg261His) as determinants of glioma risk. IDH2 encodes for the mitochondrial NAD(+)-dependent isocitrate dehydrogenase which is involved in the citric acid cycle. 43 While IDH2 p.(Arg261His) is not mutated in glioma, IDH1 or IDH2 are commonly mutated in glioma tumours and always involve the arginine residue. 44 IDH2 is in chromosome 15q26.1, the location of a previously reported glioma linkage peak. 45 Modelling of the IDH2 p.(Arg261His) change is shown in Supplementary Figure 4. This amino acid change is predicted to disrupt several salt bridge interactions, which may affect protein activity.
In our study, none of the PAVs genotyped in any of the previously identified glioma GWAS regions showed evidence of association with glioma (n = 240; P41.37 × 10 − 3 ). While accepting that we are constrained by the content of PAVs on the array, this argues against a rare coding variant that is tagged by a SNP contributing significantly to any of the GWAS signals identified.
While aiming to provide a comprehensive survey of recurrent PAVs it is apparent from our analysis that there are a number of issues that will impact on the utility of the Illumina Exome Array. Firstly, a high proportion of the featured SNPs are either monomorphic in Europeans or have a MAF o0.005. Secondly, as illustrated by comparison with data from the UK10K sequencing project, 22% of missense variants with allele counts45 are not featured on the array (11 894 of 54 463 variants; Supplementary Table 6). Additionally, only~36% of PAVs on the array are predicted to be functionally deleterious. Finally, indels are not well represented on the array. Collectively, these observations cast doubt on the ability of the array to provide a comprehensive assessment of the contribution of PAVs to disease risk, highlighting the value of sequence-based approaches to discover new disease variants.
In conclusion, there is increasing evidence that cancer susceptibility is in part mediated through low-frequency variants affecting the amino acid sequence of expressed proteins. Hence genome scans of PAVs represent a complementary strategy to identify disease-causing variants compared to scans based on tagSNPs. Strategies to lessen the multiple testing burden by restricting analysis to PAVs with higher priors affords an opportunity to maximise study power. hospital staff, cancer registries, study staff and funders who contributed to the blood sample and data collection for this study as listed in Hepworth et al (BMJ 2006, 332, 883). For the German study, we are indebted to B Harzheim (Bonn), S Ott and Dr A Müller-Erkwoh (Bonn) for help with the acquisition of clinical data and R Mahlberg (Bonn) who provided technical support. The UK study made use of control genotyping data generated by the Wellcome Trust Case-Control Consortium. A full list of the investigators who contributed to the generation of the data is available from The 1958 Birth Cohort exome chip data was QCd by Kathy Stirrups. Data sharing was organised by the UK Exome-chip consortium. French controls were taken from the SU.VI.MAX study. The German GWA study made use of genotyping data from the Heinz-Nixdorf RECALL study. The HNR cohort was established with the support of the