Genetic correlation between multiple myeloma and chronic lymphocytic leukaemia provides evidence for shared aetiology

The clustering of different types of B-cell malignancies in families raises the possibility of shared aetiology. To examine this, we performed cross-trait linkage disequilibrium (LD)-score regression of multiple myeloma (MM) and chronic lymphocytic leukaemia (CLL) genome-wide association study (GWAS) data sets, totalling 11,734 cases and 29,468 controls. A significant genetic correlation between these two B-cell malignancies was shown (Rg = 0.4, P = 0.0046). Furthermore, four of the 45 known CLL risk loci were shown to associate with MM risk and five of the 23 known MM risk loci associate with CLL risk. By integrating eQTL, Hi-C and ChIP-seq data, we show that these pleiotropic risk loci are enriched for B-cell regulatory elements and implicate B-cell developmental genes. These data identify shared biological pathways influencing the development of CLL and, MM and further our understanding of the aetiological basis of these B-cell malignancies.


Introduction
Chronic lymphocytic leukaemia (CLL) and multiple myeloma (MM) are both B-cell malignancies, which arise from the clonal expansion of progenitor cells at different stages of B-cell maturity [1][2][3] . Evidence for inherited predisposition to CLL and MM comes from the six-and twofold increased risk of the respective diseases seen in relatives of patients 4 .
Epidemiological observations on familial cancer risks across the different B-cell malignancies suggest an element of shared inherited susceptibility, especially between CLL and MM 4 .
Linkage disequilibrium (LD) score regression is a method which exploits the feature of a test statistic for a given single nucleotide polymorphism (SNP), whereby that test statistic will incorporate the effects of correlated SNPs 15 . Conventional LD score regression regresses trait χ 2 statistics against the LD score for a given SNP, with the coefficient of the regression line providing an estimate of trait heritability. This method can be modified by instead regressing the product of SNP Z-scores from two traits against the SNP LD score, with the slope providing an estimate of genetic covariance between the two traits 16 . This method can be applied to summary statistics, is not biased by sample overlap and does not require multiple traits to be measured for each individual.
By analysis of GWAS data for MM and CLL and applying cross-trait LD score regression, we have been able to demonstrate a positive genetic correlation between CLL and MM. We find evidence of shared genetic susceptibility at 10 known risk loci and by integrating promoter capture Hi-C (PCHi-C) data, ChIP-seq and expression data we provide insight into the shared biological basis of CLL and MM.

Ethics
Collection of patient samples and associated clinicopathological information was undertaken with written informed consent and relevant ethical review board approval at respective study centres in accordance with the tenets of the Declaration of Helsinki.

Quality control
Standard quality-control measures were applied to the GWAS 17 . Specifically, individuals with low SNP call rate (<95%) as well as individuals evaluated to be of non-European ancestry (using the HapMap version 2 CEU, JPT/CHB and YRI populations as a reference) were excluded. For apparent first-degree relative pairs, we excluded the control from a case-control pair; otherwise, we excluded the individual with the lower call rate. SNPs with a call rate <95% were excluded as were those with a MAF <0.01 or displaying significant deviation from Hardy-Weinberg equilibrium (P < 10 −5 ). GWAS data were imputed to >10 million SNPs using IMPUTE2 v4 (for CLL) and IMPUTE2 v2.3 (for MM) software in conjunction with a merged reference panel consisting of data from 1000 Genomes Project 18 (phase 1 integrated release 3 March 2012) and UK10K 19 . Genotypes were aligned to the positive strand in both imputation and genotyping. We imposed predefined thresholds for imputation quality to retain potential risk variants with MAF >0.01 for validation. Poorly imputed SNPs with an information measure <0.80 were excluded. Tests of association between imputed SNPs and MM were performed under an additive model in SNPTESTv2.5 20 . The adequacy of the case-control matching and possibility of differential genotyping of cases and controls was evaluated using a Q-Q plot of test statistics. The inflation λ was based on the 90% least-significant SNPs and assessment of λ 1000 . Details of SNP QC are provided in Supplementary Table 3 and 4. Four principal components, generated using common SNPs, were included to limit the effects of cryptic population stratification in the US-CLL data set. Eigenvectors for the GWAS data sets were inferred using smartpca (part of EIGENSOFT) by merging cases and controls with phase II HapMap samples.

Meta-analysis
Meta-analyses were performed using the fixed-effects inverse-variance method using META v1.6 21 . Cochran's Q-statistic to test for heterogeneity and the I 2 statistic to quantify the proportion of the total variation due to heterogeneity was calculated.

LD score regression
To investigate genetic correlation between MM and CLL, we implemented cross-trait LD score regression by Bulik-Sullivan et al. 16 . Using summary statistics from the GWAS meta-analysis we implemented filters as recommended by the authors 16 . Specifically, filtering SNPs to INFO >0.9, MAF >0.01, and harmonizing to Hap Map3 SNPs with 1000 Genomes EUR MAF >0.05, removing indels and structural variants, removing strandambiguous SNPs and removing SNPs where alleles did not match those in 1000 Genomes. This was performed by running the munge-sumstats.pr script included with ldsc. We ran ldsc.py, part of the ldsc package, excluding the HLA region. We report heritability estimates on the observed scale. There is no distinction between observed and liability scale genetic correlation for case/control traits 16 .

Shared risk loci
To identify pleiotropic risk loci, that is genetic loci that influence two traits, we identified SNPs previously reported to be associated with each disease at genomewide significance (P < 5 × 10 −8 ), as well as highly correlated variants (r 2 > 0.8) at the 45 and 23 known risk loci for CLL and MM, respectively. Within these correlated variant sets at each locus, we determined how many of the CLL susceptibility loci were associated with MM at region-wide significance after Bonferroni correction for multiple testing (i.e. P adj < 0.05/45). We then repeated the process, examining MM susceptibility SNPs in CLL, applying a significance level of P adj < 0.05/23. A full list of results is summarized in Supplementary Data File 1 and 2.

Partitioned heritability
A variation of LD score regression, namely stratified LD score regression, can be used to partition heritability according to different genomic categories. For both MM and CLL we applied stratified LD score regression across the baseline model used in Finucane et al. 22 . We plotted the enrichment of functional categories for each diseasethis is defined as proportion heritability divided by the total heritability. We excluded from our plot additional flanking regions around each functional category, which authors designed to allow observation of enrichment of SNP heritability in intermediary regions. A plot of the results is found in Supplementary Figure 1.

Variant set enrichment
To examine enrichment in specific histone mark binding across shared risk loci, we adapted the method of Cowper-Sal lari et al. 23 . Briefly, for each risk locus, a region of strong LD (defined as r 2 > 0.8 and D′ > 0.8) was determined, and these SNPs were considered the associated variant set (AVS). Publically available ChIP-seq data for 6 histone marks from naive B cells was downloaded from Blueprint Epigenome Project 24 . For each mark, the overlap of the SNPs in the AVS and the binding sites was assessed to generate a mapping tally. A null distribution was produced by randomly selecting SNPs with the same characteristics as the risk-associated SNPs, and the null mapping tally calculated. This process was repeated 10,000 times, and P-values calculated as the proportion of permutations where null mapping tally was greater or equal to the AVS mapping tally. An enrichment score was calculated by normalizing the tallies to the median of the null distribution. Thus, the enrichment score is the number of standard deviations of the AVS mapping tally from the median of the null distribution tallies. An enrichment plot for naive B cells is shown in Supplementary Figure 2.

Cell-type-specific analyses
We considered chromatin mark overlap enrichment for genome-wide significant loci in different cell types using the methodology of Trynka et al. 25 . This approach scores GWAS SNPs based on proximity to chromatin mark and fold-enrichment of respective chromatin mark, assessing significance using a tissue-specific permutation method. We obtained chip-seq data for H3K4me3 from primary blood cells and CLL samples downloaded from Blueprint Epigenome project 24 . In addition, we included in our analysis 4 MM cell lines-KMS11, JJN3, MM1-S and L363 processed as previously described 26 . A heat map of results is shown in Supplementary Figure 3. eQTL eQTL analyses were performed using publicly available whole-blood data downloaded from GTeX 27 . The relationship between SNP genotype and gene expression we carried out using Summary-data-based Mendelian Randomization (SMR) analysis as per Zhu et al. 28 . Briefly, if b xy is the effect size of x (gene expression) on y (slope of y regressed on the genetic value of x), b zx is the effect of z on x, and b zy be the effect of z on y, b xy (b zy /b zx ) is the effect of x on y. To distinguish pleiotropy from linkage where the top associated cis-eQTL is in LD with two causal variants, one affecting gene expression the other affecting trait we tested for heterogeneity in dependent instruments (HEIDI), using multiple SNPs in each cis-eQTL region. Under the hypothesis of pleiotropy b xy values for SNPs in LD with the causal variant should be identical. For each probe that passed significance threshold for the SMR test, we tested the heterogeneity in the b xy values estimated for multiple SNPs in the cis-eQTL region using HEIDI.
GWAS summary statistics files were generated from the meta-analysis. For the disease discovery GWAS, we set a threshold for the SMR test of P SMR < 2.5 × 10 −5 corresponding to a Bonferroni correction for the number of probes which demonstrated an association in the SMR test. For all genes passing this threshold we generated plots of the eQTL and GWAS associations at the locus, as well as plots of GWAS and eQTL effect sizes (i.e. input for the HEIDI heterogeneity test). HEIDI test P-values <0.05 were considered as reflective of heterogeneity. This threshold is, however, conservative for gene discovery because it retains fewer genes than when correcting for multiple testing. SMR plots for significant eQTLs are shown in Supplementary Figures 4, 5 and a summary of results are shown in Supplementary Table 5.

Genetic correlation and heritability
We performed cross-trait LD-score regression using summary statistics from two recent GWAS meta-analyses based on 7717 MM cases and 21,587 controls and 4017 CLL cases and 7881 controls (Fig. 1, Supplementary Table  1-4). While these data sets have been previously subject to quality control (QC) [5][6][7][9][10][11][12] for the current analysis we implemented additional filtering steps as per Bulik-Sullivan et al. 16 , resulting in 1,055,728 harmonized SNPs between the two data sets. Heritability estimates from cross-trait LD score regression of 9.2 (±1.8%) and 22 (±5.9%) were comparable with previous estimates for MM 14 and CLL 6 . LD-score regression revealed a significant-positive genetic correlation between MM and CLL with an R g value of 0.44 (P = 4.6 × 10 −3 ).

Identification of pleiotropic risk loci
We identified SNPs previously reported to be associated with each disease at genome-wide significance (P < 5 × 10 −8 ), as well as highly correlated variants (r 2 > 0.8) at the 45 and 23 known risk loci for CLL and MM, respectively. To identify pleiotropic risk loci, that is genetic loci that influence two traits, we determined how many of the CLL susceptibility loci were associated with MM at regionwide significance after Bonferroni correction for multiple testing (i.e. P adj < 0.05/45). We then repeated the process, examining MM susceptibility SNPs in CLL, applying a significance level of P adj < 0.05/23. Of the 45 CLL risk loci, four were associated with MM (P adj < 0.0011) while, of 23 MM risk loci, five were significantly associated in CLL (P adj < 0.0022) (Table 1, Fig. 2). Correlated SNPs (r 2 > 0.8) Fig. 1 Schematic outlining the processing of data sets used in the genetic correlation at 3q26.2 are associated with both CLL and MM at genome-wide significance (Fig. 2), bringing the total number of pleiotropic loci to 10.

Biological inference
Trynka et al. have recently shown that chromatin marks highlighting active regulatory regions overlap with phenotype-associated variants in a cell-type-specific manner 25 . As H3K4me3 was shown to be the most phenotypically cell-type-specific chromatin mark, we examined cell-type specificity of the 10 pleiotropic risk loci by analysing H3K4me3 chromatin marks in normal haematopoietic cells and CLL patient samples from Blueprint  Figure 1). Furthermore, we found significant enrichment of SNPs in the shared loci within regions of active chromatin, as indicated by the presence of H3K27ac and H3K4Me3 marks in naive B cells, supporting the principle that SNPs in shared loci influence risk through regulatory effects (Supplementary Figure 2). To identify target genes we analysed PCHi-C data on naive B cells from Blueprint 24 . We also sought to gain insight into the possible biological mechanisms for associations by performing an expression quantitative trait locus (eQTL) analysis using mRNA  expression data on blood from GTEx. Applying Summary data-based Mendelian Randomization (SMR) methodology, we tested for pleiotropy between GWAS signal and cis-eQTL for genes to identify a causal relationship. Broadly, our analysis of the shared loci groups them into those which act on a B-cell regulation and differentiation and those which underpin the distinctive biology of cancer; specifically, loci relating to genome instability, angiogenesis and dysregulated apoptosis (Supplementary Table 6).
Of the shared loci, three were related to B-cell regulation. This included composite evidence at 10q23.31, from looping interactions in naive B cells and correlation in GWAS effect size and expression, which provide evidence for two candidate genes ACTA2, encoding smooth muscle (α)-2 actin, a protein involved in cell movement and contraction of muscles 33 and FAS, a member of the TNFreceptor superfamily. FAS, has a central role in regulating the immune response through apoptosis of B cells 34,35 . At 2q31.1, looping interactions implicated transcription factor SP3, which has been shown to influence expression of germinal centre genes, 36,37 . Variants at 6p25.3 reside in the 3′-UTR of IRF4, which has an established role in Bcell regulation 38,39 and MM oncogenesis 40,41 .
Three of the 10 loci contain genes with roles in maintenance of genomic stability. Specifically, evidence from expression and PCHi-C data implicated RFWD3 at 16q23.1. This gene encodes an E3 ubiquitin-protein ligase, which has been shown to promote progression to late stage homologous recombination through ubiquitination and timely removal of RAD51 and RPA at sites of DNA damage 42 and is necessary for replication fork restart 43 . Variants in this locus demonstrated enrichment of H3K4me3 marks in two samples of naive B cells, which represents a plausible cell of disease origin. rs58618031 (7q31.33) maps 5′ of POT1, the protection of telomeres 1 gene, which is part of the shelterin complex and functions to maintain chromosomal stability 44,45 . Variant rs1317082 at 3q26.2 is located proximal to TERC, a gene which has been shown to influence telomere length 46 . Additionally, we observed looping interactions to a number of genes at 3q26.2 including SEC62, which has been proposed as a cancer biomarker [47][48][49][50] . Intriguingly, variants at 3q26.2 this locus have been implicated in colorectal 51 , thyroid 52 and bladder 53 cancer.
Several genes were implicated at 22q13.33 by looping interactions for SCO2, LMF2, ODF3B, TYMP/ECGF1, NCAPH2, SYCE3 and ARSA, with TYMP/ECGF1 and SCO2 demonstrating evidence of correlation in GWAS and eQTL effect size, albeit not significant after multiple testing (P SMR = 2.38 × 10 −4 and 3.19 × 10 −4 ). Variants within this locus were enriched in H3K4me3 chromatin marks in both CD38-B cells and inflammatory macrophages. TYMP (alias ECGF1) encodes thymidine phosphorylase, which is often overexpressed in tumours and has been linked to angiogenesis 54,55 . A detailed study on this gene has implicated TYMP in the development of lytic bone lesions in MM, via a mechanism involving activation of PI3K/Akt signalling and increased DNMT3A expression resulting in hypermethylation of RUNX2, osterix, and IRF8 56 . Furthermore, SCO2 (synthesis of cytochrome c oxidase), also mapping to this locus, has been implicated in the development of breast 57,58 , gastric 59 and leukaemia 60 , through glucose metabolism reprogramming 61 , a hallmark of cancer 62 . Tumour suppressor, p53, regulates metabolic pathways, p53transactivated TP53-induced glycolysis (TIGAR), and regulation of apoptosis in part through SCO2 58,59,61 .
Finally, whereas these data were indifferent to decipher 8q24.21, this locus has also been shown to harbour risk SNPs for other cancers, which localize within distinct LD blocks and likely reflect tissue-specific effects on cancer risk through regulation of MYC 30 .

Discussion
Our analysis provides evidence of a genetic correlation between MM and CLL. Furthermore, we have identified shared genetic susceptibility at 10 known risk loci. While requiring biological validation, integration of data from PCHi-C, chromatin mark enrichment and eQTL at shared loci has provided insight into how these loci may confer susceptibility to both CLL and MM. Applying a working hypothesis that the loci may act in pleiotropic fashion, we selected relevant cells representing a common tissue of disease origin; namely naive B cells.
A significant genetic correlation between MM and CLL, as well as the discovery of risk loci shared between them, Fig. 2 Overlap of loci in multiple myeloma and chronic lymphocytic leukaemia. *correlated variants at 3q26.2 had been previously published as genome wide significant in each data set prior to this analysis supports epidemiological data demonstrating elevated familial risks between these B-cell malignancies 4 . Furthermore, the shared loci we identified could be broadly grouped into those containing genes related to B-cell regulation and differentiation and those containing genes involved in angiogenesis, genome stability and apoptosis, supporting the tenet that these alleles can influence aetiology of either disease. With the expansion of GWAS of the B-cell malignancies, more detailed characterisation of common underlying risk alleles and affected pathways can inform the biology of B-cell oncogenesis.
Data availability SNP genotyping data that support the findings of this study have been deposited in Gene Expression Omnibus with accession codes GSE21349, GSE19784, GSE24080, GSE2658 and GSE15695; in the European Genomephenome Archive (EGA) with accession code EGAS00000000001; in the European Bioinformatics Institute (Part of the European Molecular Biology Laboratory) (EMBL-EBI) with accession code E-MTAB-362 and E-TABM-1138; and in the database of Genotypes and Phenotypes (dbGaP) with accession code phs000207.v1.p1. The remaining data are contained within the paper and Supplementary Files or available from the author upon request. Naive B-cell HiC data used in this work is publicly available from Blueprint Epigenome Project [https://osf.io/u8tzp/]. ChIP-seq data for H3K27ac, H3K4Me1, H3K27Me3, H3K9Me3, H3K36Me3 and H3K27Me3 from naive B cells are publicly available and was obtained from Blueprint Epigenome Project [http:// www.blueprint-epigenome.eu/]. and other individuals who participated in the project. The Genome-Wide Association Study (GWAS) of Non-Hodgkin Lymphoma (NHL) project, from which US-CLL samples were obtained, was supported by the intramural program of the Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NCI), National Institutes of Health (NIH). The data sets have been accessed through the NIH database for Genotypes and Phenotypes (dbGaP) under accession # phs000801. A full list of acknowledgements can be found in supplementary note (Berndt SI et al., Nature Genet., 2013, PMID: 23770605).