Germline variants at SOHLH2 influence multiple myeloma risk

Multiple myeloma (MM) is caused by the uncontrolled, clonal expansion of plasma cells. While there is epidemiological evidence for inherited susceptibility, the molecular basis remains incompletely understood. We report a genome-wide association study totalling 5,320 cases and 422,289 controls from four Nordic populations, and find a novel MM risk variant at SOHLH2 at 13q13.3 (risk allele frequency = 3.5%; odds ratio = 1.38; P = 2.2 × 10−14). This gene encodes a transcription factor involved in gametogenesis that is normally only weakly expressed in plasma cells. The association is represented by 14 variants in linkage disequilibrium. Among these, rs75712673 maps to a genomic region with open chromatin in plasma cells, and upregulates SOHLH2 in this cell type. Moreover, rs75712673 influences transcriptional activity in luciferase assays, and shows a chromatin looping interaction with the SOHLH2 promoter. Our work provides novel insight into MM susceptibility.


Introduction
Multiple myeloma (MM) is characterized by an uncontrolled, clonal expansion of plasma cells in the bone marrow, producing a monoclonal immunoglobulin ("M protein"). Clinically, MM is complicated by bone marrow and kidney failure, hypercalcemia, and lytic bone lesions 1 . MM is preceded by monoclonal gammopathy of undetermined significance (MGUS) 2,3 , detectable in 3% of individuals older than 50 years 4 , which progresses to MM at an annual rate of about 1% 5 .
Epidemiological studies have shown that first-degree relatives of patients with MM have 2-4 times higher risk of MM and MGUS [6][7][8] , as well as an increased risk of chronic lymphocytic leukemia, lymphomas, and certain solid tumours [9][10][11] . Genome-wide association studies (GWAS) have identified inherited sequence variants at 24 loci that influence MM risk and sequencing studies of familial cases have implicated candidate genes where rare MM-predisposing variants might reside 12 . Still, identified loci explain less than 20% of the estimated heritability 13 .
To advance our understanding of MM genetics, we carried out a GWAS based on cases and controls from four Nordic populations, with follow-up in additional data sets of European ancestry. We detect a novel MM risk locus at 13q13.3, spanning the SOHLH2 gene. Dissecting the functional architecture, we identify a putative causal variant within the linkage disequilibrium (LD) block that likely acts by upregulating SOHLH2 in plasma cells.

Sample sets
We carried out a GWAS totalling 5,320 MM and non-IgM MGUS cases and 422,293 controls recruited from Denmark, Iceland, Norway and Sweden (Tables 1 and 2); Swedish series 2,338 MM cases from the Swedish National MM Biobank (Skåne University Hospital, Lund) and 11,971 Swedish controls (random blood donors and primary care patients) genotyped within an ongoing GWAS on blood cell traits (Lopez de Lapuente Portilla

Genotyping and imputation
Swedish, Danish and Norwegian samples were genotyped with Illumina single-nucleotide polymorphism microarrays and phased together with 442,737 samples from North-Western Europe using Eagle2 17 . Samples and variants with <98% yield were excluded. We used the same methods as used for the Icelandic data 15,18 to create a haplotype reference panel by phasing the whole-genome sequence (WGS) genotypes for 15,575 individuals from North-Western Europe, including 3,012 Swedish, 8,429 Danish and 2,550 Norwegian samples, together with the phased microarray data, and to impute the genotypes from the haplotype reference panel into the phased microarray data.
Sample preparation and WGS of 49,962 Icelanders are previously reported 15,19 . Briefly, 37.6 million sequence variants were identified by WGS in 49,962 Icelanders using Illumina technology to a mean depth of at least 18×. SNPs and indels were identified and their genotypes were called jointly using Graphtyper 20 . In addition, over 165,000 Icelanders, including all those with WGS data, have been microarray-genotyped and long-range phased 18 , improving genotype calls using information based on haplotype sharing. The genotypes of the highquality sequence variants were imputed into the microarray-typed Icelanders 21 . To increase the sample size and power to detect associations, the sequence variants were also imputed into relatives of the microarraytyped using genealogic information. All tested variants had imputation information > 0.8. Variants were mapped to hg38 and matched on position and alleles to harmonize the four data sets. rs145374408, rs78351393 and rs17202418 were genotyped in the Swedish follow-up sample set.

Ancestry analysis
Genetic ancestry analysis was done in two stages for the Danish, Swedish and Norwegian sample sets separately. Firstly, ADMIXTURE v1.23 22 was run in supervised mode with 1000 Genomes populations CEU, CHB, and YRI 23 as training samples and Danish, Swedish or Norwegian individuals as test samples. Input data for ADMIXTURE had long-range LD regions removed 24 and was then LDpruned with PLINK v.190b3a 25 using the-indep-pairwise 200 25 0.3 option. Samples with <0.9 CEU ancestry were excluded. Secondly, remaining samples were projected onto a principal component analysis (PCA), calculated with an in-house European reference panel to calculate the 20 first principal components for each population. UMAP 26 was used to reduce the coordinates of test samples on 20 principal components to two dimensions. Additional European samples not in the original reference set were also projected onto the PCA and UMAP components to identify ancestries represented in the clusters, and samples with Swedish, Danish and Norwegian ancestries were identified.

Association testing
We performed logistic regression in the Icelandic, Swedish, Danish and Norwegian data set separately to test for association between MM and genotypes using deCODE software 15 . In the Danish, Swedish and Norwegian association analysis, we adjusted for gender, whether the individual had been microarray-typed and/or sequenced, and the first 20 principal components. In the Icelandic association analysis, we adjusted for gender, county-of-origin, current age or age at death, blood sample availability for the individual, and an indicator function for the overlap of the lifetime of the individual with the time span of phenotype collection. We used LD score regression to account for distribution inflation due to cryptic relatedness and population stratification 27 .
For the meta-analysis, we used a fixed-effects inverse variance method 28 . Of note, using a large number of controls, primarily in the Icelandic and Danish data sets, will not bias the results for individual data sets as it only provides more accurate estimate of the allelic frequency in the control group and hence increases power. The inverse variance method used to combine effect size estimates, in essence, weights effects by sample size through the use of corresponding standard errors. This meta-analysis method is well recognized and will not bias results when the ratio of cases to controls is unequal 29 .
Genome-wide significance was determined using classbased Bonferroni significance thresholds for about 33 million variants. Sequence variants were split into five classes based on their genome annotation, and the significance threshold for each class was based on the number of variants in that class 30 . The adjusted significance thresholds used are 2.49 × 10 −7 for variants with high impact (including stop-gained and stop-loss, frameshift, splice acceptor or donor and initiator codon variants), 4.97 × 10 −8 for variants with moderate impacts (including missense, splice-region variants, inframe deletions and insertions), 4.52 × 10 −9 for low-impact variants (including synonymous, 3′ and 5′ UTR, and upstream and downstream variants), 2.26 × 10 −9 for deep intronic and intergenic variants in DNase I hypersensitivity sites and 7.53 × 10 −10 for all other variants, including those in intergenic regions.

eQTL analysis
To identify expression quantitative locus (eQTL) effects in plasma cells, we used previously published RNA-seq data for CD138 + immunomagnetic bead-enriched MM plasma cells 31 . To test for association, we used linear regression with and without 10 principal components as covariates. To identify eQTLs in blood, we used wholeblood RNA-seq from 13,127 Icelanders 32 . Gene expression was quantified based on transcript abundances estimates using Kallisto 33 . Association between sequence variants and gene expression was calculated using generalized linear regression 34 . The additive genetic effect was assumed and quantile-normalized gene expression estimates were calculated while adjusting for sequencing artefacts, demography variables, and hidden factors 35 . Finally, as a complement to the Icelandic data, we used data from the eQTLGen database (www.eqtlgen.org) 36 .

ATAC-seq data for blood cell populations
Sequencing reads for published ATAC-seq libraries from sorted hematopoietic cell types were downloaded from https://atac-blood-hg38.s3.amazonaws.com/hg38/ using the rtracklayer R package 37 , and processed using hg38 as a reference genome.

Luciferase assays
Luciferase constructs representing the reference and alternative allele of rs75712673 were made by cloning 120-bp genomic sequences (Integrated DNA Technologies) centered on the variant into the pGL3-basic vector. Using electroporation (Neon Transfection system; Life technologies, USA) constructs were co-transfected with renilla plasmid into L363 and OPM2 cells (DSMZ). Twenty hours after electroporation, luciferase and renilla activity was measured using DualGlo Luciferase (Promega no. E1960) on a GLOMAX 20/20 Luminometer (Promega, USA). Based on luciferase/renilla readings, we calculated log 2 scores for each variant.

Identification of a novel MM risk locus at 13q13
To find new MM risk loci, we carried out a GWAS based on four case-control data sets from Iceland, Denmark, Norway and Sweden totalling 5,320 MM patients and 422,289 controls (Table 1). We performed association testing in the four data sets separately, and combined the resulting statistics for 33 million variants that passed quality filtering. Two versions of the Icelandic case-control data were used for meta-analysis: one with MM patients only, and one that was expanded with non-IgM MGUS patients to increase power ( Table  1). The latter is motivated because MM evolves from MGUS, relatives of MGUS patients have increased MM risk, and several studies support pleiotropy between MM and MGUS 44 . Our analysis identified genome-wide significant association signals at 10 loci, and all previously reported MM lead variants were nominally significant with effects in the same direction as in the discovery studies (Supplementary Table  1). Nine of the genome-wide significant signals correspond to signals correspond to known MM risk loci. In addition, a previously unreported low-frequency variant at 13q13.3 (lead variant rs200203825; RAF ̴ 3,5%; combined P = 2.65 × 10 −10 with Icelandic MGUS cases; Table 1) showed significant association. This signal is represented by a haplotype of 14 non-coding variants in high LD (r 2 > 0.8) spanning the SOHLH2 gene (spermatogenesis and oogenesis specific basic helix-loop-helix 2; Supplementary Table  2). The detected variants showed comparable effects in the same direction in all four discovery sets ( Table 1). The conditional analysis did not reveal any additional independent signals. For follow-up, we genotyped an additional 473 MM cases and 3,430 controls from Sweden, and also looked for association in published MM association data sets from the United Kingdom, Germany, Holland and USA ( Table 1). The association with MM was nominally significant in three of the six follow-up data sets, including in the two largest series from the UK and Germany. For all the series the effects were in the same direction as in our discovery GWAS.

Functional annotation of the 13q13.3 association
Because non-coding variants generally act by altering the regulation of gene expression, we sought to identify candidate causal variants responsible for the 13q13.3 association based on epigenetic features associated with regulatory activity.
Firstly, ATAC-seq data from 17 blood cell subpopulations showed that rs75712673 (r 2 = 0.985 with rs200203825 in Swedes) maps to a genomic region selectively open in plasma cells (Fig. 1d). Secondly, consistent with this, chromatin immunoprecipitation and sequencing (ChIP-seq) data for the MM plasma cell line L363, showed that rs75712673 maps to a H3K4me3 histone mark (Fig. 1e). Collectively, these data are consistent with the 13q13.3 association affecting regulatory activity at rs75712673 in plasma cells.
To identify a putative target gene, we carried out eQTL analysis using RNA-seq data for CD138 + plasma cell isolated from the bone marrow of 188 MM patients 31 (Fig. 2a). We identified an association between rs75712673 and SOHLH2 expression (Wilcoxon rank sum test P = 0.0081 for heterozygotes versus major allele homozygotes without principal components; P = 0.043 with 10 expression and 10 genotype principal component covariates). By contrast, we did not detect a plasma cell-specific eQTL for any of the nearby genes SPG20, DLCK1, CCDC169, or CCNA1, nor a SOHLH2 eQTL in peripheral blood. SOHLH2 is normally expressed in hematopoietic stem and progenitor cells, with its expression in B-cells and plasma cells being low ( Supplementary Fig. 1). Further, consistent with the eQTL, luciferase analysis of rs75712673 in MM plasma cell lines L363 and OPM2 showed higher luciferase activity for the rs75712673-G risk allele relative to the reference allele (Fig. 2b). Computational motif analysis predicted that rs75712673-G alters multiple transcription factor motifs, including multiple members of the FOX and SOX families, IKZF2, and EN1 (Supplementary Table 3). Finally, using GeneHancer 45 promoter-capture HiC data, we detected a chromatin looping interaction between rs75712673 and the SOHLH2 promoter (Fig. 1c).

Discussion
In conclusion, we have identified a new genetic association for MM at SOHLH2, increasing the number of risk loci to 25. This locus was not significant in our recent six-center meta-analysis totalling 9,974 cases. The likely reason was that the Went et al. study combined data from different geographic populations, and, moreover, that the imputation was done using nonpopulation-matched reference genomes. By contrast, the present study was done in homogenous populations of Nordic ancestry and the imputation was done using population-matched reference genomes, increasing the power to detect low-frequency variants such as the one at SOHLH2.
Interestingly, SOHLH2 encodes a transcription factor with a basic helix-loop-helix domain that has previously been implicated in spermatogenesis 46 and development of breast and ovarian cancer 47 , but is normally expressed only at a low lever in plasma cells. To explore the mechanism underlying the SOHLH2 association, we functionally fine-mapped the 13q13.3 signal, and identified rs75712673 as a likely causal variant that upregulates SOHLH2 in plasma cells. Our findings provide novel insight into the molecular basis of inherited MM susceptibility.