A cross-disease meta-GWAS identifies four new susceptibility loci shared between systemic sclerosis and Crohn’s disease

Genome-wide association studies (GWASs) have identified a number of genetic risk loci associated with systemic sclerosis (SSc) and Crohn’s disease (CD), some of which confer susceptibility to both diseases. In order to identify new risk loci shared between these two immune-mediated disorders, we performed a cross-disease meta-analysis including GWAS data from 5,734 SSc patients, 4,588 CD patients and 14,568 controls of European origin. We identified 4 new loci shared between SSc and CD, IL12RB2, IRF1/SLC22A5, STAT3 and an intergenic locus at 6p21.31. Pleiotropic variants within these loci showed opposite allelic effects in the two analysed diseases and all of them showed a significant effect on gene expression. In addition, an enrichment in the IL-12 family and type I interferon signaling pathways was observed among the set of SSc-CD common genetic risk loci. In conclusion, through the first cross-disease meta-analysis of SSc and CD, we identified genetic variants with pleiotropic effects on two clinically distinct immune-mediated disorders. The fact that all these pleiotropic SNPs have opposite allelic effects in SSc and CD reveals the complexity of the molecular mechanisms by which polymorphisms affect diseases.

Quality control and imputation. All GWAS data were quality control (QC) filtered prior imputation.
Single-nucleotide polymorphisms (SNPs) and subjects with success call rates lower than 95% were removed using PLINK V.1.9 (www.cog-genomics.org/plink/1.9/) 29 . SNPs showing a deviation from the Hardy-Weinberg equilibrium (P-value < 0.001) and minor allele frequencies <1% were also excluded. In addition, one subject per duplicate pair and per pair of first-degree relatives was also removed via the Genome function in PLINK V.1.9 with a Pi-HAT threshold of 0.4. Principal component analysis (PCA) was performed in order to identify and exclude outliers based on their ethnicity by using PLINK V.1.9 and the GCTA64 and R-base under GNU Public license V.2. We estimated the first five PCs using ~100.000 quality-filtered independent SNPs (r 2 < 0.15). Outliers were defined as individuals who deviated more than six standard deviations from the centroid of their population. The number of SNPs before and after QC for each cohort is summarized in Supplementary Table S1.
Imputation was performed using the Michigan Imputation Server 30 . The software SHAPEIT 31 was used in order to estimate haplotypes, and the European panel of the Haplotype Reference Consortium r1.1 32 was used as the reference panel for both SSc and CD genotype data in the discovery phase. Individual chunks of 50.000 Mb were used to carry out the imputation, covering whole-genome regions with a probability threshold for merging genotypes of 0.9, thus maximizing the quality of the imputed variants. Imputed data were also subjected to the above-mentioned QC filters in PLINK V.1.9. The total number of SNPs imputed for each cohort is summarized in Supplementary Table S1. Statistical analysis. Statistical analyses were performed with PLINK V.1.9.
Discovery phase. Each GWAS case/control cohort was independently analysed by logistic regression assuming an additive model with the first five PCs as covariates, as a correcting method for population stratification. Odds ratios (ORs) and 95% confidence intervals (CIs) were calculated according to Woolf 's method. Subsequently, SSc datasets were meta-analysed by the inverse variance-weighted method. Sex chromosomes were excluded from the analysis.
In order to detect common signals for SSc and CD with the same effect, either risk or protection, we selected SNPs that showed a P-value < 1 × 10 −5 in the SSc-CD meta-analysis and showed nominal significance (P-value < 0.01) with each disease separately, as well as no significant heterogeneity in the SSc meta-analysis (Cochran's Q test > 0.05 and heterogeneity index I 2 < 50%). To identify common signals for SSc and CD with opposite effect, the direction of association was flipped in the CD dataset (1/OR instead of OR). Again, we selected SNPs that showed a P-value < 1 × 10 −5 in the SSc and CD meta-analysis and that were associated with each disease separately at a P-value < 0.01.
The strongest associated SNP within each locus was selected for the replication phase. Genetic variants were annotated using variant effect predictor (VEP) 33 and their previous association with SSc and/or CD was explored using Immunobase (http://www.immunobase.org) and the GWAS catalog 34 .
Replication phase. Replication cohorts were analysed by logistic regression for the previously selected SNPs. Finally, combined analysis of the SSc and CD discovery and replication cohorts was performed using the inverse variance method. After the replication phase, we considered as statistically significant those signals that showed a P-value < 0.05 in each disease separately in the replication phase and a P-value < 5 × 10 −8 in the SSc-CD cross-disease meta-analysis including both discovery and replication datasets.
The statistical power of the SSc-CD combined meta-analyses (both discovery and discovery + replication) was determined as described by Skol et al. 35 . In the discovery cross-disease meta-analysis, the statistical power to detect an association at a P-value of 1 × 10 −5 (MAF = 20% and OR = 1.2) was 80%. In the discovery + replication meta-analysis, the statistical power to detect an association at a P-value of 5 × 10 −8 (MAF = 20% and OR = 1.2) was 100%.
Independence analysis. For those SSc-CD common loci identified for which an association with any of the analysed diseases was already reported, we evaluated the independence between pleiotropic signals and genetic variants previously associated with SSc and/or CD at the genome-wide significance level according to Immunobase and the GWAS Catalog. For this purpose, we used LDlink 36 , a tool that provides linkage disequilibrium (LD) data between polymorphisms across a variety of ancestral populations. Only the European ancestry was taken into account for the LD analysis.
In addition, since one of the shared genetic risk loci was located close to the extended major histocompatibility complex (MHC) region, we decided to test the independence between our new common signal and the main SSc and CD HLA associations. For this, we imputed SNPs, classical HLA alleles and amino acids across the extended MHC region (29,000,000 to 34,000,000 bp in chromosome 6) using the SNP2HLA method with the Beagle software package 37 and the Type 1 Diabetes Genetics Consortium reference panel, composed of 5,255 individuals of European origin 38 . HLA imputation of the CD discovery cohort was not possible due to the low coverage of this region included in the platform used for the genotyping of this dataset. For the SSc discovery cohort, the presence of independent effects within the extended MHC region was examined using a stepwise logistic regression by conditioning on the top independent signals.

Functional annotation.
We assessed the potential regulatory function of the SSc-CD common susceptibility variants identified by means of in silico expression quantitative trait locus (eQTL) analysis using Haploreg v4.1. Haploreg v4.1 is a tool for exploring annotations at variants on haplotype blocks, providing a large collection of regulatory information, capable of the functional assignment onto any set of variants derived from GWAS or sequencing studies 39 . We only included eQTLs found in tissues with relevance in SSc and/or CD.

Protein-protein interaction and gene set enrichment analyses.
In order to identify interactions among proteins encoded by SSc and CD common risk loci, we decided to construct a protein-protein interaction (PPI) network using the STRING database V.11.0 40 . This software provides a critical assessment and integration of PPI, including functional (indirect) as well as physical (direct) associations.
Gene ontology (GO) was applied to perform an enrichment analysis in order to determine whether certain biological processes are overrepresented in the set of SSc-CD common genes.

Results
Meta-analysis and replication. Following QC and imputation, we performed a meta-analysis considering both diseases as a single phenotype. A total of 5,994,231 SNPs overlapped between all GWAS datasets in the discovery phase.
When we combined GWAS data from SSc and CD under the assumption that alleles had the same effect in both diseases, genetic variants at 13 loci fulfilled the replication criteria (p-value < 1 × 10 −5 in the SSc-CD meta-GWAS and p-value < 0.01 in each disease-specific analysis) ( Fig. 2A and Supplementary Table S2). One of these common signals was located within the IRF8 region, a known genetic risk locus shared between SSc and CD, and, therefore, it was not considered in subsequent analyses. On the other hand, we performed the analysis under the assumption that alleles had opposite directions in both diseases, identifying 12 loci that fulfilled all criteria for the replication phase ( Fig. 2B and Supplementary Table S3).
To confirm these associations, the strongest associated SNP within each locus was selected for validation in additional sample sets. According to the criteria established for the replication analysis (genome-wide significance in the combined analysis including both discovery and replication sets, and nominal statistical significance in each disease-specific replication analysis), we identified a total of 4 genetic variants showing a pleiotropic effect in SSc and CD: two intronic variants located within IL12RB2 and STAT3, a SNP close to IRF1, and an intergenic variant at 6p21.31 located between ZBTB9 and BAK1 (Table 1). It is remarkable that an opposite allelic effect in both disorders was observed for all these new common signals.
Three of these shared risk loci have been previously associated with one of the analysed diseases, IL12RB2 with SSc and IRF1 and STAT3 with CD. Shared genetic variants at the IRF1 and STAT3 loci identified in our study Scientific RepoRtS | (2020) 10:1862 | https://doi.org/10.1038/s41598-020-58741-w www.nature.com/scientificreports www.nature.com/scientificreports/ were linked to those polymorphisms previously associated with CD (r 2 > 0.40). In the case of IL12RB2, it is an established genetic risk locus for SSc but, in addition, the IL23R gene, located within this same genomic region, is a known susceptibility gene for CD. However, LD analysis evidenced that the pleiotropic variant identified in our study (rs6659932) was independent of the IL23R SNPs previously associated with CD (Supplementary Table S4).
On the other hand, the intergenic variant at 6p21.31 (rs68191) is located close to the extended MHC region. Considering this, we decided to test the independence between our new common signal and the main HLA associations observed in the SSc and CD discovery cohorts. In the case of CD, independence between signals could not be checked due to the low coverage of the HLA region. Regarding SSc, two independent signals were observed after conditional regression analysis, HLA-DPB1*1301 (p = 1.77 × 10 −19 , OR = 2.79) and HLA-DRB1*1104 (p = 1.21 × 10 −12 , OR = 1.83). After controlling for these two classical alleles, the SSc-CD common signal remained significant in the SSc discovery cohort (p-value = 8.15 × 10 −3 ; conditioned p-value = 2.78 × 10 −2 ). Functional effect on gene expression. Subsequently, we used the HaploReg database to explor wether the most strogly associated polymorphism of each shared locus acted as an eQTL. As shown in Supplementary  Table S5, all the pleiotopic SNPs identified in our study appeared to affect gene expression levels. Shared genetic variants at the IL12RB2 (rs6659932) and STAT3 (rs4796791) loci affected expression levels of IL12RB2 and STAT3, respectively, whereas the pleiotropic SNP of the IRF1 locus (rs2548998) acted as an eQTL for IRF1 and SLC22A5. Interestingly, the intergenic polymorphism at the MHC extended region (rs68191) affected gene expression levels of TAPBP.
Protein-protein interaction and enrichment analysis. Finally, we also evaluated the connectivity at the protein interaction level among the genetic risk loci shared between SSc and CD, including genes whose expression levels were affected by the pleiotopic polymorphisms identified in our study, that is IRF1, SLC22A5, STAT3, IL12RB2 and TAPBP, as well as loci associated in previous studies with both SSc and CD, including STAT4, TYK2, IRF8, GSDMA and IKZF3. GSDMA and IKZF3 belong to the same LD block, however GSDMA has been set as the most probable candidate gene of this locus in SSc and IKZF3 for CD 41,42 . Thus, we decided to keep both genes for PPI and enrichment analyses.
The PPI network involved 9 of the 10 common proteins included in the analysis, except for SLC22A5 (Fig. 3). We observed a strongly significant PPI enrichment (p-value < 1 × 10 −6 ), indicating that these proteins have more interactions than would be expected for a random set of proteins of similar size. www.nature.com/scientificreports www.nature.com/scientificreports/ To further evaluate this connection, we performed a gene ontology enrichment analysis in biological processes. In this regard, we observed 29 statistically significant over-represented biological processes (p-value < 0.05). The most significantly over-represented pathways were related to interleukin-mediated signaling, especially those related with the IL-12 family and the type I interferon signaling pathway ( Table 2).

Discussion
Through the first comprehensive study of the genetic component shared between SSc and CD, we have identified four loci that contribute to suceptibility to both disorders. Of these, one had not been previously associated with any of the diseases under study (an intergenic locus at 6p21.31), whereas the remaining three represent established genetic risk loci for one but not the other condition.
Although all these pleiotropic SNPs are located in non-coding regions, functional annotation indicated that they act as regulatory variants affecting expression levels of either the gene were they mapped or close genes in cell types or tissues of relevance in the pathogenesis of SSc and/or CD. In this regard, pleiotropic variants appeared to influence expression levels of the IL12RB2, IRF1, SLC22A5, STAT3, and TAPBP genes (Supplementary Table S5). Most of these genes are key players of the immune response: IL12RB2 encodes a subunit of the IL-12 receptor complex implicated in Th1 differentiation; STAT3 encodes a transcription factor that is essential for the differentiation of Th17 cells; IRF1 encodes a transcriptional regulator of type I interferon (IFN) and IFN-inducible genes;  Table 1. Loci associated with a genome-wide significant threshold after the cross-disease meta-analysis of systemic sclerosis and Crohn's disease. Results of the discovery and replication analysis for each individual disease and of the combined meta-analysis (discovery + replication) are shown. SNP, single-nucleotide polymorphism; SSc, systemic sclerosis; CD, Crohn's disease. www.nature.com/scientificreports www.nature.com/scientificreports/ and TAPBP is crucial for optimal peptide loading on the MHC class I molecule. In addition, the pleiotropic variant affecting IRF1 levels also regulates the expression of SLC22A5, which encodes an organic cation transporter involved in the active cellular uptake of carnitine.
Interestingly, PPI analysis evidenced a number of non-random connections among the SSc-CD common genes, including both shared risk loci previously described and comon genes identified in our study, which indicates overlap among the pathways involved in the pathogenesis of these two disorders. Specifically, the IL-12 family signaling pathways, including IL-35, IL-23, IL-12, IL-21, and IL-27-mediated signaling, were particularly compelling. This family of cytokines plays a crucial role in shaping immune responses, differentiation of naïve T cells towards different types of effector cells, as well as in the regulation of effector cell functions 43 . Moreover, the type I interferon signaling pathway was also enriched among the set of SSc-CD common genes. An increased expression and activation of IFN-inducible genes, known as interferon signature, has been reported in SSc 44 and several interferon regulatory factors (IRFs), including IRF5, IRF4, and IRF8, have been involved in its susceptibility 14,45 , thus supporting the role of IRF1, previously associated with CD but not with SSc, as a new susceptibility gene for this last condition.
Considering these results, both IL-12 family and type I interferon signaling pathways could represent interesting therapeutic targets for both SSc and CD. Indeed, ustekinumab, a monoclonal antibody to the p40 subunit common to IL-12 and IL-23, has been recently approved in the EU and the USA to treat patients with CD and, therefore, this drug could be repositioned to treat SSc. However, it should be advised that all the pleiotropic variants identified in our study showed opposite allelic effects in the two analysed disorders, thus highlighting the complex effects that shared associations have on disease outcomes. This could be due to the fact that consequences of genetic variants are influenced by the cell type. For example, as previously indicated, the shared genetic variant at IL12RB2 influenced IL12RB2 gene expression levels; however, whereas the minor allele (which conferred risk to SSc in our study) correlated with an increased gene expression in whole blood, the major allele (which conferred risk to CD) had the same effect (increased IL12RB2 expression) in fibroblasts, according to GTEx data. In addition, the effect on gene expression of the pleiotropic SNP located within the 5q31.1 region was also cell type specific, influencing IRF1 expression levels in lymphoblastoid cells and SLC22A5 levels in other tissues, and, therefore, this SNP could have a different biological implication in both diseases. Indeed, higher expression levels of OCTN2, the protein encoded by SLC22A5, have been found in inflamed regions of the intestinal epithelium compared with non-inflamed areas, and a role of this protein in the intestinal homeostasis has also been reported 46 ; whereas, given the relevance of the type 1 interferon signaling pathway in SSc, the IRF1 gene seems a more plausible candidate to be involved in SSc susceptibility. Considering this, it is possible that an effective treatment for SSc could have a detrimental effect on CD, and conversely. As previously mentioned, we observed discordant associations for variants located in genes implicated in IL-23 and Th1 differentiation pathways. In this context, IL-17-specific antibody therapy, effective in psoriasis and with promising effects on SSc 47,48 , has been proven to exacerbate CD 49 . This could be due to a deficient Th17 activation in CD owing to mutations in STAT3, which could lead to hyper-IgE syndrome, typically associated with extracellular fungal and bacterial infections 50 . Interestingly, according to our results, the STAT3 rs4796791 variant confers protection to CD and risk to SSc, which could lead to an exacerbate reaction in CD patients carrying this variant when treated with anti-IL17 therapy.
Interestingly, it has been reported a reduced incidence of CD in patients with SSc 51,52 . Although the causes of this phenomenon are not clear, our results suggest that identical genetic risk factors could have different or even opposite functional effects in both diseases. These 'flip-flop' associations have been extensively observed across different comparative analyses 53 . In this regard, a cross-disease meta-analysis including CD and type 1 diabetes 54 identified two variants, such as IL27 rs4788084 and IL10 rs3024505, with opposite effects in these two conditions. Furthermore, a meta-analysis of 6 different immune-mediated disorders showed that 14% of overlapped variants were discordant regarding the risk allele across diseases 55 . These results suggest that predisposition to related diseases may be