Network-assisted analysis of primary Sjögren’s syndrome GWAS data in Han Chinese

Primary Sjögren’s syndrome (pSS) is a complex autoimmune disorder. So far, genetic research in pSS has lagged far behind and the underlying biological mechanism is unclear. Further exploring existing genome-wide association study (GWAS) data is urgently expected to uncover disease-related gene combination patterns. Herein, we conducted a network-based analysis by integrating pSS GWAS in Han Chinese with a protein-protein interactions network to identify pSS candidate genes. After module detection and evaluation, 8 dense modules covering 40 genes were obtained for further functional annotation. Additional 31 MHC genes with significant gene-level P-values (sigMHC-gene) were also remained. The combined module genes and sigMHC-genes, a total of 71 genes, were denoted as pSS candidate genes. Of these pSS candidates, 14 genes had been reported to be associated with any of pSS, RA, and SLE, including STAT4, GTF2I, HLA-DPB1, HLA-DRB1, PTTG1, HLA-DQB1, MBL2, TAP2, CFLAR, NFKBIE, HLA-DRA, APOM, HLA-DQA2 and NOTCH4. This is the first report of the network-assisted analysis for pSS GWAS data to explore combined gene patterns associated with pSS. Our study suggests that network-assisted analysis is a useful approach to gaining further insights into the biology of associated genes and providing important clues for future research into pSS etiology.

Biological annotation for the identified module genes. To better understand the biological functions of the DMS-identified module genes, we conducted a Gene Ontology (GO) enrichment analysis by using DAVID. The enriched biological processes were related with negative regulation of protein metabolic process and proteasomal protein catabolic process. The enriched molecular functions were about transcription regulator activity and transcription factor activity. Further information of GO enrichment analysis of module genes was summarized in Supplementary Table S3.
We also computed the tissue specificity of module genes by using the Gene Enrichment Profiler. In the transcript expression heatmap (Supplementary Figure S2), approximately two-thirds of these genes were highly expressed in immune-related cell types (specifically, B, T and myeloid cells). In the transcript enrichment heatmap (Supplementary Figure S3), we found module genes were preferentially expressed in the immune cell types.
Module genes and sigMHC-genes as candidates for pSS. To explore whether exist interactions between the DMS-identified module genes and sigMHC-genes, we extracted these genes and according interactions from the node-weighted pSS network, resulting in a subnetwork as shown in Fig. 1b. Most of sigMHC-genes directly or indirectly connected with module genes except 6 singletons and two isolated PPI pairs (HLA-DPB1 vs. HLA-DPA1, and HLA-DQB1 vs. HLA-DQA2). The combined module genes and sigMHC-genes were defined as candidate genes (71 genes) for pSS. Of these candidates, four genes (STAT4, GTF2I, HLA-DPB1, and HLA-DRB1) had been reported their Figure 1. (a) The subnetwork formed by identified module genes; (b) the subnetwork formed by sigMHCgenes and identified module genes. The triangle-shaped nodes represent sigMHC-genes and circular-shaped nodes represent DMS-identified module genes. The color of the node was proportioned with the gene P-value. The most significant gene P-value was red color and the most non-significant gene P-value was yellow color. association with pSS in the original GWAS dataset 6 , and gene PTTG1 (pituitary tumor-transforming 1) had been reported as suggestive association (rs2431098, allelic meta P-value = 2.28 × 10 −7 ; rs2431697, allelic meta P-value = 3.76 × 10 −6 ) with SS in another GWAS study of SS in European descent 5 . In addition, three genes (HLA-DQB1 14 , MBL2 15 , and TAP2 16 ) had been previously reported their association with SS/pSS by candidate gene studies.

Discussion
Unlike the extensive GWAS experiment in other AIDs, such as RA and SLE, there have been only two GWAS studies in SS/pSS until now 5,6 . Genetic studies of pSS have lagged behind. To further mining the existing genetic data, network-assisted analysis of pSS GWAS in Han Chinese was performed in order to explore the joint effects of multiple genetic association signals on pSS and discover additional candidate genes associated with pSS pathogenesis. First, all SNPs were first mapped into genes and gene-based association was performed by using VEGAS. Then, dense modules were dynamically searched in the context of the node-weighted pSS network via DMS, yielding thousands of functional modules. To avoid false positive results and the topology bias, a strict criterion was applied to select modules with significant genetic signals. After two stepwise significance tests (see Methods), 8 modules covering 40 genes were screened out. These 40 module genes had a high proportion of significant genes (60%) and preferentially expressed in immune-related cell types. The proteins encoded by the DMS-identified module genes were more closely interconnected than what would be expected by random cases, as suggested by DAPPLE analysis. In addition, there were 31 MHC genes with significant gene P-values, defined as sigMHC-genes. To avoid the results of module search focusing on MHC region, the sigMHC-genes were not set as "seed genes" (see methods) and directly remained as part of final results. By merging all module genes and sigMHC-genes, a total of 71 genes were involved and denoted as candidate pSS genes.
Of the 71 candidate pSS genes, eight genes had been previously reported their association with SS/pSS, as well as another six genes that had positive evidence associated with SLE or RA (Fig. 2). All the 14 reported genes had significant gene P-values (see Supplementary Table S2). Considering that pSS might share genetic signatures with SLE and RA to some extent, these six SLE or RA susceptibility genes were very likely associated with pSS. For example, one of reported RA susceptibility genes was NFKBIE (nuclear factor of kappa light polypeptide gene enhancer in B-cells inhibitor, epsilon) [20][21][22] , which is a non-MHC gene with gene-level P-value of 0.00801. NFKBIE, also known as IκBε, is part of the Iκ B family of proteins that regulates NF-κ B-dependent transcription by inhibiting DNA binding and localizing these factors to the cell cytoplasm 32 . It has been demonstrated that IκBα (another Iκ B family of protein) promoter polymorphisms are associated with susceptibility to SS 33 . In addition, NF-κ B plays an important role in inflammatory diseases and in the development of autoimmunity 34 . Experimental studies have shown an activation of NF-κ B in pSS. These clues implied that NFKBIE might be also associated with pSS.
Due to disease genes execute their functions are not alone, the interactions among the candidate pSS genes might play an important role. It has been observed that causal genes for the same Mendelian disease often physically interact 35,36 . There were two direct interactions among the reported genes, i.e. (HLA-DRB1 vs. HLA-DRA), and (HLA-DQB1 vs. HLA-DQA2). It was worth noting that most of 14 reported genes were closely connected by gene UBC (ubiquitin C) except for some sigMHC-genes. As the theory of "guilt by association", it is possible that a few highly connected nodes (hub genes) bring together several disease-associated genes, even though the hubs themselves are not relevant 37 . In this study, although the gene-level P-value of UBC was not significant (gene P-value = 0.463), it was involved by all the DMS-identified modules. As shown in Fig. 1a,b, UBC was centered both in the subnetwork formed by DMS-identified genes and in the subnetwork formed by the combined candidate genes. UBC directly or indirectly (via one or two nodes) connected with most of pSS candidate genes.
Of the rest of 57 genes, some genes with non-significant gene P-values might act as connectors to connect with other disease genes. For example, the gene P-value of JUN (jun proto-oncogene) was not significant (gene P-value = 0.918), but it interacted with STAT4 (signal transducer and activator of transcription 4), which was the most significant gene in this study (gene P-value = 1E-07). In addition, STAT4 was the only gene that had been confirmed its association with SS/pSS in both Han Chinese and European descent 5,6 . In addition, other novel candidate pSS genes with significant gene P-values were also valuable, such as STAT1 (signal transducer and activator of transcription 1, 91kDa). STAT1 (gene P-value = 0.00267) directly interacted with GTF2I, which was a reported pSS associated gene in the Chinese cohort 6 . STAT1 phosphorylation at serine 708 is a key event in the interferon signalling pathway 38 . In many SS patient, interferon activation plays an important role in the immune attack and destruction of salivary and lacrimal glands at some stage in the course of the disease. These evidence suggested that STAT1 was likely related with pSS.
The present study has some limitations that require consideration. First, only one pSS GWAS data in Han Chinese was available for this study. It would be more valuable if we could make a comparison between multiple GWAS datasets from different population. In our study, only gene PTTG1 could be cross evaluated between two SS/ pSS GWAS. PTTG1 was only reported as suggestive association with SS in European descent 5 and not found to be a risk factor in the Chinese cohort 6 . However, this gene was significant at gene-level (gene P-value = 0.00143) and identified by module search, and it was also reported to be associated with SLE 24 . These lines of evidence implied that PTTG1 might be associated with pSS in Han Chinese. Second, calculation of gene-level P-value is a key step in the network-assisted analysis of GWAS. VEGAS is comparable with other tools to compute gene P-values, however, it could only deal with autosomal SNPs. Due to women are nine times more likely than men to be affected with SS, it would be interesting to evaluate SNPs located on the sex chromosomes (X and Y).
In summary, this is the first use of network analysis of pSS GWAS data to further mine genetic signals at a molecular level instead of analyzing each of single locus (SNP). Complementary to the traditional GWAS analysis, it was more powerful that gene-level P-value was considered by calculating the combined effect of all SNPs within a gene and subsequently integrated with a pSS-specific PPI network to search for gene combination patterns contributed to pSS. Our findings included 40 non-MHC genes identified by DMS algorithm and 31MHC-region genes with significant gene-level P-values. These candidates and interactions among them were more likely to be associated with pSS.
Deciphering the mechanism of pSS pathogenesis is still challenging, although certain progress has been made, much remains to be understood. Deriving a pSS-specific PPI network and identification of dense module genes and sigMHC-genes, as described herein, offers new targets for further functional assessment for this chronic and complex condition.

Methods
The workflow of the network-assisted analysis for pSS GWAS data was shown in Supplementary Figure S4, and the sections below were labeled in correspondence with this figure. pSS GWAS dataset. The pSS GWAS data is composed of samples of Han Chinese 6 . There are 642,832 SNPs in 597 pSS cases and 1,090 controls genotyped with the Affymetrix Axiom Genome-Wide CHB 1 Array Plate. Details of this pSS GWAS data and process of quality control are provided in ref. 6 . After quality control filtering, a total of 542 cases, 1,050 controls and 556,134 autosomal SNPs were remained for subsequent analysis.

Computing gene-level P-values.
To perform network analysis and examine functional correlation between genes, gene-level P-value representing the significance of association with phenotype for each gene was needed to be considered. The gene-level P-value was calculated with VEGAS 39 , which can incorporate information from all SNPs mapped to a gene and take into account the linkage disequilibrium (LD) patterns between SNPs for the specific samples (a custom set of individuals) or ethnic background (HapMap data). In VEGAS, all SNPs were mapped to human protein-coding genes according to positions on the UCSC Genome Browser. In this study, an off-line version of VEGAS was applied and improved in some aspects. First, gene position was updated from original hg18 to hg19 downloaded from Ensembl (GRCh37.P11) 40 . Second, in order to capture SNPs in regulatory region and simultaneously avoid too many genes with overlapped SNPs, gene boundary was extended to 20kb upstream/downstream of the gene coordinates instead of 50kb by default. Third, when estimating the LD patterns, the pSS GWAS data was used.
Building a node-weighted pSS interactome. A consolidated human protein-protein interaction (PPI) network data was obtained from iRefIndex database (version 13.0) 41 , which collected nine interaction databases Scientific RepoRts | 5:18855 | DOI: 10.1038/srep18855 and computed the union of data sets. Among this large network, many interactions were either predicted or supported by a single experiment. In order to reduce the rate of false positives, we included only those interactions supported by at least two publications for all subsequent analyses, resulting in a highly reliable network of 10,163 nodes (genes) and 36,680 interactions. Then, the nodes involved in this network was annotated with gene P-value as a node attribute, and extracted to derive a node-weighted pSS network. sigMHC-genes. Currently, the reported susceptibility genes associated with pSS mainly focused on immune-related genes and the MHC region 5,6,42,43 . To unravel more risk genes outside MHC region and avoid the complexity of the MHC region 44 , we did not assign gene P-values for nodes in pSS network if the corresponding genes located in MHC region and their gene P-values < 0.05, named as significant MHC genes (sigMHC-genes). Since sigMHC-genes might interplay with other significant genes and play an important role in the disease-related biological functions, these genes were still left in the pSS network to maintain the integrity of the network.

Module detection.
A dense module search (DMS) method implemented in dmGWAS was applied to search for modules that were enriched with significant P-value genes in the context of the node-weighted pSS network 45 . DMS starts by transferring each gene P-value into a Z score (z i ) by using the inverse normal cumulative distribution function 46 . For a module with k genes, the module score ( ) Z m was computed by summing the z i over all genes in the module, i.e.
. The detailed process of module search can be found in this study 45 . Briefly, for a given "seed gene", module grows by adding the neighboring nodes that can generate the maximum increment of a module score ( ) Z m . Module growth will stop if the increment is not greater than × . Z 0 1 m . The process of module searching was conducted taking each node in the pSS interactome as the seed gene except for sigMHC-genes.

Module evaluation.
For the proper capture of the connection between genetic association and network topology, two steps of tests were performed.
First, to assess the significance of the resultant modules, module P-values were calculated based on the module scores by empirically estimating the null distribution, which is assumed to be a normal distribution 47 . Specifically, module scores were median-centered, and then the parameters of mean δ and standard deviation σ were estimated for the empirical null distribution using the R package locfdr. The standardized module scores Z s were computed and converted to P-values by using the normal cumulative density function. The modules with P-values ( ) P Z m < 0.05 were selected for the further estimation.
Second, to avoid the bias that nodes with many interactors in the PPI are more probably to be chose by DMS, the topology of resultant modules were evaluated as suggested in 48 . All nodes in the network were divided into four groups according to nodes degree, i.e. 0-2 2 , 2 2 -2 4 , 2 4 -2 6 , and > 2 6 . For a given module with k genes, 10,000 modules with the same number of genes were generated by considering which group each gene located in and then randomly picking one gene from the corresponding group (i.e. structurally equivalent random networks). An empirical P-value was calculated by π = ( ) ≥ = , where π ( ) Z m is the score of the random module for the π th resample. The modules with P-values P topo < 0.05 were selected as the final results.
Assessment of the significance of connectivity between module genes. To evaluate whether module genes were densely connected via PPI network, a permutation test was performed to assess the significance of connectivity between module genes by using DAPPLE (Disease Association Protein-Protein Link Evaluator) algorithm 13 . Briefly, this approach first generated a random network that has nearly the exact same structure as the original one that is derived from the InWeb database 36 . The node labels (i.e. the protein names) were then randomly re-assigned to nodes of equal binding degree. DAPPLE assumes a null distribution of connectivity that is entirely a function of the binding degree of individual proteins. We built 10,000 different random networks and each of them had the same number of proteins, connectivity and per-protein binding degree as InWeb. The significance of our real PPI network formed by module genes was then assessed through permutation. For more details, please refer to the article 13 .
GO enrichment and cell-specific expression of module genes. In order to discern biological attributes of the identified module genes, we performed Gene Ontology (GO) enrichment analysis by using DAVID 49 . DAVID bioinformatics resources consist of an integrated biological knowledgebase and analytic tools aimed at systematically extracting biologically meaning from a list of genes or proteins. Fisher Exact tests were conducted in DAVID to compute the P-value for each GO term. In this case study, only GO terms with an adjusted P-value (Benjamini & Hochberg) of less than 0.25 were selected. Cell-specific expression was assessed with an online tool Gene Enrichment Profiler 50 . This tool computes the expression and enrichment of any set of query genes on the basis of a reference set obtained from 126 normal tissues and cell types (represented by 557 microarrays).