Phenome-wide search for pleiotropic loci highlights key genes and molecular pathways for human complex traits

Over recent decades, genome-wide association studies (GWAS) have dramatically changed the understanding of human genetics. A recent genetic data release by UK Biobank has allowed many researchers worldwide to have comprehensive look into the genetic architecture of thousands of human phenotypes. In this study, we developed a novel statistical framework to assess phenome-wide significance and genetic pleiotropy across the human phenome based on GWAS summary statistics. We demonstrate widespread sharing of genetic architecture components between distinct groups of traits. Apart from known multiple associations inside the MHC locus, we discover high degree of pleiotropy for genes involved in immune system function, apoptosis, hemostasis cascades, as well as lipid and xenobiotic metabolism. We find several notable examples of novel pleiotropic loci (e.g., the MIR2113 microRNA broadly associated with cognition), and provide several possible mechanisms for these association signals. Our results allow for a functional phenome-wide look into the shared components of genetic architecture of human complex traits, and highlight crucial genes and pathways for their development.


Introduction
Over the recent decades, many advances have been made in uncovering the genetic architecture of human complex traits. Genome-wide association studies (GWAS), emerging at the onset of the millennium, have facilitated the discovery of susceptibility loci for both complex disease and quantitative traits (reviewed in Mills and Rahal 1 ). Most recent GWAS for common phenotypes such as type 2 diabetes include up to 900,000 individuals and allow to identify numerous novel risk loci 2 . Many methods have been developed to interpret the GWAS signal on both single-variant and genome scale, including a group of specialized gene set enrichment tests [3][4][5] aimed at discovery of biological pathways and mechanisms related to complex traits. Integration of GWAS and expression quantitative trait loci (eQTL) data has also shed light onto the role of gene expression in complex traits, and allowed for identification and validation of causal genes 6 .
Advances in biobanking, genetic data acquisition, and analysis have allowed first glimpse into the genetic architecture of multiple traits at once, with the UK Biobank (UKB) genetic dataset of 500,000 individuals being extensively used for diverse studies in statistical genetics 7 . These included, for example, estimation of natural selection effects on complex trait-associated variation 8 , and development of trait-specific risk scores for identification of individuals at high risk for complex disease 9 . Genetic associations in the UK Biobank data have attracted the attention of researchers worldwide, and several efforts have been made to aggregate genetic association data across all traits [10][11][12] . These studies highlighted the complexity of the human phenome, and pinpointed notable loci with multiple associations (e.g., the MHC locus).
Several attempts have been made to assess the association of individual variants or genes across the whole phenome (phenome-wide association studies, PheWAS) (e.g., Schmidt et al 13 ), though no unified statistical framework to interpret such phenome-wide associations has been proposed. Pleiotropy of trait-associated variants in the human genome has also attracted lots of attention in the field; and Mendelian randomization based approaches have been proposed to detect pleiotropy in GWAS data 14 . At the same time, UK Biobank dataset offers an excellent opportunity to analyze shared components of the genetic architecture of human phenotypes on the phenomescale, with global overviews suggesting widespread signals of pleiotropy in the UK Biobank data 15,16 . In this study, we developed a statistical framework to explore the landscape of phenome-wide associations in GWAS summary statistics derived from UK Biobank dataset, and identified multiple shared blocks of genetic architecture of diverse human complex traits.

Complexity reduction by clustering of similar traits
To conduct a genome-wide scan for pleiotropic loci, we first obtained sets of significantly associated SNPs for all phenotypes in the UK Biobank data using pre-calculated GWAS summary statistics provided by Benjamin Neale's lab (release 1, downloaded 2018-02-25). The dataset included both standard GWAS loci as well as imputed variants, totaling 10,894,597 variants. We focused our analysis only on 543 complex traits that have significant non-zero partitioned heritability estimates (h 2 ). A total of 469,013 (4.27%) SNPs had at least one phenotype associated at genome-wide significant level, with an average of 4.34 phenotypes associated with each locus.
Interestingly, we observed numerous multiple associations across the dataset, with 230296 (49.21%) of SNPs having > 1 associated phenotype, and more than 10 associated phenotypes for 57,856 (12.34%) SNPs. While many of these multiple associations could be true pleiotropic variants, much of the signal likely arises from statistical artifacts or multiple highly correlated phenotypes. Hence, we went on to cluster traits that share a significant proportion of their genetic architecture. We applied the Jaccard distance metric as a simple and efficient way that can group traits based on pre-selected significant SNPs. Hierarchical clustering based on Jaccard distance between phenotypes identified 307 independent trait clusters with 1.76 phenotypes on average. Among these, 29 clusters comprised 3 or more phenotypes (see Extended Data Figure 1). The clustering procedure has dramatically decreased the amount of SNPs having multiple associations (92,224 (19.7%) SNPs with more than one association compared to 230,296 in non-clustered data) (see Extended Data Figure 2); as well as the average number of associated trait clusters (1.33 cluster per variant). Notably the number of pleiotropic SNPs with more than 10 associations has dramatically dropped to 0.014% (64 variants) after performing clusterization procedure ( Figure   1d-e). Interestingly, variants that have > 5 associated clusters also have higher minor allele frequencies, i.e. most rare variants are not pleiotropic (Figure 1e). Similar results were obtained when comparing SNPs with less or more than 15 associated individual phenotypes (see Extended Data Figure 3), indicating that the clustering procedure didn't affect the relationship between MAF and pleiotropy.

Investigation of molecular pathways relevant to major trait clusters
We then focused on investigation of the genetic architecture of the major trait clusters. Many of the identified clusters comprised biologically similar complex traits or same phenotypes in different encoding (ICD-10 and self-reported disease status, etc.). As a way to decipher the molecular basis of traits and clusters, we used a modified version of the gene set enrichment analysis we call Locus Set Enrichment Analysis (LSEA) which accounts for linkage disequilibrium (LD) between SNPs and can work with few significantly associated SNPs for each trait (see Online methods for description of the method).
To focus solely on shared genetic architecture components in each cluster, we considered loci which were significantly associated with 2 or more phenotypes for each cluster. In contrast to other published methods 3,5 , LSEA approach sensitively identified enrichment for coagulation hallmark genes and specific coagulation pathways (e.g., intrinsic pathway, extrinsic pathway) for the cluster 35 comprising blood clotting and other related phenotypes (Figure 2a Interestingly, we also observed highly significant enrichment for GPCR signaling pathway, which is concordant with investigation of GPCR signaling regulation in hypertension 17 . Yet another cluster comprised phenotypes characterized by high cholesterol level, with enriched gene sets predominantly related to lipid digestion, mobilization and transport, and small molecule metabolism. Genes from these gene sets were also overrepresented in the association signal of cluster 34, which included most heart pathologies, such as heart attack, angina, and acute diseases such as asthma, hypo-and hyperthyroidism, and psoriasis, was expectedly enriched for immune system signaling pathways, e.g. cytokine-related ones (see Extended Data Figure 9).
Overall, we observed substantial overlap between gene sets enriched in major trait clusters.
We identified 8393 gene sets that showed significant enrichment for at least 2 different phenotypes or clusters (760 of which belonged to the hallmark gene sets or canonical pathways set defined by Molecular Signatures Database (MSigDB) 18,19 . Of these, more than half had 50 or more phenotypes with enrichment signal (Figure 2c). We also investigated the enrichment statistics for another approach based on selection of gene sets enriched for at least 2 phenotypes in each cluster (Extended Data Figure 10). The results of such analysis were highly concordant with the ones obtained by using shared associated loci.
We then turned our attention to the analysis of gene sets which were highly enriched in the major trait clusters. To this end, we selected top-100 enriched gene sets for each cluster, and focused on gene sets having 2 or more enrichments after such filtering. Quite expectedly, we observed that most of the shared enriched gene sets belonged to the immune system, metabolism, and extracellular matrix ( Figure 2d). Of these, hemostasis cascade elements, matrisome components, and xenobiotic metabolism genes were enriched in the largest number of trait clusters.
We also assessed the enrichment of association signal inside two large regions of chromosome 6, the MHC region and the histone gene cluster 1, which is known to harbor multiple binding sites for the master regulator of MHC genes, CIITA 20 . We found 12 trait clusters to be significantly associated with variation inside the MHC locus, supporting its pivotal role in complex traits.

Analysis of highly pleiotropic loci
As we observed a large amount of shared gene sets in our functional annotation studies ( Figure 2d), we went on to systematically identify specific variants and loci involved in multiple complex traits and trait clusters. To this end, we first constructed a statistical model based on Poisson distribution that accounts for the randomly expected number of associated clusters for SNPs with different minor allele frequency (MAF), with such model accurately predicting the proportion of signal-harboring SNPs for each MAF bin. We used the constructed model to obtain the empirical p-value of multiple associations for each SNP given its MAF, and selected variants that showed genome-wide significance (P < 5 * 10 -9 ); i.e. the ones involved in many more associations than expected by a random coincidence. Such filtering resulted in 2,000 genome-wide significant pleiotropic loci corresponding to 53 independent chromosomal regions. Pathway analysis of such pleiotropic loci identified strong enrichment for diverse immunity-related molecular pathways including complement system, chromosome maintenance, and hemostatic cascades.
We then went on to analyze the functional impact of the identified pleiotropic variants. We We then set off to characterize the specific sets of loci shared by different clusters. Firstly, we investigated the overall structure of the pleiotropy network represented as weighted graph with nodes corresponding to trait clusters and edges corresponding to shared pleiotropic loci defined above. 62 trait clusters shared at least one pleiotropic locus with the others (see Extended Data One of the most notable loci included the FTO gene, which was found to be associated with 14 clusters. This finding is concordant with previous reports suggesting this gene to be connected to obesity and many obesity-related comorbidities 23 . Variants in the FTO gene affecting body composition and metabolic rates, are likely to increase risk of diabetes (clusters 14, 34), vascular problems (33), and lipid metabolism disorders (73). Another interesting single-gene pleiotropic locus comprised the MSRA gene associated with trait clusters related to anxiety (6), neuroticism (11), and sleep duration (7). This gene has already been linked to neuroticism 24 . We also found the LPA gene to be connected to clusters comprising diverse heart problems and increased cholesterol level, including myocardial infarction and chronic ischemic heart disease (belonging to cluster 34), high cholesterol level (73), and ezetimibe intake (24). The relationship between SNPs in lipoprotein(a) [Lp(a)] and major adverse cardiovascular (MACE) events has also previously been reported 25 .
Among the other genes, ABO was found to be associated with trait clusters related to a wide variety of vascular phenotypes such as warfarin treatment (72), blood pressure dysregulation (33), embolism (35), and heel bone mineral density (195). While the role of ABO in thrombosis has been excessively studied 26 , the overall significance of this locus to the majority of cardiovascular phenotypes has not been previously investigated. Yet another group of loci included APOE and TOMM40 genes associated with body mass index (21), diverse lipid-metabolism associated phenotypes (73), Alzheimer's disease (25). Observed data is quite convergent with studies of APOE and TOMM40 alleles causing Alzheimer's disease 27 , and further support a solid link between lipid metabolism and diverse complex phenotypes.
One of the most interesting pleiotropic loci included the MIR2113 gene encoding a microRNA of unknown function. This locus is associated with several phenotypes related to cognitive characteristics, including college or university degree qualifications (cluster 11) and fluid intelligence score (cluster 153), as well as ion concentrations (sodium in urine (181), potassium in urine (153)). Previously, MIR2113 was shown to play a role in memory processes 28 .
However, the exact mechanism of its function has never been reported. Our data on its involvement in maintenance of ionic balance, together with a strong enrichment for potassium channel genes among its predicted targets according to miRBase (Extended Data Figure 15), suggest that MIR2113 regulates cognitive processes through regulation of ion channel gene expression, which is important for understanding the development of cognition-related traits.

Discussion
Large-scale genetic datasets, such as the UK Biobank or the Genome Aggregation Database (gnomAD) 29 allow the first ever look into the 'Holy Grail' of human genetics research, i.e. the map of genotype-phenotype relationships. In our study, we aimed at identification of highly pleiotropic loci in the human genome and dissecting the key genetic architecture components relevant to human complex traits. The UKB genetic dataset contains a massive amount of similar and/or related traits which are commonly separated into several major domains 16,30 . To overcome this limitation and group traits that share a significant proportion of their genetic architecture independently of their arbitrary classification, we used a hierarchical clustering approach using the simple Jaccard distance metric calculated using sets of significantly associated variants. While such an approach could misclassify phenotypes with low signal-to-noise ratio, it efficiently clusters similar and identical phenotypes. The results of such clustering approach are concordant with the other efforts to computationally classify traits in the UKB dataset 12,16 . Notably, our clustering procedure identified close relationships between traits that have been demonstrated as comorbid to each other. For instance, deep venous thrombosis and pulmonary embolism combine in one cluster (35) with diverse thromboembolisms. It was clearly shown that deep venous thrombosis significantly increases the risk of pulmonary embolism 25 . Grouping of asthma, diabetes, and increased risk of nasal polyps (cluster 14) is also consistent with known data 31,32 . Therefore, our results are in good concordance with existing biological and clinical knowledge.
One of the major results presented in this work is the comprehensive functional annotation of GWAS signal for individual traits and trait clusters. To perform such functional annotation using a limited set of significant variants, we've constructed an LD-aware algorithm for GSEA (LSEA), which allowed us to identify crucial molecular pathways involved in multiple groups of phenotypes. For example, we discovered multiple associations for genes involved in innate and adaptive immunity (Figure 2d), as well as an overwhelming enrichment for variants inside the HLA locus. These results provide additional evidence that variation in genes and proteins involved in immune system functioning affects a large number of developmental and physiological processes. Similar observations have been made previously using site-level PheWAS 33 .
Interestingly, we also discover high degree of pleiotropy for the loci inside the HIST1 gene cluster, which is known to harbor multiple binding sites for the CIITA transactivator protein that regulates MHC gene expression 34 . We also observe stronger association with traits from the smoking-related cluster (24) for variants inside the HIST1 gene cluster than for the MHC genes themselves. This association signal provides new mechanistic insights into the previously reported association of the histone genes with increased risk of squamous cell lung carcinoma 35  For example, we report much higher multiplicity of genetic associations for the FTO locus, comprising breast cancer, heel bone mineral density, anthropometric traits and sleep duration. We report additional associations for the diabetes-related GCKR gene, such as gout, high cholesterol levels and narcolepsy. We also expand the list of associated phenotypes for the APOE gene, adding vascular/heart problems (chronic ischemic heart disease and angina) and pulse rate. Among the sets of pleiotropic loci we found the SLC39A8 gene associated with osteoporosis, as also reported previously 39 . The same locus is reported to be associated with another osteoporosis-and osteoarthritis-related traits that were also investigated in a recent meta-analysis of UKB data 40 .
These include wait-hip ratio, BMI, body fat percentage, strenuous sport, alcohol intake, and prospective memory result. The exact relationships between these traits, though, are still to be investigated.
We identify several key properties of the pleiotropic variation. First, SNPs with high amount of associated traits and clusters tend to be common (Figure 1e, Extended Data Figure 3). This is unexpected given that variants with larger effect sizes tend to be rare due to negative selection 8 . However, low degree of pleiotropy for rare variants might be explained by sample size limitations. Secondly, pleiotropic loci identified in this study are significantly enriched in open chromatin regions bearing H3K4Me3 and H3K27Ac histone modifications as identified by ChIP-Atlas search 22 (Figure 3d). This is consistent with previous observations made for immune system 41 . Third, variants with high degree of pleiotropy have higher functional impact: these are enriched for both protein-altering missense variants and broader range eQTLs (Figure 3c). All of the aforementioned properties of pleiotropic loci are in very good concordance with the results of other global overviews of pleiotropy made using UKB data 15,16 .
Overall, our approach allowed us to detect multiple novel interrelations between complex traits, which can shed light on common molecular patterns driving the human phenome. Further investigations would help to determine causal relationships between key genes and molecular pathways that give rise to multigenic disease. Investigation of all possible genetic interactions between common and specific risk factors should help to decipher sophisticated molecular mechanisms behind complex traits.
We acquired data from the Benjamin Neale's lab website (UK Biobank GWAS results imputed v2, downloaded at 2017-10-04; http://www.nealelab.is/uk-biobank). We focused our attention only on phenotypes with significant non-zero partitioned heritability estimates (p < 0.05, estimates provided by the authors of the dataset). After obtaining the list of heritable phenotypes we retained genome-wide significant variants for each phenotype (p-value < 5 * 10 -9 ). Association summary statistics for each variant for each trait were then merged into a singly matrix for further analysis.

Clustering of traits
To group similar traits in space of pre-selected associated variants, we utilized the Jaccard distance metric calculated for each pair of traits as follows: where D is the distance between two phenotypes, and T1 and T2 are sets of associated SNPs for two traits. We then performed hierarchical clustering together with the analysis of silhouettes to determine the most appropriate number of clusters. We then calculated the number of associated trait clusters for each variant. To this end, we considered an SNP to be associated with a cluster if this SNP is associated with at least one trait inside this cluster.

Description of the LSEA method
We To construct such universe, we merged all loci that were identified for the whole UKB dataset. We then transformed all of the curated gene sets obtained from the MsigDB database (http://software.broadinstitute.org/gsea/msigdb/annotate.jsp) the following way: for each interval i in the intervals universe, we assigned i to a gene set if at least one gene in the gene set overlaps i. Such method efficiently corrects for functionally related genes that are genetically linked, as intervals spanning several genes belonging to each gene set are counted only once.
The enrichment statistic is computed for each gene set as follows: , where n is the number of intervals in the query set, = 1 if interval j belongs to a gene set of interest, and = 0 otherwise. Enrichment p-value is then computed from hypergeometric distribution. The resulting p-values are then adjusted for multiple comparisons using the Holm FDR correction.

Gene set enrichment analysis of individual traits and clusters
We used the LSEA method to analyze gene set enrichment for all individual phenotypes and trait clusters. We used two complementary approaches to make functional annotations of multi-trait clusters. The first approach was based on using variants that are significantly associated with at least 2 of the traits in each cluster (termed "union of SNPs" on Figure 2c). For the second approach, we used LSEA results for all individual phenotypes in each cluster, and retained only gene sets that were significantly enriched for 2 or more traits in a cluster (termed "union of gene sets" on Figure 2c.
As the HLA locus and the HIST1 gene cluster were omitted during gene set analysis with LSEA, we analyzed the overrepresentation of variants inside these loci separately. To this end, we created sets of intervals for the HIST1 and HLA regions in a way analogous to the LSEA method.
We then obtained the overrepresentation p-values using hypergeometric distribution, with the test statistic being equal to the number of intervals in the query that overlap the regions of interest.

Identification of pleiotropic genetic variants
To assess the phenome-wide significance and identify key pleiotropic variants, we , where k is the number of traits or trait clusters associated with a given variant. Variants with p-value < 5 * 10 -9 were selected as pleiotropic SNPs and used for further analyses.

Overrepresentation analysis for functional variant classes
To analyze the overrepresentation of cis-eQTLs, we used the GTEx data (REF). Sets of variants of interest (i.e., pleiotropic variants and variants with at least 1 association) were overlapped with significant cis-eQTLs (p < 5 * 10 -8 ). Overrepresentation of cis-eQTLs in the set of pleiotropic SNPs was assessed by comparison of the resulting proportion of significant eQTLs using the hypergeometric distribution.
We also annotated all SNPs from the UKB dataset with SnpEff 44 and separated the variants by their functional type (e.g., missense, intronic, etc. (the distributions are shown in Extended Data Figure 14)). We then compared the proportions of certain functional classes of variants in the sets of pleiotropic SNPs and SNPs bearing at least one association. The proportions were compared using the hypergeometric test.
To analyze the enrichment of epigenetic marks and chromatin states in the set of pleiotropic loci, we used the previously described ChIP-Atlas server 22 . For this analysis, sets of pleiotropic variants were converted to an interval list, and the resulting intervals were used as the input for ChIP-Atlas. To narrow down the search space, we narrowed the analysis down to a set of bloodderived genomic tracks. For each type of epigenetic mark, we retained the minimal enrichment pvalue for further analysis.

Network analysis of pleiotropic loci
To group loci that share sets of associated clusters, we constructed a network of genetic associations for the identified pleiotropic loci. To this end, we created a graph representation of the data, with nodes corresponding to individual clusters and edge weight between a pair of adjacent nodes corresponding to the number of shared independent loci. To search for individual fully connected subgraphs of this network (i.e., nodes sharing the same sets of loci) we used a simple deterministic algorithm. At the first stage, we identified all possible large combinations of nodes connected by an individual pleiotropic locus by iterating over edges of the graph. We then merged loci that share the same combinations of associated nodes.
Data and source code availability