Identifying interpretable gene-biomarker associations with functionally informed kernel-based tests in 190,000 exomes

Here we present an exome-wide rare genetic variant association study for 30 blood biomarkers in 191,971 individuals in the UK Biobank. We compare gene-based association tests for separate functional variant categories to increase interpretability and identify 193 significant gene-biomarker associations. Genes associated with biomarkers were ~ 4.5-fold enriched for conferring Mendelian disorders. In addition to performing weighted gene-based variant collapsing tests, we design and apply variant-category-specific kernel-based tests that integrate quantitative functional variant effect predictions for missense variants, splicing and the binding of RNA-binding proteins. For these tests, we present a computationally efficient combination of the likelihood-ratio and score tests that found 36% more associations than the score test alone while also controlling the type-1 error. Kernel-based tests identified 13% more associations than their gene-based collapsing counterparts and had advantages in the presence of gain of function missense variants. We introduce local collapsing by amino acid position for missense variants and use it to interpret associations and identify potential novel gain of function variants in PIEZO1. Our results show the benefits of investigating different functional mechanisms when performing rare-variant association tests, and demonstrate pervasive rare-variant contribution to biomarker variability.

= 1.01). Of the 117 distinct genes, 43 (37%) were associated with more than one biomarker, and a few genes 121 had five or more significant associations: ANGPTL3, APOB, JAK2, GIGYF1 and G6PC. Many of the genes 122 we found to be associated with biomarkers had either been implicated in diseases related to those biomarkers 123 (e.g., LRP2 with renal markers [31]) or are mechanistically related to the biomarkers themselves (e.g., cystatin 124 C with its own gene, CST ).

125
80% of genes (±5kb) contained single variants associated with the same biomarkers identified by a study using 126 imputed genotypes and a larger sample size on the same data [23]. Yet, 97% of gene-phenotype associations

Clinical relevance of biomarker associations
Genes identified by biomarker associations were highly enriched for those involved in Mendelian disorders (odds 140 ratio 4.47, p-value 3.08 × 10 −16 , one-sided Fisher's exact test), as determined by querying the Online Mendelian 141 Inheritance in Man database (OMIM, https://omim.org/). 74 out of 125 (59.2%) distinct (i.e., non-overlapping) 142 genes identified had at least one OMIM entry. Our results show that changes at the biomarker level are detectable 143 even in the 27 genes for which only recessive disorders were listed. 144 We further examined whether rare variants in these genes were associated with binary or continuous traits  Table 4). 21 out of 125 genes associated with biomarkers were also associated with one or more binary traits, 147 whereas 14 genes were associated with non-hematological quantitative traits in either study (Table 2). 148 Often such cases could be placed in a plausible causal context. For example, we found variants in SLC22A12,

149
ABCG2 and ALDH16A1 to be associated with increased levels of urate, a causal factor for gout, which they 150 were also associated with. Another example are genes associated with growth hormone IGF-1 (IGFALS, GH1, 151 GHRH, PAPPA2 ) which are also associated with anthropometric traits (e.g., sitting height). We identified 152 patterns of biomarker associations for GIGYF1 consistent with Type II diabetes (T2D), and an association 153 with T2D was confirmed at larger sample sizes [15,16] and other independent studies [32,33]. Six out of seven 154 biomarker associations for JAK2 were only identified by kernel-based tests, and the gain of function variant 155 driving these associations (V617F or rs77375493 [34]) was also associated with myeloproliferative neoplasms.

156
In other cases, trait-biomarker relationships were not readily apparent, such as the association of variants in 157 SYNJ2 with lower levels of γ-glutamyltransferase and increased risk of hearing loss.

158
Separating variant effect categories during association testing and comparing kernel-based to gene-based 159 collapsing tests allowed us to further interpret our results, as illustrated with several examples below. 160 2.6 Combining variant annotations yields six associations for G6PC 161 We found five associations for G6PC, three of which were only found when combining protein LOF and missense 162 variants with gene-based variant collapsing tests (alkaline phosphatase, SHBG, urate). Variants in G6PC cause 163 Glucose-6-phosphatase deficiency type Ia [35,36], an autosomal recessive disease categorized by growth retar-164 dation, enlarged kidneys and liver, low blood glucose, and high blood lipid and uric acid levels. Consistent with 165 signs of inflammation and impaired kidney and liver function, we found elevated levels of alkaline phosphatase 166 (p = 1.15 × 10 −9 ), γ-glutamyltransferase (p = 2.64 × 10 −11 ), and C-reactive protein (p = 1.62 × 10 −13 ) in in-167 dividuals with predicted high-impact missense or pLOF mutations in G6PC. We further identified a significant 168 association with decreased levels of sex hormone-binding globulin (SHBG, p = 2.88 × 10 −9 ), which is primar-    .    Figure 5), which was predicted to increase the binding probability of QKI.

252
The same variant on the opposite strand was also responsible for the association of DOCK7 with triglycerides.

253
However, we determined that ANGPTL3 was more likely the causal gene, given the associations of ANGPTL3 254 with triglycerides we had independently found with other variant categories and the close proximity of an 255 ANGPTL3 exon. 256 We further investigated this variant by assessing binding probabilities of RBPs beyond the six RBPs in focus 257 here that are represented in DeepRiPe. We found that the variant was also predicted to decrease the binding 258 probability of BCLAF1, a factor related to mRNA processing [59] and increase binding of splice-regulator 259 HNRNPL ( Figure 5). Using attribution maps [25], we found that instead of strengthening or inserting a QKI 260 binding motif, this variant weakens a splice donor signal in the presence of upstream binding motifs for the 261 splicing regulators QKI and HNRNPL (Supplementary Figure S5). SpliceAI predicted only a weak upstream 262 donor loss (0.02) for this variant, which was well below the threshold of 0.1 we used to identify splice-altering 263 variants, but indicative of the same trend.

264
The remaining five associations exclusively identified by association tests incorporating RBP-binding pre-265 dictions were those of ATG16L1 with total bilirubin, SOD2 with Lipoprotein A, SLC39A4 with Alanine 266 aminotransferase and SHARPIN with Alanine aminotransferase.

267
Both SHARPIN and SLC39A4 lie within half a megabase of the Alanine aminotranferase gene (GPT ), there-268 fore we could not exclude potential transcriptional cis-regulatory effects as the true cause for these associations.

269
Furthermore, single variants carried most of the signal for both genes.

270
Common intronic variants of ATG16L1 were previously found to be associated with Crohn's disease, in- A predicted splice-variant 16:67270978:G:A is negatively associated with HDL cholesterol (p = 7.83 × 10 −12 , score test), whereas intronic 1:62598067:T:C is negatively associated with Triglycerides (p = 1.37 × 10 −25 , score test). The lower and upper hinges indicate first and third quartiles, whiskers extend to the largest/lowest values no further than 1.5 × IQR away from the upper/lower hinges and black points denote outliers. (b) DeepRiPe binding probabilities for 1:62598067:T:C for three RBPs in HepG2 cells. While predicted probabilities for the reference sequence are ambiguous, the alternative allele shifts binding probabilities in favor of QKI and HNRNPL. All RBPs with absolute predicted variant effects above 0.2 and binding probabilities greater than 0.5 for either reference or alternative alleles are shown.

Combined likelihood ratio and score tests (sLRT)
In order to benefit from both the speed of the score test [8] and higher power of the LRT in the presence of 275 large effect sizes [11, 12] we investigated the use of a combination of these tests, we call the sLRT (score-LRT).

276
The sLRT is a likelihood ratio test that is performed only when initial score tests reach nominal significance at 277 a given cutoff t. If this threshold is not reached, it returns the score test p-value. Throughout our analysis, we 278 used t = 0.1, and found that it was unlikely that a larger threshold would have identified many more associations 279 (Supplementary Figure S6). As the run time is dominated by the cost of computing the LRT, this test can 280 achieve a computational speedup factor of roughly 1/t = 10 over the LRT under the null hypothesis. 281 We found that the sLRT was able to identify more significant associations than the score test for missense 282 and splice-variants when performing kernel-based association tests (Supplementary Figure S7) increase with larger and datasets. 328 We explored the use of deep-learning-derived variant effect predictions for splicing and the binding of RBPs.

329
The restriction to exon-proximal regions meant we only observed a fraction of the variants potentially acting 330 through these mechanisms. Associations found by incorporating splice-predictions largely overlapped with those 331 identified with pLOF variants (which included simple splice donor/acceptor variants). While we found some 332 associations exclusively with splice-predictions, these were mostly due to single variants and would need further 333 validation (e.g., SLC9A5 ). Similar reasoning holds for the associations found with predictions for RBP-binding. 334 We anticipate that deep-learning-based predictions will become more valuable for non-coding regions in 335 whole-genome sequencing studies, for which the approaches we developed will also be applicable.
where X is the N × q covariate design matrix (fixed effect) and α is the vector of fixed-effect parameters, 404 which together determine the mean values of y. The variance-covariance matrix of y is composed of the 405 independently distributed residual variance (I N scaled by σ 2 e ) and the kernel-matrix K g (scaled by σ 2 g ), which 406 captures the genetic similarity between individuals. K g is a function of the N × m matrix of mean-centered 407 minor allele counts G (random effect) of the genetic variants we wish to test.

408
Any valid variance-covariance matrix can be substituted for K g . In order to use efficient algorithms for 409 estimating the parameters σ 2 e and σ 2 g and performing association tests, we require K g to be factored as a quadratic form [9, 11]: in Supplementary Table S1).

450
For predicted splice-variants we followed the same strategy as for missense variants, however, we used the

455
Because some of the genes in the Ensembl 97 release share exons, we encountered cases in which these genes 456 shared associations caused by the same variants. We do not report these as distinct genes in the main text or 457 abstract, but include the full list of 212 associations in Supplementary Table S1.
Where X is the N × q covariate design matrix (fixed effect) and α is the vector of fixed-effect parameters.

690
The variance-covariance matrix of y is composed of the independently distributed residual variance (I N scaled 691 by σ 2 e ) and the kernel-matrix K g (scaled by σ 2 g ), which captures the genetic similarity between individuals. K g 692 is a function of the N × m matrix of mean-centered minor allele counts G (random effect) of the genetic variants 693 we wish to test.

694
In order to use efficient algorithms for estimating the parameters σ 2 e and σ 2 g and performing association tests, 695 we require K g to be factored as a quadratic form [9, 11]: Where the function ϕ transforms G into intermediate variables before performing the test.

697
The test statistic of the score test approximates the change of the log likelihood of a model when including K g 698 over the null model, which does not include K g (σ 2 g = 0)[8]. We calculated test statistics using fast algorithms we sampled 100 test statistics for every LR test, using our own port of RLRsim [77] in Python. Finally, we fit 706 a parametric null distribution πχ 2 0 + (1 − π)aχ 2 d with free parameters π, a and d to the pooled simulated test 707 statistics using log-quantile regression on the 10% of largest test statistics [11], and used this distribution to 708 calculate p-values as described in [9]. We used separate null distributions for all variant-type to test-type and 709 phenotype combinations (see details below).

710
The pooled null distribution is a uniform mixture of gene-specific null distributions. If we were able to obtain 711 more than a single test statistic per gene, this distribution would not be guaranteed to control the gene-specific 712 type-1 error rate. However, as it represents an average mixture across all tests, the average type-1 error rate 713 across all genes remains controlled. 714 We compared this approach to using gene-specific null distributions (i.e., fitting a separate null distribution 715 for every test and gene, similar to the method described in [12]), and found that they produced highly similar 716 results ( Supplementary Figures S9-S11).

717
To this end, we performed two analyses: First, we looked at genes close to or below the genome-wide

Functionally informed kernel-based tests
where W is an m × m diagonal matrix containing the square roots of variant weights on the diagonal and defined, local collapsing can be expressed as a matrix multiplication: S = CC T and the kernel (7) becomes: Here C is the m-variants by g-groups collapsing matrix. Therefore GW C is the n × g weighted locally

791
We used the locally collapsing kernel in the kernel-based association tests for missense variants, as it had given 792 more unique associations and overall slightly lower p-values for the most significant genes in initial experiments 793 on the 50k WES release, and was more interpretable compared to other approaches.

794
Splicing For splice-variants we performed score tests using gene-based variant collapsing and the linear 795 weighted kernel for all genes. Again, if either of the two score tests were nominally significant (p < 0.1), 796 we performed likelihood ratio tests (sLRT). As we did for missense variants, we then also performed combined  B" (PhenoScanner trait), "Cystatin C" to PhenoScanner traits "log eGFR cystatin C", "Serum cystatin c 834 estimated glomerular filtration rate eGFR" and "Cystatin C in serum", and "Urea" to "Renal function related 835 traits urea". 836 5.9 PIEZO1 L2277M association tests 837 We used the ancestry classifications described above to define a group of indidivuals of SAS ancestry, and one 838 of extended EUR ancestry (both using a cutoff of > 0.5 in the ancestry classification model). This is a less 839 stringent cutoff than that used in the EUR-analysis for all biomarkers and increased the number of observed 840 carriers in the EUR-ancestry group from 5 to 21, all heterozygous. We used the same covariates as in the 841 all-ancestry analysis.  Rows and columns are clustered using complete linkage on the Euclidean distances of the correlation matrix between phenotypes (dendrogram). While some even weakly (anti-)correlated biomarkers share significant associations (e.g., Cholesterol and Glucose, gene: GIGYF1 ), other highly correlated markers do not share significant associations (e.g., GGT, ALT, AST). Supplementary Figure S3: Kernel-based tests results overview. We calculated the average effect size for *variants with single-variant p-values below 10 −5 (score test) within significant genes found by kernel-based tests if the cumulative minor allele count across these variants was at least 5. Bubble plots showing these average effect sizes (y-axis) for each biomaker (x-axis). The four genes with largest average effect sizes are labeled for each biomaker. Larger bubble size indicates higher significance of the gene-based test. P-values and average effect sizes are those given by the most significant variant effect category (lead annotation). Effect sizes are calculated on covariate-corrected quantile transformed phenotypes. Supplementary Figure S5: Attribution maps. The variant 1:62598067:T:C at one-based position chr1:62598067 makes DeepRiPe predict increased binding probabilities for HNRNPL and QKI, and decreased probability for BCLAF1. Attribution maps for reference (ref) and alternative (alt) sequences as described in [25] highlight important nucleotides proximal to an ANGPTL3 exon boundary. The predictions for QKI depend positively on an upstream QKI binding motif (ACUAAC), and negatively on the splice donor signal (GUAAGU). The pattern is inverted for BCLAF1. Weakening of the splice signal by the alternative variant increases predicted binding probabilities for QKI and HNRNPL.  Figure S6: sLRT number of significant gene-biomarker associations vs. score test cutoff. We consider the scenario of performing five score tests per gene and biomarker (one collapsing and one kernel-based test for missense and splice variants, and one kernel-based test for RBP-variants), resulting in 2,540,128 tests genome-wide and a Bonferroni corrected p-value threshold of 1.96 × 10 −8 (α = 0.05). The plots show the number of significant associations found by the sLRT depending on the nominal significance cutoff chosen for the score tests (which determine whether the LRT is performed). We used the cutoff of 0.1 (gray vertical line) in our analysis.     Figure S12: RLRT pooled null distribution vs empirical gene-specific quantiles example. For the kernel-based RLRT with missense variants and phenotype triglycerides, we sampled 250,000 test statistics each for 52 genes and fit a χ 2 -mixture distribution (parametric null) to the pooled test statistics (0.59 × χ 2 0 + (1 − 0.59) × 0.96 × χ 2 0.98 ). The top two panels show the empirical inverse cumulative distribution function of the pooled test statistics (y; blue line), against the value of the test statistic (x). The survival function of the non-zero mixture component of the parametric null is overlaid in orange. In the lower panel, the box plot shows the gene-specific quantiles of test statistics (x) corresponding to gene-specific emirical p-value thresholds of 10 −1 , 10 −2 , 10 −3 , 10 −4 and 10 −5 (y) for the 52 genes. Red vertical lines denote the median values, and are extended to the plot above, showing that the mixture distribution accurately captures the median gene-specific quantiles in the tail.