Genetic associations of protein-coding variants in human disease

Genome-wide association studies (GWAS) have identified thousands of genetic variants linked to the risk of human disease. However, GWAS have so far remained largely underpowered in relation to identifying associations in the rare and low-frequency allelic spectrum and have lacked the resolution to trace causal mechanisms to underlying genes1. Here we combined whole-exome sequencing in 392,814 UK Biobank participants with imputed genotypes from 260,405 FinnGen participants (653,219 total individuals) to conduct association meta-analyses for 744 disease endpoints across the protein-coding allelic frequency spectrum, bridging the gap between common and rare variant studies. We identified 975 associations, with more than one-third being previously unreported. We demonstrate population-level relevance for mutations previously ascribed to causing single-gene disorders, map GWAS associations to likely causal genes, explain disease mechanisms, and systematically relate disease associations to levels of 117 biomarkers and clinical-stage drug targets. Combining sequencing and genotyping in two population biobanks enabled us to benefit from increased power to detect and explain disease associations, validate findings through replication and propose medical actionability for rare genetic variants. Our study provides a compendium of protein-coding variant associations for future insights into disease biology and drug discovery.

To define significance, we used a combination of (1) multiple testing corrected threshold of 103 p<2x10 -9 , 0.05/(~26.8x10 6 ) [sum (mean number of variants tested per disease cluster)], to 104 account for the fact that some traits are highly correlated disease subtypes, (2) concordant 105 direction of effect between UKB and FG associations, and (3) p<0.05 in both UKB and FG 106 consistent with approaches in previous large-scale GWAS 51 . Here we assume all coding 107 variants tested are independent, which is a stringent assumption as we ignore the LD, and we 108 tested only the small subset of coding variants rather than the complete genome-wide. Our Let the parameter ! denote the effect of genotype ! on disease endpoint ! in the th study, 135 for a total of = 1, 2, … , ∶ ≥ 2 studies. The meta-analysis inverse variance weighted Z-136 score, which aggregates information across all studies, is given by 137 where ! is the Z-score from the th study, i.e., 139 with 4 ! denoting the sample-based estimate of ! and 0 $ ! being its corresponding standard 141 error. We re-write the above to illustrate the impact of integrating additional independent study 142 information relative to a reference study, which we arbitrarily take to be study = 1. That is, 143 we write !"# as: .

148
( 1 ) 149 Hence, denotes the increase/decrease in Z-score computed on aggregating results across 150 studies relative to the reference study. When > 1 the aggregated !"# score is larger than the 151 reference Z-score ( ' ) -a scenario we refer to as 'IVW uplift'. Our goal is to assess changes 152 in as a function of MAF and MAF-enrichment between = 2 studies. To fix ideas, we are 153 interested in assessing uplift when (i) the reference study is the largest study and (ii) MAF is 154 enriched in the smaller study, i.e., study 2. Hence, ' ≥ ( and ( ≥ ' . The scenario 155 in which the reference study has larger sample size and also enriched , i.e., ' ≥ ( and 156 ' ≥ ( , is considered later. 157 158

Theoretical description of uplift 159
In this section we present a formula which relates uplift to the core parameters underpinning 160 computation of the test statistic !"# , i.e., ! , disease prevalence ( ! ) and sample size ( ! ), 161 for each of the ∈ {1, 2} studies. In doing so, we will (a) illustrate complexity in the 162 relationship between and the core parameters and (b) highlight the expected utility of a MAF-163 enriched study design, in particular how the probability of detecting novel (rare variant) 164 associations increases as a function of enrichment. 165

166
Let the probability of disease status follow a logistic model, i.e., 167 where denotes a genotype putatively associated with disease risk, ! is the population 170 baseline effect and ! is the effect of genotype on disease risk. Given observed data { ! , ! }, 171 estimates of the effect parameters, denoted { 0 ! , 4 ! }, are typically derived by maximizing the 172 log-likelihood function 173 That is, values for { ! , ! } which satisfy: 176 Assuming that the effect of genotype on disease risk is small, in the sense that | ! | < 1, it 179 follows that 180 181 where we have made the transformation of variables: 187 Hence, when | ! | < 1 the conditional probability of disease is well approximated by the 190 linear predictor ! a. As it will be important later, the relationship between parameters in the 191 logistic and the linear predictor is 192 Combining equations (3) and (4), it follows that the score of the logistic model satisfies 195 which is approximately zero when 197 199 ( 6 ) 200 note that these are the familiar ordinary least-squares estimates of main effect parameters in a 201 linearized model. When the vector of genotypes has been centered, i.e., k = 0 , it is 202 straightforward to show that: 203 . 205 We have used ! * to denote the number of cases in the th study, thus 0 ! * is a measure of 207 population baseline disease prevalence. The linearized genetic effect in equation (4), however, 208 can be interpreted as 209 where ! * is an estimate of the ! in the cases only: 212 We will use equations (8) and (9) later, when aiming to interpret uplift via core model 215

parameters. 216 217
Recall the relationship between the parameters in the logistic and linearized models presented 218 in equation (5). After an application of the delta method, the variance of 4 ! in the logistic model 219 (1) can be written as: 220 On combining equations (5) and (7), we reveal that the Z-score of the genetic effect in the 224 logistic model (1) is well approximated by the estimated Z-score from the linearized effects 225 (equation (7)), i.e., 226 where we have used the fact that sgnX ! * (1 − ! * )Z = 1 from equation (7). Thus, 228 and it follows therefore, that the uplift can be written as: 231 Explicit description of MAF enrichment, disease prevalence, sample size and log odds-235 ratio on IVW uplift ( ) 236 In this section our goal is to gain some intuition as to how uplift can vary in terms of core 237 parameters, e.g., ! , Supplementary Files 2a-c), however these core parameters are implicitly defined in the 240 approximation. To help provide some guidance on their explicit role, we replace 0 ! * , 0 $ ! * and ! * 241 in equation (12) with large sample (population) estimates, i.e., on assuming ! ≫ 1 for ∈ 242 {1,2}. In addition, we make the assumption of weak genetic effects (i.e., | ! | < 1), low disease 243 prevalence ( ! ≪ 1) and rare/low MAF (in both studies), so that: 244 Recall that ! * denote the MAF in the cases, i.e., 251 then, in combination with equations (4) and (7), it follows that 253 , ! ≫ 1 and ∈ {1,2}. 274 In an ideal scenario, one would also be able to compare the Z-score boost for enriched variants 300 using an additional UKB like cohort matched for the sample size of FG which is not currently 301 readily available for all diseases. However, we are able to subset the larger UKB cohort into a 302 FG sample size (i.e., N=260405) matched cohort and use the remaining (N=132409) samples 303 as a "base" analysis cohort and compare the Z-scores of base UKB meta-analysed with FG 304 (UKBxFG) against base UKB meta-analysed with FG sample size matched UKB (UKBxUKB). 305 306 Focusing on associations with MAF<0.1 from Supplementary Table 3, we randomly 307 subsetted UKB into a base UKB and a FG sample size matched UKB cohort (n=4 random 308 subsets). We performed association testing and meta-analysis using the same approach and 309 calculated the ratio of Z-scores (ZUKBxFG/ZUKBxUKB), log10 transformed so that >0 means Z-310 score higher in UKBxFG than UKBxUKB, and <0 means Z-score higher in UKBxUKB than 311 UKBxFG (Extended Data Figure 5d). 312 313 Analysis of Z-score boost due to MAF enrichment and increased sample size in simulated 314 data 315 We further investigate the relative gain in IVW Z-scores across a broader range of enrichment 316 values. For this, we modified our simulation strategy detailed in section Simulations of MAF 317 enrichment effect on inverse-variance weighted meta-analysis Z-scores to match the cohort 318 analyses above. Specifically, we followed the identical simulation protocol, yet now simulated 319 results from a UKBxFG and separately UKBxUKB meta-analysis, followed by computing the where 349  Table 3). 355

551
The HCN4 Asp364Asn variant associated with AF 552 Another AF GWAS locus is tagged by the common intergenic sentinel variant rs74022964 553 between HCN4 and REC11432,33. We identified a rare, FG enriched variant in HCN4 554 Consistently, in addition to the AF signal, we also found an association with decreased heart 560 rate (beta=-0.49, p=3.8x10 -21 ).The Asp364Asn variant lies in a poorly conserved region of the 561 protein and its effects on the HCN4 channel have not been explored (ClinVar). The association 562 with decreased heart rate and increased AF risk potentially points to a LOF effect of HCN4, 563 although further functional studies are required to elucidate this further.  (Supplementary Figure 3). The 665 propositus is a 65-years old female. She was first diagnosed with paroxysmal AF without heart 666 disease when she was 48 years old. 667 668

Reporter gene assay 669
We verified that the increased transactivation effects did not result from a higher protein 670 expression of the variant (see Western blot images in Supplementary Figure 4) or from a 671 change in the intracellular distribution of the recombinant proteins since the nuclear 672 localization was similar to that of the WT (Supplementary Figure 5a). Moreover, they were 673 not linked to the ability of WT and mutated PITX2c proteins to interact with the DNA bicoid 674 binding site as tested by EMSA assay (Supplementary Figure 5b). The full images of the 675 EMSA gel and corresponding Western blot are shown in Supplementary Figure 6. 676 677

qRT-PCR determination of connexins and ion channels 678
Control experiments show the differences in expression of PITX2c (Supplementary Table  679 14) that were taken into account to normalize the mRNA level measurements. Further, it was 680 verified that cells transfected with an empty vector had mRNA levels similar to those in non-681 transfected cells (Supplementary Table 14).

851
Other column names as defined above.