Main

Rare genetic variants with large effects on disease can have direct implications for the development of potential treatments1; however, studying their effects comprehensively requires large, broadly phenotyped samples4. Identifying variants that influence disease risk only in the homozygous state (recessive inheritance) is particularly challenging, as the square of variant frequencies means that the homozygous state is often exceedingly rare. By contrast, in populations that have encountered a recent reduction in population size, certain founder diseases with recessive inheritance are present at higher frequencies. The Finnish population has experienced such bottleneck events and has been, historically, relatively isolated from other European populations5. As a result, the Finnish population is characterized by higher rates of DNA stretches with a common origin6,7 carrying particular sets of genetic variants. This leads to higher rates of homozygosity, and increases the chance occurrence of pathogenic variants in a homozygous state that lead to diseases with recessive inheritance. In consequence, there is an enrichment of 36 specific Mendelian genetic diseases such as congenital nephrotic syndrome, Finnish type (CNF)8 in certain areas of Finland today that show mostly recessive inheritance. These 36 (at present) ‘founder diseases’ are referred to as the ‘Finnish disease heritage’2,9. Analogous founder diseases in other populations include Tay-Sachs disease and Gaucher disease in Ashkenazi Jewish individuals10; Hermansky–Pudlak syndrome in Puerto Rican individuals11; and autosomal recessive spastic ataxia of Charlevoix–Saguenay and Leigh syndrome, French Canadian type in French Canadian individuals12,13. Populations that have undergone recent bottlenecks are also characterized by an excess of mildly deleterious variants, which are derived from rare variants that stochastically increased in frequency after a bottleneck event4. This has been previously shown in Finland through an excess of potentially deleterious probable loss-of-function (pLoF) variants at lower to intermediate frequencies (around 0.5%–5%)3,14. The higher allele frequencies of deleterious founder variants increases the statistical power for detecting disease associations. Isolated populations have thus been successfully used to map disease genes for decades5,15,16.

Genetic variants can have different effects on disease when in a monoallelic state (only one allele carries the variant; heterozygous) or a biallelic state (both alleles carry the variant; homozygous) (see Fig. 1). For common genetic variants, early genome-wide association studies (GWASs) found that additive models captured most genotype–phenotype associations, including those with non-additive (also called dominance) effects17. The vast majority of GWASs were therefore conducted with additive models only. By contrast, Mendelian disease variants are rarely described as additive (or equivalently, as semidominantly inherited18,19). Even for well-known examples of semidominant inheritance, such as the LDLR gene, which is associated with familial hypercholesteremia, diseases caused by monoallelic and biallelic variants are listed as separate recessive and dominant conditions in standard databases20 (for example, ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/) and the Online Mendelian Inheritance in Man (OMIM) database (https://www.omim.org/)). Although this nomenclature is limiting, for many Mendelian disease variants, monoallelic and biallelic phenotypes do have distinct features. For example, pathogenic variants in ATM cause cancer predisposition in a monoallelic and in a biallelic state, but ataxia telangiectasia only in a biallelic state. Genetic variants can also have distinct molecular effects that are differently inherited. In SERPINA1, for example, the same variant causes a dominantly inherited liver phenotype with a gain-of-function mechanism but a recessively inherited lung phenotype with a loss-of-function mechanism21.

Fig. 1: Schema of different effect sizes of monoallelic (heterozygous) versus biallelic variant states.
figure 1

A is the wild-type and B is the mutant allele. We distinguish five main scenarios that are associated with different modes of inheritance used in rare-disease genetics (first row of the table at the bottom). In rare-disease genetics, the phenotypes associated with the mono- and the biallelic state in scenarios 2, 3 and 4 are usually viewed as distinct disease entities, with the monoallelic phenotype regarded as dominantly inherited and the biallelic phenotype, which is usually more severe, regarded as recessively inherited. In the schema, we focus on autosomal inheritance and do not show overdominant or underdominant inheritance (rare outside the HLA region). A perfectly linear additive genetic architecture (scenario 3) is also described, in which no dominance effect contributes to phenotypic variation.

In this study, we analyse the effects of coding variants on 2,444 disease phenotypes using data from nationwide electronic health records of 176,899 Finnish individuals from the FinnGen project22. Participants are largely recruited through hospital biobanks and the data are thus enriched for individuals with diseases across the clinical spectrum. The phenotypes are derived from national healthcare registries collected over more than 50 years. In addition to the standard additive GWAS model23, we systematically search for recessive associations. This enables us to examine two related questions of interest to both Mendelian and quantitative genetics communities. First, we investigate the potential benefit of searching for recessive associations in GWASs. Second, we consider the broad phenotypic consequences of Mendelian disease variants in the heterozygous as well as the homozygous state, and highlight how the current nomenclature could be improved to more precisely describe the complex inheritance of Mendelian variants.

Recessive disease associations

In light of the global enrichment of deleterious variants in Finland3,14 and the well-described Finnish disease heritage2,9, we analysed data from the large population cohort gnomAD24 and found that a set of variants that are known to cause disease with recessive inheritance are enriched in the Finnish population (Extended Data Fig. 1 and Supplementary Note 1). We thus saw the opportunity to identify novel recessive associations in our Finnish dataset. We performed a phenome-wide association study (pheWAS) to search for the effects of coding variants on 2,444 disease phenotypes in 176,899 Finnish individuals. When using the conventional GWAS model assuming additive genotype effects (see also scenario 3 in Fig. 1) in 82,647 variants, we found 1,788 significant associations (P < 5 × 10−8) for 445 coding variants in 305 genes22. For 44,370 coding variants with 5 or more homozygous individuals, we also investigated whether any of them had disease effects in the homozygous state (recessive GWAS model; see also scenario 5 in Fig. 1 and Supplementary Tables 1 and 2). We identified 124 associations—involving 39 unique variants—where the recessive model fitted substantially better than the additive model (recessive P value two orders of magnitude smaller than the additive GWAS P value; Fig. 2), corresponding to 31 unique loci (Fig. 2b, Supplementary Table 3 and Supplementary Fig. 1). Our simulations (Supplementary Note 2, Extended Data Fig. 2 and Supplementary Table 4) supported this as a suitable way to identify recessive associations. The simulations also showed how a recessive model can uncover associations that are missed by a conventional GWAS model—particularly for rare variants (minor allele frequency (MAF) < 0.05). This is consistent with more than half of our recessive associations not being genome-wide significant in the additive GWAS model (Table 1). Another 93 associations (of 73 unique variants) were identified in which the recessive P value was smaller than the additive P value (Supplementary Table 1). Those likely included non-additive effects as a simulated additive effect only showed a smaller recessive P value than additive P value in around 2 in 100,000 simulations.

Fig. 2: P values of the additive versus the recessive GWAS model of all genome-wide-significant variant–disease associations.
figure 2

Associations in which the P value of the recessive model is two orders of magnitude lower than that of the additive model are shown in blue (category: ‘recessive’); all other associations are shown in black. a, All associations. b, Recessive associations, broken down by known inheritance modes of the respective disease gene (source: OMIM). Independent loci (considering adjacent variants with r2 > 0.25 associated with the same (parent) trait as one locus). For clarity, −log10 P values are capped at 50.

Table 1 Recessive associations

Of the 31 unique loci, 13 were in known OMIM genes (Fig. 2), all of which were previously described with recessive disease inheritance. Of these, 9 of the 13 variants were known disease-causing variants20 (6 pathogenic or likely to be pathogenic (‘likely pathogenic’); 3 likely pathogenic by at least one submitter) and had large effects on disease, as expected. Among the remaining 18 variants in non-OMIM genes, we highlight CASP7 (adult-onset cataract) and EBAG9 (female infertility) (Fig. 3 and Supplementary Note 3). We investigated the effect of the EBAG9 variant on fertility in 106,732 women born between 1925 and 1975 (thus with an approximately complete reproductive period) in our FinnGen data. We found that individuals homozygous for the EBAG9 variant had fewer children (1.7 ± 1.3 (mean ± s.d.)) than did wild-type individuals (2.0 ± 1.3, Wilcoxon rank test P = 0.0002), and had their first child at a later age (Extended Data Fig. 3), supporting the association of this variant with infertility. The six common (allele frequency greater than 10%) variants with recessive associations all overlapped known associations in the GWAS catalogue; however, for four of the six, a non-additive model of association had not been previously documented (Supplementary Table 5; variants in linkage disequilibrium with r2 > 0.1; ref. 25). Among the less-common variants, genes GJB2 and EYS each contained two significantly associated likely pathogenic variants. For those and other genes, we thus identified compound heterozygous effects; that is, biallelic effects of two different pathogenic variants on separate alleles (see Extended Data Fig. 4 and Supplementary Note 4).

Fig. 3: Age at first diagnosis of variants with recessive disease associations.
figure 3

Data are shown as survival plots. a, Missense variant in CASP7 associated with cataract; not previously described (P = 2.5 × 10−16); n = 176,899. b, Missense variant in C10orf90 associated with hearing loss (P = 2.2 × 10−12); only recently described53; n = 176,899. c, Intronic variant in EBAG associated with female infertility (P = 1.6 × 10−11); not previously described; n = 110,361 female individuals. Survival curves of wild-type individuals are coloured in blue, heterozygotes in yellow and homozygotes in red. The 95% confidence intervals of the point estimates are shaded in light blue, light yellow or light red.

We next sought to validate any recessive associations in the UK Biobank (UKBB)26 and in the FinnGen data freeze R6, an updated larger version of our original dataset. Owing to the isolation of the Finnish population2, most of our variants with outsized homozygous effects were Finnish-enriched (allele frequency more than two times higher than that in Europeans who are not Finnish, Swedish or Estonian). Thus, in total, only 13 out of 31 variants were suitable for validation in the UKBB, 8 of which had nominally significant recessive associations to related phenotypes in the UKBB (Extended Data Fig. 5 and Supplementary Tables 6 and 7). We excluded seven variants with lower homozygote case counts in R6 compared to R4 as those may be due to technical differences between data freezes. Of the remaining 24 variants, 18 remained genome-wide significant in FinnGen R6. In total, we validated 20 recessive associations in FinnGen R6 and/or the UKBB (Table 1), including 4 novel associations. Recessive associations that could not be validated are listed in Supplementary Tables 1 and 3. Most homozygous individuals also had a substantially earlier onset of disease than did wild-type individuals affected with the same diseases, providing additional evidence for the effect of these variants on disease (Extended Data Fig. 6).

We next investigated whether any of the variants with outsized homozygous effects also had subtle heterozygous effects on the same diseases, by excluding homozygous and compound heterozygous individuals from the analysis. We found variants with nominally significant heterozygous effects in SERPINA1, NPHS1 and CASP7 (see Table 1). In addition, we confirmed a previously hypothesized27 heterozygous effect of a pLoF variant in GJB2 on hearing loss in the larger R6 replication dataset (Fig. 4b; P = 0.02, β = 0.11). We thus classified the inheritance of these variants as recessive, with rare expressing heterozygotes (Fig. 1).

Fig. 4: Age at first diagnosis of known disease-associated variants.
figure 4

Data are shown as survival plots. a, Known likely pathogenic variant (known recessive inheritance) in GJB2 associated with hearing loss also in a heterozygous state (P = 0.02). The y axis is cut at 0.9 for clarity. b, Known likely pathogenic variant in XPA associated with skin cancer (P = 8 × 10−11). In a homozygous state, this variant causes xeroderma pigmentosum with childhood-onset skin cancer40. c, Likely benign missense variant in DBH protects from hypertension (P = 5.2 × 10−13). (DBH is associated with the recessively inherited disease dopamine beta-hydroxylase deficiency, which is characterized by severe hypotension31). a and b show R4 data (n = 176,899); c shows R6 data (n = 234,553). Survival curves of wild-type individuals are coloured in blue, heterozygous individuals in yellow and homozygous individuals in red. The 95% confidence intervals of the point estimates are shaded in light blue, light yellow or light red.

Effects of known disease variants

Public databases of variants such as ClinVar20, the largest, are central for routine clinical genetics but—as with many research community efforts—can include errors. We investigated whether pheWASs of 2,444 disease phenotypes could potentially enhance the interpretation of some ClinVar variants. Specifically, such efforts could support ClinVar submissions that are seen in individual patients with statistically robust associations. We first cross-validated our own results, and found the disease associations for the most frequent likely pathogenic variants across a wide range of phenotypes as expected (see Supplementary Note 5, Supplementary Fig. 1 and Supplementary Table 8). In addition, we provide examples in which our data can help to verify or falsify previously described disease associations or identify novel associations (Fig. 4b). We then examined the collective association of several groups of ClinVar variants, as for many rare variants we were only powered to find their effects on disease with moderate significance (Methods and Extended Data Fig. 7). As anticipated, we found global disease associations for likely pathogenic variants in genes that are described to cause disease with dominant inheritance (classification: OMIM). However, we also found global disease associations for variants that are listed as benign or likely to be benign (‘likely benign’). Likely benign variants are defined as ‘not implicated in monogenic disease’28, and are often considered to be neutral29. However, 16 likely benign variants were likely causally associated with disease in our data (using statistical fine-mapping30; see Supplementary Table 9 and Supplementary Note 6). None of them had Mendelian effect sizes and would thus not cause monogenic disease. Rather, they moderately increased disease risk or protected from phenotypes that are mostly similar to the Mendelian phenotypes that are associated with the same gene. Of these benign variants, we highlight a variant in the gene DBH. DBH is a gene associated with dopamine beta-hydroxylase deficiency (inheritance: recessive), which is characterized by severe hypotension31. We observed a likely benign missense variant in DBH that conveyed protection from hypertension (see Fig. 4c), a plausible finding given the association of DBH with hypotension. The variant is also an example for a Finnish-enriched variant (allele frequency of 0.05, which is 22 times higher than that in Europeans who are not Finnish, Swedish or Estonian).

Variants with monoallelic and biallelic effects

Because we observed unexpected disease effects from heterozygous variants in genes with reported recessive inheritance, we sought to further contrast the effects of variants with their previously described modes of inheritance. We found multiple coding variants with effects that did not match their annotated inheritance. These include known variants in CHEK2, JAK2, TYR, OCA2 and MC1R (previously described as dominant inheritance) that have additive effects on cancer and cancer-related phenotypes (Supplementary Fig. 2). Similarly, we highlight one variant in SCN5A that was previously associated with severe cardiac-arrhythmia-like sick sinus syndrome32 in a biallelic state, which we confirm in our data (Fisher’s exact test, P = 9 × 10−4; odds ratio (OR) = 48 (95% confidence interval: 6–319)). In a heterozygous state, however, that same SCN5A variant protects from cardiac arrhythmia in FinnGen (β = −0.48, P= 2 × 10−8, posterior inclusion probability = 0.996 (ref. 30), indicating probable causality), including atrial fibrillation (β = −0.62, P =  7 × 10−7). We could replicate this association in the UKBB26 (β = −0.39, P = 0.04). In line with the subtle heterozygous effects of individual variants, we found global disease effects of 203 likely pathogenic variants in disease genes with recessive inheritance (see Extended Data Fig. 7 and Methods). Both simulations and reanalysis after excluding variants with homozygotes from the global analysis indicated that modest heterozygous effects are likely to contribute to this signal. In disease genes with known recessive and/or dominant inheritance, we found 79 additional coding variants (not likely pathogenic) with effects on disease that are likely to be additive (additive P < recessive P) and causal30. Of these 79 variants, 11 had Mendelian effect sizes (OR > 3). In summary, we identify several variants that have disease effects in a heterozygous and a homozygous state in known Mendelian disease genes. The modes of inheritance of these variants cannot be described with the usual labels of recessive and dominant. Our data thus indicate the need for a nomenclature that integrates Mendelian and complex genetic effects on diseases. We outline a suggestion in Fig. 1.

Discussion

A subset of disease-causing variants is enriched to unusually high frequencies in populations with a history of recent bottlenecks, such as the Finnish population. This applies particularly to homozygous variants that cause disease with recessive inheritance. The FinnGen cohort, given its size and its inclusion of broad medical phenotypic data over the lifespan of an individual, is well-powered for discovering novel alleles with large effects in homozygous individuals and for studying the disease architecture of Mendelian variants.

We found multiple variants in known Mendelian disease genes20 with large effects in homozygous individuals that have weaker but significant effects in heterozygous individuals for the same or closely related diseases, highlighting that describing their inheritance with only dominant and recessive labels does not adequately describe disease biology. Terms beyond recessive and dominant are, however, rarely used in clinical genetics. Although it has been estimated that semidominant inheritance is much more common than true dominant inheritance33, phenotypes of homozygotes are frequently unknown because they are rare or too severe for an individual with the phenotype to survive to birth. The term semidominant inheritance is more frequently used in animal and plant genetics34, in which mono- and biallelic effects can be more easily quantified and systematically studied. Aggregating the evidence for pathogenicity across biallelic and monoallelic observations, rather than viewing these as separate disease entities, could benefit the interpretation of clinical variants, as well as providing a more accurate description of biology. Furthermore, we found that several variants that were previously described as likely to be benign are associated with disease. This is a noteworthy reminder that such likely benign variants are generally defined as not causing monogenic disease and should not necessarily be regarded as neutral29.

Adding to an increased complexity of Mendelian inheritance, we detected modest heterozygous effects on disease in variants that are known to cause disease only in a homozygous state (recessive inheritance). This is in line with previous studies that found subtle heterozygous fitness effects of variants that are described as causing disease with recessive inheritance in mice35, Drosophila36 and humans37. We found heterozygous effects of such variants in the genes NPHS1 (previously debated38), SERPINA1 (known21) and GJB2 (previously hypothesized27). As expected for recessive inheritance, the variants had an order-of-magnitude-larger effect size and, often, an earlier onset of disease in a homozygous than in a heterozygous state, thereby exceeding a linear additive model. We also found a heterozygous pLoF variant in XPA that increases susceptibility to adult-onset skin cancer. The effect was far larger than is usually found by GWASs and could thus provide valuable information on personal risk. Although a nominally significant association (P = 0.01) of a heterozygous pLoF variant in XPA with skin cancer was found in a previous study39, we provide here the first—to our knowledge–definitive evidence. Homozygous pLoF variants in the gene XPA are known to cause a related phenotype, xeroderma pigmentosum, which is characterized by extreme vulnerability to UV radiation and childhood-onset skin cancer40. Long before these large-scale data became available, small heterozygous effects were found in variants that cause Mendelian disease with recessive inheritance41, in some cases conferring an advantage against certain infectious diseases42,43,44. Similarly, we find that one variant in SCN5A, which was previously associated with severe cardiac arrhythmia such as sick sinus syndrome32 in a biallelic state, protected from mild cardiac arrhythmia diseases in a heterozygous state in our data. Previous experimental data found a mild loss-of-function effect of this variant32,45. This is in line with a potential slowing of electrical conduction in the heart in a few individuals who are heterozygous for the SCN5A variant32, which could thus provide a protective mechanism against cardiac arrhythmia.

Of course, it is possible that this and other heterozygous effects we observe come from low-frequency variants in a compound heterozygous state that were not captured by our genotyping array. However, a different age of disease onset in heterozygous than in homozygous individuals suggests that that scenario is unlikely to explain most of the observed heterozygous effect. In addition, with population-specific exome sequencing and imputation we could account for the presence of additional pathogenic variants at frequencies greater than 0.2%. Additional limitations are the lack of more in-depth phenotypes, including symptoms that are not captured by International Classification of Diseases (ICD) codes, or serological or diagnostic tests missing subtle physiological differences between heterozygous and wild-type individuals. Furthermore, the Finnish population bottleneck leads to a lower number of rare variants, which limits new discoveries. We cannot thus investigate relatively common European Mendelian disease variants; for example, variants in CFTR, which cause cystic fibrosis46.

We systematically investigated recessive associations of coding variants genome-wide with pheWAS at biobank scale. We could validate 20 loci (4 of them novel) that had large biallelic effects without, or with only nominally significant heterozygous effects. Novel associations included complex non-syndromic diseases with few or no previously described large-effect variants, such as adult-onset cataract (new disease gene: CASP7) and female infertility (new disease gene: EBAG9). Intuitively, these novel findings appeared in phenotypes (deafness, cataracts and infertility) for which Mendelian subtypes would not obviously be clinically distinguished from other common presentations, although—as in previous examples—an earlier age of onset in the cases of common diseases of ageing is seen. Biallelic associations of rare coding variants have been found in other population biobanks47,48,49,50,51, but have not (to our knowledge) been investigated in a broad phenotype context in most studies, which lack the scale, the advantage of isolation and/or the high rates of homozygosity by descent. We suggest that searching for homozygous effects is most meaningful for coding and structural variants as opposed to regulatory variants with only weak effects. This is consistent with the observations that dominance effects are very modest in GWASs of common variants in complex human traits52, but that recessive inheritance in Mendelian disease is widespread.

In summary, our biobank-scale additive and recessive pheWAS of coding variants shows the benefit of including recessive scans in GWASs. We find known and novel biallelic associations across a broad spectrum of phenotypes such as retinal dystrophy, adult-onset cataract and female infertility that are missed by the standard additive GWAS model. As a related point, we find an underappreciated complexity of inheritance patterns of multiple Mendelian variants. Our study could thus provide a starting point for reconciling the variant-effect nomenclature of the conventionally separate but more-and-more overlapping fields of Mendelian and complex genetics.

Methods

Ethics and data access approvals

The Coordinating Ethics Committee of the Hospital District of Helsinki and Uusimaa (HUS) approved the FinnGen study protocol HUS/990/2017. The FinnGen project is approved by the Finnish Institute for Health and Welfare (THL) (approval number THL/2031/6.02.00/2017; amendments THL/1101/5.05.00/2017, THL/341/6.02.00/2018, THL/2222/6.02.00/2018 and THL/283/6.02.00/2019), the Digital and Population Data Services Agency (VRK43431/2017-3 and VRK/6909/2018-3), the Social Insurance Institution (KELA) (KELA 58/522/2017, KELA 131/522/2018 and KELA 70/522/2019) and Statistics Finland TK-53-1041-17. The Biobank access decisions for FinnGen samples and data used in FinnGen data freeze 4 include: THL Biobank BB2017_55, BB2017_111, BB2018_19, BB_2018_34, BB_2018_67, BB2018_71 and BB2019_7, Finnish Red Cross Blood Service Biobank 7.12.2017, Helsinki Biobank HUS/359/2017, Auria Biobank AB17-5154, Biobank Borealis of Northern Finland_2017_1013, Biobank of Eastern Finland 1186/2018, Finnish Clinical Biobank Tampere MH0004, Central Finland Biobank 1–2017 and Terveystalo Biobank STB 2018001.

Patients and control individuals in FinnGen provided informed consent for biobank research, according to the Finnish Biobank Act. Alternatively, separate research cohorts, collected before the Finnish Biobank Act came into effect (in September 2013) and before the start of FinnGen (August 2017) were collected on the basis of study-specific consents and later transferred to the Finnish biobanks after approval by Fimea (Finnish Medicines Agency), the National Supervisory Authority for Welfare and Health. Recruitment protocols followed the biobank protocols approved by Fimea. The Coordinating Ethics Committee of the Hospital District of Helsinki and Uusimaa (HUS) statement number for the FinnGen study is HUS/990/2017. UK biobank data were accessed under protocol 31063.

Funding and partners

We acknowledge the participants and investigators of the FinnGen study. The FinnGen project is funded by two grants from Business Finland (HUS 4685/31/2016 and UH 4386/31/2016) and the following industry partners: AbbVie, AstraZeneca UK, Biogen MA, Bristol Myers Squibb (and Celgene Corporation & Celgene International II Sàrl), Genentech, Merck Sharp & Dohme, Pfizer, GlaxoSmithKline Intellectual Property Development, Sanofi US Services, Maze Therapeutics, Janssen Biotech, Novartis and Boehringer Ingelheim International. The following biobanks are acknowledged for delivering biobank samples to FinnGen: Arctic Biobank (https://www.oulu.fi/medicine/node/207208), Auria Biobank (https://www.auria.fi/biopankki) THL Biobank (https://thl.fi/en/web/thl-biobank), Helsinki Biobank (https://www.helsinginbiopankki.fi), Biobank Borealis of Northern Finland (https://www.ppshp.fi/Tutkimus-ja-opetus/Biopankki/Pages/Biobank-Borealis-briefly-in-English.aspx), Finnish Clinical Biobank Tampere (https://www.tays.fi/en-US/Research_and_development/Finnish_Clinical_Biobank_Tampere), Biobank of Eastern Finland (https://ita-suomenbiopankki.fi/en/), Central Finland Biobank (https://www.sairaalanova.fi/en-US), Finnish Red Cross Blood Service Biobank (https://www.veripalvelu.fi/verenluovutus/biopankkitoiminta), Terveystalo Biobank (https://www.terveystalo.com/fi/yhtio/biopankki) and the Finnish Hematology Registry and Clinical Biobank (https://www.fhrb.fi). All Finnish Biobanks are members of the BBMRI infrastructure (https://www.bbmri-eric.eu/national-nodes/finland/). Finnish Biobank Cooperative (FINBB; https://finbb.fi) is the coordinator of BBMRI-ERIC operations in Finland.

Coding variants in FinnGen, release 4

Greater haplotype sharing in the Finnish population facilitates the imputation of lower-frequency variants in array data with a population-specific reference panel down to frequencies below 0.0005 (ref. 54). Genotypes were thus generated with arrays thereby enabling the large scale of the FinnGen research project. In summary, we investigated 82,647 coding variants (2,634 pLoF, 76,884 missense and 3,129 others) in the FinnGen project, release 4, 8/2019 in 176,899 Finnish individuals. We excluded the HLA region (chr. 6, 25 Mb–35 Mb). In 110,361 individuals, sex was imputed as female. We filtered to variants with INFO > 0.8 at a median allele frequency of 0.004 (minimum allele frequency 3 × 10−5). Around half of the samples came from existing legacy collections, and the other half was from participants who were newly recruited to the FinnGen project. Samples were genotyped on custom microarrays and rare variants were imputed using a population-specific reference panel54. To calculate variant enrichment in Finnish individuals after a bottleneck event, we used as a general European reference point exomes from European samples in gnomAD 2.1.1, excluding those from Finland, Sweden and Estonia. Owing to large-scale migrations from Finland to Sweden in the 20th century, a substantial fraction of the genetic ancestry in Sweden is of recent Finnish origin, and the linguistically (and geographically) close population of Estonia is likely to share elements of the same ancestral founder effect. See ref. 22 or https://finngen.gitbook.io/documentation/ for a detailed description of data production and analysis.

GWAS searching for additive and recessive associations

We performed a GWAS on 2,444 disease end-points, investigating the effects of 82,647 coding variants with an additive and recessive model using the method SAIGE23. Covariates in FinnGen were age, sex, genotyping batch and the first 10 principal components of genotypes. To identify heterozygous effects, we performed a GWAS with an additive model after excluding homozygous and compound heterozygous individuals, where possible (see Supplementary Note 4). In the recessive GWAS model we analysed the effects of homozygous alleles on disease phenotypes in comparison to wild-type and heterozygous alleles. The wild type is defined as the reference allele in FinnGen. The docker container finngen/saige:0.39.1.fg with all necessary software used to run SAIGE in additive or recessive mode can be found at the docker container library hub.docker.com. We replicated our genome-wide significant recessive associations in 234,553 individuals of FinnGen data freeze R6 and in 420,531 individuals with European ancestry in the UKBB using the same recessive model in SAIGE. GWAS covariates in UKBB were age, sex, age × sex, age2, age2 × sex and the first 10 principal components of genotypes. The genotype and phenotype files along with ancestry definitions, phenotype definitions and SAIGE null models were taken from the PAN UKBB project and are further described at https://pan.ukbb.broadinstitute.org. For additional information on and results from FinnGen data freeze R6 please search the indicated websites in the ‘Data availability’ section.

Annotating variant effects from ClinVar

We annotated variants from release 25 March 2020 of ClinVar20. For any variant included in the main tables, we rechecked current classifications in ClinVar and OMIM on 2 November 2021. We grouped variants into categories according to their ‘ClinVar_ReviewStatus’. Our main categories were likely pathogenic (likely to be pathogenic or pathogenic; 311 variants), conflicting evidence (at least one submitter labelled a variant as likely pathogenic but at least one other submitter labelled it different from likely pathogenic; 298 variants) and likely benign (likely to be benign or benign; 10,948 variants). The ClinVar annotation labelling variants as likely benign was above average quality (1.6; 7 of the 16 likely benign top causal variants had a one-star and 9 of 16 a two-star review status in ClinVar) compared to an average 1.2 stars for all likely benign variants in ClinVar (range: zero to three stars). Other categories into which we grouped variants that we are not explicitly discussing in the manuscript were ‘association’ (26 variants), ‘drug response’ (59 variants), ‘not provided’ (141 variants), ‘protective’ (14 variants), ‘risk factor’ (74 variants) and ‘VUS’ (variant of unknown significance; 3,269 variants); see Supplementary Table 8.

Annotating inheritance mode from OMIM

We downloaded the OMIM catalogue of human genetic diseases (https://www.omim.org/) version 06/2019. From OMIM, we annotated genes implicated to cause disease with a recessive or dominant inheritance mode.

Global phenotype associations of ClinVar variant categories

We compared global disease phenotype associations of different ClinVar variant categories (likely benign, likely pathogenic or conflicting variants in genes with dominant or recessive inheritance) with phenotype associations of random intergenic variants. For a given variant category, we counted how many variants had at least one significant GWAS hit (2,444 phenotypes) below a given P value threshold. We then compared those to the number of top GWAS loci below the P value threshold of 1,000 random samples of intergenic variants. We calculated with empirical P values if any ClinVar variant categories had significantly more disease associations than random intergenic variants below respective P value thresholds. Allele frequency influences the power with which significant associations are identified. Therefore, we adjusted for allele frequency by sampling intergenic variants in 15 equal-sized bins that corresponded to the allele frequency of the variants under investigation. To account for linkage disequilibrium, we sampled intergenic variants from the same 3-Mb windows as variants in the respective gene set.

Age at first diagnosis

We compared age at first diagnosis of homozygous or heterozygous individuals compared to wild-type individuals, respectively, using Wilcoxon rank tests. For a few compound heterozygous variants (as indicated in the paper), we also performed survival analyses using age at first disease diagnosis as outcome using a Cox proportional hazard model with the same covariates that were also used in the GWAS (sex, age, genotyping batch and first 10 principal components).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.