Abstract
Genotypes causing pregnancy loss and perinatal mortality are depleted among living individuals and are therefore difficult to find. To explore genetic causes of recessive lethality, we searched for sequence variants with deficit of homozygosity among 1.52 million individuals from six European populations. In this study, we identified 25 genes harboring protein-altering sequence variants with a strong deficit of homozygosity (10% or less of predicted homozygotes). Sequence variants in 12 of the genes cause Mendelian disease under a recessive mode of inheritance, two under a dominant mode, but variants in the remaining 11 have not been reported to cause disease. Sequence variants with a strong deficit of homozygosity are over-represented among genes essential for growth of human cell lines and genes orthologous to mouse genes known to affect viability. The function of these genes gives insight into the genetics of intrauterine lethality. We also identified 1077 genes with homozygous predicted loss-of-function genotypes not previously described, bringing the total set of genes completely knocked out in humans to 4785.
Similar content being viewed by others
Introduction
The development of whole-genome sequencing technologies has led to a surge in the discovery of sequence variants causing Mendelian diseases1. However, the genetic causes of intrauterine lethality remain poorly understood2 as our current understanding of sequence variation that causes death of humans is limited to variants where some carriers survive past the early stages of development3. There are limited data available on causes of intrauterine lethality4, and these often go unnoticed5. Genetic causes of loss of blastocyst development, pregnancy loss, and perinatal mortality remain to be thoroughly investigated. A proportion of these pregnancy losses are revealed clinically as miscarriages, while others are unrecognized implantation failures or early pregnancy losses5.
Embryonic lethality has been studied in model organisms, and mouse studies suggest that a quarter of homozygous gene knockouts result in embryonic lethality6,7. Half of the lethal homozygous mouse knockouts die during early gestation6,8 and the majority are estimated to succumb between implantation and gastrulation9.
To date, four studies have reported 3527 autosomal genes with rare biallelic predicted loss-of-function (pLOF) sequence variants (i.e. genes knocked out in humans) that are valuable for assessing physiological and pathological consequences of gene loss-of-function10,11,12,13. Two of these involved populations of Pakistani origin with a high rate of parental relatedness10,11, which reduces the number of individuals that need to be sequenced to detect homozygous genotypes of rare variants. In combined data from these two studies, a total of 13,725 exome sequenced individuals had 1829 genes completely knocked out, of which the majority (>68%) were knocked out in just one individual and where the mean frequency of the pLOF variants was ~0.2%. The remaining two studies involved more outbred populations12,13 where the minority (<34%) of knocked-out genes were observed in just one individual and the mean frequency of the pLOF variants was ~0.5%. In the GnomAD database, 1825 genes are knocked out among 15,708 whole-genome and 125,748 exome sequenced individuals, primarily of European origin13. Finally, in a previous study of 104,220 Icelanders, we observed 6795 pLOF sequence variants in 4924 autosomal genes, detected through whole-genome sequencing of 2636 individuals, and identified 1151 genes with homozygous pLOF genotypes12. There, we also reported a deficit of both double transmissions of pLOFs12,14 from pairs of heterozygous parents and a deficit of homozygosity of pLOF variants relative to their allele frequency in the population, where the greatest deficit was observed for a splice acceptor variant in DHCR7 in the Icelandic population12,14. Cataloging genes with a strong deficit of homozygosity for protein-altering variants in human populations provides insights into potential causes of embryonic and fetal death, stillbirth, death in infancy, or under-sampling because of morbidity15. In a randomly-mating population, a rare variant present in one per five hundred individuals is expected to be present in one per million in a homozygous state. Consequently, detection of rare homozygous genotypes requires large sample sizes. To date, studies have been limited by sample sizes on the order of 100 thousand individuals, which are not well powered to detect rare homozygous genotypes through testing for deviation from Hardy-Weinberg equilibrium (HWE) expectations.
In this work, we identified sequence variants with a strong deficit of homozygosity when taking into account the number of heterozygotes and assuming HWE in a set of 1.52 million North-Western Europeans, an order of magnitude more than in our previous study12. We examine and report genotype counts for both moderate impact (missense, in-frame indels, splice region sequence variants) and pLOF variants (stop-gained, frameshift, essential splice donor, and acceptor sequence variants), and we also combined the pLOF variants in a gene test. To determine whether sequence variants with a strong deficit of homozygosity resulted from losses early or late in pregnancy, we examined the reproductive history of heterozygous carrier couples, looking for an excess of miscarriages or early death among their offspring. Also, we assessed the effects of such variants on RNA and protein levels in heterozygous carriers to provide experimental validation of their functional effect. Finally, we compared the set of genes with a deficit of pLOF homozygosity to experimental data on the viability of mouse knockouts and the critical role of these genes in cell growth.
Results
Deficit of homozygosity
We looked for a strong deficit of homozygosity among protein-altering sequence variants in a meta-analysis of 1.52 million individuals from six populations (Denmark, Iceland, Norway, Sweden, Finland, and the UK). This was based on the imputation of variants detected by whole-genome sequencing of individuals from all of the populations (Fig. 1 and Supplementary Data 1). Of the study participants, 197,146 were whole-genome sequenced (Supplementary Data 1).
We tested 75,178 moderate-impact variants in 14,453 genes and 3024 pLOF variants in 2353 genes (Fig. 1, Supplementary Data 2, and Supplementary Data 3). Of the 3024 pLOF variants, 730 were rated as low-confidence pLOFs by the LOFTEE algorithm (Loss-Of-Function Transcript Effect Estimator)13, leaving 2294 pLOF variants in 1837 genes. A summary of the 75,178 moderate-impact and 3024 pLOF single variants tested is provided as supplementary data (Supplementary Data 4). Additionally, we performed a gene-based test for the deficit of homozygosity, where we created a single biallelic genotype for each gene, indicating whether 0, 1 or both haplotypes in an individual are affected by at least one pLOF variant with a MAF under 2%, excluding the variants flagged as low-confidence by LOFTEE. We refer to such genotypes as geneLOF, and in this way, we were able to test 2757 genes for deficit of homozygosity (Supplementary Data 5 and Supplementary Data 6).
It is well established that deviations from random mating within a population (such as inbreeding or stratification) tend to increase the number of homozygotes16. For sequence variants that increase the risk of deleterious phenotypes among homozygotes, these factors will therefore tend to increase the number of individuals who are exposed to negative selection. In our data, there is an excess of observed homozygotes compared to the expected number of homozygotes under HWE, with rare variants showing the greatest relative excess (Supplementary Fig. 1 and Supplementary Data 2). Based on the genomic inbreeding coefficient, less than 0.71% of study participants are first cousins, or more closely related (Supplementary Data 1). We identified 70,721 individuals (5.4%) who had homozygous geneLOFs (i.e. both parental chromosomes harbor a pLOF variant in the same gene with MAF < 2%) in a combined set of 1.30 million genotyped individuals (excluding the Finnish data set, where individual genotype data was not available). Of the 70,721 individuals with a knockout, 66,727 (94.4%) were predicted to have just one gene knocked out. A total of 2671 genes were knocked out based on geneLOFs in the meta-analysis of all 1.52 million individuals (Supplementary Data 6, and 7). We observed two or more knockouts for 1722 of these 2671 genes (66.3%). In total 1077 of the identified genes have not been reported in previous publications10,11,12,13. Combining the data on knockouts from the current and previous studies10,11,12,13, yields 4785 knocked-out genes, of which 42 are observed in all datasets (Supplementary Data 7).
We considered a variant to have a strong deficit of homozygosity if we observed 10% or less of predicted homozygotes17,18 based on observed heterozygote frequency and the assumption of HWE within populations (Supplementary Fig. 2, Supplementary Data 2, and 5). Variants with a less marked deficit are presented in the section “Incomplete homozygous deficit” in the Supplementary Discussion. pLOF and moderate impact sequence variants have the greatest predicted functional impact and are most likely to affect health and viability19. At the other end of the spectrum are intergenic variants, that have the lowest predicted functional impact19. Therefore, to increase power to detect deficit of homozygotes we calibrated our expectation of homozygous protein-altering variants under neutrality and compare the deficit of homozygous genotypes of protein-altering variants to that of intergenic variants. After binning variants based on the expected number of homozygotes under HWE and functional impact, we compared the fraction of protein-altering variants (f_pav) with a strong deficit of homozygosity in each bin to the intergenic one (f_intergenic) to derive a false discovery rate (FDR = f_intergenic/f_pav) (Fig. 2, Supplementary Data 2, and 5). One minus the FDR estimates the fraction of homozygous deficit variants within each bin due to negative selection rather than by chance, under the assumption that homozygosity for intergenic variants is effectively neutral (1 - FDR = positive predictive power (PPV) = 1 - f_intergenic/f_pav).
pLOF variants with five or more expected homozygotes had an FDR under 3% (Fig. 2, and Supplementary Data 2). Five or more homozygotes were expected for 1736 pLOF variants in 1425 genes. Of these, 16 variants in as many genes were deemed to have a strong deficit of homozygosity (Table 1). The FDR for moderate impact variants with eight or more expected homozygotes was under 6%, and of these, six variants had a strong deficit of homozygosity (Fig. 2, Table 1, and Supplementary Data 2). In comparison, using Bonferroni correction for multiple testing, five variants had a significant deficit for homozygosity for pLOF (P < 0.05/1736 = 2.9 × 10−5, assuming Poisson distribution) and two for moderate impact variants (P < 0.05/47,429 = 1.1 × 10−6, assuming Poisson distribution) (Supplementary Data 2). No low-impact variants had a significant strong deficit of homozygosity after accounting for multiple testing. No deficit of homozygosity was observed for variants with an expected homozygote count above 250, or minor allele frequency (MAF) above ~1.4% (Supplementary Data 2).
geneLOFs with five or more expected homozygotes had an FDR under 4% (Fig. 2 and Supplementary Data 5). Five or more homozygous individuals were expected for 1258 geneLOFs and nineteen of these genes had a strong deficit of homozygosity (Table 1). If we determined significance based on deviation from HWE and use Bonferroni correction for multiple testing (P < 0.05/1258 = 4 × 10−5, assuming Poisson distribution), ten genes had a significant deficit of homozygosity (Supplementary Data 5).
In total, we identified 25 genes with protein-altering variants with a strong deficit of homozygosity; nineteen involving pLOF variants, and six involving moderate impact variants (Table 1 and Supplementary Data 8). The allele frequency distribution of the underlying pLOF and missense variants ranges from <0.001% to 1.4% across the six populations and are detectable but rarer in publicly available exome and genome sequence databases (Table 1, Supplementary Fig. 3, and Supplementary Data 9) (see Supplementary Discussion for details). Among the 25 genes harboring variants with a strong deficit of homozygosity, 11 are located in genes that have not been reported to cause a Mendelian condition (Table 1). The remaining 14 genes are reported to have variants causing a Mendelian condition (12 under a recessive mode of inheritance, two under a dominant mode), and in ten instances the variant in question has been observed in genotypes classified as pathogenic or likely pathogenic in the ClinVar database20 (Supplementary Data 4, and 10) (see Supplementary Discussion for details).
Effect of variants with a strong deficit of homozygosity on gene expression
We assessed the impact of variants with a significant deficit of homozygosity on RNA splicing (sQTL), mRNA levels (eQTL), and protein levels (pQTL) in the Icelandic population, based on RNA sequencing of blood samples from 17,848 individuals and plasma protein levels measured with 4907 aptamers (SOMAscan) in 35,559 individuals21. We found that the variants in ten of the genes with a strong deficit of homozygosity were in high LD (r2 from 0.8 to 1.0) with five lead sQTLs, six lead cis-eQTLs, and three lead cis-pQTLs (Supplementary Data 11, 12, and 13).
In ATP5PB, the stop gained variant p.Arg185Ter is the lead eQTL for ATP5PB, and is associated with reduced blood mRNA levels (P < 1 × 10−300, effect = −2.5 SD), consistent with nonsense-mediated decay (Supplementary Fig. 4). The splice donor variant c.561_564+4delACAAGTAA in CCDC59 causes a skipping of the third exon of this gene (effect = 2.7 SD, P = 3.0 × 10−229) inducing a frameshift (Supplementary Fig. 4). The start loss variant in GTF2H3 associates with reduced expression (P < 1.3 × 10−30, effect = −1.3 SD) over all exons consistent with a loss-of-function effect (Supplementary Fig. 4).
In our data, the splice region variant c.70+5 G > A associated with reduced mRNA levels of MVD (encoding Diphosphomevalonate decarboxylase; ERG19) in blood (effect = −0.56 SD, P = 7.9 × 10−7), and was a lead cis-pQTL for MVD in plasma (effect = −0.77 SD, P = 5.0 × 10−22) (Supplementary Fig. 5). Heterozygosity of this variant is associated with a high risk of congenital malformations of skin in the UK Biobank (ICD10 code Q82; 1464 cases and 429,474 controls) (MAFUK = 0.41%, OR = 6.8, P = 1.2 × 10−36). This association is consistent with autosomal dominant form of porokeratosis reported in OMIM (OMIM:614714). Diphosphomevalonate decarboxylase is an enzyme involved in cholesterol biosynthesis that catalyzes the conversion of mevalonate pyrophosphate into isopentenyl pyrophosphate. Thus, among heterozygotes, reduced dosage increases the risk of malformations of the skin but does not impact life expectancy. On the other hand, homozygosity for the MVD splice region variant likely reduces enzymatic activity to levels not compatible with life.
We also confirmed the previously described effects of four homozygous deficit variants reported as disease-causing on RNA and protein levels: c.964-1 G > C in DHCR7 activates a cryptic splice-site resulting in a 134 base pair intron retention that leads to a frameshift22, c.691+2 T > C in GBE1 leads to skipping of exon five23, p.Arg141His in PMM2 leads to reduced levels of Phosphomannomutase 2 encoded by PMM224, and c.1029+2 T > C PNKP25 introduces a retained intron resulting in skipping of exon 10 (Supplementary Fig. 5).
Gene set over-representation analysis
Experimental data on the viability of mouse knockouts, and the essentiality of genes for the growth of human cell lines is valuable to infer the gestational timing of pregnancy loss26,27. To gain a better understanding of the biology behind a strong deficit of homozygosity, we performed a gene set over-representation analysis using three different data sets: genes harboring variants reported to cause recessive Mendelian disease, genes essential for growth of human cell lines identified through genome-wide screens, and orthologous mouse genes known to affect viability (Table 2, Supplementary Data 14, 15, and 16).
Among the 1258 genes with geneLOFs expected to have five or more homozygotes, 96 are essential for cell growth, and 192 are lethal when knocked out in mice (Table 2). The fraction of genes with a homozygous deficit among those essential for cell growth was 11.5% (11/96), and those that are mouse lethal was 6.8% (13/192). Compared to geneLOFs that did not show a homozygous deficit, those with a homozygous deficit are 6.6-fold more likely to be linked to autosomal recessive disease (P = 1.9 × 10−4), 15.1-fold more likely to be essential for viability in human cell lines (P = 9.1 × 10−8), and 19.5-fold more likely to result in lethality when knocked out in mice (P = 1.2 × 10−6) (Table 2). Thus, pLOF variants in genes with a strong deficit of homozygosity may cause pre-natal lethality rather than a post-natal disorder. Furthermore, based on being essential for growth of human cell lines, 13 genes with a strong deficit of homozygosity are candidates for harboring variants that lead to early pregnancy loss (see Supplementary Discussion for details).
geneLOFs with an expected homozygote count between one and five were also enriched in these datasets, although not to the same extent (Table 2, Supplementary Data 17 and 18). This shows that we only have statistical power to detect the subset of such variants in the combined set of 1.52 million individuals with a MAF of at least 0.2% (pLOF: MAF ≥ 0.18% corresponding to an expected homozygous count of 5, moderate impact variants: MAF ≥ 0.23% corresponding to an expected homozygous count of 8) (Supplementary Fig. 6). It has been suggested that the majority of recessive lethal variants are very rare and likely rarer than those identified in the current study15.
Effect of variants with a strong deficit of homozygosity on pregnancy loss in the Icelandic population
To determine whether a strong deficit of homozygosity is the result of early infant death or increased rate of miscarriage, we identified 140 Icelandic couples who are carriers of pLOF variants in 15 of the homozygous deficiency genes when restricting to genes where the sum of pregnancies (miscarriage or registered birth) of all carrier couples is at least two. These couples have a one-in-four chance of producing a zygote that is a homozygote for the pLOF they carry. Carrier mothers were at increased risk of ever experiencing a miscarriage if the father was a carrier compared to mothers from non-carrier couples matched on year of birth and number of pregnancies (OR = 1.93 [95% CI: 1.35–2.74], P = 2.4 × 104, N couples = 140, N miscarriage = 57) (Table 3 and Supplementary Data 19). Consistent with a recessive inheritance pattern, couples, where one partner was a carrier, were not more likely to experience a miscarriage (OR = 1.0 [95% CI: 0.96–1.05], P = 0.92, N couples = 12,915, N miscarriage = 3398) (Supplementary Data 19). The most significant effect on miscarriage was observed for couples carrying pLOF variants in DHCR7 and was significant after correcting for 15 genes being tested (OR = 5.3 [95% CI: 2.0–16], P = 1.9 × 10−4 < 0.05/15), although we could not show an excess of miscarriage for any other gene individually (Table 3). Couples carrying pLOF variants in the remaining 14 genes also had an excess of miscarriages (OR = 1.6 [95% CI: 1.3–2.7], P = 0.012, N couples = 119, N miscarriage = 43) (Table 2). We came to the same conclusion by comparing the number of pregnancies that result in miscarriage between mothers from carrier couples and controls (Supplementary Data 19).
For BRIP1, one of the 15 genes tested for excess miscarriage, the stop gained variant p.Arg798Ter (MAFIceland = 0.21%), and the frameshift variant p.Leu680PhefsTer9 (MAFIceland = 0.46%) account for the large majority of pLOF carriers. The p.Leu680PhefsTer9 is absent from most population databases13 and is likely an Icelandic founder mutation. Homozygous and compound heterozygous mutations in BRIP1 have been reported as a cause of Fanconi anemia, complementation group J (OMIM:607039). A compound heterozygous genotype consisting of p.Arg798Ter and the missense mutation p.Ala349Pro has been reported in a stillborn fetus at a gestational age of 22 weeks, who was diagnosed with Fanconi anemia complementation group J28. Frameshift at the Leu680 position are reported to cause Fanconi anemia (VCV000128166), and p.Leu680PhefsTer9 is associated with a high risk of ovarian cancer in Iceland among heterozygotes29. Interestingly, a BRIP1 compound heterozygous genotype consisting of the p.Arg798Ter stop-gain and p.Leu680PhefsTer9 frameshift variants was deemed causative in a clinical sequencing setting in Iceland in a fetus diagnosed with radial dysplasia in utero.
For c.946-1 G > C in DHCR7 which has the most prominent homozygous deficit and miscarriage excess in the current study, in a few reported cases, homozygosity leads to either early miscarriage and intrauterine fetal demise or severe Smith-Lemli-Optiz syndrome and death before three months of age30,31. Our results confirm a recent observation in the Israeli population of excess miscarriage in carrier couples of the c.946-1 G > C variant in DHCR731. As we previously reported, two children of heterozygous couples died in their first year12. Importantly, carrier couples were not more likely to experience a miscarriage if one parent was a carrier (OR = 1.04 [95% CI: 0.94–1.15], P = 0.45, N couples = 2034, N miscarriage = 554) (Supplementary Data 19). This indicates that the effect of the c.946-1 G > C variant in DHCR7 on miscarriage is consistent with a recessive model.
Discussion
We identified 25 genes with protein-altering variants for which there was a significant deficit of homozygosity in a set of 1.52 million individuals. Nineteen of those involve pLOF variants expected to disrupt the protein and six moderate impact variants (five missense and one splice region). Sequence variants in 12 of the 25 genes, cause Mendelian disease under a recessive mode of inheritance, two under a dominant mode, but variants in the remaining 11 genes have not been reported as disease-causing.
We demonstrate that when comparing the 1239 genes without a homozygous deficit based on geneLOFs to the 19 genes with such a deficit, the latter are more likely to be linked to autosomal recessive disease, to result in embryonic lethality when knocked out in mice, and to be essential for the viability of human cell lines. Interestingly, there is evidence of lethality in animal models of orthologous genes in addition to mice. Mutations in PNKP, and RPAP2 orthologs are linked to recessive lethality in the OMIA database (Online Mendelian Inheritance in Animals)32 in purebred cattle and pig populations, respectively. A splice acceptor variant in RPAP2 with a carrier frequency of 21% in a purebred cattle population shows a complete homozygous deficit due to early embryonic lethality33. A missense variant p.Gln96Arg in PNKP with a carrier frequency of 4.7% has a complete homozygous deficit in purebred pig populations34. In addition, inactivation of ATP5PB, PMM2, and WARS2 orthologs causes embryonic lethality in zebrafish, fruit-flies, and worm35,36,37,38,39 (Supplementary Data 20).
Thirteen genes with a strong deficit of homozygosity are most likely crucial in early development, based on the fact that they are essential for the growth of human cell lines or lethality if knocked out in mice (Supplementary Data 21). Importantly, eight of those genes are not currently linked to Mendelian disease in humans15. If a mutation in a gene is not known to cause human disease but exhibits a strong deficit of homozygosity it can, in theory, be due to any event from early embryonic selection to sickness in adults that prevents them from participation in research. If variants with a strong deficit of homozygosity led to disease after birth then they could have been recognized in OMIM already. Consequently, we postulate that a strong deficit of homozygosity in these unreported genes confer their effect early in development. Among the eight genes not currently linked to Mendelian disease in humans, the p.Ile233Arg variant in the mitoribosomal protein40 MRPS30 has the most prominent deficit of 48 homozygotes. This variant is present in all of the European populations considered with an allelic frequency ranging from 0.3% to 1%, indicating that it is ancient in origin. Assuming a generation time of 25 years, the estimated age of the G allele of rs72756207 resulting in the Ile233Arg missense variation of MRPS30 is estimated to be 16,000 years (637 generations) (95% CI: 380–923 generations, 9500–23,000 years)41. In comparison, the homozygous deficit observed for p.Ile233Arg in MRPS30 is on par with p.Arg141His in PMM2 which is the most frequently reported pathogenic variant for congenital disorder of glycosylation42,43,44(OMIM:601785.0001, ClinVar Variation ID:7706) with an allelic frequency ranging from 0.5% to 0.7%. MRPS30 is essential for the growth of human cell lines but a knockout in mice has not been reported. Further studies are required to understand the biological impact of p.Ile233Arg in MRPS30.
Known disease-causing sequence variants with an established loss-of-function effect that have a homozygous deficit in our data (i.e. DHCR7, GBE1, GLE1, PMM2, PNKP, and TSFM) have almost exclusively been reported in compound heterozygous cases in combination with a hypomorphic allele (resulting in only partial loss-of-function as cataloged in OMIM and ClinVar). This suggests that the variants that we describe are at least partial loss-of-function variants and that some minimum level of activity is required for successful embryonic development. By assessing RNA and protein levels in heterozygous carriers we are able to provide experimental validation of the effect of variants in ten of the genes with a strong deficit of homozygosity. This includes six variants not reported as disease-causing in ATP5PB, CCDC59, GTF2H3, MVD, PUM3, and RPAP2 in addition to the abovementioned known disease-causing loss-of-function variants in DHCR7, GBE1, PMM2, and PNKP.
In addition to the genes for which we observe a significant deficit, the results presented here also include information about the genes that do not reach significance (Supplementary Data 4, 6, 17, 18, and 21). Whereas we determined the cutoff for the significance of deficit at five or more expected homozygotes of pLOF variants, we noted that the group of genes with one to five expected homozygotes and a deficit, is also enriched for recessive Mendelian disease, lethal when knocked-out in mice and essential in cell lines. This information, despite not reaching significance, may help in the interpretation of clinical sequencing and study of Mendelian diseases, including cases of neuropsychiatric disease as previously demonstrated45.
In addition to detecting genes with a deficit of homozygotes, we identified 2671 genes with observed homozygotes for pLOFs, most of which involve two or more individuals (1722/2671 = 66.3%) in the set of 1.52 M individuals. Some of the annotated pLOF variants where we observe homozygots may not be true loss-of-function variants meaning that true loss-of-function homozygotes could still not be viable. Also, our analysis will only identify deficit of genes that cause loss-of-function homozygotes to be absent from the general population, and the detection of homozygotes for pLOFs suggests that biallelic loss-of-function mutations of these genotypes are not lethal before adult age. However, we cannot exclude the possibility that some of these genotypes would have severe phenotypic effects (Supplementary Discussion).
The approach employed in this study allows for the detection of genes with a strong deficit of homozygosity, resulting from the impact of homozygous genotypes on early stages of development. Homozygous deficit variants that have previously been unnoticed can now be detected in data sets derived from a combination of whole-genome sequencing and genotype imputation into large population sets. The overall burden of homozygous deficit variants at the population level is notable, where the combined deficit of significant protein-altering variants amounts to 444 individuals who were not born in our combined population set of 1.52 million (~3/10,000 individuals). We have identified recessive alleles that decrease reproductive success in the general population. Furthermore, they shed light on the genetic causes of pregnancy loss and add to the understanding of the function of genes that are essential for successful development of a human.
Methods
Study samples and ethics declarations
For Iceland, this study is based on whole-genome sequence data from the white blood cells of 49,708 Icelanders participating in various disease projects at deCODE Genetics14. In addition, a total of 155,250 Icelanders have been genotyped using Illumina SNP chips. All participating individuals who donated blood or buccal tissue samples, or their guardians, provided written informed consent. All sample identifiers were encrypted in accordance with the regulations of the Icelandic Data Protection Authority. Personal identities of the participants and biological samples were encrypted by a third-party system approved and monitored by the Icelandic Data Protection Authority. The study was approved by the Data Protection Authority (ref. 2013030423/ÞS/−, with amendments) and the National Bioethics Committee (ref. VSN-19-023, VSNb2019010015/03.01), which also reviewed and approved the protocol, methodology, and all documents presented to the participants. All methods were performed in accordance with the relevant guidelines and regulations.
The UK Biobank resource is a large-scale prospective study that includes data from 500,000 volunteer participants who were recruited between the age of 40–69 years in 2006–2011 across the United Kingdom (https://www.ukbiobank.ac.uk/). Various health records and health-related information is available and regularly updated for these 500,000 participants. The UK Biobank phenotype and genotype data were collected following an informed consent and the study is overseen by The North West Research Ethics Committee that reviewed and approved UK Biobanks scientific protocol and operational procedures (REC Reference Number: 06/MRE08/65).
Danish samples were obtained through collaboration with the Danish Blood Donor Study (DBDS) and the Copenhagen Hospital Biobank (CHB). The Danish Blood Donor Study (DBDS) GWAS study is a large prospective cohort study of ~110,000 blood donors across Denmark46. The Danish Data Protection Agency (P-2019-99) and the Danish National Committee on Health Research Ethics (NVK-1700704) approved the studies under which genetic data on DBDS participants were obtained. CHB is a research sample repository, which contains left-over samples obtained from diagnostic procedures on hospitalized and outpatient patients in the Danish Capital Region hospitals47,48. Genotypic data from the CHB were included as part of the study.
Norwegian genotype data were obtained from both hospital and population-based samples. Clinical samples included data from the DemGene and TOP studies which consist of case control samples of neuropsychiatric disorders. Written informed consent was obtained, and the Regional Committee for Medical and Health Research Ethics (REC) South East (#2009/2485) and Mid Norway (#2014/631) approved the studies. Population-based samples included data from the Norwegian Mother, Father and Child cohort study (Mor og Barn; MoBa) and the Hordaland Health Study (HUSK). MoBa is a population-based pregnancy cohort study conducted by the Norwegian Institute of Public Health. Participants were recruited from all over Norway from 1999–2008. The women provided consent to participation in 41% of the pregnancies. The cohort includes approximately 114,500 children, 95,200 mothers and 75,200 fathers. Blood samples were obtained from both parents during pregnancy and from mothers and children (umbilical cord) at birth. For a more detailed description of the MoBa sample see Magnus et al.49,50. The current study included genotype data from 168,000 mothers, fathers and offspring. The establishment of MoBa and initial data collection was based on a license from the Norwegian Data Protection Agency and approval from the REC. The MoBa cohort is currently regulated by the Norwegian Health Registry Act. Written informed consent was obtained from all mothers and fathers participating in MoBa. The current study was approved by REC South East (#2016/1226). MoBa is supported by the Norwegian Ministry of Health and Care Services and the Ministry of Education and Research. We are grateful to all the participating families in Norway who take part in this on-going cohort study. The HUSK Study is a community-based prospective study conducted in Hordaland County in Norway (http://husk.b.uib.no). The project was approved by REC (Western Norway 2018/915), and written informed consent was obtained from all participants. Genotypic data was provided by the HARVEST collaboration (supported by the Research Council of Norway (RCN) (#229624), the NORMENT Centre (RCN #223273) South East Norway Health Authorities and Stiftelsen Kristian Gerhard Jebsen; in collaboration with deCODE Genetics, and the Center for Diabetes Research at the University of Bergen (funded by the ERC AdG project SELECTionPREDISPOSED, Stiftelsen Kristian Gerhard Jebsen, Trond Mohn Foundation, the RCN, the Novo Nordisk Foundation, the University of Bergen, and the Western Norway Health Authorities).
Genotypic data from Sweden was primarily retrieved from disease-specific population-based case-control studies on chronic inflammatory diseases, including studies on multiple sclerosis (EIMS)51,52 (04/252 1-4 & 2019-00639) and STOPMS2 (2009/2107-31/2 & 2020-0712), approved by National Ethical review board, GEMS53, IMSE54, and IMSE2 (2011/641-31/4), STOPMS55 (02-548), and COMBATMS56 (2017/32-31/4) approved by The Stockholm Regional Ethical Review Board, and rheumatoid arthritis (EIRA, Umea)57,58. The original rheumatoid arthritis studies were approved by the Swedish Ethical Review Authority and all data have been de-identified prior to analyses. Furthermore, genotypic data from the Swedish National Myeloma Biobank59,60 (Swedish Ethical Review Authority; Dnr 2019-06386), Skåne University Hospital, Lund, and from Swedish blood donors and primary care patients aged 18 to 71 years from Skane county61 (Lund University Ethics Review Board; Dnr 2018/2) were also included. The original studies were approved by the Lund University Ethical Review Board, and all data have been de-identified prior to analyses.
The Finnish data on genotype counts were obtained from the FinnGen project (https://www.finngen.fi/en), which gathers samples and phenotype data from a nationwide network of Finnish biobanks and national health registers. The Coordinating Ethics Committee of the Helsinki and Uusimaa Hospital District evaluated and approved the FinnGen research project which complies with existing legislation (in particular the Biobank Law and the Personal Data Act). The official data controller of the study is the University of Helsinki. The genotype data were imported on May 11th, 2021 from a source available to consortium partners (version 5; http://r5.finngen.fi).
Genotyping
The 155 K Icelanders had 27.2 million imputed sequence variants discovered through whole-genome sequencing of 50 K Icelanders21. Our approach to WGS, genotyping, long-range phasing, and imputation of a substantial fraction of the Icelandic population has been described in detail in previous publications14,62. In brief here for the benefit of the readers, 56,959 Icelanders have been WGS using standard TrueSeq methodology (Illumina), to a median depth of 37X, and genotyped with Illumina microarrays (chip-genotyped). An additional 96,095 Icelanders have been chip-genotyped and not WGS. Genotypes of sequence variants identified through sequencing (SNPs and indels) have been imputed into all chip-typed Icelanders, resulting in a set of 153,054 chip-genotyped and imputed Icelanders. We report carrier status among imputed samples if genotype probability exceeds 0.9. Samples and variants with less than 98% yield were excluded. For the purpose of this study, individuals with either one or both parents of foreign ancestry, and individuals WGS for the purpose of clinical diagnostics were removed from the set.
The 432 K participants in the UK Biobank in this study had 57.7 million imputed sequence variants discovered through whole-genome sequencing of 150,119 individuals from UKB63. We report carrier status among imputed samples if genotype probability exceeds 0.9. Samples and variants with less than 98% yield were excluded. For the purpose of this study, our analysis was limited to individuals with British-Irish ancestry (XBI) as defined elsewhere63.
Samples from Denmark, Norway, and Sweden were genotyped using Illumina Global Screening Array chips and long-range phased together with other genotyped samples from North-western Europe using Eagle264. For the purpose of this study, individuals of non-European ancestry were removed from the set based on principal component analysis based on genotypes in the set of North-western Europeans.
We report carrier status among imputed samples if genotype probability exceeds 0.9. Samples and variants with less than 98% yield were excluded. A haplotype reference panel was prepared in the same manner as for the Icelandic and UK data14,65 by phasing whole-genome sequence genotypes of 15,576 individuals from Scandinavia, the Netherlands, and Ireland using the phased chip data. Graphtyper was used to call the genotypes which were subsequently imputed into the phased chip data.
Whole-genome sequencing, chip-typing, quality control, long-range phasing, and imputation from which the data for this analysis were generated was performed at deCODE genetics.
A custom-made FinnGen ThermoFisher Axiom array (>650,000 SNPs) was used to genotype ~177,000 FinnGen samples at Thermo Fisher genotyping service facility in San Diego. Genotype calls were made with AxiomGT1 algorithm. Individuals with ambiguous gender, high genotype missingness (>5%), excess heterozygosity (±4 SD), and non-Finnish ancestry were excluded. Variants with high missingness (>2%), low Hardy-Weinberg equilibrium (HWE) (<1 × 10−6), and minor allele count (<3) were excluded. High coverage (25–30×) WGS data was used to develop the Finnish population-specific SISu v3 imputation reference panel with Beagle 4.1. More than 16 million variants have been imputed (https://finngen.gitbook.io/documentation/methods/genotype-imputation).
We manually assessed BAM files of different regions of variants with homozygous deficit, with particular interest in those with indels. These included the AGK chr7:141649323 TAAC duplication, the MVD chr16:88663006 C to T substitution, the CCDC59 chr12:82354490 TTACTTGT deletion, and the RPAP2 chr1:92333464 GAGTA deletion (Supplementary Figs. 8–11). We examined the BAM files of more than 20 individuals of each genotype, including heterozygotes and non-carriers, to confirm that the data in the BAM files corresponded to the reported genotypes in all cases. The reference allele was observed to have multiple copies in heterozygotes in all cases.
Imputation
Samples chip-typed and whole-genome sequenced at deCODE genetics from Denmark, Iceland, UK, Norway, and Sweden were long-range phased65, and the variants identified in the whole-genome sequencing were imputed into the chip-typed individuals, as has been described in detail elsewhere14,63. We restrict our analysis to variants that are reliably imputed with leave-one-out r-squared score (L1oR2) score greater than 0.5 and imputation info above 0.914,63. Because our imputations are based on haplotype rather than genotype, we are less likely to encounter artificial deficits in homozygotes as a result of genotyping or imputation errors14,63. Importantly, given the two phased haplotypes of each individual, the imputation of the individual’s two haplotypes was performed independently which leads to less dependence between the imputed alleles than when genotypes are imputed from genotypic data.
For samples from Finland imputation was done with the population-specific SISu v3 reference panel66 with Beagle 4.1 (version 08Jun17.d8b) as described in the following protocol: dx.doi.org/10.17504/protocols.io.nmndc5e. We restrict our analysis to variants with INFO score greater than 0.9.
Identification of a deficit in the number of observed homozygotes
We tested the deficit of observed homozygotes for variants with an expected homozygote count over 0.5. This corresponds to an allelic frequency >0.1% the set of 1.5 million. Given the frequency (p) in a population and assuming random mating, the number of homozygotes is expected to be p2 under HWE. The combined expected number of homozygotes in the six populations is the sum of the expected number of homozygotes from each population.
We used Variants Effect Predictor (VEP)19 to assess the functional impact of sequence variants. We assessed homozygote count for intergenic variants (located in intergenic regions more than 5 kb from a RefSeq annotated genic region), low-impact variants (intronic variants, synonymous variants, and 3’UTR/5’UTR variants within 5 of an exon), moderate-impact variants (missense, inframe indel, splice region), and high impact variants (a.k.a. predicted loss-of-function variants) (stop-gained, frameshift, essential splice donor and acceptor). We restricted our analysis to autosomal variants that fall within Tier 1 high confidence regions based on Genome in a Bottle consortium (GiaB)67, and excluded variants located in segmental duplications, centromeres, telomeres, and low mappability regions that are difficult to map with short-read sequencing technologies67.
For each sequence variant, we derived an estimate of the allele frequency of the variant in each population i from the genotyped individuals as
, where ni denotes the number of individuals in population i that were genotyped for the variant. Since here we are primarily interested in rare sequence variants, the estimated allele frequency is driven by the number of observed non-carriers and heterozygotes, and only slightly affected by the number of homozygotes. Under HWE, \({n}_{i}{\hat{{p}_{i}}}^{2}\) is the expected number of homozygotes within population i. Under HWE within each population, the expected total number homozygotes is then \({\lambda={\varSigma }_{i}{n}}_{i}{\hat{{p}_{i}}}^{2}\). We considered a variant to have a strong deficit of homozygosity if the observed number of homozygotes was 10% or less of the expected number of homozygotes under HWE, i.e. if the observed number of homozygotes was less than 0.1λ. This criterion was used instead of 0% to allow for some deviation from a total deficit as used in animal models17,18.
Since we are focusing on rare variants, the observed number of homozygotes then approximately follows a Poisson distribution with mean λ. This allows us to calculate a P-value for deviation from HWE which can then be corrected using Bonferroni correction to obtain a significance threshold for each set of variants. However, deviations from random mating within each population tend to increase the number of homozygotes. We therefore used the intergenic variants, which are the sequence variants with the lowest predicted functional impact, to estimate the probability that a sequence variant has a strong deficit of homozygosity in the absence of HWE. We grouped variants based on their expected number of homozygotes under HWE and calculated the fraction of variants with a strong deficit of homozygosity. The groupings of expected number of homozygosity we used were: [0.5–1), [1, 2), [2, 3), [3, 5), [5, 8), [8, 13), [13, 250), [250, ∞). Within one of these ranges of expected number of homozygotes under HWE, let f_intergenic and f_pav denote the fraction of variants with a strong deficit of homozygosity among intergenic sequence variants, and protein-altering sequence variants, respectively. A false discovery rate (FDR) was estimated by dividing the fraction of intergenic sequence variants with a strong deficit of homozygosity by the fraction of protein-altering sequence variants with a strong deficit of homozygosity:
Using the fraction of variants at deficits of homozygosity among intergenic variants as a reference does address the issue of artificial deficit of homozygotes caused by genotyping or imputation artifacts since imputation artifacts should not preferentially affect protein-altering variants over intergenic variants. FDR confidence intervals were calculated using the ad-hoc approximate-estimate CI (AECI) method, which estimates a confidence interval for the ratio of two independent Poisson rates68.
To account for hitchhiking effects due to linked selection, we excluded highly correlated variants between impact classes and additionally defined sets of intergenic variants with different exclusion regions outside of RefSeq annotated genes to calibrate the FDR. Specifically, moderate-impact variants highly correlated (R2 > 0.8) with high-impact variants were removed from the moderate-impact class, low-impact variants highly correlated with moderate or high-impact variants were removed from the low-impact class, and intergenic variants highly correlated with moderate, high, or low-impact variants were removed from the intergenic class. Additionally, we defined sets of intergenic variants located 5 kb, 50 kb, 100 kb, 250 kb, and 500 kb outside of annotated genic regions (Supplementary Data 22). There were no substantial fluctuations in the FDR as a result of the choice of intergenic variant sets (Supplementary Fig. 7). For further analysis we used intergenic variants located 5 kb outside of annotated genic regions which is the definition used by VEP19. As the number of intergenic variants 500 kb outside annotated genic regions is lower than the number of low-impact variants (875,258 compared to 877,296), it is likely that an exclusion region of such a size is excessive (Supplementary Data 22).
geneLOFs
We collapsed rare and low frequency (<2% minor allele frequency) predicted loss-of-function variants by autosomal genes for the geneLOF tests69,70. Assuming that all loss-of-function variants have the same phenotypic effect, collapsing genotypes across the variants maximizes the power to detect association71. We excluded sequence variants deemed as low-confidence by the LoFtee (Loss-Of-Function Transcript Effect Estimator) algorithm, and variants labeled “likely not LoF” and “not LoF” after manual curation of pLOF variants that have passed all LoFtee filters13. Loss-of-function burden tests have used frequency thresholds from 0.1% to 5% MAF72,73 to attenuate the probability of false-positive loss-of-function variants in the burden test. Here, we filtered on loss-of-function MAF below 2% because pathogenic variants can be of higher allele frequencies in populations with founder effects, such as in Iceland and Finland74,75,76.
Gene expression analysis
We sequenced RNA from whole blood from 17,848 Icelanders, described in detail elsewhere77. We computed gene expression based on personalized transcript abundances using kallisto78. We quantile normalized the gene expression estimates and adjusted for measurements of sequencing artifacts, demographic variables, blood composition, and hidden covariates79. We then tested for association with sequence variants.
We used the SomaLogic® SOMAscan proteomics assay to measure protein levels in plasma21. The assay scanned 4907 aptamers that measure 4719 proteins in samples from 35,559 Icelanders with genetic information available at deCODE genetics. We quantile standardized the plasma protein levels and adjusted for year of birth, sex, and year of sample collection (2000–2019). We performed a proteome-wide association study and evaluated whether sequence variants associated with protein levels (pQTL).
Miscarriage among carrier couples
We identified couples where both partners carry variants with a strong deficit of homozygosity in a heterozygous state. In each pregnancy, these couples have a one-in-four chance of transmitting two copies of the variant with a strong deficit of homozygotes. We looked for records of miscarriage among 61,848 genotyped couples from Iceland where the female partner completed a pregnancy history questionnaire at the Cancer Detection Clinic of the Icelandic Cancer Society, carried out in connection with routine screening for cancers of the cervix and breast between 1964 and 1994 (Supplementary Data 23). Participants were asked if they had experienced a miscarriage, and if so, how many times. Differences in miscarriage risk between carrier couples (carrier mother + carrier father, and where one partner is a carrier) versus control couples (non-carrier mother + non-carrier father) were evaluated using Fisher’s exact test. In this study, we assess excess miscarriage both in terms of the number of mothers experiencing at least one miscarriage, and the number of pregnancies resulting in miscarriage between mothers from carrier couples and control couples. Non-carrier control couples were randomly drawn from the group of 61,848 genotyped couples from Iceland where the female partner answered a routine pregnancy history questionnaire and matched on age and number of pregnancies (1:100 nearest neighbor matching with replacement).
Gene set over-representation analysis
We performed a gene over-representation analysis using three sets of data: (1) genes harboring variants reported to cause recessive Mendelian disease, (2) genes essential for the growth of human cell lines identified through genome-wide screens, and (3) orthologous mouse genes known to affect viability. Gene set over-representation was estimated by a two-sided Fisher exact test. As the unit of the test is the gene, we used the 1258 geneLOFs with five or more expected homozygotes in the meta-analysis of all 1.52 million individuals.
-
(1)
Information on the mode of inheritance of Mendelian disease and linked genes was extracted from the Inheritance subontology of The Human Phenotype Ontology (HPO)80 (http://purl.obolibrary.org/obo/hp/hpoa/phenotype.hpoa) (see Supplementary Data 14).
-
(2)
Data on genes essential for the growth of human cell lines were derived from genome-wide screens were downloaded from Project Achilles81,82 website (https://depmap.org/portal/download). A unified list of of common essential genes from three gene sets was used (Achilles_common_essentials.csv, CRISPR_common_essentials.csv, and Common_essentials.csv) (see Supplementary Data 15).
-
(3)
Data on mouse lethal phenotypes was retrieved from the Mouse Genome Informatics (MGI) database (http://www.informatics.jax.org/downloads/reports/MGI_GenePheno.rpt) and the International Mouse Phenotyping Consortium (IMPC). The 15th release of IMPC mouse phenotype data was downloaded from the IMPC ftp site (http://ftp.ebi.ac.uk/pub/databases/impc/all-data-releases/release-15.1/results/viability.csv.gz). A unified list of ‘embryonic lethal’ genes was identified through query of the Mammalian Phenotype Ontology (MP) terms83 associated with viability among the joint MGI and IMPC dataset (see Supplementary Data 16).
Variant age estimation
To estimate the age of selected variants, human genome dating database was used (https://human.genome.dating/snp/rs72756207). Using the reference allele as the ancestral state, age was estimated for the alternate allele, and the generation time was assumed to be 25 years41.
Power analysis
For power analysis, we used a two-sample proportional test. We assumed that the true homozygote frequency in the population was 10% of its expected frequency. We estimated the sample size required to detect a strong deficit of homozygosity with 80% power (significance level = 0.05), as well as the power to detect the effect of a strong deficit of homozygosity on minor allele frequencies between 0 and 1.6%. We used the R function stats::power.prop.test to perform the power analysis (sig.level = 0.05, power = 0.80, p1 = expected frequency of homozygous genotype, p2 = 0.1*p1).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All data generated during this study are included in this published article and its supplementary files. Genotype data for protein-altering-variants for the combined set of 1.52 million individuals generated for this study are publicly available and tabulated in Supplementary Data 4, and Supplementary Data 6. Figshare https://figshare.com/s/c498d3df17cb04189135 (2023). This study made use of publicly available datasets. This research has been conducted using the FinnGen resource. The FinnGenn GWAS summary statistics, variant annotation, and genotype counts are publicly accessible following registration at https://www.finngen.fi/en/access_results. To gain access to Finngen data an online form needs to be filled out at https://elomake.helsinki.fi/lomakkeet/102575/lomake.html. Instructions on how to download data from Finngen are then sent per e-mail; This research has been conducted using the UK Biobank Resource under application number 56270. Data from the UK Biobank are available by application to all bona fide researchers in the public interest at https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access. Additional information about registration for access to the data are available at www.ukbiobank.ac.uk/register-apply/. Data access for approved applications requires a data transfer agreement between the researcher’s institution and UK Biobank, the terms of which are available on the UK Biobank website (www.ukbiobank.ac.uk/media/ezrderzw/applicant-mta.pdf); GWAS summary statistics for RNA splicing (sQTL), mRNA levels (eQTL), and protein levels (pQTL) in the Icelandic population, based on RNA sequencing of blood samples from 17,848 individuals and plasma protein levels measured with 4907 aptamers (SOMAscan) in 35,559 individuals21 used in this study are publicly accessible following registration at https://www.decode.com/summarydata/ (https://download.decode.is/form/folder/proteomics); Information on the mode of inheritance of Mendelian disease and linked genes was extracted from the Inheritance subontology of The Human Phenotype Ontology (HPO) are freely available at http://purl.obolibrary.org/obo/hp/hpoa/phenotype.hpoa, and tabulated in Supplementary Data 14; Data on genes essential for the growth of human cell lines were derived from genome-wide screens were downloaded from Project Achilles website (22Q2) are freely available at https://depmap.org/portal/download. A unified list of of common essential genes from three gene sets was used (https://depmap.org/portal/download/all/?releasename=DepMap+Public+22Q2&filename=Achilles_common_essentials.csv, https://depmap.org/portal/download/all/?releasename=DepMap+Public+22Q2&filename=CRISPR_common_essentials.csv, and https://depmap.org/portal/download/all/?releasename=DepMap+Public+22Q2&filename=common_essentials.csv), and is tabulated in Supplementary Data 15; Data on mouse lethal phenotypes are freely available and was retrieved from the Mouse Genome Informatics (MGI) database (http://www.informatics.jax.org/downloads/reports/MGI_GenePheno.rpt) and the International Mouse Phenotyping Consortium (IMPC), the 15th release of IMPC mouse phenotype data was downloaded from the IMPC ftp site at http://ftp.ebi.ac.uk/pub/databases/impc/all-data-releases/release-15.1/results/viability.csv.gz. This data is tabulated in Supplementary Data 16.; To estimate the age of selected variants, human genome dating database was used which is freely available (https://human.genome.dating); Data from the OMIA database is freely available. A list of genes for which mutations have been shown to result in Mendelian traits in non‐laboratory animals is available for download at https://www.omia.org/download/causal_mutations/?format=X2.
Change history
03 July 2023
A Correction to this paper has been published: https://doi.org/10.1038/s41467-023-39492-4
References
Bamshad, M. J., Nickerson, D. A. & Chong, J. X. Mendelian Gene Discovery: Fast and Furious with No End in Sight. Am. J. Hum. Genet. 105, 448–455 (2019).
Bick, D., Jones, M., Taylor, S. L., Taft, R. J. & Belmont, J. Case for genome sequencing in infants and children with rare, undiagnosed or genetic diseases. J. Med. Genet. 56, 783–791 (2019).
Chong, J. X., Ouwenga, R., Anderson, R. L., Waggoner, D. J. & Ober, C. A population-based study of autosomal-recessive disease-causing mutations in a founder population. Am. J. Hum. Genet. 91, 608–620 (2012).
Gao, Z., Waggoner, D., Stephens, M., Ober, C. & Przeworski, M. An estimate of the average number of recessive lethal mutations carried by humans. Genetics 199, 1243–1254 (2015).
Macklon, N. S., Geraedts, J. P. M. & Fauser, B. C. J. M. Conception to ongoing pregnancy: the ‘black box’ of early pregnancy loss. Hum. Reprod. Update 8, 333–343 (2002).
Dickinson, M. E. et al. High-throughput discovery of novel developmental phenotypes. Nature 537, 508–514 (2016).
Bult, C. J. et al. Mouse Genome Database (MGD) 2019. Nucleic Acids Res. 47, D801–D806 (2019).
White, J. K. et al. XGenome-wide generation and systematic phenotyping of knockout mice reveals new roles for many genes. Cell 154, 452–464 (2013).
Yoon, Y., Riley, J., Gallant, J., Xu, P. & Rivera-Pérez, J. A. Implantation and Gastrulation Abnormalities Characterize Early Embryonic Lethal Mouse Lines. bioRxiv https://doi.org/10.1101/2020.10.08.331587 (2020).
Saleheen, D. et al. Human knockouts and phenotypic analysis in a cohort with a high rate of consanguinity. Nat. Publ. Group 544, 235–239 (2017).
Narasimhan, V. M. et al. Health and population effects of rare gene knockouts in adult humans with related parents. Science 352, 474–477 (2016).
Sulem, P. et al. Identification of a large set of rare complete human knockouts. Nat. Genet. 47, 448–452 (2015).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Gudbjartsson, D. F. et al. Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 47, 435–444 (2015).
Amorim, C. E. G. et al. The population genetics of human disease: the case of recessive, lethal mutations. PLoS Genet 13, 1–23 (2017).
Wright, S. Evolution in Mendelian Populations. Genetics 16, 97–159 (1931).
Mukai, T., Chigusa, S. I., Mettler, L. E. & Crow, J. F. Mutation rate and dominance of genes affecting viability in Drosophila melanogaster. Genetics 72, 335–355 (1972).
Greenberg, R. & Crow, J. F. A Comparison of the Effect of Lethal and Detrimental Chromosomes from Drosophila Populations. Genetics 45, 1153–1168 (1960).
Sveinbjornsson, G. et al. Weighting sequence variants based on their annotation increases power of whole-genome association studies. Nat. Genet. 48, 314–317 (2016).
Landrum, M. J. et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44, D862–D868 (2016).
Ferkingstad, E. et al. Large-scale integration of the plasma proteome with genetics and disease. Nat. Genet. 53, 1712–1721 (2021).
Waterham, H. R. & Hennekam, R. C. M. Mutational spectrum of Smith-Lemli-Opitz syndrome. Am. J. Med. Genet. C. Semin. Med. Genet. 160C, 263–284 (2012).
Ravenscroft, G. et al. Whole exome sequencing in foetal akinesia expands the genotype-phenotype spectrum of GBE1 glycogen storage disease mutations. Neuromuscul. Disord. 23, 165–169 (2013).
Matthijs, G., Schollen, E., Heykants, L. & Grünewald, S. Phosphomannomutase deficiency: the molecular basis of the classical Jaeken syndrome (CDGS type Ia). Mol. Genet. Metab. 68, 220–226 (1999).
Neuser, S. et al. Prenatal phenotype of PNKP-related primary microcephaly associated with variants affecting both the FHA and phosphatase domain. Eur. J. Hum. Genet. https://doi.org/10.1038/s41431-021-00982-y (2021).
Dawes, R., Lek, M. & Cooper, S. T. Gene discovery informatics toolkit defines candidate genes for unexplained infertility and prenatal or infantile mortality. NPJ Genom. Med. 4, 8 (2019).
Cacheiro, P. et al. Human and mouse essentiality screens as a resource for disease gene discovery. Nat. Commun. 678250 https://doi.org/10.1038/s41467-020-14284-2 (2020).
Levran, O. et al. The BRCA1-interacting helicase BRIP1 is deficient in Fanconi anemia. Nat. Genet. 37, 931–933 (2005).
Rafnar, T. et al. Mutations in BRIP1 confer high risk of ovarian cancer. Nat. Genet. 43, 1104–1107, https://doi.org/10.1038/ng.955 (2011).
Nowaczyk, M. J. M., Waye, J. S. & Douketis, J. D. DHCR7 mutation carrier rates and prevalence of the RSH/Smith-Lemli-Opitz syndrome: where are the patients? Am. J. Med. Genet. A 140, 2057–2062 (2006).
Daum, H. et al. Smith-Lemli-Opitz syndrome: what is the actual risk for couples carriers of the DHCR7:c.964-1G>C variant? Eur. J. Hum. Genet. 28, 938–942 (2020).
Nicholas, F. W. Online Mendelian Inheritance in Animals (OMIA): a record of advances in animal genetics, freely available on the Internet for 25 years. Anim. Genet. 52, 3–9 (2021).
Guarini, A. R. et al. Estimating the effect of the deleterious recessive haplotypes AH1 and AH2 on reproduction performance of Ayrshire cattle. J. Dairy Sci. 102, 5315–5322 (2019).
Derks, M. F. L. et al. Loss of function mutations in essential genes cause embryonic lethality in pigs. PLoS Genet 15, e1008055 (2019).
Clark, K. J. et al. In vivo protein trapping produces a functional expression codex of the vertebrate proteome. Nat. Methods 8, 506–515 (2011).
Mummery-Widmer, J. L. et al. Genome-wide analysis of Notch signalling in Drosophila by transgenic RNAi. Nature 458, 987–992 (2009).
Gönczy, P. et al. Functional genomic analysis of cell division in C. elegans using RNAi of genes on chromosome III. Nature 408, 331–336 (2000).
Colaiácovo, M. P. et al. A targeted RNAi screen for genes involved in chromosome morphogenesis and nuclear organization in the Caenorhabditis elegans germline. Genetics 162, 113–128 (2002).
Simmer, F. et al. Genome-wide RNAi of C. elegans using the hypersensitive rrf-3 strain reveals novel gene functions. PLoS Biol. 1, E12 (2003).
Cheong, A. et al. Nuclear-encoded mitochondrial ribosomal proteins are required to initiate gastrulation. Development 147, dev188714 (2020).
Albers, P. K. & McVean, G. Dating genomic variants and shared ancestry in population-scale sequencing data. PLoS Biol. 18, e3000586 (2020).
Kjaergaard, S., Skovby, F. & Schwartz, M. Absence of homozygosity for predominant mutations in PMM2 in Danish patients with carbohydrate-deficient glycoprotein syndrome type 1. Eur. J. Hum. Genet. 6, 331–336 (1998).
Jaeken, J., Lefeber, D. & Matthijs, G. Clinical utility gene card for: Phosphomannomutase 2 deficiency. Eur. J. Hum. Genet. 22, 1054 (2014).
Erlandson, A. et al. Scandinavian CDG-Ia patients: genotype/phenotype correlation and geographic origin of founder mutations. Hum. Genet. 108, 359–367 (2001).
Arnadottir, G. A. et al. Population-level deficit of homozygosity unveils CPSF3 as an intellectual disability syndrome gene. Nat. Commun. 13, 1–9 (2022).
Hansen, T. F. et al. DBDS Genomic Cohort, a prospective and comprehensive resource for integrative and temporal analysis of genetic, environmental and lifestyle factors affecting health of blood donors. BMJ Open 9, e028401 (2019).
Laursen, I. H. et al. Cohort profile: Copenhagen Hospital Biobank—Cardiovascular Disease Cohort (CHB-CVDC): Construction of a large-scale genetic cohort to facilitate a better understanding of heart diseases. BMJ Open 11, e049709 (2021).
Sørensen, E. et al. Data Resource Profile: The Copenhagen Hospital Biobank (CHB). Int. J. Epidemiol. 50, 719–720, https://doi.org/10.1093/ije/dyaa157 (2021).
Shanahan, M. J., Mortimer, J. T. & Johnson, M. K. Handbook of the Life Course : Volume II. (Springer, 2015).
Magnus, P. et al. Cohort Profile Update: The Norwegian Mother and Child Cohort Study (MoBa). Int. J. Epidemiol. 45, 382–388 (2016).
Hedström, A. K. et al. High Levels of Epstein-Barr Virus Nuclear Antigen-1-Specific Antibodies and Infectious Mononucleosis Act Both Independently and Synergistically to Increase Multiple Sclerosis Risk. Front. Neurol. 10, 1368 (2019).
Hedström, A. K. et al. Organic solvents and MS susceptibility: Interaction with MS risk HLA genes. Neurology 91, e455–e462 (2018).
Rhead, B. et al. Mendelian randomization shows a causal effect of low vitamin D on multiple sclerosis risk. Neurol. Genet 2, e97 (2016).
Piehl, F., Holmén, C., Hillert, J. & Olsson, T. Swedish natalizumab (Tysabri) multiple sclerosis surveillance study. Neurol. Sci. 31, 289–293 (2011).
Khademi, M. et al. Cerebrospinal fluid CXCL13 in multiple sclerosis: a suggestive prognostic marker for the disease course. Mult. Scler. 17, 335–343 (2011).
Alping, P., Piehl, F., Langer-Gould, A. & Frisell, T. & COMBAT-MS Study Group. Validation of the Swedish Multiple Sclerosis Register: Further Improving a Resource for Pharmacoepidemiologic Evaluations. Epidemiology 30, 230–233 (2019).
Hallmans, G. et al. Cardiovascular disease and diabetes in the Northern Sweden Health and Disease Study Cohort - evaluation of risk factors and their interactions. Scand. J. Public Health Suppl. 61, 18–24 (2003).
Boman, A. et al. Antibodies against citrullinated peptides are associated with clinical and radiological outcomes in patients with early rheumatoid arthritis: a prospective longitudinal inception cohort study. RMD Open 5, e000946 (2019).
Swaminathan, B. et al. Variants in ELL2 influencing immunoglobulin levels associate with multiple myeloma. Nat. Commun. 6, 7213 (2015).
Duran-Lozano, L. et al. Germline variants at SOHLH2 influence multiple myeloma risk. Blood Cancer J. 11, 76 (2021).
Jonsson, S. et al. Identification of sequence variants influencing immunoglobulin levels. Nat. Genet. 49, 1182–1191 (2017).
Jónsson, H. et al. Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature 549, 519–522 (2017).
Halldorsson, B. V. et al. The sequences of 150,119 genomes in the UK biobank. bioRxiv https://doi.org/10.1101/2021.11.16.468246 (2021).
Loh, P.-R. et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 48, 1443–1448 (2016).
Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nat. Genet. 40, 1068–1075 (2008).
Kals, M. et al. Advantages of genotype imputation with ethnically matched reference panel for rare variant association analyses. bioRxiv 579201 https://doi.org/10.1101/579201 (2019).
Wagner, J. et al. Towards a Comprehensive Variation Benchmark for Challenging Medically-Relevant Autosomal Genes. bioRxivhttps://doi.org/10.1101/2021.06.07.444885 (2021).
Kharrati-Kopaei, M. & Dorosti-Motlagh, R. Confidence intervals for the ratio of two independent Poisson rates: Parametric bootstrap, modified asymptotic, and approximate-estimate approaches. Stat. Methods Med. Res. 29, 2140–2150 (2020).
Helgason, H. et al. Loss-of-function variants in ATM confer risk of gastric cancer. Nat. Genet. 47, 906–910 (2015).
Lee, S., Abecasis, G. R., Boehnke, M. & Lin, X. Rare-variant association analysis: study designs and statistical tests. Am. J. Hum. Genet. 95, 5–23 (2014).
Li, B. & Leal, S. M. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet. 83, 311–321 (2008).
Stitziel, N. O., Kiezun, A. & Sunyaev, S. Computational and statistical approaches to analyzing variants identified by exome sequencing. Genome Biol. 12, 227 (2011).
Cirulli, E. T. et al. Genome-wide rare variant analysis for thousands of phenotypes in over 70,000 exomes from two cohorts. Nat. Commun. 11, 542 (2020).
Rafnar, T. et al. Association of BRCA2 K3326* With Small Cell Lung Cancer and Squamous Cell Cancer of the Skin. J. Natl Cancer Inst. 110, 967–974 (2018).
Levy-Lahad, E. et al. Founder BRCA1 and BRCA2 mutations in Ashkenazi Jews in Israel: frequency and differential penetrance in ovarian cancer and in breast-ovarian cancer families. Am. J. Hum. Genet. 60, 1059–1067 (1997).
Norio, R. Finnish Disease Heritage II: population prehistory and genetic roots of Finns. Hum. Genet. 112, 457–469 (2003).
Mikaelsdottir, E. et al. Genetic variants associated with platelet count are predictive of human disease and physiological markers. Commun. Biol. 4, 1132 (2021).
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).
Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc. 7, 500–507 (2012).
Köhler, S. et al. The Human Phenotype Ontology in 2021. Nucleic Acids Reshttps://doi.org/10.1093/nar/gkaa1043 (2020).
Tsherniak, A. et al. Defining a Cancer Dependency Map. Cell 170, 564–576.e16 (2017).
Dempster, J. M. et al. Extracting Biological Insights from the Project Achilles Genome-Scale CRISPR Screens in Cancer Cell Lines. bioRxiv 720243 https://doi.org/10.1101/720243 (2019).
Smith, C. L. & Eppig, J. T. The mammalian phenotype ontology: enabling robust annotation and comparative analysis. Wiley Interdiscip. Rev. Syst. Biol. Med. 1, 390–399 (2009).
Acknowledgements
We thank the individuals who participated in this study and whose contributions made this work possible. We also thank our valued colleagues at the Icelandic Patient Recruitment Center and the deCODE genetics core facilities who contributed to the data collection and phenotypic characterization of clinical samples as well as to the genotyping and analysis of the whole-genome association data. We want to acknowledge the FinnGen study (https://www.finngen.fi/en) and the UK Biobank for providing genotypic data. We want to acknowledge the participants and investigators of DBDS which is a part of the Bio and Genome Bank Denmark funded by the Danish Regions and has received a grant from the Independent Research Fund Denmark (271-08-0640). We want to acknowledge the participants and investigators of MoBa which is supported by the Norwegian Ministry of Health and Care Services and the Ministry of Education and Research. We are grateful to all the participating families in Norway who take part in this ongoing cohort study. Financial support from the Research Council of Norway (223273, 273291, 324252, 274611), South-Eastern Norway Regional Health Authority (#2020060, #2020022), European Union’s Horizon2020 Research and Innovation Programme (CoMorMent project; Grant #847776), Kristian Gerhard Jebsen Stiftelsen (SKGJ-MED-021), and candy’s Foundation is acknowledged.
Author information
Authors and Affiliations
Consortia
Contributions
A.O., Pa.S., K.S., and D.F.G. designed the study and interpreted the results. A.O., Pa.S., D.F.G., A.H., and K.S. drafted the manuscript. A.O. implemented the analysis pipelines with input from Pa.S., G.S., G.A.A., G.H.H., B.A.A., G.R.O., H.H., H.K., R.F., B.O.J., H.B.T., S.R.D., B.V.H., A.H., and D.F.G. A.O., G.H.H., E.F., and Pa.S. performed expression analyses. A.O., Pa.S., G.S., G.H.H., V.T., E.F., H.J., S.A.G., D.B., K.H.M., S.K., O.A.S., B.V.H., and D.F.G. performed the statistical and bioinformatics analyses. Subject recruitment and the biological material collection were organized and carried out by J.H., V.S., H.S.N., D.We., J.M.K., O.F., G.B.W., I.K., H.Hj., T.A.O., Ge.S., M.N., C.E., T.B., S.S., T.O., K.N., As.H., M.D., T.F.H., T.S., R.L.J., R.T.L., S.D., L.A., A.L.P., Pe.S., I.E.S., L.T., M.T.B., S.B., P.M., B.V.H., J.S., O.T.M., D.B.D.S., L.P., K.B., T.R., J.A., L.K., O.B.P., G.M., A.l.H., B.N., O.A.A., M.D., S.R.O., I.J., H.S., H.Ho., and U.T. T.A.O., As.H., T.S., I.J., H.Ho., U.T., and K.S. were responsible for phenotype data acquisition. Sequencing and genotyping were supervised by O.T.M. and J.S. All authors contributed to the final version of the paper.
Corresponding authors
Ethics declarations
Competing interests
Authors affiliated with deCODE genetics/Amgen Inc., A.O., Pa.S., G.S., G.A.A., V.S., G.H.H., B.A.A., G.R.O., H.Ho., H.K., R.F., B.O.J., V.T., E.F., H.J., S.A.G., D.B., K.H.M., H.B.T., S.K., O.A.S., S.S., P.M., B.V.H., J.S., A.H., O.T.M., I.J., H.S., H.Ho., U.T., D.F.G., and K.S. declare competing interests as employees. O.A.A. is a consultant to HealthLytix. G.S. Participated in advisory board meetings for Biogen. The remaining authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Oddsson, A., Sulem, P., Sveinbjornsson, G. et al. Deficit of homozygosity among 1.52 million individuals and genetic causes of recessive lethality. Nat Commun 14, 3453 (2023). https://doi.org/10.1038/s41467-023-38951-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-023-38951-2
This article is cited by
-
Goldilocks principle and recessive disease
European Journal of Human Genetics (2024)
-
Variant in the synaptonemal complex protein SYCE2 associates with pregnancy loss through effect on recombination
Nature Structural & Molecular Biology (2024)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.