Evaluating potential drug targets through human loss-of-function genetic variation

Naturally occurring human genetic variants predicted to cause loss of function of protein-coding genes provide an in vivo model of human gene inactivation that complements cell and model organism knockout studies. Here we investigate the application of human loss-of-function variants to assess genes as candidate drug targets, with three key findings. First, even essential genes, where loss-of-function variants are not tolerated, can be highly successful as targets of inhibitory drugs. Second, in most genes, loss-of-function variants are sufficiently rare that genotype-based ascertainment of homozygous or compound heterozygous “knockout” humans will await sample sizes ~1,000 times those available at present. Third, automated variant annotation and filtering are powerful, but manual curation remains critical for removing artifacts and making biological inferences, and is a prerequisite for recall-by-genotype efforts. Our results provide a roadmap for human “knockout” studies and should guide interpretation of loss-of-function variants in drug development.


SUMMARY
Naturally occurring human genetic variants predicted to cause loss of function of protein-coding genes provide an in vivo model of human gene inactivation that complements cell and model organism knockout studies. Here we investigate the application of human loss-of-function variants to assess genes as candidate drug targets, with three key findings. First, even essential genes, where loss-of-function variants are not tolerated, can be highly successful as targets of inhibitory drugs. Second, in most genes, loss-of-function variants are sufficiently rare that genotype-based ascertainment of homozygous or compound heterozygous "knockout" humans will await sample sizes ~1,000 times those available at present. Third, automated variant annotation and filtering are powerful, but manual curation remains critical for removing artifacts and making biological inferences, and is a prerequisite for recall-by-genotype efforts. Our results provide a roadmap for human "knockout" studies and should guide interpretation of loss-offunction variants in drug development.

MAIN TEXT
Human genetics is an increasingly critical source of evidence guiding the selection of new targets for drug discovery 1 . Most drug candidates that enter clinical trials eventually fail for lack of efficacy 2 , and while in vitro, cell culture, and animal model systems can provide preclinical evidence that the compound engages its target, too often the target itself is not causally related to human disease 1 . Candidates targeting genes with human genetic evidence for disease causality are more likely to become approved drugs 3,4 , and identification of humans with loss-offunction (LoF) variants, particularly two-hit (homozygous or compound heterozygous) genotypes, has, for several genes, correctly predicted the safety and phenotypic effect of pharmacologically inhibiting the drug's target 5 . While these examples demonstrate the value of human genetics in drug development, important questions remain regarding strategies for identifying individuals with LoF variants in a gene of interest, interpretation of the frequency -or lack -of such individuals, and whether it is wise to pharmacologically target a gene in which LoF variants are associated with a deleterious phenotype.
Public databases of human genetic variation have provided catalogs of predicted loss-offunction (pLoF) variants -nonsense, essential splice site, and frameshift variants expected to result in a non-functional allele -in humans. Such databases present a new opportunity to study the effects of pLoF variation in genes of interest and to identify individuals with pLoF genotypes in order to understand gene function or disease biology, or to assess potential for therapeutic targeting. While many variants initially annotated as pLoF do not, in fact, cause a loss of function 6 , rigorous automated filtering can remove common error modes 7 . True LoF variants are generally rare, and show important differences between outbred, bottlenecked 8 , and consanguineous 9 populations 6,10 . Counting the number of distinct pLoF variants in each gene in a population sample allows quantification of gene essentiality in humans through a metric known as constraint [10][11][12][13] . Specifically, the rate at which de novo pLoF mutations arise in each gene is predicted based on DNA mutation rates 10,12 , and the ratio of the count of pLoF variants observed in a database to the number expected based on mutation rates -obs/exp or simply constraint score -measures how strongly purifying natural selection has removed such variants from the population. Annotation of pLoF variants remains imperfect, and continued improvements are being made 14 , but the fact that constraint usefully measures gene essentiality is demonstrated by agreement with cell culture and mouse knockout experiments 7 , by overlap with human disease genes 7,10 and genes depleted for structural variation 15 , and by the power of constraint to enrich for deleterious variants in neurodevelopmental disorders 7,16 .
Building on these insights, here we leverage pLoF variation in the Genome Aggregation Database (gnomAD) 7 v2 dataset of 141,456 individuals to answer open questions in the interpretation of human pLoF variation in disease biology and drug development.

Constraint in human drug targets
We compared constraint in the targets of approved drugs extracted from DrugBank 17 (N=383) versus all protein-coding genes (N=17,604). Drug targets were, on average, just slightly more constrained than all genes (mean 44% vs. 52%, P=0.00028), but the two gene sets had a qualitatively similar distribution of scores, ranging from intensely constrained (0% obs/exp) to not at all constrained (≥100% obs/exp; Fig. 1a). Constraint scores showed clear divergence between categories of genes (Extended Data Table 1) expected to be more or less tolerant of inactivation (Fig. 1b), as previously reported 7,10 , validating the usefulness of constraint as a measure of gene essentiality. Nonetheless, when drug targets were stratified by drug modality, indication, or effect, no statistically significant differences between subsets of drug targets were observed.
The slightly but significantly lower obs/exp value among drug targets may superficially appear to provide evidence that constrained genes make superior drug targets. Stratification of drug targets by protein family, human disease association, and tissue expression, however, argues against this interpretation. Drug targets are strongly enriched for a few canonically "druggable" protein families, for genes known to be involved in human disease, and for genes with tissuerestricted expression; each of these properties is in turn correlated with either significantly stronger or weaker constraint (Extended Data Fig. 1). Although controlling for these correlations does not abolish the trend of stronger constraint among drug targets, the correlation of so many observed variables with a gene's status as drug target argues that many unobserved variables likely also confound interpretation of the lower mean obs/exp value among drug targets.
The overall constraint distribution of drug targets (Fig. 1a) also argues against the view that a gene in which LoF is associated with a deleterious phenotype cannot be successfully targeted. Indeed, 19% of drug targets (N=73), including 52 targets of inhibitors, antagonists or other "negative" drugs, have obs/exp values lower than the average (12.8%) for genes known to cause severe diseases of haploinsufficiency 18 (ClinGen Level 3). To determine whether this finding could be explained by particular class or subset of drugs, we examined constraint in several well-known example drug targets (Table 1). Some heavily constrained genes are targets of cytotoxic chemotherapy agents such as topoisomerase inhibitors or cytoskeleton disruptors, a set of drugs intuitively expected to target essential genes. However, genes with near-complete selection against pLoF variants also include HMGCR and PTGS2, the targets of highly successful, chronically used inhibitors -statins and aspirin.
These human in vivo data further the evidence from other species and models that essential genes can be good drug targets. Homozygous knockout of Hmgcr and Ptgs2 are lethal in mice [19][20][21] . Drug targets exhibit higher inter-species conservation than other genes 22 . Targets of negative drugs include 14 genes with lethal heterozygous knockout mouse phenotypes reported 23 and 6 reported as essential in human cell culture 24 .

Prospects for finding "knockout" individuals
While constraint alone is not adequate to nominate or exclude drug targets, the study of individuals with single hit (heterozygous) or two-hit ("knockout") LoF genotypes in a gene of interest can be highly informative about the biological effect of engaging that target 5 . To assess prospects for ascertaining "knockout" individuals, we computed the cumulative allele frequency (CAF) of pLoF variants in each gene (Online Methods), and then used this to estimate the expected frequency of two-hit individuals under different population structures (Fig. 2) absent natural selection.
Whereas gnomAD is now large enough to include at least one pLoF heterozygote for the majority (15,317/19,194; 79.8%) of genes, ascertainment of total "knockout" individuals in outbred populations will require 1,000-fold larger sample sizes for most genes: the median gene has an expected two-hit frequency of just 6 per billion (Fig. 2a). Even if every human on Earth were sequenced, there are 4,728 genes (25%) for which identification of even one two-hit individual would not be expected in outbred populations. Intuitively, because today's gnomAD sample size is larger than the square root of the world population, variants so far seen in zero or only a few heterozygous individuals are not likely to ever be seen in a homozygous state in outbred populations, except where variants prove common in populations not yet well-sampled by gnomAD.
Because population bottlenecks can result in very rare variants present in a founder rising to an unusually high frequency, we also considered the utility of bottlenecked populations for knockout discovery, using Finnish individuals in gnomAD as an example 8 . Although this population structure can enable well-powered association studies for the small fraction of genes in which pLoF variants drifted to high frequency due to the bottleneck, overall, identification of two-hit pLoF individuals for a pre-specified gene of interest appears equally or more difficult in Finns than in outbred populations ( Fig. 2b and Extended Data Fig. 2), because rare variants not present in a founder have been effectively removed from the population.
Finally, we considered consanguineous individuals, where parental relatedness greatly increases the frequency of homozygous pLoF genotypes. The N=2,912 individuals in the East London Genes & Health (ELGH) cohort 25 who report having parents who are second cousins or closer have on average 5.8% of their genomes autozygous. Here, the expected frequency of two-hit individuals is many times higher than in outbred populations, at 5 per million for the median gene (Fig. 2c).
These projections allow us to draft a roadmap for discovery of human knockouts ( Fig. 2d-e). Of 19,194 genes, 3,367 (18%) already have a human disease association annotated in OMIM, although we note that the discovery of LoF individuals in population databases will still be valuable for assessing penetrance and for identifying LoF syndromes associated with known GoF genes. There are 3,421 genes (18%) without known human disease association that have two-hit pLoF genotypes reported in gnomAD 7 , ELGH 26 , PROMIS 27 , DeCODE 28 , or UK BioBank 29 , suggesting this genotype may be tolerated. An additional 2,190 genes (11%) can be inferred likely intolerant of heterozygous inactivation (pLI > 0.9) in gnomAD, and would be expected to be enriched for genes with severe heterozygous and lethal homozygous LoF phenotypes. Another 2,781 genes (14%) have no pLoF variants yet observed in gnomAD, but our sample size is not yet large enough to robustly infer LoF intolerance. For these genes, observation of outbred two-hit individuals is not expected, and we cannot yet assess the feasibility of identifying consanguineous two-hit individuals because we lack an estimate of pLoF allele frequency. This leaves 7,435 genes (39%) for which one or more pLoFs are observed in gnomAD, but strong LoF intolerance cannot be inferred, nor have two-hit genotypes been observed, nor is a human disease phenotype known. We projected the sample sizes required to identify "knockout" individuals for these genes (Fig. 2e). In outbred populations, current sample size would need to be increased by approximately 1,000-fold before ascertainment of a single two-hit LoF individual would be expected for the typical gene. In contrast, a ~10-to 100-fold increase from current consanguineous sample size, meaning hundreds of thousands of individuals in absolute terms, would identify at least one two-hit LoF individual for the typical gene.
These calculations are based on variants annotated as predicted LoF in gnomAD. Structural and non-coding variation resulting in a loss of function may be missed in exomes, and missense variants resulting in a loss of function cannot be rigorously annotated, leading to underestimation of cumulative LoF allele frequency. Overall, however, our calculations likely represent an upper bound on the total frequency of two-hit individuals in the population. The variants included in this analysis are filtered but have not been manually curated or functionally validated, so some will ultimately prove not to be true LoF. These false positives tend to be more common and will have disproportionately contributed to the cumulative LoF allele frequency. More importantly, for some genes, complete knockout will not be tolerated. When only one or a few two-hit individuals are expected in a dataset, the absence of any such individuals can be due to either early lethality, a severe clinical phenotype incompatible with inclusion in gnomAD, or simply chance. Thus, the ability to infer lethality of this genotype based on statistical evidence will lag behind the identification of two-hit individuals where they do exist (Fig. 2e). For some genes, inference of lethality will always remain impossible in outbred populations, though it may be feasible in consanguineous individuals.

Curation of pLoF variants
Where pLoF variants can be identified, they are a valuable resource for assessing the impact of lifelong reduction in gene dosage. To highlight the challenges and opportunities of identifying such variants, we manually curated gnomAD data and the scientific literature for six genes associated with gain-of-function (GoF) neurodegenerative diseases, for which inhibitors or suppressors are under development [30][31][32][33][34][35] : HTT (Huntington disease), MAPT (tauopathies), PRNP (prion disease), SOD1 (amyotrophic lateral sclerosis), and LRRK2 and SNCA (Parkinson disease). The results ( Table 2  First, other things being equal, genes with longer coding sequences offer more opportunity for LoF variants to arise, and so tend to have a higher cumulative frequency of LoF variants, unless they are heavily constrained. Ascertainment of LoF individuals is thus harder for shorter and/or more constrained genes, even though these may be good targets (Table 1).
Second, many variants annotated as pLoF are false positives 6 , and these are enriched for higher allele frequencies, so that both filtering and curation have an outsized impact on the cumulative allele frequency of LoF. Studies of human pLoF variants lacking stringent curation can therefore easily dilute results with false pLoF carriers.
Third, after careful curation, cumulative LoF allele frequency is sometimes sufficiently high to place certain bounds on what heterozygote phenotype might exist. For example, GoF mutations causing genetic prion disease have a ~1 in 50,000 genetic prevalence 36 and have been known for three decades, with thousands of cases identified, making it unlikely that a comparably severe and penetrant haploinsufficiency syndrome associated with PRNP would have gone unnoticed to the present day despite being more than twice as common (~1 in 18,000). Similar arguments can be made for HTT, LRRK2, and SOD1. Of course, this does not rule out the possibility that heterozygous loss-of-function in these genes could be associated with less severe or less penetrant phenotypes.
Finally, careful inspection of the distributions of pLoF variants can reveal important error modes or disease biology. HTT, MAPT, and PRNP each have different non-random positional distributions of pLoF variants (Fig. 3). High-frequency HTT pLoF variants cluster in the polyglutamine/polyproline repeat region of exon 1 and appear to be alignment artifacts (Fig. 3a). True HTT LoF variants are rare and the gene is highly constrained, which might suggest some fitness effect in a heterozygous state in addition to the known severe homozygous phenotype 37,38 , although the frequency of LoF carriers still argues against a penetrant syndromic illness, consistent with the lack of phenotype reported in heterozygotes identified to date 38,39 . High-frequency MAPT pLoF variants cluster in exons not expressed in the brain in GTEx data 14,40 , and all remaining pLoFs appear to be alignment or annotation errors (Fig 3b). No true LoFs are observed in MAPT, although our sample size is insufficient to prove that MAPT LoF is not tolerated -among constitutive brain-expressed exons, we expect 12.6 LoFs and observe 0, giving a 95% confidence interval upper bound of 23.7% for obs/exp. PRNP truncating variants in gnomAD cluster in the N terminus; the sole C-terminal truncating variant in gnomAD is a dementia case (Extended Data Table 2), consistent with variants at codon ≥145 causing a pathogenic gain-of-function through change in localization (Fig. 3c). Within codon 1-144, PRNP is unconstrained, and no neurological phenotype has been identified in individuals with truncating variants to date, consistent with the hypothesis that N-terminal truncating variants are true LoF and are tolerated in a heterozygous state 41 .

Discussion
The study of gene inactivation through human genetic databases can illuminate human biology and guide drug target selection, complementing mouse knockout studies 42 , but analysis of any one gene requires genome-wide context to set expectations and guide inferences. Here we have used gnomAD data to provide context to aid in the interpretation of human LoF variants.
Targets of approved drugs span a spectrum from highly constrained to completely unconstrained. Why do some genes apparently tolerate pharmacological inhibition but not genetic inactivation? LoF variants, at least in constitutive exons, should affect all tissues for life, whereas drugs differ in tissue distribution and timing and duration of use. Many drugs known or suspected to cause fetal harm are tolerated in adults 43 , and might target developmentally important genes. Constraint is believed to primarily reflect selection against heterozygotes 13 , whose effective gene dosage may differ from that achieved by a drug. Constraint measures natural selection over centuries or millennia; our ancestors' environment presented different selective pressures than what we face today. Finally, the actions of small molecule drugs do not always map one-to-one onto genes 44-47 . Regardless, these human in vivo data show that even a highly deleterious knockout phenotype is compatible with a gene being a viable drug target.
For most genes, the lack of total "knockout" individuals identified to date does not yet provide statistical evidence that this genotype is not tolerated, and, for many genes, such evidence may never be attainable in outbred populations. Bottlenecked populations, individually, are unlikely to yield two-hit individuals for a pre-specified gene of interest, though the sequencing of many different, diverse bottlenecked populations will certainly expand the set of genes accessible by this approach. Identification of two-hit individuals will be most greatly aided by increased investment in the ascertainment and characterization of consanguineous cohorts, where the sample size required for any given gene is often orders of magnitude lower than in outbred populations. Our analysis is limited by sample size, insufficient diversity of sampled populations, and simplifying assumptions about population structure and distribution of LoF variants, so our calculations should be taken as rough, order-of-magnitude estimates. Nonetheless, this strategic roadmap for the identification of human "knockouts" should inform future research investments and rationalize the interpretation of existing data.
Recall-by-genotype efforts to characterize humans with genotypes of interest are only valuable if the variants in question are true LoF. Automated filtering 7 and transcript expression-aware annotation 14 are powerful tools, but we demonstrate the continued value of manual curation for excluding further false positives, assessing and interpreting the cumulative allele frequency of true LoF variants, and identifying error modes or biological phenomena that give rise to nonrandom distributions of pLoF variants across a gene. Such curation is essential prior to any recontact efforts, and indeed, establishing methods for high-throughput functional validation 48 of LoF variants should be a high priority. Our curation of pLoF variants in neurodegenerative disease genes is limited by a lack of functional validation and detailed phenotyping; in a companion paper we demonstrate a deeper investigation of the effects of LoF variants in LRRK2 49 .
As the value of human genetics for drug discovery has been demonstrated repeatedly, we expect that drug development projects will increasingly be accompanied by efforts to study the phenotypes of human carriers of LoF variants. Because the cost of drug discovery is driven overwhelmingly by failure 50 , successful interpretation of LoF data to select the right targets and the right clinical pathways will yield an outsize benefit for research productivity and, ultimately, human health.      Table 3.

Calculation of pLoF constraint
The calculation of constraint values for genes has been described in general elsewhere 10,12 and for this dataset specifically by Karczewski et al 7 . Constraint calculations used LOFTEE-filtered ("high confidence") single-nucleotide variants (which for pLoF means nonsense and essential splice site mutations) found in gnomAD exomes with minor allele frequency < 0.1%. Only unique canonical transcripts for protein-coding genes were considered, yielding 17,604 genes with available constraint values. For curated genes (Table 2), the number of observed variants passing curation was divided by the expected number of variants to yield a curated constraint value. For PRNP, the expected number of variants was adjusted by multiplying by the ratio of the sum of mutation frequencies for all possible pLoF variants in codons 1-144 to the sum of mutation frequencies for all possible pLoF variants in the entire transcript, yielding 6 observed out of 6.06 expected. For MAPT, the expected number of variants was taken from Ensembl transcript ENST00000334239, which includes only the exons identified as constitutively brainexpressed in Fig. 3b.

Calculation of pLoF heterozygote and homozygote/compound heterozygote frequencies
LOFTEE-filtered high-confidence pLoF variants with minor allele frequency <5% in 125,748 gnomAD exomes were used to compute the proportion of individuals without a loss-of-function variant (q); the CAF was computed as p = 1-sqrt(q). This approach conservatively assumes that, if an individual has two different pLoF variants, they are in cis to each other and count as only one pLoF allele.
For outbred populations (Fig. 2a), we used the value of p from all 125,748 gnomAD exomes, as this allows the largest possible sample size. This includes some individuals from bottlenecked populations, for which the distribution of p does differ from outbred populations, but these individuals are a small proportion of gnomAD exomes (12.6%). This also includes some consanguineous individuals, but these are an even smaller proportion of gnomAD exomes (2.3%), and any difference in the value of p between consanguineous and outbred populations is expected to be very small. Heterozygote frequency was calculated as 2p(1-p) and homozygote and compound heterozygote frequency was calculated as p 2 . Lines indicate the size of gnomAD (141,456 individuals) and the world populaton (6.69 billion).
For bottlenecked populations (Fig. 2b), we used the value of p from the 10,824 Finnish exomes only. Lines indicate the number of Finns in gnomAD (12,526) and the population of Finland (5.5 million).
For consanguineous individuals (Fig. 2c), we again used the value of p from all gnomAD exomes, because p is not expected to differ greatly in consanguineous versus outbred populations. We used the mean proportion of the genome in runs of autozygosity (a) from individuals self-reporting second cousin or closer parents in East London Genes & Health, a = 0.05766 (rounded to 5.8%). Heterozygote frequency was calculated as 2p(1-p) and homozygote and compound heterozygote frequency was calculated as (1-a)p 2 + ap. Lines indicate the number of consanguineous South Asian individuals in gnomAD (N=2,912, by coincidence the same number as report second cousin or closer parents in ELGH) based on F > 0.05 (a conservative estimate, since second cousin parents are expected to yield F = 0.015625), and the estimated number of individuals in the world with second cousin or closer parents (10.4% of the world population) 9 .
Several caveats apply to our CAF analysis. Our approach naively treats genes with no pLoFs observed as having p=0, even though pLoFs might be discovered at a larger sample size. It also naively treats genes with one pLoF allele observed as having p=1/(2*125748), even though on average singleton variants have a true allele frequency lower than their nominal allele frequency 10 . We naively group all populations together, even though the distribution of populations sampled in gnomAD does not reflect the world population 7 ; we believe this is reasonable because CAF for many genes is driven by singletons and other ultra-rare variants for which frequency is not expected to differ appreciably by continental population 10 . It is important to note that the histograms shown in Fig. 2 reflect the expected frequency of heterozygotes and homozygotes/compound heterozygotes, based on gnomAD allele frequency, rather than the actual observed frequency of individuals with these genotypes in gnomAD. Finally, the sample size for all gnomAD exomes ( Fig. 2a and 2c) is larger than for only Finnish exomes (Fig. 2b). For a version of Fig. 2 with the global gnomAD population downsampled to the same sample size as the gnomAD Finnish population, see Extended Data Fig. 2.

Knockout roadmap
For the knockout "roadmap" (Fig. 2d-e) we classified genes according to the current status of human disease association and LoF ascertainment. Genes were classified as having a Mendelian disease association if they were present in OMIM with the filters described in Extended Data Table 1.
Remaining genes were classified as "2-hit LoF reported" based on presence in one or more of the following gene lists: homozygous LoF genotypes in gnomAD curated as described 7 ; filtered homozygous LoF genotypes in runs of autozygosity with minor allele frequency <1% in canonical transcripts in the Bradford, Birmingham, and ELGH 25 cohorts (total N=8,925); observed number of imputed homozygotes >1 or number of compound heterozygous carriers where minor allele frequency <2% (for both variants) in DeCODE 28 ; homozygous LoF reported in PROMIS 27 ; homozygous LoF with minor allele frequency <1% in UK Biobank 29 .

Genetic prevalence estimation
Here, we define "genetic prevalence" for a given gene as the proportion of individuals in the general population at birth who harbor a pathogenic variant in that gene that will cause them to later develop disease. Genetic prevalence has not been well-studied or estimated for most disease genes.
In principle, it should be possible to estimate genetic prevalence simply by examining the allele frequency of reported pathogenic variants in gnomAD. In practice, three considerations usually preclude this approach. First, the present gnomAD sample size of 141,456 exomes and genomes is still too small to permit accurate estimates for very rare diseases. Second, the mean age of gnomAD individuals is ~55, above the age of onset for many rare genetic diseases, and individuals with known Mendelian disease are deliberately excluded, so pathogenic variants will be depleted in this sample relative to the whole birth population. Third and most importantly, a large fraction of reported pathogenic variants lack strong evidence for pathogenicity and are either benign or low penetrance 10,41 , so without careful curation of pathogenicity assertions, summing the frequency of reported pathogenic variants in gnomAD will in most cases vastly overestimate the true genetic prevalence of a disease.
Instead, we searched the literature and very roughly estimated genetic prevalence based on available data. In most cases, we took disease incidence (new cases per year per population), multiplied by proportion of cases due to variants in a gene of interest, multiplied by average age at death in cases. In some cases, estimates of at-risk population or direct measures of genetic prevalence were available. Details of the calculations undertaken for each gene are provided in Extended Data Table 2.

Data and source code availability
Analyses utilized Python 2.7.10 and R 3.5.1. Data and code sufficient to produce the plots and analyses in this paper are available at https://github.com/ericminikel/drug_target_lof providing information on individuals with PRNP truncating variants. We thank Jeffrey B. Carroll, Karl Heilbron, J. Fah Sathirapongsasuti, and Laurent C. Francioli for comments and suggestions. A subset of the analyses reported here originally appeared as a blog post on CureFFI.org (http://www.cureffi.org/2018/09/12/lof-and-drug-safety/).  79 . The GWASimplicated common variants were not included in the case series on which our estimate is based 64 , but G2019S does account for the majority of cases in that series. Because the 0.03% estimate here is based on counting symptomatic cases rather than asymptomatic individuals, it will appropriately underestimate the number of G2019S carriers. In essence, in this calculation each G2019S carrier in the population only counts as 1/3 of a person, because they have only a 1/3 probability of developing a disease. It is therefore appropriate that our estimate of genetic prevalence (0.03%) is actually lower than double the allele frequency of G2019S in gnomAD (0.1%). Fig. 3c is the total number of definite or probable cases with sequencing performed in the studies cited in this table. The L234Pfs7X variant changes PrP's C-terminal GPI signal from SMVLFSSPPVILLISFLIFLIVGX to SMVPSPLHLX. This novel sequence does not adhere to the known rules of GPI anchor attachment 80 : GPI signals must contain a 5-10 polar residue spacer followed by 15-20 hydrophobic residues. Thus, this frameshifted PrP would be predicted to be secreted and thus may be pathogenic, explaining the Alzheimer disease diagnosis in this individual. However, it is also possible that the novel C-terminal sequence found here interferes with prion formation, and/or that this variant is incompletely penetrant, and that the diagnosis of Alzheimer's disease in this individual is merely a coincidence.

Extended Data Figure 2 | Expected frequency of individuals with one or two null alleles for every protein-coding gene across different population models, with sample size held
constant. This is identical to Fig. 2 except as follows. As noted in Online Methods, one caveat about Fig. 2 is that the sample size is larger for the plots using all gnomAD exomes (Fig. 2a and  2c) than for Finnish exomes (Fig. 2b). This figure shows the same analysis, but with the global gnomAD population downsampled to 10,824 randomly chosen exomes so that the same size is identical to that of Finnish exomes. Computation of p = 1-sqrt(q) as described in Methods is computationally expensive for downsampled datasets because it requires individual-level genotypes. Instead, this analysis uses "classic" CAF, which is simply the sum of allele frequencies of all high-confidence pLoF variants each at allele frequency <5%, capped at a total of 100%, for both global and Finnish exomes. The results show that even when sample size is held constant, the number of genes with zero pLoF variants observed is higher in a bottlenecked population than in a mostly outbred population. A constant y axis with no axis breaks is used in this figure to make this difference more clearly visible.