Introduction

An analysis of counts of homozygous versus heterozygous among the 201k single nucleotide, missense variants from the 1000 Genome Project (1 kG)1 shows a general increasing proportion of homozygotes (Fig. 1a) that also agrees with simulated datasets (Fig. 1b, see Methods). Mendelian inheritance suggests that the degree of homozygosity should be 8–29%, which in turn dictates that positions with variants in ≥41 genomes (in 1 kG) would be unlikely to have zero homozygous (hom) counts. Indeed, in simulations we see very few exclusively homozygous positions above this value and none at all above 259. After removing positions where 1 kG and gnomAD1,2 datasets wildly disagreed in terms of minor allele frequencies, we had 167k variants of which 1943 have zero hom counts and total variant count ≥41 in the 1 kG dataset, of which 38 also satisfy the same requirement within gnomAD. We reasoned that errors were more likely in the more heterogeneous gnomAD dataset, so we tolerated a hom count ≤ 5 to derive a set of 353 variants. Lowering this threshold gives values of 223, 156, 112, 75 and 50 (for values 4 through 0). This threshold is, of course, somewhat arbitrary as we do not have any benchmark (or indeed knowledge) of the phenomenon we are investigating. More conservative sets can be extracted from the dataset in Supplementary Table 1.

Fig. 1: Exclusively heterozygous variants in 1 kG.
figure 1

a Plots of homozygous vs heterozygous counts for the 1 kG dataset. The preponderance of values on the X axis (i.e. zero homozygous counts) are indicated. b As in a) but with shuffled 1 kG data. c How the distribution of functional impact scores changes as homozygous counts decrease. d How the distribution of functional impact scores changes for sites where homozygous counts are zero with increasing heterozygous counts.

We also excluded genes that are highly repeat-prone, unusually subject to mutations or that showed any hints of the existence of pseudogenes that might obscure the signal (specifically Filaggrin, Mucins, Olfactory receptors and Rootletin). This then gave a final set of 286 exclusively heterozygous variants that we considered further. A summary of how we defined and filtered the data to arrive at these variants is given in Fig. 2.

Fig. 2: Filtering and data processing overview.
figure 2

Overview showing the processing and filtering of 201k missense variants based on exclusive heterozygosity.

We studied these 286 exclusively heterozygous variants to test if the phenomenon might be explained by various gene features. They show no obvious pattern in terms of chromosomal location and there are no clear clusters around any loci. We believe this rules out a haplotype-based explanation for their occurrence. The genes also show no increased tendency (compared to others) in their tolerance of loss-of-function variants (as defined using pLI or pRec from gnomAD). We also found no significant difference between these genes, 3586 disease genes (those with at least one Mendelian variant defined in UniProt/OMIM) or all other genes when comparing variant profiles, defined by the number of individuals (1 kG) that contained homozygous, heterozygous or compound heterozygous variants at any position.

For comparison, we also considered other variant types than missense. The 2k frame-shift variants, in-frame insertions/deletions and stop-gains (Supplementary Figure 1a-c) have a number of exclusively heterozygous variants as might be expected as they are likely to alter/abate protein function3. Reassuringly however, we see significantly fewer exclusively heterozygous splice (Wilcoxon rank sum test, p = 2.69×10−12), intronic (p = 2.78×10−12), UTR (p = 9.1×10−8), synonymous (p < 2.2×10−16) or non-coding (p = 2.55×10−11, Supplementary Figure 1d-h) variants compared to missense. Overall, these observations support the notion that many of our observed exclusively heterozygous missense variants could be functional.

Interestingly 30 of these exclusively heterozygous missense variants lie in 28 known disease genes4, but are currently not known to be causative (see examples below). However, the majority of variants (256) are in genes not currently associated with disease.

We used several metrics to assess the structural and functional impact of the above variants (Fig. 2). These included conservation across sets of orthologs or human paralogs, structural measurements derived from Alphafold2 structures5,6, functional information from UniProt7, haplotype insufficiency, previous observations of gene-disease association and others (see Methods). The Bayesian combination of scores gave a performance on a benchmark set roughly in line with the best previously published predictors (Supplementary Figure 2b). We avoided including the existing methods8,9,10 into the integrated score as several of our candidate variants (i.e. common SNPs) are included in the negatives used to benchmark the respective methods.

The degree of heterozygous exclusivity appears to enrich for functional impact. For example, for all 1 kG variants with zero homozygous counts, the fraction of those having a predicted functional impact increases with the heterozygous count (Fig. 1d – red line). In addition, increasing the homozygous count above zero diminishes the fraction that have a predicted impact (Fig. 1c – black line). Both of these observations suggest that the degree of exclusive heterozygosity enriches for functionally disruptive variants thus implying a fraction of these positions are likely to disrupt or modify protein function. Both of these results are also seen when using previously published variant impact predictors (e.g. PMUT, Supplementary Figure 2d), though to a lesser degree. Of the 286 exclusively heterozygous variants, 87 (30.4%) have a functional impact score (≥11) with a false-discovery rate < 1% and a false positive rate < 5%, also suggesting a substantial enrichment of functionally relevant changes.

As might be expected, these 87 genes show little coherence in terms of function (no significant enrichment via DAVID11 or GetGo12), though certain broad groups are apparent in (albeit insignificant) gene-enrichment output, protein function (Uniprot) and the literature. Interestingly, these are conditions that might lead to symptoms in single tissues later in life (e.g. eyes or kidneys in the ciliopathies) or those that might only manifest under certain circumstances (autoimmunity or obesity). Below, we discuss six examples in more detail we found convincing after inspection of existing information on gene/protein function, associated diseases and/or other disease variants was available. The full list of 286 variants is given in Supplementary Table 1.

Among variants potentially affecting ciliary processes is p.Gln200Leu in coiled-coil domain-containing protein 8 (CCDC8; Fig. 3a), which is exclusively heterozygous in 44 (1.8%) 1 kG participants as well as in 694 (0.26 %) gnomAD individuals (though with 6 homozygous instances). CCDC8, together with CUL7 and OBSL1 forms the 3 M complex involved in regulating microtubule dynamics and genome integrity13. Mutually exclusive, homozygous or compound heterozygous mutations in these three genes are causative of 3 M syndrome13, an autosomal recessive growth disorder with prenatal growth restriction and the failure of postnatal catch-up, resulting in short stature and skeletal abnormalities14 and a likely ciliopathy12. Gln200 is largely conserved in vertebrates and lies within a short ordered segment15. CCDC8-null mice showed defects in trophoblast motility known to result in complications during pregnancy such as placentation failures or even fetal death16 Other known CCDC8 3 M syndrome mutations are stop-gains or frameshifts, though there are 3 M missense mutations in CUL717. It has been proposed that these CCDC8 3 M mutations disrupt the binding of ANKRA2, which has been shown to recognize a C-terminal motif in CCDC8 (Fig. 3a)18. Gln200 lies in a putative WW domain region that is phosphorylated and indeed high-throughput studies have identified putative phosphorylation events at Tyr197 and Ser20219. Nearby phosphorylations are thought to mediate interactions with other 3 M proteins16; indicating that p.Gln200Leu might disrupt folding or interactions involving this region of CCDC8.

Fig. 3: Examples of exclusively heterozygous variants.
figure 3

Examples of exclusively heterozygous variants showing hints of a possible structural/functional consequence. a Top: Jalview78 alignment of selected mammalian orthologs around Gln200 (Arrow) in CCDC8. Conserved residues are shown in ClustalX colours. Bottom: domain diagram superimposed on top of a IUPred plot of protein disorder15. Locations of phosphorylated serines and other mutations associated with disease are labelled in addition to p.Gln200Leu. b Left: as for a). Right: VMD79 representation of the Alphafold26 NLRP12 model showing the location of Asn394. The zoomed view highlights (ball-and-stick representation) sidechains (Tyr390) or mainchain (Ser427 Arg429) atoms in contact with Asn394 (spheres). c Left: alignment as for a) but with PANK1-3 paralogs from Uniprot Sprot. Right: VMD representation of location of PANK3 Ile301 on the crystal structure (RCSB PDB:6pe6). The zoomed image shows how PANK3 (cyan spheres) packs tightly against hydrophobic sidechains (brown). d Left: alignment as for c). Right: VMD representation of a superimposition80 of the Alphafold2 structure of RHD superimposed with two copies of RHCG (using RCSB PDB:3hd6). The location of Tyr311 is shown (cyan/red spheres) as are the Ca atoms of residues harbouring weak D mutations (magenta).

Another candidate variant affecting ciliary function is p.Pro420Leu in the transcription factor GLIS2, which is exclusively heterozygous in 44 (1.7%) 1 kG participants and in 255 (0.1%) in gnomAD individuals (with 3 homozygous instances). There is no known or confidently predicted structure for this region of GLIS2, though Pro420 is largely conserved across homologs. GLIS2 has recently been reported to have important functions in cellular reprogramming20,21 and a fusion of CBFA2T3 to GLIS2 is frequently observed in early onset acute leukaemia, often associated with poor prognosis22,23. Additional members of this protein family are also causative of diseases, for example GLI1 is causative of postaxial Polydactyly24, ZIC1 in Craniosynostosis25 and ZIC2 in Holoprosencephaly26. Interestingly, variants in GLIS2 are causative of Nephronophthisis, an end-stage kidney disease in children and young adults27,28.

Among the variants affecting proteins in the immune system is p.Asn394Lys in NLRP12 (Fig. 3b), a protein expressed in dendritic cells and macrophages29. This variant is exclusively heterozygous in 72 (2.9%) 1 kG and 70 (0.03%) gnomAD individuals. Asn394 lies within the NACHT domain of NLRP12 and is highly conserved in homologs (Fig. 3b). NLRP12 acts as a negative regulator of various inflammatory processes30. Loss-of-function, through deletions or frame-shift variants are associated with Familial cold autoinflammatory syndrome 231, a disease triggered by exposure to cold with typical inflammatory symptoms (i.e. fever, rashes, myalgia and headaches).

Another candidate variant is p.Ala270Thr in Beta-1-Adrenergic Receptor 1 (ADRB1) which is exclusively heterozygous in 44 (1.8%) 1 kG and 353 (0.15%) gnomAD individuals (although with 2 homozygous instances). At least one variant in this protein (p.Arg389Gly) is associated with congestive heart failure32. ADRB1 plays roles in stimulate brown adipose tissue, suggesting that loss-of-function variants could progress to obesity and insulin resistance33. Both variants are located on the intracellular portion of the receptor (Ala270 is in the third intracellular loop, Arg389 is at the C-terminus), suggesting roles in G-protein recognition and signalling34. Ala270 is mostly conserved in mammals, though certain species have other residues, including at least one that has Thr at this position (Bottlenose dolphins).

The missense variant p.Ile301Phe from Pantothenate kinase 3 (PANK3; Fig. 3c) is exclusively heterozygous in 1 kG with 528 (21 %) and in gnomAD with 1820 (0.7%) carriers. PANK3 is one of three kinases essential in coenzyme A biosynthesis35 and are largely preserved across all three kingdoms of life. Loss-of-function variants in S. cerevisae and D. melanogaster are not viable36. Ile301 lies in the fumble domain37 buried in the core of the protein just under the active site of the enzyme. This position is either Isoleucine or Valine in all homologs of PANK1-3, suggesting that even the seemingly conservative change to Phenylalanine would not be tolerated (Fig. 3c). Defects in coenzyme A biosynthesis, particularly mutations in the close PANK3 paralog PANK2, are associated with Neurodegeneration with Brain Iron Accumulation (NBIA) or Hallervorden-Spatz syndrome, a recessive neurological disorder. Notably, known variants include PANK2 p.Ile501Thr which is at the equivalent position to PANK3 Ile30138. Introducing human wild-type PANK3 to PANK2 equivalent knockout (fbl-/-) Drosophila can partly rescue the WT phenotype39 suggesting some equivalency of these close paralogs. The absence of homozygous individuals despite so many heterozygous carriers makes it tempting to suggest that this PANK3 mutation could cause a similar recessive condition.

Blood group Rh(D) polypeptide (RHD) variant p.Tyr311Ser is never homozygous despite occurring in 623 and 1080 times in 1 kG and gnomAD respectively. RHD is a non-transporting homolog of other transporters (e.g. RHCG) with which it forms heterotrimers that together are involved in ammonium transport between erythrocytes and kidney/liver40. The crystal structure of RHCG41 shows that the equivalent of Tyr311 (Tyr323) lies at the protein-membrane interface, with several intramolecular hydrophobic contacts to other protein residues (Fig. 3d). In orthologs and indeed wider homologs this position is nearly always hydrophobic, is never Serine and curiously whether the position is Tyr or Cys appears to indicate whether the protein is RHD or the close paralog RHCE, respectively. Serine is disfavoured at membrane interfaces, so replacing Tyr with Ser at this position could alter the membrane position or the trimer structures41. At least 14 variants in RHD (magenta in Fig. 3d) have been previously associated with the “weak D antigen”42 of which 8 are concentrated in a region around position 311 (residues 270-339). It is plausible that two copies of this mildly or fully dysfunctional RHD subunit could lead to a disease phenotype.

Our variant impact score filtering is stringent, meaning there could well be additional promising candidates in the excluded variants. For instance, the filtered variant p.Ser244Gly in placental Alkaline Phosphatase (ALPP) is exclusively heterozygous (counts of 42 and 361 in 1 kG/gnomAD) and lies close in sequence and structure to Hypophosphatasia (HOPS) variant positions in the close paralog ALPL. These variants are associated with reduced enzyme activity43,44. Elsewhere, the sperm head protein Zonadhesin (ZAN) has an exclusively heterozygous variant p.Leu871Pro (43/1155 het counts and zero hom counts in 1 kG/gnomAD) that lies in an unusual repeat region in extracellular portion of this long protein. Defects in this protein might be affect to prevent haploid sperm from penetrating the egg, making it then impossible for a homozygous individual to arise.

It is possible that homozygous variants are lacking for these examples because of clear illness or disability in living persons (hence precluding inclusion in a healthy/control dataset). For certain conditions (e.g. ciliopathies only affecting certain tissues in later life), the disorder might simply not be known, which might be the case for cases such as CCDC8 where there are possibly a small number of homozygous individuals. Equally possible is that the homozygous variants are so severe that individuals are entirely unviable. This possibility is easier to argue for variants such as those in PANK3 or RHD that have more than 1000 heterozygous despite zero homozygous counts. Another possibility is that some of these variants might be beneficial when heterozygous despite being detrimental when homozygous (as with the sickle cell homozygous β-globin variant p.Glu7Val that provides malaria resistance when heterozygous45,46).

Evolutionary population genetics argues that deleterious variants are eventually removed from a population by purifying selection. If indeed homozygous variants are not viable or diseased, then necessarily carriers will have a lower fitness. Why are these variants still in the population? It has been argued that humans are indeed undergoing purifying selection47,48,49 and certain recessive diseases are probably examples50,51,52 of those in the process of being removed (e.g. SMA153,54, IMD31B55,56, NSHPT57,58). It is difficult to argue that these candidate genes are in this category, though several of them (41/286, 14.3%) show enrichment in human sub-populations (Supplementary Table 1). For instance, NLRP12 p.Asn394Lys is twice as frequent in American, East-Asian and European populations compared with African populations, PANK3 p.Ile301Phe has a frequency of 11-14% in non-African populations i.e. 2-3 times that of African populations (5%). It is plausible that these variants were fortuitously enriched in original migratory populations.

It remains a possibility that experimental or technical artefacts are the reason why these variants are in the current databases. We managed to rule out most obvious genome features that could give rise to our observations, though there could well be oddities of epigenetics/genome modifications of which we are currently unware. We believe that requiring consistency between gnomAD and 1 kG rules out individual database biases, though this does not eliminate systematic methodological problems that might arise during variant calling and other processing steps.

There are clear experiments that could test the validity of these hypotheses. For instance, enzymes, such as PANK3, have been probed biochemically, including a number of site-directed mutants59. Our observations would suggest that p.Ile301Phe would alter the active site of the enzyme with effects likely observable through enzymology. It would also be interested to test cells engineered to be homozygous for these variants using CRISPR/Cas9 or similar genome tools; we would predict that many would have observable phenotypes.

Our findings, like several others60,61,62, demonstrate the power of exploiting healthy genomes to identify potentially new insights into diseases and molecular function. Our few variants above are a subset of a bigger number of variants occurring in ostensibly healthy people that drastically alter protein function. Their existence raises wider questions about gene/protein function and evolution and additional investigations will likely be highly illuminating. As databases continue to grow, more variants like those described here will be uncovered providing a potentially powerful resource to diagnose and understand human genetic diseases.

Methods

Data and processing

We extracted missense SNVs from variant call files (VCFs) from the 1 kG (Phase 3, 2,504 individuals) and gnomAD (V2.1.1, 125,748 exomes). We considered only those 1 kG variants where data were available in gnomAD and where the difference in Minor Allele Frequency (MAF) was ≤10%. We also converted MAF > 50% by inverting them (100-MAF). Within 1 kG we also considered other variant types (synonymous, in-frame indels, frameshifts, stop-gains, UTR, splicing, intronic or non-coding) for comparison. For easier visual comparison with missense counts in Fig. 1, the plots in Supplementary Figure 1 were created by randomly selecting up to 11k variants showing homozygous vs heterozygous counts for each variant.

Defining exclusively nonhomozygous variants

The parental genotype in 1 kG data is unavailable. If it is assumed that homozygous variant carriers are viable and able to reproduce, the chance of homozygous offspring is approximately 0.29 after combination of all possible G0 constellations (AA x AA, Aa x Aa, AA x aa, aa x aa, Aa x AA and Aa x aa with a = WT trait and A = variant trait). When the assumption is that homozygous variant carriers are non-viable and unable to reproduce, the chance of homozygous offspring shrinks to 0.083 after the combination of all remaining G0 constellations (Aa x Aa, aa x aa, Aa x aa). The distribution of heterozygous and homozygous counts present in 1 kG generally follows this regimen (Fig. 1a). We used the Mendelian laws of inheritance-based likelihood for homozygous offspring to subject each 1 kG variant to binomial testing, defining the minimum requirement of ≥ 41 heterozygotes with 0 homozygotes for variants to be further considered.

We removed genes and their associated variants if the proteins were repeat prone (based on literature reviews; mucins, filaggrin) or more prone to mutation (olfactory receptors) or if they had any match (E ≤ 0.001 and ≥80% protein sequence identity by TBLASTN63 to the set of 22334 (of 204563 cDNAs) pseudogenes annotated in Ensembl64 version 106.

Orthologs and alignments

We computed the orthologs for all proteins in the Uniprot proteome of Human (Proteome ID - UP000005640; retrieved April 2021) using the Orthofinder program65. Briefly, we used the canonical proteomes of Human and 507 other organisms from across the tree of life to compute the orthologs. In the Orthofinder program, we used the option of computing multiple sequence alignments to build gene trees and supplied an in-house species tree (will be published elsewhere) to infer the orthologs for each species pair. Next, for every protein in the Human proteome, we gathered all the orthologs across species allowing for one-to-one, one-to-many and many-to-many relationships. As mentioned above, the Orthofinder program also calculated the multiple sequence alignments for each Orthogroup (homologous group containing orthologs and paralogs). The alignments were calculated using the MAFFT L-INS-i method when there were <500 sequences in a group and the native MAFFT method66 for larger groups. To obtain the alignments for orthologs, we subset the Orthogroup alignments for each Human protein and its respective orthologs, and removed any positions that contained all gaps.

Shuffling the 1 kG variants

A random value between 0 and 1 was chosen based on a uniform pseudo-random number generating algorithm67 and compared with the observed allele frequency for a given variant, with a decimal smaller/equal to the observed allele frequency determined to yield a mutated allele. For example: an allele with a variant frequency of 19 % would be considered mutated when the random number would lie between 0 and 0.19. Simulated individual genotypes would then consist of two consecutively shuffled alleles. A total of 2504 individuals (5008 alleles) were subjected to this shuffling and each of the 201k variants would undergo 100 simulation cycles. Then, the average simulated genotype would be calculated from the heterozygous and homozygous counts for each variant from each cycle.

An integrated score for variant impact on protein function

We used alignments of orthologs (feature name ortho) and all homologs (homo) to compute HMMer profiles68 which provided log odds scores for each amino acid and each position. The score for any mutated position was taken as the difference between the mutated value in these profiles and the wild-type. We also used scores from the BLOSUM62 matrix for each variant (blosum).

We used structures for all human proteins constructed by Alphafold5 to define a variety of structural parameters. We first computed secondary structure (sec), main-chain dihedral (psi/phi) angles, and accessibility (acc) using DSSP69,70. We also computed burial (bur) as the accessibility of a Gly-X-Gly tripeptide minus the DSSP accessibility value. Note that burial and accessibility are thus not direct equivalents as the amino acid size affects them differently. For these commodities we then studied amino acids in representatives (fourth level of the hierarchy) of the ECOD database71 (v281) to first define divisions into zones: secondary structure: helical (characters H,G) strand (E,B), or coil (others); dihedral angles: a 12 × 12 grid with phi and psi (−180 – 180) in increments of 30. accessibility: low (0-15), medium (16-59), (≥60); burial: low (0-114), medium (115-164), high (≥165). We then computed log-odds scores of observed counts versus expected (based on the abundance of amino acids and the totals in each zone). For every variant we then computed the score for each commodity as log-odds mutant – log-odds wild-type) where negative values indicate a poorer fit for the mutant and vice versa. For structural parameters using Alphafold data we did not consider how the confidence scores and quality will affect wild-type and mutants equally. We also used the impact score (mech) from Mechismo72 for each variant and devised an equivalent (mech-intra) score using residue pair-potentials for intramolecular (in contrast to intermolecular) contacts across the ECOD dataset.

Information about approved drug targets was retrieved from the U.S. Food & Drug Administration73, scoring genes where medications were already approved for and when the gene was listed as disease causing in the Online Mendelian Inheritance in Man database4 with a (FDA or OMIM) score of 1.

For each gene we considered existing annotations on haplotype insufficiency, retrieved from ClinGen74,75. Genes that were associated with an autosomal recessive phenotype received a (haplotype) score of 1 and decreased to 0.75 when sufficient information was available or to 0.5 and 0.25 when some or only minimal information was available. Absence of information or unlikeliness for dosage sensitivity scored 0.

Lastly, we collected information about post-translation modifications, active centers and known variants from UniProt7 (uniprot-function).

We combined all of these scores (ortho, homo, blosum, sec, acc, bur, phi/psi, mech, mech-intra, FDA, OMIM, haplotype, uniprot-function) into a combined functional impact score using Bayesian integration76,77,

$$log_2\left( {O_{prior}} \right) + \mathop {\sum}\limits_{i = 1}^N {log} _2\left( {\frac{{\left( {D_i|P_{true}} \right.}}{{D_i|P_{false}}}} \right)$$
(1)

where \(D_i|P_{true}\) and \(D_i|P_{false}\) correspond to the true and false positive rates (TPR and FPR), which were obtained from ROC curves considering 26767 known disease causing variants from ClinVar as positives and a 4103 as negatives. We set \(O_{prior} = 1\), arbitrarily, as we were only interested in the ranking of values and not the absolute number.

We also compared this new value to values for PMUT9, PolyPhen28 and SIFT10 on the same dataset (Supplementary Figure 2b).

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.