Introduction

Ectodysplasin A1 receptor (EDAR) is a cell surface receptor involved in the development of ectodermal structures including hair, teeth and glands [1]. Upon activation by its ligand ectodysplasin (EDA), EDAR signals through its cytoplasmic adapter protein EDARADD to trigger the activation of the transcription factor NF-κB [2], this signalling sequence being essential for its function [3]. Disruption of this highly conserved signalling pathway via loss-of-function mutations of any of EDA, EDAR, or EDARADD causes hypohidrotic ectodermal dysplasia (HED) [2, 4, 5], a condition characterised by sparse hair, the loss or reduction of many skin-associated glands, and tooth agenesis [6]. Selective absence of teeth can occur as a result of milder function-reducing mutations in the EDA–EDAR pathway, without eliciting the complete set of clinical HED phenotypes [7].

The death domain of the EDAR protein is essential for the recruitment of the EDARADD protein and thus for EDAR function [2]. This domain is present in many proteins, the majority of which are involved in cell death and inflammation [8]. The death domain is ~80 amino acids in length and is composed of six alpha helices, these forming a surface that is capable of self-association and of binding to other specific death domain containing proteins [8, 9].

A non-synonymous single-nucleotide variant (SNV), rs3827760 (NM_022336.4:c.1109T>C), hereafter referred to as EDAR:c.1109T>C, encodes a valine-to-alanine substitution within the death domain of EDAR at amino acid position 370 (NP_071731.1:p.(Val370Ala), hereafter referred to as EDAR:p.(Val370Ala). The derived allele is at very high frequency in northern East Asian and Native American populations, with allele frequencies of up to 90% in some groups [10]. The EDAR:c.1109T>C allele displays clear evidence of positive selection both from haplotype and allele frequency spectrum based analyses [11,12,13]; at least some of this selection presumably occurred in the common ancestors of modern East Asian and Native American populations. EDAR:p.(Val370Ala) has been shown to increase the activation of NF-κB compared with that of the protein encoded by the ancestral allele (EDAR370Val) in vitro using reporter assays [14, 15], and ameliorate the clinical signs of HED caused by hypomorphic EDA mutations in heterozygous carriers of EDAR:p.(Val370Ala) [16], strongly indicating that the derived allele is a gain of function. The physiological consequences of this increased signalling have been assessed in mouse models, either with multiple copies of EDAR to increase expression level and signalling, or through engineering of the EDAR:c.1109T>C variant in mice [15, 17, 18]. Both of these models were observed to have thicker hair fibres and; human association studies have shown that EDAR:c.1109T>C is associated with thicker, straighter scalp hair, along with other traits such as shovelling of incisors, altered ear and chin shape, and increased fingertip sweat gland density [18,19,20,21,22,23].

Here we identify another SNV in EDAR (rs146567337, NM_022336.4:c.1138A>C), hereafter referred to as EDAR:c.1138A>C, which causes a serine-to-arginine substitution at amino acid position 380 (NP_071731.1:p.(Ser380Arg), hereafter referred to as EDAR:p.(Ser380Arg). The geographic distribution of the derived allele of this SNV partly overlaps that of the previously characterised EDAR:c.1109T>C (encoding EDAR:p.(Val370Ala)), though at lower frequency and with a more southerly prevalence. The EDAR:p.(Ser380Arg) substitution increases the signalling function of EDAR to a similar degree as the EDAR:p.(Val370Ala) substitution, but its genomic context does not show the same signs of strong positive selection in human populations, despite both alleles having approximately the same age [24]. These findings suggest that EDAR:c.1138A>C (EDAR:p.(Ser380Arg)) may influence the same human traits as those associated with EDAR:c.1109T>C (EDAR:p.(Val370Ala)), and that these traits may have been under different selective pressures in different regions of Asia.

Materials and methods

Generation of phylogeographic maps

Maps of the world and of Southeast Asia were generated using MapChart (https://mapchart.net/). The rs3827760 and rs146567337 allele frequencies were gathered from publicly available datasets [10, 25, 26] and plotted for each population.

Determination of archaic human genotypes

We used high coverage Altai Neanderthal (http://cdna.eva.mpg.de/neandertal/altai/AltaiNeandertal/VCF/) [27] and Altai Denisovan (http://cdna.eva.mpg.de/denisova/VCF/hg19_1000g/) [28] genomes to determine rs146567337 state.

Haplotype analysis

VCF files from publicly available datasets [10, 29] were reduced to a 20 kb window surrounding rs146567337. Files were viewed using inPHAP (v1.1) [30] and variants that disagreed with the human reference genome (GRCh37) were mapped for samples containing EDAR:c.1138A>C or EDAR:c.1109T>C. The variants thus mapped combined to give a total of 50 SNVs used for haplotype construction.

A median-joining haplotype network was constructed using NETWORK 5.0 (http://www.fluxus-engineering.com). We used the region ± 10 kb from the SNVs of interest among the individuals included in the Estonian Biocentre Human Genome Diversity Panel dataset [25].

Generation of extended haplotype homozygosity (EHH) and bifurcation plots

EHH plots and bifurcation plots were constructed from a dataset of 10,640 individuals from the Han Chinese population [31] using R package rehh (v2.0.2) [12, 32]. The EHH plot was generated by separating the haplotypes into three categories: ancestral—where the ancestral alleles of both rs3827760 and rs146567337 are present, EDAR:c.1138A>C—where the derived allele of rs146567337 and the ancestral allele of rs3827760 are present, and EDAR:c.1109T>C—where the ancestral allele of rs146567337 and the derived allele of rs3827760 are present. There were 554 double ancestral haplotypes, 433 EDAR:c.1138A>C haplotypes and 20,293 EDAR:c.1109T>C haplotypes.

Genome-wide distributions of EHH were generated from analysis of 2 Mb windows seeded using SNVs with derived allele frequencies comparable with rs146567337 and rs3827760 (i.e. transversions ranging between >0.00 and 0.04 and transitions ranging between 0.93 and 0.97, respectively). In total, 1533 transversions and 151 transitions were analysed. EHH distance was calculated from the EHH x-intercepts, with a lower limit of 0.05. The Bash and R scripts used to sample the CONVERGE data [31] are available here: https://doi.org/10.7488/ds/2798.

Sequence alignment and protein structure modelling

EDAR death domain sequence alignment was generated with the T-coffee alignment tool [33] using peptide sequences gathered from the NCBI protein database (GenInfo Identifiers: human, 11641231; mouse, 6753714; chicken, 60302666; zebrafish, 924859488; Xenopus, 55742031). Sequence conservation was visualised using BOXSHADE v3.21. The EDAR death domain structure was generated in the intensive mode of Phyre2 v2.0 [34] using positions 345-431 of human EDAR.

Transfection of cells and luciferase assays

HEK293T and HaCaT cells were maintained at 37 °C in 5% CO2 in high glucose Dulbecco’s modified Eagle’s medium (DMEM) (Sigma-Aldrich, St. Louis, MO, USA) supplemented with 10% foetal bovine serum (FBS) and 50 µg/ml streptomycin and 100 U/ml penicillin (Thermo Fisher Scientific, Waltham, MA, USA). Transfections of HEK293T and HaCaT cells were performed using Lipofectamine 3000 (Thermo Fisher Scientific) in 24-well plates (well surface area: 1.9 cm2). Cells were seeded at a density of 5 × 104 24 h prior to transfection. Each well was transfected with plasmid DNA mix in opti-MEM (Thermo Fisher Scientific), consisting of 125 ng pNFκB-luc, 62.5 ng pRLTK, 10 ng pCR3::EDAR expression vector (different variants) and made up to a total of 500 ng with empty pCR3.1 vector. Transfections were performed according to the manufacturer’s instructions in DMEM supplemented with 10% FBS, 50 µg/ml streptomycin and 100 U/ml penicillin. Luciferase assays were performed 18 h post transfection using the Dual-Luciferase Assay System (Promega, Madison, WI, USA) according to the manufacturer’s instructions.

Results

The death domain of EDAR is a highly conserved region of the protein, with variants altering this domain commonly leading to a loss of function and thus clinically diagnosed HED [35], presumably due to altered or abrogated EDAR interaction with EDARADD [36]. We identified SNV rs146567337 (EDAR:c.1138A>C) in EDAR in the gnomAD database (https://gnomad.broadinstitute.org/) [37]. The derived allele encodes a serine-to-arginine substitution at the highly conserved amino acid 380 (EDAR:p.(Ser380Arg)), only ten amino acids from the alteration in the well-characterised EDAR:p.(Val370Ala) variant (Fig. 1a). In the gnomAD database [37], the frequency of the derived allele at rs146567337 was 1.85%. Using publicly available datasets [10, 25, 26, 29], we found EDAR:c.1138A>C only in East and Southeast Asian populations, at highest frequency in southern China, Vietnam, the Philippines, Malaysia and Indonesia. However, the distribution of this allele did not extend further south and east into New Guinean populations (Fig. 1b, c). Since EDAR:c.1109T>C is at very high frequency in many populations with appreciable frequencies of EDAR:c.1138A>C, we assessed whether EDAR:c.1138A>C and EDAR:c.1109T>C appear on the same haplotype. Using the same datasets, we analysed haplotypes spanning a 20 kb window surrounding rs146567337, on which EDAR:c.1138A>C is present, and found only one occurrence, out of 33 assessed EDAR:c.1138A>C haplotypes from 5608 evaluated chromosomes, where EDAR:c.1109T>C and EDAR:c.1138A>C co-existed on the same haplotype. This singular occurrence, possibly a genotyping or phasing error, suggests that the derived allele of rs146567337 arose on a different haplotype to that of EDAR:c.1109T>C (Fig. 1d). We also found the entire haplotype context of the EDAR:c.1138A>C allele, but with the ancestral allele at this SNV, likely representing the immediately ancestral haplotype to EDAR:c.1138A>C. This immediately ancestral haplotype has a worldwide population distribution. The variant was also found to be ancestral in both the Altai Neanderthal [27] and Altai Denisovan [28] genomes, and was not inferred to be in archaic introgressed haplotypes identified in either the Simons Genome Diversity Project [29] or Indonesian Genome Diversity Project data [26]. Hence, we infer that the EDAR:c.1138A>C variant arose in modern humans rather than through introgression from an archaic human population.

Fig. 1: Conservation, distribution and haplotype structure of EDAR variants.
figure 1

a Multiple sequence alignment of vertebrate EDAR death domains. Amino acid positions within the EDAR protein are numbered at the start and end of the sequence for each species. The position of 370 is indicated by a blue triangle, the position of 380 by a red triangle. The positions of known recessive and dominant mutations causing hypohidrotic ectodermal dysplasia in humans are indicated by black and orange squares, respectively, above the alignment. Purple bars below the alignment indicate the positions of the predicted alpha helices. b Worldwide allele frequencies for EDAR:c.1109T>C and EDAR:c.1138A>C in the 1000 Genomes dataset plotted as pie charts for each population. The remaining allele frequency was depicted as ancestral. c EDAR:c.1109T>C and EDAR:c.1138A>C allele frequencies in the Southeast Asian Island populations were gathered from publicly available datasets [25, 26] and plotted on a map of Southeast Asia as in b. The area of each chart is proportional to the sample number of each population. d Diagram of the EDAR:c.1138A>C haplotypes from the 1000 Genomes Project and Simons Genome Diversity Project datasets plotted for a region 10 kb upstream and 10 kb downstream of rs146567337. White boxes indicate alleles matching the reference human genome (GRCh37) and grey boxes indicate the presence of the alternate allele. Red shading indicates the EDAR 1138C allele and blue shading indicates EDAR 1109C. In total, 33 EDAR:c.1138A>C haplotypes were present in these datasets, with five unique haplotype structures identified. These were ranked in order of frequency, as shown by the percentages to the right of each haplotype, with the total number of each individual haplotype in the dataset indicated in brackets. The scale bar indicates position on chromosome 2.

To further define the distribution of EDAR haplotypes, we constructed a median-joining haplotype network consisting of 142 SNPs spanning about ±10 kb around SNVs of interest (rs3827760 and rs146567337) using a publicly available dataset (Fig. 2) [25]. We used the HapMap combined genetic map [38] to confirm that the window did not have especially fast recombination likely to disrupt the network reconstruction (0.0797 cM; 44th percentile of total genetic map distance in non-overlapping genome wide 114.9 kb windows). The network identified 88 haplotypes and further supports the independent origins of EDAR:c.1138A>C and EDAR:c.1109T>C. The haplotype associated with EDAR:c.1109T>C is mainly composed of individuals from East Asia, Siberia, Southeast Asia Island and mainland populations and the Americas (Fig. 2). Individuals from Siberia (denoted by cyan) represent almost 50% of this haplotype in this population sample. The dataset used for the construction of this network sampled fewer East Asian individuals (n = 11) than Siberians (n = 108), thus explaining the greater proportion of the latter with the associated haplotype [25]. We also observed that the EDAR:c.1109T>C associated haplotype demonstrates a star-like pattern, suggestive of a demographic expansion and corroborating earlier evidence of positive selection at this locus. In contrast, the haplotype associated with EDAR:c.1138A>C was found to be distant from EDAR:c.1109T>C and showed more restricted geographic distribution, confined to individuals mainly from the islands of Southeast Asia and one individual from South Asia.

Fig. 2: Median-joining haplotype network of EDAR.
figure 2

Median-joining haplotype network spanning ±10 kb around rs3827760 and rs146567337 showing the relationship of the haplotypes. The network is based on 446 individuals included in the Estonian Biocentre Human Genome Diversity Panel dataset [25]. Each pie chart represents a unique haplotype and the size of the chart is proportional to the number of chromosomes carrying it. Colours represent the geographic location of populations where each haplotype was found. Lines represent variants, with greater branch length indicating a greater number of distinguishing variants. The associated haplotypes of interest (for EDAR:c.1109T>C and EDAR:c.1138A>C) have been labelled. The black arrows represent locations where an event resulting in the EDAR:c.1109T>C variant was inferred. The multiple arrows likely reflect ambiguity in the network reconstruction.

As EDAR:c.1109T>C displays very strong evidence for positive selection [11,12,13], we tested for indications of selection on EDAR:c.1138A>C. Using a large-scale whole genome sequence dataset of the Han Chinese population [31], we constructed EHH plots of 433 EDAR:c.1138A>C haplotypes (derived rs146567337, ancestral rs3827760) and 20,293 EDAR:c.1109T>C haplotypes (derived rs3827760, ancestral rs146567337) against 554 double ancestral haplotypes (haplotypes bearing the ancestral alleles for both rs146567337 and rs3827760) (Fig. 3a). No double derived allele haplotypes were found in this Han Chinese dataset. As expected for loci that underwent selection, and as demonstrated previously [11, 12], EDAR:c.1109T>C shows a broad region of haplotype homozygosity compared with the double ancestral haplotype. EDAR:c.1138A>C exhibits much less EHH than EDAR:c.1109T>C, suggesting that EDAR:c.1138A>C has not been subjected to the same pressures or degree of selection as EDAR:c.1109T>C. The Han Chinese dataset included 433 EDAR:c.1138A>C (EDAR:p.(Ser380Arg)) haplotypes, therefore we constructed EHH bifurcation plots by random subsampling of 433 haplotypes from double ancestral allele and EDAR:c.1109T>C (EDAR:p.(Val370Ala)) haplotypes. The bifurcation plots confirmed that the EDAR:c.1138A>C haplotype had been reduced by recombination less frequently than the double ancestral haplotype, but more frequently than the EDAR:c.1109T>C haplotype (Fig. 3b).

Fig. 3: Extended haplotype homozygosity (EHH) and EHH bifurcation plots surrounding EDAR variants.
figure 3

a EHH plot showing the length of conserved haplotype on either side of rs146567337. An EHH value of 1 indicates that haplotypes are identical at this position. Double ancestral haplotypes are represented by the black dotted line, EDAR:c.1138A>C (EDAR:p.(Ser380Arg)) haplotypes are represented by the red line, and EDAR:c.1109T>C (EDAR:p.(Val370Ala)) haplotypes are represented by the blue line. b Bifurcation plot showing the branching of each haplotype. Thicker lines indicate more common haplotypes. Double ancestral haplotypes are represented by the black line, EDAR:c.1138A>C haplotypes are represented by the red line and EDAR:c.1109T>C haplotypes are represented by the blue line. c Distribution of autosomal EHH x-intercept distances of 1533 alleles with a derived allele frequency ranging from >0.00 to 0.04 across autosomes. The red dashed line indicates the x-intercept value of EDAR:c.1138A>C and is located in the 70th percentile of EHH values. d Distribution of autosomal EHH x-intercept distances of 151 alleles with a derived allele frequency ranging from 0.93 to 0.97. The blue dashed line indicates the x-intercept value of EDAR:c.1109T>C and is located as the 100th percentile of EHH values.

We also compared the EHH scores of EDAR:c.1138A>C and EDAR:c.1109T>C to alleles elsewhere in the genome defined by derived alleles of the same frequency and mutation type. In total, 1533 transversions (allele frequency ≥ 0.00–0.04) were selected to compare against EDAR:c.1138A>C. Similarly, 151 transitions (allele frequency = 0.93–0.97) were selected to compare with EDAR:c.1109T>C. From this analysis, EDAR:c.1138A>C ranked slightly to the right of the middle of the EHH distribution (70th percentile) (Fig. 3c). In contrast, EHH associated with the EDAR:c.1109T>C allele ranked highest of all SNVs assessed (Fig. 3d). These findings further support the idea that EDAR:c.1138A>C has not been under detectable selection, while EDAR:c.1109T>C has undergone strong positive selection.

After determining the global distribution and genomic context of EDAR:c.1138A>C, we next investigated the effect of the substitution on the encoded protein. To map the position of EDAR:p.(Ser380Arg) and identify any predicted structural effects of this amino acid substitution, we modelled the variant EDAR death domain structures [34]. The resulting predicted protein structure positioned amino acid EDAR380 within an alpha helix (Fig. 4a), a structural feature known to be important for the protein–protein interactions mediated by death domains [9]. However, the alternate amino acid variants did not alter the predicted structure of this helix or any other part of the death domain. The protein structure also remained unaltered when we modelled the EDAR:p.(Val370Ala) substitution (Fig. 4a). Based on the conservation of the serine residue at position EDAR380 among vertebrates (Fig. 1a), introduction of a positive charge through its substitution to arginine and strong evidence of functional alteration reflected from SIFT (score 0) [39] and PolyPhen (score 0.999) [40], we predicted that the EDAR:p.(Ser380Arg) substitution would alter EDAR protein function. To test this, we transfected HEK293T cells, a human cell line derived from embryonic kidney, with EDAR cDNAs encoding either the ancestral EDAR, EDAR:p.(Val370Ala), EDAR:p.(Ser380Arg) or the double substituted EDAR:p.[(Val370Ala;Ser380Arg)] protein. We also included the known loss-of-function variant EDAR:p.(Glu379Lys), a mutation that is dominant for selective tooth agenesis in humans and recessive for HED in mice [1, 7], as a control in these experiments. Each form was assayed for its ability to activate a co-transfected NF-κB luciferase reporter. We found that EDAR:p.(Ser380Arg) activated NF-κB in these cells to a greater degree than ancestral EDAR, and to the same extent as EDAR:p.(Val370Ala). Generation of an EDAR variant carrying both the Val370Ala and Ser380Arg amino acid substitutions led to a level of NF-κB activity comparable with or slightly greater than that caused by either single substitution (Fig. 4b). These effects on signalling activity were broadly confirmed in the human HaCaT cell line, derived from the skin’s epidermis (Fig. 4c). The different fold changes observed in EDAR-stimulated NF-κB activity between HEK293T cells and HaCaT cells may result from different basal levels of NF-κB activity in these cell lines, and from different levels of expression of components of the EDAR signal transduction pathway.

Fig. 4: Functional effects of EDAR variants.
figure 4

a The predicted protein structures of the ancestral EDAR, EDAR:p.(Val370Ala) and EDAR:p.(Ser380Arg) death domains were modelled using Phyre2. The location of amino acid positions 370 and 380 are indicated by arrows. Amino acid position 380 is located towards the end of an alpha helix. b HEK293T and c HaCaT cells were transfected to express EDAR variants and resulting NF-κB luciferase reporter activity determined. Error bars represent the standard error of the mean from experiments performed in quadruplicate and repeated independently six times. Statistical significance was calculated using a Student’s unpaired t test (**P < 0.005, *** P < 0.0005).

Discussion

We characterised a novel functional variant in EDAR. We find that EDAR:c.1138A>C is at highest frequency in Southeast Asia and appears to have arisen on a different haplotype to that of the more common and previously characterised EDAR:c.1109T>C substitution. We find that EDAR:c.1138A>C does not show the same signs of having been under strong positive selection as EDAR:c.1109T>C, but that the encoded protein increases NF-κB activation in vitro, to approximately the same extent as the EDAR:c.1109T>C substitution. As there are no known splice variants of EDAR, which do not include the death domain encoding exon, and as the EDAR:c.1138A>C variant is present within the death domain, which is essential for protein function, the change in receptor activity detected in this assay suggests that EDAR:c.1138A>C is likely to have a similar direction of phenotypic effects in vivo to those observed with the EDAR:c.1109T>C allele.

Several theories as to what the selective advantage conferred by EDAR:p.(Val370Ala) was have been advanced. Chang et al. suggested that EDAR:p.(Val370Ala) was selected for in the ancestors of East Asians and Native Americans for adaptation to a cold and dry climate, in which increased skin-associated glands and resulting glandular secretions, perhaps together with straighter hair, could be advantageous in producing a functional barrier to the environment [17]. Hlusko et al. suggested a latitude-based adaptive scenario, in which altered transfer of nutrients, particularly vitamin D, through breast milk in far northeast Asia [41] was caused by the mammary gland alterations enacted by enhanced EDAR signalling [17, 18]. Kamberov et al. placed the origin of the EDAR:p.(Val370Ala) encoding allele in central China at greater than 30,000 years ago, and suggested increased eccrine sweat gland number, associated with this variant in mouse and human in their study, as one of the potential selective forces that would have been advantageous in the hot and humid climate there [18].

A recent genealogical estimation of allele ages in the human genome assessed the derived alleles EDAR:c.1109T>C and EDAR:c.1138A>C as having a similar date of origin, at ~1400 generations ago [24]. The geographic distribution of these alleles is somewhat similar, and, though at much lower frequency than EDAR:c.1109T>C in all regions, the highest frequencies of EDAR:c.1138A>C overlap the more southerly regions in which EDAR:c.1109T>C is prevalent. The EDAR:c.1138A>C variant is notably absent from the Americas, where in native populations EDAR:c.1109T>C is essentially at fixation [14]. We found that EDAR:c.1138A>C does not show a strong signal of positive selection in human populations, as does EDAR:c.1109T>C, despite cell culture experiments predicting similar outcomes resulting from these two substitutions. The frequency of the EDAR:c.1138A>C variant peaks in Southeast Asia and it is thus most likely to have arisen in that region. The EDAR:c.1109T>C variant appears to have arisen further north, based on its present-day population distribution and ancient DNA analyses [42, 43], suggesting that phenotypes associated with an EDAR-dependent increase in NF-κB activation have been preferentially selected for in more northern regions of Asia.

The possibility that EDAR:p.(Val370Ala) and EDAR:p.(Ser380Arg) may have similar phenotypic effects should be considered in future gene or genome-wide association studies, particularly in populations in which the derived allele of rs3827760 is at high frequency. In these populations, only a small fraction of ancestral rs3827760 alleles exist and a sizeable proportion of these haplotypes will carry the derived rs146567337 allele, which could obscure the phenotypic associations with the derived allele of rs3827760. The EDAR:c.1138A>C allele has been identified as one of many candidate alleles in people exhibiting tooth agenesis in East Asian populations [44, 45]. However, our data suggest that this allele is unlikely to be causative for the condition due to its increased, rather than decreased, activity, as observed for EDAR:c.1109T>C. Tooth agenesis has also been reported in some patients carrying the EDAR:c.1109T>C allele [46], demonstrating that increased EDAR activity is not necessarily sufficient to mitigate the effects of mutations in other genes or pathways that cause this condition.

The discovery of a second SNV in EDAR that increases NF-κB activation to the same extent as EDAR:c.1109T>C raises questions as to how many routes there are to achieving the same molecular effect of increased EDAR activity. Multiple variants have also been identified in an enhancer region upstream of the LCT gene that have the same molecular effect of increasing LCT transcription [47]. However, some of these LCT SNVs exhibited clear EHH, suggesting that this molecular effect was selected for in each of the populations containing these variants. In this case, EDAR:c.1138A>C does not show the same extent of EHH as EDAR:c.1109T>C, even though the alleles are predicted to be of a similar age, and both therefore should have had the opportunity to be selected. This indicates that either the molecular consequences of EDAR:c.1109T>C are more complex in vivo than reflected in cell signalling assays or that the phenotypic consequences of enhanced EDAR signalling have only been strongly selected for in northern East Asian populations. This work highlights that exploring the population genetics of variants with similar molecular phenotypes as known selected variants could prove beneficial in the future for refining the features of those variants, and the relevant environments, that led to their selection.