Although pigmentation varies globally, it has been more thoroughly studied and is therefore better understood in European populations. This has led to a research gap, especially in East-Asian populations. The OCA2 gene, which is thought to be responsible for maintaining pH levels within melanosomes,1 has been shown to be under positive selection in both European and East-Asian populations.2,3 However, the variants and haplotypes favored by selection are different in each population.2,4–6 For example, a variant located within the HERC2 gene is known to affect the expression of the nearby OCA2 gene, and it is strongly associated with blue eyes in European populations.7–9 The HERC2 rs12913832 allele associated with blue eyes has a high frequency in Europe but is not present in East-Asian populations.7–9 In addition, two non-synonymous polymorphisms, rs1800414 and rs74653330, have been associated with pigmentation in East Asians5,10,11 and are not found at high frequencies in any population outside of East Asia.12 It has been suggested that the phenotype of lighter skin is a result of convergent evolution in Europe and East Asia.2,6,13

Available population data indicate that the rs1800414 and rs74653330 polymorphisms show a distinct geographical distribution. The highest frequencies of the derived rs1800414 G allele are found in Japan, China and Korea, whereas the derived rs74653330 A allele has the highest frequencies in northern East Asia, including Mongolia.12,14

In this report, we provide further data on the global distribution of rs1800414 and rs74653330, with a primary focus on the allelic frequencies observed in East Asia. Briefly, the two polymorphisms were genotyped in the Human Genome Diversity Project–Centre d’Étude du Polymorphisme Humain (HGDP–CEPH) samples (http://www.cephb.fr/en/hgdp_panel.php) by LCG Genomics (Beverly, MA, USA) by using KASP genotyping technology. The HGDP–CEPH panel includes samples for more than 1,000 individuals from 52 populations around the world. Supplementary Table 1 shows the allelic frequencies of both markers in the HGDP–CEPH panel. In agreement with previous data, both polymorphisms are primarily restricted to East-Asian populations. The derived rs1800414 G allele has a broad distribution in East Asia, with the highest frequencies observed in the Japanese population (79%) and several populations from China (Dai, Miaozu, Han, Hezhen, Tujia and Xibo, with frequencies between 65 and 50%). In contrast, the distribution of the derived rs74653330 A allele is more restricted, with the highest frequencies found in Altaic speaking populations from northern East Asia and Mongolia, such as the Yakut from Siberia (36%), the Daur (33%), the Oroqen (28%), the Hezhen (22%) and the Mongola (20%). Figure 1 shows a map of East Asia with the frequencies of both polymorphisms. The derived rs1800414 G and rs74653330 A alleles are not present in any of the samples from Africa, the Middle East or Oceania. In the Americas, the rs1800414 G allele is also absent, and one Maya individual is heterozygous for rs74653330. Both derived alleles are present at very low frequencies in Central–South Asia (rs1800414 G: 4.4%; rs74653330 A: 2.1%) and Europe (rs1800414 G: 0.3%; rs74653330 A: 1%). Within Central–South Asia, the derived alleles are primarily present in the Hazara (Pakistan) and Uygur (China). Within Europe, the derived alleles are observed only in Russia. The presence of the two derived alleles in some of the populations from Central–South Asia and Europe seems to be the consequence of gene flow from East-Asian groups.

Figure 1
figure 1

Distribution of allele frequencies for SNPs rs1800414 (blue) and rs74653330 (orange) in East-Asian populations: (1) Dia; (2) Daur; (3) Han; (4) Hezhen; (5) Japanese; (6) Lahu; (7) Miaozu; (8) Mongola; (9) Naxi; (10) Oroqen; (11) She; (12) Tu; (13) Tujia; (14) Uyghur; (15) Xibo; (16) Yakut; (17) Yizu; and (18) Cambodia.

It is interesting to note that the frequency distribution of the rs74653330 A allele reflects the present genetic structure at a genome-wide level in East Asia. We used the program PLINK15 to perform principal component analysis (PCA) of the East-Asian CEPH–HGDP populations by using genome-wide data (Affymetrix Axiom Human Origins Array) available in the HGDP–CEPH website (http://www.cephb.fr/en/hgdp_panel.php). We pruned SNPs based on linkage disequilibrium (LD) and removed five known areas of long-range LD. Figure 2 shows a visualization of the first two axes of the PCA using the program PAST (http://folk.uio.no/ohammer/past/). There is a clear geographic pattern with the northern populations (Yakut, Oroqen, Mongola, Daur and Hezhen) present on the left side of the plot. As described above, it is precisely in these populations in which the highest frequencies of the derived rs74653330 A allele are observed.

Figure 2
figure 2

PCA (axes 1 and 2) showing population structure of East-Asian populations from the CEPH–HGDP panel.

We explored the haplotype structure of the OCA2 region in East Asia in detail. To do this, we merged the genotype data of the two markers of interest with the Affymetrix Human Origin data set for chromosome 15 plus the Illumina (San Diego, CA, USA) 650K data set for chromosome 15. The OCA2 gene was extracted from this data set by selecting markers from chromosome 15, position 25–26.5 Mb. On the basis of the north–south geographical gradient observed in the PCA output as well as the geographic distribution of the two polymorphisms, the haplotype analysis of East Asia was carried out separately in northern East Asia and the rest of East Asia. Populations that were included in the northern grouping included the Yakut from Siberia and the Oroqen, Mongola, Daur and Hezhen from northern China. The haplotype analyses were performed with the program Haploview.16 Figure 3 shows the haplotype structure surrounding the rs1800414 and rs74653330 polymorphisms. The two non-synonymous polymorphisms are located in the same LD block, but they are always found in different haplotypes. The haplotype analysis suggests that the haplotypes carrying the derived alleles for each polymorphism arose independently from the same ancestral haplotype. Using the markers rs7170451–rs1800414–rs728405–rs728404–rs4778214–rs1448488–rs12903382–rs74653330–rs12910433–rs3794609–rs730502 to define the haplotype block (the relevant non-synonymous polymorphisms are labeled in bold), our results indicate that, from the ancestral haplotype ‘AAGAGCAGGTT’, a non-synonymous mutation at rs1800414 originated the haplotype ‘AGGAGCAGGTT’, and another non-synonymous mutation independently originated the haplotype ‘AAGAGCAAGTT’. Both derived haplotypes then increased in frequency in different regions of East Asia. The haplotype ‘AGGAGCAGGTT’ is now the most common haplotype in a broad region of East Asia, whereas the haplotype ‘AAGAGCAAGTT’ has become the most prevalent in northern East Asia. Several lines of evidence indicate that this increase in frequency may have been the result of positive selection favoring light skin in high-latitude regions. Both derived alleles are non-synonymous variants predicted to have a functional effect,11 and both have been associated with lighter skin pigmentation in East-Asian populations.5,10,11 In addition, several studies have identified signatures of positive selection in the OCA2 region in genome-wide scans in East-Asian populations.2,3 The geographic distribution of the variants strongly suggests that these two mutations arose after the separation of European and East-Asian populations. This is supported by a recent study that dated the derived G allele of the OCA2 rs1800414 polymorphism to ~10,000 years ago.17 To our knowledge, there has been no attempt to date the polymorphism rs74653330. We used the dense, genome-wide SNP data available for the HGDP–CEPH panel to estimate the ages of the derived alleles at rs1800414 and rs74653330 in East-Asian populations. We used a method18 that relies on the decay of haplotype sharing of the ancestral genomic segment on which the derived mutations occurred. Before the analysis, we removed individuals with pi-hat values exceeding 0.05 to minimize potential problems with cryptic relatedness. To account for the possibility that members of individual populations may have a most recent common ancestor (MRCA) that is more recent than the MRCA of the entire East-Asian sample, we calculated these age estimates assuming a correlated genealogy.18 Under these conditions, and assuming a generation time of 29 years,19 we estimated the age of the derived allele at rs74653330 to be 6,835 years (95% confidence interval (CI): 1,070–12,798). The estimated age of the derived allele at rs1800414 is quite similar at 6,397 years (95% CI: 1,183–11,446 years). This is slightly younger than a previous estimate of the age of the derived allele at rs1800414 using a different method (10,660 years; 95% CI of 8,070–15,780),17 although the CIs of our estimate overlap Chen’s point estimate. The discrepancy in age may be explained by differences in the two methods as well as in differences among the East-Asian populations and the data sets used in each study.

Figure 3
figure 3figure 3

Haplotype block structure and pattern of LD of the OCA2 region including markers rs1800414 (marker 424) and rs74653330 (marker 432). (a) Northern East Asia; (b) in the rest of Asia.

Recent ancient DNA studies, which have characterized dense genomic data in Eurasian individuals spanning a broad archaeological period (e.g., from hunter gatherers to individuals living in the Bronze Age), have provided important information about the temporal distribution of genetic markers associated with pigmentation variation in Europe and have strengthened the case for selection operating in pigmentation-related genes in this region.20–22 Similar studies in East Asia have the potential to clarify the major events that have shaped the interesting distribution of the two non-synonymous variants of the OCA2 gene in this vast area. In this respect, it will be important to consider not only potential selective effects but also the major population movements that have taken place in this region during the past 15,000 years.