Introduction

The Japanese population has a relatively low genetic diversity,1 which was one of the reasons for the early success of genome-wide association studies (GWASs).2, 3 However, the results of several other studies on genetic variations in the Japanese population, which examined mitochondrial DNA-sequence variation,4, 5 polymorphic markers on the Y chromosome5 or some polymorphic loci in autosomes,6, 7 support the hypothesis that the Japanese population has a ‘dual structure’ and that immigrants came to Japan in at least two major migrations.8 These studies revealed that the Japanese population has three main genetic groups, the Hondo Japanese, who live on the main islands of Japan, the Ryukyu Japanese, who live on the Ryukyu Islands, and the Ainu Japanese, who are the indigenous people of Hokkaido, although their current population in Hokkaido is small (0.5%). Furthermore, geographic clines of haplotype frequencies were found in the human leukocyte antigen (HLA) region and several loci for blood types.8

In a principal component analysis of genome-wide single-nucleotide polymorphism (SNP) genotypes in the Japanese population, we showed that most Japanese individuals fell into two main clusters (the Hondo and Ryukyu clusters).9 Furthermore, genetic differentiation was observed among different regions in the Hondo people. Although the SNPs that are most differentiated between the Hondo and Ryukyu peoples have been identified,9, 10 a more detailed study of genetic differentiation between the two main clusters is desired for three reasons. First, understanding population structure is essential for the design of GWASs,11, 12, 13 which are powerful tools for identifying disease-causing genes. To conduct more accurate GWASs of the Japanese population, it is important to know whether the population has a dual structure10 and that the genetic backgrounds for the case and control samples are not biased. In addition, differentiated SNPs can be used as ancestry-informative markers to determine to which subpopulation each individual belongs.14 Second, differentiated genomic regions are the genomic regions where spurious associations are likely to occur. Therefore, knowledge of the differentiated regions would help to make GWASs more robust. Third, differentiated genes are more likely to be involved in phenotypic variations15 because some of them rapidly change in allele frequency by adaptive evolution. Therefore, highly differentiated genomic regions could be good places to look for phenotype-associated genes. The Hondo and Okinawa peoples have some phenotypic differences, such as in ear wax type and hair thickness.8 We previously showed that a SNP in ABCC1116 and another SNP in EDAR17 were the most differentiated nonsynonymous SNPs between the Hondo and Ryukyu clusters known so far.9 Although these results may depend on the set of SNPs selected for genotyping, they suggest that highly differentiated genes are likely to be involved in phenotypic differences. In fact, the EDAR gene was recently shown to be also involved in the morphology of front teeth.18

Another advantage of genome-wide SNP genotype data is that they can be used for haplotype inference.19 Understanding haplotype structure and frequency is important for associating genetic polymorphisms with a given trait and for inferring the genetic genealogy of alleles in a population.20, 21 If a haplotype catalog can be created from genome-wide SNP genotypes, it would be useful for looking at haplotypes at the genomic regions of interest. Furthermore, genome-wide haplotypes would be useful for evaluating genomic diversity of the population and differences between subpopulations. Use of haplotypes, as well as SNP genotypes, may be well suited for identifying genetic differences between closely related subpopulations because a recent recombination may have created new haplotypes that may result in a genetic difference between the two subpopulations. In contrast, analyses of common SNPs are based on only two alleles whose origins are relatively old. Therefore, comparison of haplotype frequencies can be used in addition to comparison of SNP allele frequencies to find genetic differentiation.

We previously showed that haplotype structure and frequency can be estimated from SNP homozygotes by the use of genotype data from 3397 individuals from the Japanese population.22 In the present study, we applied this approach to genotype data of autosomal SNPs from 18379 individuals from the Hondo cluster of the Japanese population, determined haplotypes and estimated haplotype frequencies. The haplotypes of the Ryukyu cluster were analyzed separately with genotype data from 504 individuals. Our analysis revealed genomic regions with dissimilar haplotype frequencies. In addition to the HLA region in chromosome 6, many other genomic regions showed genetic differentiations between the two clusters. These differentiated regions between the two clusters would be good candidate regions to look for genes that are involved in phenotypic differences between the Hondo and Ryukyu populations.

Materials and methods

Subjects and genotype data

In this study, we used the same 19170 Japanese subjects that were analyzed in our previous study.10 These individuals consisted of healthy controls from the Midosuji Rotary Club and case individuals from the BioBank Japan Project.23 All the DNA samples were genotyped for 529 412 SNPs with Illumina 550K or 610K arrays (Illumina, Sandiego, CA, USA).

Selection of individuals for the two main clusters

Principal component analysis of the 19 170 Japanese individuals10 generated two relatively distinct clusters for the Hondo and Ryukyu populations (Supplementary Figure S1). Using the eigenvalues for principal component 1, we selected 18 379 individuals for the Hondo cluster (principal component 1: −0.012 to 0.012) and 504 individuals for the Ryukyu cluster (principal component 1: −0.046 to −0.023).

Use of genotype data for haplotype analysis

Genotyped SNPs in autosomes (chromosomes 1–22) were selected for haplotype analyses if they satisfied the following three criteria: (1) the call rate was at least 99%, (2) genotype frequencies did not drastically depart from the Hardy–Weinberg equilibrium (P10−7) and (3) the minor allele frequency was at least 0.05. After this filtering, the genotype data for 388 591 SNPs were selected and alleles in the SNP genotype data were converted into the corresponding alleles in the top strand with the genomic coordinate for each chromosome.

The genomic regions for all the autosomes were divided into non-overlapping bins having a fixed number of SNPs (4, 6 or 10 in this study). It should be noted that there are a small fraction of regions >1 Mbp where SNPs are very sparse that were excluded from the haplotype analysis, because haplotype inference would be inaccurate for these regions.

Haplotype analysis

We previously examined the efficiency of haplotype determination and frequency estimation based on SNP homozygotes,22 and applied this approach to the genome-wide SNP genotype data from 18 379 individuals from the Hondo cluster. We evaluated the efficiency of the haplotype analysis for the Hondo cluster, because the reliability of the haplotype analysis based on SNP homozygotes depends on several factors (for example, the length of the region and the level of linkage disequilibrium).22 Therefore, we examined the fraction and number of homozygotes to see whether they were enough for detection of haplotypes and estimation of haplotype frequencies. In addition, total frequencies of the haplotypes were examined to see whether undetected haplotypes were negligible in terms of frequency and whether the estimated haplotype frequencies were reliable. Haplotypes for the Ryukyu cluster were inferred and their frequencies were estimated using the computer program SNPHAP (www.gene.cimr.cam.ac.uk/clayton/software/) because the small sample size (504) may result in an inaccurate estimation of haplotype frequency based on SNP homozygotes.

To examine genetic differentiation between the Hondo and Ryukyu clusters, the FST value, as originally defined by Wright,24 between the Hondo and Ryukyu clusters was calculated from the normalized haplotype frequencies. The haplotype frequencies estimated by counting SNP homozygotes were normalized so that the sum of frequencies was 1.0.

Comparison of SNP allele frequencies between the Hondo and Ryukyu clusters

Genotyped autosomal SNPs were selected for comparison of allele frequencies in the two clusters if they satisfied the following three criteria: (1) the SNPs were polymorphic in the Japanese sample, (2) the genotype frequency did not drastically depart from the Hardy–Weinberg equilibrium (P10−6) and (3) the call rate was at least 0.99. We selected 437 697 SNPs (discarding 65 202 SNPs) for comparison of allele frequencies. For each SNP site, we calculated FST, as originally defined,24 between the Hondo and Ryukyu clusters.

Results

Determination of haplotypes of the Hondo and Ryukyu clusters

Haplotypes for the Hondo cluster were determined by detecting SNP homozygotes.22 To find the appropriate condition for haplotype analysis from SNP homozygotes, we conducted a genome-wide haplotype analysis with different numbers of SNPs (4, 6 and 10) and inspected the results by (1) fraction and number of homozygotes and (2) total frequencies of haplotypes (Supplementary Table S1). The fraction and number of homozygotes depended on the haplotype lengths.22 When the genomic regions were divided into regions having four SNPs, the proportion of SNP homozygotes were about 35% for the Hondo cluster (Supplementary Table S1). As the haplotype becomes longer, the fraction and number of homozygotes tended to decrease. When the genomic regions were divided into regions having four SNPs, the total frequency of haplotypes was 1.011, which was closer to 1.0 than the total frequencies obtained with other conditions. Based on these results, we decided to use the haplotypes with four SNPs. After discarding long haplotypes that may contain large gaps (19 regions, see Materials and methods), 97 119 genomic regions having four SNPs were used for further analysis, and each analyzed region had 6.28 haplotypes on average. Haplotypes for the Ryukyu cluster were inferred and haplotype frequency was estimated for the 97 119 regions by using the SNPHAP program. On average, 7.62 haplotypes were detected from the 504 Ryukyu individuals.

Differentiation between the Hondo and Ryukyu clusters by haplotype frequency

To evaluate genetic difference between the Hondo and Ryukyu clusters at the haplotype level, we compared haplotype frequencies between the two clusters. Generally, the haplotype frequencies in the two clusters were highly correlated (correlation coefficient was 0.983, Figure 1). The level of genetic differentiation between the two clusters was evaluated by FST with haplotype frequencies. The value of haplotype FST ranged from 0.0 to 0.039 among the 97 119 genomic regions covering all the autosomes and the average value of FST was 0.0028 (the empirical distribution of FST for all the analyzed regions is shown in Supplementary Figure S2). In spite of the low level of differentiation between the two clusters, a substantial proportion of SNPs were located in the tails of the distribution: 2030 of 97 119 regions have FST0.01. Therefore, we searched for genomic regions that showed relatively higher differentiation by the FST values.

Figure 1
figure 1

Comparison of haplotype frequencies between the Hondo and Ryukyu clusters. The frequencies of haplotypes for the Hondo cluster (X-axis) and the Ryukyu cluster (Y-axis) are shown in a scatter plot. All the genomic regions in autosomes were divided into non-overlapping bins having four SNPs, and 97 119 regions were analyzed. Haplotypes of the Hondo cluster were determined by detection of SNP homozygotes and the normalized haplotype frequencies (sum=1.0) were used. Haplotypes of the Ryukyu cluster were inferred by use of the SNPHAP program. The correlation coefficient of haplotype frequency between the two clusters was 0.983.

To see whether some specific regions show relatively higher genetic differentiation, we examined chromosomal distribution of haplotype FST (Figure 2). These plots show that each chromosome has substantial variations in FST values.25 Some local genomic regions show high FST values. In particular, the short arm of chromosome 6 had a long stretch of high FST haplotypes in the HLA region (approximate genomic positions 28 500 000–33 000 000). High FST regions were also found on the other genomic regions, as seen on the short arm of chromosome 9 (genomic position: around 35 081 154, proximal to the centromere) where the PIGO gene is located and on the long arm of chromosome 7 (genomic position: around 146 600 000) where the CNTNAP2 gene (contactin-associated protein-like 2) is located.

Figure 2
figure 2

Chromosomal distribution of haplotype FST between the Hondo and Ryukyu clusters. FST values calculated with haplotype frequencies are shown along the genomic coordinate (Mbp) for each chromosome. Horizontal orange lines show the average value of haplotype FST for each chromosome.

By comparing haplotype frequencies, we detected the genomic regions that differed most in haplotype frequency between the two clusters (Table 1). A genomic region in chromosome 3 (genomic position: 188 873 942–188 884 675) showed the highest value of haplotype FST (0.039) in all the autosomal regions (Table 1). Although this region does not contain any protein-coding gene, SST and RTP2 were located nearby. Chromosome 7 had three genomic regions (genomic position: around 146 600 000) adjacent to each other, which showed high values of FST. These regions contain CNTNAP2, whose polymorphism is associated with autism26 and Pitt–Hopkins-like syndrome 1.27

Table 1 Genomic regions showing the highest differentiation between the Hondo and Ryukyu clusters based on haplotype frequency

Then we looked at haplotype frequencies at the most differentiated regions to see whether any haplotypes show marked differences in frequency between the two subpopulations (Table 2). We detected a few haplotypes whose difference in haplotype frequency is >0.1 at many of the most differentiated regions. The most differentiated region in chromosome 3 (approximate genomic position: 188 880 000) had a few haplotypes whose frequency differences were remarkable. The major haplotype in the Hondo cluster was CTGT (0.882), whereas its frequency was only slightly lower in the Ryukyu cluster (0.708). However, the haplotype TCAT is present at a frequency of 0.062 in the Hondo cluster, whereas its frequency was much higher in the Ryukyu cluster (0.214).

Table 2 Haplotype frequencies in the most differentiated genomic regions

To identify any functional bias in genes located at highly differentiated genomic regions, we examined overrepresented biological functions in these genes. The top 1% of highly differentiated genomic regions (971) were selected by the FST value and found to contain 379 genes. We divided the highly differentiated genes into two groups: genes in the HLA region (54) and genes in the non-HLA region (325), and conducted a gene-set enrichment analysis of each group. The HLA region was analyzed separately as it is known to be highly differentiated among populations, which may bias or obscure differences in other regions. We examined overrepresented biological functions in the differentiated regions using the PANTHER Classification System (http://www.pantherdb.org/). For the 54 differentiated genes in the HLA region, the molecular functions that are most overrepresented included antigen processing and presentation (Table 3a). On the other hand, for 325 differentiated genes in non-HLA regions, the molecular functions that are most overrepresented included cell–cell adhesion and intracellular-signaling cascade functions (Table 3b).

Table 3a Overrepresented functions of highly differentiated genes in the HLA region
Table 3b Overrepresented functions of highly differentiated genes in the non-HLA region

Differentiation by haplotype frequencies and allele frequencies

To determine to what extent differences in haplotype frequencies is correlated with differences in allele frequencies at single SNP sites, we calculated FST at all the SNP sites and examined the relationship between haplotype FST and FST at SNP sites. We used 437 697 autosomal SNPs to calculate FST by allele frequencies between the Hondo and Ryukyu clusters. The most differentiated autosomal SNPs were found in the MOG gene of the HLA region in chromosome 6 (Supplementary Table S2), in agreement with the results of a previous study.9 Examination of the most differentiated SNPs in gene regions (Supplementary Table S3) detected differentiated nonsynonymous SNPs (Supplementary Table S4). The most differentiated SNPs were found in the following annotated genes FMN2, FBXL21, GEIN6 and ZNF96. Our previous study identified the most differentiated nonsynonymous SNPs in EDAR and ABCC11. This discrepancy on the most differentiated nonsynonymous SNPs may be due to the differences in the SNPs that were selected for genotyping in the two studies.

Next, to examine the relationship between haplotype FST and FST at single SNP sites, we compared FST in two ways. First, we calculated the average value of FST at SNP sites within each region and examined the relationship with the haplotype FST. Second, we chose the largest FST for any SNP in each region and examined the relationship with the value of haplotype FST. We found that the haplotype FST was significantly correlated with these values of FST for each region (correlation coefficient was 0.837 for Figure 3a and 0.811 for Figure 3b).

Figure 3
figure 3

Relationships between differentiation in haplotype frequency and differentiation at SNP sites. FST at each SNP site between the Hondo and Ryukyu clusters was calculated with allele frequencies. As each region had four analyzed SNPs, the average or maximum value of FST in each region was used for comparison. (a) Haplotype FST and average values of FST at SNP sites within the region are shown in scatter plot. The correlation coefficient was 0.837 (0.835–0.839). (b) Haplotype FST and the maximum value of FST at SNP sites within the region are shown in a scatter plot. The correlation coefficient was 0.811 (0.809–0.814).

However, the correlation between haplotype FST and the largest FST based on allele frequency was not very strong. To check the dissimilarity in two measures, we selected 971 genomic regions showing the highest values of haplotype FST (top 1%) and examined how many of them had the highest FST values at single SNP sites. By comparing with the top 1% genomic regions (971) having the highest FST at single SNP sites, we found that only 392 of the 971 genomic regions had the largest differences in both haplotype and allele frequencies. These results show that Hondo and Ryukyu clusters have genomic regions that are highly differentiated in haplotype frequency without a drastic difference in allele frequency at single SNP sites. Conversely, some genomic regions are highly differentiated in allele frequency but did not show drastic differences in haplotype frequency as single SNP sites. We considered that the former cases are to be investigated rather than the latter cases by two reasons. First, the latter cases may be explained by a weaker linkage of polymorphisms between SNP sites. Second, the merits of haplotype analysis may be found in the former cases. One example of a genomic region where haplotype frequencies showed drastic differences but did not have any highly differentiated SNP is in chromosome 1 (genomic position: 235 499 862–235 513 179) where the differentiation in haplotype frequency was 0.0130 (in the top 1%). This region contains the RYR2 (ryanodine receptor 2) gene, whose mutations are associated with ventricular tachycardia and arrhythmogenic right-ventricular dysplasia. In this region, the frequency of haplotype TATC was 0.070 for the Hondo cluster and 0.190 for the Ryukyu cluster. However, no strongly differentiated SNP was observed in this region, the largest SNP FST being 0.0072. Another example is a genomic region in chromosome 4 (genomic position: 463 935–487 138) that contains ZNF721 and PIGG. The haplotype FST was 0.0130 for this region, whereas the largest SNP FST value based on allele frequency was 0.0091. So far PIGG has not been related to any phenotype or disease, whereas other genes involved in phosphatidylinositol glycan anchor biosynthesis are known to be related to various kinds of diseases (for example, PIGA is known to be involved in paroxysmal nocturnal hemoglobinuria).

Discussion

The present study examined the genetic differentiation between the Hondo and Ryukyu clusters in the Japanese population with SNP genotype data from about 400 000 autosomal SNP sites. Population differentiation between the two clusters was examined at both the allele frequency and haplotype levels. This is the first time that differentiation between these clusters was examined by genome-wide haplotypes. We identified many non-HLA regions with haplotype frequencies that were dissimilar between the Hondo and Ryukyu clusters. Previous studies have shown regional differences in haplotype frequency in the Japanese population mainly in the HLA region.28,29 The present results suggest that geographic clines of haplotype frequencies exist in genomic regions other than the HLA region.

Jomon and Yayoi peoples differ in skeletal and cranial morphology. According to the dual structure model of the Japanese population,8 regional differences of phenotypic variations of the Japanese may be explained by the varying fates of intermixture of the peoples in the second migration from Northeast Asia. In fact, there are morphological differences among different geographical regions in modern Japanese on skeletal, teeth, cranial and facial morphologies. However, the genetic determinants of these morphological differences have not been fully elucidated. Differentiated genomic regions found in the present study may be good candidates to search for the genetic determinants of the phenotypic differences between these peoples with the caveat that variations in the X and Y chromosomes and mitochondrial DNA were not investigated in this work. The differentiated genomic regions found in the present study may be good candidates to search for the genetic determinants of the phenotypic differences between two people.

Understanding the differentiation between subpopulations, in addition to being useful for avoiding false positive results in association studies, is also important for medical population genomics when disease prevalence varies among the populations. For example, the prevalence of closed angle glaucoma is higher in Okinawa than the main islands of Japan.30 The Hondo and Okinawa peoples slightly differ in morphology, and some genetic factors may contribute to the phenotypic differences between them. Environmental factors may also affect the higher prevalence of glaucoma in Okinawa. Further studies are needed to clarify as to which and to what extent genetic factors contribute to the higher prevalence of glaucoma in Okinawa.

Differentiated genomic regions should be examined carefully in GWASs because spurious associations are likely to occur. On the other hand, spurious associations are less likely to occur in most other regions with little differentiation. In addition, some of the differentiated SNPs identified in this study can be used as ancestry-informative markers). A set of SNPs as ancestry-informative markers would be useful for identifying the subpopulation to which each individual belongs. The catalog of real haplotypes with their estimated frequencies, as we created in this study, will be useful for identifying causative polymorphisms for a trait, which are linked to the most associated SNPs in a GWAS. In particular, the haplotypes for the Hondo cluster were determined by SNP homozygotes without ambiguity, and the estimated haplotype frequencies were very similar to the frequencies by the SNPHAP program (correlation coefficient was 0.9995). The genome-wide haplotype catalog created in this study could be improved by investigating the haplotype block structure, which varies between genomic regions, because the strength of linkage of polymorphisms between SNP sites is different by regions. A genome-wide haplotype analysis is one of the ways to uncover a rough sketch of genome-sequence variations with a large number of samples, although sequencing individual genomes is becoming more convenient and less expensive. Through an appropriately designed haplotype analysis of many individuals, we may be able to identify the most variable, conserved or differentiated regions in human populations of interest.