Introduction

In a genome-wide association study (GWAS), one tries to find the genomic loci responsible for a given trait. This is performed by finding polymorphic markers that are associated with the trait. Recently, single-nucleotide polymorphism (SNP) markers (landmark SNPs) are extensively used for this purpose. In such studies, much effort has been directed at understanding haplotype structure and haplotype frequency.1, 2, 3, 4

Haploid human genomes differ by one SNP in every 1200–1700 bp.5, 6 Even though significant progress has been made in sequencing technologies, determination of haplotypes by sequencing is expensive and time consuming. Another approach is to infer haplotypes by statistical methods from genotype data. A pioneering algorithm to infer haplotypes using genotype data by Clark,7 which starts with detection of homozygous individuals, used a parsimonious algorithm and heuristic search. Many statistical methods for inferring haplotypes or estimating haplotype frequency were subsequently developed,8, 9, 10, 11, 12, 13, 14, 15, 16 and some of them are widely used to infer haplotypes from SNP genotype data.

After a dense SNP map on the human genome became available,5, 17, 18, 19 several studies examined the extent of linkage disequilibrium.1, 2, 4 The SNP discovery project, one of the Japan Millennium Genome Projects, identified over 170 000 SNPs by sequencing gene regions in the human genome of DNA samples from 24 individuals.19, 20 Subsequently, genotype data were obtained for a larger number of individuals from the Japanese population, and allele frequencies for common SNPs were estimated (http://snp.ims.u-tokyo.ac.jp/). We built a genome-wide map of linkage disequilibrium for the gene-based SNPs,2 and this led to GWASs as early as 2002–2003.21, 22 The International HapMap project23, 24 provided genome-wide SNP genotype data for several selected populations. In that study, haplotype inference was conducted with the genotype data by statistical methods.23 The HapMap project also provided guidelines for selecting tag SNPs for association studies and has led to an increase in the number of GWASs.25 Statistical approaches for inferring haplotypes are quite accurate for haplotypes that exist at high frequencies. However, such methods usually predict too many haplotypes in which frequencies for some haplotypes are very low, and it is usually difficult to tell which of these haplotypes really exist. Furthermore, the accuracy of haplotype inference from data of unrelated individuals was found to be lower than the haplotype inference from family data.26 Another approach to determining haplotypes in a population is to use hemizygous tissues in which haplotypes are unambiguously determined. For example, complete hydatidiform moles (CHMs), which have chromosomes from a single sperm, have been used to determine the haplotypes of the Japanese population.27

Clark7 showed that homozygous segments without multiple heterozygous sites in an individual can be used to define haplotypes. Searching runs of SNP homozygosity for autozygous genomic regions are useful for mapping recessive diseases.28, 29, 30, 31, 32 Using genotype data from thousands of individuals in the BioBank Japan Project,33 we attempted to (1) determine haplotypes by detecting SNP homozygotes for genomic regions of interest and (2) estimate haplotype frequencies based on the proportion of homozygous individuals in the sample by assuming that genotype frequencies are in Hardy–Weinberg equilibrium. To determine the accuracy and efficiency of estimating haplotype frequencies from homozygous individuals, we analyzed SNP genotype data for 3397 individuals from the Japanese population.

First, we conducted a test analysis with ‘definitive haplotypes’ from 74 complete hydatidiform mole samples in the Kyushu University Definitive Haplotype Database (D-haploDB),27, 34 and compared the estimated haplotype frequencies with those obtained by two statistical methods, PHASE and SNPHAP, to examine the reliability of haplotype determination and haplotype frequency determination. Second, we applied this approach to the genomic regions of all human genes to evaluate the efficiency of the method in various conditions.

Materials and methods

Determination of haplotypes and estimation of haplotype frequencies

Using SNP genotype data from a sufficient number of individuals, we examined the efficiency of haplotype determination from homozygous individuals for a given range of genomic region and estimation of haplotype frequencies from the proportion of the homozygous individuals in the sample. First, we select homozygous individuals whose SNP genotypes were homozygous for all the SNPs in a given genomic region. Then, we determine haplotypes assuming that two copies of the same haplotype are present for each homozygous individual. We estimate frequency of each haplotype from the proportion of homozygous individuals for the haplotype in the sample. Suppose we have a sample of N diploid individuals from a population. Let f (=f1, …, fM) denotes the set of haplotype frequencies for the observed haplotypes (arbitrarily labeled 1, …, M). Assuming the Hardy–Weinberg equilibrium, the expected number of homozygotes for a particular haplotype i, ni, is Nfi2. The frequency of haplotype i can be estimated as follows,

where pi (=ni/N) denotes the proportion of the homozygotes of haplotype i in the sample.

Subjects and genotype data

Genotype data were obtained from 3397 self-identified Japanese individuals. These individuals consists of healthy controls from Midosuji Rotary Club and case individuals for 13 of the 47 diseases that are studied in the BioBank Japan Project.33 All the patients provided written informed consent to participate in the BioBank Japan Project. The BioBank Japan Project was approved by the ethical committees at The Institute of Medical Science, The University of Tokyo, and the Center for Genomic Medicine (formerly, SNP Research Center), Institutes of Physical and Chemical Research (RIKEN).

All the Japanese DNA samples were genotyped for 568 666 SNPs by using Illumina 550K arrays (Illumina, San Diego, CA, USA). Genotyped SNPs in autosomes (chromosomes 1–22) were selected for further analyses if they satisfy both of the following two criteria: (1) call rates were high enough (99%), (2) no abnormality was detected by visual inspection of raw data of genotyping when there was a departure from Hardy–Weinberg’s equilibrium of genotype frequencies (P<10−6; 10−6P<10−3 and call rate <0.9998). After the selection of SNPs, the genotype data for 547 458 SNPs were used in further analyses.

Test analysis with the definitive haplotypes

To compare the determined haplotypes and their frequencies by different approaches, we selected 79 007 SNPs that were genotyped in both of the 547 458 SNPs (genotyped by the Illumina 550K arrays) and 81 250 SNPs (genotyped by the Perlegen platform) in the definitive haplotypes from 74 CHMs in D-haploDB (http://orca.gen.kyushu-u.ac.jp/)27, 34 (Supplementary Figure 1). After discarding monomorphic SNPs in the 3397 individuals, 79 005 SNPs were used for the following analyses.

Data of genomic locations for the RefSeq genes (transcripts) were retrieved from the Entrez Gene website (http://www.ncbi.nlm.nih.gov/gene) in the NCBI bioinformatics resources. Genomic regions between the start and end positions for each transcript were selected for analysis. We selected genomic regions having at least three SNPs for the haplotype analysis.

Comparison of haplotype frequencies with other approaches

Haplotypes for the analyzed regions were inferred and their frequencies were estimated by using computer programs, PHASE (http://stephenslab.uchicago.edu/software.html)9 and SNPHAP (http://www-gene.cimr.cam.ac.uk/clayton/software/). All of the 3397 individuals were included in the analysis. The haplotype frequencies estimated from homozygotes were compared with the haplotype frequencies estimated by the two statistical methods, PHASE and SNPHAP, and the correlation coefficient of haplotype frequencies by two approaches was calculated. We also normalized the haplotype frequencies from the homozygotes so that the sum of frequencies of the observed haplotypes equaled 1.0, and the haplotype frequencies were compared with those obtained by the two statistical methods in the same way.

Similarly, the haplotype frequencies were also compared with those of the ‘definitive haplotypes’ in 74 CHMs in D-haploDB.27, 34 Genotype data for the definitive haplotypes from the 74 CHMs were downloaded from D-haploDB,27, 34 and the list of haplotypes for the 1955 genomic regions were generated from the genotype data.

Application to genomic regions of all human genes

We used the genotype data for genome-wide 546 457 SNPs that satisfy quality control filters (see above), and selected 404 758 SNPs by discarding SNPs whose minor allele frequencies were less than 0.05 (Supplementary Figure 3). The genomic region for each transcript included an additional 3000 bp region for both 5′ and 3′ side of the transcript. We selected 11 351 genomic regions which have at least three SNPs for the haplotype analysis.

Results

Test analysis with statistical approaches and the definitive haplotypes

To examine the efficiency of the haplotype analysis from SNP homozygotes by comparing with different approaches, we selected 79 005 SNPs, which were genotyped in both the 3397 Japanese individuals in the BioBank Japan Project33 and the 74 CHM samples.27 Using genomic locations for human transcripts, we selected the genomic regions for genes that had at least three of these SNPs. For the three or more SNPs in each genomic region, the average distance between SNPs was 20 215 bp. The minor allele frequencies ranged from 0.02 to 0.50, and the average was 0.271.

We selected 1955 genomic regions with 3–10 analyzed SNPs for further analyses. In these genomic regions, we detected homozygous individuals in which genotypes were homozygous for all the analyzed SNPs. The proportion of homozygotes in the sample was 0.37 on average for the 741 regions having three SNPs, and decreased with increasing number of SNPs (Table 1). In total, we detected 17 739 haplotypes for the 1955 genomic regions by detecting individuals in which genotypes were homozygous for all the analyzed SNPs. The numbers of haplotypes whose frequencies are higher than 0.01 were very similar to those obtained by the two statistical methods (PHASE and SNPHAP) (Table 2). However, the statistical methods predicted a large number of haplotypes whose frequencies are less than 0.01. Similarly, the numbers of haplotypes in the 74 CHMs in D-haploDB were very similar to those detected in homozygotes for genomic regions having 3–5 SNPs (Supplementary Table 1). However, as the number of the SNPs in the regions increased, the number of haplotypes in the 74 CHMs became larger than the number of haplotypes detected in homozygotes in the 3397 individuals (Supplementary Table 1).

Table 1 Detection of homozygous individuals in the 3397 individuals for the 1955 genomic regions
Table 2 Number of haplotypes detected in homozygous individuals and predicted by statistical approaches

Frequencies of the detected haplotypes were estimated from the proportion of homozygotes in the sample. The lowest haplotype frequency was 0.0172 (the square root of 1/3397), a frequency that was observed for 3633 haplotypes in 1118 genes. The highest haplotype frequency (0.946) was observed in the INVS gene for haplotype GGC (rs2787366, rs1999877, rs2787390), for which 3041 of 3397 individuals were homozygous.

The haplotype frequencies estimated from the proportions of homozygotes were highly correlated with the results obtained by the two statistical methods: the correlation coefficients were 0.9986 (P<2.2 × 10−16) for PHASE (Figure 1a) and 0.9985 (P<2.2 × 10−16) for SNPHAP (Figure 1c). The correlation coefficient of the haplotype frequencies with the definitive haplotypes from 74 CHMs was 0.9691 (P<2.2 × 10−16) (Figure 1e). Although this is a comparison of haplotype frequencies between the SNP homozygotes in the 3397 individuals and the 74 CHMs from the Japanese population, the haplotype frequencies in the two samples were very similar. When the haplotype frequencies were normalized so that their sum equaled 1.0, they were still highly correlated (r=0.9983, P<2.2 × 10−16) with the frequencies obtained by the statistical methods (Figure 1b and d) and with those in D-haploDB (r=0.9688, P<2.2 × 10−16, Figure 1f), although the correlations were slightly weaker than those obtained without the normalization.

Figure 1
figure 1

Comparison of haplotype frequencies from different approaches. (af) Haplotype frequencies obtained by two approaches are shown in scatter plots. X axis is haplotype frequency estimated from the proportion of homozygotes. (ad) Comparison of haplotype frequencies with those estimated by two statistical methods, PHASE program (a, b) and SNPHAP program (c and d). Genomic regions having more than seven SNPs were not analyzed by those programs because of the long computation time they require. (e, f) Comparison of haplotype frequencies with those in the 74 CHMs from D-haploDB. (b, d, f) The haplotype frequencies estimated from the proportion of homozygotes were normalized so that the sum of frequencies of the observed haplotypes for each region equaled 1.0. The correlation coefficients of haplotype frequencies for each plot are given below; (a) 0.9986, (b) 0.9983, (c) 0.9985, (d) 0.9981, (e) 0.9691 and (f) 0.9688.

For the genomic regions having three polymorphic SNPs, the average number of haplotypes identified by our method that were also among the definitive haplotypes from the 74 CHMs in D-haploDB was 4.7 (Supplementary Figure 2). This suggests that this study independently identified most of the definitive haplotypes in the 74 CHMs. On the other hand, 280 haplotypes in 741 genomic regions were present only in the 74 CHMs in D-haploDB, and 142 haplotypes were present only in our result (Supplementary Figure 2).

Application to genomic regions for all human genes

The efficiency of detection of haplotypes in a region from SNP homozygotes depends on several factors such as the number of SNPs in the region, the length of the region, the level of linkage disequilibrium and the selection of SNPs. In particular, as the number of SNPs increases, both the number of possible haplotypes and the number of actual haplotypes would increase, and the actual haplotypes may contain haplotypes that exist at very low frequencies. On the other hand, the proportion of SNP homozygotes would decrease as the number of SNPs increases. Generating a list of haplotypes from SNP homozygotes with a larger number of SNPs may not include the haplotypes that exist at very low frequencies. Therefore, we applied our approach to a larger number of genomic regions and evaluated the results according to the number of SNPs and the level of linkage disequilibrium.

We focused on the genomic regions for all the human transcripts (see Materials and methods and Supplementary Figure 3) and used genotype data for 404 758 SNPs after discarding SNPs, whose minor allele frequencies were less than 0.05 (see Materials and methods). Then, we selected 11 351 genomic regions which have at least three SNPs. Although the number of analyzed SNPs on those regions ranged from 3 to 706, the number of the analyzed SNPs were less than 20 for a majority of the regions analyzed (9335/11 351, data not shown). For the three or more SNPs in each genomic region, the average distance between SNPs was 5290 bp, and the average minor allele frequency was 0.27. The estimated frequencies of haplotypes were compared with those based on SNPHAP (Supplementary Figure 4). The haplotype frequencies estimated from the proportions of homozygotes were highly correlated (r=0.9991, P<2.2 × 10−16) with the results obtained by the SNPHAP program.

The proportions of SNP homozygotes (+ marks in Figure 2) decreased with increasing number of analyzed SNPs. However, the expected proportions of SNP homozygotes in linkage equilibrium with the average allele frequency (green dots in Figure 2) were much lower than the observed proportions of SNP homozygotes. This may be because the analyzed SNPs within the regions are in linkage disequilibrium. We measured the levels of linkage disequilibrium in the analyzed regions by a multi-locus linkage disequilibrium parameter, ɛ.35 The average values of ɛ for each number of SNPs were compared with those in the test analysis with the definitive haplotypes. The average ɛ values were always larger than those in the test analysis for various numbers of SNPs (see Supplementary Table 2). This may be due to a higher level of linkage disequilibrium between the analyzed SNPs that were more densely distributed in the regions compared with the test analysis.

Figure 2
figure 2

Proportion of SNP homozygotes in the 3397 individuals. The observed proportion of SNP homozygotes (+) for each analyzed region was plotted (y axis in log-scale) according to the number of analyzed SNPs (x axis). Green dots indicate the expected proportions of SNP homozygotes in linkage equilibrium, which were calculated with the average minor allele frequency (0.27).

If there are no unobserved haplotypes or unobserved haplotypes are negligible in terms of frequency, the sum of the estimated frequencies of the haplotypes would be 1.0 or very close to 1.0. However, as the number of the analyzed SNPs in the region increases, the number of actual haplotypes increases and some haplotypes in the sample may not be detected as SNP homozygotes. Therefore, we examined how much the sum of the estimated frequencies of the haplotypes deviated from 1.0. The sum of the haplotype frequencies was nearly 1.0 for the regions in which the numbers of analyzed SNPs were less than 20 (Table 3). However, as the number of analyzed SNPs increased above 20, the average sum of the haplotype frequencies became much smaller than 1.0 (for example, the average was 0.742 for the regions with 30–39 SNPs, Table 3). This suggests that the sum of frequencies of unobserved haplotypes becomes substantial when the number of SNPs is larger than 30 (150 kb in this analysis). The total frequency of the observed haplotypes was positively correlated with the level of linkage disequilibrium for the genomic regions that had 10–19 analyzed SNPs (Figure 3). This may be because the genomic regions in higher linkage disequilibrium have fewer numbers of haplotypes that are less likely to be missed in detection as homozygotes.

Table 3 Total frequency of observed haplotypes from SNP homozygotes
Figure 3
figure 3

Relationship between the total frequency of the observed haplotypes and linkage disequilibrium. For the genomic regions with 10–19 analyzed SNPs, the total frequency of the observed haplotypes in SNP homozygotes for each region was calculated and was plotted according to the level of linkage disequilibrium (x axis) measured by epsilon (Nothnagel et al.35). The horizontal red line shows that the total frequency is 1.0. (a) Genomic regions with 10–14 analyzed SNPs. The correlation coefficient was 0.327 (P<2.2 × 10−16). (b) Genomic regions with 15–19 analyzed SNPs. The correlation coefficient was 0.497 (P<2.2 × 10−16).

Discussion

Although our approach to determining haplotypes based on SNP homozygotes uses only homozygous individuals in a sample, the haplotype frequencies estimated from the proportion of homozygotes were quite similar to those obtained by conventional statistical approaches (PHASE and SNPHAP).9 As mentioned above, some low-frequency haplotypes identified by statistical approaches may not be real. Two advantages of our approach, which is based on genotype data from thousands of individuals, are that it can detect real haplotypes without ambiguity and can estimate their frequencies.

The estimated haplotype frequencies were also similar to the frequencies of the ‘definitive haplotypes’ in 74 CHMs,27, 34 although the sample used in the previous studies was different from ours. Although the haplotypes determined from the genotype data of the 74 CHMs in D-haploDB are also real, the chromosomes in the CHMs might have accumulated mutations during their abnormal development. Our study identified 2808 additional haplotypes for the 1955 genomic regions that were not detected in the 74 CHMs. This was expected because their sample is different from ours and also much smaller than ours. Knowing haplotypes without ambiguity should improve haplotype estimation by statistical approaches.36 Therefore, the haplotypes determined by our approach appear to be useful for haplotype inference with new data.

Our method, when applied to genotype data from about 3000 individuals, can detect haplotypes with frequencies as low as 0.03. Because the size of the data for genome-wide SNPs is increasing as more people are genotyped, our approach would be useful for identifying many real haplotypes including low-frequency haplotypes in a population. For example, in a sample of 5000 individuals, the probability of finding a haplotype whose frequency is 0.03 in a homozygous individual would be 98.9%.

To use our approach, the number and type of SNPs to be analyzed in a genomic region should be chosen with care because the efficiency of this approach depends on the nature of the SNP genotype data. As the number of analyzed SNPs increases, our method can obtain haplotypes on a finer scale. However, the proportion of homozygotes in a sample decreases with increasing number of analyzed SNPs, even though the number of existing haplotypes in the population is large. Therefore, our approach may fail to detect a large number of low-frequency haplotypes. When we applied this approach to a large number of genomic regions using genotype data from the Japanese population, the sum of frequencies of unobserved haplotypes appeared to be negligible when the analyzed region was less than 100 kb. It is also important to discard SNPs whose frequencies are low (for example, minor allele frequency <0.05) before conducting an analysis in order to avoid generating many low-frequency haplotypes that do not need to be distinguished in association studies. The criteria for designing an analysis may be modified depending on the level of diversity and history of the target population. The criteria can also be modified according to the different levels of linkage disequilibrium among local genomic regions, even in an analysis focusing on one population.

Another reason for limiting the length of the target region is that longer target regions are more likely to have a recent recombination that created new haplotypes. These new and younger haplotypes may exist only in limited local areas in the population, or their distributions among local geographic regions may not be uniform. In such a situation, combinations of haplotypes in individuals in the entire population may depart from Hardy–Weinberg equilibrium. On the other hand, the genotype frequencies at each SNP could be in Hardy–Weinberg equilibrium when there is little difference in SNP allele frequency among local geographic regions.

Although additional haplotypes can be identified by including individuals with only one heterozygous SNP, we did not use additional haplotypes for the following reasons. First, the contribution of these additional haplotypes would be very low in terms of frequency when thousands of individuals are analyzed. In fact, we detected additional haplotypes by allowing only one heterozygous SNP, and examined total frequencies of these haplotypes using the result obtained by the SNPHAP program. We found that the total frequencies of these haplotypes were very low (0.014 on average, shown in Supplementary Figure 5). Thus, a list of haplotypes detected in homozygotes would be enough to create a useful haplotype catalog for a GWAS, if a sufficient number of individuals were included in the analysis. Second, additional haplotypes that are detected on only one chromosome should be treated with caution, because some of them may be the result of a genotyping error in a homozygous individual and not really exist. In contrast, determining haplotypes by using only homozygotes would be robust against such a problem. Third, including such a limited proportion of heterozygotes would complicate estimation of haplotype frequencies.

Our approach can determine haplotypes without ambiguity, can estimate haplotype frequencies and can act as a catalog of haplotypes for the population. Another advantage of determining haplotypes by the proportions of homozygotes is that it does not take much computation time even if there is a number of SNPs in the genomic region of interest. The catalog of real haplotypes with their estimated frequencies will be useful for identifying causative polymorphisms for a trait, which are linked to the most associated SNPs in a GWAS. Furthermore, the catalog of haplotypes will be useful for haplotype-based GWAS37, 38 and detection of shared haplotypes that contain multiple variants39, 40 that affect the trait.