A powerful score test to detect positive selection in genome-wide scans

Zhong, Ming; Lange, Kenneth; Papp, Jeanette C; Fan, Ruzong

doi:10.1038/ejhg.2010.60

Download PDF

Article
Published: 12 May 2010

A powerful score test to detect positive selection in genome-wide scans

Ming Zhong¹,
Kenneth Lange²,
Jeanette C Papp² &
…
Ruzong Fan^1,3

European Journal of Human Genetics volume 18, pages 1148–1159 (2010)Cite this article

1078 Accesses
10 Citations
Metrics details

Subjects

Abstract

One of the surest signatures of recent positive selection is a local elevation of advantageous allele frequency and linkage disequilibrium (LD). We proposed to detect such hitchhiking effects by using extended stretches of homozygosity as a surrogate indicator of recent positive selection. An extended haplotype-based homozygosity score test (EHHST) was developed to detect excess homozygosity. The EHHST conditioned on existing LD and it tested the haplotype version of the Hardy–Weinberg equilibrium. Compared with existing popular tests, which usually lack clear distribution, the EHHST is asymptotically normal, which makes analysis and applications easier. In particular, the EHHST facilitates the computation of an asymptotic P-value instead of an empirical P-value, using simulations. We evaluated by simulation that the EHHST led to appropriate false-positive rates, and it had higher or similar power as the existing popular methods. The method was applied to HapMap Phase II data. We were able to replicate previous findings of strong positive selection in 17 autosome genomic regions out of 20 reported candidates. On the basis of high EHHST values and population differentiations, we identified 15 new candidate regions that could undergo recent selection.

A test for deviations from expected genotype frequencies on the X chromosome for sex-biased admixed populations

Article 17 May 2019

Daniel Backenroth & Shai Carmi

Biobank-scale inference of ancestral recombination graphs enables genealogical analysis of complex traits

Article Open access 01 May 2023

Brian C. Zhang, Arjun Biddanda, … Pier Francesco Palamara

Extremely sparse models of linkage disequilibrium in ancestrally diverse association studies

Article 28 August 2023

Pouria Salehi Nowbandegani, Anthony Wilder Wohns, … Luke J. O’Connor

INTRODUCTION

In recent years tremendous interest has been generated in association study or linkage disequilibrium (LD) mapping of complex diseases. The success of LD mapping depends heavily on the levels of LD between markers and genetic traits.¹ Positive selection can lead to an increase in the frequency of an advantageous allele and to high levels of LD in the vicinity of the trait gene.^{2, 3, 4, 5, 6} Many forces determine disequilibrium on a genomic scale. Migration, nonrandom mating, variation in mutation rates, nonuniform recombination rates, and genetic drift, all immediately come to mind. Positive selection is one major force that increases LD locally rather than globally across the genome. A favorable mutation increases the frequency of the chromosome segment on which it occurs, and until that segment shrinks by recombination, neutral alleles at nearby loci will hitchhike to success. If the favorable mutation reaches fixation, then a selective sweep is declared.^{7, 8, 9} Therefore, selection may shed light on complex diseases and human evolution. There is considerable interest in developing statistical methods to detect recent positive selection.^{10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}

Statistical geneticists studying positive selection have focused their efforts on two strategies: targeted examination and testing of candidate genes,^{21, 22} and data mining of genome association scans.^{23, 24, 25} Methods to detect recent positive selection are either genotype-based or haplotype-based. The latter methods merit a brief summary.^{21, 23, 26, 27} Hanchard et al²⁶ suggested screening for positive selection by passing a sliding window across a genomic region. At each position of the window, a test of haplotype similarity was evaluated. The extended haplotype homozygosity test proposed by Sabeti et al²¹ also assessed the age of a core haplotype. In view of the computational burdens encountered in genome-wide scans, there has recently been a swing back to genotype-based methods. Tang et al²⁸ proposed a homozygote counting method for genome scan data. This approach captured the decay of genotype homozygosity around a central single-nucleotide polymorphism (SNP). In contrast, the Sabeti et al²¹ statistic was designed to highlight the decay of extended haplotype homozygosity among extended haplotypes of a core haplotype. The counting method of Tang et al²⁸ encouraged comparison of the homozygosity profiles of different populations. Although it is computationally fast, the counting method suffers from information loss, particularly with phase-known data.

Positive selection might cause LD and extended stretches of homozygosity.²⁹ These hitchhiking effects are most pronounced in genomic regions of low recombination. Although extended stretches of homozygosity also occur in inbreeding, the stretches occur randomly rather than systematically across the genome. In fact, none of the other forces that disrupt genetic equilibrium function in the targeted way of positive selection. For this reason, geneticists have been anxious to study the homozygous tracts of the human genome. Gibson et al³⁰ examined the length, number, and distribution of homozygous tracts of SNPs among the HapMap reference populations without much theoretical analysis. In this article, we developed homozygosity score statistics to detect positive selection. These statistics were similar to the Tang et al²⁸ statistic in that they rely on the length of homozygosity around each core SNP. We went beyond the analysis of Tang et al²⁸ and calculated the mean and variance of each statistic under the appropriate null hypothesis. This facilitated computation of P-values by a normal approximation.

Our three tests included (a) an extended genotype-based homozygosity score test (EGHST), (b) a hidden Markov model score test (HMMST), and (c) an extended haplotype-based homozygosity score test (EHHST). The null hypothesis of EGHST unrealistically postulated both Hardy–Weinberg equilibrium (HWE) and linkage equilibrium. The EHHST explicitly took into account multilocus LD. The HMMST occupied the intermediate ground of allowing for pairwise LD. In short, the EHHST was the most conservative test. We derived the tests and investigated their type I errors by simulation. We then focused on EHHST as it is the most robust. Under several demographic population models, we evaluated, by simulation, the fact that EHHST leads to appropriate false-positive rates. We investigated the power of EHHST by comparing it with popular methods.^{3, 11, 12, 17, 21, 26} It was found that EHHST has a higher or similar power as the existing popular methods. We also applied the tests to the previously studied HapMap Phase II data. Our results were consistent with previous findings across the genome and within specific candidate regions. We identified new candidate regions that were not reported before and were close to those reported previously.

Materials and methods

Consider a random sample of n unrelated individuals typed on a large number of SNPs. Assume that the core SNP is SNP 0, which is the central SNP. Hence, the SNPs around the core SNP 0 can be denoted as k=…, −2, −1, 0, 1, 2, …. Here, one may need to truncate if the core SNP 0 is on the boundary or is close to the boundary. Let M be the indicator of whether SNP 0 is homozygous, let L be the number of consecutive homozygous SNPs flanking SNP 0 on the left, and let R be the number of consecutive homozygous SNPs flanking SNP 0 on the right. If the core SNP 0 is heterozygous (M=0), then we define L=R=0. The extent of homozygosity is measured by the total, T=L+M+R. The quantities L, M, R, and T are random variables that vary from person to person. If we can find the mean μ and variance σ² of T, then we can conduct a test for excess homozygosity. More precisely, let T_j be the value of T for person j in the random sample. On the basis of the central limit theorem, the score statistic

should be approximately standard normal. In this article, we consider three tests: EGHST, HMMST, and EHHST. As we are concerned with excess homozygosity, a one-sided test is appropriate.

Before we calculate the mean and variance of each test statistic, let us consider their corresponding null hypotheses. In each instance, the null hypothesis includes random mating and hence global HWE. Thus, the frequency of phased haplotypes H₁/H₂ is 2h₁h₂ when H₁≠H₂, and h₁² when H₁=H₂. Here, h₁ and h₂ are frequencies of haplotypes H₁ and H₂, respectively. Only the null hypothesis of EGHST invokes the further assumption of linkage equilibrium; here, h₁ and h₂ equal the product of the underlying allele frequencies. Under the null hypothesis of HMMST, SNPs exhibit pairwise but not higher-order LD. For EHHST, arbitrary LD is allowed. In summary, the null hypotheses of the three tests are the following:

Null hypothesis of EGHST: HWE and linkage equilibrium;
Null hypothesis of HMMST: HWE and pairwise LD, but no higher-order disequilibrium interactions;
Null hypothesis of EHHST: HWE and arbitrary multilocus LD.

In human genome, LD tends to extend the stretch of homozygosity surrounding a central marker, given high-density SNPs such as the HapMap Phase II data. The mean μ calculated for the EGHST is consequently too small, and the EGHST is anticonservative. In other words, there are too many false positives favoring selection. As the other extreme, EHHST tends to condition on existing haplotype diversity and is very conservative. The HMMST stands between these extremes and conditions on pairwise LD. Given the ubiquity of pairwise disequilibrium, this seems to be a reasonable compromise. Regardless of the test, one can decompose the theoretical mean of T as μ=E(L)+E(M)+E(R). Because M is an indicator random variable, E(M)=Pr(M=1) and Var(M)=Pr(M)[1−Pr(M)]. If we let X_k to be the unordered genotype of SNP k, then it is natural to calculate E(L) as E[E(L∣X₀)]. As X₀ takes only three possible values, the outer expectation in E[E(L∣X₀)] is trivial to compute. The case X₀=1/2 is easiest of all, because L=0 when X₀=1/2 and M=0. Similar comments apply to E(R). The most natural route to calculate variance σ² follows the formula

Again, it is productive to condition on X₀. For instance,

and, assuming L and R are independent given X₀,

It is also worth pointing out that E(LM)=E(L) and E(RM)=E(R), because L and R equal 0 when M does, and when M=1, LM equals L and RM equals R. Thus, one has

These considerations emphasize the importance of finding the distributions of L and R conditional on X₀=1/1 and X₀=2/2. The next few sections tackle this issue.

The distribution of L and R under the null hypothesis of EGHST

Under the dual assumptions of HWE and linkage equilibrium, the conditional distributions of the random variables L and R depend only on M and not on the particular value of X₀. Let p_k1 and p_k2 be the frequencies of the two alleles at SNP k. In this notation, one can readily deduce that

where the products are empty when r=0 or l=0. In practice, one can either estimate the allele frequencies p_k1 and p_k2 from the sample or substitute known values for them. To compute the conditional means and variances of L and R numerically, we recommend the right-tail sums

valid for any nonnegative random variable Y with integer values. The sums defining E(Y) and E(Y²) can be truncated as soon as they stabilize.

The distribution of L and R under the null hypothesis of HMMST

To find the conditional distributions of L and R under this scenario, we run a Markov chain, the states of which are the three unordered SNP genotypes 1/1, 1/2, and 2/2 and the epochs of which are SNPs. If we again hypothesize that SNP 0 is the central SNP, then the genotype sequence …, X₋₁, X₀, X₁, … constitutes the chain. Every SNP emits a signal, either a 1 for a homozygote or a 0 for a heterozygote. Assuming pairwise LD but no higher-order linkage interactions, the two sections of the chain to the left and right of the central SNP are independent, conditional on the state X₀ at that SNP. The only nontrivial states X₀ that come into effect at SNP 0 are 1/1 and 2/2, and these occur with the Hardy–Weinberg probabilities p₀₁² and p₀₂².

To compute the conditional mean and variance of R, it suffices to compute the probabilities Pr(R≥r∣X₀). This can be achieved by running Baum's forward algorithm for an infinite sequence of emitted 1s. One pass of the algorithm is adequate. When SNP r is visited, Pr(R≥r∣X₀) becomes available. This description omits the mention of transition probabilities. Along either haplotype, the transition from allele j at SNP r to allele k at SNP r+1 is governed by the known conditional probabilities that explicitly account for pairwise LD. These conditional probabilities can be readily estimated from sample data. We traverse the left and right sections in opposite direction. Hence, their transition probabilities must take this into account.

Under the assumption of no genotyping error, the complexities of the hidden Markov chain can be replaced by simple recurrence relations. Let p_{r, j → k} be the LD probability that allele j at locus r is followed by allele k at locus r+1 on a chromosome segment containing both loci. If we also let

then we can conclude that Pr(R≥r∣X₀=1/1)=a_r1+a_r2. Thus, computing the conditional mean and variance of R reduces the problem to computation of a_r1 and a_r2. By convention, we consider a₀₁=1 and a₀₂=0. These choices lead to the recurrences

Computation of the vector a_r=(a_r1, a_r2) should continue until

for ɛ>0 suitably small. To compute Pr(R≥r∣X₀=2/2), we similarly define

The b_rj satisfies exactly the same recurrences as a_rj, but differ in the initial conditions b₀₁=0 and b₀₂=1.

As just stated, the distribution of L is independent of R, given X₀. Let c_lj be the probability that X_−l=j/j and L≥l, given X₀=1/1. The conventions c₀₁=1 and c₀₂=0 are consistent with the formula Pr(L≥l∣X₀=1/1)=c_l1+c_l2. Furthermore, we have recurrences

If we let d_lj be the probability that L≥l and X_−l=j/j, given X₀=2/2, then the same recurrences as c_lj hold for d_lj, but the initial conditions are given by d₀₁=0 and d₀₂=1.

The distribution of L and R under the null hypothesis of EHHST

In the presence of arbitrary LD, the fast recurrences (3) and (4) for a_rj and b_rj no longer apply. However, if we define $h_{i_{0}, \dots, i_{r}}$ to be the population frequency of the haplotype (i₀, …, i_r) extending from SNP 0 to SNP r, then the formula

delivers the required right-tail probabilities. When all conceivable haplotypes are possible, there are 2^r terms in the multiple sum, and the formula as it stands is cumbersome. On the other hand, if only a few haplotypes are possible, then the sum is straightforward to evaluate. Moment formulas (2) are still applicable. The haplotype frequencies $h_{i_{0}, \dots, i_{r}}$ can be estimated from genotype data by the EM algorithm.^{31, 32}

RESULTS

Type I error rates

By construction, EHHST was the most conservative among the three tests of EGHST, HMMST, and EHHST. For further confirmation, we performed false-positive (type I) error comparison by simulating genotype data under the null of EGHST. The results were reported in Supplementary I, in which we showed that the type I error rates of EHHST were the smallest. Thus, EHHST was the most robust among the three. Hereafter, we focused our attention on evaluating the performance of EHHST.

We first used SelSim to simulate data under the neutral model.³³ In a genomic region, a few fixed numbers 51, 61, 71, 81, 91, and 101 of SNPs were simulated to evaluate type I error rates. In addition, uniform recombination rates of ρ=1.5, 3, 6, and 9 between SNPs were assumed. To calculate an empirical type I error rate, 5000 random samples of n=60 or 100 individuals were generated. For each sample, an empirical EHHST value for the central SNP was calculated. The type I error rates at two nominal levels α=0.05 and 0.01 were reported in Table 1, which were the proportion of the EHHST values that exceeded the 95th and 99th percentiles of the standard normal. When the number of SNPs is 51, the type I error was much bigger than the nominal levels for any of the four recombination rates. Interestingly, type I error rates deceased when the number of SNPs increased to 61 and then to 71, and the trend continued until the number of SNPs increased to 71. Once the number of SNPs reached 71, the type I error rates stabilized. Hence, truncation at the boundary SNPs caused a problem of high false positives. Fortunately, almost all contemporary genomic data comprised a large number of SNPs. On the basis of the results of Table 1, the type I error rates were lower than or around the nominal level, except for the recombination rate ρ=1.5 when the number of SNPs was larger or equal to 71. When ρ=1.5, type I error rates were generally higher than the nominal levels but not very high. Therefore, EHHST had appropriate type I error rates when it was used to calculate the test score of SNPs that are reasonably far away from the boundary (≥35).

Table 1 Type I error rates of the extended haplotype-based homozygosity score test (EHHST)

Full size table

To investigate the impact of demographic population history on EHHST, we performed coalescent simulations using ms.³⁴ We evaluated the type I error rates of EHHST under a few plausible population genetic demographic models. Specifically, we considered four demographic models that are similar to those considered in Hanchard et al.²⁶

1
Population structure: Two equal-sized sub-populations were simulated, which exchanged migrants with a probability 0.1;
2
Population expansion: A rapid population growth was simulated with a current population size of 10 000, and the population had a constant population size until 500 generations ago when it expanded exponentially by a factor of 100 to reach the current day population size of 10 000;
3
Population bottleneck 150/300: A panmictic population was simulated, which had a constant size of 10 000 until T₁=300 generations ago, when it underwent an instantaneous size reduction to 5000, followed by a period of 150 generations of constant size, followed by a rapid exponential population expansion in the last T₂=150 generations to reach a current day size of 20 000;
4
Population bottleneck 250/500: A population similar to the above Population bottleneck 150/300: Except that T₁=500 and T₂=250.

Again, a genomic region of 101 SNPs was simulated with four recombination fractions ρ=1.5, 3, 6, and 9. In addition, 5000 samples of n=60 or 100 were generated to calculate the empirical type I error rates one by one. The results were reported in Table 2. When the recombination fractions are 3, 6, or 9, type I error rates were lower or around the nominal levels. Similar to the results of Table 1, the type I error rates were generally higher than the nominal levels when ρ=1.5. For the four models, the type I error rates of our EHHST were lower than those of Hanchard's HS reported in Hanchard et al,²⁶ p155, second paragraph of the left column. The EHHST was reasonably robust for the four simple demographic models.

Table 2 Type I error rates of the extended haplotype-based homozygosity score test (EHHST)

Full size table

Power of EHHST

To perform power comparisons with the existing methods, we focused our attention on the results reported in Figure 1 of Hanchard et al.²⁶ The figure contained a comparison of Hanchard's HS, Sabeti's EHH, Tajima's D-test, Fu and Li's D-test, Fay and Wu's H-test, and Hudson's haplotype-partition method.^{3, 11, 12, 17, 21, 26} By a comprehensive and careful comparison, Hanchard et al²⁶ concluded that Hanchard's HS and Sabeti's EHH are the two best tests. Thus, we mainly compared the performance of our EHHST with Hanchard's HS and Sabeti's EHH.

Similar to the study by Hanchard et al,²⁶ we performed coalescent simulations for power comparison by SelSim.³³ First, all parameters were taken as exactly those of Figure 1 of Hanchard et al,²⁶ with one exception: we simulated a genomic region comprising 101 SNPs instead of 50, to avoid a potential problem caused by truncation at the boundary (refer to type I error rates). For readers’ convenience, let us briefly describe the models and parameters as follows. In Hanchard et al,²⁶ three different uniform recombination rates, ρ=4N₀r=1.5, 3, and 6, between SNPs were used in the simulation, and three different allele frequencies (0.1, 0.2, and 0.4) were used for the minor allele of the central SNP. Here, N₀ is the diploid population size and r is the probability of crossover per generation between SNPs. In our simulations, we used four recombination rates ρ=1.5, 3, 6, and 9, and six present day population frequencies of the derived allele for the central SNP (0.1, 0.2, 0.4, 0.6, 0.8, and 0.9). As in Hanchard et al,²⁶ a partial selective sweep was assumed for the central SNP by using a selection coefficient s=500.

To calculate an empirical power level, we simulated 5000 samples of 200 chromosomes or n=100 individuals. For each sample, we calculated an empirical EHHST value for the central SNP. Thereafter, empirical power was calculated as the proportion of the 5000 EHHST values that exceeded the 95th and 99th percentiles of the standard normal. The results are reported in Table 3. At the nominal level α=0.05, the empirical power of EHHST was higher than 0.9410, irrespective of the four recombination rates and the five present day population frequencies 0.1, 0.2, 0.4, 0.6, and 0.8 of the derived allele of the central SNP. Most of the EHHST empirical power levels were around 0.98 at the nominal level α=0.05. For the present day population frequency 0.9 of the derived allele of the central SNP, the empirical power of the EHHST was higher than 0.76. To compare the performance of our EHHST with Sabeti's EHH and Hanchard's HS, we showed the results of the power comparison in Figure 1. The two plots on the top of Figure 1, ie, EHH and HS plots, were taken from Hanchard et al,²⁶ Figure 1. The results of Figure 1 and Table 3 clearly show that EHHST performed just as well as or even better than Hanchard's HS and Sabeti's EHH.²⁶ One may want to notice that the power levels of Hanchard's HS and Sabeti's EHH reported in Figure 1 of Hanchard et al²⁶ can be lower than or around 0.80 when allele frequency was 0.1 at the nominal level α=0.05, although the rest of the power levels were larger than 0.90. The empirical power of EHHST, on the other hand, was high at the nominal level α=0.05, with a minimum 0.9718 when ρ=6, and population frequency of the derived allele was equal to 0.1 for the five present day population frequencies 0.1, 0.2, 0.4, 0.6, and 0.8 of the derived allele of the central SNP, and ρ=1.5, 3, and 6. The empirical power of EHHST was high at the nominal level α=0.01, with a minimum 0.8898 for the five derived allele frequencies 0.1, 0.2, 0.4, 0.6, and 0.8 of the central SNP, and ρ≤6.

Table 3 Power of the extended haplotype-based homozygosity score test (EHHST)

Full size table

In addition, we calculated empirical power by simulating 5000 samples of 120 chromosomes or n=60 individuals. The HapMap data contained samples of size 60, and our results provided some insight into the samples. The results are reported in Table 3. The EHHST provided reasonably high power in this case for the five present day population frequencies 0.1, 0.2, 0.4, 0.6, and 0.8 of the derived allele of the central SNP. For the present day population frequency 0.9 of the derived allele, the empirical power of EHHST could be low.

To perform power comparison for the three tests proposed, we simulated data under the same conditions as those of Table 3 by SelSim to calculate the empirical powers of EGHST and HMMST. The results were reported in Tables 4 and 5 in Supplementary I. Again, 5000 samples of 200 chromosomes (ie, n=100 individuals) or 120 chromosomes (ie, n=60 individuals) were simulated. As expected, the power of EGHST was higher than that of HMMST, which was generally more powerful than EHHST. Hence, EHHST was the most conservative.

HapMap Phase II data

We applied the proposed score test statistics to the whole-genome SNP data of HapMap Phase II.³⁵ Data sets include 3.1 million SNP genotypes from population samples of three continents: 60 CEPH Utah residents with ancestry from Northern and Western Europe (CEU); 60 Yorubas from Ibadan (YRI), Nigeria in Africa; and 45 Han Chinese from Beijing (CHB) and 45 Japanese from Tokyo (JPT), Japan, of Asia. The samples were downloaded from http://www.hapmap.org/downloads/phasing/2007-08_rel22/phased/. The two Asian samples were combined into one, referred to hereafter as CHB+JPT, as instructed by the HapMap Consortium. We used only the unrelated individuals from the three samples, omitting the children in the trio families from the CEU and YRI samples.

Results in the candidate regions

To evaluate the performance of our proposed test statistics, we applied them to the HapMap Phase II data in the 20 autosome candidate regions that show strong signals in Table 1 of Sabeti et al.³⁶ Note that our tests were designed for autosome data (and we actually could not download sex-linked X and Y chromosome data in the above-mentioned HapMap website). Before we discuss our results in detail, let us give a rough summary: In 17 out of 20 candidates, EHHST values showed peaks for the selected population samples in column 2, Table 1 of Sabeti et al;³⁶ hence, there were extended stretches of homozygosity in these 17 regions, as a result of which positive selection could lead to excess homozygosity in the human genome. The three exceptions can be found in Figure 3 of Supplementary II: (a) a region around 78.3 Mb on chromosome 12, (b) the BCAS3 gene region on chromosome 17, (c) the gene region of CHST5, ADAT1, and KARS on chromosome 16. In the following paragraphs, we limited our discussion to the candidate regions on chromosomes 2 and 15. In particular, we considered the regions containing lactose tolerance gene LCT on chromosome 2 and the pigmentation gene SLC24A5 on chromosome 15. For the sake of brevity, we treated the remaining candidate regions in Supplementary II.

Figure 2a explains an interesting fact about the SLC24A5 gene on chromosome 15. The gene occurred between the two vertical dotted-dashed lines from 46.20 to 46.22 Mb in Figure 2a. The highest peak of EHHST occurred around 46.4 Mb, which was reported in Table 1 of Sabeti et al;³⁶ the EHHST values of CHB+JPT and YRI samples were very low and the test scores of the YRI sample were uniformly the lowest. Our results were consistent with those of Sabeti et al³⁶ and Lamason et al,³⁷ who argued for positive selection on the basis of a striking reduction in heterozygosity in the CEU sample. In a 200 kb region around gene HERC1 on chromosome 15, the CHB+JPT sample showed signs of positive selection (Table 1, Sabeti et al³⁶). The EHHST values were plotted in Figure 2b. Again, the gene was located between the vertical dotted-dashed lines, from 61.69 to 61.91 Mb on the Figure 2b. The EHHST values of CHB+JPT were clearly highest within most parts of the HERC1 gene. Hence, the CHB+JPT sample showed long extended haplotype homozygosity in the gene region.

The LCT gene was sited between 136.26 and 136.32 Mb on chromosome 2, and LD extended about 3.2 Mb around it in the CEU sample.^{38, 39, 40} Two other genes were located in the same region, RAB3GAP1 between 135.53 and 135.64 Mb and R3HDM1 between 136.01 and 136.20 Mb. Our EHHST values plotted in Figure 2c were noticeably higher in the CEU sample than in YRI and CHB+JPT samples, confirming the previous results. Most striking of all was that the EHHST statistic spiked directly over gene R3HDM1 right next to gene LCT. Although this did not prove positive selection, the fact that a mutation deregulating the LCT gene occurred on the conserved haplotype strongly favors this interpretation. Because of the high-density SNPs of HapMap data, high-degree LD may not necessarily be the selection signal. Long extended haplotype homozygosity, however, could lead to high EHHST values and interesting signals for further investigations.

Two other regions on chromosome 2, a 1.0 Mb region around gene EDAR and an 800 kb region around 72.6 Mb, showed strong evidence of selection in the CHB+JPT sample (Table 1, Sabeti et al³⁶). The EHHST values plotted in Figure 2d–e confirmed the previous findings. The sharp EHHST peak for the CHB+JPT sample located very close to the EDAR region between 108.88 and 108.97 Mb (Figure 2d). In short, the EHHST statistic provided evidence of selection signal of the CHB+JPT sample in the region of the EDAR gene. In comparison, our EHHST values plotted in Figure 2e confirmed that the CHB+JPT sample has a strong selection signal in the 800 kb region around 72.6 Mb. Interestingly, the EHHST values reached the highest in the region (Figure 2e and Table 2 of Supplementary III). In a 1.2-Mb region around gene PDE11A, both CHB+JPT and CEU samples were reported to have a strong selection signal (Table 1, Sabeti et al³⁶). The EHHST peaks of CHB+JPT and CEU samples overlapped the PDE11A region in Figure 2f.

New candidate regions for further investigation based on the high EHHST values

Among the three proposed test statistics, EHHST was the most conservative. High EHHST values in a region indicated that there were long stretches of homozygosity. In the 20 candidate regions reported previously, we found that EHHST values show peaks in 17 of them. All these features encouraged us to use EHHST in search of new candidate regions for further investigations. Before selecting a candidate region, we first selected SNPs for natural selection as follows: (1) the selected SNP had a high EHHST value in the top one percentile, ie, the EHHST value of the SNP is in the top one percentile of all SNPs of a chromosome in which the SNP is located; (2) the selected SNP had an allele that is likely to be newly derived by using the data from http://www.hg-wen.uchicago.edu/selection/frontpage.html of the University of Chicago;²³ (3) the derived allele of the selected SNP had a high frequency that was larger than 0.5 in the tested population; (4) the derived allele of the selected SNP was likely to be highly differentiated among the three populations of CHB+JPT, CEU, and YRI, ie, the F_st score of the SNP was in the top one percentile of all F_st scores of SNPs on a chromosome.^{41, 42, 43} A candidate region was selected if there was a long list of SNPs that satisfied the four selection criteria.

On the basis of the four criteria described above, 21 candidate regions were found for natural selection (Supplementary III). In the 21 candidate regions, 3 were close to regions reported in Sabeti et al,³⁶ and 12 were not reported; we counted these 15 regions as new candidates. The remaining six regions were within regions reported in Sabeti et al.³⁶ A brief description of the 15 new candidates is presented in Table 4. For the three regions that were close to regions reported in Sabeti et al,³⁶ we plotted the EHHST values in Figure 3. The region containing the least number of SNPs satisfying the criteria (seven SNPs) was located at chromosome 10:23.9 Mb. It was reported because it was close to one candidate chr10:22.7 Mb of Sabeti et al.³⁶ Figure 3b shows the EHHST values of the three samples. It is clear that CEU sample has the highest EHHST values, which is consistent with the result in Table 4.

Table 4 New candidate regions for natural selection identified by the four criteria described in the main text and Supplementary II

Full size table

Other regions of Table 4 contain 9–69 SNPs that satisfy the four criteria. It was interesting that the region containing the most number of 69 SNPs was located on chromosome 4, from 41 521 093 to 41 849 931 bp, which overlapped with the candidate region chr4:42 Mb reported in Sabeti et al³⁶ for natural selection in the CHB+JPT sample. Figure 3a shows that the EHHST values of the CHB+JPT sample were much higher than those of CEU and YRI samples in the gene region of SLC30A9, TMEM33, BEND4, and WDR21B, and the result is actually the same as that in Figure 1a of Supplementary II in the overlapped region. A region on chromosome 17 from 55 588 298 to 55 698 601 bp, which was close to the candidate region chr17:56.4 Mb in Sabeti et al,³⁶ was identified by our four criteria for natural selection in the CEU sample. Figure 3c shows that the EHHST values of the CEU sample were much higher than those of the other two samples. However, we failed to confirm the result of Sabeti et al³⁶ in the chr17:56.4 Mb region of size 0.4 Mb by our three tests (Figure 3b of Supplementary II). One possible reason for this discrepancy is that the distance/position of SNPs used in Sabeti et al³⁶ was different from ours; if this is the case, then the number of new candidate regions that were not reported before is 12. In any case, it was encouraging to observe strong selection signals in 3+6=9 neighboring regions out of 21 by our methods, which were reported (or close to those reported) in Sabeti et al.³⁶

It was interesting to study the 12 new regions in Table 4 for further dissection, which were not reported before and which were not close to the ones reported. In Figures 4 and 5, the EHHST values were plotted for a comparison in the 12 new regions. Figures 4a–e and 5e, f show that EHHST values of the CHB+JPT sample were either high or spiked over the regions on chromosomes 1, 2, 3, 5, 12, and 13 reported in Table 4. In particular, the EHHST values of the CHB+JPT sample spiked directly over the region of LHX8 and SLC44A5 genes in Figure 4a, and over the region of the TMEM117 gene in Figure 5e. For the regions in Table 4 with natural selection signals in the CEU sample, the EHHST values of the CEU sample spiked over some parts of the regions on chromosomes 7 and 8 in Figure 4f and b, whereas the scores of CHB+JPT and YRI samples were very low. In addition, the EHHST values of the CEU sample spiked over the region of the PCMTD1 gene in Figure 5b. The EHHST values of the CHB+JPT sample are generally the highest in Figure 5c, and the EHHST values of the CEU sample are generally the highest in Figure 5d, except in short parts of the regions. In short, the results of Figure 5c and d are consistent with those of chromosome 11 in Table 4. In Figure 5a, the EHHST values of both the CHB+JPT and CEU samples are high, whereas those of YRI are very low in the region between 50 635 491 and 50 943 725 bp, although CHB+JPT was not selected in Table 4 for the region.

Genome-wide scans of HapMap Phase II data

Given the encouraging results with the candidate regions, we performed a genome-wide scan of the HapMap Phase II data. The scan generated several results. In Supplementary IV, we present the results of chromosome 2 data. The features of the results of the remaining chromosomes were similar. The results of chromosome 2 are reported as three figures for the three tests – EHHST, HMMST, and EGHST. Each figure plots the test scores of one test for the CEU sample versus the YRI sample, one for the CHB+JPT sample versus the YRI sample, and one for the CEU sample versus the CHB+JPT sample. For the EGHST and HMMST statistics, scores were the highest for the CHB+JPT sample and lowest for the YRI sample, with the CEU sample having intermediate scores (Figures 2 and 3 of Supplementary IV). This result was consistent with the finding by Gibson et al³⁰ that the YRI sample had the fewest long tracts of homozygosity. It was also consistent with current thinking about the demographic history of the three populations.

As a reflection of LD, EGHST and HMMST values were high across the genome (Figures 2 and 3 of Supplementary IV). Because the HMMST values adjust for pairwise LD, they were roughly half as high as EGHST values. By contrast, the EHHST values were generally low, with sharp spikes in just a few regions (Figure 1 of Supplementary IV). Hence, HWE was valid for most part of the genome; the high EGHST and HMMST values were most likely due to LD among SNP markers. From the plot of the high EHHST values, one could easily spot these narrow chromosome regions where HWE broke down.

Software and computational performance

Our C++ code for the proposed statistics is freely available on request to Dr Fan. The EGHST and HMMST are very fast computationally, taking only minutes to analyze a typical chromosome of the HapMap Phase II data. In contrast, it requires hours per chromosome to compute the EHHST values. Hence, the EHHST seems to be the most suited for fine mapping in candidate gene regions.

Discussion

In this article, we proposed score test statistics for genome-wide screening of the extended homozygosity of the human genome. We considered three testing cases: EGHST, HMMST, and EHHST. Intuitively, EGHST might provide high values as long as either HWE or linkage equilibrium was invalid, HMMST could do so if either HWE was invalid or there existed higher order LD interaction than pairwise ones among SNPs, and EHHST might provide high scores only when the haplotype version of HWE was invalid in a chromosome region. Roughly speaking, EGHST and EHHST were two extremes: EGHST was the most aggressive one that might give many positive signals, given the high density of SNP data; hence, the presence of LD was a fact of ubiquity. EHHST was the most conservative one as it might give high scores in the presence of excess homozygosity. We started from a measure T of the extent of homozygosity, and then provided the distribution of T and its mean and variance under the null hypothesis of each test case. This facilitated the calculation of our test statistics.

By simulating data under the null hypothesis of the EGHST, we evaluated the robustness of the three tests through type I error calculations and confirmed that the EHHST was the most robust (Supplementary I). We then used coalescent programs SelSim and ms to simulate data under the neutral model. We showed that EHHST led to appropriate false-positive rates and it was robust in the presence of simple demographic population history. By comparing with the results reported in Hanchard et al,²⁶ we showed that the EHHST had higher or similar power as the existing popular methods. One might want to notice that the existing popular tests usually did not follow a distribution. The EHHST, however, is asymptotically normal, which makes analysis and applications easier. We applied the tests to Hapmap Phase II data for genome-wide screening, for comparison with previously reported candidate regions, and to search for new candidate regions on the basis of high EHHST values and population differentiations. It was encouraging that our EHHST values confirmed 17 regions of excess homozygosity out of 20 candidates reported by Sabeti et al.³⁶ The statistics also validated the relative demographic history of African, European, and East Asian populations. Our plots suggested multiple regions of excess homozygosity. Given our ignorance about the function of many genes, it would take a long time to sort through these hints.

In summary, the main contributions are the following: we showed that the EHHST could be used to detect regions of excess homozygosity, which could be candidates of recent selection for further investigations by additional requirement, such as the criteria used in Sabeti et al,³⁶ namely, selected alleles were newly arisen, were likely to be highly differentiated among populations, and had biological effects. The EHHST was conservative and robust. Compared with the existing popular methods, EHHST performed just as well or even better. Moreover, EHHST was straightforward and was asymptotically normal. In addition to EHHST, we showed that EGHST and HMMST were useful in genome-wide scans for a general picture of the strength of LD and violation of HWE by comparing test scores of different population samples. For candidate regions that had selection signals, the comparison of the three test scores might provide clues of either LD or violation of HWE or both, which lead to high test scores.

Because of the conservative nature of EHHST, one might miss some candidate regions in which HWE is roughly valid, but LD exists. Thus, high EHHST values were not a sufficient and necessary condition for detection of selection signal. Notwithstanding, EHHST could be a new tool in addition to existing methods of detecting selection. Population geneticists have proposed several tests for inferring a selective sweep. Jensen et al⁴⁴ summarized the most important tests, including ones based on increased LD.^{4, 45} We liked the current statistics because they exploited dense SNP genotyping and depended on minimal assumptions. Of course, the lack of a detailed model had its disadvantages. For example, our tests said nothing about the age of a favorable mutation. This issue was obviously intertwined with variations in recombination rates across the genome.

References

Slatkin M : Disequilibrium mapping of a quantitative-trait locus in an expanding population. Am J Hum Genet 1999; 64: 1765–1773.
Article CAS Google Scholar
Bamshad M, Wooding SP : Signatures of natural selection in the human genome. Nat Rev Genet 2003; 4: 99–111.
Article CAS PubMed Google Scholar
Hudson RR, Bailey K, Skarecky D, Kwiatowski J, Ayala FJ : Evidence for positive selection in the Superoxide Dismutase (Sod) region of Drosophila melanogaster. Genetics 1994; 136: 1329–1340.
CAS PubMed PubMed Central Google Scholar
Kim Y, Nielsen R : Linkage disequilibrium as a signature of selective sweeps. Genetics 2004; 167: 1513–1524.
Article PubMed PubMed Central Google Scholar
Ronald J, Akey JM : Genome-wide scans for loci under selection in humans. Hum Genomics 2005; 2: 113–125.
Article CAS PubMed PubMed Central Google Scholar
Vallender EJ, Lahn BT : Positive selection on the human genome. Hum Mol Genet 2004; 13 (Spec. No. 2): R245–R254.
Article CAS PubMed Google Scholar
Maynard SJ, Haigh J : The hitch-hiking effect of a favorable gene. Genet Res 1974; 23: 23–25.
Article Google Scholar
Kaplan NL, Hudson RR, Langley CH : The “hitchhiking effect” revisited. Genetics 1989; 123: 887–899.
CAS PubMed PubMed Central Google Scholar
Stephan W, Wiehe THE, Lenz MW : The effect of strongly selected substitutions on neutral polymorphism: analytical results based on diffusion theory. Theor Popul Biol 1992; 41: 237–254.
Article Google Scholar
Ewens WJ : Mathematical Population Genetics. I. Theoretical Introduction. New York, USA: Springer, 2004.
Book Google Scholar
Fay JC, Wu CI : Hitchhiking under positive Darwinian selection. Genetics 2000; 155: 1405–1413.
CAS PubMed PubMed Central Google Scholar
Fu YX, Li WH : Statistical tests of neutrality of mutations. Genetics 1993; 133: 693–709.
CAS PubMed PubMed Central Google Scholar
Hudson RR, Kreitman K, Aguadé M : A test of neutral molecular evolution based on nucleotide data. Genetics 1987; 120: 831–840.
Google Scholar
Kreitman M : Methods to detect selection in populations with applications to the human. Annu Rev Genomics Hum Genet 2000; 1: 539–559.
Article CAS PubMed Google Scholar
Nielsen R : Molecular signatures of natural selection. Annu Rev Genet 2005; 39: 197–218.
Article CAS PubMed Google Scholar
Sabeti PC, Schaffner SF, Fry B et al: Positive natural selection in the human lineage. Science 2006; 312: 1614–1620.
Article CAS PubMed Google Scholar
Tajima F : Statistical methods for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 1989; 123: 585–595.
CAS PubMed PubMed Central Google Scholar
Teshima KM, Coop G, Przeworski M : How reliable are empirical genomic scans for selective sweeps? Genome Res 2006; 16: 702–712.
Article CAS PubMed PubMed Central Google Scholar
Watterson GA : On the number of segregating sites in genetic models without recombination. Theor Popul Biol 1975; 7: 256–276.
Article CAS PubMed Google Scholar
Williamson SH, Hubisz MJ, Clark AG, Payseur BA, Bustamante CD, Nielsen R : Localizing recent adaptive evolution in the human genome. PLoS Genet 2007; 3: e90.
Article PubMed PubMed Central Google Scholar
Sabeti PC, Reich DE, Higgins JM et al: Detecting recent positive selection in the human genome from haplotype structure. Nature 2002; 419: 832–837.
Article CAS PubMed Google Scholar
Toomajian C, Ajioka RS, Jorde LB, Kushner JP, Kreitman M : A method for detecting recent selection in the human genome from allele age estimates. Genetics 2003; 165: 287–297.
CAS PubMed PubMed Central Google Scholar
Voight BF, Kudaravalli S, Wen X, Pritchard JK : A map of recent positive selection in the human genome. PLoS Biol 2006; 4: e72.
Article PubMed PubMed Central Google Scholar
Wang ET, Kodama G, Baldi P, Moyzis RK : Global landscape of recent inferred Darwinian selection for Homo sapiens. Proc Natl Acad Sci USA 2006; 103: 135–140.
Article CAS PubMed Google Scholar
Nielsen R, Williamson S, Kim Y, Hubisz MJ, Clark AG, Bustamante C : Genomic scans for selective sweeps using SNP data. Genome Res 2005; 15: 1566–1575.
Article CAS PubMed PubMed Central Google Scholar
Hanchard NA, Rockett KA, Spencer C et al: Screening for recently selected alleles by analysis of human haplotype similarity. Am J Hum Genet 2006; 78: 153–159.
Article CAS PubMed Google Scholar
Sabeti PC : The case for selection at CCR5-Delta32. PLoS Biol 2005; 3: e378.
Article PubMed PubMed Central Google Scholar
Tang K, Thornton KR, Stoneking M : A new approach for using genome scans to detect recent positive selection in the human genome. PLoS Biol 2007; 5: e171.
Article PubMed PubMed Central Google Scholar
McVean G : The structure of linkage disequilibrium around a selective sweep. Genetics 2007; 175: 1395–1406.
Article PubMed PubMed Central Google Scholar
Gibson J, Morton NE, Collins A : Extended tracts of homozygosity in outbred human populations. Hum Mol Genet 2006; 15: 789–795.
Article CAS PubMed Google Scholar
Ayers KL, Lange K : Penalized estimation of haplotype frequencies. Bioinformatics 2008; 24: 1596–1602.
Article CAS PubMed Google Scholar
Long JC, Williams RC, Urbanek M : An E-M algorithm and testing strategy for multiple-locus haplotypes. Am J Hum Genet 1995; 56: 799–810.
CAS PubMed PubMed Central Google Scholar
Spencer CCA, Coop G : SelSim: a program to simulate population genetic data with natural selection and recombination. Bioinformatics 2004; 20: 3673–3675.
Article CAS PubMed Google Scholar
Hudson RR : Generating samples under a Wright-Fisher neutral model. Bioinformatics 2002; 18: 337–338.
Article CAS PubMed Google Scholar
The International HapMap Consortium: A second generation human haplotype map of over 3.1 million SNPs. Nature 2007; 449: 851–861.
Article PubMed Central Google Scholar
Sabeti PC, Varilly P, Fry B et al: Genome-wide detection and characterization of positive selection in human populations. Nature 2007; 449: 913–918.
Article CAS PubMed PubMed Central Google Scholar
Lamason RL, Mohideen MPK, Mest JR et al: SLC24A5, a putative cation exchanger, affects pigmentation in Zebrafish and humans. Science 2005; 310: 1782–1786.
Article CAS PubMed Google Scholar
Bersaglieri T, Sabeti PC, Patterson N et al: Genetic signatures of strong recent positive selection at the lactase gene. Am J Hum Genet 2004; 74: 1111–1120.
Article CAS PubMed PubMed Central Google Scholar
Enattah NS, Sahi T, Savilahti E, Terwilliger JD, Peltonen L, Järvelä I : Identification of a variant associated with adult-type hypolactasia. Nat Genet 2002; 30: 233–237.
Article CAS PubMed Google Scholar
Poulter M : The causal element for the lactase persistence/non-persistence polymorphism is located in a 1 Mb region of linkage disequilibrium in Europeans. Ann Hum Genet 2003; 67: 298–311.
Article CAS PubMed Google Scholar
Akey JM, Zhang G, Zhang K, Jin L, Shriver MD : Interrogating a high-density SNP map for signature of natural selection. Genome Res 2002; 12: 1805–1814.
Article CAS PubMed PubMed Central Google Scholar
Akey JM, Eberle MA, Rieder MJ et al: Population history and natural selection shape patterns of genetic variation in 132 genes. PLoS Biol 2004; 2: e286.
Article PubMed PubMed Central Google Scholar
Weir BS, Cockerham CC : Estimating F-statistics for the analysis of population structure. Evolution 1984; 38: 1358–1370.
CAS PubMed Google Scholar
Jensen JD, Kim Y, Dumont VB, Aquadro CF, Bustamante CD : Distinguishing between selective sweeps and demography using DNA polymorphism data. Genetics 2005; 170: 1401–1410.
Article CAS PubMed PubMed Central Google Scholar
Przeworski M : The signature of positive selection at randomly chosen loci. Genetics 2002; 160: 1179–1189.
PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Dr Gill introduced us to the topic of positive selection and stimulated our interest in the research. We thank Dr Pritchard's group for sharing their iHS C++ codes, which facilitated our research. This research was supported by the NIH Grant R01CA133996 from the MD Anderson Cancer Center for Fan R, and the NIH grants R01GM53275/R01MH59490 from UCLA for Lange K.

Author information

Authors and Affiliations

Department of Statistics, The Texas A&M University, College Station, TX, USA
Ming Zhong & Ruzong Fan
Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
Kenneth Lange & Jeanette C Papp
Department of Epidemiology, MD Anderson Cancer Center, University of Texas, Houston, TX, USA
Ruzong Fan

Authors

Ming Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Kenneth Lange
View author publications
You can also search for this author in PubMed Google Scholar
Jeanette C Papp
View author publications
You can also search for this author in PubMed Google Scholar
Ruzong Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruzong Fan.

Ethics declarations

Competing interests

The authors declare no conflict of interest.

Additional information

Supplementary Information accompanies the paper on European Journal of Human Genetics website

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhong, M., Lange, K., Papp, J. et al. A powerful score test to detect positive selection in genome-wide scans. Eur J Hum Genet 18, 1148–1159 (2010). https://doi.org/10.1038/ejhg.2010.60

Download citation

Received: 23 June 2009
Revised: 04 January 2010
Accepted: 26 March 2010
Published: 12 May 2010
Issue Date: October 2010
DOI: https://doi.org/10.1038/ejhg.2010.60