Introduction

Since Kimura (1968) and King & Jukes (1969) first suggested that most polymorphisms are selectively neutral, testing the neutral hypothesis has been one of the prime objectives of molecular population genetics. The objective of studies testing neutrality has been to make general inferences about the causes of molecular evolution. However, there has been a shift in focus in the last decade to using the neutral theory as a null model against which specific occurrences of selection can be detected. There has especially been interest in providing evidence for positive selection and selective sweeps. Positive selection occurs when a new selectively advantageous mutation is segregating in a population. This type of selection is of particular interest because it may provide evidence for adaptation at the molecular level and help elucidate genotype–phenotype relationships. Selective sweeps refer to the elimination of variation at neutral sites as a linked positively selected allele goes to fixation in a population. Much of the interest in selective sweeps is spurred by the observation that the rate of recombination is correlated with the level of polymorphism in organisms such as Drosophila melanogaster (e.g. Begun & Aquadro, 1992). Since the size of the region affected by a selective sweep is determined by the recombination rate, recurrent selective sweeps provide one possible explanation for this correlation.

The new availability of large genomic data sets has invigorated the field of molecular population genetics and spurred new controversies regarding the causes of molecular evolution. Large samples of Single Nucleotide Polymorphisms (SNPs), microsatellites and DNA sequence data are currently being obtained in humans and other organisms. Using these data and appropriate statistical methodologies, it is in theory possible to identify regions that have undergone selective sweeps or positive selection. By finding genomic regions in which selection has been acting, we can identify the causes for species-specific phenotypic differences. For example, we might be able to identify those parts of the genome that have been undergoing selection in the evolution of humans to their modern form. Likewise, it might be possible to identify regions currently under selection, for example, because of the presence of disease-causing mutations. Tests of neutrality provide us with a powerful tool for developing hypotheses regarding function from genomic data. An important question is therefore how to extract the information regarding natural selection from genomic data and how best to identify regions, loci or specific nucleotide sites which have been targeted by selection.

The problem of testing the neutral hypothesis from molecular data has taken up much of the theoretical literature in population genetics in the last three decades. I will here provide a brief, perhaps opinionated, review of some of this literature. Because of space limitation, the review will not be comprehensive but will focus on some of the classical examples pertinent to the analysis of genomic data and on selected recent developments. I will divide tests of neutrality into two categories: (1) tests based on the allelic distribution and/or level of variability; and (2) tests based on comparisons of divergence/variability between different classes of mutations within a locus, such as nonsynonymous and synonymous mutations. Not all tests naturally fall into one of these categories. For example, tests based on the molecular clock (e.g. Langley & Fitch, 1974) may not belong to either of these categories. However, this categorization is useful for highlighting the following point: despite the fact that much of the literature has concentrated on tests of type (1) they have had very limited success in providing unambiguous evidence for selection, mostly because they rely on strong assumptions regarding the demographics of the populations. In contrast, tests of type (2) have been very successful in providing clear evidence for selection.

I will here argue that it might be difficult to construct neutrality tests applicable to genomic data based on allelic distributions alone that are robust to the demographic assumptions. In contrast, robust inferences can easily be made by comparing variability in nonsynonymous and synonymous sites or between other categories of mutations, in the same genomic region. In particular, comparisons of the rates and distributions of nonsynonymous and synonymous substitutions are useful for providing robust inferences regarding the presence of selection.

Tests based on the allelic distribution or levels of variability

One locus

One of the milestones of population genetical theory was the discovery of the Ewens sampling formula (Ewens, 1972). This formula provides an analytical expression for the sampling probability under the infinite allele model, whereby every mutation is to a new allelic type, for a sample obtained from a single population of constant size with no population structure. Using Ewens's sampling formula, one of the most famous tests of neutrality, the Ewens–Watterson test (Watterson, 1977) was developed. In this test the expected homozygosity, given the observed number of alleles, is compared to the observed homozygosity. If the difference between the observed and expected homozygosity is larger than some critical value, the neutral null hypothesis can be rejected. This test is applicable to data for which the infinite-alleles model might be reasonable, such as allozyme data.

For nucleotide data, one of the most popular tests is Tajima's D-test (Tajima, 1989). Tajima's D is the scaled difference in the estimate of θ=4Neμ (Ne=effective population size, μ=mutation rate per generation) based on the number of pairwise differences and the number of segregating sites in a sample of nucleotide sequences. It is defined as

where θπ is an estimator of θ based on the average number of pairwise differences, θω is an estimator of θ based on the number of segregating sites and S Θ ^ π is an estimate of the standard error of the difference of the two estimates. If the value of D is too large or too small the neutral null hypothesis is rejected. The critical values are obtained by simulations if mutational rate variation and recombination are taken into account. There are several similar tests based on slightly different test statistics such as the tests by Fu & Li (1993), Simonsen et al. (1995) and Fay & Wu (2000). A likelihood ratio test of a similar problem was described in Galtier et al. (2000).

These tests have had great success in many applications in testing the neutral equilibrium model. However, the interpretation of significant results is not always clear. The null hypothesis is a composite hypothesis that includes assumptions regarding the demographics of the populations, such as constant population size and no population structure. There is wide awareness in the field of this fact. For example, when examining the power of the Tajima's D-test, Simonsen et al. (1995) examined its power against both demographic and selection alternatives. They found that Tajima's D had a reasonable power to detect population bottlenecks and population subdivision in addition to selective sweeps. The word `neutrality test' has therefore to some degree become synonymous with tests of the equilibrium neutral population model. Significant deviations from the neutral equilibrium model alone do not provide evidence against selective neutrality.

Some insights into the problems associated with these tests have been gained by considering the genealogical structure of the data. For example, a complete selective sweep tends to produce genealogies similar to those generated by a severe bottleneck (Fig. 1b). In both cases, the lineages in the genealogy are forced to coalesce at the time of the selective sweep or the bottleneck. The average number of pairwise differences is decreased compared to the number of segregating sites, leading to negative values of Tajima's D. The fundamental problem is that both the demographic process and selection can have very similar effects on the genealogy. It is therefore quite difficult to distinguish these effects when a single locus is considered. For the case of weak selection, it may be even more difficult to use allelic distributions to distinguish selection from demographic processes. Neuhauser & Krone (1997) and Golding (1997) have argued that weak selection may at best have only a slight effect on the genealogy. Neutrality tests based on allelic distribution might therefore often have much less power against the common models of selection than against demographic deviations from the neutral equilibrium model.

Figure 1
figure 1

Genealogies simulated under (a) the standard neutral equilibrium model and (b) a model with a severe bottleneck or a complete selective sweep t generations in the past. The effect of a severe bottleneck or a complete selective sweep is to force all lineages in the genealogy to coalesce at the time of the bottleneck/sweep.

Multiple loci

Several statistical tests have been proposed for employing data from multiple loci. One of the most famous is the Lewontin–Krakauer test (Lewontin & Krakauer, 1973). In its original form, this test considers data at diallelic loci from multiple populations. For each locus,

is calculated, where p and σp2are the mean and variance in allele frequency, respectively, across populations. If the variance in F is too large among loci, the neutral null model can be rejected. The problem with this test is how to determine when the variance in F is too large. In its original form, critical values were calculated assuming independence among populations, a condition that is violated by shared common ancestry or migration between populations (Robertson, 1975). The test relies on very strong, and in many cases arguably unrealistic, demographic assumptions.

The most popular test applicable to DNA sequence data obtained from multiple loci is the HKA test (Hudson et al., 1987). In this test variability within and between species is compared for two or more loci. The idea is that in the absence of selection, the expected number of segregating sites within species (polymorphisms) and the expected number of fixed differences between species (divergence) are both proportional to the mutation rate, and the ratio of the two expectations should be constant among loci. Selection is inferred when the variance among loci of the ratio of divergence to polymorphism is too high. One problem that is often ignored in interpreting results of this test is that the variance in the number of segregating sites depends strongly on the demographic model. For example, we can consider the realistic case in which we have sampled DNA sequences from a population that exchanges migrants with another unobserved population. The coefficient of variation (standard deviation divided by mean) in the number of segregating sites under this model is in Fig. 2. Notice that the coefficient of variation approaches infinity as the migration rate goes to zero. This implies, paradoxically, that as there is less and less chance of observing evidence for genetic exchange between populations, it is more and more likely that tests based on comparing levels of variability in a single population in different regions will give falsely significant results due to migration. The reason is that for low migration rates, the probability that an ancestral lineage visits the other unobserved population is very small. However, if a lineage happens to visit the other population it will tend to stay there for a very long time. The effect is a very high variance in the coalescence time among different loci.

Figure 2
figure 2

The coefficient of variation (standard deviation divided by the mean) of the number of segregating sites. The coefficient of variation (CV) was evaluated by simulating samples of 25 genes from a single neutrally evolving population under the infinite sites model (Watterson, 1975). It was assumed that θ=10 (θ is four times the effective population size times the mutation rate) and that the population on average exchanges M migrants per generation symmetrically with another unobserved population of the same size. Ten thousand simulations were performed for each point in the graph. For more details on coalescence simulations and on the model see pp. 19–24 in Hudson (1990).

Demographic factors affect all loci in the genome of an organism. Selection will in contrast target specific loci or nucleotide sites. Common sense would therefore dictate that it is possible to detect selection by comparing multiple loci. If there is strong statistical evidence against the neutral equilibrium model for a particular locus, but the model fit the data in other loci quite well, this will usually be interpreted as evidence for selection at that locus. For example, one can imagine searching for genomic regions of low variability and/or small values of Tajima's D as a method for identifying regions that have undergone a recent selective sweep. We readily realize that searches for regions with low levels of variability might be difficult to perform robustly, because the variance in measures of variability is strongly dependent on the demographic models (e.g. Fig. 2). Unfortunately, we face a similar problem when searching for genomic regions with extreme values of Tajima's D or other related statistics; not only the expectation but also the variance of Tajima's D depends on the demographic model. For example, we can consider the previously described demographic model, in which there is a low level of migration between the sampled population and another unobserved population (Fig. 3). In such a model the mean value of Tajima's D is approximately zero, independently of migration, but the variance in Tajima's D is increased. When the average number of migrants per generation is 0.1 it is 6–7 times as likely to observe an extreme value of |D| > 2 as when there is no migration. Variation in the observed value of Tajima's D or other similar summary statistics along a chromosome may therefore only in extreme cases be interpreted as evidence for selection.

Figure 3
figure 3

The distribution of Tajima's D evaluated using 10 000 simulations under the same assumptions as in Fig. 2. The case of M=0 corresponds to the standard neutral equilibrium model without migration.

As more genomic data is collected, there will be an increased demand for robust and general tests for identifying regions that have experienced selection. In constructing such tests, we face the challenge that most observations based on a single summary statistic easily can be explained by demographic factors. However, it may be possible to construct more robust test by using methods that capture more of the information in the data.

Comparing variability in different classes of mutations

McDonald–Kreitman type tests

Tests based on allelic distribution or variability alone are, as just argued, quite sensitive to the underlying demographic assumptions, mostly because the structure of the gene genealogy is a product of the demographic processes in the populations. However, it is possible to establish tests of neutrality based on statistics with distributions that are independent of the genealogy or only depend on the genealogy through a nuisance parameter that can be eliminated. A famous example is the McDonald–Kreitman test (McDonald & Kreitman, 1991). In this test, the ratio of nonsynonymous to synonymous polymorphisms within species is compared to the ratio of the number of nonsynonymous and synonymous fixed differences between species in a 2 × 2 contingency table. The justification of this test is very similar to the HKA test. If polymorphism and divergence are driven only by mutation and genetic drift, the ratio of the number of fixations to polymorphisms should be the same for both nonsynonymous and synonymous mutations. In statistics, parameters that are of no interest to the researcher but cannot be ignored are labelled `nuisance parameters'. A common approach is to eliminate such parameters by conditioning on a sufficient statistic, i.e. a statistic that contains all the relevant information in the data regarding the parameter. In the case of the McDonald–Kreitman test, the total tree length is the nuisance parameter and the total number of substitutions is a sufficient statistic for this parameter. By conditioning on the total number of substitutions in the 2 × 2 table, the total tree length parameter is eliminated. In this manner a test of neutrality is established that is valid for any possible demographic model. The McDonald–Kreitman test has been very useful for detecting selection. For example, Eanes et al. (1993) found very strong evidence for selection in the G6pd gene in Drosophila melanogaster and D. simulans.

Although the McDonald–Kreitman test does provide unambiguous evidence for selection, it is not always clear which type of selection is acting on the gene. For example, changes in the population size combined with weak selection against slightly deleterious mutations may either increase or decrease the number of nonsynonymous polymorphisms. An increase in the population size will lead to a deficiency of nonsynonymous polymorphisms and a decrease in population size will lead to an excess of nonsynonymous polymorphisms. Significant results from the McDonald–Kreitman can not be interpreted directly as evidence for positive selection.

A related test was applied by Akashi (1994) to examine if there is selection for optimal codon usage in Drosophila. In the Drosophila genome, some codons occur at a higher frequency than others coding for the same amino acid. The common codons are usually referred to as `preferred codons' and the rare codons are named `unpreferred codons'. Akashi (1995) developed a test to examine if the presence of preferred codons could be attributed to selection or, alternatively, to mutational biases. He demonstrated that changes to unpreferred codons showed a significantly higher ratio of polymorphism to divergence than preferred changes in the Drosophila simulans lineage, providing evidence for the action of selection at silent sites.

These types of test do not rely on assumptions regarding the demographics of the populations because they are constructed by comparing different types of variability within the same locus, or genomic region. Since nonsynonymous and synonymous sites, for example, are interspersed among each other in a coding region, the effect of the demographic model is the same for both types of site.

Test based on allelic distribution in nonsynonymous and synonymous sites

Other robust tests of neutrality can be constructed by comparing the allelic distribution in different types of sites. For example, differences in the allelic distributions (frequency spectra) between synonymous and nonsynonymous polymorphisms, provide quite unambiguous evidence for selection. Such tests are particularly relevant for genomic data sets in which large numbers of polymorphisms can be obtained. Akashi (1999) suggested comparing the frequency distribution in nonsynonymous sites to the frequency distribution in synonymous sites using a test of homogeneity. If selection is of no importance, the frequency distributions of synonymous and nonsynonymous sites should be the same. For example, Cargill et al. (1999) and Sunyaev et al. (2000) demonstrated that the overall frequency spectra in the human genome of nonsynonymous and synonymous mutations differ, providing evidence for selection on segregating mutations. Similar information was used in the test by Nielsen & Weinreich (1999) in which the ages of nonsynonymous and synonymous mutations were estimated. Differences in the average age of nonsynonymous and synonymous mutations provided evidence for selection.

Tests based on the dN/dS ratio

The most direct method for showing the presence of positive selection is to demonstrate that the number of nonsynonymous substitutions per nonsynonymous site (dN) is significantly larger than the number of synonymous substitutions per synonymous site (dS). For example, Hughes & Nei (1988) showed that dN > dS in the antigen binding cleft of the Major Histocompatibility Complex. This observation provided unambiguous evidence for positive selection in the region, presumably overdominant or frequency dependent selection. A value of dN > dS implies that nonsynonymous mutations are fixed with a higher probability than neutral ones due to positive selection.

A statistical framework for making inferences regarding dN and dS was developed by Goldman & Yang (1994) and Muse & Gaut (1994). In this framework the evolution of a nucleotide sequence is modelled as a continuous-time Markov chain with state space on the 61 possible codons in the universal genetic code. In one parameterization, the instantaneous rate matrix of the process Q={qij}, is given by

where πj is the stationary frequency of codon j, κ is the transition/transversion rate ratio and ω (=dN/dS) is the nonsynonymous/synonymous rate ratio. Using this model, it is possible to calculate the likelihood function for ω and for other parameters using the general algorithm of Felsenstein (1981). It is thereby possible to obtain maximum likelihood estimates of these parameters, and hypotheses such as H0: ω ≤ 1 can be tested using likelihood ratio tests. This maximum likelihood method has several advantages over previous methods in that it correctly accounts for the structure of the genetic code, it can incorporate complex mutational models and it is applicable directly to multiple sequences, taking the structure of the underlying genealogical tree into account.

In general, testing if ω ≤ 1 (dN < dS) for an entire gene is a very conservative test of neutrality. Purifying selection must occur quite frequently in functional genes to preserve function. For this reason, the average dN is expected to be much less than the average dS for most genes, even if positive selection is occurring in some sites quite frequently. However, when multiple divergent sequences are available it is possible to detect the presence of positively selected sites, even when most sites are under negative selection, by allowing variation in ω among codon sites. Nielsen & Yang (1998) developed a model in which there are three categories of sites: invariable sites (ω=0), neutral sites (ω=1) and positively selected sites (ω > 1). By comparing the maximum likelihood calculated under a constrained model in which the frequency of positively selected sites is set to zero (neutral model), to the maximum likelihood calculated under the general model (positive selection model), a likelihood ratio test of the hypothesis H0: ω ≤ 1, for all sites, can be performed. In other words, we can test if all of the sites in the sequence have values of ω ≤ 1. Tests based on more realistic models for the distribution of ω were also considered in Yang et al. (2000a). These tests have reasonable power, even when the majority of sites are constrained or are evolving neutrally. In fact, it has been possible in several cases to detect selection even when the majority of sites were constrained and only a few percent of sites were evolving under positive selection (Yang et al., 2000a). The test has provided evidence for positive selection in many viral systems including HIV-1 (Nielsen & Yang, 1998; Zanotto et al., 1999), in reproductive proteins (Swanson et al., 2000), in abalone sperm lysin (Yang et al., 2000b), plant chitinases (Bishop et al., 2000) and for a variety of other genes including beta-globin (Yang et al., 2000a).

When positive selection has been detected, sites undergoing positive selection can be identified using an empirical Bayes method. Swanson et al. (2000) showed that this method correctly identifies the positively selected sites in known test cases. It is therefore in many cases possible to identify the exact location of sites targeted by selection.

It is also possible to detect selection occurring on a particular lineage of a phylogeny using similar methods. By allowing ω to vary among lineages, hypotheses such as H0: ω(j) ≤ 1 can be tested, where ω(j) is the value of ω on a particular lineage of a phylogeny (Yang, 1998). This type of test has been used in detecting selection, for example, in the human BRCA1 gene (Huttley et al., 2000).

The tests of neutrality based on testing H0: ω ≤ 1, differ from other neutrality tests, such as the McDonald–Kreitman test, by providing direct evidence for positive selection. Detecting values of ω > 1 is to date the only direct method available for detecting positive selection from DNA sequence data. However, the tests also have some limitations. In particular, they assume no recombination between the sequences and are therefore in many cases not applicable to intraspecific data. Also, the effect of a strong codon bias on these methods has not been systematically explored.

Tests of neutrality in the genomic future

We have argued that robust tests of neutrality based solely on simple summary statistics of allelic distributions and/or levels of variability are difficult to establish. The reason is that the distribution of genealogies is highly dependent on the demographics of the populations. To detect selection, more information is needed than a single summary statistic evaluated along a sequence. Tests based on comparing the pattern of synonymous and nonsynonymous mutations, in contrast, are relatively robust because parameters relating to the genealogy can be eliminated as nuisance parameters.

Several genomic sequencing projects have been recently completed or are close to completion (e.g. humans, mouse and Drosophila). Assuming that the sequencing projects do not stop here, there will soon be an abundance of comparative data. Such data are perfectly suited for scanning the genome for sites at which positive selection has occurred. Several authors have argued that positive selection might be frequent in the genomes of humans and other organisms (Kreitman & Akashi, 1995; Schmid et al., 1999). If this is true, we have the necessary statistical methods for identifying which sites have undergone selection based on comparative data. It will be possible to make systematic searches for genes that have undergone positive selection in the lineage leading to humans and identify the adaptive changes at the molecular level that were important in the evolution of modern humans. Identifying selection in the genome might very well become one of our most powerful tools for identifying causes for species-specific differences and for identifying genomic regions of functional, and perhaps, medical importance.