Introduction

An active area of research for the past decade has been the exploration of evidence for population structure from genome-wide genetic data. Accounting for population structure is critical in association studies, in which population stratification or cryptic relatedness can lead to inferential errors. Inferences about population structure are also critical in understanding group differences in studies of evolutionary and demographic histories.

Principal components analysis is widely used for identifying population structure in genetic data. EIGENSOFT implements principal components analysis for the purposes of detecting and correcting for population stratification in genome-wide association studies (Price et al., 2006) and detecting structure in population genetic studies (Patterson et al., 2006). EIGENSOFT is based on principal components analysis of the normalized sample covariance matrix, in which the normalization assumes Hardy–Weinberg equilibrium (Price et al., 2006). Formal hypothesis testing for significance of eigenvalues using Tracy–Widom statistics permits a determination of the number of significant principal components, or, equivalently, the number of covariates necessary to control for population stratification (Patterson et al., 2006).

Many stopping rules have been developed to determine the number of significant principal components (for a comparative study of 20 stopping rules, see Peres-Neto et al. (2005)). In this study, I explore Velicer's minimum average partial test (Velicer, 1976; O’Connor, 2000) as an alternative to Tracy–Widom statistics. Rather than performing formal hypothesis testing using an external reference distribution and subjective significance thresholds, Velicer's minimum average partial test is based on an objective minimization function of partial correlations (Velicer, 1976).

The motivations of this study are twofold. One, in their original description of EIGENSOFT, Patterson et al. (2006) noted an overestimation of significant principal components for admixed data. Two, analysis of an empirical admixed African-American data set by EIGENSOFT yielded 16 significant principal components, whereas the expectation for a two-way admixed sample was one significant principal component. Herein, by computer simulation, Velicer's minimum average partial test estimated the number of principal components to retain with a smaller mean squared error than EIGENSOFT, with EIGENSOFT's estimates being biased upward. Computer simulation also revealed that EIGENSOFT yielded even more highly upwardly biased estimates for admixed samples than for unadmixed samples, whereas Velicer's minimum average partial test yielded a low mean squared error for both unadmixed and admixed samples. For the empirical data, Velicer's minimum average partial test indicated retention of only one principal component, matching the expectation for a two-way admixed sample.

Materials and methods

Simulations

All work was performed in R (R Development Core Team, 2009). R code is available upon request.

Two populations: Under a coalescent model of vicariance (McVean, 2009), suppose A and B represent two populations that diverged at some time t in the past. To mimic an admixed African-American population, suppose A represents individuals of West African ancestry and suppose B represents individuals of Western European ancestry. A sample of 216 haplotypes (108 diploid individuals, see real data analysis below) from population A and 218 haplotypes (109 diploid individuals) from population B were simulated with divergence times t={0, 0.001, 0.01, 0.1, 1.0} in units of 2Ne generations (Figure 1a). Each data set consisted of 10 000 unlinked sites, that is, each site had an independently realized coalescent genealogy. As a result, there was neither background linkage disequilibrium nor extended linkage disequilibrium due to admixture. Mutations were placed on a genealogy proportional to branch lengths; the effective population size Ne canceled out in this step. As a consequence of this mutational scheme, sites were ascertained to be polymorphic, but were not ascertained to be ancestrally informative. Haplotypes were randomly paired within each population to generate diploid individuals. These data sets were used to test population stratification.

Figure 1
figure 1

Genealogical representation of the coalescent simulations. (a) Two populations with a single divergence event 2tNe generations ago. (b) Three populations with the first divergence event 2t1Ne generations ago and the second divergence event 2t2Ne generations ago.

I extended this model of vicariance to include admixture. A sample of 1018 admixed individuals was generated instantaneously using parental populations A and B. For each admixed individual, the average genome-wide admixture proportion p was determined by drawing a random deviate from the beta-distribution β(10.18508,2.837815), yielding an expected genome-wide admixture proportion p̄ = 0.782. The parameters of the beta-distribution were chosen such that the first two moments matched the first two moments of the empirical distribution of individual admixture proportions for the African-American data set described below in the real data analysis subsection. For each site independently, the individual was assigned the state of a randomly selected haplotype from population A if a random deviate from the uniform distribution U(0,1) p and assigned the state of a randomly selected haplotype from population B otherwise. For each divergence time t, 1000 independent replicate data sets were generated.

Three populations: I also extended the model of vicariance to three populations. In the same coalescent framework, suppose A, B and C represent three ancestral populations that diverged at two times in the past. Populations B and C diverged t1={0.0001, 0.001, 0.01, 0.1, 1.0} in units of 2Ne generations ago, and population A diverged t2=10t1 in units of 2Ne generations ago (Figure 1b). Samples of 200 haplotypes from each of the three populations were simulated. Each data set consisted of 10 000 unlinked sites. Haplotypes were randomly paired within each population to generate diploid individuals. These data sets were used to test population stratification.

A sample of 1018 admixed individuals was generated instantaneously using parental populations A, B and C. For each admixed individual, p1 was a random deviate from β(0.8,7.2) and p2 was a random deviate from β(12,12). For each site independently, the individual was assigned the state of a randomly selected haplotype from population A if a random deviate from U(0,1) p1. If the random uniform deviate >p1, then the individual was assigned the state of a randomly selected haplotype from population B if a random deviate from U(0,1) p2 and assigned the state of a randomly selected haplotype from population C otherwise. The expected genome-wide proportion of haplotypes from populations A, B and C were pA=p1=0.10, pB=(1−p1)p2=0.45 and pC=1−p1−(1−p1)p2=0.45, respectively, intended to mimic African, European and Native American admixture proportions in Latino populations (Martinez-Marignac et al., 2007; Price et al., 2007). For each divergence time t1, 1000 independent replicate data sets were generated.

EIGENSOFT

Let G be the M × N matrix of genotypes for i=1 to M SNPs and j=1 to N individuals, with genotypes coded as 0, 1 or 2 copies of the minor allele. The rows of G were centered by subtracting

from each entry in row i (Price et al., 2006). Eigendecomposition was performed on the standardized N × N sample covariance matrix. Significance of the leading eigenvalue was determined by a formal hypothesis test on the basis of the Tracy–Widom distribution (Johnstone, 2001; Patterson et al., 2006). The rank of the sample covariance matrix was one less than the number of individuals due to centering. The number of eigenvalues/eigenvectors equaled the rank of the covariance matrix. Given orthogonality of eigenvectors, P-values were Bonferroni corrected for the number of eigenvectors. Thus, the significance level was 0.05/(N−1).

Velicer's minimum average partial test

Let G be the M × N matrix of genotypes for i=1 to M SNPs and j=1 to N individuals, with genotypes coded as 0, 1 or 2 copies of the minor allele. First, center the rows of G by subtracting

from each entry in row i (Price et al., 2006). Next, compute the N × N sample correlation matrix R, in which the elements are Pearson product-moment correlation coefficients. Let R(m)* be the N × N matrix of partial correlations after the first m principal components have been partialed out. Let rkl* be the element in the kth row and lth column of R(m)*. Velicer (1976) proposed the summary statistic . The summary statistic m is the average of the squared partial correlations after the first m components are partialed out, with m=0 to N−1 (see the Appendix for computer code) (Velicer, 1976). The stopping point is the value of m for which fm is minimum (Velicer, 1976).

Real data analysis

To illustrate application to real world data, the number of significant principal components was estimated using both EIGENSOFT and Velicer's minimum average partial test for a previously described sample of 1018 unrelated African Americans, genotyped as part of the Howard University Family Study (Adeyemo et al., 2009). The 808 465 autosomal SNPs that passed quality control were pruned for linkage disequilibrium at r20.3 using PLINK, version 1.06 (Purcell et al., 2007). Of the SNPs that remained after pruning, 10 000 were randomly selected. HapMap phase III CEU and YRI samples were used to represent the presumed parental populations (The International HapMap 3 Consortium, 2010). After quality control, 108 YRI and 109 CEU individuals remained (Chen et al., 2010).

Results

Simulations

For the simulation of two populations separated by one divergence event (Figure 1a), the expected number of significant principal components was one if the divergence event occurred in the distant past, or zero if the divergence event occurred in the recent past (Figure 2). These expectations also hold for analysis of admixed individuals, because the expected allele frequencies for an admixed individual are linear mixes of the parental allele frequencies (Patterson et al., 2006; McVean, 2009). Consequently, admixed individuals will have coordinates along the axis defined by the two parental populations. This behavior holds regardless of whether the parental populations are included in the projection, which is important because sizes of proxy or reference samples are typically much smaller than sizes of admixed samples, and unequal sample sizes can severely distort the projection (Novembre et al., 2008; McVean, 2009).

Figure 2
figure 2

Representative projections of simulated data for two populations. (ac) The divergence event between populations A (red circles) and B (blue circles) occurred 0 generations ago. (df) The divergence event occurred 2Ne generations ago. (a, d) Analysis of populations A and B. (b, e) Analysis of admixed individuals (gray circles) with average individual admixture proportions 78.2% population A and 21.8% population B. (c, f) Combined analysis of admixed individuals, population A and population B.

EIGENSOFT with Tracy–Widom statistics consistently overestimated the number of significant principal components in analysis of the two populations and performed worse in analysis of admixed individuals (Table 1). Velicer's minimum average partial test showed neither bias nor variance (Table 1). Both methods were more sensitive to lower levels of divergence in analysis of stratified populations than in the analysis of admixed individuals, despite the former analysis consisting of 102 individuals and the latter analysis consisting of 103 individuals (Table 1).

Table 1 Number of significant principal components in the simulation of two populations

For the simulation of three populations separated by two divergence events (Figure 1b), the expected number of significant principal components was 0, 1 or 2, depending on the time since the divergence events (Figure 3). EIGENSOFT with Tracy–Widom statistics showed a larger upward bias for analysis of three populations (Table 2) than for analysis of two populations (Table 1), and again performed worse in analysis of admixed individuals (Table 2). Velicer's minimum average partial test yielded no to very small biases and variances in both the analysis of three populations and the analysis of admixed individuals.

Figure 3
figure 3

Representative projections of simulated data for three populations. (ac) The divergence event between populations B (blue circles) and C (black circles) occurred 0.0002Ne generations ago and the divergence of population A (red circles) occurred 0.002Ne generations ago. (df) The divergence event between populations B and C occurred 2Ne generations ago and the divergence of population A occurred 20Ne generations ago. (a, d) Analysis of populations A, B and C. (b, e) Analysis of admixed individuals (gray circles) with average individual admixture proportions 10% population A, 45% population B and 45% population C. (c, f) Combined analysis of admixed individuals, population A, population B and population C.

Table 2 Number of significant principal components in the simulation of three populations

Real data analysis

In analysis of 1018 unrelated African Americans, Adeyemo et al. (2009) retained two principal components on the basis of visual inspection of the scree plot (that is, eigenvalues sorted in descending order as a function of the eigenvalue index) from EIGENSOFT. Using Tracy–Widom statistics, EIGENSOFT claimed 16 significant principal components (Figure 4). Using Velicer's minimum average partial test, only one principal component should be retained (Figure 5), the minimal number of principal components necessary to explain admixture between two ancestral parental populations in the absence of residual substructure.

Figure 4
figure 4

Top 16 principal components for the Howard University Family Study data using EIGENSOFT. All 16 principal components are statistically significant according to Tracy–Widom statistics. The bottom right panel shows the scree plot.

Figure 5
figure 5

Top 16 principal components for the Howard University Family Study data using Velicer's minimum average partial test. Only the top principal component is statistically significant. The bottom right panel shows the scree plot.

Discussion

Detecting population structure is important in both medical genetics and population genetics. EIGENSOFT is a widely used implementation of principal components analysis supplemented with statistics for formal hypothesis testing on the basis of the expected distribution of the largest eigenvalue (Johnstone, 2001; Patterson et al., 2006). In this study, I compared the test based on the Tracy–Widom distribution in EIGENSOFT to Velicer's minimum average partial test (Velicer, 1976) for determining the number of principal components to retain. Computer simulation under a coalescent model of vicariance (that is, divergence with no subsequent gene flow) revealed three trends regarding the estimation of the number of principal components to retain. One, EIGENSOFT increasingly overestimated the number of significant principal components for increasingly distant divergence events in unadmixed samples. Two, EIGENSOFT overestimated the number of significant principal components even more for samples of admixed individuals than for samples of unadmixed individuals. Three, EIGENSOFT increasingly overestimated the number of significant principal components as the number of populations increased. In contrast, Velicer's minimum average partial test showed almost no tendency to overestimate the number of principal components to retain in any of these scenarios.

Velicer's minimum average partial test is one of many possible stopping rules for determining the number of principal components to retain (Peres-Neto et al., 2005). Another possibility is sparse factor analysis (Engelhardt and Stephens, 2010), in which constraints encourage loadings to shrink toward zero. Whereas sparse factor analysis involves numerical optimization, which is stochastic, and can therefore yield different results from run to run, Velicer's minimum average partial test is deterministic and need only be run once. Also, sparse factor analysis does not provide a rule for determining the number of principal components to retain, whereas both Velicer's minimum average partial test and EIGENSOFT implement stopping rules.

EIGENSOFT was designed to work with genome-wide genotype data while accounting for linkage disequilibrium. The model of admixture used in the simulations assumes free recombination. Furthermore, the model assumes that haplotypes in the admixed individuals are inherited identical by descent from the parental populations, regardless of the number of generations since admixture. Taken together, these modeling assumptions eliminate the possibility that the observed overestimation of significant principal components by EIGENSOFT resulted from a failure to properly account for linkage disequilibrium. My procedure works well with 10 000 unlinked random markers (Gao and Martin, 2009; McVean, 2009), so that researchers can afford to aggressively prune genome-wide data sets to obtain a set of markers in linkage equilibrium and not have to account further for linkage disequilibrium (either background linkage disequilibrium or extended linkage disequilibrium due to admixture, both of which can induce spurious clustering (Falush et al., 2003; Kaeuffer et al., 2007)). At the same time, this number of markers is much greater than the number of individuals in the vast majority of genome-wide studies, such that sensitivity to detect population structure is limited by sample size (Patterson et al., 2006).

Unequal sample sizes can severely distort projections in principal component analysis (Novembre et al., 2008; McVean, 2009). This is a problem for analysis of admixed individuals if smaller reference samples of parental populations are included in the data set. On the one hand, this problem can be circumvented because analysis of admixed individuals does not require reference samples of parental populations (see Figures 2e and 3e). On the other hand, my procedure is less sensitive for detecting admixture without reference samples of parental populations than it is for detecting population stratification in samples of unadmixed individuals.

EIGENSOFT is known to overestimate the number of significant principal components for samples of admixed individuals (Patterson et al., 2006). This overestimation is not solely due to the use of ancestrally informative markers nor to extended linkage disequilibrium induced by admixture as previously suggested (Patterson et al., 2006), since a similar overestimation was observed in this study using random, unlinked markers. There are three major differences between EIGENSOFT and my procedure that might explain this problem. One, EIGENSOFT standardizes the sample covariance matrix, estimated from the genotype matrix, using the binomial variance of estimated allele frequencies, whereas Velicer's minimum average partial test uses the correlation matrix estimated from the genotype matrix. Thus, EIGENSOFT assumes Hardy–Weinberg equilibrium whereas my procedure does not. The assumption of Hardy–Weinberg equilibrium does not hold under either population stratification or admixture. Violation of Hardy–Weinberg equilibrium may partially explain why EIGENSOFT systematically overestimates the number of significant principal components.

Two, EIGENSOFT and my procedure use the results of eigendecomposition differently. EIGENSOFT performs formal hypothesis testing to determine the number of significant principal components by testing each eigenvalue using approximations to an external reference distribution, viz. the Tracy–Widom distribution (Patterson et al., 2006). When assessing a given eigenvalue for significance, EIGENSOFT discards all leading eigenvalues; consequently, EIGENSOFT systematically overestimates the proportion of variance explained for every eigenvalue after the first one. In contrast, Velicer's minimum average partial test does not rely on any external reference distribution. Rather, it determines the number of principal components to retain using an objective minimization function, in which the stopping point is on the basis of data-dependent estimation of systematic noise (Velicer, 1976). Also, Velicer's minimum average partial test is based on partialing out the cumulative effect of all leading eigenvalues and eigenvectors from the original correlation matrix when it assesses a given eigenvalue and eigenvector for retention.

Three, EIGENSOFT estimates the effective number of markers on the basis of the empirical distribution of eigenvalues (Patterson et al., 2006). This value is used as a plug-in when estimating the mean and s.d. for the Tracy–Widom test statistic (Patterson et al., 2006). This moment estimator was derived under the null hypothesis of no structure (Patterson et al., 2006); its validity under the alternative hypothesis is unknown. These three differences may contribute to EIGENSOFT's overestimation of significance. Regardless, the principal components that are retained by Velicer's minimum average test can be used in the same way as those deemed significant by EIGENSOFT, for example, as covariates in association testing (Price et al., 2006).