Introduction

Understanding human population genetic structure remains important for gaining insights into human history and demography, as well as for investigating genetic diseases in relation to geography and ancestry1,2,3. Historically, human population divergence is assessed using approaches from a variety of disciplines including archaeology, palaeontology, linguistics, climatology and genetics. Early studies of genetic divergence were conducted by investigating the genetic variability of mitochondrial DNA and Y chromosomes4,5. Currently, genome-wide SNP sites are used to measure population differentiation (e.g., the HapMap genotype data)6,7,8, and to search for outlier regions that are potentially associated with geographically restricted genetic diseases9. Different classes of genetic markers vary widely in many important characteristics, such as their mode of inheritance (paternal, maternal, or biparental), mutation rate, and degree of selective neutrality. As a result population genetic divergence can vary depending on the class of genetic marker investigated. Copy-number-variable (CNV) loci are an important cause of genetic variation in human genomes, and give rise to differences of 4.8–9.5% in the overall length of human genomes10,11. However population genetic divergence at the genome-wide CNV loci has not been investigated in detail12,13, nor has genome-wide divergence at the CNV loci been compared with that at SNP sites.

Genetic variation at CNV loci in Homo sapiens and other species has been extensively reviewed from a number perspectives12,14. Topics covered include the mechanisms for generating copy number variation, natural selection on duplication and deletion variants, the impacts of demographical changes on CNV loci, associations with SNP loci, and the role of CNV loci in causing diseases10,11,12,15,16,17. At the population level, the evolutionary dynamics of CNV loci can be studied within the framework of population genetics12,14. Emerson et al.18 used an infinite-site model to investigate purifying selection on copy number variation in specific gene regions in Drosophila melanogaster. Sjödin and Jakobsson12 suggested the use of a K-allele model19 or a stepwise mutational model20 to describe the mutation process at CNV loci. The neutrality of CNV loci has also been analyzed21,22. We recently developed a three-allele model to test neutrality at CNV loci, and demonstrated selective neutrality at 856 CNV loci scored in 1184 healthy individuals from the HapMap genotype data set23. The evolution of these CNV loci can be essentially explained by a mutation-drift process23. Here, we proceed with the same dataset to investigate population genetic divergence at the genome-wide CNV loci.

In comparison with variation at SNP sites, variants at CNV loci have several distinct features. First, CNV variants often differ in length by 1kbp or more24,25, whereas SNP variants differ by a single base pair. Thus although CNV loci (~4.8~9.5% of human genomes) are much less abundant than SNP sites in human genomes, they represent an important type of chromosomal structural variation11. Second, more complex processes are involved in generating copy number variants, including non-allelic homologous recombination (NAHR)26, non-homologous end joining (NHEJ), and insertion of transposable elements (TEs)27,28. These differ dramatically from the mechanisms generating point mutation (transitions and transversions) at SNP sites. Third, the average mutation rate at CNV loci is expected to be much higher than the point mutation rates at SNP sites29, resulting in a much younger average age of alleles for CNV than for SNP loci in natural populations30. Given these differences in the properties of CNV and SNP markers, we anticipate that they will vary in their degree of population genetic divergence.

To test this hypothesis, we employ genotype data at CNV loci from the HapMap Phase III populations. This has two advantages. The first is that genetic divergence among these populations has been fully investigated at genome-wide SNP sites31, providing the opportunity for direct comparison with results for the genome-wide CNV loci. Analysis of CNV loci has so far only been conducted with partial HapMap Phase III populations31 or at a particular gene site32. Our result should differ from existing analysis because they will include more populations with a wider range of ancestry. Increasing the number of individuals will affect both the genetic divergence and the number of common CNV loci. The second advantage for using the HapMap dataset is that exact discrete copy numbers are available for each diploid genotype at each CNV locus31. Although techniques for detecting CNV loci have recently been improved, discrete copy-number genotypes at each CNV locus, which are also essential for accurate case-control association testing with CNV loci33, are rarely archived in publically accessible data. Furthermore, the sample sizes in previous studies at CNV loci are often too small, and hence are inappropriate for population genetic structure analysis18,34,35. The large sample sizes in HapMap Phase III populations means that the probabilities of making either false-positive or negative CNV calls are negligible23.

In this study we analyze genetic divergence at the genome-wide CNV loci and compare it with that at the genome-wide SNP sites in exactly the same populations. To further address the population genetic properties of CNV loci and reinforce our explanations of evolution at CNV loci, we test LDs at both gametic and zygotic levels among all pairs of CNV loci. We compare the patterns of gametic and zygotic LDs at CNV loci with those previously reported at SNP sites36,37. Recent theoretical studies indicate that zygotic LD is more informative than gametic LD for inferring the effects of different evolutionary forces (mating system, gene flow, selection, and genetic drift)38,39. In the absence of functional epistatic selective effects among loci, gametic LD (lower order) is always greater than the maximum zygotic LD in value. Other processes, including mating system, gene flow and genetic drift, do not change this pattern although they can generate LD (statistical associations between loci)38,39. The difference between the values of gametic LD and maximum zygotic LD can be used to infer whether epistasis exists between loci. Such differences tested previously at the genome-wide SNP sites with the HapMap Phase III populations37, have shown the existence of epistases among many SNP sites. Here, we also investigate this property at the genome-wide CNV loci by presuming that individual CNV loci are directly/indirectly or equally involved in fitness changes. Information from LD analyses among CNV loci helps us to view the difference in population genetic divergence between SNP and CNV loci from a different perspective. Overall our objective is to infer the roles of mutation and migration in producing human population genetic divergence at the genome-wide CNV loci by comparing the single and multilocus population genetic structure of SNP and CNV loci.

Results

Population genetic divergence

Maximum likelihood estimates (MLEs) of allele frequencies are summarized in Table S1. Although all CNV loci are polymorphic in the pooled population, they exhibit various levels of polymorphisms among populations (Table 1). More than 80% of CNV loci are polymorphic in African populations (ASW, LWK, MKK, and YRI), but less than 60% in non-African populations except MEX (62.38%). Three Asian populations (CHB, CHD, and JPT) have about 45% polymorphic CNV loci.

Table 1 Sample sizes and polymorphisms at the genome-wide CNV loci in 11 human populations.

African populations have 1.84–1.90 alleles per CNV locus while Asian populations have about 1.50 alleles per CNV locus. The rest of the 11 populations have intermediate numbers of alleles per locus (Na = 1.6–1.66). Similarly, African populations have high gene diversity over all CNV loci (He = ~0.13) and small standard deviations (~0.15); while Asian populations have low gene diversity (~0.11) but large standard deviations (~0.16) over all CNV loci. The rest of the 11 populations have intermediate gene diversity and standard deviations (Table 1).

Genetic differentiation measured by Gst is 0.0498 ± 0.0491 among all CNV loci, and most individual Gst values are around 0.05, with a few CNV loci having relatively large Gst values (Fig. 1). Substantial variations exist among chromosomes, especially for the small Gst values that are outside the 95% CIs (Fig. 2). The proportions of CNV loci exhibiting a significantly low level of population genetic divergence are 72.72% on Chr 1, 51.35% on Chr 4,76.6% on Chr 5, 84.48% on Chr 6, 76.67% on Chr 7, 56.26% on Chr 9, 62.8% on Chr 11, 52.63% on Chr 17, 60.87% on Chr 19, and 90.91% on Chr 22. The rest of the chromosomes have less than 50% of CNV loci with a significantly low level of population differentiation. None of the chromosomes has any CNV locus exhibiting a significantly high level of population differentiation (Fig. 2).

Figure 1: A histogram of Gst distribution at 856 CNV loci.
figure 1

The abscissa axis is the Gst values. The curve is based on the kernel-smoothed density function.

Figure 2: Gst values across chromosomes at CNV loci.
figure 2

The observed Gst values are in red, and their 95% CIs are derived from 1000 bootstrapping samples on each chromosome. The lines with opened and closed circles are the lower and upper Gst values of 95% CIs, respectively. The abscissa axis is the positions for CNV loci on each chromosome in Mb, and the ordinate axis is the Gst values.

The average pairwise multilocus Gst ranges from 0.0038 ± 0.00001 (CHB-CHD) to 0.0421 ± 0.0001 (JPT-LWK), with the mean of 0.0255 ± 0.0114 over all pairs (Table 2). The average pairwise multilocus Gst in African populations ranges from 0.0059 ± 0.00001 (LWK-YRI) to 0.0128 ± 0.00002 (MKK-YRI), with the mean of 0.0081 ± 0.0025 over population pairs. The average pairwise multilocus Gst in non-African populations ranges from 0.0038 ± 0.00001(CHB-CHD) to 0.0352 ± 0.0001(TSI-JPT), with the mean of 0.0212 ± 0.0109 over population pairs. The average pairwise multilocus Gst among African and non-African populations ranges from 0.0206 ± 0.00004 (MKK-TSI) to 0.0421 ± 0.0001 (JPT-LWK), with the mean of 0.0324 ± 0.0064. over population pairs.

Table 2 Comparison of the pairwise Gst(CNV) at the genome-wide CNV loci with the pairwise Fst(SNP) at the genome-wide SNP sites7.

Compared with the pairwise multilocus Fst previously reported at the genome-wide SNP sites7, the pairwise multilocus Gst at the genome-wide CNV loci is generally more than three times lower (average ratio of Fst(SNP)/Gst(CNV) = 3.3081 ± 1.1837; Table 2). The ratios of Fst(SNP)/Gst(CNV) range from 1.3481 ± 0.0171 (LWK-YRI) to 2.1023 ± 0.0087 (MKK-LWK) in African populations, with the mean of 1.6849 ± 0.3294 over population pairs; from 0.2649 ± 0.0265 (CHB-CHD) to 3.6545 ± 0.0253 (CEU-CHD) in non-African populations, with the mean of 2.5048 ± 0.9240 over population pairs; and from 3.3.5497 ± 0.0200 (ASW-GIH) to 4.8624 ± 0.0258 (CHD-MKK) among African and non-African populations, with the mean of 4.2584 ± 0.3548 over population pairs (Table 2).

Inter-chromosomal variations in pairwise Gst values are substantial among different population pairs (Figure S1a), indicating the presence of differential divergences among chromosomes during the formation of populations. The pairs among African and non-African populations have large variations among chromosomes, especially on Chrs 9, 10, 16, 20, and 22 (Figure S1a), while the pairs among African populations or among non-African populations exhibit relatively stable divergences among chromosomes (e.g., CHB-JPT and CEU-CHB; Figure S1b).

Pairwise Nei’s genetic distances at multiple CNV loci range from 0.001 ± 0.000004 (CHB-CHD) to 0.0241 ± 0.0001 (CHD-YRI), with a mean of 0.0124 ± 0.0067 over all pairs (Table S2). The average genetic distance is 0.0029 ± 0.0010 among African populations, 0.0085 ± 0.0049 among non-African populations, and 0.0174 ± 0.0040 among African and non-African populations. Cluster analysis with the unweighted pair group method with arithmetic mean (UPGMA) shows that the three subgroups (African, Asian, and the rest of the populations) are clearly distinguished (Fig. 3). Bootstrapping resample trees (1000) using PHYLIP40 indicate that African and non-African populations can be separated with a probability of 100% (data not shown here).

Figure 3: Cluster analysis of 11 human populations.
figure 3

The plot is based on Nei’s genetic distance by using UPGMA for hierarchical clustering.

Consider an average mutation rate of the order 10−5 at a CNV locus29, the equal effective population sizes among the 11 populations, and 25 years per generation. From the average distance and its approximate variance V(t), the population isolation time is generally about t = 0.0124 × 5 × 104 × 25 ± 0.0067 × 5 × 104 × 25 = 15500 ± 8375 years among populations, t = 3625 ± 1250 years among African populations, about t = 10625 ± 6125 years among non-African populations, and about t = 21750 ± 5000 years among African and non-African populations.

Gametic and zygotic LDs

Statistical tests indicate that very few pairs of CNV loci, 0.027~0.073%, exhibit significant gametic LDs in the 11 populations (Table 3; Table S3 for details). Most pairs of CNV loci have insignificant gametic LDs in each population. Among the significant gametic LDs, African populations generally have a lower proprtion of CNV locus pairs with significant gametic LDs than do most non-African populations (Table 3). The average significant r-squares are higher for CNV loci from the same chromosome (~0.76) than from different chromosomes (~0.16). Among the significant gametic LDs on the same chromosomes, more pairs come from partially overlapped CNV loci in each population (Table 3).

Table 3 Means and standard deviations of significant gametic LDs (r-squares) in 11 human populations * .

Patterns of gametic LDs are different among populations. African populations have more significant gametic LDs from different chromosomes than from the same chromosomes, while non-African populations except CEU and MEX have more significant gametic LDs from the same chromosomes than from different chromosomes. No common pairs of CNV loci have significant gametic LDs on different chromosomes among 11 populations, but twelve common pairs from overlapped CNV loci (except one on Chr 7) exist, with 1 on Chrs 1, 7,9,11, and 12, 3 on Chr 5, and 4 on Chr 6 (Table S4).

Tests of zygotic LDs also indicate that a very few CNV loci have significant zygotic LDs, 0 ~ 0.0359% (Table 4), which is generally less than the proportion of significant gametic LDs (Table 3). Most CNV loci with significant zygotic LDs are partially overlapped on the same chromosomes (Table S5). African populations have fewer significant zygotic LDs than do most non-African populations in significant Dij (i, j = 0, 1, 2) except D 3j (j = 0, 1, 2, 3). There are twenty-two common pairs of CNV loci (mostly overlapped) of significant zygotic LDs in 11 populations, with 1 pair on Chr 1, 3 on Chr 5, 7 on Chr 6, 3 on Chr 7, 1 on Chr 9, 2 on Chr 10, 2 on Chr 11, 1 on Chr 12, 1 on Chr 13, and 1 on Chr 20 (Table 4). These locus pairs also have significant gametic LDs, while some CNV loci with significant zygotic LDs have no significant gametic LDs in 11 populations (Table S4).

Table 4 Percentages of the pairs of CNV loci with significant zygotic LDs in 11 human populations * .

For all CNV loci the maximum zygotic LD is smaller than the gametic LD in value, indicating that no epistatic effects exist between CNV loci. Both gametic and zygotic LD analyses indicate that these CNV loci are essentially in linkage equilibrium except for a few overlapped loci in each population.

Joint migration and mutation rates

From the pairwise multilocus Gst(CNV) (Table 2) and the pairwise multilocus Fst(SNP)7,31, the ratios of the joint migration and nutation rates at CNV loci (mc + 3μc/2) to those at SNP sites (ms + 2μs) are estimated according to equations (9) and (12) (Table 5). The ratios range from 0.2624 ± 0.0263 (CHB-CHD) to 5.7238 ± 0.0375 (CHD-MKK), with the mean of 3.6600 ± 1.4188 over all pairs. The ratios change from 1.4126 ± 0.0144 (ASW-LWK) to 2.1402 ± 0.0088 (MKK-YRI) in African populations, with the mean of 1.6988 ± 0.3411 over population pairs; from 0.2624 ± 0.0263 (CHB-CHD) to 3.9942 ± 0.0311 (CEU-CHD) in non-African populations, with the mean of 2.6796 ± 1.0224 over population pairs; and from 3.8132 ± 0.0266 (ASW-GIH) to 5.7238 ± 0.0375 (CHD-MKK) among African and non-African populations, with the mean of 4.8157 ± 0.4929 over population pairs.

Table 5 Ratios of the joint mutation and migration rates at CNV loci to those at SNP sites (above diagonal), and the ratios of the mutation rate to the migration rate at CNV loci (below diagonal).

Using the average pairwise Fst(SNP) = 0.0956 ± 0.05677,31 and the average pairwise Gst(CNV) = 0.0255 ± 0.0114 across all population pairs, we obtain (mc + 3μc/2)/(ms + 2μs) = 4.0396 ± 3.2341, where a large standard deviation arises from the variation among populations. The above estimates indicate that the joint migration and mutation rates are generally much greater at the genome-wide CNV loci than at the genome-wide SNP sites.

The ratio of the mutation rate to the migration rate at CNV loci can be approximately quantified. According to equations (13) and (14), estimates of μc/m are summarised in Table 5, which range from 0.0352 ± 0.0177 (TSI-CEU) to 3.1492 ± 0.0250 (CHB-MEX), with a mean of 1.8153 ± 0.9016 over population pairs (except for a negative value for the CHB-CHD pair). The mutation rate is generally smaller than the migration rate among African populations (0.2392 ± 0.0115~0.7601 ± 0.0055; Table 5), but is greater than the migration rate among non-African populations (1.2036 ± 0.5881) or among African and non-African populations (2.4655 ± 0.4384).

Estimate of μc/m is 2.0264 ± 2.1561 from the rate (mc + 3μc/2)/(ms + 2μs) = 4.0396 ± 3.2341 in the 11 populations, and 2.0352 ± 2.0909 from (mc + 3μc/2)/(ms + 2μs) = 4.0529 ± 3.1364 in four populations (CEU, YRI, CHB, and JPT)7,8 (average pairwise Fst(SNP) = 0.1265 ± 0.0675; average pairwise Gst(CNV) = 0.0354 ± 0.0158 in the present study). These estimates indicate that the mutation rate at CNV loci is generally about twice as large as the migration rate.

Discussion

Our results indicate a closer population genetic relationship at CNV loci than at SNP sites among 11 HapMap Phase III populations. Previous reports indicate a similar pattern at specific loci among African, European and East Asian populations (HapMap Phase II data)41, or among HapMap Phase II populations (Fst = ~0.11 at the genome-wide SNP sites)42. A general similarity in relative population genetic structure at CNV loci and SNP sites is also reported with more populations (29) and fewer CNV loci (396) and individuals (405 in total), but the difference is not quantified13. LD analyses indicate that these CNV loci are essentially in linkage equilibrium except for a few overlapped loci. Epistasis does not exist for any pair of CNV loci, presuming that these CNV loci are not selectively neutral or equally additive in influencing fitness. This result is different from those at the genome-wide SNP sites where epistasis occurs among many intron SNPs37. The results provide additional support for a recent report indicating that the 856 CNV loci are selectively neutral in each population23. The evolutionary processes for the low level of population divergences are different from those at the nonsynonumous SNP sites with Fst < 5% where negative selection is thought to be involved31.

Note that our analyses are based on the three-allele system for describing the evolution at a CNV locus because the maximum number of allele copies is four in a diploid genotype. These 856 CNV loci are shown to exhibit neutrality among 1184 healthy individuals23. A system of more than three alleles is needed when more than four allele copies occur in a genotype at any CNV locus. This could likely occur when fewer individuals are surveyed or when unhealthy individuals are included because the number of common CNV loci could become fewer with smaller sample sizes. Under this situation, a neutrality test at CNV loci is needed for small sample sizes, and the extent of population genetic divergence could be different from the results reported here. This needs further verification.

Nei’s genetic distance at the genome-wide CNV loci is generally comparable to those between human populations at the common protein or blood group loci43. However, African populations have even smaller genetic divergence at CNV loci. In the process of mutation-drift at the 856 CNV loci23, population differentiation is expected to occur more recently owing to the high mutation rates at CNV loci. The time estimates since divergence are much shorter than those for general population genetic divergence in humans estimated from common protein loci (~120 Kyrs between human populations43), or than the postulated time (>100 Kyrs) for modern humans to leave Africa and colonize the rest of the world. Because the assumption for  = 2 μt43,44 is violated due to the unequal effective population sizes among populations45,46, the varying mutation rates among loci, and the finite number of alleles at a CNV locus (not the infinite-allele model)23, the preceding estimates might provide a reference for the minimum divergence times.

Patterns of genetic divergence at CNV loci may reflect the historical divergence in forming modern human origins. The common pattern at both CNV and SNP loci is that the smallest genetic divergence is present among African populations, followed by among non-African populations, and then among African and non-African populations. Polymorphisms at CNV loci decrease from African to non-African populations. More alleles per CNV locus in African populations suggest a longer-term accumulation of mutants. These patterns are consistent with the Out of African model rather than with the multiregional model for modern human origins47,48. Genetic drift effects reduce genetic diversity in non-African populations. Further inferences on the evolutionary processes occurring among non-African populations would require additional information besides the comparison of polymorphisms at CNV loci. Nevertheless, the genetic relationships among non-African populations show a clear separation of Asian populations from non-Asian populations. Evidence at genome-wide CNV loci supports the hypothesis that CHB and CHD have a very close genetic relationship. This is slightly different from the genetic relationships revealed by the patterns of zygotic and gametic LDs at the genome-wide SNP sites where JPT and CHD have a very close genetic relationship37. Genetic drift effects could explain the relative small differentiation in polymorphism at CNV loci in Asian and European populations. Both CHB and CHD have relatively smaller genetic drift effects than JPT45, and hence have higher polymorphisms (1.50 vs1.48 alleles per CNV locus). CEU probably has relatively smaller genetic drift effects than do CHB and JPT45, and hence has more alleles per CNV locus (1.66 alleles per CNV locus). A relatively high level of polymorphisms in MEX among non-African populations probably arise from an admixture of individuals with multiple distinct ancestries, which is consistent with previous explanations37,49.

Because both mutation and migration reduce population genetic divergence50, the combined patterns of genetic divergence at CNV and SNP loci provide us with an opportunity to address their relative roles. Previous reports51 indicate that the mutation rates are about 1.7 × 10−6 to 1.0 × 10−4, about 100~10000 times of the point mutation rate at SNP sites (1.8–2.5 × 10−8). Fu et al.29 indicates that the mutation rate for most CNV loci is about order of 10−5 per CNV locus per generation. On average, a mutation rate of the order 10−5 at the 856 CNV loci could be inferred from the estimate of the population-scaled mutation rate θ (=4 ) = 0.1415 ± 0.014423, given N ~ 300045. Patterns of the μc/mestimates suggest a dominant role that the mutation process plays in shaping population genetic divergence at CNV loci, especially in the non-African populations (Table 5). The low μc/m in African populations could likely arise from their closer genetic relationships where the inter-population gene exchanges are historically more frequent or from natural evolutionary convergence where their genetic compositions become similar since ancestral populations. However, statistical tests indicate that the mutation-drift process can explain the variation at CNV loci in African populations, implying that the latter process could be the main reason for low genetic divergence23.

In comparison with the previous results (Gst ~ 0.11) at a few CNV loci10 (67 CNV loci and n = 270 in total) or at the locus of a specific gene CCL4L32 in four HapMap populations (YRI, CEU, and CHB + JPT), our investigation shows much lower population genetic divergence at the 856 CNV loci among these four populations (mean Gst = 0.0345 ± 0.0158; Table 2). This result indicates that the CNV loci shared among 1184 healthy individuals exhibit smaller population genetic divergence. Also, compared with the pairwise Fst across chromosomes at the genome-wide SNP sites (Fig. 2 in Baye8), a similarity in pattern at the genome-wide CNV loci exists (Figure S1). The difference is the presence of low population genetic divergence at CNV loci.

A caveat in the above inferences is that it is based on the assumption of equilibrium among the processes of mutation, drift, and migration at CNV and SNP loci in human populations. Like conventional population genetics analyses in different organisms, such an equilibrium might not be attained in reality, and a dynamic model of evolution is more realistic for further investigation. However, concerning the estimates of , the qualitative conclusion about the major effects of mutation on population genetic divergence cannot be rejected at the genome-wide CNV loci29, especially in non-African populations.

Although small LDs are difficult to detect owing to the statistical power, very few CNV loci exhibit significant gametic and zygotic LDs from either the same or different chromosomes. This is different from the patterns at the genome-wide SNP sites (Hu and Hu37 for zygotic LDs with the recombination rate <10%, Reich et al.36 for gametic LDs with the recombination rate <16%, and Koch et al.52 for gametic LDs with the recombination rate >25%). The CNV loci on the same chromosomes (except a few overlapped loci) are distributed over a wide range of distances, with an average recombination rate of 3.3% (0~35%). The significant correlations among CNV loci do not exist across populations53. The generally concordant pattern of no significant gametic and zygotic LDs provides no evidence for the presence of functionally epistatic CNV loci26,27, different from the results at genome-wide SNP sites37.

Patterns of LDs also suggest that the effects of mutation on reducing LDs are stronger than the effects of migration that increases LDs. The gametic LDs at CNV loci gradually decay with time in African populations, and the same is the case for the zygotic LDs at CNV loci53, except for the overlapped CNV loci (but not for one pair of CNV loci on Chr 7 with a physical distance of 2658 bp that requires a longer time to decay). The gametic LDs at CNV loci initially formed by the founder effects in non-African populations also decay with time due to the mutation and recombination effects. The same is the case for the zygotic LDs38. If recombination is the dominant process in eroding LDs, a certain proportion of CNV loci could maintain significant LD within very short distances except for overlapped loci. Such an expected pattern is not observed (Tables S4 and S5). High mutation rates causing low LDs between CNV and SNP loci are also discussed54. Thus, the mutation effects could be greater than the recombination effects in eroding both gametic and zygotic LDs although recombination and mutation effects are both involved in reducing LDs55.

Finally, our investigation suggests differential evolutionary processes at CNV and SNP loci along chromosomes. Although mosaic patterns occur in genome architecture in terms of different measures of genetic diversity or from different perspectives53, the DNA segments with CNV loci themselves display individual blocks each with a small level of population genetic divergence. These blocks are different from the gametic or zygotic LD blocks at SNP sites since recombination within CNV loci should rarely occur. The LD blocks between CNV loci cannot be maintained due to the effects of the high mutation rates.

Methods

Genotype data at CNV loci

Genotype data in 11 HapMap Phase III populations, released by The International HapMap 3 Consortium, was downloaded from ftp://ftp.ncbi.nlm.nih.gov/hapmap/cnv_data/hm3_cnv_submission.txt. The data differs from most accessible data sets in that it provides the discrete copy numbers per CNV locus. The copy numbers at a CNV locus are derived through a two-step process according to Altshuler et al.31 The first step is to detect copy number variation on each chromosome by analyzing the probe-level intensity data from both the Affymetrix and Illumina arrays. QuantiSNP56 and Birdseye57 algorithms are used to identify CNV loci separately. Common CNV loci are further identified, and refined to ensure qualified copy number variant calls. The second step is to determine the discrete copy numbers for each CNV locus from the probe-level intensity data. CNVtools33 and a two-dimensional model (Gaussian mixture)31, are used to infer the copy numbers from the maximum posterior likelihood function. A meta-approach combining the two algorithms and other criteria are used to further refine the discrete copy number classes to ensure reliable copy number estimates per diploid genomes. This second step for estimating the copy number per CNV locus is not conducted in most archived CNV data sets although later techniques for CNV locus detection are now more advanced.

Diploid genotypes were recorded in integers (0, 1, 2, 3, and 4): 0 for the genotype without any allele copy in both gametes, 1 for the genotype with one allele copy in one gamete but without any copy in the other gamete, 2 for the genotype with one allele copy in each gamete, 3 for the genotype with one allele copy in one gamete and two allele copies in the other gamete, and 4 for the genotype with two allele copies in each gamete. From the individual IDs in the HapMap project, eleven populations were extracted from the pooled data (hm3_cnv_submission.txt): ASW (African ancestry in Southwest USA), CEU (Utah residents with Northern and Western European ancestry from the CEPH collection), CHB (Han Chinese in Beijing, China), CHD (Chinese in Metropolitan Denver, Colorado), GIH (Gujarati Indians in Houston, Texas), JPT (Japanese in Tokyo, Japan), LWK (Luhya in Webuye, Kenya), MEX (Mexican ancestry in Los Angeles, California), MKK (Maasai in Kinyawa, Kenya), TSI (Toscans in Italy), and YRI (Yoruba in Ibadan, Nigeria). Sample size for each population is shown in Table 1. The number of CNV loci per Chr ranges from 11 on Chr 22 to 68 on Chr 2, with 856 common CNV loci in total. Mean size of CNV loci per Chr is ~0.02 Mb, ranging from 26 to 456897 bp. The physical distance between adjacent CNV loci per Chr is ~3.3 Mb on average, ranging from 0 (partially overlapped loci) to 34804235 bp. There are 29 CNV loci that are partially overlapped on chromosomes.

Allele frequency

Because the maximum number of allele copies is four at a CNV locus in the diploid genotype dataset of HapMap Phase III populations, a three-allele system is used to describe the genotype composition. Note that a system of more than three alleles is needed if the number of allele copies is more than 4 in a diploid genotype23,58. Let A0, A1, and A2 be the alleles with 0-, 1-, and 2-copies at a CNV locus, respectively. Allele A1 may be the most abundant variant in a population (the segment on the reference genome), while alleles A0 and A2 are likely less abundant at a CNV locus. Owing to lack of information needed to separate distinct genotypes with the same copy numbers in diploids, allele frequencies under Hardy-Weinberg equilibrium (HWE) were estimated using the expectation-maximization (EM)23,29,59,60. Polymorphism was measured in terms of the number of observed alleles per CNV locus (Na), the percentage of polymorphic loci, P(99%), and the genetic diversity in a population ( where pu is the uth allele frequency) which is equal to the expected heterozygosity (He) under HWE.

Genetic divergence

Population genetic differentiation was measured by Gst44: Gst = 1 − Hs/Ht where Hs is the mean of the expected heterozygosity (He) per locus over all populations and Ht is the expected heterozygosity per locus in the pooled population. The 95% confidence intervals (CIs) for Gst was derived using the bootstrapping approach. To relate the population genetic differentiation to the time since the populations diverge from a single ancestral population, genetic distance was measured46. This distance develops under a specific evolutionary processes. Nei’s genetic distance44 was used to measure population genetic divergence: D = −ln(I) where in which plu1 and plu2 are the frequencies of alleles u1 and u2 at the lth locus from populations 1 and 2, respectively. Under the neutral process (mutation and genetic drift), Nei’s genetic distance is linearly related to the time since divergence (t), i.e. 44,46, and its approximate variance V(t) = V(D)/4μ2, given a mutation rate μ. Standard deviations for Gst and Nei’s genetic distance were calculated using the jackknife method46.

LD tests

To assess the properties of CNV loci relevant for interpreting population genetic divergence, both the gametic and zygotic LDs were tested in each population. Assuming that CNV loci are involved in fitness, a comparison of gametic LD with the maximum zygotic LD in value can be used to determine whether epistasis occurs or not among loci37,38,39. If the maximum zygotic LD (high order LD) is greater than the gametic LD (low order) in value, epistasis exists between loci, which otherwise does not occur (additive or neutral effects). This relationship has been applied to analyzing genome-wide SNP sites37, providing the evidence of epistasis among many intron SNP sites in each of the 11 populations. For a pair of CNV loci each with three alleles, there are 9 types of two-non-allele gametes. Let dij (i, j = 0, 1, 2) be the gametic LD between allele i at the first locus and allele j at the second locus, and pij be the gametic frequency in the population. MLE of the frequency of a genotype pair, (s, t = 0, 1, 2, 3, 4), can be obtained using the direct counting method. An EM method is used to estimate the gametic frequency through an iterative calculation, which is described below:

where δij, a Kronecker delta variable, is equal to 1 when i = j, and 0 when ij. Note that the E- and M-steps are combined into one formula in equation (1). Thus, given the initial gametic frequency pij (i, j = 0, 1, 2), the gametic frequency at the next step puv can be calculated using equation (1). Then, replace pij in equation (1) with puv and recalculate puv at the next step. This iterative calculation is repeated until the convergence of gametic frequencies is attained.

The gametic LD, dij, is then estimated as where and are the MLEs of the frequencies of allele i at the first locus and allele j at the second locus, respectively. A chi-square statistic with 1 degree of freedom (df) is used to test H0: dij = 0 46, i.e.

R-square, , is used to measure gametic LD, which ranges from 0 to 1. Appendix S1 gives the power calculation for the gametic LD test. The power tends to a concave upward curve as the allele frequency increases because the variance under H0 or under H1 (dij ≠ 0) has a maximum value at the intermediate allele frequencies. A large variance increases the uncertainty and hence reduces the power, given a sample size (n), a significance level (α), and gametic LD. The power also increases as the sample size or the gametic LD increases.

Let Dij be the zygotic LD between genotypes i at the first locus and j at the second locus (i, j = 0, 1, 2, 3, 4) in the population. The MLE of zygotic LD, , from the sample of size n can be obtained by where is the MLE of the joint frequency of genotypes i at the first locus and j at the second locus, and (or ) is the frequency of genotype i (or j). To test H0: Dij = 0, a chi-square statistic with 1 df is set as

The normalized r-square is set as , which ranges from 0 to 137,39,61. Appendix S2 derives the power calculation for the zygotic LD test. Similarly, the power increases as the sample size or the zygotic LD increases. The power may be relatively lower for testing zygotic LD than for testing gametic LD due to the doubling of sample size in gametic LD tests.

The significance tests of gametic and zygotic LDs were conducted at the genome-wide level in each population, and hence a Bonferroni adjusted p-value was set as 0.05/the number of all pairs of CNV loci across 22 chromosomes, ranging from 1.88 × 10−7~6.91 × 10−7 owing to different numbers of polymorphic loci in the 11 populations. To minimize the impacts of minor allele frequency (MAF) on amplifying gametic LD test or on increasing false-positive errors, those alleles with their frequencies being out of the range [0.05, 0.95] in the samples were excluded in testing gametic LD. For the same reason, those genotypes with genotypic frequencies beyond the range [0.05, 0.95] in the samples were excluded in testing zygotic LD. Sample sizes ranging from 77 to 171 can provide appropriate statistical power for genotypic frequencies within the range [0.05, 0.95] (Appendix S2). Since the constraints and hold, only four gametic LDs and sixteen zygotic LDs were tested for each pair of CNV loci. Note that CNV loci were not filtered out by frequency except in this LD analysis.

Joint mutation and migration rates

Consider a neutral CNV locus with three alleles. Let μc be the mutation rate of one allele to any of the other two alleles at a CNV locus. The probability density distribution (pdf) for the allele frequency under an equilibrium among genetic drift, mutation, and migration effects can be approximated by synthesizing Kimura’s19 and Wright’s50 work, i.e.

where N is the effective population size, mc is the migration rate per generation for an allele at a CNV locus, Q is the migrant allele frequency, and θc (aka “population diversity”) is the population-scaled mutation rate (=4c). Fst per locus is derived as

The practical population differentiation with Fst62 is measured by Gst44 for a three-allele locus.

Similarly, the pdf of allele frequency at a bi-allelic SNP locus under an equilibrium among genetic drift, mutation, and migration effects can be approximated by synthesizing Kimura’s19 and Wright’s50 work,

where ms is the migration rate per generation, Q is the migrant allele frequency, and θs is equal to 4s in which μs is the mutation rate at an SNP locus. Fst per locus is derived as

The relative extent of genetic divergence at the genome-wide SNP sites versus at the genome-wide CNV loci is measured by the ratio of Fst(SNP)/Gst(CNV), and its standard deviation can be estimated from the variance approximation:

where and are the means of Fst(SNP) and Gst(CNV), respectively, and cov(Fst(SNP), Gst(CNV)) is the covariance between Fst(SNP) and Gst(CNV). The above expression is derived by the delta method63. Estimate of the variance of the ratio can be approximated by assuming that the covariance, cov(Fst(SNP), Gst(CNV)) is negligible at the genome-wide scale. Correlations between CNV and SNP loci are weak, which could arise from the effects of transposition events, recurrent mutation/reversions, or the preference of CNV loci at the low density of SNP sites on chromosomes12,54.

From equations (5) and (7), the ratio of the joint migration and nutation rates at CNV loci to those at SNP sites is estimated as

Similarly, the variance of this ratio can be estimated using the delta method51. Let X = Fst(SNP)(1 − Gst(CNV)) and Y = Gst(CNV)(1 − Fst(SNP)). Again, assume that cov(Fst(SNP), Gst(CNV)) is neglected at the genome-wide scale. The variance V(X) is given by

V(Y) can be obtained by replacing Fst(SNP) and 1 − Gst(CNV) in equation (10) with Gst(CNV) and 1 − Fst(SNP), respectively. Similarly, V(XY) can be obtained by replacing 1 − Gst(CNV) in equation (10) with Gst(CNV). The covariance cov(X, Y) is given by

The variance of the ratio V(X/Y) can be estimated from the following expression,

The variance can be appropriately estimated by V(X/Y) in equation (12), especially when the sample sizes are large.

It is appropriate to assume that the migration rate is the same, on average, at the neutral CNV and SNP loci (mc = ms = m) although local variation might occur among loci (e.g., due to the genetic hitchhiking effects). Also, compared with the migration rate, the point mutation rate at the SNP sites can be neglected. Thus, the ratio of the mutation rate to the migration rate at CNV loci can be estimated:

The standard deviation of the μc/m estimate can be obtained according to equation (12), i.e.

Additional Information

How to cite this article: Hu, X.-S. et al. High mutation rates explain low population genetic divergence at copy-number-variable loci in Homo sapiens. Sci. Rep. 7, 43178; doi: 10.1038/srep43178 (2017).

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.