Inferring the Dynamics of Effective Population Size Using Autosomal Genomes

Next-generation sequencing technology has provided a great opportunity for inferring human demographic history by investigating changes in the effective population size (Ne). In this report, we introduce a strategy for estimating Ne dynamics, allowing the exploration of large multi-locus SNP datasets. We applied this strategy to the Phase 1 Han Chinese samples from the 1000 Genomes Project. The Han Chinese population has undergone a continuous expansion since 25,000 years ago, at first slowly from about 7,300 to 9,800 (at the end of the last glacial maximum about 15,000 YBP), then more quickly to about 46,000 (at the beginning of the Neolithic about 8,000 YBP), and then even more quickly to reach a population size of about 140,000 (recently).

The dynamics of human population size provide important information for understanding the processes underlying human evolutionary history. Important events in the course of human evolution, such as the development of technological innovations and climatic changes, often led to changes in human population size, and in turn have left footprints on extant genetic polymorphism 1,2 . The next-generation whole-genome sequencing technology has provided a great opportunity for interrogating the human population demography 3,4 through investigations of changes in effective population size (N e ). A number of methods for estimating N e from genetic data have been developed since Kingman introduced the coalescent theory 5 , and recent developments have made it possible to study changes in N e using large-scale sequencing data 3,4,6-8 . Gronau et al. 3 and Heled et al. 6 estimate inter-population variation in N e , which provides no information on the dynamics within a population. These two methods and as well as Li et al. 4 's method also have limitations in estimating the dynamics of N e when dealing with whole-genome sequencing data of medium to large numbers of sampled individuals, due to either the complexity of the algorithm itself 4 or the substantial amount of computational calculations required 3,6 which is very difficult to handle using current computational resources. On the other hand, inferring the dynamics of N e from the site frequency spectrum 7,8 requires the assumption of independence of sites, which wastes the linkage information in the data.
East Asia, being the crossroads of human migrations and human activities, is one of the most important regions for studying both the evolution and the genetic diversity of human populations 9 . In particular, the details of the population demographic history in this region since the Last Glacial Maximum (LGM) have scarcely been investigated. Furthermore, present research is still confined to mtDNA, Y-chromosome, or only a few autosomal loci with conflicting results [10][11][12][13] , especially with regard to the start time and the extent of expansions.
In this report, we develop a strategy for estimating the changes in N e , allowing the exploration of whole-genome sequences, we call ENUMS (Estimation of N e Using Multiple Segments). This strategy includes three steps: (1) identification of haplotype blocks ( hereafter referred to as blocks), following the definition of Wang et al. 14 , (2) estimation of N e dynamics through time for each block (hereafter referred to as block N e dynamics) using the Bayesian Skyline Plot (BSP) method 15 , and (3) estimation of population N e changes over time by taking the weighted average value of N e of all blocks at each time point, which minimizes the Euclidean distance to all block N e dynamics (see Methods for details). Based on the coalescence theory and by employing the standard Markov Chain Monte Carlo (MCMC) procedure, the Bayesian Skyline Plot (BSP) method 15 can co-estimate the evolutionary rate, substitution model parameters, phylogeny and ancestral population dynamics within a single analysis directly by sampling DNA sequences. In order to reduce the noise associated with short coalescent intervals, the method allows multiple coalescent intervals to be grouped, assuming the population sizes in successive coalescent intervals are correlated. The population sizes in these grouped intervals are allowed to change linearly or remain constant. The resulting estimation of N e dynamics over time is gained from the posterior sampling of the MCMC procedure, including credibility intervals that represent both phylogenetic and coalescent uncertainty. By applying the BSP method directly to each block and then integrating the information from all blocks, our ENUMS strategy not only has inherited the advantages of the BSP method, but is also able to circumvent the limitations of dealing with large samples of whole-genome sequences.
We applied this strategy to the samples of Han Chinese autosomal genomes in Phase 1 taken from the 1000 Genomes Project (1KGP) 16 to investigate how N e changed in three periods separated by two important events (i.e., the end of the LGM and introduction of agriculture). The three periods are: (1) the LGM period (25,000 YBP -15,000 YBP); (2) from the end of the LGM period to the end of the Paleolithic era (15,000 YBP -8,000 YBP) and (3) from the beginning of the Neolithic era to recent (8,000 YBP -recent).

Results
Genome-wide autosomal SNP (Single Nucleotide Polymorphism) sites of the 197 Han Chinese individuals (CHB & CHS) from the low coverage data set (the Phase 1 data) in the 1KPG 16 were used in this study.
To circumvent the unspecified influence of crossovers on the BSP method 15 which was used as part of the estimation algorithm in the following analysis, we partitioned the sequence data of each chromosome into blocks free of traces of crossover events by employing the four-gamete test (FGT) algorithm 14 on all of the individuals under study. Given the knowledge of poor sequencing quality in the 1KGP data set, we selected 844 haplotype blocks of higher quality, 5,516,675 bp in length totally, from all the autosomal blocks (see Supplementary Results  & Supplementary Table 1 for details). 332 of these blocks overlapped with at least one gene while 512 did not. Although this strategy may cause biased results since only part of each chromosome were used, it uses at least some of the linkage information by assuming no recombination within each block for inference of N e (also see Methods & Discussion for details).
The BSP method 15 was applied to individual blocks to estimate block N e dynamics. Blocks with the age of the most recent common ancestor (MRCA) younger than 25,000 years were removed, leaving 801 blocks (hereafter referred to as Best Set), 315 of which overlap with at least one gene. To further simplify the description of the dynamics of each block N e , we denoted a vector of 26 elements corresponding to N e values that were taken at a series of time points from present to 25,000 YBP with an interval of 1,000 years (also see Methods & Supplementary Figure 2).
To characterize the dynamics of the N e of all blocks studied, we estimated the population N e by calculating the weighted average values over the 801 blocks for every element of the aforementioned vectors (see Methods for details). The result revealed a continuous expansion from 25,000 YBP to recent, resulting in an approximately 18-fold increase in size. The initial N e was ~7,300, while the recent N e is ~140,000 ( Fig. 1). We investigated the change of N e for three time periods: (1) from 25,000 YBP to the end of the LGM (15,000 YBP), (2) from 15,000 YBP to the beginning of Neolithic (8,000 YBP), and (3) from 8,000 YBP to recent. The population increased by only 33% throughout the LGM period, with slight fluctuations. It reached ~46,000 by the end of the Paleolithic era (8,000 YBP), and then expanded to ~140,000 in size from 8,000 YBP to recent (i.e., since the invention of the  Table 2 for details) while two of them have been reported in East Asian populations (see Supplementary Table 3 for details). After removing all these 25 blocks from the Best Set, the remaining blocks show highly similar trends of population N e dynamics (also see Fig. 2). We also partitioned the Best Set into two subsets according to whether the whole or part of the segment overlapped with any of the genes. Both subsets yielded highly similar N e trends (Fig. 3).
Although our interest was in the Han Chinese N e dynamics from present to 25,000 YBP, we also investigated the trend of N e changes from 25,000 YBP to 300,000 YBP (see Supplementary Figure 5 for details). The results suggest that N e increased very slowly from 50,000 YBP to 25,000 YBP (less than 3-fold) while remaining nearly the same from 300,000 YBP to 50,000 YBP.

Discussion
In this report, we developed a strategy for estimating the changes in N e from the recent to the past, allowing for the exploration of large multi-locus SNP datasets, and even whole-genome sequences. The results of the application of this new strategy on the recently released Phase 1 Han Chinese samples from the 1KGP 16 suggest the following: a slight population expansion in East Asia during the LGM; an increase of N e continuing during the post-LGM period and a population expansion escalation with the agriculture in the recent millennia.
The two sampling populations (CHB and CHS) used by this study, although collected from the extant Chinese population, constitute an effective representation of East Asia 10,21,22 . Xu et al. 21 showed that both CHB and CHS are highly admixed. Furthermore, Zheng et al. 10 showed that CHB and CHS have experienced similar population size changes since 25,000 YBP based on the analyses of mitochondrial genomes. Therefore, the current analyses based on CHB and CHS capture the overall picture of the demographic dynamics of human populations in East Asia. To explore how the differentiation between CHB and CHS would influence the estimation of the population N e dynamics, we calculated the fixation index (F st ) 23,24 value of each block in the Best Set and the range of the resulted F st values is 0 to 4.5%. We then divided the Best Set into two sub-sets according to the whole-genome average F st value between CHB and CHS (0.12%) 16 . 289 blocks are with an F st value more the 0.12% while 512 are not. Both sub-sets reveal highly similar patterns of population N e dynamics (Fig. 4), which indicates that the influence of the population structure is very slight.
Constructing coalescent trees with recombination over a whole chromosome is an NP-hard problem when the sample size reaches tens of hundreds, therefore, it is reasonable to choose segments free of traces of crossover events in order to accomplish the estimation. However, even the FGT method, the most sensitive one for detecting currently known crossover events 14,25,26 , may fail to find all crossover events (also see Supplementary Results). Thus, how residual crossover events, although very rare (also see Supplementary Results), would influence our results is worth further investigation. We excluded any blocks shorter than 5,000 bp mainly because genealogical information may be insufficient in extremely short blocks (data not shown), but also to avoid a substantial amount of computational calculation. Generally, a 6,000 bp block consumes nearly 156 hours using one core of an XEON E5650 CPU and 10GB for hard disk of a 600,000,000-step MCMC iteration calculation and 20 GB of memory for dealing with the results generated by the MCMC iteration calculation (also see Methods for details). As a result, ~132,000 core hours of CPU and ~6TB of hard disk were consumed for all the segments in the Best Set for just one set of parameters. Meanwhile, there has not been any evidence supporting that inference is biased without short blocks 15 .
The properties of the BSP method were well studied using simulated data under different patterns of demography 15,27 , supporting the applicability of the method. Since the estimation of population expansion based on the ENUMS strategy is solely dependent on the BSP method (also see Methods), these simulation studies could reflect the properties and accuracy of ENUMS. Furthermore, the information on the uncertainty of the estimation of N e , as provided by the BSP, has been taken into consideration during the third step in ENUMS.
The times estimated by the BSP method 15 are in the unit of mutations per site. To rescale them into the unit of years, we need to know the mutation rate per site per year. There are two approaches to estimate the mutation rate 28 and we considered the one calculating pairwise substitution rates between closely related species as more suitable for our study (see Supplementary Discussion for details). So, we set the mutation rate as 2.5 × 10 −8 per site per generation. However, if we choose the one counting mutation rates that occur between generations in present-day individuals, assuming the mutation rate to be 1.25 × 10 −8 per site per generation, the results still support that the Han Chinese population has experienced a continuous expansion since 25,000 years ago (also see Supplementary Figure 6). Therefore, our results revealed autosomal expansion at least 12,000 years earlier than mtDNA expansion 10 and 19,000 years earlier than Y-chromosomal expansion 13 (see Supplementary Results for details).
The time interval by which N e values are selected for each block (also see Methods) might also have an impact on the trend of estimated population N e . Nonetheless, nearly the same results could be obtained after transforming the time interval from 1000 years to 500 years (Fig. 5). In addition, the quality of the dataset used in this study was not quite adequate due to the low-coverage sequencing strategy used, and thereby there may be a bias in the estimation of coalescent times as well as effective population sizes. The SNPs with ≤1% frequency were insufficient due to the poor sequencing quality 16 , especially with the inference of recent population dynamics of N e , since the signature of a recent population expansion will mainly be found in the singletons. A better estimation would be expected when high-quality data is available.
How recent positive selection might influence the previous estimates was investigated in two ways. Firstly, we partitioned the Best Set into two subsets according to whether the whole or part of the segment overlapped with any of the genes. Both subsets yielded highly similar N e trends (Fig. 3), which also demonstrates that the amount of blocks does not influence the estimation of the population N e when it is not too small. Secondly, we detected recent positive selection in each block in the Best Set by the modified CMS 19,20 test (see Supplementary Methods) and 25 blocks appear to be subject to recent positive selection in total. Nearly the same trend of population N e dynamics has been obtained after removing these 25 blocks from the Best Set (Fig. 2). Both the above observations suggest that recent positive selection has very limited effects on the N e estimation in this study.

Framework of ENUMS.
We have developed a strategy, denoted as ENUMS, for estimating the changes of N e from the recent to the past within a population, allowing the exploration of large multi-locus SNP datasets, including whole-genome sequences. It functions by identifying haplotype blocks, estimating block N e dynamics through time of each block and estimating population N e changes over time by integrating information from all block N e dynamics. Based on the BSP method, this strategy circumvents the limitations of large sample sizes and minimizes the Euclidean distance while integrating information from all block N e dynamics.  15 is based on a run of 600,000,000 generations, sampled every 2,000 generations, with the first 50,000,000 generations discarded as burn-in. Specifically, the steps we took in the MCMC process were nearly 10 times longer than the ones used in other studies 10,31 . In order to plot the dynamics of block N e with respect to time, a strict clock and a neutral mutation rate of 2.5 × 10 −8 per generation per site 4,32 and 25 years per generation are also assumed. For the convenience of analyzing all blocks together, N e values were selected at a series of time points from present to 25,000 YBP with an interval of 1,000 years to describe the dynamics of each block N e , which can be denoted by a vector V T j , where j is an integer denoting the index of the block and T is a vector denoting the index of the time points selected, i.e., T = (0, 1000, …, 25000)′ .

Estimation of population N e dynamics using all blocks under survey. Because estimation of N e
changes from any block in the Best Set may be biased by the influence of many factors such as genetic drift, positive selection, we take the vector V T that can minimize function (1) as the estimation of the population N e changes, i.e., the vector describing the population N e dynamics is the one whose sum of squared Euclidean distance to all block N e vectors in the Best Set is minimal (see function (2)). i.e., the estimation of the population N e .