Fine population structure analysis method for genomes of many

Fine population structure can be examined through the clustering of individuals into subpopulations. The clustering of individuals in large sequence datasets into subpopulations makes the calculation of subpopulation specific allele frequency possible, which may shed light on selection of candidate variants for rare diseases. However, as the magnitude of the data increases, computational burden becomes a challenge in fine population structure analysis. To address this issue, we propose fine population structure analysis (FIPSA), which is an individual-based non-parametric method for dissecting fine population structure. FIPSA maximizes the likelihood ratio of the contingency table of the allele counts multiplied by the group. We demonstrated that its speed and accuracy were superior to existing non-parametric methods when the simulated sample size was up to 5,000 individuals. When applied to real data, the method showed high resolution on the Human Genome Diversity Project (HGDP) East Asian dataset. FIPSA was independently validated on 11,257 human genomes. The group assignment given by FIPSA was 99.1% similar to those assigned based on supervised learning. Thus, FIPSA provides high resolution and is compatible with a real dataset of more than ten thousand individuals.


Testing existence of population structure
We used the same demographical model proposed by Lawson et al. 1 to assess the fine population structure scenario. Three populations (A, B, C) were set to split simultaneously 3000 years ago, followed by a split from C to C1 and C2 2000 years ago, and a split from B to B1 and B2 1000 years ago ( Supplementary Fig. 2). Linkage disequilibrium was not considered in the simulations. The procedure was as follows.
Step 1: independently simulate tens of thousands of segment using SFS_CODE 2  Step 2: randomly draw a single SNV from each segment to generate a dataset of certain counts of independent SNV.
Step 3: For each SNV count, step 2 is repeated 20 times to generate 20 simulated datasets.
Step 4: For each simulated dataset, 20 permutated datasets are generated to calculate pseudo pvalue.

Results
Simulation details of demographic model of Supplementary Fig. 2 With a demographical model where there were three starting populations, we simulated unlinked SNVs using SFS_CODE for testing performance. The procedure was as follows.
Step 1: independently simulate tens of thousands of segment: Step 2: randomly draw a single SNV from each segment to generate a dataset of certain counts of independent SNV.
Step 3: For each SNV count, step 2 is repeated 5 times to generate 5 simulated datasets.

Details of the three methods evaluated on simulation data
ChromoPainter unlinked model and fineSTRUCTURE 3 , K-means 4 were compared with FIPSA. ChromoPainter and fineSTRUCTURE are frameworks for dissecting fine population structure comprised of two steps. The first step is to construct a pairwise individual co-ancestry matrix using ChromoPainter, and the second step is the clustering of individuals based on the co-ancestry matrix using fineSTRUCTURE. ChromoPainter has two versions, the linked version and the unlinked version. Although the unlinked version is not as sensitive as the linked version, which gains a substantial advantage by modeling linkage disequilibrium following Li.
et al. 5 , the unlinked version is more representative because of its close relationship with classical methods such as Principal Component Analysis 6 , structure 7 and ADMIXTURE 8 .
Besides, the unlinked version of fineSTRUCTURE is non-parametric and assumes independence among the SNVs; thus, it is more comparable with FIPSA. We chose K-means as a representative of classical clustering methods because it is parameter-free and assumes loci independence as well. The Adjusted Random Index (ARI) was used to measure clustering accuracy for the three methods. The choice of K for the K-means is Calinski criteria, _ for FIPSA and not fixing K (default) for fineSTRUCTURE.

Parameters for speed evaluation
The parameter set for K-means: given K from 2 to 10, with each K having 100 iterations using the cascadeKM function in the vegan library, the Calinski criteria for K selection; choromopainter unlinked and fineSTRUCTURE: 100,000 burn-in iterations and 100,000 sample iterations for fineSTRUCTURE; FIPSA: given K from 2 to 10, each K with 100 iterations per individual, 10 replicates for each K, maximum informative K for choice of K.
The evaluation of the speed was run on a desktop computer with Intel i7 3770 and 32 GB RAM.

Simulated data containing two subpopulations
We used the ms 9 software to simulate a simple fine population structure scenario to  Supplementary Fig. 2), we found that the permutation approach had much more power to detect population structure even when the simulated dataset only contained 1000 loci from 500 individuals. The LR for the permutated datasets was 2,369 31 (twenty replicates), while the LR for the original dataset was 5,638, which was much larger. The method had increased power as the SNV count, or the individual count, increased. Thus, for real datasets with hundreds of thousands of loci or thousands of individuals, this gain in power would be even larger.
However, this approach would result in some false positive results when used on real datasets. As linkage among loci is unavoidable in real situations, the permutation process would break down the linkage disequilibrium as well as the population structure, thus making a pseudo p value more significant for the linked dataset. Using chromosome 22 of the HGDP 10 East Asian dataset, which contains 9,660 SNVs from 140 individuals, as a representative real data of fine population structure, the permutation approach also proved to have more power, whereas the LR for the permutated datasets was 10,663 46 (twenty replicates), while the LR for the original dataset was 17,258.

Choice of K and _
There is no perfect way to choose K. One possible solution is to use _ . The _ criteria is based on the likelihood ratio (LR) of each K. Thus, there is a chance that the LR for a certain K does not reach the maximum, which will lead to an inappropriate choice of _ . To reduce this chance, the restart time for FIPSA should not be too small.
Practically, for the HGDP East Asian dataset, the restart time was set to 60. Still, we did not exclude the possibility of an inappropriate _ . Even so, the _ criteria gives a suggestive rank for all of the calculated K, and the top ranked K is more likely to be better than the rest. Another drawback of using _ is the setting of . The could not be set too large for a huge dataset, primarily because of the computational burden.
Besides _ , we define _ (K with strongest structure) as the K in which the second derivative of the LR on K (SOD(K)) reaches the maximum.
The definition of _ was inspired by simulated data. When we ran FIPSA on simulated data with no population structure, we found LR to be approximately proportional to K. On the contrary, when we ran FIPSA on HGDP data, LR was not proportional to K, which was due to the presence of population structure. Thus, the K which reflects the strongest population structure ( _ ) is determined by second order difference of the LR over K (denoted as SOD(K)). And SOD(K) is chosen as the statistic to assess _ . For HGDP dataset, the largest SOD(K) corresponded with _ 3 (Supplementary Fig. 4).
The choice of K is based on heuristics. We strongly recommend that users specify K based on biological knowledge.