Clustering of 770,000 genomes reveals post-colonial population structure of North America

Despite strides in characterizing human history from genetic polymorphism data, progress in identifying genetic signatures of recent demography has been limited. Here we identify very recent fine-scale population structure in North America from a network of over 500 million genetic (identity-by-descent, IBD) connections among 770,000 genotyped individuals of US origin. We detect densely connected clusters within the network and annotate these clusters using a database of over 20 million genealogical records. Recent population patterns captured by IBD clustering include immigrants such as Scandinavians and French Canadians; groups with continental admixture such as Puerto Ricans; settlers such as the Amish and Appalachians who experienced geographic or cultural isolation; and broad historical trends, including reduced north-south gene flow. Our results yield a detailed historical portrait of North America after European settlement and support substantial genetic heterogeneity in the United States beyond that uncovered by previous studies.


Genetic differentiation between clusters
To assess the connection between IBD clustering and global populations, we measured differentiation in common genetic variation (FST) between the 6 largest clusters detected in the IBD network (Supplementary Table 8), clusters detected in the sub-networks (Supplementary Table 5), and the clusters ("stable subsets") identified in the spectral analysis (Supplementary Table 4). F ST between Jewish, Irish, Scandinavian, Finnish, Hawaiian and African American stable subsets (Supplementary Table 4) closely matches FST estimated from comparable worldwide populations. For example, compare FST = 0.078-0.081 between African American and European-origin stable subsets (Irish, Scandinavian, Finnish) to European-African FST of 0.096 10 . Also, FST = 0.005-0.007 between Jewish and European-origin stable subsets are similar to FST estimates of 0.008-0.011 between Italian/Ashkenazi Jews and French/North Italians 11 . Our FST estimates are typically slightly lower than others'; this is expected because our samples were not recruited to specifically capture genetic variation of ancestral populations, as in 1000 Genomes 13 or People of the British Isles 14 , and the clusters likely contain more admixed individuals than the comparable cohorts. Here, we use FST calculated from the stable subsets because they appear to be more representative of global populations than the hierarchical clustering; Jewish and African Americans are striking examples of this (Supplementary Data 2). This illustrates that the hierarchical clustering, as opposed to the stable subsets, contain individuals who may not neatly "fit" within a single subgroup, and are potentially more admixed than the individuals within the stable subsets.

Discussion on interpretation of clusters using genealogical data
Although the approach we have described is useful for characterizing most clusters, here we discuss some of the limitations we encountered in using the genealogical data to interpret the clusters. Perhaps most obviously, groups that do not show a particularly unusual concentration of ancestral birth locations in any regions will be difficult to interpret from these data alone. The most conspicuous example of this is the Jewish cluster; most areas of the US and Europe featuring large past or present Jewish communities are also ancestral locations of other large ethnic groups that immigrated to the US. In this case, the estimated admixture proportions allow for a clear interpretation of the Jewish cluster, but these data did not allow for an unambiguous interpretation of the three Jewish sub-network clusters, labeled "European Jewish A, B and C" (Supplementary Data 2). Several other groups do not exhibit a particularly unusual geographic concentration in the US, including Irish, Colombians and Caribbeans, because they have primarily migrated to areas inhabited by many other groups. For similar reasons, large cities such as Chicago, Detroit and New York are important ancestral birth locations for several groups (e.g., African Americans), but do not feature prominently in the maps because these cities are also attributed to other clusters.
Another consideration is that the extent to which these data reflect US-wide trends hinges on the composition of the AncestryDNA database, and the availability of genealogical records. Clearly, this database captures much of the genetic diversity of the US, but in Supplementary  Fig. 17, for example, we might expect a greater density of ancestors from Poland given that it is one of the largest ethnic groups in the US. This discrepancy could be partly attributed to lower availability of Polish genealogical records, and a predisposition in the AncestryDNA database toward individuals with Western European origins. locations in this cluster closely correspond to the settlement pattern of Norwegians in rural Minnesota, North Dakota and Wisconsin, with large numbers moving to Minneapolis and Chicago 19,20 (Supplementary Fig. 20). The subdivision of the Midwest immigrants cluster also delineates two clusters, one corresponding to Swedish and Danish, and the other to Norwegians (Supplementary Fig. 26).
Irish. In the spectral analysis, we identify a cluster which likely corresponds to descendants of Irish immigrants. Six million Irish settled in the US in the 19th century, with immigration peaking in 1852 during the Irish famine 21 . Irish migration was historically characterized by a highly localized pattern of chain migration, as migrants followed family members and neighbors to the same towns and cities in the U.S. 21 . In fact, almost half (46%) of Irish immigrants migrated to just 10 U.S. counties 22 . Once in the US, the Irish may have tended to marry within their ethnic group since their migration was relatively gender-balanced compared to other European groups, which were male-dominated. Many young Irish women emigrated to the U.S. in order to find a spouse, have families, and achieve economic independence, as the famine reduced and delayed opportunities for marriage in Ireland for decades 23 . Interestingly, in the third level of the hierarchical clustering, we identify 3 clusters corresponding to North, South and West geography within Ireland ( Supplementary Fig. 26).
Other European immigrant groups. In addition to the results already discussed, we identify substructure corresponding to other European immigrant groups when we subdivide the "Italians, Irish and Scottish" cluster (Supplementary Data 2, Fig. 26). For example, we identify a cluster whose ancestral birth locations are disproportionately concentrated in Scotland, Atlantic Canada and Ontario ( Supplementary Fig. 26), corresponding to migration of large numbers of Scottish to Canada. Furthermore, we also find an Italian cluster ( Supplementary Fig. 26). The ancestral birth locations for the Italian cluster are particularly concentrated in southern Italy, reflecting the predominant source of Italian immigration to the US.

Polynesians, East Asians and Hawaiians.
In the first level of hierarchical clustering, we identify a cluster corresponding to Polynesians and East Asians (Supplementary Data 2) with only a small number of individuals, and thus we do not subdivided it further. The Hawaiian stable subset identified in this cluster is representative of the Polynesian population with some East Asian admixture. Reflecting this finding, Hawaiians have likely remained genetically isolated due to the large distance from the continental US, while Hawaii has a complicated history with recent, abrupt population changes and rapid growth 24 .

Continental admixed groups
Mexico clusters. In the main text, we discuss the connection between the Northeast and West Mexico clusters and Mexico-US migration patterns. We further note that areas that have not traditionally seen a large influx of migrants to the US, such as southern Mexican states, are poorly represented in the Mexican clusters (Supplementary Figs. 17,18), reflecting our USbiased sample.
New Mexicans. We identify a cluster corresponding to New Mexicans in both the clustering of sub-networks as well as in the spectral analysis (Fig. 3, Table 1, Supplementary Data 2). This cluster most likely represents descendants of the Nuevomexicanos, some of the earliest European colonial settlers that migrated northward from New Spain along the El Camino Real de Tierra Adentro trail 25 . Supporting this hypothesis, ancestral birth locations are disproportionately found in parts of Mexico and New Mexico near this trail (Supplementary Fig. 20).
Puerto Ricans. In our discussion in the main text, although we do not single out Puerto Ricans from other Caribbean Islands peoples, Puerto Ricans are by far the predominant Caribbean group in our sample. Puerto Ricans typically have a mixture of European and African ancestry, with smaller amounts of Native American admixture 26 , and this is reflected in our data Supplementary Data 2). In the second level of the hierarchical clustering, we identify fine-scale structure on the island of Puerto Rico in the 3 largest clusters of the Caribbean sub-network (Supplementary Data 2). These 3 clusters are clearly correlated with geography of the island of Puerto Rico, as they roughly subdivide the island into 3 regionsnorthwest, southwest and east ( Supplementary Fig. 21). These 3 Puerto Rican clusters show small differences in Native American, West African and European admixture proportions ( Supplementary Fig. 11) that only partially reproduce the findings of genetic variation across the island of Puerto Rico 26 , perhaps due to differences in the composition of our database and their sample. Interestingly, the Puerto Rican cluster has an enrichment of ancestral birth locations on the island of Hawaii. This likely reflects the arrival of sugar cane plantation laborers to Hawaii in the early 1900's; Puerto Ricans have been documented as a separate ethnic group in Hawaii as early as 1910, and they have constituted over 2% of the population up until 1950 27 .

Assimilated immigrant groups
Pennsylvania. In the second-level hierarchical clustering, we identify a cluster with birth locations concentrated in Pennsylvania (Table 1, Fig. 3, Supplementary Fig. 19). Roughly 80,000 Germans immigrated to the colonies in North America between 1717 and 1775, with the majority settling in Philadelphia, southeast Pennsylvania, and New Jersey. By 1760, 50,000 Germans had settled in southeast Pennsylvania alone. At the time, Germans constituted the largest ethnic group in the colony with a distinct language, religious culture, and identity 28 . German immigrants tended to marry within their ethnic group and remained geographically stable for many generations 28 . While this demography may not encompass the entire history of the Pennsylvania cluster, it provides some background for the genealogical data associated with individuals assigned to this cluster.
Southern US: Alabama, and North and South Carolina. Here, we discuss further substructure identified in third-level hierarchical clustering of the "Lower South" cluster (Supplementary Figs. 25,27). The predominant migration pattern from North and South Carolina to Alabama observed in the genealogical data of the "Alabama, and North and South Carolina" cluster ( Supplementary Fig. 27) may be explained by the westward expansion of the cotton industry between 1820 and 1860. "Alabama fever" gripped South Carolinians after the opening of the territory to European settlement following the expulsion of the Creek people in 1814 29 . Since soil quality had declined in both South Carolina and Georgia by 1820 30 , migrants passed through Georgia and moved directly into Alabama, where the nutrient rich soil yielded 3 times more cotton per acre 31 . By 1850, there were an estimated 45,000 migrants from South Carolina in Alabama 31 , accounting for approximately 30% of all incoming migrants 30 . South Carolina migration to Alabama began to decrease in 1860 due to the opening of new migration routes further west into Texas, Arkansas, and Florida 30 .
Southern US: Florida, Georgia, and South Carolina. Another cluster identified from third-level hierarchical clustering of the "Lower South" cluster has enriched ancestral birth locations in Florida, Georgia, and South Carolina (Supplementary Figs. 25,27). The southward movement of people in Georgia and South Carolina ( Supplementary Fig. 27) is also consistent with documented historical migration patterns. Settlers moved into the rich coastal plains of Georgia and South Carolina-the best agricultural land in the South-between 1780 and 1810 32 . Then, between 1825 and 1840, migrants poured into "middle Florida" (near present-day Tallahassee) prompted by the acquisition of the Florida territory from Spain in 1821, and the removal of native Seminole, Miccosukee and Red Stick Creek people in 1824 33 . The end of the in-migration to Florida coincided with the collapse of cotton prices in 1840 33 .
Southern US: additional substructure. In the third-level hierarchical clustering of the Lower South and Upland South clusters (Supplementary Figs. 25,27), we identify further fine-scale structure that corresponds in part to geography in the Southern US, and possibly other historical migration patterns. Again, we emphasize that the demographic interpretation of many third-level clusters is more speculative.

Post-migration isolated groups
Appalachians. In the main text, we discuss the Appalachians stable subset. The Appalachians cluster is particularly concentrated in southeastern Kentucky near the Cumberland Mountains, which was more geographically and economically isolated relative to other parts of Appalachia. Moore 34 claims that the problematic stereotype of Appalachia as remote and homogenous is based on this particular sub-region. The coal industry developed first in northern and western Kentucky, then only gradually moved into the southeastern part of the state. Railroads did not reach eastern Kentucky until the 1880s, and only came to Harlan County, where a dense number of ancestral birth locations are concentrated ( Supplementary Fig. 20), during World War I 34,35 . This history, as discussed in the main text, provides an explanation for the identification of this cluster in the IBD network.

Mennonites.
We identify what we hypothesize to be a sub-network cluster corresponding to Mennonites, although this label is less clear than the others. Mennonite families homesteaded in different parts of the Great Plains; districts with concentrated Mennonite settlement in the include Marion, McPherson, Harvey, Butler and Reno counties in Kansas 36 , and near Korn (later Corn), Fairview, North Enid, and in the Meno-Ringwood-Goltry in Oklahoma 37 . In Supplementary Fig. 19, we observe that many enriched ancestral birth locations of this cluster occur at or near these settlements.
Utah. Here we include some additional details about the Utah stable subset that we did not have room to discuss in the main text. Using genealogical annotations, we are able to trace the migration patterns of this cluster with remarkable detail (Supplementary Fig. 24). Large numbers of Scandinavians migrated to the Northeast in the 1700's, descendants of whom later moved to Utah; this is well captured by the large concentration of ancestral birth locations for the Utah cluster in Scandinavia, especially Denmark. Areas in the west outside Utah (Mexico, Arizona and British Columbia) also appear as over-represented ancestral birth locations in this cluster, and correspond to known Mormon settlements. Over-represented ancestral birth locations in and near Iowa correspond to important settlements along the Mormon trail (Nauvoo, Illinois and Omaha, Nebraska).

Discussion of IBD network analysis
We sometimes find that the clusters identified by recursively maximizing the modularity closely overlap with the stable subsets (examples include European Jewish and New Mexicans; see Supplementary Data 2). In such cases, the spectral analysis provides additional support for the clusters, and narrows the range of likely demographic hypotheses underlying the hierarchical clustering. Since the spectral analysis only delineates the most disconnected subgraphs, an additional benefit is that it filters out "admixed" individuals that might be arbitrarily assigned to clusters-e.g., a child of parents from two genetically isolated populations. This explains why stable subsets are more representative of the global ancestral populations than their corresponding clusters (compare, in Supplementary Data 2 and Supplementary Figs. 9, 10, the admixture proportions in the clusters and stable subsets corresponding to African Americans and European Jewish).
In other cases, the stable subset contains only a small fraction of the cluster, or no stable subset is identified in the cluster (Table 1, Supplementary Data 2). Such clusters are typically much less modular than the clusters that closely overlap stable subsets; this can be seen by comparing the internal edge density of the cluster (Win) to the density of edges to non-cluster members (Wout). Although any population structure underlying this clustering is therefore subtler, we find that the clustering in several cases corresponds to unambiguous demographic patterns. In this circumstance, the likely interpretation of the hierarchical clustering is that it is the discretization of some unknown, continuous feature of IBD variation (e.g., geographic distance). That being said, we often find that this network structure is often more difficult to interpret, and the exact boundaries of the clusters may be partly influenced by synthetic factors, such as the frequent failure of modularity-maximizing methods to partition small modules [38][39][40] .
Laplacian eigenmaps and related spectral clustering methods have been previously proposed for inferring population structure from genetic data, and there are several published works on this topic [41][42][43] . The most closely related work is by Lee et al. 41,42 ; they use spectral dimensionality reduction methods, as we do here, to uncover population structure in the POPRES data set. The key differences are: (1) we define similarity between data points using pairwise IBD estimates, whereas they take a dot product of the genotypes; (2) we uncover population structure at much finer scale; and (3) we apply spectral methods to data on a much larger scale.
Also, we are not the first to combine hierarchical clustering with spectral methods, and it is possible that other algorithms could provide a more systematic implementation of our approach to identifying modular structure informative of population demography in the IBD network; see, for example, the HQcut method developed by Ruan and Zhang 44 . However, an important consideration is the massive scale of our network, which prohibits the application of this method and other more computationally complex algorithms developed for community detection. Additionally, visualizing the individual dimensions of the spectral embedding generated from the IBD network yields additional information about structure in the population ( Supplementary Fig.  22).
An unresolved issue common to both the hierarchical clustering and spectral analysis is that the stopping criterion is unclear: in the hierarchical clustering, there is the question of when to stop subdividing clusters; in the spectral analysis, there is the question how many dimensions of the spectral embedding to use to delineate stable subsets. Although researchers have proposed stopping criteria for these methods (e.g., 39,45 ), our experience as well as the unique properties of the IBD network suggest that these criteria either do not apply or do not work well in our setting. In particular, the question of assessing statistical significance in detected modules has received considerable attention in the literature; see Berry et al. 38 and Fortunato 39 for some recent reviews of the topic. However, this unresolved question is even further complicated by the difficulty of determining an appropriate "null model" for a network reflecting IBD in a modernday human populations. As a result, we have not attempted to make any claims about statistical significance of clusters detected in the IBD network. For the hierarchical clustering, we have used an ad hoc stopping criterion based on the size of the cluster (see above), and this is an aspect we hope to address more systematically in future work.
An open question in the spectral analysis is whether the order in which stable subsets are discovered in the spectral embedding yields an approximate ranking of how closely the clusters resemble disconnected components, and therefore provides a measure of genetic isolation. Anecdotally, there is evidence for this interpretation-Jewish and Hispanic/Latino groups cluster in the first dimensions of the spectral embedding, whereas the last stable subsets we identify correspond to Utah settlers and Irish immigrants which show little genetic differentiation from most European-origin individuals in the US. However, except in idealized settings 45,46 , there is no theory supporting the ordering of the eigenvectors of the Laplacian to rank the clusters according to the "isolation" in the network.
Finally, we consider a general limitation to the proposed approach: the structure of the IBD network hinges on parameter choices and assumptions made. For example, while the edge weights are defined from simulations intended to reflect real biological relationships (as described), they do not account for factors such as population-specific excess IBD sharing (e.g., due to founder events or rapid population growth). As a result, network structure corresponding to subpopulations exhibiting sharp deviations from random mating may reflect more distant demography compared to subpopulations that are more closely modeled by our simulations. A model-based approach that adjusts the demographic model parameters to fit the IBD data could potentially address this limitation. Recent work has demonstrated the power of coalescent model-based methods to infer population demographic parameters from IBD [47][48][49] . The distribution of detected IBD in each of the clusters could be used to reconstruct more detailed histories of the underlying historical groups. However, for discovering population structure, an important yet unresolved question is how to develop a model-based method that can jointly learn both the population parameters and population assignments. In our model-free approach, we take the pragmatic view that sensitivity of the inferred network structure to assumptions could be a feature of our method-that is, in future work, alternative choices could reveal population structure arising from different time-scales.

Sample collection
All DNA samples included in this study were collected from AncestryDNA customers (except for samples obtained from external sample collections that are included only in the admixture reference panel-see below). The typical sample collection process for an AncestryDNA customer is as follows: a customer orders an AncestryDNA kit through the AncestryDNA website; the customer collects saliva at home, and returns the saliva sample in stabilizing solution; DNA from the saliva is extracted; finally, genotypes are called for a dense panel of single nucleotide polymorphisms (SNPs) across the genome.
A DNA sample is only processed at the laboratory once the customer has activated the kit through the website. As part of the activation step, the customer provides basic personal information, including age and/or year of birth, first and last name, and gender. (Some of this information, such as gender, is used for subsequent quality control steps.) During or after the activation, the customer is able to associate ("link") his or her DNA sample with a node in a pedigree. These pedigrees are accessed through the user's online account, and they are generated either by the customer or by other users (more details are given below).
During activation, each customer is given the option of consenting to the AncestryDNA Human Genetic Diversity research project. Following sample quality control steps described in below, we obtain a final panel of 774,516 genotyped samples consented to participate in research. This is the number of samples available for our subsequent analyses.

Genealogical data
Many customers have provided detailed information about their family history. Online pedigrees are created by individual users or, occasionally, by professional genealogists. An individual can associate a DNA sample to a node in any pedigree that is accessible through their Ancestry user account. Pedigrees are viewable by other users unless a user has marked a pedigree as "private". This hides the information in the pedigree from public view. For DNA samples linked to pedigrees, we use the pedigree data in aggregate to better understand the historical and geographical significance of the clusters that are identified from the IBD data.
We take a few basic steps to remove associations between DNA samples and pedigrees that are unlikely to be correct. We exclude all pedigrees linked to DNA samples that do not satisfy all of the following criteria: (1) recorded death date for the linked pedigree node (when this death date is available) occurs after AncestryDNA began; (2) the gender is the same as the gender recorded during DNA activation; and (3) the birth date is within 3 years of the birth date recorded during DNA activation. (DNA samples that do not satisfy these criteria are still included in our analyses, but the associated pedigree data is not used.) We note that while more stringent tests for reliability could be used, they are of unclear value given the size and complexity of these data. Finally, we exclude all pedigrees that users marked as "private". After taking these filtering steps, we obtain a final set of 432,611 DNA samples linked to non-private pedigrees.
In all subsequent analyses of the genealogical data, we only include pedigree nodes corresponding to ancestors of the tested individuals; that is, we only retain pedigree nodes x such that the DNA sample is a descendant of x. (These are the only pedigree members that could have passed down genetic material to the associated user.) We exclude all ancestors more than 9 generations back in users' pedigrees since we have found that the reliability of the pedigree information diminishes considerably after 9 generations.
We include two types of information in our analyses of associated pedigree data: birth year and birth location (map coordinates in longitude and latitude). Other available pedigree information, such as place names, surnames, death dates, and evidence in the form of documents attached to pedigree nodes, are not used in this study. We only use birth locations that include state or province information, and we only use US birth locations that include county or city information.
Based on these pedigree data, the vast majority of the DNA samples are from individuals born in the US; out of the DNA samples linked to pedigrees, 322,683 (96% of reported birth locations) were born in the US, 13,748 (4% of reported birth locations) were born outside the US, and an additional 96,180 (22% of all DNA samples linked to pedigrees) have unreported birth locations. Based on these reported birth locations, we have a reasonably good representation of DNA from individuals born in all US states, with the largest proportion from California (Supplementary Table 1, Supplementary Fig. 1).
The user-generated pedigrees associated with DNA samples exhibit wide variation in size (number of pedigree nodes), completeness (proportion of an individual's ancestors added to the pedigree), and depth (number of generations represented in pedigree); see Supplementary  Figs. 12, 13. The average size of the pedigrees also appears to vary somewhat by US state ( Supplementary Fig. 14 About 76% of all pedigree nodes include birth location, birth year and surname. Reassuringly, the proportion of pedigree nodes with these three annotations does not appear to vary by generation ( Supplementary Fig. 15). In other words, if an ancestral node is present in a pedigree, how well that node is annotated does not appear to be strongly affected by the depth of that ancestor in the pedigree. Annotation completeness varies only slightly by US state (Supplementary Table 1). Variability in tree and annotation completeness could be partly a result of access to historical records, either in the US, or for different ethnic groups.
We caution that pedigree information from different users is not always independent. For example, multiple DNA samples may be linked to similar or identical pedigrees. This may even be the case for individuals that are more distantly related to each other. (With Ancestry's "hinting" system, new relatives can be suggested for an individual's pedigree based on similarity with other online pedigrees. Although this system has been successful for helping users expand their family trees, it can also perpetuate errors in pedigrees.) Adoption is another source of error, although this is expected to have a limited impact. (Users can mark some lines as biological or adoptive, but many users are unaware of this feature.) For the purposes of this study, we assume that inaccurate pedigree data has a negligible impact on the accuracy of the summary statistics when they are compiled from thousands of pedigrees.

Phasing reference panel
The reference panel is based on 217,722 genotype samples that were available in the AncestryDNA database at the time when the panel was constructed. We use a subset of 633,299 autosomal SNPs for all steps of the phasing analysis. Before phasing these samples, we first impute the small proportion of missing genotypes using Beagle. This is accomplished in batches of 200-1,000 samples. Each of these batches also includes 558 phased samples that were downloaded from the Beagle website 50 , and originally collected as part of Phase 1 of the 1,000 Genomes Project 13, 51 .
Next, we combine these imputed samples into larger batches of approximately 50,000 samples each. Each of these batches are phased separately using HAPI-UR version 1.01 6 . Once we have phased all the reference samples using HAPI-UR, we learn and store the haplotype models using our Beagle-like algorithm. Phasing new genotype samples using this reference panel takes only a few seconds to complete for each sample.

Genotype phasing algorithm
Beagle defines a probability distribution over haplotypes across each chromosome region, or "window", using a Markov model 55 . The accuracy of the haplotype models increase with sample size 6 . However, Beagle was not designed to scale to hundreds of thousands of samples. An alternative method is HAPI-UR 6 , which is able to simultaneously phase tens of thousands of samples. As of this writing, HAPI-UR can handle larger numbers of samples than Beagle (we are using HAPI-UR version 1.01 and Beagle version 3.3.2). Still, it can take several days for HAPI-UR to complete its computation for the large samples used in this study.
Our algorithm extends Beagle in two ways to accommodate the size of our data set. First, when estimating the transition likelihoods in the haploid models, we add a "pseudocount" (10 -4 ) to each haplotype count. This allows for the possibility that a new genotype sample contains a haplotype that was never previously encountered. Without this modification, the probability of a new haplotype is zero.
Second, we modify the criterion for deciding whether two haplotype clusters (i.e. nodes of the haploid Markov model) should be merged during model learning. Since the standard method is overly confident for frequencies that are close to 0 or 1, we regularize the estimates using a symmetric beta distribution as a prior. Specifically, haplotype clusters x and y are not merged unless the following condition is satisfied for some haplotype h: where nx and ny are the sizes of clusters x and y. The posterior allele frequency estimates in this formula are where nx(h) and ny(h) are the numbers of haplotypes that begin with haplotype h. We set the parameters of the Beta prior (the prior counts), α and β, to 0.5. Compare this criterion to 55 , which merges two clusters unless the following relation holds for some h: is the proportion of haplotypes in cluster x with that begin with haplotype h, and ̂( ℎ) is the proportion of haplotypes in cluster y that begin with h. We evaluated the phasing accuracy of the algorithm using a few different values for constant C and settled on C = 20.

Evaluation of genotype phasing method
Supplementary Table 7 compares the phasing accuracy of BEAGLE applied to datasets of different size against our phasing method. We evaluated phasing accuracy on a test set of 1,188 unrelated individuals from our database that have been trio-phased. This experiment shows that our implementation infers the phase of new genotype samples more accurately than BEAGLE-and with much lower computational cost-provided we are able to make use of a very large panel of phased genotypes.

Constructing the global ancestry reference panel
Predicting the ethnic origins of an individual's ancestors from their DNA is a central feature of the AncestryDNA product. We use these admixture estimates here to interpret the clusters in relation to worldwide regions (regions in Europe, Africa, and so on). Based on an individual's genotypes, we estimate the proportion of their genome that is attributed to different ancestral populations. For additional details on these methods, refer to the AncestryDNA Ethnicity Estimate White Paper 11 .
For estimating admixture proportions, we curate a reference panel of 3,000 "labeled" genotype samples for which we can reasonably trace their origins to one of the 26 ancestral populations.
We begin with an initial set of 4,657 labeled samples compiled from multiple sources: 855 samples from 52 worldwide populations collected as part of the Human Genome Diversity Project 2,56 ; over 1,200 samples from a proprietary AncestryDNA reference collection; and over 1,600 putatively single-origin samples from AncestryDNA customers that consented to participate in research. To identify AncestryDNA candidates for inclusion in the reference panel, we consult user-generated pedigrees, and we select a customer sample if all lineages trace back to the same geographic region. (More precisely, we check the birth locations of all available grandparents and great-grandparents in the customer's pedigree.) For samples from the proprietary reference collection, we also check birth locations of the most distant ancestors that were provided in the pedigree. We take an additional step to confirm single-origin ancestry of the candidate reference samples in the admixture analysis, detailed below.
The samples from the proprietary collection are genotyped using the same Illumina OmniExpress array (described above). To ensure high-quality genotype data, we follow identical quality control steps to the AncestryDNA samples. The HGDP samples are genotyped on the Illumina 650K platform 2,56 . Therefore, for the admixture analysis we use only the ~300,000 SNPs common to both OmniExpress and 650K arrays.
Population structure analysis can be sensitive to inclusion of genetically related samples. Therefore, we take steps to discard samples so that no two individuals in the reference panel have an unusually high amount of DNA sharing. To quantify DNA sharing, we estimate the probability that 1 and 2 alleles are identical-by-descent (IBD). We use PLINK 57 to compute these probabilities for each pair of individuals. When assessing outlying IBD proportions, we compare against members assigned to the same region only, since different populations can exhibit markedly different IBD distributions (for example, due to historical bottlenecks or rapid population expansion). This filtering step removes about 100 samples from consideration for the reference panel.
Next, we take an iterative process to gradually refine the reference panel, determine the final populations for admixture estimation, and eliminate samples in which the genetic estimates disagree with the provided labels. This process involves iterating two separate analyses: (1) We visually inspect the labeled genotype samples projected onto 2 two principal components (PCs). We use this projection to suggest groupings and identify outliers; groups that do not separate as well as others based on this projection are merged. We also use this projection to identify "outliers" that are far from the majority of the samples assigned to the same group. Because the top 2 PCs correspond to different axes of variation depending on the samples included, we repeat this analysis at different "scales", typically by geography; e.g., global samples, then only samples from Europe, then samples from Scandinavia, then samples from Norway only.
(2) We run additional analyses using ADMIXTURE to further identify outliers and groups that are not well delineated genetically. Specifically, we perform a leave-one-out validation, in which we remove the label of one candidate sample from the reference panel, and attempt to estimate the admixture proportions of this sample using the remaining (labeled) reference samples. We repeat this procedure for each sample. Samples that are assigned high admixture proportions to the wrong populations are considered outliers, and removed from the panel. This leave-one-out validation step is also useful for informing population boundaries; groups that are commonly confounded by ADMIXTURE are combined into one population.
After completing this process, the final reference panel consists of 3,000 samples partitioned into K = 26 ancestral populations ( Supplementary Fig. 7, Supplementary Table 2). This reference panel is used to estimate admixture proportions in all unlabeled (customer) samples.

Estimating ancestral admixture proportions in unlabeled genotype samples using ADMIXTURE
We use the program ADMIXTURE 58,59 to jointly estimate admixture proportions in the labeled reference panel and unlabeled customer samples from their genotypes. One reason to use ADMIXTURE over other software (e.g., STRUCTURE 60 ) is that it can easily incorporate information from labeled samples. Another important reason is that the computation scales well to large data sets, so we can deliver accurate admixture estimates for hundreds or thousands of individuals in a reasonable amount of time.
We iteratively run ADMIXTURE jointly on the 3,000 labeled samples and small batches of unlabeled samples. The size of the batch, as well as the composition of samples included in the batch, can impact the final admixture prediction for a given individual since the population allele frequencies are also adjusted to reflect the unlabeled samples. However, in practice we find that variation in the predictions for different batches is small so long as the number of unlabeled samples included in a single batch is small relative to the size of the reference panel. Also, since admixture estimates can be sensitive to closely related individuals, we take steps to ensure that no closely related samples (customer samples or reference samples) are included in the same batch.
We set K = 26, and run ADMIXTURE in "supervised" mode (more accurately, it is semisupervised). We provide population labels (in a .pop file) for the 3,000 reference samples only.
Since the model assumes independent markers, we use PLINK 57 to prune SNPs that are in high linkage disequilibrium. Because a few ancestral populations represent a large proportion of our reference panel, and therefore dominate the correlation observed in the panel, we calculate the correlation coefficient (r 2 ) between the same pair of SNPs in all 26 populations separately, then we define a new correlation coefficient by taking the average of the 26 values. We repeat this calculation for all pairs of SNPs within each 50-SNP window. We then prune SNPs until no pair of SNPs within a window has an "averaged" r 2 greater than 0.2. We repeat this process for each window, starting at every 5th marker on a chromosome. After this pruning procedure, we arrive at a set of 112,909 SNPs-these are the SNPs used to estimate the admixture proportions for all unlabeled samples.

Overview of IBD network analysis
Rather than infer population characteristics from IBD patterns in known, or assumed, populations, here we aim to discover underlying population structure from IBD. We use a modelfree approach, turning to the well-studied problem in machine learning and statistical physics of learning structure in a network. Intuitively, our approach is analogous to the way that principal components analysis (PCA) has been widely used to infer structure from genetic polymorphism data without specifying a demographic process 60 . (Although we note that recent work has formally related PCA to underlying demographic processes; e.g., 61 .) The key idea is to transform the problem of inferring population structure from IBD to the well-studied problem of learning structure in a network-that is, an undirected graph with weighted edges 39,45,62 . This is similar in some respects to the approach described by Gusev et al. 63 . Our hypothesis is that some of the structural features we identify in the IBD-based network can be related to population demography. (Note this approach is unrelated to reconstruction of haplotype networks 64 .) Our method involves three key steps. First, we estimate the total length of IBD shared between each pair of samples in our database (this step is described above). Second, using the estimated IBD, we construct an IBD network-a graph in which vertices correspond to genotyped individuals, and edges are a function of the estimated IBD between each pair (details are given in next section). Third, we build on methods developed in machine learning and statistical physics to study the structural properties of the IBD network.
We take a simple approach to inferring structure in the IBD network by identifying subgraphs with a relatively high density of internal edges-these are commonly called either modules or communities. (In our paper, we have deliberately avoided using the term "community" in this way because it can be confused with its usage in population studies, although we do refer to the method as "community detection".) This is the most widely used strategy for inferring network structure, and many algorithms have been developed that can quickly and accurately approximate the modular structure of a network 39,62 .
In contrast to PCA applied to genetic polymorphism data, here we do not have the benefit of previous work supporting the interpretation of modular network structure as an underlying demographic process. Another underlying concern is that we cannot guarantee that standard theory and practice of inferring modular network structure is applicable because the IBD network has an unusual combination of properties that distinguish it from other types of networks commonly studied in the literature: (1) The network is very sparse; for example, if we assign nonzero edges to all pairs with total IBD > 12 cM in our sample (see next section), only 0.2% of pairs are connected in the network.
(2) The edge weights are noisy; IBD as a predictor of familial relationships has high variance, even when detection of IBD is very accurate.
(3) Most community detection methods assign each vertex to a module, but here we expect that many, if not most, individuals do not neatly "fit" within any single module-consider an individual with grandparents from different populations.
(4) In addition, some individuals may not belong to any module; for example, individuals from groups that are poorly represented in our sample.
With these points in mind, we have developed a network analysis based on two complementary methods for inferring modular network structure: (1) a hierarchical clustering method that recursively maximizes the modularity of the network; (2) a spectral analysis method that generates a low-dimension representation of the network structure, which we use to extract "unusually disconnected" subsets of the network. In the remainder of the Supplementary Methods, we use "cluster" to refer exclusively to a subgraph identified by recursive modularity maximization, and a "stable subset" to mean a subgraph identified via the spectral analysis.
(Elsewhere in the description of the results, for brevity we also use "cluster" to refer to stable subsets when the distinction between cluster and stable subset is not relevant to the result.) We use the term "stable subset" to contrast with degenerate network clusterings 40 that may underlie continuous variation in IBD due to, for example, isolation-by-distance. These stable subsets tend to isolate the more discontinuous portion of variation in IBD that putatively reflects IBD patterns from distinct subgroups-see Supplementary Fig. 6 for an illustration of this using simulated data. Unlike the hierarchical clustering, in the spectral analysis we do not attempt to assign every individual to a cluster-we only use it to identify subsets that are unusually disconnected from the rest of the network. Since we have found that this type of population structure is typically more straightforward to interpret, much of our presentation focuses on annotating and interpreting the network structure captured by the spectral analysis. For additional discussion of the relationship between the hierarchical clustering and the spectral analysis, see "Discussion of network analysis" in the Supplementary Text.

Constructing the IBD network
There are two key considerations that constrain our choice of edge weight function w[e(i, j)]. First, only a very small proportion of estimated IBD segment lengths are suggestive of close relationships. Therefore, if we place most of the weight on close relationships, the graph will be extremely sparse and disconnected, and there will be little population demographic structure that can be inferred from the graph. A second consideration is that while GERMLINE can infer longer IBD segments with high accuracy, it cannot reliably distinguish between shorter tracts that are truly inherited from a common ancestor and false positive IBD 65 . Therefore, if we place substantial weight on shorter shared IBD arising from more distant familial relationships (e.g., IBD less than 4 cM in length), there is a good chance that the "noise" in the network structure will overwhelm the pattern of genetic connections that we are ultimately interested in investigating.
Within these constraints, we still have considerable flexibility for defining edge weights from IBD. Our strategy is as follows. First, we choose a target range of ancestral generations. Second, we empirically assess the distribution of IBD lengths via simulation. Third, we place most weight on IBD lengths arising from familial relationships corresponding to the target generations. The defined edge weights necessarily hinge on the assumptions made in the simulations. Therefore, we make these assumptions, and their rationale, clear.
We compile estimated IBD from a variety of familial relationships by simulating reproductive events from a subset of the sample genotypes. This simulation is intended to capture the correspondence between familial relationship (specifically, number of separating meioses) and IBD segment length for an "idealized" (random mating) population with a population distribution that reflects our customer database. We begin with a subset of 24,362 samples selected so that no pair of samples share a 20-cM IBD segment (as detected using the procedures described above). We draw samples at random without replacement to simulate familial relationships as close as parent-child and as distant as 10th cousins. Supplementary Fig. 29 gives an example simulation of two individuals (labeled S3 and S4) with a first-cousin relationship. All other familial relationships are simulated in a similar way. Note that we do not simulate relationships, such as half-sibs, that do not follow this pedigree pattern. Recombination events during meiosis are simulated according to interpolated HapMap genetic distances 66 . The final product of the simulations is 4,412 genomes with known familial relationships.
As discussed above, we note that this simulation does not capture population-specific excess IBD sharing due to founder events, non-random mating, rapid population expansion, or both. This means that for some subpopulations, most of the edge weights will be more correctly concentrated on IBD due to ancestors a specified number of generations back, whereas for other subpopulations with sharp deviations from our simulation assumptions (e.g., European Jewish), the IBD corresponds in expectation to slightly more distant common ancestors. As a result, clustering of the IBD network could reveal population structure that is further back from the target generations.
The distribution of total IBD from this simulation experiment is summarized in Supplementary  Fig. 30. Note that the proportions for a given amount of total IBD depend on the relative number of relationships simulated of a given type. We ensured that the number of relationships doubles for every increase of 2 to the number of separating meioses (e.g., there are twice as many pairs separated by 6 meioses as there are separated by 4 meioses).
Finally, we define the edge weights in the network as the observed proportion of total IBD lengths that are due to relationships separated by at most 8 meioses (corresponding to common ancestors at most 4 generations back). This empirical distribution is fit to the Beta cumulative density function, and this fitted distribution (with scale parameters α = 2, β = 200) defines the weights for all edges in the network (refer to Supplementary Fig. 2 for more details). Furthermore, we remove all edges corresponding to pairs with total IBD less than 12 cM since they signal the target familial relationships less than 6% of the time, and therefore contribute little weight to the network. Intuitively, pruning edges representing small amounts of genomic sharing allows us to focus on IBD that corresponds to more recent demography, while also removing significant amounts of spuriously identified IBD 65 . This reduces the chance of having edges corresponding to false-positive IBD, while allowing for a large number of edges in the graph.

Hierarchical clustering of IBD network
In this phase of our analysis, we employ a simple and fast heuristic algorithm, the multi-level or Louvain method 67 , to identify network modules. After running this multi-level community detection algorithm, 99.9% of the IBD network (768,758 out of 769,444 vertices) is subdivided into 6 clusters (Supplementary Data 2). The rest of the network (0.1% of the vertices) is assigned to many extremely small clusters with at most 101 members. Many of these small clusters likely correspond to subpopulations that have poor representation in our database, or to unusually large, tight-knit families. They are difficult to interpret based on the available genealogical data, so we do not examine them further.
To investigate more fine-scale clustering, and potentially fine-scale population structure, we split the original network into 5 sub-networks corresponding to the largest 5 clusters, then we partition these sub-networks into additional clusters using the same multi-level community detection algorithm. Although the smallest clusters are more likely to contain additional modular structure than larger clusters 68 , to safeguard the clustering against over-fitting due to noise in the observed edges and the sparse number of pairs with IBD > 12 cM, we only run this second round of community detection on the 5 clusters with at least 10,000 members. (Of the 6 initial clusters, the smallest contains only 3,845 samples; see Supplementary Data 2.) Applying the multi-level algorithm to the 5 sub-networks partitions the network into a total of 112 clusters, the majority of which are very small; 71 out of the 112 have less than 100 members.
Again, since small clusters are more difficult to reliably interpret with the available genealogical data, in the results we restrict in our attention to clusters with at least 2,000 members. A total of 22 second-level clusters have at least 2,000 members (Supplementary Data 2). These 22 clusters, and the 6th-largest top-level cluster that was not subdivided further, account for 98.8% of the network (759,925 out of 769,444 vertices). We observe wide variation in size among the 22 largest second-level clusters; the largest has 108,786 members, while others have just over 2,000 members.
To generate the third level of the cluster hierarchy, we apply the multi-level method to the 12 second-level (sub-network) clusters with more than 10,000 members. Many of these clusters are likely informative of additional, more subtle trends in population structure, but are often difficult to interpret unambiguously from our data. In total, we identify 164 clusters within the 12 second-level clusters. As before, many of these clusters are small; 64 have less than 100 members. Still, we identify a total of 55 clusters with more than 2,000 members, the largest of which includes 65,551 samples (this is a subset of the "Lower Midwest and Appalachians" cluster; see Supplementary Data 2).
Subsequent inspection of the genealogical data in the third-level clustering strongly indicates that many of these clusters are highly informative of fine-scale population structure-some of the more unmistakable examples are clusters corresponding to Italians, Irish, Scottish, Finnish, Norwegians and Puerto Ricans (see Supplementary Figs.

25-27 and Supplementary
Discussion). However, out of concern for reporting network structure attributed to synthetic factors (either false positive clusters, or clustering that can be strongly biased by properties of the modularity-maximization approach), as well as the difficulty of characterizing the subtle population structure underlying many of these clusters, we focus on the first and second levels of the hierarchical clustering in the main presentation of the results. Although the third-level clustering is not the focus of our main presentation, we do treat these clusters as candidates for the spectral analysis (see below). Some of these clusters are indeed supported by the spectral analysis; that is, some of the clusters detected in the largest second-level clusters closely coincide with "stable subsets" identified in the spectral analysis (Supplementary Data 2).

Spectral analysis of IBD network
We complement the hierarchical clustering using a spectral dimensionality reduction technique for network data. Our spectral analysis approach is based on the Laplacian eigenmaps method 69 , which has close connections to spectral clustering 45,46 . Spectral methods have been previously used to infer population structure from genetic data 42,43 .
The Laplacian eigenmaps method 69 is derived from a spectral decomposition of the (normalized) Laplacian matrix. The intuition behind this approach is that the eigenvectors associated with the largest eigenvalues of the Laplacian matrix (outside the largest eigenvalue, which is always 1, or nearly 1) separate disconnected components, or weakly connected components, of the graph.
We define the spectral embedding as the first m eigenvectors of the normalized Laplacian. (To the best of our knowledge, there are no theoretical results guiding the choice of m in this setting, and this is a point we return to below.) We efficiently solve for the largest m eigenvectors of the sparse, symmetric n × n Laplacian matrix using the Lanczos iterative algorithm 70 , implemented in ARPACK 71 , and interfaced to R through the rARPACK library. To justify the interpretation of the spectral embedding as the projection of samples onto a Euclidean space, we note that the first m eigenvectors and eigenvalues can be used to formally define a projection operator 72 .
A key feature of the spectral analysis is that provides a continuous representation of network structure, potentially overcoming the unnatural assumption that each sample belongs to a single cluster, or population. However, it is currently unknown how to generalize interpretation of this representation beyond the few examples we have seen where some dimensions of the spectral embedding correlate strongly with admixture proportions estimated using ADMIXTURE ( Supplementary Fig. 22). Therefore, we take a simple approach to infer population structure from the spectral embedding by projecting the hierarchical clustering onto this embedding, then using this projection to extract clusters. These clusters, which we refer to as "stable subsets," represent unusually disconnected portions of the network.
Before describing our procedures for identifying stable subsets more formally, we first give an example to illustrate how the spectral embedding can be used to identify these highly disconnected subsets. Supplementary Fig. 31 shows the projection of all genotyped individuals (i.e., vertices in the IBD network) onto the first two dimensions of the spectral embedding. The vast majority of samples are concentrated near the origin. Labeling the samples according to their membership to the first-level IBD network clusters, we find that many of the samples projecting away from the origin along the first dimension are assigned to the cluster labeled as "Caribbeans"; these are drawn as blue circles and crosses in the figure. Many of the samples projected away from the origin along the second dimension belong to the cluster labeled as "European Jewish"; these are drawn as red circles and crosses in the figure. (Later, we explain how we interpret these clusters as European Jewish and Caribbean Islanders.) To define a stable subset for the Caribbeans cluster, we specify a classification rule of the form yi,1 > b1 ∧ yi,2 > b2 ⇒ "i is a member of Caribbeans stable subset". Here, yi,1 and yi,2 are the projection of sample i onto dimensions 1 and 2 of the spectral embedding, rotated by r degrees. The choice of this rule-that is, the choice of parameters b1, b2 and r-is subject to the following condition: most of the samples satisfying this classification rule must also be assigned to the Caribbeans cluster (again, as identified by hierarchical clustering). In other words, the rate of "false positives" must be low, in which we define "true positives" and "false positives" using hierarchical clustering membership as the "ground-truth." Setting b1 = 0.001, b2 = -0.0005 and r = 8 degrees counter-clockwise, 9,315 out of 11,807 Caribbean cluster members are assigned to the stable subset (79% recall). These are the blue circles in Supplementary Fig. 31. The 2,492 cluster members that are not assigned to the stable subset are shown as blue crosses. There are an additional 60 samples that satisfy this classification rule, but are not members of the Caribbean cluster (and so are not represented in the figure). Therefore, we obtain a false positive rate of only 0.6%. The final stable subset we report is the set of 9,315 samples that satisfy this classification rule and are assigned to the Caribbeans cluster in the hierarchical clustering (see Supplementary Data 2).
Following the same procedure, we define the stable subset for the European Jewish cluster. In this case, we specify the classification rule as yi,1 < 0.001 ∧ yi,2 < -0.001 ⇒ "i is a member of European Jewish stable subset". Applying this classifier, 26,547 out of 32,708 cluster members are included in the stable subset (81% recall). The selected samples are the red circles in the figure, and the remaining cluster members are shown as red crosses. In this case, only 13 individuals that are not members of the European Jewish cluster satisfy this classification rule, for a very low false positive rate of 0.05%. The final stable subset we report is the set of 26,547 samples that satisfy this classification rule and are assigned to the cluster. Below, we describe the desired thresholds for recall and false positive rate in the identification of stable subsets, as well as our rationale for using the hierarchical clustering as validation.
Note that many of the members of the two clusters that are not included in their respective stable subsets-the blue and red crosses-lie somewhere in between the two stable subsets. Our conjecture is that this is an example of how the community detection method arbitrarily assigns "admixed" individuals to a single cluster-these are putatively individuals of both Jewish and Caribbean descent-whereas the mapping onto the spectral embedding allows us to distinguish such individuals.
Once the eigenvectors corresponding to the largest m eigenvalues have been computed, a common strategy is to use a simple algorithm such as k-means to estimate clusters in the spectral embedding (e.g., 46 ). Although this approach has been successful in many domains, the "local scaling problem" in spectral clustering, which has been studied in other types of data 73 , and here reflects the relative density of the IBD connections, complicates this enormously. This is readily appreciated-in two dimensions, at least-by observing the wide variation in dispersal of the clusters identified in the spectral embedding (Supplementary Figs. 4,5). To circumvent the local scaling problem, we use the projection of the hierarchical clustering onto the spectral embedding to generate a set of "candidate clusters" for the spectral analysis.
Our formal procedure for identifying stable subsets is as follows. The first step is to identify a cluster that projects away from the origin in the spectral embedding. This is accomplished simply by visually inspecting the projection of the candidate clusters in the spectral embedding, in which we label the clusters with different colors and symbol shapes (see Supplementary Figs.

4, 5 for examples of this).
Obviously, this can be only realistically done in two dimensions at a time. Further, to make this process more tractable, we only inspect pairs of consecutive eigenvectors; that is, j = i + 1. It is conceivable that some clusters are better delineated by nonconsecutive pairs of eigenvectors, or even more than two eigenvectors 74 , but we did not investigate this. Thus, this procedure does not guarantee identification of all stable subsets in the spectral embedding.
Once we have selected a cluster that projects away from the origin, we attempt to identify a subset satisfying the following definition: a subset of a hierarchical cluster is defined as a stable subset if it is possible to specify a simple linear classification rule (detailed above) using eigenvectors i and j to classify some of the individuals to the same cluster with a low rate of false positives (again, a "false positive" is a sample that is incorrectly classified according to the cluster assignment from the hierarchical clustering). This definition does not uniquely determine the stable sets, as there is still considerable flexibility in how these stable subsets can be chosen from the spectral embedding. Our approach is to specify the linear decision rule that recovers the largest number of cluster members (that is, high recall), while keeping the false positive rate acceptably low (as a guideline, we use 10%). All stable subsets we detect in the spectral embedding have low false positive rates; the largest false positive rate, in the Central American cluster, is 12% (Supplementary Data 2). Finally, to verify that the selected cluster is the most appropriate one for defining the stable subset, we check that the correlation between the linear classifier and cluster assignment is highest for the selected hierarchical cluster. Linear decision rules, false positive rates and other details for all identified stable subsets are given in Supplementary Data 2.
Beyond the local scaling problem, another challenge with identifying clusters in the spectral embedding is that in most dimensions of the spectral embedding, only a small number of samples project away from the origin-typically far less than 100. These correspond to very small subgraphs of the IBD network that are the least connected with the rest of the network. (Note that these highly disconnected subgraphs do not necessarily correspond to the very small clusters identified in the first, second and third rounds of the hierarchical clustering.) Furthermore, many of the same clusters appear in multiple dimensions of the spectral embedding. In short, the spectral embedding captures the clusters and small numbers of samples that exhibit the most dominant modular structure in the network, possibly obscuring other, more subtly disconnected subsets. (In some respects, this is the opposite of the "resolution limit" problem in modularity-maximizing methods 68 , in which strong modular structure in small subgraphs can be obscured by more subtle modular structure in large portions of the network.) We address this issue by isolating the subgraph for which we have not identified any substructure, analogous to the approach of recursively subdividing clusters in the hierarchical clustering. Initially, we compute the spectral embedding from the completely connected graph with 769,444 vertices. Once we have completed the spectral analysis of this graph, we compute the spectral embedding from a subgraph with 586,147 vertices that is obtained by first removing the small sets of individuals and the clusters that project away from the origin in the initial spectral embedding (i.e., the clusters for which we are able to identify stable subsets in the first phase). Therefore, we use two sets of eigenvectors in the spectral analysis: eigenvectors of the Laplacian defined by the completely connected graph with 769,444 vertices; and eigenvectors of the Laplacian defined by the subgraph on 586,147 vertices.
Finally, an important aspect to this procedure that remains unresolved is the number of eigenvectors that are used to define the spectral embedding. One commonly proposed criterion is the "eigengap" heuristic 42 , which is based on theory showing that a relatively large difference in consecutive eigenvalues of the Laplacian matrix is suggestive of the number of disconnected components of the graph. However, this heuristic only works well if the network contains very well-pronounced modules 45 , which is not the case here. Here, we limit the spectral embedding to the top m = 40 eigenvectors, primarily for manageability of the analysis procedure. It is possible that inspecting more eigenvectors (with smaller eigenvalues) could provide support for additional substructure in the IBD network.

Projecting 1000 Genomes samples onto the spectral embedding
Since the spectral embedding defines a projection operator 63 , we can map new genotype samples onto the manifold defined by this embedding. This allows us to validate against data not used to construct the embedding. Here, we use the 1000 Genomes data 31 . The SNPs genotyped in the AncestryDNA samples (using the OmniExpress chip) were also genotyped in 1,816 unrelated 1000 Genomes samples (using the Illumina OMNI 2.5M chip), so we can follow the steps above to estimate IBD between all pairs (i, j), in which i is a 1000 Genomes sample and j is an AncestryDNA sample. (Note that we do not need to estimate IBD shared by 1000 Genomes samples as it is not needed to compute the projection, below.) These IBD detection results are summarized in Supplementary Table 3 Genomes data (e.g., Fig. 6).

Supplementary figures
Supplementary Figure 1    connected uniformly at random if they aren't already connected by an edge. Panel b shows the (symmetric) adjacency matrix of the undirected graph for 500 randomly chosen vertices of this graph. (Note that the diagonal of this adjacency matrix is set to 1 for the Laplacian eigenmaps method.) After generating these data, we apply the community detection (Panel d) and spectral analysis methods (Panel c) to these data. The community detection algorithm subdivides the network into 15 communities, or clusters; the assignment of samples to these 15 clusters is depicted by different colors and shapes in Panel a, c and d. These 15 clusters accurately capture the 3 populations because there are few connections between these populations, but it also subdivides each of the 3 populations in such a way that samples nearby each other are usually included in the same cluster. Panel d shows the sample adjacency matrix, again for a random subset of 500 vertices, in which the vertices are arranged by assignment to the 15 clusters. Despite the apparent arbitrariness of this clustering, it is consistent with the aim of maximizing the modularity of the network, as the density of connections between the detected clusters is relatively small. In Panel c, the dominant structure captured by the spectral analysis-specifically, the first 2 eigenvectors of the Laplacian-is the separation of the points into the 3 populations. Although the spectral embedding also captures structure within population 3, specifically the location in population 3 along the horizontal (x) axis, this is a less dominant feature in the embedding. In summary, this example illustrates that the community detection method identifies clusters that capture both discrete and continuous population structure (e.g., isolation-by-distance), whereas the spectral analysis can be used to isolate discrete population structure.