Neural ADMIXTURE for rapid genomic clustering

Characterizing the genetic structure of large cohorts has become increasingly important as genetic studies extend to massive, increasingly diverse biobanks. Popular methods decompose individual genomes into fractional cluster assignments with each cluster representing a vector of DNA variant frequencies. However, with rapidly increasing biobank sizes, these methods have become computationally intractable. Here we present Neural ADMIXTURE, a neural network autoencoder that follows the same modeling assumptions as the current standard algorithm, ADMIXTURE, while reducing the compute time by orders of magnitude surpassing even the fastest alternatives. One month of continuous compute using ADMIXTURE can be reduced to just hours with Neural ADMIXTURE. A multi-head approach allows Neural ADMIXTURE to offer even further acceleration by computing multiple cluster numbers in a single run. Furthermore, the models can be stored, allowing cluster assignment to be performed on new data in linear time without needing to share the training samples. The rapid growth in sequenced human genomes and the proliferation of population-scale biobanks have enabled the creation of increasingly accurate models to predict traits and disease risk using an individual’s genome. However, different predictive models can be required depending on an individual’s genetic ancestry, and this necessitates accurately characterizing genetic cluster composition at the individual level 1 . Such characterization is also an essential part of most modern population genetics studies and national biobanking efforts 2,3 . However, many existing algorithms for this task struggle with next-generation sequencing datasets, where both the number of samples and the number of sequenced positions along the genome are much greater than earlier case–control genotyping studies. Scalable algorithms to characterize the population structure of genetic sequences are especially important for more diverse biobanks, themselves needed to

The rapid growth in sequenced human genomes and the proliferation of population-scale biobanks have enabled the creation of increasingly accurate models to predict traits and disease risk using an individual's genome. However, different predictive models can be required depending on an individual's genetic ancestry, and this necessitates accurately characterizing genetic cluster composition at the individual level 1 . Such characterization is also an essential part of most modern population genetics studies and national biobanking efforts 2,3 . However, many existing algorithms for this task struggle with next-generation sequencing datasets, where both the number of samples and the number of sequenced positions along the genome are much greater than earlier case-control genotyping studies. Scalable algorithms to characterize the population structure of genetic sequences are especially important for more diverse biobanks, themselves needed to correct the extreme imbalance towards European-descent samples in existing studies in order to avoid a new divide in healthcare arising through omitting most of the world's population from precision health research 4

.
A common approach for characterizing the population structure within a genetic dataset is to describe each sample as a set of fractional assignments to each cluster. These clusters are centroids found via an unsupervised algorithm in a space spanning the frequencies of each variant. By avoiding the culture-specific labels and subjective constructs (for example, ethnicity) of supervised classification methods 5 , these unsupervised approaches can better reflect the spectrum of genetic structure across samples. Generally, the input variants are the individual's sequence of single nucleotide polymorphisms (SNPs), that is, single positions along the genome known to vary between individuals. Smaller datasets of less numerous variants, such as microsatellites, have also been used. There are millions of SNPs in the human genome and most are biallelic (two variants) permitting a binary encoding. For instance, zero could be used to encode the most common (or reference) variant at an SNP position on the genome and one to encode the minority (or alternate) variant. The frequency distribution of these variants will vary between populations due to differing histories: founder events, migration, isolation, and drift. Article https://doi.org/10.1038/s43588-023-00482-7 of clusters, but needing only a single training for all numbers of clusters desired.
Neural ADMIXTURE was trained with a standard binary crossentropy, leading to an equivalence with the traditional ADMIXTURE model's objective function (Methods). Two initialization techniques, one based on principal component analysis [10][11][12] and the other on archetypal analysis 13 , were used as an alternative to common network initializations to speed up the training process and improve results (Supplementary section 'Decoder initialization'). Furthermore, two mechanisms are available to incorporate prior knowledge about the amount of admixture in a dataset by controlling the softness of the cluster assignments: applying L2 regularization during training (Methods) and softmax tempering (Supplementary section 'Softmax tempering'). Both single-head and multi-head approaches can be adapted to a supervised version that performs regular classification given known training labels (Supplementary section 'Supervised training'). The proposed method is fully compatible with the original ADMIXTURE framework, allowing the use of ADMIXTURE results as an initialization for Neural ADMIXTURE parameters (Supplementary section 'Pretrained mode'), and vice versa. We performed an in-depth evaluation of the proposed method and compared it with competing approaches across multiple datasets, including using simulations from a variety of systems [14][15][16][17] and using samples from large-scale, real-world biobanks (Methods, Supplementary Table 1, Supplementary Table 2, and Supplementary section 'Dataset description').
We present an autoencoder that expands on the clustering method for genomes: ADMIXTURE 6,7 . ADMIXTURE was developed as a computationally efficient alternative to STRUCTURE 8 , and we take this pursuit of efficiency now to the next generation of datasets. Our proposed method, Neural ADMIXTURE, follows the same modeling assumptions as ADMIXTURE, but reframes the task as a neural-network-based autoencoder, providing faster computational times, both on graphics and central graphics units (GPUs and on CPUs), while maintaining high-quality assignments.

Model overview
Neural ADMIXTURE (Fig. 1a) is an interpretable autoencoder with two main components: (1) an encoder, composed of two linear layers with a Gaussian error linear unit (GELU) activation 9 in-between, then a softmax activation, which projects a genotype sequence onto a vector representing fractional ancestry assignments for each individual (Q); and (2) a decoder, which is a single linear layer whose weights are restricted to lie between 0 and 1, leading to an interpretable projection matrix that learns the cluster centroids, or equivalently, the average variant frequency at each site for each population (F). Additionally, we introduce Multi-head Neural ADMIXTURE (Fig. 1b) The input sequence (x) is projected into 64 dimensions using a linear layer (θ 1 ) and processed by a GELU non-linearity (σ 1 ). The cluster assignment estimates Q are computed by feeding the 64-dimensional sequence to a K-neuron layer (parametrized by θ 2 ) activated with a softmax (σ 2 ). Finally, the decoder outputs a reconstruction of the input (x ) using a linear layer with weights F. Note that the decoder is restricted to this linear architecture to ensure interpretability. b, Simple multi-head example with H = 3. The 64-dimensional hidden vector is copied and processed independently by different sets of weights (θ 2 h ), which yield vectors of different dimensions, corresponding to the different K values. Each different Q K h matrix is processed independently by different decoder matrices F K h yielding H different reconstructions. All parameters are optimized jointly in an end-to-end fashion.

Single-head and multi-head results
Neural ADMIXTURE is systematically faster than alternative algorithms, both on CPU and GPU (Table 1, Supplementary Fig. 1). This speedup is further enhanced when using the Multi-head Neural ADMIXTURE architecture, which can perform clusterings for different K values simultaneously. For example, in the All-Chms dataset, we observed that Neural ADMIXTURE trained in less than 2 min, whereas ADMIXTURE required more than a day. Neural ADMIXTURE performs at least as well as existing algorithms on both predicting the ancestry assignments (Q) and the allele frequencies (F). On average, Neural ADMIXTURE's Q estimates appear to be more similar to the matrix of known labels than the Q estimates from previous methods (Extended Data Fig. 1). Table 2 shows the accuracy and time performance of ADMIXTURE and Neural ADMIXTURE on the test data for three different datasets. Both ADMIXTURE and Neural ADMIXTURE are able to generalize and produce consistent assignments on unseen data. However, Neural ADMIXTURE is much faster than ADMIXTURE on both CPU and GPU, because ADMIXTURE must optimize the objective with a fixed F to find Q for unseen data, whereas Neural ADMIXTURE directly learns a function that estimates Q. We note that inference on GPU is extremely fast (generally less than a second for a forward pass); the computational bottleneck comes simply from reading and processing of the data, which could be further addressed.
We visualized the Q estimates of ADMIXTURE and Neural ADMIXTURE on the Chm-22-Sim dataset using pong 18 (Fig. 2a-d). The SNP frequencies (the entries in the F matrix) from both models can be observed as projections onto the first two principal components of the training data (Fig. 2e). Neural ADMIXTURE provides harder cluster predictions, with many samples being assigned only to a single population, whereas ADMIXTURE provides softer cluster predictions with partial assignments to multiple clusters. On this dataset, ADMIXTURE does not assign different clusters to Native Americans (AMR) and East Asians (EAS); instead, it partitions Africans (AFR) into two different ancestry clusters (Fig. 2a,b). Neural ADMIXTURE, however, does split AMR and EAS populations ( Fig. 2c-e). Depictions of the cluster assignments (Q) of all algorithms on several datasets can be found in Supplementary Figs. 2-5. We applied Neural ADMIXTURE, trained on Chm-22-Sim, to admixed populations that were not present in the training data: Mexican Ancestry in Los Angeles, California (MXL, 118), and Puerto Ricans in Puerto Rico (PUR, 104) (Fig. 2f).
We evaluated Multi-head Neural ADMIXTURE with Chm-22-Sim (Extended Data Fig. 2) and showed that as the number of clusters increases, each population group gets assigned its own cluster. Metrics reported from the training data. Root mean squared error (RMSE) (F, F GT ), as defined in the Methods section, for fastSTRUCTURE, TeraStructure, and HaploNet was not computed because the first two lack an allele frequency matrix and the third lacks interpretability. HaploNet was not run on CPU because its resource and time requirements exceed system capabilities. Runtime format is HH:MM:SS and denotes wall-clock time. A runtime longer than a day denotes that the algorithm could not finish on the described hardware within 24 h, requiring it to be run on alternative hardware for longer. The best performing method for a given metric is highlighted in bold.
Article https://doi.org/10.1038/s43588-023-00482-7 Furthermore, we showed that Multi-head Neural ADMIXTURE can be successfully applied to closely related populations (Extended Data Fig. 3). Finally, we showed that the proposed method can be applied on real, admixed datasets (Extended Data Fig. 4).

UK Biobank computational analysis
To assess the clustering speed on a very large dataset, we ran Neural ADMIXTURE in its multi-head mode on the entire UK Biobank-a total of 488,377 samples-and using 147,604 SNPs subsetted to remove linkage disequilibrium (LD) by pruning the full set 19 . Neural ADMIXTURE was able to process the complete dataset within 11 h, providing results from K = 2 to K = 6, whereas ADMIXTURE would take about a month to do the same, given that it took 5.5 days to provide results for K = 2. Traditional techniques such as ADMIXTURE are thus too slow for such large biobanks, particularly because multiple additional runs with different parameters and subsets of data are generally needed in a study. Neural ADMIXTURE was trained without regularization (λ = 0, Methods) and using the PCK-means initialization (Supplementary Algorithm 1). During inference, the temperature was set to τ = 3 2 (Supplementary section 'Softmax tempering'). Figure 3 displays these cluster assignments for the UK Biobank genomes. We group the individuals by their reported country of birth; those with missing or non-existent country-of-birth labels were excluded from the plots. ADMIXTURE results were computed using the Projection analysis mode, which reuses the F matrix computed during the fitting stage using the training data. Neural ADMIXTURE results were computed by simply feeding the sequences to the trained encoder, hence the extremely fast execution time. AlStructure, TeraStructure, and HaploNet lack the ability to compute ancestry assignments on data they were not trained on and so are not taken into account. Runtime format is HH:MM:SS and denotes wall-clock time. The best performing method for a given metric is . a, Q estimates of ADMIXTURE on training data. b, Q estimates of ADMIXTURE on test data. c, Q estimates of Neural ADMIXTURE on training data. d, Q estimates of Neural ADMIXTURE on test data. e, Two-dimensional principal component analysis (PCA) projection of the training data and the matrix F learnt by both ADMIXTURE and Neural ADMIXTURE, which correspond to the cluster centroids. The color of each individual in the PCA represents its ground truth regional label. f, Q estimates of Neural ADMIXTURE on admixed populations not present in the training data. Among the MXL samples, we observe mainly an orange AMR component with a red and yellow component (West Asians (WAS) and Europeans (EUR), respectively). These latter components likely originate from the immigration of Spanish, Morisco, and Sephardic Jewish individuals into Mexico during the colonial period. The PUR samples exhibit EUR, WAS, AMR, and AFR ancestry clusters. The additional AFR component is likely linked to the introduction of enslaved West Africans during the colonial period. In the barplots (used to visualize Q), each vertical bar represents an individual sample and bar color lengths represent the proportion of the sample's ancestry assigned to that colored cluster. OCE, Oceanians; SAS, South Asians. Article https://doi.org/10.1038/s43588-023-00482-7

Scalability analysis
To assess the scalability of different methods, we simulated multiple datasets with various numbers of variants and samples using the software reported previously 17 . The datasets consist of combinations of N ∈ {1,000, 5,000, 10,000, 20,000, 50,000} and M ∈ {1,000, 10,000, 50,000, 100,000}, where N and M are the number of samples and SNPs, respectively.
We compared the training times of ADMIXTURE, AlStructure, TeraStructure, and Neural ADMIXTURE, both on CPU and GPU, across different dataset sizes (Fig. 4)

Discussion
Many unsupervised clustering methods for genotype sequences have been introduced 8,20-25 including the most commonly used, ADMIXTURE 6,7 . These methods, which resemble a non-negative matrix factorization, decompose each input sequence into a set of cluster assignments and compute a centroid for each cluster. The cluster assignments give the proportion of each genetic ancestry cluster for an individual, whereas the cluster centroids give the SNP variant frequencies at each genetic position corresponding to each cluster. As a diploid organism, most humans have a paternal and maternal copy of each non-sex chromosome. Therefore, for a given individual at each genomic position, we have the possibility of four different combinations of biallelic SNPs (0/0, 0/1, 1/0, 1/1). It is common practice to sum both maternal and paternal variants, obtaining a count sequence n ij . In this scenario, an individual i has n ij ∈ {0, 1, 2} copies of the minority SNP j. ADMIXTURE models each individual's count sequence, given a fixed number of population groups K, as n ij ~ Bin(2, p ij ), where p ij = ∑ k q ik f kj , with q ik denoting the fraction of population k assigned to i, and f kj denoting the frequency of SNPs with a value of '1' j in population k. ADMIXTURE applies block relaxation to find the parameters Q and F that minimize the negative log-likelihood function shown in equation (1). The value of K (number of clusters) is typically chosen by using an ad hoc crossvalidation procedure 7 , necessitating runs across a range of values. The block relaxation optimization in ADMIXTURE runs much faster than other approaches used by its main competitors, namely FRAPPE 21 and STRUCTURE 8 . Although it can be run in multi-threading mode, greatly boosting the execution time, it is insufficient when dealing with either a large number of samples or a large number of SNPs. Here we instead use neural networks, whose architectures have begun to be explored for several other genetic structure tasks including haplotype segmentation, dimensionality reduction, and classification [26][27][28][29][30][31][32][33][34][35] (Supplementary section 'Related work').
An important caveat when using soft-clustering techniques, such as Neural ADMIXTURE or ADMIXTURE, is that these techniques follow a modeling assumption that there are some 'prototype' populations and that each individual can be placed within the convex hull of such prototypes. Note that this model might not reflect the underlying structure of real-world populations particularly when independent genetic drift has occurred in each population following admixture events. This limitation is particularly acute in the case of ancient admixture events, and in such cases, other complementary techniques should also be used. Future experiments to quantify these effects using simulations would be valuable. Combining unsupervised clustering with tree-based methods to account for this drift would also be a useful direction. This could complement the progress being made in ancestral recombination graphs.
Although the computational times of Neural ADMIXTURE enable practitioners to obtain rapid results with multiple hyperparameters and different values of K, properly selecting the best results still involves a subjective element, and additional experiments and new quantitative measures are needed. Further, unsupervised clustering methods, and more generally dimensionality-reduction techniques, are affected by sampling imbalances between population groups, which can alter population structure detection and prioritization 36,37 . Additionally, even if structure is not present within the data, these techniques can indicate otherwise 38

Single-head Neural ADMIXTURE
As described in the Discussion, the existing ADMIXTURE algorithm minimizes the negative log-likelihood: with Q = (q ik ) and F = (f kj ). This can be formulated as a non-negative matrix factorization problem. Let X denote the training samples, where the features are the alternate allele normalized counts per position and the jth SNP of the ith individual is represented as x ij = n ij 2 ∈ {0, 0.5, 1}. Then, X ≈ QF, where Q is the assignments, F is the alternate allele frequencies per SNP and population, and the negative log-likelihood in equation (1) is a distance between X and QF. This can be translated into a neural network as an autoencoder with Q = Ψ(X) being the bottleneck computed by the encoder function Ψ and F being the decoder weights themselves (Fig. 1a). Because Q is estimated at every forward pass and not learnt as a whole for the training data, to retrieve Q assignments on previously unseen data, we can perform a simple forward pass instead of running the optimization process fixing F, unlike with ADMIXTURE.
Note that the restrictions in the optimization problem (equation (1)) impose restrictions in the architecture. Those relating to Q (∑ k q ik = 1 and q ik ≥ 0) can be enforced by applying a softmax activation at the encoder output, making the bottleneck equivalent to the cluster assignments. Although the decoder restriction (0 ≤ f kj ≤ 1) could be enforced by applying the sigmoid function to the decoder weights, we found that it suffices to project the weights of the decoder to the interval [0, 1] after every optimization step, one of the most common forms of projected gradient descent 40 .
The decoder must be linear and cannot be followed by a nonlinearity, as this would break the interpretability of the F matrix; the equivalence between the decoder weights and cluster centroids would be lost. On the other hand, the encoder architecture is free from constraints, and it may be composed of several layers. The proposed architecture includes a 64-dimensional, non-linear layer with a GELU activation before the bottleneck and batch normalization acting directly on the input. The latter re-scales the data to have zero mean and unit variance. Since the mean for each SNP is its frequency p, and the standard deviation σ is √p(1 − p), the {0, 1} input gets encoded as {− √ thereby supplying more explicitly the information of the allele frequencies to the network.
The ADMIXTURE model does not precisely reconstruct the input data as a regular autoencoder would do, because the input SNP genotype sequences, n ij ∈ {0, 1, 2}, and the reconstructions, p ij ∈ [0, 1], do not have matching ranges. This can easily be remedied by dividing the genotype counts by two, so that the input data are x ij = n ij 2 ∈ {0, 0.5, 1}.
Moreover, instead of minimizing ℒ C (equation (1)), we propose minimizing the binary cross-entropy instead, using a penalty term on the Frobenius norm of the encoder weights, θ: This regularization term avoids hard assignments in the bottleneck, which helps during the training process and reduces overfitting. In equation (3) we show that the proposed optimization problem and the ADMIXTURE one are equivalent (excluding the regularization term) by using equations (1) and (2): (

3)
A perfect reconstruction can of course be obtained by setting the number of clusters (K) equal to the number of training samples or to the dimension of the input (number of SNPs). However, the bottleneck should ideally capture elementary information about the population structure of the given sequences; therefore, we make use of low-dimensional bottlenecks.

Multi-head Neural ADMIXTURE
In ADMIXTURE, cross-validation must be performed to choose the number of population clusters (K), unless specific prior information about the number of population ancestries is known. Furthermore, in many applications, practitioners desire to observe how cluster assignments change as the number of clusters increases. As the number of both sequenced individuals and variants increases, the feasible number of different cluster numbers that can be run for cross-validation rapidly decreases due to the additional computational cost. As a solution, Multi-head Neural ADMIXTURE allows all cluster numbers to be run simultaneously by taking advantage of the 64-dimensional latent representation computed by the encoder. This shared representation is jointly learnt for the different values of K, {K 1 , …, K H }. Figure 1b shows how the shared representation is split into H different heads in the multi-head architecture. The ith head consists of a non-linear projection to a K i -dimensional vector, which corresponds to an assignment that assumes there are K i different genetic clusters in the data. Although every head could be concatenated and fed through a decoder, this would cause the decoder weights F to not be inter pretable. Therefore, every head needs to have its own decoder and, thus, H different reconstructions of the input are retrieved.
As we have H reconstructions, we will now have H different loss values. We can train this architecture by minimizing equation (4): where Q K h and F K h are, respectively, the cluster assignments and the SNP frequencies per population for the hth head. The restrictions of the ADMIXTURE optimization problem (equation (1)) must be satisfied by The multi-head architecture allows computation of H different cluster assignments corresponding to H different values for K, efficiently, in a single forward pass. Results can then be quantitatively and qualitatively analyzed by the practitioner to decide which value of K is the most suitable for the data.

Evaluation setup
Let N denote the number of samples and M the number of variants (SNPs). To assess the performance of the Q estimates, we match the assignments with the known labels and report the RMSE between them, 5) and the RMSE between the known allele frequencies (F GT ) and the estimated frequencies (F), Article https://doi.org/10.1038/s43588-023-00482-7 We also use a new metric, Δ, defined as which is equivalent to the mean squared difference between the covariance matrices of the estimated and the target populations. In case the Q estimates completely agree with Q GT (up to permutation), Δ will be zero. The larger the disagreement, the higher the value of Δ. We are interested in these metrics, as they are more easily interpreted than the loss function value itself. We are aware that these pseudo-supervised metrics, when applied to datasets simulated from real individuals, do not yield the true quality of the predictions of the models, since the biogeographic labels assigned to the real sequences used to simulate datasets might not reflect the true genomics clusters and variation within the populations. To further investigate this issue, we also used fully simulated population clusters to evaluate the methods.
Dataset preparation. For reproducibility we have used a comprehensive set of publicly available, labeled human whole-genome sequences from diverse populations across the world, combining the 1000 Genomes Project 41 , the Simons Genome Diversity Project 42 , and the Human Genome Diversity Project 43 , as well as data simulated from these samples using PyAdmix 14 and data simulated de novo using the Balding-Nichols Pritchard-Stephens-Donnely model 8,23 . The populations within the combined real datasets can be found in Supplementary Table 2. Each subpopulation is aggregated into a continental-level label according to its geographical location (Supplementary section 'Dataset description'). Additionally, we used the entire UK Biobank genotype dataset.
Benchmarking setup. We compared Neural ADMIXTURE computational time and clustering quality with ADMIXTURE, fastSTRUCTURE 24 , AlStructure 22 , and TeraStructure 23 . fastSTRUCTURE assumes the STRUCTURE model but uses accelerated variational methods instead of MCMC, yielding speedups of more than two orders of magnitude against STRUC-TURE. TeraStructure iteratively computes Q and F while avoiding a high computational load by subsampling SNPs at every iteration, which makes the algorithm faster. AlStructure first estimates a low-dimensional linear subspace of the admixture components and then searches for a model in the latter subspace that satisfies the modeling constraints, yielding a fast alternative to the iterative or maximum likelihood schemes followed by most algorithms. Furthermore, we also compared against HaploNet 26 , a variational autoencoder that maps parts of the sequence (windows) to a low-dimensional latent space, on which clustering is then performed using Gaussian mixture priors. Although the global structure of the data is preserved in the low-dimensional space, direct interpretability of the allele frequencies (available in Neural ADMIXTURE) is not preserved. All models were optimized using 16 threads on an AMD EPYC 7742 (x86_64) processor, which consists of 64 cores and 512 GB of RAM. We restricted the number of threads to 16 despite the fact that more cores are available to run several executions in parallel. To assess GPU performance of Neural ADMIXTURE, all networks were trained on an NVIDIA Tesla V100 SXM2 of 32 GB. The same GPUs were used to run inference on the trained models.

Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability
The samples used in the 'Experiments' section were compiled from public datasets: 1000 Genomes Project (https://www.international genome.org/data/) 41 , the Simons Genome Diversity Project (https:// www.simonsfoundation.org/simons-genome-diversity-project/) 42 , and the Human Genome Diversity Project (https://www.internationalgenome.org/data-portal/data-collection/hgdp) 43 . The compiled datasets (All-Chms, Chm-22 and Chm-22-Sim) are available on figshare 44 . The UK Biobank has approval from the North West Multi-centre Research Ethics Committee as a Research Tissue Bank. This dataset is available to researchers through an open application via https://www.ukbiobank.ac.uk/register-apply/. The entire dataset of genotypes available to download from the UK Biobank portal were used. Source data are provided with this paper.

Code availability
The software is available as an installable package in the PyPi repository under the name 'neural-admixture'. The source code can be found at https://github.com/ai-sandbox/neural-admixture ref. 45. These results reflect the genetic similarity between the respective groups due to their Out-of-Africa migration patterns and subsequent gene flow. After increasing to K=5, OCE obtains its own cluster, reflecting the ancient divergence from the others of that population consisting in our study of the Australo-Papuan groups-Native Australian (SGDP), Papuan Highlands (HGDP), Papuan Sepik (HGDP), Bougainville (HGDP), and Dusun (HGDP). As more clusters are incorporated, American (AMR) and EAS obtain their own clusters and OCE is divided between a component found predominantly in OCE and a component characteristic of EAS. The latter likely reflects the later migration of Austronesian speakers from East Asia out into the Pacific Islands, where they contributed their ancestry to the Oceanian inhabitants. A shared component between EUR, SAS and WAS is maintained, independent of the cluster number K. This could be linked to early farmer expansions out of West Asia and into both Europe and South Asia following the birth of agriculture, or to the much later expansion of the Indo-European language family across all of these regions. Other genetic exchanges between these neighboring regions doubtlessly played a role. With a sufficiently high number of clusters, a shared component between WAS and some AFR populations appears, perhaps reflecting North African gene flow.  K=6. Because the majority of the dataset is composed of individuals with white British ancestry, we only plot the cluster assignments of individuals that reported a country-of-birth outside British and Irish Isles. We can observe that K=2 approximately divides samples between European and non-European populations. With K=3 European, South-and-East Asian, and African ancestry clusters emerge. When K=4 a fine-grained clustering emerges dividing East and South Asian populations. K=5 adds a fifth cluster shared in common (with different proportions) between Southern European (Mediterranean) and West Asian (Near Eastern) populations. Finally, K=6 seems to introduce a cluster mostly present in Northern and Eastern European populations.

Corresponding author(s): Alexander Ioannidis
Last updated by author(s): 05/11/2023 Reporting Summary Nature Portfolio wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency in reporting. For further information on Nature Portfolio policies, see our Editorial Policies and the Editorial Policy Checklist.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted

Software and code
Policy information about availability of computer code Data collection No software was used for data collection.

Data analysis
The software is available as an installable package in the PyPi repository under the name neural-admixture. The source code from this paper is available from the address https://github.com/ai-sandbox/neural-admixture listed in the paper. In addition version 1. For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Portfolio guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A description of any restrictions on data availability -For clinical datasets or third party data, please ensure that the statement adheres to our policy The datasets used in the Experiments section of this article have been compiled from the publicly available 1000 Genomes Project (https:// www.internationalgenome.org/data/), the Simons Genome Diversity Project (https://www.simonsfoundation.org/simons-genome-diversity-project/), and the