Main

The rapid growth in sequenced human genomes and the proliferation of population-scale biobanks have enabled the creation of increasingly accurate models to predict traits and disease risk using an individual’s genome. However, different predictive models can be required depending on an individual’s genetic ancestry, and this necessitates accurately characterizing genetic cluster composition at the individual level1. Such characterization is also an essential part of most modern population genetics studies and national biobanking efforts2,3. However, many existing algorithms for this task struggle with next-generation sequencing datasets, where both the number of samples and the number of sequenced positions along the genome are much greater than earlier case–control genotyping studies. Scalable algorithms to characterize the population structure of genetic sequences are especially important for more diverse biobanks, themselves needed to correct the extreme imbalance towards European-descent samples in existing studies in order to avoid a new divide in healthcare arising through omitting most of the world’s population from precision health research4.

A common approach for characterizing the population structure within a genetic dataset is to describe each sample as a set of fractional assignments to each cluster. These clusters are centroids found via an unsupervised algorithm in a space spanning the frequencies of each variant. By avoiding the culture-specific labels and subjective constructs (for example, ethnicity) of supervised classification methods5, these unsupervised approaches can better reflect the spectrum of genetic structure across samples. Generally, the input variants are the individual’s sequence of single nucleotide polymorphisms (SNPs), that is, single positions along the genome known to vary between individuals. Smaller datasets of less numerous variants, such as microsatellites, have also been used. There are millions of SNPs in the human genome and most are biallelic (two variants) permitting a binary encoding. For instance, zero could be used to encode the most common (or reference) variant at an SNP position on the genome and one to encode the minority (or alternate) variant. The frequency distribution of these variants will vary between populations due to differing histories: founder events, migration, isolation, and drift.

We present an autoencoder that expands on the clustering method for genomes: ADMIXTURE6,7. ADMIXTURE was developed as a computationally efficient alternative to STRUCTURE8, and we take this pursuit of efficiency now to the next generation of datasets. Our proposed method, Neural ADMIXTURE, follows the same modeling assumptions as ADMIXTURE, but reframes the task as a neural-network-based autoencoder, providing faster computational times, both on graphics and central graphics units (GPUs and on CPUs), while maintaining high-quality assignments.

Results

Model overview

Neural ADMIXTURE (Fig. 1a) is an interpretable autoencoder with two main components: (1) an encoder, composed of two linear layers with a Gaussian error linear unit (GELU) activation9 in-between, then a softmax activation, which projects a genotype sequence onto a vector representing fractional ancestry assignments for each individual (Q); and (2) a decoder, which is a single linear layer whose weights are restricted to lie between 0 and 1, leading to an interpretable projection matrix that learns the cluster centroids, or equivalently, the average variant frequency at each site for each population (F). Additionally, we introduce Multi-head Neural ADMIXTURE (Fig. 1b), which includes multiple decoders in a single network to obtain results analogous to training ADMIXTURE repeatedly for different numbers of clusters, but needing only a single training for all numbers of clusters desired.

Fig. 1: Neural ADMIXTURE model architecture.
figure 1

a, Single-head architecture. The input sequence (x) is projected into 64 dimensions using a linear layer (θ1) and processed by a GELU non-linearity (σ1). The cluster assignment estimates Q are computed by feeding the 64-dimensional sequence to a K-neuron layer (parametrized by θ2) activated with a softmax (σ2). Finally, the decoder outputs a reconstruction of the input (\(\tilde{x}\)) using a linear layer with weights F. Note that the decoder is restricted to this linear architecture to ensure interpretability. b, Simple multi-head example with H = 3. The 64-dimensional hidden vector is copied and processed independently by different sets of weights (\({\theta }_{{2}_{h}}\)), which yield vectors of different dimensions, corresponding to the different K values. Each different \({Q}_{{K}_{h}}\) matrix is processed independently by different decoder matrices \({F}_{{K}_{h}}\) yielding H different reconstructions. All parameters are optimized jointly in an end-to-end fashion.

Neural ADMIXTURE was trained with a standard binary cross-entropy, leading to an equivalence with the traditional ADMIXTURE model’s objective function (Methods). Two initialization techniques, one based on principal component analysis10,11,12 and the other on archetypal analysis13, were used as an alternative to common network initializations to speed up the training process and improve results (Supplementary section ‘Decoder initialization’). Furthermore, two mechanisms are available to incorporate prior knowledge about the amount of admixture in a dataset by controlling the softness of the cluster assignments: applying L2 regularization during training (Methods) and softmax tempering (Supplementary section ‘Softmax tempering’). Both single-head and multi-head approaches can be adapted to a supervised version that performs regular classification given known training labels (Supplementary section ‘Supervised training‘). The proposed method is fully compatible with the original ADMIXTURE framework, allowing the use of ADMIXTURE results as an initialization for Neural ADMIXTURE parameters (Supplementary section ‘Pretrained mode’), and vice versa. We performed an in-depth evaluation of the proposed method and compared it with competing approaches across multiple datasets, including using simulations from a variety of systems14,15,16,17 and using samples from large-scale, real-world biobanks (Methods, Supplementary Table 1, Supplementary Table 2, and Supplementary section ‘Dataset description’).

Single-head and multi-head results

Neural ADMIXTURE is systematically faster than alternative algorithms, both on CPU and GPU (Table 1, Supplementary Fig. 1). This speedup is further enhanced when using the Multi-head Neural ADMIXTURE architecture, which can perform clusterings for different K values simultaneously. For example, in the All-Chms dataset, we observed that Neural ADMIXTURE trained in less than 2 min, whereas ADMIXTURE required more than a day. Neural ADMIXTURE performs at least as well as existing algorithms on both predicting the ancestry assignments (Q) and the allele frequencies (F). On average, Neural ADMIXTURE’s Q estimates appear to be more similar to the matrix of known labels than the Q estimates from previous methods (Extended Data Fig. 1).

Table 1 Performance comparison of several global ancestry inference algorithms

Table 2 shows the accuracy and time performance of ADMIXTURE and Neural ADMIXTURE on the test data for three different datasets. Both ADMIXTURE and Neural ADMIXTURE are able to generalize and produce consistent assignments on unseen data. However, Neural ADMIXTURE is much faster than ADMIXTURE on both CPU and GPU, because ADMIXTURE must optimize the objective with a fixed F to find Q for unseen data, whereas Neural ADMIXTURE directly learns a function that estimates Q. We note that inference on GPU is extremely fast (generally less than a second for a forward pass); the computational bottleneck comes simply from reading and processing of the data, which could be further addressed.

Table 2 Performance comparison of ADMIXTURE and Neural ADMIXTURE on test data

We visualized the Q estimates of ADMIXTURE and Neural ADMIXTURE on the Chm-22-Sim dataset using pong18 (Fig. 2a–d). The SNP frequencies (the entries in the F matrix) from both models can be observed as projections onto the first two principal components of the training data (Fig. 2e). Neural ADMIXTURE provides harder cluster predictions, with many samples being assigned only to a single population, whereas ADMIXTURE provides softer cluster predictions with partial assignments to multiple clusters. On this dataset, ADMIXTURE does not assign different clusters to Native Americans (AMR) and East Asians (EAS); instead, it partitions Africans (AFR) into two different ancestry clusters (Fig. 2a,b). Neural ADMIXTURE, however, does split AMR and EAS populations (Fig. 2c–e). Depictions of the cluster assignments (Q) of all algorithms on several datasets can be found in Supplementary Figs. 25.

Fig. 2: Visualization of several results of ADMIXTURE and Neural ADMIXTURE trained on the dataset Chm-22-Sim (K = 7).
figure 2

a, Q estimates of ADMIXTURE on training data. b, Q estimates of ADMIXTURE on test data. c, Q estimates of Neural ADMIXTURE on training data. d, Q estimates of Neural ADMIXTURE on test data. e, Two-dimensional principal component analysis (PCA) projection of the training data and the matrix F learnt by both ADMIXTURE and Neural ADMIXTURE, which correspond to the cluster centroids. The color of each individual in the PCA represents its ground truth regional label. f, Q estimates of Neural ADMIXTURE on admixed populations not present in the training data. Among the MXL samples, we observe mainly an orange AMR component with a red and yellow component (West Asians (WAS) and Europeans (EUR), respectively). These latter components likely originate from the immigration of Spanish, Morisco, and Sephardic Jewish individuals into Mexico during the colonial period. The PUR samples exhibit EUR, WAS, AMR, and AFR ancestry clusters. The additional AFR component is likely linked to the introduction of enslaved West Africans during the colonial period. In the barplots (used to visualize Q), each vertical bar represents an individual sample and bar color lengths represent the proportion of the sample’s ancestry assigned to that colored cluster. OCE, Oceanians; SAS, South Asians.

Source data

We applied Neural ADMIXTURE, trained on Chm-22-Sim, to admixed populations that were not present in the training data: Mexican Ancestry in Los Angeles, California (MXL, 118), and Puerto Ricans in Puerto Rico (PUR, 104) (Fig. 2f).

We evaluated Multi-head Neural ADMIXTURE with Chm-22-Sim (Extended Data Fig. 2) and showed that as the number of clusters increases, each population group gets assigned its own cluster. Furthermore, we showed that Multi-head Neural ADMIXTURE can be successfully applied to closely related populations (Extended Data Fig. 3). Finally, we showed that the proposed method can be applied on real, admixed datasets (Extended Data Fig. 4).

UK Biobank computational analysis

To assess the clustering speed on a very large dataset, we ran Neural ADMIXTURE in its multi-head mode on the entire UK Biobank—a total of 488,377 samples—and using 147,604 SNPs subsetted to remove linkage disequilibrium (LD) by pruning the full set19. Neural ADMIXTURE was able to process the complete dataset within 11 h, providing results from K = 2 to K = 6, whereas ADMIXTURE would take about a month to do the same, given that it took 5.5 days to provide results for K = 2. Traditional techniques such as ADMIXTURE are thus too slow for such large biobanks, particularly because multiple additional runs with different parameters and subsets of data are generally needed in a study. Neural ADMIXTURE was trained without regularization (λ = 0, Methods) and using the PCK-means initialization (Supplementary Algorithm 1). During inference, the temperature was set to \(\tau =\frac{3}{2}\) (Supplementary section ‘Softmax tempering’). Figure 3 displays these cluster assignments for the UK Biobank genomes. We group the individuals by their reported country of birth; those with missing or non-existent country-of-birth labels were excluded from the plots.

Fig. 3: Q fractional genetic cluster estimates across the entire UK Biobank dataset (N = 488,377) obtained using Multi-head Neural ADMIXTURE (K = 6 displayed).
figure 3

Although results are only displayed for K = 6, the multi-head architecture was trained for K = 2 to K = 6 simultaneously in approximately 11 h. In the barplots (used to visualize Q), each vertical bar represents an individual sample and stacked bar color heights represent the proportion of the sample’s ancestry assigned to that colored genetic cluster. Since they result from unsupervised clustering, interpretation of the cluster colors is left open. a, Q estimates of all the samples. Although many samples are clustered together (blue cluster, representing a northern European/British ancestry component), other clusters emerge reflecting the diverse modern populations now living within the United Kingdom. b, Q estimates of individuals born in the British and Irish Isles and territories. Samples from Gibraltar and the Channel Islands are excluded as they contain a very small number of individuals. c, Q estimates for individuals born outside of the British and Irish Isles are labeled by their country or region of birth, showcasing clusters representing Africans, East Asians, South Asians, Northern Europeans, and West Asians (sharing a cluster in part with Southern Europeans). Despite the large ancestry imbalance, Neural ADMIXTURE characterizes the globally diverse genetic variation found in the UK Biobank. Many UK residents born in other countries appear to have northern European (British) ancestry. These likely represent children born abroad to British parents, who later repatriated. We also note a sizeable South-Asian-like genetic ancestry cluster seen in many individuals born in East Africa. This likely stems from the decolonization era exodus out of East Africa of South Asians, who had settled there during the British Empire. The predicted cluster assignments for K = 2 to K = 6 for individuals born outside of the British and Irish Isles can be found in Extended Data Fig. 5.

Source data

Scalability analysis

To assess the scalability of different methods, we simulated multiple datasets with various numbers of variants and samples using the software reported previously17. The datasets consist of combinations of N {1,000, 5,000, 10,000, 20,000, 50,000} and M {1,000, 10,000, 50,000, 100,000}, where N and M are the number of samples and SNPs, respectively.

We compared the training times of ADMIXTURE, AlStructure, TeraStructure, and Neural ADMIXTURE, both on CPU and GPU, across different dataset sizes (Fig. 4). Neural ADMIXTURE is consistently faster than the alternatives. Moreover, Neural ADMIXTURE accelerates substantially using GPUs in contrast to the other methods. The hyperparameters used are described in Supplementary Table 3.

Fig. 4: Evolution of execution time when increasing number of samples and number of variants.
figure 4

Neural ADMIXTURE has clearly faster execution times than the other benchmarked methods on both CPU and GPU. AlStructure results are not reported on the 50,000 samples because this method has prohibitively slow execution times.

Source data

Discussion

Many unsupervised clustering methods for genotype sequences have been introduced8,20,21,22,23,24,25 including the most commonly used, ADMIXTURE6,7. These methods, which resemble a non-negative matrix factorization, decompose each input sequence into a set of cluster assignments and compute a centroid for each cluster. The cluster assignments give the proportion of each genetic ancestry cluster for an individual, whereas the cluster centroids give the SNP variant frequencies at each genetic position corresponding to each cluster. As a diploid organism, most humans have a paternal and maternal copy of each non-sex chromosome. Therefore, for a given individual at each genomic position, we have the possibility of four different combinations of biallelic SNPs (0/0, 0/1, 1/0, 1/1). It is common practice to sum both maternal and paternal variants, obtaining a count sequence nij. In this scenario, an individual i has nij {0, 1, 2} copies of the minority SNP j. ADMIXTURE models each individual’s count sequence, given a fixed number of population groups K, as nij ~ Bin(2, pij), where pij = ∑kqikfkj, with qik denoting the fraction of population k assigned to i, and fkj denoting the frequency of SNPs with a value of ‘1’ j in population k. ADMIXTURE applies block relaxation to find the parameters Q and F that minimize the negative log-likelihood function shown in equation (1). The value of K (number of clusters) is typically chosen by using an ad hoc cross-validation procedure7, necessitating runs across a range of values.

The block relaxation optimization in ADMIXTURE runs much faster than other approaches used by its main competitors, namely FRAPPE21 and STRUCTURE8. Although it can be run in multi-threading mode, greatly boosting the execution time, it is insufficient when dealing with either a large number of samples or a large number of SNPs. Here we instead use neural networks, whose architectures have begun to be explored for several other genetic structure tasks including haplotype segmentation, dimensionality reduction, and classification26,27,28,29,30,31,32,33,34,35 (Supplementary section ‘Related work’).

An important caveat when using soft-clustering techniques, such as Neural ADMIXTURE or ADMIXTURE, is that these techniques follow a modeling assumption that there are some ‘prototype’ populations and that each individual can be placed within the convex hull of such prototypes. Note that this model might not reflect the underlying structure of real-world populations particularly when independent genetic drift has occurred in each population following admixture events. This limitation is particularly acute in the case of ancient admixture events, and in such cases, other complementary techniques should also be used. Future experiments to quantify these effects using simulations would be valuable. Combining unsupervised clustering with tree-based methods to account for this drift would also be a useful direction. This could complement the progress being made in ancestral recombination graphs.

Although the computational times of Neural ADMIXTURE enable practitioners to obtain rapid results with multiple hyperparameters and different values of K, properly selecting the best results still involves a subjective element, and additional experiments and new quantitative measures are needed. Further, unsupervised clustering methods, and more generally dimensionality-reduction techniques, are affected by sampling imbalances between population groups, which can alter population structure detection and prioritization36,37. Additionally, even if structure is not present within the data, these techniques can indicate otherwise38,39.

Methods

Single-head Neural ADMIXTURE

As described in the Discussion, the existing ADMIXTURE algorithm minimizes the negative log-likelihood:

$$\begin{array}{ll}\mathop{\min }\limits_{Q,F}&{{{{\mathcal{L}}}}}_\mathrm{C}(Q,F\,)=-\mathop{\sum}\limits_{i,\,j}{n}_{ij}\log \left(\mathop{\sum}\limits_{k}{q}_{ik}\,{f}_{kj}\right)+(2-{n}_{ij})\log \left(1-\mathop{\sum}\limits_{k}{q}_{ik}{f}_{kj}\right)\\ {{{\rm{subject}}}}\,{{{\rm{to}}}}&0\le {f}_{kj}\le 1\\ &\mathop{\sum}\limits_{k}{q}_{ik}=1\\ &{q}_{ik}\ge 0\end{array}$$
(1)

with Q = (qik) and F = (fkj).

This can be formulated as a non-negative matrix factorization problem. Let X denote the training samples, where the features are the alternate allele normalized counts per position and the jth SNP of the ith individual is represented as \({x}_{ij}=\frac{{n}_{ij}}{2}\in \{0,0.5,1\}\). Then, X ≈ QF, where Q is the assignments, F is the alternate allele frequencies per SNP and population, and the negative log-likelihood in equation (1) is a distance between X and QF. This can be translated into a neural network as an autoencoder with Q = Ψ(X) being the bottleneck computed by the encoder function Ψ and F being the decoder weights themselves (Fig. 1a). Because Q is estimated at every forward pass and not learnt as a whole for the training data, to retrieve Q assignments on previously unseen data, we can perform a simple forward pass instead of running the optimization process fixing F, unlike with ADMIXTURE.

Note that the restrictions in the optimization problem (equation (1)) impose restrictions in the architecture. Those relating to Q (∑kqik = 1 and qik ≥ 0) can be enforced by applying a softmax activation at the encoder output, making the bottleneck equivalent to the cluster assignments. Although the decoder restriction (0 ≤ fkj ≤ 1) could be enforced by applying the sigmoid function to the decoder weights, we found that it suffices to project the weights of the decoder to the interval [0, 1] after every optimization step, one of the most common forms of projected gradient descent40.

The decoder must be linear and cannot be followed by a non-linearity, as this would break the interpretability of the F matrix; the equivalence between the decoder weights and cluster centroids would be lost. On the other hand, the encoder architecture is free from constraints, and it may be composed of several layers. The proposed architecture includes a 64-dimensional, non-linear layer with a GELU activation before the bottleneck and batch normalization acting directly on the input. The latter re-scales the data to have zero mean and unit variance. Since the mean for each SNP is its frequency p, and the standard deviation σ is \(\sqrt{p(1-p)}\), the {0, 1} input gets encoded as \(\left\{{-\sqrt{\frac{p}{1-p}},\sqrt{\frac{1-p}{p}}}\right\}\), thereby supplying more explicitly the information of the allele frequencies to the network.

The ADMIXTURE model does not precisely reconstruct the input data as a regular autoencoder would do, because the input SNP genotype sequences, nij {0, 1, 2}, and the reconstructions, pij [0, 1], do not have matching ranges. This can easily be remedied by dividing the genotype counts by two, so that the input data are \({x}_{ij}=\frac{{n}_{ij}}{2}\in \{0,0.5,1\}\). Moreover, instead of minimizing \({{{{\mathcal{L}}}}}_\mathrm{C}\) (equation (1)), we propose minimizing the binary cross-entropy instead, using a penalty term on the Frobenius norm of the encoder weights, θ:

$${{{{\mathcal{L}}}}}_\mathrm{N}(Q,F\,)=-\mathop{\sum}\limits_{i,\,j}{x}_{ij}\log \left(\mathop{\sum}\limits_{k}{q}_{ik}{f}_{kj}\right)+(1-{x}_{ij})\log \left(1-\mathop{\sum}\limits_{k}{q}_{ik}{f}_{kj}\right)+\lambda \Vert \theta \Vert_{F}^{2}.$$
(2)

This regularization term avoids hard assignments in the bottleneck, which helps during the training process and reduces overfitting. In equation (3) we show that the proposed optimization problem and the ADMIXTURE one are equivalent (excluding the regularization term) by using equations (1) and (2):

$$\begin{array}{ll}&{{{{\mathcal{L}}}}}_\mathrm{N}^{\lambda = 0}(Q,F\,)=-\mathop{\sum}\limits_{i,\,j}{x}_{ij}\log \left(\mathop{\sum}\limits_{k}{q}_{ik}{f}_{kj}\right)+(1-{x}_{ij})\log \left(1-\mathop{\sum}\limits_{k}{q}_{ik}\,{f}_{kj}\right)\\ &=-\mathop{\sum}\limits_{i,\,j}\frac{{n}_{ij}}{2}\log \left(\mathop{\sum}\limits_{k}{q}_{ik}{f}_{kj}\right)+\left(1-\frac{{n}_{ij}}{2}\right)\log \left(1-\mathop{\sum}\limits_{k}{q}_{ik}{f}_{kj}\right)=\\ &=-\frac{1}{2}\mathop{\sum}\limits_{i,\,j}{n}_{ij}\log \left(\mathop{\sum}\limits_{k}{q}_{ik}{f}_{kj}\right)+(2-{n}_{ij})\log \left(1-\mathop{\sum}\limits_{k}{q}_{ik}{f}_{kj}\right)=\\ &=\frac{1}{2}{{{{\mathcal{L}}}}}_\mathrm{C}(Q,F\,).\end{array}$$
(3)

A perfect reconstruction can of course be obtained by setting the number of clusters (K) equal to the number of training samples or to the dimension of the input (number of SNPs). However, the bottleneck should ideally capture elementary information about the population structure of the given sequences; therefore, we make use of low-dimensional bottlenecks.

Multi-head Neural ADMIXTURE

In ADMIXTURE, cross-validation must be performed to choose the number of population clusters (K), unless specific prior information about the number of population ancestries is known. Furthermore, in many applications, practitioners desire to observe how cluster assignments change as the number of clusters increases. As the number of both sequenced individuals and variants increases, the feasible number of different cluster numbers that can be run for cross-validation rapidly decreases due to the additional computational cost. As a solution, Multi-head Neural ADMIXTURE allows all cluster numbers to be run simultaneously by taking advantage of the 64-dimensional latent representation computed by the encoder. This shared representation is jointly learnt for the different values of K, {K1, …, KH}.

Figure 1b shows how the shared representation is split into H different heads in the multi-head architecture. The ith head consists of a non-linear projection to a Ki-dimensional vector, which corresponds to an assignment that assumes there are Ki different genetic clusters in the data. Although every head could be concatenated and fed through a decoder, this would cause the decoder weights F to not be interpretable. Therefore, every head needs to have its own decoder and, thus, H different reconstructions of the input are retrieved.

As we have H reconstructions, we will now have H different loss values. We can train this architecture by minimizing equation (4):

$${{{{\mathcal{L}}}}}_\mathrm{MNA}({Q}_{{K}_{1,...,H}},{F}_{{K}_{1,...,H}})=\mathop{\sum }\limits_{h=1}^{H}{{{{\mathcal{L}}}}}_\mathrm{N}\left({Q}_{{K}_{h}},{F}_{{K}_{h}}\right),$$
(4)

where \({Q}_{K_{h}}\) and \({F}_{K_{h}}\) are, respectively, the cluster assignments and the SNP frequencies per population for the hth head. The restrictions of the ADMIXTURE optimization problem (equation (1)) must be satisfied by \({Q}_{K_{h}}\) and \({F}_{K_{h} }\,\,\forall h\in \{1,\ldots ,H\,\}\).

The multi-head architecture allows computation of H different cluster assignments corresponding to H different values for K, efficiently, in a single forward pass. Results can then be quantitatively and qualitatively analyzed by the practitioner to decide which value of K is the most suitable for the data.

Evaluation setup

Let N denote the number of samples and M the number of variants (SNPs). To assess the performance of the Q estimates, we match the assignments with the known labels and report the RMSE between them,

$${{{\rm{RMSE}}}}\left(Q,{Q}_\mathrm{GT}\right)=\frac{1}{\sqrt{NK}}{\left\Vert Q-{Q}_\mathrm{GT}\right\Vert}_{F}$$
(5)

and the RMSE between the known allele frequencies (FGT) and the estimated frequencies (F),

$${{{\rm{RMSE}}}}(F,{F}_\mathrm{GT})=\frac{1}{\sqrt{KM}}| | F-{F}_\mathrm{GT}| {| }_{F}$$
(6)

We also use a new metric, Δ, defined as

$${{\Delta }}(Q,{Q}_\mathrm{GT})=\frac{1}{{N}^{2}}{\left\Vert Q{Q}^\mathrm{T}-{Q}_\mathrm{GT}{Q}_\mathrm{GT}^\mathrm{T}\right\Vert}_{F}^{2},$$
(7)

which is equivalent to the mean squared difference between the covariance matrices of the estimated and the target populations. In case the Q estimates completely agree with QGT (up to permutation), Δ will be zero. The larger the disagreement, the higher the value of Δ. We are interested in these metrics, as they are more easily interpreted than the loss function value itself. We are aware that these pseudo-supervised metrics, when applied to datasets simulated from real individuals, do not yield the true quality of the predictions of the models, since the biogeographic labels assigned to the real sequences used to simulate datasets might not reflect the true genomics clusters and variation within the populations. To further investigate this issue, we also used fully simulated population clusters to evaluate the methods.

Dataset preparation

For reproducibility we have used a comprehensive set of publicly available, labeled human whole-genome sequences from diverse populations across the world, combining the 1000 Genomes Project41, the Simons Genome Diversity Project42, and the Human Genome Diversity Project43, as well as data simulated from these samples using PyAdmix14 and data simulated de novo using the Balding–Nichols Pritchard–Stephens–Donnely model8,23. The populations within the combined real datasets can be found in Supplementary Table 2. Each subpopulation is aggregated into a continental-level label according to its geographical location (Supplementary section ‘Dataset description’). Additionally, we used the entire UK Biobank genotype dataset.

Benchmarking setup

We compared Neural ADMIXTURE computational time and clustering quality with ADMIXTURE, fastSTRUCTURE24, AlStructure22, and TeraStructure23. fastSTRUCTURE assumes the STRUCTURE model but uses accelerated variational methods instead of MCMC, yielding speedups of more than two orders of magnitude against STRUCTURE. TeraStructure iteratively computes Q and F while avoiding a high computational load by subsampling SNPs at every iteration, which makes the algorithm faster. AlStructure first estimates a low-dimensional linear subspace of the admixture components and then searches for a model in the latter subspace that satisfies the modeling constraints, yielding a fast alternative to the iterative or maximum likelihood schemes followed by most algorithms. Furthermore, we also compared against HaploNet26, a variational autoencoder that maps parts of the sequence (windows) to a low-dimensional latent space, on which clustering is then performed using Gaussian mixture priors. Although the global structure of the data is preserved in the low-dimensional space, direct interpretability of the allele frequencies (available in Neural ADMIXTURE) is not preserved.

All models were optimized using 16 threads on an AMD EPYC 7742 (x86_64) processor, which consists of 64 cores and 512 GB of RAM. We restricted the number of threads to 16 despite the fact that more cores are available to run several executions in parallel. To assess GPU performance of Neural ADMIXTURE, all networks were trained on an NVIDIA Tesla V100 SXM2 of 32 GB. The same GPUs were used to run inference on the trained models.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.