Abstract
Characterizing the genetic structure of large cohorts has become increasingly important as genetic studies extend to massive, increasingly diverse biobanks. Popular methods decompose individual genomes into fractional cluster assignments with each cluster representing a vector of DNA variant frequencies. However, with rapidly increasing biobank sizes, these methods have become computationally intractable. Here we present Neural ADMIXTURE, a neural network autoencoder that follows the same modeling assumptions as the current standard algorithm, ADMIXTURE, while reducing the compute time by orders of magnitude surpassing even the fastest alternatives. One month of continuous compute using ADMIXTURE can be reduced to just hours with Neural ADMIXTURE. A multihead approach allows Neural ADMIXTURE to offer even further acceleration by computing multiple cluster numbers in a single run. Furthermore, the models can be stored, allowing cluster assignment to be performed on new data in linear time without needing to share the training samples.
Main
The rapid growth in sequenced human genomes and the proliferation of populationscale biobanks have enabled the creation of increasingly accurate models to predict traits and disease risk using an individual’s genome. However, different predictive models can be required depending on an individual’s genetic ancestry, and this necessitates accurately characterizing genetic cluster composition at the individual level^{1}. Such characterization is also an essential part of most modern population genetics studies and national biobanking efforts^{2,3}. However, many existing algorithms for this task struggle with nextgeneration sequencing datasets, where both the number of samples and the number of sequenced positions along the genome are much greater than earlier case–control genotyping studies. Scalable algorithms to characterize the population structure of genetic sequences are especially important for more diverse biobanks, themselves needed to correct the extreme imbalance towards Europeandescent samples in existing studies in order to avoid a new divide in healthcare arising through omitting most of the world’s population from precision health research^{4}.
A common approach for characterizing the population structure within a genetic dataset is to describe each sample as a set of fractional assignments to each cluster. These clusters are centroids found via an unsupervised algorithm in a space spanning the frequencies of each variant. By avoiding the culturespecific labels and subjective constructs (for example, ethnicity) of supervised classification methods^{5}, these unsupervised approaches can better reflect the spectrum of genetic structure across samples. Generally, the input variants are the individual’s sequence of single nucleotide polymorphisms (SNPs), that is, single positions along the genome known to vary between individuals. Smaller datasets of less numerous variants, such as microsatellites, have also been used. There are millions of SNPs in the human genome and most are biallelic (two variants) permitting a binary encoding. For instance, zero could be used to encode the most common (or reference) variant at an SNP position on the genome and one to encode the minority (or alternate) variant. The frequency distribution of these variants will vary between populations due to differing histories: founder events, migration, isolation, and drift.
We present an autoencoder that expands on the clustering method for genomes: ADMIXTURE^{6,7}. ADMIXTURE was developed as a computationally efficient alternative to STRUCTURE^{8}, and we take this pursuit of efficiency now to the next generation of datasets. Our proposed method, Neural ADMIXTURE, follows the same modeling assumptions as ADMIXTURE, but reframes the task as a neuralnetworkbased autoencoder, providing faster computational times, both on graphics and central graphics units (GPUs and on CPUs), while maintaining highquality assignments.
Results
Model overview
Neural ADMIXTURE (Fig. 1a) is an interpretable autoencoder with two main components: (1) an encoder, composed of two linear layers with a Gaussian error linear unit (GELU) activation^{9} inbetween, then a softmax activation, which projects a genotype sequence onto a vector representing fractional ancestry assignments for each individual (Q); and (2) a decoder, which is a single linear layer whose weights are restricted to lie between 0 and 1, leading to an interpretable projection matrix that learns the cluster centroids, or equivalently, the average variant frequency at each site for each population (F). Additionally, we introduce Multihead Neural ADMIXTURE (Fig. 1b), which includes multiple decoders in a single network to obtain results analogous to training ADMIXTURE repeatedly for different numbers of clusters, but needing only a single training for all numbers of clusters desired.
Neural ADMIXTURE was trained with a standard binary crossentropy, leading to an equivalence with the traditional ADMIXTURE model’s objective function (Methods). Two initialization techniques, one based on principal component analysis^{10,11,12} and the other on archetypal analysis^{13}, were used as an alternative to common network initializations to speed up the training process and improve results (Supplementary section ‘Decoder initialization’). Furthermore, two mechanisms are available to incorporate prior knowledge about the amount of admixture in a dataset by controlling the softness of the cluster assignments: applying L2 regularization during training (Methods) and softmax tempering (Supplementary section ‘Softmax tempering’). Both singlehead and multihead approaches can be adapted to a supervised version that performs regular classification given known training labels (Supplementary section ‘Supervised training‘). The proposed method is fully compatible with the original ADMIXTURE framework, allowing the use of ADMIXTURE results as an initialization for Neural ADMIXTURE parameters (Supplementary section ‘Pretrained mode’), and vice versa. We performed an indepth evaluation of the proposed method and compared it with competing approaches across multiple datasets, including using simulations from a variety of systems^{14,15,16,17} and using samples from largescale, realworld biobanks (Methods, Supplementary Table 1, Supplementary Table 2, and Supplementary section ‘Dataset description’).
Singlehead and multihead results
Neural ADMIXTURE is systematically faster than alternative algorithms, both on CPU and GPU (Table 1, Supplementary Fig. 1). This speedup is further enhanced when using the Multihead Neural ADMIXTURE architecture, which can perform clusterings for different K values simultaneously. For example, in the AllChms dataset, we observed that Neural ADMIXTURE trained in less than 2 min, whereas ADMIXTURE required more than a day. Neural ADMIXTURE performs at least as well as existing algorithms on both predicting the ancestry assignments (Q) and the allele frequencies (F). On average, Neural ADMIXTURE’s Q estimates appear to be more similar to the matrix of known labels than the Q estimates from previous methods (Extended Data Fig. 1).
Table 2 shows the accuracy and time performance of ADMIXTURE and Neural ADMIXTURE on the test data for three different datasets. Both ADMIXTURE and Neural ADMIXTURE are able to generalize and produce consistent assignments on unseen data. However, Neural ADMIXTURE is much faster than ADMIXTURE on both CPU and GPU, because ADMIXTURE must optimize the objective with a fixed F to find Q for unseen data, whereas Neural ADMIXTURE directly learns a function that estimates Q. We note that inference on GPU is extremely fast (generally less than a second for a forward pass); the computational bottleneck comes simply from reading and processing of the data, which could be further addressed.
We visualized the Q estimates of ADMIXTURE and Neural ADMIXTURE on the Chm22Sim dataset using pong^{18} (Fig. 2a–d). The SNP frequencies (the entries in the F matrix) from both models can be observed as projections onto the first two principal components of the training data (Fig. 2e). Neural ADMIXTURE provides harder cluster predictions, with many samples being assigned only to a single population, whereas ADMIXTURE provides softer cluster predictions with partial assignments to multiple clusters. On this dataset, ADMIXTURE does not assign different clusters to Native Americans (AMR) and East Asians (EAS); instead, it partitions Africans (AFR) into two different ancestry clusters (Fig. 2a,b). Neural ADMIXTURE, however, does split AMR and EAS populations (Fig. 2c–e). Depictions of the cluster assignments (Q) of all algorithms on several datasets can be found in Supplementary Figs. 2–5.
We applied Neural ADMIXTURE, trained on Chm22Sim, to admixed populations that were not present in the training data: Mexican Ancestry in Los Angeles, California (MXL, 118), and Puerto Ricans in Puerto Rico (PUR, 104) (Fig. 2f).
We evaluated Multihead Neural ADMIXTURE with Chm22Sim (Extended Data Fig. 2) and showed that as the number of clusters increases, each population group gets assigned its own cluster. Furthermore, we showed that Multihead Neural ADMIXTURE can be successfully applied to closely related populations (Extended Data Fig. 3). Finally, we showed that the proposed method can be applied on real, admixed datasets (Extended Data Fig. 4).
UK Biobank computational analysis
To assess the clustering speed on a very large dataset, we ran Neural ADMIXTURE in its multihead mode on the entire UK Biobank—a total of 488,377 samples—and using 147,604 SNPs subsetted to remove linkage disequilibrium (LD) by pruning the full set^{19}. Neural ADMIXTURE was able to process the complete dataset within 11 h, providing results from K = 2 to K = 6, whereas ADMIXTURE would take about a month to do the same, given that it took 5.5 days to provide results for K = 2. Traditional techniques such as ADMIXTURE are thus too slow for such large biobanks, particularly because multiple additional runs with different parameters and subsets of data are generally needed in a study. Neural ADMIXTURE was trained without regularization (λ = 0, Methods) and using the PCKmeans initialization (Supplementary Algorithm 1). During inference, the temperature was set to \(\tau =\frac{3}{2}\) (Supplementary section ‘Softmax tempering’). Figure 3 displays these cluster assignments for the UK Biobank genomes. We group the individuals by their reported country of birth; those with missing or nonexistent countryofbirth labels were excluded from the plots.
Scalability analysis
To assess the scalability of different methods, we simulated multiple datasets with various numbers of variants and samples using the software reported previously^{17}. The datasets consist of combinations of N ∈ {1,000, 5,000, 10,000, 20,000, 50,000} and M ∈ {1,000, 10,000, 50,000, 100,000}, where N and M are the number of samples and SNPs, respectively.
We compared the training times of ADMIXTURE, AlStructure, TeraStructure, and Neural ADMIXTURE, both on CPU and GPU, across different dataset sizes (Fig. 4). Neural ADMIXTURE is consistently faster than the alternatives. Moreover, Neural ADMIXTURE accelerates substantially using GPUs in contrast to the other methods. The hyperparameters used are described in Supplementary Table 3.
Discussion
Many unsupervised clustering methods for genotype sequences have been introduced^{8,20,21,22,23,24,25} including the most commonly used, ADMIXTURE^{6,7}. These methods, which resemble a nonnegative matrix factorization, decompose each input sequence into a set of cluster assignments and compute a centroid for each cluster. The cluster assignments give the proportion of each genetic ancestry cluster for an individual, whereas the cluster centroids give the SNP variant frequencies at each genetic position corresponding to each cluster. As a diploid organism, most humans have a paternal and maternal copy of each nonsex chromosome. Therefore, for a given individual at each genomic position, we have the possibility of four different combinations of biallelic SNPs (0/0, 0/1, 1/0, 1/1). It is common practice to sum both maternal and paternal variants, obtaining a count sequence n_{ij}. In this scenario, an individual i has n_{ij} ∈ {0, 1, 2} copies of the minority SNP j. ADMIXTURE models each individual’s count sequence, given a fixed number of population groups K, as n_{ij} ~ Bin(2, p_{ij}), where p_{ij} = ∑_{k}q_{ik}f_{kj}, with q_{ik} denoting the fraction of population k assigned to i, and f_{kj} denoting the frequency of SNPs with a value of ‘1’ j in population k. ADMIXTURE applies block relaxation to find the parameters Q and F that minimize the negative loglikelihood function shown in equation (1). The value of K (number of clusters) is typically chosen by using an ad hoc crossvalidation procedure^{7}, necessitating runs across a range of values.
The block relaxation optimization in ADMIXTURE runs much faster than other approaches used by its main competitors, namely FRAPPE^{21} and STRUCTURE^{8}. Although it can be run in multithreading mode, greatly boosting the execution time, it is insufficient when dealing with either a large number of samples or a large number of SNPs. Here we instead use neural networks, whose architectures have begun to be explored for several other genetic structure tasks including haplotype segmentation, dimensionality reduction, and classification^{26,27,28,29,30,31,32,33,34,35} (Supplementary section ‘Related work’).
An important caveat when using softclustering techniques, such as Neural ADMIXTURE or ADMIXTURE, is that these techniques follow a modeling assumption that there are some ‘prototype’ populations and that each individual can be placed within the convex hull of such prototypes. Note that this model might not reflect the underlying structure of realworld populations particularly when independent genetic drift has occurred in each population following admixture events. This limitation is particularly acute in the case of ancient admixture events, and in such cases, other complementary techniques should also be used. Future experiments to quantify these effects using simulations would be valuable. Combining unsupervised clustering with treebased methods to account for this drift would also be a useful direction. This could complement the progress being made in ancestral recombination graphs.
Although the computational times of Neural ADMIXTURE enable practitioners to obtain rapid results with multiple hyperparameters and different values of K, properly selecting the best results still involves a subjective element, and additional experiments and new quantitative measures are needed. Further, unsupervised clustering methods, and more generally dimensionalityreduction techniques, are affected by sampling imbalances between population groups, which can alter population structure detection and prioritization^{36,37}. Additionally, even if structure is not present within the data, these techniques can indicate otherwise^{38,39}.
Methods
Singlehead Neural ADMIXTURE
As described in the Discussion, the existing ADMIXTURE algorithm minimizes the negative loglikelihood:
with Q = (q_{ik}) and F = (f_{kj}).
This can be formulated as a nonnegative matrix factorization problem. Let X denote the training samples, where the features are the alternate allele normalized counts per position and the jth SNP of the ith individual is represented as \({x}_{ij}=\frac{{n}_{ij}}{2}\in \{0,0.5,1\}\). Then, X ≈ QF, where Q is the assignments, F is the alternate allele frequencies per SNP and population, and the negative loglikelihood in equation (1) is a distance between X and QF. This can be translated into a neural network as an autoencoder with Q = Ψ(X) being the bottleneck computed by the encoder function Ψ and F being the decoder weights themselves (Fig. 1a). Because Q is estimated at every forward pass and not learnt as a whole for the training data, to retrieve Q assignments on previously unseen data, we can perform a simple forward pass instead of running the optimization process fixing F, unlike with ADMIXTURE.
Note that the restrictions in the optimization problem (equation (1)) impose restrictions in the architecture. Those relating to Q (∑_{k}q_{ik} = 1 and q_{ik} ≥ 0) can be enforced by applying a softmax activation at the encoder output, making the bottleneck equivalent to the cluster assignments. Although the decoder restriction (0 ≤ f_{kj} ≤ 1) could be enforced by applying the sigmoid function to the decoder weights, we found that it suffices to project the weights of the decoder to the interval [0, 1] after every optimization step, one of the most common forms of projected gradient descent^{40}.
The decoder must be linear and cannot be followed by a nonlinearity, as this would break the interpretability of the F matrix; the equivalence between the decoder weights and cluster centroids would be lost. On the other hand, the encoder architecture is free from constraints, and it may be composed of several layers. The proposed architecture includes a 64dimensional, nonlinear layer with a GELU activation before the bottleneck and batch normalization acting directly on the input. The latter rescales the data to have zero mean and unit variance. Since the mean for each SNP is its frequency p, and the standard deviation σ is \(\sqrt{p(1p)}\), the {0, 1} input gets encoded as \(\left\{{\sqrt{\frac{p}{1p}},\sqrt{\frac{1p}{p}}}\right\}\), thereby supplying more explicitly the information of the allele frequencies to the network.
The ADMIXTURE model does not precisely reconstruct the input data as a regular autoencoder would do, because the input SNP genotype sequences, n_{ij} ∈ {0, 1, 2}, and the reconstructions, p_{ij} ∈ [0, 1], do not have matching ranges. This can easily be remedied by dividing the genotype counts by two, so that the input data are \({x}_{ij}=\frac{{n}_{ij}}{2}\in \{0,0.5,1\}\). Moreover, instead of minimizing \({{{{\mathcal{L}}}}}_\mathrm{C}\) (equation (1)), we propose minimizing the binary crossentropy instead, using a penalty term on the Frobenius norm of the encoder weights, θ:
This regularization term avoids hard assignments in the bottleneck, which helps during the training process and reduces overfitting. In equation (3) we show that the proposed optimization problem and the ADMIXTURE one are equivalent (excluding the regularization term) by using equations (1) and (2):
A perfect reconstruction can of course be obtained by setting the number of clusters (K) equal to the number of training samples or to the dimension of the input (number of SNPs). However, the bottleneck should ideally capture elementary information about the population structure of the given sequences; therefore, we make use of lowdimensional bottlenecks.
Multihead Neural ADMIXTURE
In ADMIXTURE, crossvalidation must be performed to choose the number of population clusters (K), unless specific prior information about the number of population ancestries is known. Furthermore, in many applications, practitioners desire to observe how cluster assignments change as the number of clusters increases. As the number of both sequenced individuals and variants increases, the feasible number of different cluster numbers that can be run for crossvalidation rapidly decreases due to the additional computational cost. As a solution, Multihead Neural ADMIXTURE allows all cluster numbers to be run simultaneously by taking advantage of the 64dimensional latent representation computed by the encoder. This shared representation is jointly learnt for the different values of K, {K_{1}, …, K_{H}}.
Figure 1b shows how the shared representation is split into H different heads in the multihead architecture. The ith head consists of a nonlinear projection to a K_{i}dimensional vector, which corresponds to an assignment that assumes there are K_{i} different genetic clusters in the data. Although every head could be concatenated and fed through a decoder, this would cause the decoder weights F to not be interpretable. Therefore, every head needs to have its own decoder and, thus, H different reconstructions of the input are retrieved.
As we have H reconstructions, we will now have H different loss values. We can train this architecture by minimizing equation (4):
where \({Q}_{K_{h}}\) and \({F}_{K_{h}}\) are, respectively, the cluster assignments and the SNP frequencies per population for the hth head. The restrictions of the ADMIXTURE optimization problem (equation (1)) must be satisfied by \({Q}_{K_{h}}\) and \({F}_{K_{h} }\,\,\forall h\in \{1,\ldots ,H\,\}\).
The multihead architecture allows computation of H different cluster assignments corresponding to H different values for K, efficiently, in a single forward pass. Results can then be quantitatively and qualitatively analyzed by the practitioner to decide which value of K is the most suitable for the data.
Evaluation setup
Let N denote the number of samples and M the number of variants (SNPs). To assess the performance of the Q estimates, we match the assignments with the known labels and report the RMSE between them,
and the RMSE between the known allele frequencies (F_{GT}) and the estimated frequencies (F),
We also use a new metric, Δ, defined as
which is equivalent to the mean squared difference between the covariance matrices of the estimated and the target populations. In case the Q estimates completely agree with Q_{GT} (up to permutation), Δ will be zero. The larger the disagreement, the higher the value of Δ. We are interested in these metrics, as they are more easily interpreted than the loss function value itself. We are aware that these pseudosupervised metrics, when applied to datasets simulated from real individuals, do not yield the true quality of the predictions of the models, since the biogeographic labels assigned to the real sequences used to simulate datasets might not reflect the true genomics clusters and variation within the populations. To further investigate this issue, we also used fully simulated population clusters to evaluate the methods.
Dataset preparation
For reproducibility we have used a comprehensive set of publicly available, labeled human wholegenome sequences from diverse populations across the world, combining the 1000 Genomes Project^{41}, the Simons Genome Diversity Project^{42}, and the Human Genome Diversity Project^{43}, as well as data simulated from these samples using PyAdmix^{14} and data simulated de novo using the Balding–Nichols Pritchard–Stephens–Donnely model^{8,23}. The populations within the combined real datasets can be found in Supplementary Table 2. Each subpopulation is aggregated into a continentallevel label according to its geographical location (Supplementary section ‘Dataset description’). Additionally, we used the entire UK Biobank genotype dataset.
Benchmarking setup
We compared Neural ADMIXTURE computational time and clustering quality with ADMIXTURE, fastSTRUCTURE^{24}, AlStructure^{22}, and TeraStructure^{23}. fastSTRUCTURE assumes the STRUCTURE model but uses accelerated variational methods instead of MCMC, yielding speedups of more than two orders of magnitude against STRUCTURE. TeraStructure iteratively computes Q and F while avoiding a high computational load by subsampling SNPs at every iteration, which makes the algorithm faster. AlStructure first estimates a lowdimensional linear subspace of the admixture components and then searches for a model in the latter subspace that satisfies the modeling constraints, yielding a fast alternative to the iterative or maximum likelihood schemes followed by most algorithms. Furthermore, we also compared against HaploNet^{26}, a variational autoencoder that maps parts of the sequence (windows) to a lowdimensional latent space, on which clustering is then performed using Gaussian mixture priors. Although the global structure of the data is preserved in the lowdimensional space, direct interpretability of the allele frequencies (available in Neural ADMIXTURE) is not preserved.
All models were optimized using 16 threads on an AMD EPYC 7742 (x86_64) processor, which consists of 64 cores and 512 GB of RAM. We restricted the number of threads to 16 despite the fact that more cores are available to run several executions in parallel. To assess GPU performance of Neural ADMIXTURE, all networks were trained on an NVIDIA Tesla V100 SXM2 of 32 GB. The same GPUs were used to run inference on the trained models.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The samples used in the ‘Experiments’ section were compiled from public datasets: 1000 Genomes Project (https://www.internationalgenome.org/data/)^{41}, the Simons Genome Diversity Project (https://www.simonsfoundation.org/simonsgenomediversityproject/)^{42}, and the Human Genome Diversity Project (https://www.internationalgenome.org/dataportal/datacollection/hgdp)^{43}. The compiled datasets (AllChms, Chm22 and Chm22Sim) are available on figshare^{44}. The UK Biobank has approval from the North West Multicentre Research Ethics Committee as a Research Tissue Bank. This dataset is available to researchers through an open application via https://www.ukbiobank.ac.uk/registerapply/. The entire dataset of genotypes available to download from the UK Biobank portal were used. Source data are provided with this paper.
Code availability
The software is available as an installable package in the PyPi repository under the name ‘neuraladmixture’. The source code can be found at https://github.com/aisandbox/neuraladmixture ref. ^{45}.
References
Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Privé, F. Using the UK Biobank as a global reference of worldwide populations: application to measuring ancestry diversity from GWAS summary statistics. Bioinformatics 38, 3477–3480 (2022).
Morales, J. et al. A standardized framework for representation of ancestry data in genomics studies, with application to the NHGRIEBI GWAS Catalog. Genome Biol. 19, 1–10 (2018).
Mathieson, I. & Scally, A. What is ancestry? PLoS Genet. 16, e1008624 (2020).
Alexander, D. H., Novembre, J. & Lange, K. Fast modelbased estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).
Alexander, D. H. & Lange, K. Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinform. 12, 246 (2011).
Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).
Hendrycks, D. & Gimpel, K. Gaussian error linear units (GELUs). Preprint at https://doi.org/10.48550/arXiv.1606.08415 (2020).
Novembre, J. et al. Genes mirror geography within Europe. Nature 456, 98–101 (2008).
Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).
Price, A. L. et al. Principal components analysis corrects for stratification in genomewide association studies. Nat. Genet. 38, 904–909 (2006).
Cutler, A. & Breiman, L. Archetypal analysis. Technometrics 36, 338–347 (1994).
Kumar, A., Montserrat, D. M., Bustamante, C. & Ioannidis, A. XGMix: localancestry inference with stacked XGBoost. Preprint at bioRxiv https://doi.org/10.1101/2020.04.21.053876 (2020).
Maples, B. K., Gravel, S., Kenny, E. E. & Bustamante, C. D. RFMix: a discriminative modeling approach for rapid and robust localancestry inference. Am. J. Hum. Genet. 93, 278–288 (2013).
Karavani, E. et al. Screening human embryos for polygenic traits has limited utility. Cell 179, 1424–1435.e8 (2019).
Chiu, A., Molloy, E., Tan, Z., Talwalkar, A. & Sankararaman, S. Inferring population structure in biobankscale genomic data. Am. J. Hum. Genet. 109, 727–737 (2022).
Behr, A. A., Liu, K. Z., LiuFang, G., Nakka, P. & Ramachandran, S. Pong: fast analysis and visualization of latent clusters in population genetic data. Bioinformatics 32, 2817–2823 (2016).
Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Bradburd, G. S., Coop, G. M. & Ralph, P. L. Inferring continuous and discrete population genetic structure across space. Genetics 210, 33–52 (2018).
Tang, H., Peng, J., Wang, P. & Risch, N. J. Estimation of individual admixture: analytical and study design considerations. Genet. Epidemiol. 28, 289–301 (2005).
Cabreros, I. & Storey, J. D. A likelihoodfree estimator of population structure bridging admixture models and principal components analysis. Genetics 212, 1009–1029 (2019).
Gopalan, P., Hao, W., Blei, D. & Storey, J. Scaling probabilistic models of genetic variation to millions of humans. Nat. Genet. 48, 1587–1590 (2016).
Raj, A., Stephens, M. & Pritchard, J. K. fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics 197, 573–589 (2014).
GimbernatMayol, J., Dominguez Mantes, A., Bustamante, C. D., Mas Montserrat, D. & Ioannidis, A. G. Archetypal analysis for population genetics. PLoS Comput. Biol. 18, e1010301 (2022).
Meisner, J. & Albrechtsen, A. Haplotype and population structure inference using neural networks in wholegenome sequencing data. Genome Res. 32, 1542–1552 (2022).
Joo, W., Lee, W., Park, S. & Moon, I.C. Dirichlet variational autoencoder. Pattern Recognit. 107, 107514 (2020).
Keller, S. M., Samarin, M., Torres, F. A., Wieser, M. & Roth, V. Learning extremal representations with deep archetypal analysis. Int. J. Comput. Vis. 129, 805–820 (2021).
Ausmees, K. & Nettelblad, C. A deep learning framework for characterization of genotype data. G3 12, jkac020 (2022).
Svensson, V., Gayoso, A., Yosef, N. & Pachter, L. Interpretable factor models of singlecell RNAseq via variational autoencoders. Bioinformatics 36, 3418–3421 (2020).
Battey, C., Coffing, G. C. & Kern, A. D. Visualizing population structure with variational autoencoders. G3 11, jkaa036 (2021).
Montserrat, D. M., Bustamante, C. & Ioannidis, A. LAINet: localancestry inference with neural networks. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing 1314–1318 (IEEE, 2020).
Oriol Sabat, B., Mas Montserrat, D., Giroi Nieto, X. & Ioannidis, A. G. SALAINet: speciesagnostic local ancestry inference network. Bioinformatics 38, ii27–ii33 (2022).
Romero, A. et al. Diet networks: thin parameters for fat genomics. In 5th International Conference on Learning Representations (OpenReview.net, 2017).
Battey, C. J., Ralph, P. L. & Kern, A. D. Predicting geographic location from genetic variation with deep neural networks. eLife 9, e54507 (2020).
Toyama, K. S., Crochet, P.A. & Leblois, R. Sampling schemes and drift can bias admixture proportions inferred by structure. Mol. Ecol. Resour. 20, 1769–1785 (2020).
Elhaik, E. Principal component analyses (PCA)based findings in population genetic studies are highly biased and must be reevaluated. Sci. Rep. 12, 14683 (2022).
Chari, T., Banerjee, J. & Pachter, L. The specious art of singlecell genomics. Preprint at bioRxiv https://doi.org/10.1101/2021.08.25.457696 (2021).
Montserrat, D. M. & Ioannidis, A. G. Adversarial attacks on genotype sequences. In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, 2023).
Lin, C.J. Projected gradient methods for nonnegative matrix factorization. Neural Comput. 19, 2756–2779 (2007).
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).
Bergström, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, eaay5012 (2020).
Dominguez Mantes, A. et al. Neural ADMIXTURE  datasets. figshare https://doi.org/10.6084/m9.figshare.19387538.v1 (2022).
Dominguez Mantes, A., Ioannidis, A. G. & Montserrat, D. M. AIsandbox/neuraladmixture: stable release. Zenodo https://doi.org/10.5281/zenodo.7938892 (2023).
Acknowledgements
This work was partially supported by a grant from the Stanford Institute for HumanCentered Artificial Intelligence (HAI), NIH grants 7U01HG009080 and R01HG010140, and project PID2020117142GBI00 funded by MCIN/ AEI /10.13039/501100011033. This research was conducted using the UK Biobank Resource under Application Number 89006.
Author information
Authors and Affiliations
Contributions
A.G.I. and D.M.M. designed the research. A.D.M. performed the research and wrote the software. A.D.M., D.M.M., X.G.N., and A.G.I. interpreted the results. C.D.B. contributed data. A.D.M., D.M.M., and A.G.I. wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
C.D.B. is the chief executive officer of Galatea Bio, and A.G.I. also holds shares. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Computational Science thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 2D visualization of Q estimates using multidimensional scaling (MDS)
Algorithms appearing closer in the MDS projection have more similar estimates than those farther away. In order to use MDS, a distance matrix of the Q results of different algorithms (including the ground truth matrix) has been computed by using the Frobenius norm between the different Q matrices. The average of the normalized distances has been taken across all datasets in order to retrieve a single distance matrix.
Extended Data Fig. 2 Results from Multihead Neural ADMIXTURE (K=3 to K=8) on the test set of Chm22Sim
For K=3, European (EUR), West Asian (WAS) and South Asian (SAS) are combined within the same cluster, while American (AMR), Oceanian (OCE), and East Asian (EAS) are clustered together, and African (AFR) has its own cluster. These results reflect the genetic similarity between the respective groups due to their OutofAfrica migration patterns and subsequent gene flow. After increasing to K=5, OCE obtains its own cluster, reflecting the ancient divergence from the others of that population consisting in our study of the AustraloPapuan groupsNative Australian (SGDP), Papuan Highlands (HGDP), Papuan Sepik (HGDP), Bougainville (HGDP), and Dusun (HGDP). As more clusters are incorporated, American (AMR) and EAS obtain their own clusters and OCE is divided between a component found predominantly in OCE and a component characteristic of EAS. The latter likely reflects the later migration of Austronesian speakers from East Asia out into the Pacific Islands, where they contributed their ancestry to the Oceanian inhabitants. A shared component between EUR, SAS and WAS is maintained, independent of the cluster number K. This could be linked to early farmer expansions out of West Asia and into both Europe and South Asia following the birth of agriculture, or to the much later expansion of the IndoEuropean language family across all of these regions. Other genetic exchanges between these neighboring regions doubtlessly played a role. With a sufficiently high number of clusters, a shared component between WAS and some AFR populations appears, perhaps reflecting North African gene flow.
Extended Data Fig. 3 Multihead Neural ADMIXTURE results on a dataset consisting of closely related groups.
To qualitatively assess the performance of Neural ADMIXTURE on related groups, we ran multihead Neural ADMIXTURE on a subset of the dataset AllChms containing 504 East Asian (EAS) individuals from neighboring regions. The selfreported ancestry of these individuals are Chinese Dai in Xishuangbanna, China (CDX, 93), Han Chinese in Beijing, China (CHB, 103), Han Chinese South (CHS, 105), Japanese in Tokyo, Japan (JPT, 104) and Kinh in Ho Chi Minh City, Vietnam (KHV, 99). The network was trained in its multihead version from K=3 to K=7 using the PCKMeans initialization. The Japanese samples (JPT) are differentiated and clearly assigned their own cluster (blue), which is present only marginally in other populations. CDX (Chinese Dai) and KHV (Vietnamese Kinh) initially share the same cluster (K=3, green), reflecting their common Southeast Asian lineage, but are split into different groups at K=4 (purple and green). As expected CHB (Han Chinese in Beijing) and CHS (Han Chinese from South China) samples share the same cluster at first (red) and are only differentiated last (at K=5, red and orange). Further structure (yellow and brown) is seen within some populations at higher K.
Extended Data Fig. 4 Q estimates of multihead Neural ADMIXTURE on a dataset consisting of only admixed samples.
To assess performance of the model using real admixed samples, we have trained a multihead Neural ADMIXTURE model (from K=2 to K=5) with samples whose selfreported ancestry are African Caribbean in Barbados (ACB, 96), African Ancestry in Southwest US (ASW, 61), Colombian in Medellin, Colombia (CLM, 94), Mexican Ancestry in Los Angeles, California (MXL, 64), Peruvian in Lima, Peru (PEL, 85) and Puerto Rican in Puerto Rico (PUR 104). The groups have been selected from the 1000 Genomes Project. The variants used (839629) are the same as in the dataset AllChms. The network was trained using the PCKMeans initialization (Supplementary Text ‘Decoder initialization’). At K=2, ACB and ASW are assigned predominantly to their own cluster, separating their mostly African origins from the remaining outofAfrica components. When introducing the next new cluster (K=3), admixed individuals in CLM, MXL and PEL are assigned some fraction to it, differentiating an Indigenous American component in them from their European component. At K=4 the individuals in the PUR population are assigned some fraction of the new cluster, and this cluster is also present in small amounts in CLM and smaller amounts in some MXL. This component, which does not decrease the Indigenous American component fraction in the samples, likely represents an early colonialera Spanish (Europeanancestry) founder effect on the island of Puerto Rico perhaps reflecting the subsequent early colonial expansion from the Spanish Caribbean to coastal Colombia and Mexico. Structure in the European component appears at K=5.
Extended Data Fig. 5 Cluster assignments computed by Neural ADMIXTURE for individuals born outside the British and Irish Isles in the UK Biobank training data.
(a) K=2 (b) K=3 (c) K=4 (d) K=5 (e) K=6. Because the majority of the dataset is composed of individuals with white British ancestry, we only plot the cluster assignments of individuals that reported a countryofbirth outside British and Irish Isles. We can observe that K=2 approximately divides samples between European and nonEuropean populations. With K=3 European, SouthandEast Asian, and African ancestry clusters emerge. When K=4 a finegrained clustering emerges dividing East and South Asian populations. K=5 adds a fifth cluster shared in common (with different proportions) between Southern European (Mediterranean) and West Asian (Near Eastern) populations. Finally, K=6 seems to introduce a cluster mostly present in Northern and Eastern European populations.
Supplementary information
Supplementary Information
Supplementary Text 1–7, Figs. 1–5 and Tables 1–3.
Supplementary Data 1
Q estimates of different methods on benchmarking training datasets.
Supplementary Data 2
Q estimates of ADMIXTURE and Neural ADMIXTURE on benchmarking test datasets.
Source data
Source Data Fig. 2
Q estimates for ADMIXTURE and Q and F estimates for Neural ADMIXTURE on Chm22Sim and on admixed datasets.
Source Data Fig. 3
Q estimates of Multihead Neural ADMIXTURE (K = 2 to K = 6) on the UK Biobank dataset.
Source Data Fig. 4
Runtimes of different methods on datasets of different numbers of samples and variants.
Source Data Extended Data Fig. 2
Q estimates of Multihead Neural ADMIXTURE (K = 3 to K = 8) on the test data of Chm22Sim.
Source Data Extended Data Fig. 3
Q estimates of Multihead Neural ADMIXTURE (K = 3 to K = 7) trained on samples from East Asia.
Source Data Extended Data Fig. 4
Q estimates of Multihead Neural ADMIXTURE (K = 2 to K = 5) trained only on admixed samples.
Source Data Extended Data Fig. 5
Q estimates of Multihead Neural ADMIXTURE (K = 2 to K = 6) trained on the UK Biobank dataset.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Dominguez Mantes, A., Mas Montserrat, D., Bustamante, C.D. et al. Neural ADMIXTURE for rapid genomic clustering. Nat Comput Sci 3, 621–629 (2023). https://doi.org/10.1038/s43588023004827
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588023004827
This article is cited by

Harnessing deep learning for population genetic inference
Nature Reviews Genetics (2023)

Machine learning speeds up genetic structure analysis
Nature Computational Science (2023)