Neural ADMIXTURE for rapid genomic clustering

Dominguez Mantes, Albert; Mas Montserrat, Daniel; Bustamante, Carlos D.; Giró-i-Nieto, Xavier; Ioannidis, Alexander G.

doi:10.1038/s43588-023-00482-7

Download PDF

Article
Open access
Published: 06 July 2023

Neural ADMIXTURE for rapid genomic clustering

Albert Dominguez Mantes ORCID: orcid.org/0000-0002-6224-0750^1,2,3,
Daniel Mas Montserrat¹,
Carlos D. Bustamante⁴,
Xavier Giró-i-Nieto² &
…
Alexander G. Ioannidis ORCID: orcid.org/0000-0002-4735-7803^1,5

Nature Computational Science volume 3, pages 621–629 (2023)Cite this article

8809 Accesses
5 Citations
58 Altmetric
Metrics details

Subjects

A preprint version of the article is available at bioRxiv.

Abstract

Characterizing the genetic structure of large cohorts has become increasingly important as genetic studies extend to massive, increasingly diverse biobanks. Popular methods decompose individual genomes into fractional cluster assignments with each cluster representing a vector of DNA variant frequencies. However, with rapidly increasing biobank sizes, these methods have become computationally intractable. Here we present Neural ADMIXTURE, a neural network autoencoder that follows the same modeling assumptions as the current standard algorithm, ADMIXTURE, while reducing the compute time by orders of magnitude surpassing even the fastest alternatives. One month of continuous compute using ADMIXTURE can be reduced to just hours with Neural ADMIXTURE. A multi-head approach allows Neural ADMIXTURE to offer even further acceleration by computing multiple cluster numbers in a single run. Furthermore, the models can be stored, allowing cluster assignment to be performed on new data in linear time without needing to share the training samples.

Hybrid autoencoder with orthogonal latent space for robust population structure inference

Article Open access 14 February 2023

Rapid detection of identity-by-descent tracts for mega-scale datasets

Article Open access 10 June 2021

Mining whole genome sequence data to efficiently attribute individuals to source populations

Article Open access 22 July 2020

Main

The rapid growth in sequenced human genomes and the proliferation of population-scale biobanks have enabled the creation of increasingly accurate models to predict traits and disease risk using an individual’s genome. However, different predictive models can be required depending on an individual’s genetic ancestry, and this necessitates accurately characterizing genetic cluster composition at the individual level¹. Such characterization is also an essential part of most modern population genetics studies and national biobanking efforts^2,3. However, many existing algorithms for this task struggle with next-generation sequencing datasets, where both the number of samples and the number of sequenced positions along the genome are much greater than earlier case–control genotyping studies. Scalable algorithms to characterize the population structure of genetic sequences are especially important for more diverse biobanks, themselves needed to correct the extreme imbalance towards European-descent samples in existing studies in order to avoid a new divide in healthcare arising through omitting most of the world’s population from precision health research⁴.

A common approach for characterizing the population structure within a genetic dataset is to describe each sample as a set of fractional assignments to each cluster. These clusters are centroids found via an unsupervised algorithm in a space spanning the frequencies of each variant. By avoiding the culture-specific labels and subjective constructs (for example, ethnicity) of supervised classification methods⁵, these unsupervised approaches can better reflect the spectrum of genetic structure across samples. Generally, the input variants are the individual’s sequence of single nucleotide polymorphisms (SNPs), that is, single positions along the genome known to vary between individuals. Smaller datasets of less numerous variants, such as microsatellites, have also been used. There are millions of SNPs in the human genome and most are biallelic (two variants) permitting a binary encoding. For instance, zero could be used to encode the most common (or reference) variant at an SNP position on the genome and one to encode the minority (or alternate) variant. The frequency distribution of these variants will vary between populations due to differing histories: founder events, migration, isolation, and drift.

We present an autoencoder that expands on the clustering method for genomes: ADMIXTURE^6,7. ADMIXTURE was developed as a computationally efficient alternative to STRUCTURE⁸, and we take this pursuit of efficiency now to the next generation of datasets. Our proposed method, Neural ADMIXTURE, follows the same modeling assumptions as ADMIXTURE, but reframes the task as a neural-network-based autoencoder, providing faster computational times, both on graphics and central graphics units (GPUs and on CPUs), while maintaining high-quality assignments.

Results

Model overview

Neural ADMIXTURE (Fig. 1a) is an interpretable autoencoder with two main components: (1) an encoder, composed of two linear layers with a Gaussian error linear unit (GELU) activation⁹ in-between, then a softmax activation, which projects a genotype sequence onto a vector representing fractional ancestry assignments for each individual (Q); and (2) a decoder, which is a single linear layer whose weights are restricted to lie between 0 and 1, leading to an interpretable projection matrix that learns the cluster centroids, or equivalently, the average variant frequency at each site for each population (F). Additionally, we introduce Multi-head Neural ADMIXTURE (Fig. 1b), which includes multiple decoders in a single network to obtain results analogous to training ADMIXTURE repeatedly for different numbers of clusters, but needing only a single training for all numbers of clusters desired.

**Fig. 1: Neural ADMIXTURE model architecture.**

Neural ADMIXTURE was trained with a standard binary cross-entropy, leading to an equivalence with the traditional ADMIXTURE model’s objective function (Methods). Two initialization techniques, one based on principal component analysis^10,11,12 and the other on archetypal analysis¹³, were used as an alternative to common network initializations to speed up the training process and improve results (Supplementary section ‘Decoder initialization’). Furthermore, two mechanisms are available to incorporate prior knowledge about the amount of admixture in a dataset by controlling the softness of the cluster assignments: applying L2 regularization during training (Methods) and softmax tempering (Supplementary section ‘Softmax tempering’). Both single-head and multi-head approaches can be adapted to a supervised version that performs regular classification given known training labels (Supplementary section ‘Supervised training‘). The proposed method is fully compatible with the original ADMIXTURE framework, allowing the use of ADMIXTURE results as an initialization for Neural ADMIXTURE parameters (Supplementary section ‘Pretrained mode’), and vice versa. We performed an in-depth evaluation of the proposed method and compared it with competing approaches across multiple datasets, including using simulations from a variety of systems^14,15,16,17 and using samples from large-scale, real-world biobanks (Methods, Supplementary Table 1, Supplementary Table 2, and Supplementary section ‘Dataset description’).

Single-head and multi-head results

Neural ADMIXTURE is systematically faster than alternative algorithms, both on CPU and GPU (Table 1, Supplementary Fig. 1). This speedup is further enhanced when using the Multi-head Neural ADMIXTURE architecture, which can perform clusterings for different K values simultaneously. For example, in the All-Chms dataset, we observed that Neural ADMIXTURE trained in less than 2 min, whereas ADMIXTURE required more than a day. Neural ADMIXTURE performs at least as well as existing algorithms on both predicting the ancestry assignments (Q) and the allele frequencies (F). On average, Neural ADMIXTURE’s Q estimates appear to be more similar to the matrix of known labels than the Q estimates from previous methods (Extended Data Fig. 1).

Table 1 Performance comparison of several global ancestry inference algorithms

Full size table

Table 2 shows the accuracy and time performance of ADMIXTURE and Neural ADMIXTURE on the test data for three different datasets. Both ADMIXTURE and Neural ADMIXTURE are able to generalize and produce consistent assignments on unseen data. However, Neural ADMIXTURE is much faster than ADMIXTURE on both CPU and GPU, because ADMIXTURE must optimize the objective with a fixed F to find Q for unseen data, whereas Neural ADMIXTURE directly learns a function that estimates Q. We note that inference on GPU is extremely fast (generally less than a second for a forward pass); the computational bottleneck comes simply from reading and processing of the data, which could be further addressed.

Table 2 Performance comparison of ADMIXTURE and Neural ADMIXTURE on test data

Full size table

We visualized the Q estimates of ADMIXTURE and Neural ADMIXTURE on the Chm-22-Sim dataset using pong¹⁸ (Fig. 2a–d). The SNP frequencies (the entries in the F matrix) from both models can be observed as projections onto the first two principal components of the training data (Fig. 2e). Neural ADMIXTURE provides harder cluster predictions, with many samples being assigned only to a single population, whereas ADMIXTURE provides softer cluster predictions with partial assignments to multiple clusters. On this dataset, ADMIXTURE does not assign different clusters to Native Americans (AMR) and East Asians (EAS); instead, it partitions Africans (AFR) into two different ancestry clusters (Fig. 2a,b). Neural ADMIXTURE, however, does split AMR and EAS populations (Fig. 2c–e). Depictions of the cluster assignments (Q) of all algorithms on several datasets can be found in Supplementary Figs. 2–5.

**Fig. 2: Visualization of several results of ADMIXTURE and Neural ADMIXTURE trained on the dataset Chm-22-Sim (K = 7).**

We applied Neural ADMIXTURE, trained on Chm-22-Sim, to admixed populations that were not present in the training data: Mexican Ancestry in Los Angeles, California (MXL, 118), and Puerto Ricans in Puerto Rico (PUR, 104) (Fig. 2f).

We evaluated Multi-head Neural ADMIXTURE with Chm-22-Sim (Extended Data Fig. 2) and showed that as the number of clusters increases, each population group gets assigned its own cluster. Furthermore, we showed that Multi-head Neural ADMIXTURE can be successfully applied to closely related populations (Extended Data Fig. 3). Finally, we showed that the proposed method can be applied on real, admixed datasets (Extended Data Fig. 4).

UK Biobank computational analysis

To assess the clustering speed on a very large dataset, we ran Neural ADMIXTURE in its multi-head mode on the entire UK Biobank—a total of 488,377 samples—and using 147,604 SNPs subsetted to remove linkage disequilibrium (LD) by pruning the full set¹⁹. Neural ADMIXTURE was able to process the complete dataset within 11 h, providing results from K = 2 to K = 6, whereas ADMIXTURE would take about a month to do the same, given that it took 5.5 days to provide results for K = 2. Traditional techniques such as ADMIXTURE are thus too slow for such large biobanks, particularly because multiple additional runs with different parameters and subsets of data are generally needed in a study. Neural ADMIXTURE was trained without regularization (λ = 0, Methods) and using the PCK-means initialization (Supplementary Algorithm 1). During inference, the temperature was set to $\tau =\frac{3}{2}$ (Supplementary section ‘Softmax tempering’). Figure 3 displays these cluster assignments for the UK Biobank genomes. We group the individuals by their reported country of birth; those with missing or non-existent country-of-birth labels were excluded from the plots.

**Fig. 3: Q fractional genetic cluster estimates across the entire UK Biobank dataset (N = 488,377) obtained using Multi-head Neural ADMIXTURE (K = 6 displayed).**

Scalability analysis

To assess the scalability of different methods, we simulated multiple datasets with various numbers of variants and samples using the software reported previously¹⁷. The datasets consist of combinations of N ∈ {1,000, 5,000, 10,000, 20,000, 50,000} and M ∈ {1,000, 10,000, 50,000, 100,000}, where N and M are the number of samples and SNPs, respectively.

We compared the training times of ADMIXTURE, AlStructure, TeraStructure, and Neural ADMIXTURE, both on CPU and GPU, across different dataset sizes (Fig. 4). Neural ADMIXTURE is consistently faster than the alternatives. Moreover, Neural ADMIXTURE accelerates substantially using GPUs in contrast to the other methods. The hyperparameters used are described in Supplementary Table 3.

**Fig. 4: Evolution of execution time when increasing number of samples and number of variants.**

Discussion

Many unsupervised clustering methods for genotype sequences have been introduced^{8,20,21,22,23,24,25} including the most commonly used, ADMIXTURE^6,7. These methods, which resemble a non-negative matrix factorization, decompose each input sequence into a set of cluster assignments and compute a centroid for each cluster. The cluster assignments give the proportion of each genetic ancestry cluster for an individual, whereas the cluster centroids give the SNP variant frequencies at each genetic position corresponding to each cluster. As a diploid organism, most humans have a paternal and maternal copy of each non-sex chromosome. Therefore, for a given individual at each genomic position, we have the possibility of four different combinations of biallelic SNPs (0/0, 0/1, 1/0, 1/1). It is common practice to sum both maternal and paternal variants, obtaining a count sequence n_ij. In this scenario, an individual i has n_ij ∈ {0, 1, 2} copies of the minority SNP j. ADMIXTURE models each individual’s count sequence, given a fixed number of population groups K, as n_ij ~ Bin(2, p_ij), where p_ij = ∑_kq_ikf_kj, with q_ik denoting the fraction of population k assigned to i, and f_kj denoting the frequency of SNPs with a value of ‘1’ j in population k. ADMIXTURE applies block relaxation to find the parameters Q and F that minimize the negative log-likelihood function shown in equation (1). The value of K (number of clusters) is typically chosen by using an ad hoc cross-validation procedure⁷, necessitating runs across a range of values.

The block relaxation optimization in ADMIXTURE runs much faster than other approaches used by its main competitors, namely FRAPPE²¹ and STRUCTURE⁸. Although it can be run in multi-threading mode, greatly boosting the execution time, it is insufficient when dealing with either a large number of samples or a large number of SNPs. Here we instead use neural networks, whose architectures have begun to be explored for several other genetic structure tasks including haplotype segmentation, dimensionality reduction, and classification^{26,27,28,29,30,31,32,33,34,35} (Supplementary section ‘Related work’).

An important caveat when using soft-clustering techniques, such as Neural ADMIXTURE or ADMIXTURE, is that these techniques follow a modeling assumption that there are some ‘prototype’ populations and that each individual can be placed within the convex hull of such prototypes. Note that this model might not reflect the underlying structure of real-world populations particularly when independent genetic drift has occurred in each population following admixture events. This limitation is particularly acute in the case of ancient admixture events, and in such cases, other complementary techniques should also be used. Future experiments to quantify these effects using simulations would be valuable. Combining unsupervised clustering with tree-based methods to account for this drift would also be a useful direction. This could complement the progress being made in ancestral recombination graphs.

Although the computational times of Neural ADMIXTURE enable practitioners to obtain rapid results with multiple hyperparameters and different values of K, properly selecting the best results still involves a subjective element, and additional experiments and new quantitative measures are needed. Further, unsupervised clustering methods, and more generally dimensionality-reduction techniques, are affected by sampling imbalances between population groups, which can alter population structure detection and prioritization^36,37. Additionally, even if structure is not present within the data, these techniques can indicate otherwise^38,39.

Methods

Single-head Neural ADMIXTURE

As described in the Discussion, the existing ADMIXTURE algorithm minimizes the negative log-likelihood:

$$\begin{array}{ll}\mathop{\min }\limits_{Q,F}&{{{{\mathcal{L}}}}}_\mathrm{C}(Q,F\,)=-\mathop{\sum}\limits_{i,\,j}{n}_{ij}\log \left(\mathop{\sum}\limits_{k}{q}_{ik}\,{f}_{kj}\right)+(2-{n}_{ij})\log \left(1-\mathop{\sum}\limits_{k}{q}_{ik}{f}_{kj}\right)\\ {{{\rm{subject}}}}\,{{{\rm{to}}}}&0\le {f}_{kj}\le 1\\ &\mathop{\sum}\limits_{k}{q}_{ik}=1\\ &{q}_{ik}\ge 0\end{array}$$

(1)

with Q = (q_ik) and F = (f_kj).

This can be formulated as a non-negative matrix factorization problem. Let X denote the training samples, where the features are the alternate allele normalized counts per position and the jth SNP of the ith individual is represented as ${x}_{ij}=\frac{{n}_{ij}}{2}\in \{0,0.5,1\}$. Then, X ≈ QF, where Q is the assignments, F is the alternate allele frequencies per SNP and population, and the negative log-likelihood in equation (1) is a distance between X and QF. This can be translated into a neural network as an autoencoder with Q = Ψ(X) being the bottleneck computed by the encoder function Ψ and F being the decoder weights themselves (Fig. 1a). Because Q is estimated at every forward pass and not learnt as a whole for the training data, to retrieve Q assignments on previously unseen data, we can perform a simple forward pass instead of running the optimization process fixing F, unlike with ADMIXTURE.

Note that the restrictions in the optimization problem (equation (1)) impose restrictions in the architecture. Those relating to Q (∑_kq_ik = 1 and q_ik ≥ 0) can be enforced by applying a softmax activation at the encoder output, making the bottleneck equivalent to the cluster assignments. Although the decoder restriction (0 ≤ f_kj ≤ 1) could be enforced by applying the sigmoid function to the decoder weights, we found that it suffices to project the weights of the decoder to the interval [0, 1] after every optimization step, one of the most common forms of projected gradient descent⁴⁰.

The decoder must be linear and cannot be followed by a non-linearity, as this would break the interpretability of the F matrix; the equivalence between the decoder weights and cluster centroids would be lost. On the other hand, the encoder architecture is free from constraints, and it may be composed of several layers. The proposed architecture includes a 64-dimensional, non-linear layer with a GELU activation before the bottleneck and batch normalization acting directly on the input. The latter re-scales the data to have zero mean and unit variance. Since the mean for each SNP is its frequency p, and the standard deviation σ is $\sqrt{p(1-p)}$, the {0, 1} input gets encoded as $\left\{{-\sqrt{\frac{p}{1-p}},\sqrt{\frac{1-p}{p}}}\right\}$, thereby supplying more explicitly the information of the allele frequencies to the network.

The ADMIXTURE model does not precisely reconstruct the input data as a regular autoencoder would do, because the input SNP genotype sequences, n_ij ∈ {0, 1, 2}, and the reconstructions, p_ij ∈ [0, 1], do not have matching ranges. This can easily be remedied by dividing the genotype counts by two, so that the input data are ${x}_{ij}=\frac{{n}_{ij}}{2}\in \{0,0.5,1\}$. Moreover, instead of minimizing ${{{{\mathcal{L}}}}}_\mathrm{C}$ (equation (1)), we propose minimizing the binary cross-entropy instead, using a penalty term on the Frobenius norm of the encoder weights, θ:

$${{{{\mathcal{L}}}}}_\mathrm{N}(Q,F\,)=-\mathop{\sum}\limits_{i,\,j}{x}_{ij}\log \left(\mathop{\sum}\limits_{k}{q}_{ik}{f}_{kj}\right)+(1-{x}_{ij})\log \left(1-\mathop{\sum}\limits_{k}{q}_{ik}{f}_{kj}\right)+\lambda \Vert \theta \Vert_{F}^{2}.$$

(2)

This regularization term avoids hard assignments in the bottleneck, which helps during the training process and reduces overfitting. In equation (3) we show that the proposed optimization problem and the ADMIXTURE one are equivalent (excluding the regularization term) by using equations (1) and (2):

$$\begin{array}{ll}&{{{{\mathcal{L}}}}}_\mathrm{N}^{\lambda = 0}(Q,F\,)=-\mathop{\sum}\limits_{i,\,j}{x}_{ij}\log \left(\mathop{\sum}\limits_{k}{q}_{ik}{f}_{kj}\right)+(1-{x}_{ij})\log \left(1-\mathop{\sum}\limits_{k}{q}_{ik}\,{f}_{kj}\right)\\ &=-\mathop{\sum}\limits_{i,\,j}\frac{{n}_{ij}}{2}\log \left(\mathop{\sum}\limits_{k}{q}_{ik}{f}_{kj}\right)+\left(1-\frac{{n}_{ij}}{2}\right)\log \left(1-\mathop{\sum}\limits_{k}{q}_{ik}{f}_{kj}\right)=\\ &=-\frac{1}{2}\mathop{\sum}\limits_{i,\,j}{n}_{ij}\log \left(\mathop{\sum}\limits_{k}{q}_{ik}{f}_{kj}\right)+(2-{n}_{ij})\log \left(1-\mathop{\sum}\limits_{k}{q}_{ik}{f}_{kj}\right)=\\ &=\frac{1}{2}{{{{\mathcal{L}}}}}_\mathrm{C}(Q,F\,).\end{array}$$

(3)

A perfect reconstruction can of course be obtained by setting the number of clusters (K) equal to the number of training samples or to the dimension of the input (number of SNPs). However, the bottleneck should ideally capture elementary information about the population structure of the given sequences; therefore, we make use of low-dimensional bottlenecks.

Multi-head Neural ADMIXTURE

In ADMIXTURE, cross-validation must be performed to choose the number of population clusters (K), unless specific prior information about the number of population ancestries is known. Furthermore, in many applications, practitioners desire to observe how cluster assignments change as the number of clusters increases. As the number of both sequenced individuals and variants increases, the feasible number of different cluster numbers that can be run for cross-validation rapidly decreases due to the additional computational cost. As a solution, Multi-head Neural ADMIXTURE allows all cluster numbers to be run simultaneously by taking advantage of the 64-dimensional latent representation computed by the encoder. This shared representation is jointly learnt for the different values of K, {K₁, …, K_H}.

Figure 1b shows how the shared representation is split into H different heads in the multi-head architecture. The ith head consists of a non-linear projection to a K_i-dimensional vector, which corresponds to an assignment that assumes there are K_i different genetic clusters in the data. Although every head could be concatenated and fed through a decoder, this would cause the decoder weights F to not be interpretable. Therefore, every head needs to have its own decoder and, thus, H different reconstructions of the input are retrieved.

As we have H reconstructions, we will now have H different loss values. We can train this architecture by minimizing equation (4):

$${{{{\mathcal{L}}}}}_\mathrm{MNA}({Q}_{{K}_{1,...,H}},{F}_{{K}_{1,...,H}})=\mathop{\sum }\limits_{h=1}^{H}{{{{\mathcal{L}}}}}_\mathrm{N}\left({Q}_{{K}_{h}},{F}_{{K}_{h}}\right),$$

(4)

where ${Q}_{K_{h}}$ and ${F}_{K_{h}}$ are, respectively, the cluster assignments and the SNP frequencies per population for the hth head. The restrictions of the ADMIXTURE optimization problem (equation (1)) must be satisfied by ${Q}_{K_{h}}$ and ${F}_{K_{h} }\,\,\forall h\in \{1,\ldots ,H\,\}$.

The multi-head architecture allows computation of H different cluster assignments corresponding to H different values for K, efficiently, in a single forward pass. Results can then be quantitatively and qualitatively analyzed by the practitioner to decide which value of K is the most suitable for the data.

Evaluation setup

Let N denote the number of samples and M the number of variants (SNPs). To assess the performance of the Q estimates, we match the assignments with the known labels and report the RMSE between them,

$${{{\rm{RMSE}}}}\left(Q,{Q}_\mathrm{GT}\right)=\frac{1}{\sqrt{NK}}{\left\Vert Q-{Q}_\mathrm{GT}\right\Vert}_{F}$$

(5)

and the RMSE between the known allele frequencies (F_GT) and the estimated frequencies (F),

$${{{\rm{RMSE}}}}(F,{F}_\mathrm{GT})=\frac{1}{\sqrt{KM}}| | F-{F}_\mathrm{GT}| {| }_{F}$$

(6)

We also use a new metric, Δ, defined as

$${{\Delta }}(Q,{Q}_\mathrm{GT})=\frac{1}{{N}^{2}}{\left\Vert Q{Q}^\mathrm{T}-{Q}_\mathrm{GT}{Q}_\mathrm{GT}^\mathrm{T}\right\Vert}_{F}^{2},$$

(7)

which is equivalent to the mean squared difference between the covariance matrices of the estimated and the target populations. In case the Q estimates completely agree with Q_GT (up to permutation), Δ will be zero. The larger the disagreement, the higher the value of Δ. We are interested in these metrics, as they are more easily interpreted than the loss function value itself. We are aware that these pseudo-supervised metrics, when applied to datasets simulated from real individuals, do not yield the true quality of the predictions of the models, since the biogeographic labels assigned to the real sequences used to simulate datasets might not reflect the true genomics clusters and variation within the populations. To further investigate this issue, we also used fully simulated population clusters to evaluate the methods.

Dataset preparation

For reproducibility we have used a comprehensive set of publicly available, labeled human whole-genome sequences from diverse populations across the world, combining the 1000 Genomes Project⁴¹, the Simons Genome Diversity Project⁴², and the Human Genome Diversity Project⁴³, as well as data simulated from these samples using PyAdmix¹⁴ and data simulated de novo using the Balding–Nichols Pritchard–Stephens–Donnely model^8,23. The populations within the combined real datasets can be found in Supplementary Table 2. Each subpopulation is aggregated into a continental-level label according to its geographical location (Supplementary section ‘Dataset description’). Additionally, we used the entire UK Biobank genotype dataset.

Benchmarking setup

We compared Neural ADMIXTURE computational time and clustering quality with ADMIXTURE, fastSTRUCTURE²⁴, AlStructure²², and TeraStructure²³. fastSTRUCTURE assumes the STRUCTURE model but uses accelerated variational methods instead of MCMC, yielding speedups of more than two orders of magnitude against STRUCTURE. TeraStructure iteratively computes Q and F while avoiding a high computational load by subsampling SNPs at every iteration, which makes the algorithm faster. AlStructure first estimates a low-dimensional linear subspace of the admixture components and then searches for a model in the latter subspace that satisfies the modeling constraints, yielding a fast alternative to the iterative or maximum likelihood schemes followed by most algorithms. Furthermore, we also compared against HaploNet²⁶, a variational autoencoder that maps parts of the sequence (windows) to a low-dimensional latent space, on which clustering is then performed using Gaussian mixture priors. Although the global structure of the data is preserved in the low-dimensional space, direct interpretability of the allele frequencies (available in Neural ADMIXTURE) is not preserved.

All models were optimized using 16 threads on an AMD EPYC 7742 (x86_64) processor, which consists of 64 cores and 512 GB of RAM. We restricted the number of threads to 16 despite the fact that more cores are available to run several executions in parallel. To assess GPU performance of Neural ADMIXTURE, all networks were trained on an NVIDIA Tesla V100 SXM2 of 32 GB. The same GPUs were used to run inference on the trained models.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The samples used in the ‘Experiments’ section were compiled from public datasets: 1000 Genomes Project (https://www.internationalgenome.org/data/)⁴¹, the Simons Genome Diversity Project (https://www.simonsfoundation.org/simons-genome-diversity-project/)⁴², and the Human Genome Diversity Project (https://www.internationalgenome.org/data-portal/data-collection/hgdp)⁴³. The compiled datasets (All-Chms, Chm-22 and Chm-22-Sim) are available on figshare⁴⁴. The UK Biobank has approval from the North West Multi-centre Research Ethics Committee as a Research Tissue Bank. This dataset is available to researchers through an open application via https://www.ukbiobank.ac.uk/register-apply/. The entire dataset of genotypes available to download from the UK Biobank portal were used. Source data are provided with this paper.

Code availability

The software is available as an installable package in the PyPi repository under the name ‘neural-admixture’. The source code can be found at https://github.com/ai-sandbox/neural-admixture ref. ⁴⁵.

References

Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019).
Article Google Scholar
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Article Google Scholar
Privé, F. Using the UK Biobank as a global reference of worldwide populations: application to measuring ancestry diversity from GWAS summary statistics. Bioinformatics 38, 3477–3480 (2022).
Article Google Scholar
Morales, J. et al. A standardized framework for representation of ancestry data in genomics studies, with application to the NHGRI-EBI GWAS Catalog. Genome Biol. 19, 1–10 (2018).
Article Google Scholar
Mathieson, I. & Scally, A. What is ancestry? PLoS Genet. 16, e1008624 (2020).
Article Google Scholar
Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).
Article Google Scholar
Alexander, D. H. & Lange, K. Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinform. 12, 246 (2011).
Article Google Scholar
Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).
Article Google Scholar
Hendrycks, D. & Gimpel, K. Gaussian error linear units (GELUs). Preprint at https://doi.org/10.48550/arXiv.1606.08415 (2020).
Novembre, J. et al. Genes mirror geography within Europe. Nature 456, 98–101 (2008).
Article Google Scholar
Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).
Article Google Scholar
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Article Google Scholar
Cutler, A. & Breiman, L. Archetypal analysis. Technometrics 36, 338–347 (1994).
Article MathSciNet MATH Google Scholar
Kumar, A., Montserrat, D. M., Bustamante, C. & Ioannidis, A. XGMix: local-ancestry inference with stacked XGBoost. Preprint at bioRxiv https://doi.org/10.1101/2020.04.21.053876 (2020).
Maples, B. K., Gravel, S., Kenny, E. E. & Bustamante, C. D. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am. J. Hum. Genet. 93, 278–288 (2013).
Article Google Scholar
Karavani, E. et al. Screening human embryos for polygenic traits has limited utility. Cell 179, 1424–1435.e8 (2019).
Article Google Scholar
Chiu, A., Molloy, E., Tan, Z., Talwalkar, A. & Sankararaman, S. Inferring population structure in biobank-scale genomic data. Am. J. Hum. Genet. 109, 727–737 (2022).
Article Google Scholar
Behr, A. A., Liu, K. Z., Liu-Fang, G., Nakka, P. & Ramachandran, S. Pong: fast analysis and visualization of latent clusters in population genetic data. Bioinformatics 32, 2817–2823 (2016).
Article Google Scholar
Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Article Google Scholar
Bradburd, G. S., Coop, G. M. & Ralph, P. L. Inferring continuous and discrete population genetic structure across space. Genetics 210, 33–52 (2018).
Article Google Scholar
Tang, H., Peng, J., Wang, P. & Risch, N. J. Estimation of individual admixture: analytical and study design considerations. Genet. Epidemiol. 28, 289–301 (2005).
Article Google Scholar
Cabreros, I. & Storey, J. D. A likelihood-free estimator of population structure bridging admixture models and principal components analysis. Genetics 212, 1009–1029 (2019).
Article Google Scholar
Gopalan, P., Hao, W., Blei, D. & Storey, J. Scaling probabilistic models of genetic variation to millions of humans. Nat. Genet. 48, 1587–1590 (2016).
Raj, A., Stephens, M. & Pritchard, J. K. fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics 197, 573–589 (2014).
Article Google Scholar
Gimbernat-Mayol, J., Dominguez Mantes, A., Bustamante, C. D., Mas Montserrat, D. & Ioannidis, A. G. Archetypal analysis for population genetics. PLoS Comput. Biol. 18, e1010301 (2022).
Article Google Scholar
Meisner, J. & Albrechtsen, A. Haplotype and population structure inference using neural networks in whole-genome sequencing data. Genome Res. 32, 1542–1552 (2022).
Joo, W., Lee, W., Park, S. & Moon, I.-C. Dirichlet variational autoencoder. Pattern Recognit. 107, 107514 (2020).
Article Google Scholar
Keller, S. M., Samarin, M., Torres, F. A., Wieser, M. & Roth, V. Learning extremal representations with deep archetypal analysis. Int. J. Comput. Vis. 129, 805–820 (2021).
Article MathSciNet Google Scholar
Ausmees, K. & Nettelblad, C. A deep learning framework for characterization of genotype data. G3 12, jkac020 (2022).
Svensson, V., Gayoso, A., Yosef, N. & Pachter, L. Interpretable factor models of single-cell RNA-seq via variational autoencoders. Bioinformatics 36, 3418–3421 (2020).
Article Google Scholar
Battey, C., Coffing, G. C. & Kern, A. D. Visualizing population structure with variational autoencoders. G3 11, jkaa036 (2021).
Article Google Scholar
Montserrat, D. M., Bustamante, C. & Ioannidis, A. LAI-Net: local-ancestry inference with neural networks. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing 1314–1318 (IEEE, 2020).
Oriol Sabat, B., Mas Montserrat, D., Giro-i Nieto, X. & Ioannidis, A. G. SALAI-Net: species-agnostic local ancestry inference network. Bioinformatics 38, ii27–ii33 (2022).
Article Google Scholar
Romero, A. et al. Diet networks: thin parameters for fat genomics. In 5th International Conference on Learning Representations (OpenReview.net, 2017).
Battey, C. J., Ralph, P. L. & Kern, A. D. Predicting geographic location from genetic variation with deep neural networks. eLife 9, e54507 (2020).
Article Google Scholar
Toyama, K. S., Crochet, P.-A. & Leblois, R. Sampling schemes and drift can bias admixture proportions inferred by structure. Mol. Ecol. Resour. 20, 1769–1785 (2020).
Article Google Scholar
Elhaik, E. Principal component analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated. Sci. Rep. 12, 14683 (2022).
Article Google Scholar
Chari, T., Banerjee, J. & Pachter, L. The specious art of single-cell genomics. Preprint at bioRxiv https://doi.org/10.1101/2021.08.25.457696 (2021).
Montserrat, D. M. & Ioannidis, A. G. Adversarial attacks on genotype sequences. In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, 2023).
Lin, C.-J. Projected gradient methods for nonnegative matrix factorization. Neural Comput. 19, 2756–2779 (2007).
Article MathSciNet MATH Google Scholar
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article Google Scholar
Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).
Article Google Scholar
Bergström, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, eaay5012 (2020).
Dominguez Mantes, A. et al. Neural ADMIXTURE - datasets. figshare https://doi.org/10.6084/m9.figshare.19387538.v1 (2022).
Dominguez Mantes, A., Ioannidis, A. G. & Montserrat, D. M. AI-sandbox/neural-admixture: stable release. Zenodo https://doi.org/10.5281/zenodo.7938892 (2023).

Download references

Acknowledgements

This work was partially supported by a grant from the Stanford Institute for Human-Centered Artificial Intelligence (HAI), NIH grants 7U01HG009080 and R01HG010140, and project PID2020-117142GB-I00 funded by MCIN/ AEI /10.13039/501100011033. This research was conducted using the UK Biobank Resource under Application Number 89006.

Author information

Authors and Affiliations

Department of Biomedical Data Science, Stanford Medical School, Stanford, CA, US
Albert Dominguez Mantes, Daniel Mas Montserrat & Alexander G. Ioannidis
Signal Theory and Communications Department, Universitat Politècnica de Catalunya, Barcelona, Catalonia, Spain
Albert Dominguez Mantes & Xavier Giró-i-Nieto
School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne, Vaud, Switzerland
Albert Dominguez Mantes
Galatea Bio, Hialeah, FL, US
Carlos D. Bustamante
Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA, US
Alexander G. Ioannidis

Authors

Albert Dominguez Mantes
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Mas Montserrat
View author publications
You can also search for this author in PubMed Google Scholar
Carlos D. Bustamante
View author publications
You can also search for this author in PubMed Google Scholar
Xavier Giró-i-Nieto
View author publications
You can also search for this author in PubMed Google Scholar
Alexander G. Ioannidis
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.G.I. and D.M.M. designed the research. A.D.M. performed the research and wrote the software. A.D.M., D.M.M., X.G.N., and A.G.I. interpreted the results. C.D.B. contributed data. A.D.M., D.M.M., and A.G.I. wrote the manuscript.

Corresponding author

Correspondence to Alexander G. Ioannidis.

Ethics declarations

Competing interests

C.D.B. is the chief executive officer of Galatea Bio, and A.G.I. also holds shares. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 2D visualization of Q estimates using multidimensional scaling (MDS)

Algorithms appearing closer in the MDS projection have more similar estimates than those farther away. In order to use MDS, a distance matrix of the Q results of different algorithms (including the ground truth matrix) has been computed by using the Frobenius norm between the different Q matrices. The average of the normalized distances has been taken across all datasets in order to retrieve a single distance matrix.

Extended Data Fig. 2 Results from Multi-head Neural ADMIXTURE (K=3 to K=8) on the test set of Chm-22-Sim

For K=3, European (EUR), West Asian (WAS) and South Asian (SAS) are combined within the same cluster, while American (AMR), Oceanian (OCE), and East Asian (EAS) are clustered together, and African (AFR) has its own cluster. These results reflect the genetic similarity between the respective groups due to their Out-of-Africa migration patterns and subsequent gene flow. After increasing to K=5, OCE obtains its own cluster, reflecting the ancient divergence from the others of that population consisting in our study of the Australo-Papuan groups-Native Australian (SGDP), Papuan Highlands (HGDP), Papuan Sepik (HGDP), Bougainville (HGDP), and Dusun (HGDP). As more clusters are incorporated, American (AMR) and EAS obtain their own clusters and OCE is divided between a component found predominantly in OCE and a component characteristic of EAS. The latter likely reflects the later migration of Austronesian speakers from East Asia out into the Pacific Islands, where they contributed their ancestry to the Oceanian inhabitants. A shared component between EUR, SAS and WAS is maintained, independent of the cluster number K. This could be linked to early farmer expansions out of West Asia and into both Europe and South Asia following the birth of agriculture, or to the much later expansion of the Indo-European language family across all of these regions. Other genetic exchanges between these neighboring regions doubtlessly played a role. With a sufficiently high number of clusters, a shared component between WAS and some AFR populations appears, perhaps reflecting North African gene flow.

Source data

Extended Data Fig. 3 Multi-head Neural ADMIXTURE results on a dataset consisting of closely related groups.

To qualitatively assess the performance of Neural ADMIXTURE on related groups, we ran multi-head Neural ADMIXTURE on a subset of the dataset All-Chms containing 504 East Asian (EAS) individuals from neighboring regions. The self-reported ancestry of these individuals are Chinese Dai in Xishuangbanna, China (CDX, 93), Han Chinese in Beijing, China (CHB, 103), Han Chinese South (CHS, 105), Japanese in Tokyo, Japan (JPT, 104) and Kinh in Ho Chi Minh City, Vietnam (KHV, 99). The network was trained in its multi-head version from K=3 to K=7 using the PCK-Means initialization. The Japanese samples (JPT) are differentiated and clearly assigned their own cluster (blue), which is present only marginally in other populations. CDX (Chinese Dai) and KHV (Vietnamese Kinh) initially share the same cluster (K=3, green), reflecting their common Southeast Asian lineage, but are split into different groups at K=4 (purple and green). As expected CHB (Han Chinese in Beijing) and CHS (Han Chinese from South China) samples share the same cluster at first (red) and are only differentiated last (at K=5, red and orange). Further structure (yellow and brown) is seen within some populations at higher K.

Source data

Extended Data Fig. 4 Q estimates of multi-head Neural ADMIXTURE on a dataset consisting of only admixed samples.

To assess performance of the model using real admixed samples, we have trained a multi-head Neural ADMIXTURE model (from K=2 to K=5) with samples whose self-reported ancestry are African Caribbean in Barbados (ACB, 96), African Ancestry in Southwest US (ASW, 61), Colombian in Medellin, Colombia (CLM, 94), Mexican Ancestry in Los Angeles, California (MXL, 64), Peruvian in Lima, Peru (PEL, 85) and Puerto Rican in Puerto Rico (PUR 104). The groups have been selected from the 1000 Genomes Project. The variants used (839629) are the same as in the dataset All-Chms. The network was trained using the PCK-Means initialization (Supplementary Text ‘Decoder initialization’). At K=2, ACB and ASW are assigned predominantly to their own cluster, separating their mostly African origins from the remaining out-of-Africa components. When introducing the next new cluster (K=3), admixed individuals in CLM, MXL and PEL are assigned some fraction to it, differentiating an Indigenous American component in them from their European component. At K=4 the individuals in the PUR population are assigned some fraction of the new cluster, and this cluster is also present in small amounts in CLM and smaller amounts in some MXL. This component, which does not decrease the Indigenous American component fraction in the samples, likely represents an early colonial-era Spanish (European-ancestry) founder effect on the island of Puerto Rico perhaps reflecting the subsequent early colonial expansion from the Spanish Caribbean to coastal Colombia and Mexico. Structure in the European component appears at K=5.

Source data

Extended Data Fig. 5 Cluster assignments computed by Neural ADMIXTURE for individuals born outside the British and Irish Isles in the UK Biobank training data.

(a) K=2 (b) K=3 (c) K=4 (d) K=5 (e) K=6. Because the majority of the dataset is composed of individuals with white British ancestry, we only plot the cluster assignments of individuals that reported a country-of-birth outside British and Irish Isles. We can observe that K=2 approximately divides samples between European and non-European populations. With K=3 European, South-and-East Asian, and African ancestry clusters emerge. When K=4 a fine-grained clustering emerges dividing East and South Asian populations. K=5 adds a fifth cluster shared in common (with different proportions) between Southern European (Mediterranean) and West Asian (Near Eastern) populations. Finally, K=6 seems to introduce a cluster mostly present in Northern and Eastern European populations.

Source data

Supplementary information

Supplementary Information

Supplementary Text 1–7, Figs. 1–5 and Tables 1–3.

Reporting Summary

Peer Review File

Supplementary Data 1

Q estimates of different methods on benchmarking training datasets.

Supplementary Data 2

Q estimates of ADMIXTURE and Neural ADMIXTURE on benchmarking test datasets.

Source data

Source Data Fig. 2

Q estimates for ADMIXTURE and Q and F estimates for Neural ADMIXTURE on Chm-22-Sim and on admixed datasets.

Source Data Fig. 3

Q estimates of Multi-head Neural ADMIXTURE (K = 2 to K = 6) on the UK Biobank dataset.

Source Data Fig. 4

Runtimes of different methods on datasets of different numbers of samples and variants.

Source Data Extended Data Fig. 2

Q estimates of Multi-head Neural ADMIXTURE (K = 3 to K = 8) on the test data of Chm-22-Sim.

Source Data Extended Data Fig. 3

Q estimates of Multi-head Neural ADMIXTURE (K = 3 to K = 7) trained on samples from East Asia.

Source Data Extended Data Fig. 4

Q estimates of Multi-head Neural ADMIXTURE (K = 2 to K = 5) trained only on admixed samples.

Source Data Extended Data Fig. 5

Q estimates of Multi-head Neural ADMIXTURE (K = 2 to K = 6) trained on the UK Biobank dataset.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Dominguez Mantes, A., Mas Montserrat, D., Bustamante, C.D. et al. Neural ADMIXTURE for rapid genomic clustering. Nat Comput Sci 3, 621–629 (2023). https://doi.org/10.1038/s43588-023-00482-7

Download citation

Received: 25 February 2022
Accepted: 06 June 2023
Published: 06 July 2023
Issue Date: July 2023
DOI: https://doi.org/10.1038/s43588-023-00482-7

This article is cited by

Harnessing deep learning for population genetic inference
- Xin Huang
- Aigerim Rymbekova
- Martin Kuhlwilm
Nature Reviews Genetics (2024)
A genotyping array for the globally invasive vector mosquito, Aedes albopictus
- Luciano Veiga Cosme
- Margaret Corley
- Adalgisa Caccone
Parasites & Vectors (2024)
Machine learning speeds up genetic structure analysis
- Chris C. R. Smith
Nature Computational Science (2023)

Subjects

Abstract

Similar content being viewed by others

Main

Results

Model overview

Single-head and multi-head results

UK Biobank computational analysis

Scalability analysis

Discussion

Methods

Single-head Neural ADMIXTURE

Multi-head Neural ADMIXTURE

Evaluation setup

Dataset preparation

Benchmarking setup

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links