Hybrid autoencoder with orthogonal latent space for robust population structure inference

Analysis of population structure and genomic ancestry remains an important topic in human genetics and bioinformatics. Commonly used methods require high-quality genotype data to ensure accurate inference. However, in practice, laboratory artifacts and outliers are often present in the data. Moreover, existing methods are typically affected by the presence of related individuals in the dataset. In this work, we propose a novel hybrid method, called SAE-IBS, which combines the strengths of traditional matrix decomposition-based (e.g., principal component analysis) and more recent neural network-based (e.g., autoencoders) solutions. Namely, it yields an orthogonal latent space enhancing dimensionality selection while learning non-linear transformations. The proposed approach achieves higher accuracy than existing methods for projecting poor quality target samples (genotyping errors and missing data) onto a reference ancestry space and generates a robust ancestry space in the presence of relatedness. We introduce a new approach and an accompanying open-source program for robust ancestry inference in the presence of missing data, genotyping errors, and relatedness. The obtained ancestry space allows for non-linear projections and exhibits orthogonality with clearly separable population groups.

, where %& and %& "'&$ are the unnormalized genotype and normalized genotype at SNP for individual , respectively, and & is the sample allele frequency for SNP . Then SVD takes as input the normalized genotype matrix "'&$ ∈ ℝ " ! ×$ and decomposes it into a product of three matrices "'&$ = Σ / where Σ ∈ ℝ " ! ×$ is a diagonal matrix of size × ! containing the singular values and the orthogonal matrices ∈ ℝ " ! ×" ! and ∈ ℝ $×$ contain the left and right singular vectors, respectively. The dimension of the input data is then reduced by projecting it onto a space spanned by the top singular vectors. Let 0 ∈ ℝ " ! ×0 and 0 ∈ ℝ 0×0 denote the left singular vectors and the singular values of the first principal components, then the input data in its lower dimensional representation is given by 0 Σ 0 , and the corresponding loading matrix is denoted by 0 ∈ ℝ $×0 . The projected scores of unseen data can be obtained by multiplication of the normalized genotype matrix with 0 .

Unnormalized Principal Component Analysis
UPCA works similarly to PCA, except that SVD takes the unnormalized genotype matrix as input. Interpopulation variation is captured from the second PC onwards, while the first PC represents an average SNP pattern, as is common for PCA on non-centered data. Therefore, the first PC in UPCA can be omitted.

Spectral Decomposition Generalized by Identity-by-State Matrix
SUGIBS was previously proposed as a robust alternative against laboratory artifacts and outliers 14 by applying SVD on the IBS generalized genotype matrix, where IBS information corrects for potential artifacts due to errors and missingness.
Let ∈ ℝ " ! ×" ! denotes the pairwise IBS similarity matrix of the unnormalized genotype matrix , which is calculated following the rules in Table S1. The similarity degree of an individual is defined as %% = ∑ %1 " ! 1 any other individual in the reference dataset. The similarity degree matrix is a diagonal matrix defined as = 9 !! , … , " ! " ! }. SUGIBS works similarly to PCA, except that the IBS generalized genotype matrix ) ! $ is used as input for performing SVD, i.e., ) ! $ = Σ / . Likewise, to UPCA, the first component of SUGIBS aggregates the average SNP pattern and can therefore be omitted. For the projection of unseen samples, we use the second component and onwards 0 2 = { * , … , 03! } where 0 is the th right singular vector.
Given an unseen dataset with * individuals and the same set of SNPs as the reference dataset, let ∈ ℝ " $ ×$ denote its unnormalized genotype matrix. The reference similarity degree is defined as @ %% = ∑̃% 1 where ̃% 1 is the IBS similarity between the th individual in the unseen dataset and the th individual in the reference dataset. The reference similarity degree matrix is defined as B = 9 @ !! , … , @ " $ " $ }. The unseen dataset can be projected onto the reference space following B )! 0 2 .

Autoencoder
An autoencoder consists of two parts: an encoder network and a decoder network. The encoder maps input data to a latent representation = ( + ); the decoder maps the latent representation back to a reconstruction output J = ( ′ + ′) where (•) and (• ) are nonlinear functions, and 2 are the weight matrix, and 2 are the bias vector, , J and are the input data, the reconstructed data and the latent representation, respectively. The network is then trained to minimize the reconstruction error. The objective function takes the form 45 = N P , Q ( )RS ( where is the reconstruction error.

Regularized Autoencoder
To reduce overfitting of the model and improve its performance, regularization-based methods are often used. One widely used regularization is weight-decay 43 , which favors small weights by optimizing the following regularized objective function 45)67 = N P , Q ( )RS + N * 6 ( where hyperparameter controls the strength of the regularization. This encourages sparse weight matrix and thus reduces the redundancy.

Denoising Autoencoder
In a denoising autoencoder 25,26 , the initial input is partially corrupted before training, and then sent through the network. Based on the encoding and decoding of the corrupted input data, it is desirable to predict the original, uncorrupted data as its output. This yields the following objective function: where the corrupted version W of original input is obtained through the process ( W| ).

Denoising Autoencoder with Modified Loss
An additional term favoring robust mapping at the bottleneck/latent space is included in the original objective function of DAE, yielding the following loss function: where hyperparameter controls the emphasis on noise-free projections. The objective now is to learn latent representations that are not only robust for reconstruction, but also at the same time robust for projection.

Implementation Details
The encoder and decoder networks are fully connected feed-forward networks with Leaky ReLU [48] activation functions connecting each layer, except for the last layer of the decoder sigmoid activation is used to ensure the output values are bounded between [0 1]. We used the Adam optimizer [49] with an initial learning rate of 0.001. To allow the optimizer to take smaller steps when training gets close to convergence, we applied a learning rate scheduler to reduce the learning rate of the optimizer by 0.9999 after every epoch. To fit in available GPU memory (11,019MiB), we trained the networks in mini batches of 256 samples. The models are implemented and trained on an NVIDIA GeForce RTX 2080 Ti using PyTorch 1.7.
To implement the early stopping mechanism, we track if the validation loss keeps improving. If the difference of the validation loss between two epochs is below 0.1, it is quantified as no improvement. The early stopping patience was set to be 300 epochs and the maximum number of epochs equaled 3000 when training AE and SAE-IBS. For denoising extensions, every 25 epochs we generated a different simulated noisy dataset and fed to the model, therefore we relaxed the max epoch (to 5000) when training DAE, DAE-L, D-SAE-IBS, and D-SAE-IBS-L. To speed up the learning of SAE-IBS (and its denoising extensions) and to provide a well-initialized embedding from the encoder to apply SVD on, we pre-trained an AE firstly with up to 1000 epochs and continued training SAE-IBS afterward.
Following the suggestions by [50], we experimented with several parameter configurations in two steps: the first one involves the number of layers and the number of hidden units; the second one investigates emphasis on projection loss . If not explicitly stated otherwise, recommended values by default in PyTorch 1.7 [51] were used for any other hyperparameters (amsgrad: False, betas: [0.9, 0.999], eps: 1e-08).
Firstly, the final hyperparameter configuration of the AE model with latent space dimension of 2 was decided. As shown in Table S2, the configuration in bold was selected as the final setting for the experiments of robust projection because it resulted in the smallest validation loss and NRMSD for the simulated missingness experiments, and relatively small NRMSD for the simulated erroneousness experiments. The same procedure was conducted for other tasks and their final settings are listed in Table S3. Then, to ensure fair comparison, the same settings were used when training AE with higher latent space dimensions, denoising variants of AE, and hybrid models. Furthermore, for the experiments of robust projection using DAE-L, we fine-tuned the hyperparameter defining the emphasis on projection loss β based on NRMSD (Table S4 and Table S5). Similarly, this parameter was tuned for D-SAEIBS-L and the final settings are displayed in Table S6 and Table S7.