Robust genome-wide ancestry inference for heterogeneous datasets: illustrated using the 1,000 genome project with 3D facial images

Estimates of individual-level genomic ancestry are routinely used in human genetics, and related fields. The analysis of population structure and genomic ancestry can yield insights in terms of modern and ancient populations, allowing us to address questions regarding admixture, and the numbers and identities of the parental source populations. Unrecognized population structure is also an important confounder to correct for in genome-wide association studies. However, it remains challenging to work with heterogeneous datasets from multiple studies collected by different laboratories with diverse genotyping and imputation protocols. This work presents a new approach and an accompanying open-source toolbox that facilitates a robust integrative analysis for population structure and genomic ancestry estimates for heterogeneous datasets. We show robustness against individual outliers and different protocols for the projection of new samples into a reference ancestry space, and the ability to reveal and adjust for population structure in a simulated case–control admixed population. Given that visually evident and easily recognizable patterns of human facial characteristics co-vary with genomic ancestry, and based on the integration of three different sources of genome data, we generate average 3D faces to illustrate genomic ancestry variations within the 1,000 Genome project and for eight ancient-DNA profiles, respectively.


Supplementary
Adjusting population structure Simulated GWAS

Supplementary Note S1: Determination of the number of relevant SUGIBS components
A key question for any lower dimensional embedding of data into a latent-space is the determination of the number of relevant latent components. In previous work 11 , we used PCA to obtain lower dimensional facial shape presentations in combination with a technique referred to as Parallel Analysis 12,13 . A Parallel Analysis determines the amount of eigenvalues (and thus the number of principal components (PCs)) from the observed data that are significantly different from eigenvalues computed from permuted versions of the original data. By running multiple permutations, a null distribution of noisy eigenvalues is obtained, against which significance of the original eigenvalues can be tested (whilst taking the properties of the data itself into account). Similar to a Parallel Analysis in PCA 12 , our preliminary method or suggestion to select the number of components for SUGIBS is by comparing the spectrum of eigenvalues from an observed potentially heterogeneous dataset (HED) with that of simulated homogeneous datasets (HOD). This is done using the same number of SNPs and samples as in the observed dataset.
For the HODs, we generate the genotypes of each SNP independently according to the allele frequency calculated from the observed data. This implies that each SNP is in HWE but is not in LD with any other SNP. For each simulated HOD and the HED, we calculate the eigenvalues of − 1 2 − 1 2 , where the unnormalized genomic relationship matrix is defined as and is a similarity degree matrix defined by the IBS similarity. By comparing the eigenvalues of the HEDs with the eigenvalues from the simulated HODs, an indication whether the observed dataset deviates from a single homogeneous population is provided. However, we observed that the LD between the SNPs in a sample does affect the sloop of the eigenvalue spectrum. To illustrate this, we simulated three datasets each with 10,000 SNPs and 1,000 samples assuming homogeneity, but with different levels of LD between SNPs (no LD, 2 ≤ 0.2 and 2 ≤ 0.8). The results in Figure S3 show that the higher the LD level, the steeper the eigenvalue spectrum becomes. In other words, the first eigenvalues explain more of the total variance due to correlation in the data, which is expected given the increased levels of LD.

Figure S3: Spectrum in descending order in function of LD level. Y-axis represents the values of eigenvalues.
In order to adjust for the different slopes of the eigenvalue spectrum caused by different levels of LD, we fit a robust regression (robustfit in MATLAB) between the observed eigenvalue spectrum and the simulated eigenvalue spectrum. A robust regression, was chosen since it is not influenced by the first few large eigenvalues, which are expected for highly heterogeneous population samples. In practice, we run the simulation 100 times and robustly fit the observed eigenvalue spectrum with the median of the 100 simulated eigenvalue spectrums. Subsequently, we plot the observed eigenvalue spectrum against the adjusted simulated eigenvalue spectrums.
Results for simulated heterogeneous population samples with an admixture from three, six and nine different ancestries with different levels of Fst (0.1, 0.01 and 0.001) are shown in Figure S4. It is observed that the simulated HOD eigenvalue spectrum is consistently lower than the observed HED eigenvalue spectrum, and this for all 30 eigenvalues plotted. Therefore, in contrast to Parallel Analysis, the simulated HOD eigenvalue spectrum could not be used as a direct indicator for the number of significant components, since all the observed eigenvalues remain larger (and thus significant) compared to the simulated ones. However, an indication of the amount of relevant (instead of significant) components that represent admixture is still observed. For larger values of Fst (0.1, 0.01), the correct number of relevant components (2 for three ancestries, 5 for six ancestries and 8 for nine ancestries), are visually distinct in magnitude in comparison to the simulated HOD eigenvalue spectrum, and this distinction is larger than the subsequent (non-relevant) components. For lower values of Fst (0.001), and an admixture from more than 3 ancestries, this visual distinction is lost. Ancestry facial predictions have good value in a range of applications. In archeology, ancestry faces reconstructed from ancient DNA profiles, as done in this work, is of strong interest. Generally, for ancient DNA profiles, missing data is abundantly present, making SUGIBS a valuable technique to use. Note that, the ancestry faces are limited to modern facial constructs, due to the contemporary facial data used. However, they can help to bring ancient DNA profiles into the context of present-day populations for which facial images (e.g. open-source facial databases, Google images, etc.) are available but DNA is not. Furthermore, there is a good relationship between the face and the skull 14,15 , such that ancestry faces can be used to compare against skeletal remains. In the future, it is of interest to deploy our work on datasets of 3D skeletal craniofacial surfaces extracted from Computer Tomography (CT) or Magnetic Resonance Imaging (MRI). In medicine, and more particularly in oral and maxillofacial surgery, the surgical reconstruction of a patient's face benefits from a proper notion of normal facial shape 16 . In the next five to 20 years, whole genome sequencing will become the standard of care in clinics and a patient-specific ancestry face provides a personalized norm of facial shape towards precision medicine in surgical planning.
In forensics, an ancestry facial prediction circumvents the often legally debated reporting of ancestry proportions of a probe DNA profile in a criminal investigation. In France, for example, DNA phenotyping of externally visible traits is legally allowed, since such traits are considered to be public. However, and in contrast, genomic ancestry proportions, as typically reported in forensic DNA testing, is considered to be private information and cannot be used during criminal investigations. We agree that ancestry proportions are not an externally visible characteristic of an individual. The construction of ancestry proportions is also inherently flawed by labelling the individual into so-called parental populations. Furthermore, such numeric information is hard to interpret and use by a forensic investigator. The reconstruction of an ancestry face on the other hand, avoids needing to explicitly label a DNA profile in function of parental populations and provides a visual feedback to an investigator that is perceptually useful, even in admixed cases. However, a strong limitation is that the ancestry projection and face creation is only as good as the data used to create it. If your background face data doesn't match the ancestry of your test data, then your estimation of the face will remain poor. The challenge in forensics also involves the ability to reconstruct ancestry faces using often limited and contaminated DNA material. Another strong limitation is of ethical concerns that warrants us of the misuse of DNA and facial recognition technology in general beyond the positive implications of solving crime 17 .