Highlighting nonlinear patterns in population genetics datasets

Detecting structure in population genetics and case-control studies is important, as it exposes phenomena such as ecoclines, admixture and stratification. Principal Component Analysis (PCA) is a linear dimension-reduction technique commonly used for this purpose, but it struggles to reveal complex, nonlinear data patterns. In this paper we introduce non-centred Minimum Curvilinear Embedding (ncMCE), a nonlinear method to overcome this problem. Our analyses show that ncMCE can separate individuals into ethnic groups in cases in which PCA fails to reveal any clear structure. This increased discrimination power arises from ncMCE's ability to better capture the phylogenetic signal in the samples, whereas PCA better reflects their geographic relation. We also demonstrate how ncMCE can discover interesting patterns, even when the data has been poorly pre-processed. The juxtaposition of PCA and ncMCE visualisations provides a new standard of analysis with utility for discovering and validating significant linear/nonlinear complementary patterns in genetic data.

Nonlinear dimensionality reduction techniques. The × genotype matrix , with individuals and genetic variants or SNPs can be seen as a cloud of points (in this case individuals) lying near or on a low dimensional manifold embedded in a high dimensional feature space (in this case the space of SNPs). In order to represent the topological properties of such a manifold in low dimensions, several nonlinear dimensionality reduction techniques have been proposed, the majority of which construct a proximity graph by first connecting each point in the dataset with its nearest neighbours and then projecting these points to a space of reduced dimensions by taking advantage of the structural properties of the graph.
In this paper, we used two different nonlinear dimensionality reduction approaches to compare their results with ncMCE's: Isomap 1 and Laplacian Eigenmaps 2 . The former constructs a distance matrix by measuring shortest-paths between points over the proximity graph and projects the points to low dimensions by multidimensional scaling of this distance matrix. The latter extracts the Laplacian from the proximity graph and recovers the low dimensional coordinates of the points by solving a generalised eigenvalue problem.
Note these two techniques have two free parameters: , the number of nearest neighbours needed to construct the proximity graph and , the dimension of embedding. We fixed the later to 2 but analysed the behaviour of different proximity graphs constructed with = 2 31 neighbours (see Fig. S1).
We quantified the success of the resulting two-dimensional projections by computing their Cscore (see below for details and Fig. S2 for results).

Additional comments about ncMCE:
As mentioned in the main article, ncMCE performs the embedding of the sample dissimilarities measured over their minimum spanning tree (MST). This novel MST-derived nonlinear measure, that we refer to as minimum curvilinearity (MC), gives rise to the MC-kernel. The MST is an acyclic graph with all the samples in a population as nodes, connected to each other by paths of minimum length. As a consequence, measuring distances over this graph emphasizes the separation between nodes far apart in the manifold and maintains or reduces the distances between nearby nodes, which produces a sort of gradual denoising and reveals nonlinear patterns hidden in the high-dimensional feature space 4 .
The fact the in ncMCE the MC-kernel is not centred, poses the risk that this matrix is not always positive semi-definite and that its eigenvalues can be negative (with consequences in the projection to low dimensions). However, in practice, kernels that do not satisfy Mercer's condition (positive semi-definiteness of the kernel) can still be used as soon as they convey the intuitive idea of similarity 5 , as in the case of Isomap's kernel 1,6 and some kernels used in kernel PCA, like the famous Sigmoid kernel 7 .  Notice that the best Cscore (red circle) is attained for a very low k, which means that a proximity graph with a treelike structure, like ncMCE's basis, is preferred for providing a good discrimination between the two clusters. Figure S3. PCA applied to the Japanese population without substitution of missing values. PCA is unable to find the two clusters that ncMCE found on the original and unadjusted data matrix.   genetic differences (a). The heat map shows the log !" 1 + , in which the SNP values can be 0 (homozygous wild-type), 1 (heterozygous wild-type), 2 (homozygous variant type) or 3 (missing data). The SNPs are subdivided in a first set with high average values, in the top-left corner of the heat map, characterising the first cluster of individuals. The second set, in the bottom-right corner, has also high average values and characterises the other cluster. Note that the genetic variants in the first or the second set of SNPs make the two groups genetically different. Interestingly, the PCA projection of the Japanese individuals, which considered only the significant SNPs extracted from the original genotype matrix, revealed the two groups that ncMCE identified (b). PCA could not detect these groups upon application to the original dataset (Fig. 4a in the main article).

Example downstream analysis of the SNPs that most significantly differentiate between Japanese ethnic groups.
We explored the role of the genes to which the significant SNPs (responsible for the separations amongst the Japanese populations) mapped (see description of the SNP-to-gene mapping below). After performing a functional enrichment analysis of the respective gene list (see the Supplementary Files 1 and the SNPto-gene mapping details below), we found that the genes are significantly involved in pathways associated with diseases, such as Alzheimer's (p = 1.23E-8, Benjamini; p = 1.59E-7, Bonferroni; see the Methods for details on the meaning of these p-values), certain cardiomyopathies (p < 2.228E-9, Benjamini; p < 2.74E-8, Bonferroni) and certain types of cancer (p < 2.3E-4, Benjamini; p < 0.008, Bonferroni) or with neuronal activity (p < 2,26E-10, Benjamini; p < 2.26E-9, Bonferroni) and melanogenesis (p = 3.74E-6, Benjamini; 8.59E-5; Bonferroni). The genes are also significantly involved in the bioprocesses that govern neurogenesis (p < 3.45E-4, Benjamini; p < 0.006, Bonferroni) and cell proliferation (p = 3.06E-4, Benjamini; p = 0.005, Bonferroni). If the p-values used to select the most significant SNPs are Benjamini-corrected, the results are the same although the number of genes is reduced (see Supplementary File 2 and the SNP-to-gene mapping details below).
The fact that these diseases and pathways tend to be more present in elders and are related to aging processes, readily drew our attention to the Okinawa Centenarian Study 8,9 , a research project based on reliable age verification data, with the goal of understanding why Okinawans present such as an exceptional longevity and represent the ethnic group with the world's highest ratio of centenarians (40-50 per 100,000 persons 8 ). To our surprise, one of the findings of this study was that Okinawan elders experience a slower age decline and a delay or complete avoidance of the diseases associated with aging, such as Alzheimer's, cardiovascular disorders and cancer compared to other Japanese ethnic groups 8,10 . In addition, Okinawan centenarians possess HLA alleles that lower their risk of developing inflammatory and autoimmune disorders 10 . Two genes from the HLA family are part of the list of significant genes identified in this study (see the genes highlighted in yellow in the Supplementary File S1).
Moreover, a recent study 11 may explain why we found pathways and processes associated with neuronal activity, neurogenesis, melanogenesis and cell proliferation in general. Katsimpardi and colleagues found that restoring the functionality of age-related processes like blood flow and neural stem cell production counteracted the negative effects of aging in mice 11 . They also discovered that administration to old mice of a member of the TGF-β protein family, a family of factors that decreases with aging, reversed the age-related decline of neurogenesis and contributed to vascular remodelling 11 . Two members of this family of proteins are part of the list of significant genes identified in this study (see the genes highlighted in green in the Supplementary File S1).