Enlightening discriminative network functional modules behind Principal Component Analysis separation in differential-omic science studies

Ciucci, Sara; Ge, Yan; Durán, Claudio; Palladini, Alessandra; Jiménez-Jiménez, Víctor; Martínez-Sánchez, Luisa María; Wang, Yuting; Sales, Susanne; Shevchenko, Andrej; Poser, Steven W.; Herbig, Maik; Otto, Oliver; Androutsellis-Theotokis, Andreas; Guck, Jochen; Gerl, Mathias J.; Cannistraci, Carlo Vittorio

doi:10.1038/srep43946

Download PDF

Article
Open access
Published: 13 March 2017

Enlightening discriminative network functional modules behind Principal Component Analysis separation in differential-omic science studies

Sara Ciucci^1,2^na1,
Yan Ge¹^na1,
Claudio Durán¹^na1,
Alessandra Palladini^1,2,3^na1,
Víctor Jiménez-Jiménez⁴,
Luisa María Martínez-Sánchez¹,
Yuting Wang^5,6,
Susanne Sales⁵,
Andrej Shevchenko⁵,
Steven W. Poser⁷,
Maik Herbig⁸,
Oliver Otto⁸,
Andreas Androutsellis-Theotokis^6,7,9,
Jochen Guck⁸,
Mathias J. Gerl² &
…
Carlo Vittorio Cannistraci¹

Scientific Reports volume 7, Article number: 43946 (2017) Cite this article

6252 Accesses
40 Citations
10 Altmetric
Metrics details

Subjects

Abstract

Omic science is rapidly growing and one of the most employed techniques to explore differential patterns in omic datasets is principal component analysis (PCA). However, a method to enlighten the network of omic features that mostly contribute to the sample separation obtained by PCA is missing. An alternative is to build correlation networks between univariately-selected significant omic features, but this neglects the multivariate unsupervised feature compression responsible for the PCA sample segregation. Biologists and medical researchers often prefer effective methods that offer an immediate interpretation to complicated algorithms that in principle promise an improvement but in practice are difficult to be applied and interpreted. Here we present PC-corr: a simple algorithm that associates to any PCA segregation a discriminative network of features. Such network can be inspected in search of functional modules useful in the definition of combinatorial and multiscale biomarkers from multifaceted omic data in systems and precision biomedicine. We offer proofs of PC-corr efficacy on lipidomic, metagenomic, developmental genomic, population genetic, cancer promoteromic and cancer stem-cell mechanomic data. Finally, PC-corr is a general functional network inference approach that can be easily adopted for big data exploration in computer science and analysis of complex systems in physics.

Systems biology in cardiovascular disease: a multiomics approach

Article 18 December 2020

A python library for the fast and scalable computation of biologically meaningful individual specific networks

Article Open access 06 August 2024

Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer

Article Open access 05 January 2021

Introduction

Omic sciences are contributing to revolutionize current biomedicine towards a precision and patient-tailored approach. Indeed, recent advances in high-throughput technologies have led to a growing amount of omic data in several branches of biomedicine, and consequently boosted the development of a large number of network-inference methods (also called reverse engineering methods), in order to seize the behaviour of the underlying biosystems¹.

Inferring or ‘reverse-engineering’ biological networks can be defined as the process of probing interactions between molecular components from experimental data through computational analysis². Reverse-engineering algorithms divide into many subtypes, however two of them play an important role in biomedicine³. Physical interaction approaches aim at predicting biophysical interactions among molecules, for instance structural protein-protein interactions^4,5,6,7,8. Influence or functional interaction approaches aim at predicting associations such as correlation or causality among molecules. A correlation network is a widely used representation paradigm of the associations between the parts that compose a complex system. Each edge in this network is undirected and its weight indicates the level of correlation between the trends of two variables that are symbolically represented by two connected nodes. However, correlation does not imply causality and the correlation network does not represent relations of dependency between the variables (nodes) in the network. Such type of information would imply the inference of directionality on the edges leading to a directed graph representation of the system, which is also called regulatory network. Correlation networks are a very general representation paradigm, which is a point of strength for computational systems biology applications, in fact they can be extracted from any type of omic data regardless of the hypothesis of variable dependency, yet offering a very important information on the internal functional association and organization of the system parts. Their clear point of weakness is that they cannot describe the internal causality (conditioning or dependency) between the parts of the system. On the other hand, although regulatory networks offer a deeper characterization of the mechanisms from which complexity arises, they are a more specific representation paradigm that cannot be applied to all omic data tout court. The inference of regulatory networks is much more complicated - in many cases it requires more samples and higher data quality - to get a good estimation of the variable dependency (edge directionality).

In computational systems biology, the gene expression profiles are used to build gene co-expression networks, which are correlation networks where the nodes correspond to genes, and the edges between nodes represent significant co-expression relationships. Co-expression networks were found extremely useful for analysing microarray and RNA-Sequencing data, since they can identify functional gene modules, i.e. densely connected subnetworks of genes sharing similar patterns of expression⁹. Usually in co-expression networks all the genes present in a dataset are used to construct the network. However, for computational reasons, the network can be restricted to the most varying genes¹⁰, and genes with low variance can be filtered out. In the same way, a correlation network can be constructed from any omic dataset by using the most varying features. Nonetheless, it will not point out what features are more important for discriminating two or multiple conditions, for instance control versus disease samples in case-control studies. To this aim, a further development of this approach consists in using a univariate statistical test (for instance Mann-Whitney test) singularly for each omic feature, followed by multiple testing correction (for instance Benjamini-Hochberg adjustment). This procedure selects the features that are significantly different between two or more conditions in a univariate way (considering each feature one at a time), and then constructs the correlation network between the significant omic features that are discriminative. In practice, a linear univariate-discriminative correlation network is obtained. For simplicity, we will call the outcome of this method: P-value (correlation) network.

As a matter of fact, here we will focus our attention on universal methods for inference of correlation networks in omic data in general, whereas the development of methods for inference of regulatory networks is out of the scope of this study. For this interesting and different subject we refer to the DREAM project¹¹, where the performance of regulatory network reverse-engineering was shown to vary both across species, and within the same category of inference methods. A plethora of methods are available for gene network reverse-engineering, but few methods were developed and extensively tested for revealing discriminative associations in omic systems in general. Among them, the P-value correlation network is the most employed for the analysis of omic data, because of its simple and fast application and the straightforward interpretation of the results. A first aim of this study is to offer a valid (but still easy to adopt and interpret) multivariate alternative to the P-value networks. Therefore, we will present a method for linear multivariate-discriminative correlation network reverse engineering that, thanks to its multivariate nature can help to stress and squeeze out the underlying combinatorial and multifactorial mechanisms that generate the differences between the studied conditions. Notably, the P-value network approach is hypothesis-driven, however our goal is to exploit a data-driven and unsupervised approach, because this can also reveal interesting but unknown sample patterns and can efficiently deal with a small sample size. Between the data-driven methods, the Principal Component Analysis (PCA) is without doubt one of the most used unsupervised linear multivariate algorithms for data exploration and visualization in omic science^12,13. It reduces the data dimensionality while retaining most of the variation in the data, and can be used to detect hidden/unknown patterns of sample discrimination, that emerge according to new inferred variables (the principal components) that are linear combinations of the original variables¹⁴. Unfortunately, to the best of our knowledge, no algorithm available in literature is able to provide a network representation of the internal relations between the discriminative features that are associated to a PCA visualization in which a sample separation emerges. This could be extremely useful to juxtapose to a PCA plot also a diagram that accounts for the feature associations that contribute to generate the visualized sample segregation. Therefore, here, for the first time, we want to propose a method - which we named PC-corr – that uses the PCA loadings to perform unsupervised inference of a linear multivariate-discriminative correlation network.

To summarize, the goal to introduce the use of PC-corr networks is not to infer causality, but to offer a multivariate alternative to the P-value networks in order to explore the functional associations of the omic features at the systems’ level. The collection of the inferred associations between selected features creates a network where discriminative modules can be investigated by imposing different cut-offs to the interaction values. This process allows to explore the network structure and organization at different levels. Hence, the PC-corr networks can be – like P-value networks – thresholded at different levels in order to enlighten discriminative network modules of omic features that can be useful in gaining biomedical understanding and insight in support of the sample discrimination obtained using PCA. In practice, the PC-corr network can be a valid tool to add in many studies that use PCA to unveil the relational organization of the features that participate to the sample segregation in the projected space.

Results

The reasons to build a data-driven method based on unsupervised learning and PCA

For a general omic dataset, we developed a data-driven method to construct discriminative correlation networks relying on a prior unsupervised analysis and sample projection in a visualization space by PCA. The advantages to design our proposed data-driven approach based on an unsupervised (the labels are not used for learning the data projection) and parameter-free (the algorithm does not require tuning of any parameter for learning the projection) method such as PCA are many. First, regardless of the label availability or reliability, this can help to reveal the presence of hidden/unknown sample patterns in the data, with the additional benefit that the selected features are not biased by any supervised hypothesis, and, instead, they emerge naturally from the hidden multidimensional structure of the data. Second, unlike any supervised feature selection method, PCA also does not overfit the data¹¹, a clear advantage for small sample-size dataset analysis – very frequent in biomedicine - or pilot studies in which knowledge uncertainty represents a significant gap. Third, conceptually it might not be clear how an unsupervised method could be better suited to discriminate between conditions compared to a suitable supervised approach, however a vast amount of recent literature proves and discusses this evidence. Smialowski et al.¹² offered a solid in-silico proof of how PCA-loading-based feature selection can perform well preventing overfitting in contrast to supervised strategies. Cannistraci et al. proved how unsupervised techniques outperform support vector machine on proteomic and genomic datasets¹⁴. In a second study, Cannistraci et al. showed how unsupervised techniques could largely outperform supervised techniques for network-based prediction of protein interactions⁵. Zagar et al. in a very extensive study on stage prediction of embryonic stem cell differentiation from genome-wide expression data demonstrated on even 10 different datasets the supremacy of unsupervised on supervised techniques¹⁵. In particular and surprisingly, an unsupervised method such as PCA could outperform supervised methods such as partial least squares and support vector machine. The reason behind this apparent paradox finds an explanation in one outlandish expression coined by Richard E. Bellman in dynamic programming¹⁶: curse of dimensionality. Omics datasets are high-dimensional and present a larger number of features (variables) than samples. The fact that the multidimensional volume (in which the samples lie) grows very fast as a function of the number of features is called curse of dimensionality and represents an important problem in big data analysis because - roughly speaking - it means that the more features I have, the more samples I need for a ‘good’ analysis. To be more technical, the volume of the hyperspace increases so fast, as a function of the number of dimensions/features/variables, that the available data become sparse. This sparsity is problematic, in particular for supervised methods that strongly suffer of overfitting in a sparsity scenario, because the amount of data they need to support the prediction often grows exponentially with the dimensionality. In fact, model and parameter estimation becomes very difficult when the number of samples are much lower than the number of features, and this worsen because the data should be divided in a training test and validation set. In practice, the supervised learning is applied even on a smaller cohort of samples that constitute the training set. The curse of dimensionality effectively prevents supervised learning to be more efficient than unsupervised learning on many omics datasets where the sample to feature unbalance is often significant. On the contrary, it is empirically proven that PCA works fine also for big omic datasets (like the one we mentioned few lines above) even when the sample to feature ratio is larger than 1:1000, for instance Ringner¹³ provided a demonstration of this effectiveness on a dataset of 27,648 genes expressions in 105 breast tumour samples. Finally, as a matter of fact, at the basis of the current machine learning revolution initiated by means of deep learning there is unsupervised feature learning, as well as the advantages offered by unsupervised pre-training¹⁷.

Overview of PC-corr network construction and its evaluation

PC-corr is a method for unsupervised and parameter-free inference of a linear multivariate-discriminative correlation network. The method is unsupervised, parameter-free, linear and multivariate, because it is based on the loadings of PCA, which is a linear multivariate unsupervised and parameter-free transformation. Furthermore, starting from the PCA loadings there are not additional requirements of parameter tuning to generate the PC-corr networks. In brief, the PC-corr network is a weighted graph, whose values are obtained by combining the loadings of the nodes that are adjacent to an edge with the Pearson correlation of the same nodes. The PC-corr value of an edge is high if and only if both the nodes’ loadings and the nodes’ Pearson correlation are simultaneously high, and this is achieved using a minimum operator (for details refer to the Methods section). It is important to mention that, in general, we can associate a PC-corr network to any principal component (PC), because the PC-corr network is based on the loadings of the selected PC. In particular, our algorithm is based on the fact that the PCA suggests, along one of its PCs, a sample discrimination that can be unknown (a new pattern is discovered) or known (the presence of a known pattern, for which the labels are given, is confirmed). In the absence of labels, the presence of a pattern based on a discrimination along one of the PCs can be detected directly by visual inspection (not automated procedure) or using clustering algorithms (automated procedure). This generates a hypothesis on the presence of a possible unknown sample separation (label inference) but it requires to be validated with external information. In this case, the method to project the data remains unsupervised and therefore also the PC-corr network associated to the discriminative PC is still the result of an unsupervised transformation. The only, but important, problem that remains open is to make sense of the inferred ‘potential’ sample labels. As a common practice, many researchers visually inspect the patterns that unsupervisedly emerge in the PC1-PC2 projection of PCA, and they exploit the presence of unknown sample segregations in order to generate new hypothesis on the origin of their patient differentiation¹⁴. The new labels, which coincide with the discovered sample segregation, are compared with the information available in biomedical or clinical databases on the patient’s conditions or, very often, are employed to design novel experiments to verify or validate the new hypotheses and intuitions that the sample pattern suggested to the researcher. A clear example of this process is provided in the section below, dedicated to the lipidomic dataset. On the other hand, in the presence of known labels, PCA is very useful to reduce overfitting in the feature selection and to have an unbiased confirmation of the presence of the sample discrimination¹². Finally, to verify that the pattern discrimination along the selected PC is statistically significant, we suggest adopting a statistical test according to the inferred or the known labels. In the absence of any sample discrimination - suggested via PCA projection and confirmed by a statistical test - our method cannot be used. In addition, if the sample discrimination along a PC is statistically significant for inferred labels that do not match with any known meaningful biomedical or clinical information, the related PC-corr represents a last resort to speculate about the origin of the sample segregation and to generate hypothesis for new validation experiments.

The presence of a module in a network representation appears as a highly interconnected group of nodes, and in case of many modules hierarchically organized, the definition of modules is reserved for potentially overlapping dense subgraphs. The PC-corr network can be used to enlighten the presence of discriminative functional modules by increasing a threshold to remove weak links (connectivity with values lower than the threshold) from the network. Using this process, modules of nodes that are characterized by a strong discriminative correlation emerge in the PC-corr network. These modules are discriminative because they originate from the loadings associated to the discriminative PC, and since the nodes involved in each module have a common correlation, they potentially share a similar function.

In the explanatory Fig. 1, we give a general idea of the PC-corr method without entering in the algorithm details that, instead, are given in Supplementary Table S1. Figure 1 reports the case study where PCA is performed on the omic dataset and unsupervisedly separates the samples in two groups along PC1. We normally compute centered and non-centered PCA and we choose the one that offered better visualization result in sample discrimination. In case of similar performance, we opted for the centered PCA that is the most common employed in omic studies. There is no universal rule for when centering transformation should be used^18,19. Nevertheless, non-centering has been shown to offer several advantages^18,19. This is particularly evident in visualization tasks, when the set of points that form each cluster is distributed around the centre of the mass in the high-dimensional space⁵. Using the embedding in two dimensions after the centering transformation, the points tend to overlap around the origin of the first two dimensions. However, in most cases, executing the embedding without centering can reduce this crowding issue. As we said, the selection of the centering procedure is evidence based.

After that the presence of two groups is detected, they are attributable to two diverse (potential or already known) sample states or experimental conditions, which are accordingly visualized adopting two diverse colours, for instance red or black dots. Here, the separation in two groups is employed in order to provide a simple example, but more in general the unsupervised analysis could also provide a multiple group separation. Then the PC1-loadings - which represent the weights (importance) that each feature has in contributing to the sample separation on PC1 - are combined together with the features’ Pearson correlations (see Supplementary Table S1 for details on the mathematical formula) in order to generate the PC-corr edge weights. Hence, each pairwise feature interaction assumes a PC-corr weight: the higher is this value (in a scale between 0 and 1), the more the two features coparticipate in discriminating the two groups along PC1. The collection of the PC-corr weights, characterizing all pairwise feature interactions, is in practice a PC-corr weighted network having the features as nodes. We invite the reader to refer to Fig. 1 for a general overview of the method.

Summarizing for a general case: this easy, fast and novel method infers the discriminative linear associations existing between features, by combining the result of a preliminary unsupervised analysis (PCA), with the pairwise Pearson correlations between the features, to ultimately build and explore the discriminative network functional modules. The resulting network, that we called PC-corr network, pinpoints the discriminative correlations: correlations between those features that explain the discrimination of the samples along a principal component PCn. This is achieved by combining the PCn loadings (eigenvectors) and the pairwise Pearson correlation (for more details, see Methods).

Moreover, the structure of the interaction network is probed by applying different cut-offs, which allow to progressively reveal different discriminative network modules, that otherwise appear connected and continuous. There is not yet an automatic cut-off selection procedure, but different options are currently under investigation as a follow-up study. Notably, the free selection of the cut-off is the current most employed strategy in P-value network inspection. Hence, we suggest also in PC-corr networks that the operator tunes the cut-off in relation to his/her needs to explore the network structure and its organization at different levels. Nevertheless, a clear strategy to select one or multiple cut-offs in case of PC-corr networks is proposed in the next section entitled: How to choose the cut-off in PC-corr networks.

We compare our approach to the P-value network, i.e. a variation of the co-expression network that selects and represents only the Pearson correlations between significantly different features (as we already discussed in the Introduction). Additionally, we tested PC-corr network robustness by leave-one-out cross validation (LOOCV; the procedure is explained in the Methods section and results are shown in Supplementary information).

We decided to show the effectiveness of PC-corr networks on six datasets from different omic fields, ranging from lipidomics, to metagenomics, developmental genomics, population genetics, carcinoma cell-wide promoteromics and cancer stem cell mechanomics. As a final remark, in the first analysed dataset (coming from lipidomics) we compared PC-corr network with P-value mutual information (MI) network. Since MI is a measure of nonlinear association between two features, this analysis was oriented to provide a further comparison also versus nonlinear association networks.

How to choose the cut-off in PC-corr networks

In this section, we discuss a ‘rule of thumb’ for selecting a cut-off. The optimal cutoff is indeed quite dataset dependent, but we suggest the general cut-off range to be from 0.5 to 1. The higher the cutoff, the more discriminative the features are and the higher (stronger) the correlation between them will be. We suggest the exploration of the network structure and organization from cut-off 0.5 since this requires the features’ Pearson correlation higher than 0.5, while below this threshold the features are considered weakly correlated. However, this does not indicate that applying a cut-off lower than 0.5 will not get meaningful outcomes, but it has to be kept in mind that there could be many relatively low discriminative features and weakly interactions could be found and represented in the network. In addition, we should clarify the concept of frustration. In the PC-corr network, a red edge links nodes that display the same behaviour (positive correlation) and have therefore the same colour, while a black edge connects nodes that are negatively correlated, hence nodes with antithetical colours. Frustration occurs when some edges in the network do not follow this pattern, i.e. they represent positive correlations between nodes with opposite behaviour (opposite colours) or negative correlations between nodes with identical behaviour (same colours).

Usually, in our analyses on six datasets from different omic sciences, we selected cut-off greater or equal than 0.5, and no frustration appeared. However if any frustration occurs, we suggest two possible solutions:

(a) discard the links if the percentage of frustration is less that <5%, i.e. the edges under frustration represent less than 5% of the total number of edges in the represented PC-corr network;

(b) increase the cut-off till no more frustration is present, i.e. the percentage of frustration is 0%.

In the next studies, we will adopt the degree of freedom offered by the cut-off to perform a percolation-based investigation of the network, the detection of weak links and the automatic inference of the hierarchical organization between the discriminative network modules.

Lipidomics

Lipidomics is a rapidly growing omic field²⁰. Absolute quantification of lipid abundance by shotgun mass spectrometry is generating high-throughput datasets²⁰, which contain unveiled lipid signatures in a plethora of conditions and organisms. In particular, high-throughput human blood plasma lipidomics represents a new important tool for precision medicine, which can help to understand the causes and predict effective combinatorial biomarkers in metabolic disorders and beyond²¹. However, discovering and investigating the peculiarities of a healthy plasma lipidome is still an open task. Here we considered a lipidomic dataset composed by the systematic quantification of 281 lipid species from 27 major lipid classes in the human plasma samples of 71 healthy young Caucasians (36 males and 35 females)²². Since we did not have any verified a-priori information, but only a gender-based hypothesis on the possible segregation of the samples, we performed an unsupervised analysis of the whole dataset by PCA. This would allow finding hidden patterns that might visually emerge in the representation space composed by the first two-dimensions of embedding. Surprisingly, the unsupervised analysis revealed a dichotomy inside the female cohort, which was split in two distinct sub-populations (Fig. 2a). Since plasma lipids are strongly related to sex hormones, we postulated that the evident separation might be connected to the use of contraceptive drugs. Therefore, we asked to recover clinical information on the use of contraceptives by the female patients. This information proved that the significant separation on PC2 was contraceptive-driven (Fig. 2b). In fact, hormonal contraceptives may affect the overall change in lipid levels. However, in our analysis we focused on revealing the gender difference from the lipidomic point of view, excluding any bias related with the drug intake. Hence, we confronted the male cohort and the subpopulation of contraceptive-free females (denoted by FnoCC).

In this case also, significant separation was proven along the PC2 (Fig. 2c). This discrimination is explained by the network constructed with our approach (Fig. 3a, cut-off = 0.6), where the red lipids are more abundant (higher) in the FnoCC cohort while the black ones are higher in the male group. The PC-corr network could distinctly reveal discriminative lipid modules and interconnected sub-modules related to the gender separation. Lipids, which are highly connected in modules, belong mainly to the same lipid class. This is expected as these lipids share biosynthetic pathways and regulatory mechanisms. More interesting are the connections revealed by the network, which involve lipids of unrelated classes. While some of these connections can be explained (LPC O- 16:0 may be the lyso-product of PC O- 36:4), others are less obvious and may provide hints for future research directions.

**Figure 3: Lipidomics dataset: networks.**

Nonetheless, comparing (at the same cut-off level) the original PC-corr network (Fig. 3a) to the P-value network (which is the correlation network of the lipids that most significantly differentiate between the two groups, Fig. 3b) a noticeable difference emerges in the number of nodes and edges (Fig. 3d). In fact, the majority of the lipids in the PC-corr network are absent in the P-value network, while few lipids present in the P-value network are missing in the PC-corr network. This can also be observed in the modules, which appear ‘depleted’ in the P-value network in comparison to the PC-corr one. For instance, there is a clear difference in the common highlighted modules (circled in both networks with a dashed coloured line), as well as there are modules (circled in the network with a filled black line) which are present only in one of the networks. Substantial differences between the two networks consist in the presence of few triacylglyceride (TAG), the total absence of the Cholesteryl esters (CE lipids), the Sphinganine- Sphingosine and the Phosphatidylcholines modules in the P-value network (Fig. 3b). On the other hand, the PC-corr network lacks the presence of few isolated lipids such as the GD1, GD3, GM2, GT1. Even comparing (at the same 0.6 cut-off) the PC-corr network to the P-value MI network (i.e. the mutual information (MI) network of the lipids that most significantly differentiate between the two groups, similarly to the P-value network, Fig. 3c) (for details see the Methods section: Construction of the P-value MI network), no additional differences (to the ones described before) emerge, since all the interactions in the P-value MI network are contained in the P-value network (Fig. 3d).

All in all, our approach shows more diversity and completeness in the lipid module composition, and reveals the most discriminative functional lipid modules. While the P-value MI network contains single/double interactions and the P-Value network shows rather isolated modules, the PC-corr network also contains multiple connections between these modules. These connections are interesting points of complex metabolite interactivity which should be further investigated. It is a promising sign for a functional discriminative network to also include a major signalling sphingolipid (Sphingosine-1-phosphate).

Furthermore, we represented the possible/putative relation between the lipid chemical structural information and the connectivity of the nodes in the PC-corr network. In Supplementary Fig. S1, node colour diversity reflects the different lipid classes. This shows that the inferred discriminative modules reflect the known correlation between lipids of the same lipid class. However, also less obvious connections, as the degree of fatty acid unsaturation (caused by fatty acid desaturases), are conserved in the network. In Supplementary Fig. S2, node colours reflect the number of double bonds in each lipid species and show aggregations of similarly unsaturated lipids.

In a similar way, in Supplementary Fig. S3 the node colour gradient reflects the number of carbon atoms in the fatty acid chains and it appears that each module is internally characterized by a homogeneous colour gradient, hence by a certain similarity in the number of carbon atoms.

Since for this dataset the approach based on mutual information held a highly broken network with no interesting results, in the next datasets we continued to discuss merely the differences between the PC-corr and P-value networks.

Metagenomics

Dyspepsia is a common gastric problem mainly characterized by upper abdominal pain or discomfort and heartburn²³. The primary medicaments for the management of dyspeptic symptoms, which usually involves their long-term use, are over-the-counter antacid drugs known as Proton Pump Inhibitors (PPIs), that inhibit acid secretion in the stomach’s parietal cells, resulting in a higher intragastric pH. Although PPIs are widely prescribed, they are related to several adverse situations, in particular the development of corpus predominant atrophic gastritis in Helicobacter pylori positive patients, that is a precursor of gastric cancer, and a higher risk of enteric infections^24,25. These last two conditions are related to the alteration of the gastrointestinal microbiota, therefore this issue has been progressively addressed from a metagenomic perspective^26,27. The first question of clinical relevance is whether PPIs exert any effect on the gastric bacterial communities, and this was investigated in a study by Williams and collaborators that was still solely based on culturable gastric bacteria²⁸. More recently, next-generation-sequencing-based metagenomic studies further confirmed that PPIs modify the gastroenteric human microbiota^{29,30,31,32,33}.

Here, we re-analyzed a dataset by Amir and collaborators²⁹, in order to verify whether the PC-corr method could enhance, in comparison to Amir et al.²⁹, our understanding of the PPI-induced microbial perturbation. In particular, we were interested in promoting a systems biology perspective, pointing out the possible presence of discriminative network microbial alterations. Therefore, we considered the gastric fluid dataset, which comprises eight patients who were sampled at two different time points, that is before and after eight weeks of PPI treatment, for a total of 16 samples. We selected this specific dataset because it clearly shows that PPI treatment alters the gastric fluid microbial communities. Metagenomes were obtained by pyrosequencing fragments of the 16S r-RNA gene on the GS FLX system (Roche). Pyrosequencing has been widely used for genomic and metagenomic sequencing, because it is one of the first next generation sequencing platforms developed for the production of high throughput data and it can produce long reads (up to 800 bp).

PCA analysis of the dataset revealed that samples are separated in two groups according to treatment and their difference is significant (Mann-Whitney test p-value, Supplementary Fig. S4). Since non-centered PCA (ncPCA) offered a better discrimination than centered PCA (cPCA), we built the PC-corr networks using the ncPCA loadings of the second dimension (PC2). Nevertheless, it was not possible to build the P-value network, since the pairwise comparisons for each bacterium yielded no significant results after the Benjamini-Hochberg correction.

This finding represents an emblematic example of the need to complement and integrate the outcomes of P-value correlation networks by introducing the PC-corr method. As a matter of fact, in this very case the P-value network approach delivered no results, although there is clearly a difference between the two conditions being tested.

Figure 4 shows the networks obtained by considering two different cut-offs, that is 0.55 and 0.6. Both networks are represented with black and red nodes in the upper part of the figure (Fig. 4a,b), while at the bottom each network appears with its nodes colored according to class level (Fig. 4d,e). Red nodes are bacteria that are more abundant after treatment, while black nodes are bacteria that are more abundant in the samples collected before the treatment.

**Figure 4: PC-corr networks at two different cut-offs.**

The bacteria with higher abundance in the untreated samples (black nodes) are split in two submodules which correlate negatively with the group of red nodes (bacteria that display higher abundance in treated samples). A black submodule is made of Gammaproteobacteria (lilac dashed line), represented by Enterobacteriaceae, while the other black submodule contains a group of Betaproteobacteria (light blue dashed line) with members from the Comamonadaceae (Massilia) and Oxalobacteraceae (Acidovorax) families, and a group of Alphaproteobacteria (violet dashed line), and also Betaproteobacteria.

Although clinical metagenomic studies usually consider the entire phylum of Proteobacteria and its overall abundance instead of exploring their class-level behavior, environmental studies generally analyze how the different classes of Proteobacteria change according to the type of habitat or environmental perturbation. In a study on acid mine drainage microbial diversity³⁴ the authors found that Betaproteobacteria dominated in moderate pH conditions, while Alphaproteobacteria and Gammaproteobacteria had a strong preference for lower pH environments. Class-related behavior should be investigated also in biological systems, as our network suggests there may well be a pH-related trend in the bacterial composition of the sub-modules. Indeed, it is the main purpose of PPI treatment to increase the stomach pH, and the higher pH of treated patients is known to favor the growth of bacteria that usually reside in the mouth and esophagus and are not adapted to survive the normal gastric acidity^26,35.

Interestingly, when comparing ours to Amir’s findings, the families that result to be significantly decreased following PPI treatment (Moraxellaceae, Flavobacteriaceae, Comamonadaceae and Methylobacteriaceae) actually belong to the black sub-module featuring the Gammaproteobacteria, that is to the group of nodes that have higher abundance before treatment. The Erysipelotrichaceae are instead significantly increased after PPI treatment and in the PC-corr network they accordingly appear in the red sub-module, represented by the genus Bulleidia. Other important hubs in the red sub-module are represented by the genera Prevotella, Granulicatella, Oribacterium, Leptotrichia and Rothia, which belong to the gastric and oral microbiota^36,37,38,39.

Taken together these results confirm that, although univariate analysis (P-value network) cannot spot any microbial signature, the PC-corr discriminative network method is able to explain the clear and significant PPI-related sample discrimination detected by multivariate analysis (PCA).

Developmental genomics

One of the key questions in developmental biology is to understand all the mechanisms that are responsible for tissue differentiation⁴⁰. To develop a multicellular organism, cells must obviously differentiate during multiple stages to specialize for different functions.

Tissue specificity depends on spatiotemporal patterns of gene expression which in turn are driven by transcriptional regulatory networks⁴¹, that means interactions in which a given transcription factor (TF) regulates other TFs, or itself ⁴². Hence, combinatorial interactions among TFs would be critical to understand tissue-specific expression patterns.

To identify TF interaction networks involved in tissue development, we decided to reanalyse the dataset by Ravasi and collaborators⁴¹, that contains quantitative (normalized) mRNA profiles of 1,983 Human TFs using quantitative Reverse Transcription PCR (qRT-PCR) across 32 tissues (excluding monoblasts).

Unsupervised analysis of the dataset pointed out two main different groups of tissues that were found consistent with the differences among the tissues at the embryo level (Fig. 5a). In fact, PCA could separate the tissues derived from the ectoderm from the other tissues that arise from the endoderm and mesoderm along PC1, suggesting that expression differences between endoderm and mesoderm are slighter than between these two germ-layers and the ectoderm (for germ layer derivations, see section ‘Germ layer attribution’ in Method). Likewise, PC2 could not differentiate the endoderm and mesoderm-derived tissues and their overlapping is confirmed by the fact that, during development, endoderm patterning is controlled by a series of reciprocal interactions with nearby mesoderm tissues⁴³. However, in another study we showed that more advanced machine learning methods for nonlinear dimension reduction (such as minimum curvilinear embedding) could significantly differentiate also mesoderm from endoderm, confirming the presence of a complex and nonlinear relation between these two germ-layers that is difficult to untangle by classical linear transformations¹⁴.

**Figure 5: PCA analysis of human tissues according to their embryonic origin and the corresponding TF PC-corr network under cut-off 0.8.**

Then the PC-corr network was constructed from PC1 loadings at 0.8 cut-off level to explain the discrimination of ectodermal tissues from endodermal and mesodermal tissue (Fig. 5c), and to identify co-expressed TF-encoding genes (TFs) in a particular layer. In fact, the red nodes represent TFs that are highly expressed in ectodermal tissues, while the black ones are highly expressed in endodermal and mesodermal tissues.

Using DAVID functional annotation tool we found a significantly enriched pathway (PANTHER pathway P00057, p-value after Benjamini-Hochberg correction <0.01), a Wnt signalling pathway that is crucial for embryonic development in all animal species⁴⁴, and significantly enriched TF-encoding genes that play a specific role in the ectoderm development (PANTHER_BP_ALL, BP00246: ectoderm development, p-value after Benjamini- Hochberg correction <0.01) (Fig. 5c).

Additionally, it is important to notice the role of GATA3, that is highly expressed in mesodermal and endodermal tissues and interacts negatively with SCRT1 and the big hub NEUROD6, which are instead highly expressed in the ectodermal tissues. Previous studies revealed that the encoded protein by SCTR1 may promote neural differentiation (GeneCards database, http://www.genecards.org/), while the encoded protein by NEUROD6 may be involved in the development and differentiation of the nervous system (GeneCards database, http://www.genecards.org/), that is in concordance with the fact that the ectodermal tissues come from different part of the nervous system. Moreover, other middle-size TF-encoded genes share the same trend like PAX6, that is a homeobox gene and key regulatory gene of eye and brain development⁴⁵, and TBR1, a T-box gene involved in the differentiation and migration of neurons and is required for normal brain development^46,47.

Furthermore, the PC-corr network could identify in a triangular module three TF-encoding genes (FOXG1A, FOXG1B, FOXG1C) present in the dataset that actually represent three old names of the same gene, FOXG1⁴⁸.

Comparing it to the P-value network at the same threshold (figure not shown), we note that our approach is able to focus on the main TFs that play an important role for the discrimination shown in Fig. 5a, whereas the P-value network is highly-interconnected (Fig. 5b).

Even if we cross-match the PC-corr network with the P-value network (approximately the same number of nodes), the P-value network (Supplementary Fig. S5) loses its connectivity and starts to break in single interactions, in addition to the absence of the above mentioned NEUROD6, SCRT1 and GATA3.

Population genetics

Genetic diversity across human populations has been studied widely for many years and the field has progressed rapidly in the last few years^49,50. Complex patterns of genetic variation in populations are the product of demographic and evolutionary events acting at different time scales, including populations migration, expansion, colonization, mutation, genetic drift and selection^50,51.

Understanding human genetic variation is important to discover and describe the genetic contribution to many human diseases and understand the genetic differences within and among populations. In Asia the geographic structure of genetic variation remains enigmatic⁵². For this reason the HUGO Pan-Asian SNP consortium conducted the largest survey to date of human genetic diversity among Asians^52,53.

In our analysis, we focused on the Japanese population, represented by 164 individuals who were genotyped at 54,794 autosomal SNPs loci.

Replacing the missing values with the mode of each SNP⁵⁴ allowed to remove nonlinear perturbations and linearize the data, revealing the separation of the samples into two clear subgroups: Japanese from Tokyo (JP-Tk, including the JPT and JP-ML ethnic groups) and Japanese Ryukyuan from Okinawa (JP-RK) along the first principal component (PC1) (Fig. 6a).

**Figure 6: Genetic structure of the Japanese population based on autosomal SNP genotyping.**

The PC-corr network was constructed at cut-off 0.8 (Fig. 6c), combining the PC1 loadings with the pairwise correlations between SNPs, to find the different patterns of genetic variation in these two Japanese subpopulations.

The red nodes are the SNPs with a higher frequency in the JP-RK, while the black ones have a higher frequency in the JP-Tk. In Fig. 6c the SNPs names are changed to the name of the gene that contains the mutation when it is an intron, missense or 3′ UTR variant. Otherwise, it is left unchanged.

The PC-corr network contains many single/double interactions, a few separated modules, and autocorrelating nodes when the SNPs map to the same gene.

It is important to note that all the modules and single/double interactions not involving SNPs mapped to genes, contain interacting SNPs that lie in the same region between two genes (Supplementary Fig. S6). In particular, the first red module which appears in Fig. 6c is composed of 9 SNPs belonging to the same linkage block. These mutations span over about 2,500 bases in chromosome 2 and exhibit a pairwise disequilibrium value D’ of 1, coupled with Pearson correlation coefficients r² ranging between 0,978 and 1 (calculations were made using the web-based service provided by the Broad Institute SNAP v. 2.2; https://www.broadinstitute.org/mpg/snap/ldsearchpw.php). Disequilibrium and r² values both indicate that these SNPs are in strong linkage disequilibrium, that is to say they co-occur on the same haplotype more often than is expected by chance and is a powerful signal of the population genetic processes⁵⁵. This means there is a non-random association among those SNPs and they are inherited all together.

Moreover, the network identified phenotype-associated SNPs (checked in Ensembl variant database), including disease-related variants (Fig. 6c). For example, the second module on the right contains black nodes (they are related to the JP-Tk group) corresponding to SNPs that are significantly associated to cholesterol levels and lipoproteins. Other black SNPs also mapping to genes (NAA25, ECTD4; bottom right) were significantly associated to the same phenotypes, as well as to mortality and platelet aggregation.

SNPs rs10509225 and rs341295 too are associated to cholesterol, lipoproteins and disease state, while association to stroke and neuroblastoma was shown by two SNPs related respectively to WDR2 and MAMDC2 genes.

Interestingly, some SNPs were instead associated to two phenotype traits, that are height (JP-Tk group), and hip (JP-RK group). In fact, height is a trait that distinguishes the two sub-populations, since the Okinawans are shorter in average than mainland Japanese people⁵⁶. In addition, Okinawans have about 20% fewer hip fractures than mainland Japanese do⁵⁷.

The Japanese began to lose significantly more calcium from their bones than the Okinawans, suggesting the Okinawans preserve their bone density at healthy levels for longer periods of time than the other Japanese, due to protective lifestyle factors (high calcium intake in the food and water, high vitamin D from sun exposure, increased physical activities at older ages, high intake of dietary flavonoids from plant food) (Okinawa Centenarian Study, http://www.okicent.org/study.html).

All in all, the PC-corr network findings confirm that cholesterol levels and lipoproteins affect the JP-Tk population. Indeed, the Okinawan population, characterized by a shorter height than most Japanese, includes the largest proportion of centenarians⁵⁸, who were found to have impressively young, clean arteries, low cholesterol, and low homocysteine levels when compared to Westerners (Okinawa Centenarian Study, http://www.okicent.org/study.html) and less prevalent cardiovascular disease risk factors (such as hypertension, diabetes, hypercholesterolemia and smoking) than most populations^58,59,60. These factors help reduce their risk for coronary heart disease by up to 80% and keep stroke levels low (Okinawa Centenarian Study, http://www.okicent.org/study.html). This longest life expectancy was associated to “human longevity genes⁶¹”, like APOE and FOXO3 genes^58,62,63, and lifestyle factors⁵⁸, i.e. reduced energy intake, different composition of diet whose characteristics are low levels of saturated fat, high antioxidant intake, and low glycemic load⁶⁴, level of physical activity and philosophy of life^65,66.

However, we compared our approach to the P-value network at the same high cut-off (figure not shown), and the latter has many more SNPs and interactions between SNPs (Fig. 6b), and does not focus on the significant discriminative SNPs. Also, some SNPs of the PC-corr network are missing (indicated in the box of Fig. 6b by their corresponding accepting genes), like MAMDC2, SMC5, NAA25, HECTD4. Similarly, cross-matching the PC-corr network with the P-value network of approximately the same number of nodes (Supplementary Fig. S7) revealed that the P-value network cannot catch the SNPs that are significantly associated to hip, cholesterol, lipoproteins and disease.

Carcinoma cell-wide promoteromic dataset

Here, we decided to focus our attention on Carcinoma⁶⁷: an important and frequently occurring type of cancer that arises from epithelial cells originating from the ectoderm or the endoderm⁶⁸. It can be classified into two broadly differentiated classes: small-cell (SCC) and non-small cell carcinoma (NSCC). The latter can be subdivided into many other subtypes, being adenocarcinoma (ADC) and squamous-cell carcinoma (SqCC) the ones with the highest incidence. In particular, SCC is characterized by small undifferentiated cells with a big nucleus and almost no cytoplasm. This type of carcinoma is highly aggressive and, in spite of its initial positive response to chemo and radiotherapy, it is recurrent and poorly prognosed^69,70. SqCC is a type of cancer that begins in the squamous cells, which are a kind of epithelial cells, and can occur in a variety of tissues throughout the body, although it most often develops in sun-exposed areas⁷¹.

The dataset to which we had access is a unique source of information because it is based on a particular type of sequencing called Cap analysis of gene expression (CAGE)⁷². CAGE is the method for genome-wide identification of transcription start sites (TSSs) adopted by the Functional Annotation of the Mammalian Genome (FANTOM) consortium, and it is the first technology developed to profile the activity of gene transcription at each promoter site. In particular it concurrently identifies the TSS and from these their promoters, and determines the expression level at each promoter, thus allowing the identification of regulatory elements⁷³. The gene expression data we analysed in this work were generated in the FANTOM5 project⁷⁴. In this huge project thousands of mouse and human samples from different cell lines were sequenced using CAGE⁷² with the final aim of understanding transcriptional regulation of gene expression and annotating its promoterome⁷⁴. In our analysis we focused on human small-cell and squamous-cell carcinoma-derived cell line samples (respectively 7 and 9 samples), since these two types of cancers have the poorest prognosis according to latest research^75,76. These samples have 184,827 ‘robust’ CAGE peaks which were obtained by decomposition-based peak identification (DPI)⁷⁴.The peaks without explicit gene annotations were removed and we ended with a total of 71,189 gene features in the dataset. Then, performing dimensional reduction by PCA, we found that SqCC and SCC are clearly separated along the first principal component (PC1) regardless of the tissue the carcinoma lies in (Fig. 7a). This component was used to build the PC-corr network, and find the different patterns of gene expression in these two kinds of carcinoma.

**Figure 7: PCA separation of squamous cell carcinoma and small cell carcinoma, and the corresponding discriminative gene network using PC-corr under cut-off 0.8.**

In the network at 0.8 cut-off (Fig. 7c), most of the genes (the black nodes) are highly expressed in SqCC samples, while the remaining genes (the red nodes) are highly expressed in SCC samples, and some of them are transcription factors (TFs, highlighted in green). The EGFR is highly expressed in SqCC, which actually is one of the main targets of SqCC treatment^77,78. We can see that most of the genes that are up-regulated in SCC are markers of neuroendocrine differentiation, for example, VGF^77,78, SYP⁷⁹, CHGB⁸⁰, INSM1^81,82, while HMP19, TUBB2B, CACNA2D4 and GNB3 are highly expressed in neuroendocrine tissues (GeneCards database, http://www.genecards.org/). In contrast, many of the up-regulated genes in SqCC are transmembrane proteins in the plasma membrane, which are responsible for the way the cell interacts with the extracellular matrix, and could play a role in the different migration behavior of both carcinoma subtypes.

With the aid of DAVID^83,84 for functional annotation, we found a group of enriched genes that are annotated as regulating cell proliferation (biological process, GO:0042127, p-value after Benjamini-Hochberg correction <0.05). We also found two more groups of enriched genes, annotated as cell junction (cell component, GO:0030054) and plasma membrane (cell component, GO:0005886), and a protein family characterized as glycoprotein (p-values after Benjamini-Hochberg correction <0.05). These enriched genes were shown to be highly correlated with cancer progression^85,86, together with the genes that regulate cell proliferation. Furthermore, the TF-encoding gene TP63 is highly expressed in SqCC, while the TF-encoding gene INSM1 is highly expressed in SCC (Fig. 7c), which is confirmed by previous studies^82,87. In literature INSM1 is expressed predominantly in small cell lung carcinoma and is highly expressed in tumors of neuroendocrine origin⁸². We further investigated the target genes of each TF, to verify whether they interacted in the network with their corresponding TF. Interestingly, TUBB2B and KLK11 are the target genes of TP63, more specifically, TP63 activates KLK11 and represses TUBB2B⁸⁸. Moreover, TP63 and CSTA play an important role in epithelial tissue maintenance and development, and they are markers of squamous differentiation^89,90,91. We know that GATA6 is a TF particular important in mammalian cell differentiation and molecular programs that mediate normal cell differentiation are required for oncogenesis and tumor cell survival in certain cancers⁹². Interestingly, GATA6 emerges as an important hub at the centre of our PC-corr network and it is highly expressed in SCC, hence we speculate that its role can represent a new finding suggested by the PC-corr method for the identification of key factors that explain the strong malignancy of SCC. Our hypothesis seems also supported by the current literature⁹³ on the GATA6 oncogenic properties. Indeed GATA6, which we found to be overexpressed in small cell carcinoma cell lines, is also frequently amplified or overexpressed in gastric, esophageal and pancreatic tumor cell lines⁹³ that are very aggressive types of cancer.

In general, the PC-corr network provides valuable information in terms of clear expression patterns and functional discriminative modules identification. Instead, the P-value network at the same cut-off (figure not shown) contains too much information and is extremely difficult to investigate, due to the presence of a large amount of nodes and edges (see Venn diagrams in Fig. 7b), and TP63 is absent and consequently its TF module.

However, we compared the P-value and PC-corr networks at cut-offs that allow having approximately the same number of nodes, and we observed that the P-value network (Supplementary Fig. S8) is mainly characterized by single and double interactions, thus losing its module interconnectivity. Moreover, it doesn’t contain CST and the genes related to neuroendocrine differentiation (excluded INSM1), in addition to TP63 and its targeting gene TUBB2B. Therefore, this dataset is an instance where our approach performs better in finding discriminative genes and functional modules. Finally, the PC-corr method in this application showed the power to reverse engineering discriminative network modules, which are clearly functional related, and that represent potential reference for defining new combinatorial network-based multigene markers for carcinoma subtypes categorization.

Cancer stem cell mechanomics

The regenerative properties of tumors are mainly accounted by cancer stem cells (CSCs), which manifest as recurrence and metastasis. CSCs appear to have multiple growth options, which allow them to adopt different states and, subsequently, contribute to their ability to evade therapy. Therefore, it is crucial to concurrently characterize the molecular and mechanical differences of CSCs under different culture conditions, as a strategy towards precision medicine. As a matter of fact, patient-derived primary cell cultures can represent a reliable benchmark for in vitro testing of tailored-to-patient targeted therapies.

Here, cancer stem cell lines derived from the same patients with glioblastoma multiforme were cultured in three different states (FGFJI, EGF and Serum) with the aim to maintain them in different signal transduction states⁹⁴. The dataset consists of 9 samples for each cell culture condition for a total of 27, and nearly 40.000 genes whose expression was assessed by means of genome-wide RNA-seq.

Then PCA analysis was performed and, as shown in suppl. Fig. S9, the three culture conditions were patently distanced from each other along the PC1. Notably, the Serum group (red circles in Fig. S9) is less homogeneous (more unstable conditions) than the two others. To investigate the cause of the discrimination from the systems molecular point of view and detect discriminative network modules related with the PC1, we applied the PC-corr method. Figure 8a shows the PC-corr network (cut-off 0.65), where the red genes are highly expressed in Serum, while the gradient from grey to black indicates genes highly expressed in EGF and FGFJI.

**Figure 8: Cancer stem cells gene expression pattern and correlation with cell mechanics.**

To validate the biological reliability of the gene network modules emerged from the PC-corr analysis, we performed functional annotation analysis (DAVID functional annotation tool, p-value after Benjamini correction <0.05 to detect significant categories) of the genes involved in the modules. The result of this validation was the identification of significantly enriched genes in GO terms and in three KEGG pathways: oxidative phosphorylation, glycolysis/gluconeogenesis and focal adhesion that is known to be associated with cell mechanics. In addition, many cell mechanics enriched genes related with cytoskeleton activity were present (genes labels with yellow triangles in Fig. 8a).

Inspired by the enrichment of the inferred discriminative module for cell mechanics related genes, the morphological variations of the cells under mechanical stress were also measured by means of real-time deformation cytometry (RT-DC) technique⁹⁵. The cells elastic moduli were calculated and in Fig. 8b we investigated the level of linear (Pearson) correlation of PC1 (representing the dimension of discrimination that generated the discriminative network modules) and elastic modulus of the cells (representing cell mechanic changes). Since the mechanical assessment was not done for all the biological replicates and to have generalized indicators, we plotted the PC1 score (x-axis) and elastic modulus (y-axis) as average values (+/− standard error) of all the samples for each culture condition. The same operation was repeated separately for each of the three culture conditions, hence we got three average indicators with standard errors in the Fig. 8b. The final values were z-score-transformed to have a scale-invariant representation. The high correlation (0.99 with p-value = 0.05) demonstrates that the network we extracted accounts well for the cell morpho-mechanics variations, and we suggest that in future experiments it could be proven (by gene silencing or overexpression) to which extent the detected network modules participate to determine the mechanical differences of these cells in the three culture conditions.

In addition, the comparison (using the same cut-off value for both the methods) of the P-value (figure not shown) and PC-corr networks was performed, and the node and edge Venn diagrams show that the P-value network cannot catch two genes (SPARC and VIM) involved in cell motion and extracellular matrix and three genes (MT-CO2, MT-CYB, MT-ND1) of the oxidative phosphorylation pathway (Fig. 8c). On the other hand, although the P-value network is large (11,287 nodes) and highly interconnected (983,509 interactions), it cannot catch almost all the gene interactions present in the PC-corr network. However, if we cross-match the PC-corr network versus a P-value network with a comparable number of genes (only the first 38 genes with the highest p-value selected), the P-value network loses interconnectivity due to a majority of single interactions between couple of genes (Fig. S10). Taken together these results obtained in cancer systems mechanomics show how the PC-corr method can help to unveil molecular discriminative modules that set a connection across different scales in the cell: form the underlying molecular interactions to the macroscopic mechanical phenotypes.

Discussion

Networks are the method of choice to explore the interactions among several kinds of features (genes, bacteria, metabolites, etc.). Here we proposed a new approach for network inference that merges the commonly used correlation-based association of the features with an unsupervised multivariate analysis, i.e. PCA. Exploiting an unsupervised approach allowed to discover hidden patterns in the data without any a priori information on the samples, or to provide an unbiased and unsupervised confirmation of the groups’ differences. Hence, we proposed to clarify the groups’ differences visualized by PCA by means of the PC-corr network inference and the analysis of the functional modules emerging from the PC-corr network structure.

Considering different datasets should be regarded as a point of strength in a study that is addressed to the large community of omic science. We believe that, in order to test how a new approach performs on real omic data, we have to verify its outcome in different fields. Therefore, each section of this study was dedicated to a different omic dataset, and we opted to compare the P-value and PC-corr networks. For the lipidomics dataset, we compared them also to the P-value Mutual Information (MI) network, that appeared highly broken (nonlinear measures are more difficult to be inferred in case of small sample datasets) and characterized by single/double interactions (Fig. 3c), already spotted by the P-value network (Fig. 3d). Since for the first analysed dataset the MI-based inference method did not work in identifying intra-connected or inter-connected lipid sub-modules/modules like in the PC-corr and P-value networks (Fig. 3a,b), we continued comparing the P-value and PC-corr approaches, in relation to the systems biology evidences achieved by the methods’ results in multiple datasets.

We adopted this evidence-based approach on multiple real datasets because unfortunately, to the best of our knowledge, at the state-of-the-art there are not available theoretical models able to simulate different types of artificial omic datasets generated by different technologies. This could be a very interesting project for another study by itself, but it is off topic for this article. Different technological platforms are used to generate omics datasets from different fields (for instance shotgun mass spectrometry for lipidomics, next generation sequencing for metagenomics and SNP data, etc.) and it was crucial to verify if our approach can give meaningful biological results in all these different fields. Finally, the goal of the article is also to represent a point of reference for biologists that want to find a clear example of how to complement their P-value correlation network analysis. We suggest a multivariate method that can extend the PCA sample visualization they frequently adopt, and for which they want to gain a meaningful explanation from the omic feature perspective. We really hope that biologists, interested in fast and easily interpretable methods that clarify the functional omic modules to associate to their PCA sample separation, could use our approach. For example, in the population genetics dataset, PCA could reveal a substructure in the Japanese population that was in reality split in the Tokyotas and Okinawans. It is this ability of PCA to spot the features that play a major role in the discrimination of the samples in the reduced two-dimensional (2D) space (which are responsible for the discrimination in the data) that was transferred in the network structure (PC-corr) and combined with the Pearson correlation-based interactions. Thus, in the PC-corr network, the undirected graph displays edges that account for two levels of meaning, discrimination and correlation, while the nodes represent the features that cause the separation of the samples. This is why we have called it discriminative correlation network, that was proven by LOOCV to be a robust approach because of its unsupervised nature.

Our goal was to test how our discriminative network approach would perform on different omic datasets, ranging from lipidomics, to transcriptomics, to metagenomics and genomics. We found modules that can have a biological explanation. Some of them could be expected in the light of the hypothesis that was being tested: in the lipidomic dataset, for example, most of the lipid species are arranged in modules according to their lipid class, and each module presents a specific range in the number of carbon atoms in the fatty acid chains. In some cases modules can be unexpected and become a starting point for new insights, and therefore be a support for decision making in designing new experiments or addressing pilot studies.

The PC-corr networks were always compared to P-value networks. The first strategical difference lies in the way features are selected: while the PC-corr adopts a multivariate approach, i.e. it uses a combination of features that are responsible for the sample discrimination, in the P-value network the discriminating features are singly selected (one by one) with each Mann-Whitney test (followed by Benjamini-Hochberg procedure). The second strategical difference lies in the generation of the correlation weights in the network. PC-corr combines in parallel and at the same time in a unique formula the discrimination power of the PC-loadings and the association power of the Pearson correlation, directly providing in output discriminative omic associations. These are generated using a robust (because we use as merging factor the minimum operator, which is a very penalizing operator) mathematical trade-off between two important factors: multivariate discriminative significance and correlation association. In addition, as mentioned above, the minimum operator works as an AND logical gate in a digital circuit, therefore in order to have a high link weight in the PC-corr network, both the discrimination (the PC-loadings) and the association (the Pearson correlations) of the nodes adjacent to the link should be simultaneously high. Instead, the P-value procedure begins with the pre-selection of the significant omic features and, only in a second separated step, computes the associations between these features. Therefore, in P-value networks, the interaction weights are the result neither of multivariate discriminative significance, nor of a discrimination/association interplay. Considering the different rationale behind these two approaches, we consider them complementary. Sometimes the PC-corr network was able to achieve results when the P-value network was not, like in the carcinoma cell-wide promoteromic dataset. There we discovered a known TF module, that was absent in the other network. Interestingly in the metagenomic example, the pairwise statistical test performed on each bacterium (univariate statistics) gave no significant results, therefore the P-value network by definition did not exist. On the contrary, and surprisingly, a well-defined PC-corr network emerged from the same data. The reason lies also in the ‘multivariate core’ of the PCA that was able to extract a collection of features, which linearly combined all together in the PC2. This discriminative set of feature associations spotted significant differences between the samples before and after PPI treatment. In this case too, like in the lipidomic dataset, we could identify some sub-modules where the nodes (here bacteria) belonged to the same class. Likewise the P-value network sometimes can highlight groups of interacting features that are missing in the PC-corr network, as in the case of the lipids GD1, GD3, GM2, GT1 (Fig. 2f). Interestingly, across the different case studies, in some of them PC-corr networks include much more nodes, in others, P-value networks includes much more nodes. The factor that influences this difference lies in the different methods (univariate against multivariate) that are used to build the correlation networks. In addition, the P-value networks consider the features that are significantly different after Benjamini correction. This multiple hypothesis adjustment depends from the number of features and is an additional factor that generates instability in the number of selected nodes appearing in the P-value networks.

As a strategy to mine the network structure, the operator can display the different PC-corr network modules emerging for different cut-offs and adopt the network representations that better fit the needs and the aim of his/her biological questions. However, like for P-value network thresholding, there is not a general rule to choose the optimal cut-off. For instance, choosing a lower threshold (=0.5) offers a general overview of the discriminative (native) network organization, while a high cut-off (>0.6) allows a better resolution of the discriminative network functional modules. That is to say a focus on the most discriminative associations of features that act as main factors in the generation of the different states of the system under analysis.

The PC-corr network is pruned at a threshold that is specifically informative of the biomedical concept behind the network organization. This is both a weakness and a strength: on one hand, we do not have an impartial criterion for cut-off selection, on the other hand this is the starting point for the development of a new tool. As a matter of fact, the next stage will be to use this tuning parameter (the cut-off) to infer the hierarchy of the modules in the network structure, that is exploring how modules rearrange and/or disaggregate and/or disappear at different cut-offs, following a strategy very similar to percolation analysis in complex network theory. Finally, our datasets did not demonstrate particular problems with sample outliers and in general the pre-processing, using data normalization, can significantly help to reduce the impact of outliers on the PCA projection. However, a future interesting development could be also to consider the option to adopt robust PCA⁹⁶ instead of standard PCA.

The core advantage of our algorithm is to be the first and innovative tool to provide a multivariate-discriminative correlation network associated to PCA analysis. Nowadays, there are many algorithms tailored to reverse engineer a functional network from genomic data but not for omics data in general such as PC-corr does. Network reconstruction techniques can be generally divided into two broad classes: biological knowledge-driven (such as STRING^97,98) or data-driven (such as the above-mentioned minet⁹⁹, which for instance can be applied to expression data). However, both these methods only focus on the features’ interactions, neglecting their role in discriminating two or multiple conditions, for instance control versus disease samples in case-control studies. The P-value network approach illustrated in the paper can be used to address this question by building correlation networks between significant features, but it lacks the multivariate selection of discriminating features and, from the results shown in this paper, proved to miss interesting functional modules pointed by PC-corr. In contrast, our proposed approach integrates the discriminative significance of the commonly applied multivariate analysis named PCA with the widely used correlation association by Pearson correlation, to make PC-corr unique in inferring multivariate-discriminative correlation networks. As a result, the obtained network can be of great aid for the user to enlighten the most discriminative and functionally relevant features’ associations, linked to the PCA discrimination of groups of samples. Its main disadvantage is that it only investigates the features’ linear relationships (associations) and neglects the discovery of non-linear relationships between the features. We are currently designing a new algorithm that should overcome this limitation by decoupling linear and nonlinear discriminative associations.

In conclusion, the PC-corr approach results particularly useful for small datasets or pilot studies, where supervised approaches are unfeasible. It proposes to shift the strategy of analysis from the univariate selection of single discriminative features and their correlation, to the multivariate identification of discriminative and collective associations between omic features. PC-corr can thus represent a new tool for the definition of combinatorial and multiscale biomarkers in precision medicine¹⁰⁰ and for the investigation of complex omic data in systems biomedicine. For instance, in a mechanomic pilot study composed of a small sample-size dataset, PC-corr allowed to detect important functional modules of genes directly correlated to variations of the elastic moduli of cancer cells. This result associates functional modules that emerge at the molecular systems biology micro-scale-level with the cellular mechanic properties that arise at phenotypical meso-scale-level, and represents a significant innovation in computational multiscale systems biology. Likewise, given its machine-learning based computational layout, PC-corr is a general inference-based approach that can be easily adopted for big data exploration in computer science and analysis of complex systems in physics.

Methods