Abstract
Multiomics data are increasingly being gathered for investigations of complex diseases such as cancer. However, high dimensionality, small sample size, and heterogeneity of different omics types pose huge challenges to integrated analysis. In this paper, we evaluate two networkbased approaches for integration of multiomics data in an application of clinical outcome prediction of neuroblastoma. We derive Patient Similarity Networks (PSN) as the first step for individual omics data by computing distances among patients from omics features. The fusion of different omics can be investigated in two ways: the networklevel fusion is achieved using Similarity Network Fusion algorithm for fusing the PSNs derived for individual omics types; and the featurelevel fusion is achieved by fusing the network features obtained from individual PSNs. We demonstrate our methods on two highrisk neuroblastoma datasets from SEQC project and TARGET project. We propose Deep Neural Network and Machine Learning methods with Recursive Feature Elimination as the predictor of survival status of neuroblastoma patients. Our results indicate that networklevel fusion outperformed featurelevel fusion for integration of different omics data whereas featurelevel fusion is more suitable incorporating different feature types derived from same omics type. We conclude that the networkbased methods are capable of handling heterogeneity and high dimensionality well in the integration of multiomics.
Introduction
Omics refers to the measurements of different molecular entities (e.g., transcriptomics, proteomics, epigenomics, etc.), corresponding to various molecular mechanisms (e.g., genetic, epigenetics, etc.) of a single organism or tissue sample^{1}. High throughput ‘omics’ technologies are increasingly being used to decipher underlying molecular mechanisms and invent novel therapeutic strategies for complex diseases such as cancer^{2}. Cancer is a microevolutionary process which is contingent upon the localised tissue environment within subjects^{3}. The benefits of a personalised medicine approach to the treatment of cancer are increasingly apparent and in turn been adopted by more healthcare practitioners^{4}. Initiatives such as ‘The Cancer Genome Atlas (TCGA)’ have curated expansive amounts of multiomics data from thousands of human subjects^{5}. The technological race for high throughput biology has led to a quantum increase in multiomics data of high dimensions, which requires novel strategies for data analysis. High dimensionality and heterogeneity pose incredible challenges for integration and analysis of multiomics.
Techniques for multiomics data analysis can be broadly classified into supervised and unsupervised techniques. Recently proposed supervised techniques include multivariate techniques^{6}, groupregularized ridge regression^{7}, networksmoothed tstatistic Support Vector Machines (SVM)^{8}, generalized elastic net^{9}, and deep neural networks^{10,11}. However, such approaches require initial filtering and feature selection to reduce data dimensionality or use simple feature integration techniques such as concatenation. They fail to consider interactions among multiple molecular layers measured by different omics technologies. DIABLO, a multiomics method extending generalized canonical correlation analysis that explicitly takes correlations among different datasets, has been proposed in the supervised framework^{12}.
Unsupervised techniques for multiomics data integration can be broadly classified into joint dimensionality reduction (JDR) techniques and networkbased approaches. JDR techniques include sparse Multiblock Partial Least Squares^{13}, joint Nonnegative Matrix Factorization^{14}, Joint and Individual Variation Explained (JIVE)^{15}, Multiomics Factor Analysis (MOFA)^{16}, Regularized Generalized Canonical Correlation Analysis (RGCCA)^{17}, and tensorial Independent Component Analysis (tICA)^{18}. These techniques convert multiple omics datasets jointly into a latent space of lower dimension, capturing biological and technical sources of common variability and disentangling heterogeneities across different omics types. Latent features or factors learned by JDR techniques can be used for a variety of downstream applications such as identification of disease subtypes and patient subgroups, or prediction of clinical outcomes and end points.
Networkbased methods infer relationships between samples/patients or omics features and rely on the networks built using those relations. Examples of networks derived from omics data include Patient Similarity Networks (PSN)^{1,19} or molecular networks such as gene or protein networks. Multiple molecular or patient networks are combined using network integration techniques such as Similarity Network Fusion^{20}. Features of integrated networks are then used as inputs to machine learning models for downstream clustering or prediction tasks. The networkbased methods rely on network features that are of much lower dimensions than those of omics data and transform heterogeneous omics types into homogeneous networks. The aim of this study is to demonstrate and evaluate two fusion strategies for networkbased approaches to multiomics data integration. Our methods are illustrated in Figure 1. We demonstrate our methods on multiomics data on an application of clinical outcome prediction in neuroblastoma. This work is an extension of our earlier work for clinical outcome prediction in neuroblastoma by using Support Vector Machines (SVM) and Random Forests^{1}, and DNN for single omics data^{21}.
Neuroblastoma is a malignancy in developing sympathetic nervous system, which is often accompanied with fatal metastatic disease, resulting in survival rates less than one in two^{22}. Treatment of cancer patients depends upon clinical variables like patient’s risk of disease progression or death by disease. Therefore, it is extremely important to predict clinical outcomes of neuroblastoma patients for deciding due course of treatment. By using two multiomics datasets of neuroblastoma from Therapeutically Applicable Research to Generate Effective Treatment (TARGET) project^{22} and Sequencing Quality Control (SEQC) project^{23}, we compare two networkbased approaches for integration of multiomics: networklevel fusion and featurelevel fusion. For featuelevel fusion, multiple network features are extracted from individual PSN and then features from differnt networks are fused; and for networklevel fusion, different omics networks are integrated using Similarity Network Fusion (SNF) and then features from the integrated network are used for prediction. Extracted features from PSNs are of much lower dimension to dimensionality of omics features or the sample size.
The features extracted by fusing with omics data are then used as inputs to Deep Neural Network (DNN) classifiers^{24}; and Machine Learning classifiers with Recursive Feature Elimination (RFE)^{25}. We demonstrate that networklevel fusion of multiomics data outperforms commonly used featurelevel fusion on multiomics datasets and the DNN relevance propagation identifies salient network features better than RFE. In this research, we compare DNN method with four linear classifiers, i.e. SVM (with linear kernel), Random Forests (RF), Logistic Regression (LR), and Decision Trees (DT) as estimators of RFE.
Materials and methods
Datasets
We used neuroblastoma multiomics datasets from the TARGET project^{22} and the SEQC project^{23} to demonstrate applications of our methods. We confirm that all methods were performed in accordance with the relevant guidelines and regulations. Each dataset consists of samples gathered from two omics data types.

SEQC dataset: SEQC cohort^{26} had a total of 498 neuroblastoma samples, including 176 highrisk and 322 low and intermediaterisk samples. Microarray and RNAseq datasets for 498 neuroblastoma patients from SEQC project were downloaded from NCBI GEO database (https://www.ncbi.nlm.nih.gov/gds) with accession numbers GSE49710 and GSE62564, respectively. They both measure gene expression levels but by using different omics technologies.

TARGET dataset: Target cohort^{22} comprised of 157 highrisk neuroblastoma samples, including gene expression data and DNA methylation data. RNAseq expression dataset from TARGET project was downloaded from project website (https://ocg.cancer.gov/programs/target/projects/neuroblastoma). And DNA methylation dataset was downloaded from NIH GDC portal (https://portal.gdc.cancer.gov/projects/TARGETNBL). Gene expression data quantifies the transcriptome in neuroblastoma patients while DNA methylation data (adding methyl groups to genes) signifies epigenomic variations in those patients.
Patient similarity networks (PSN)
PSN is a graph that represents patients as nodes and similarities between patients as edges and is denoted by \(G^m = (V, A^m)\) where V denotes the set of subjects and \(A^m = \left( a^m_{uv} \right)\) denotes the affinity matrix (the similarity matrix) where \(a^m_{uv}\) denotes the similarity of measurements of omics type m between subjects \(u \in V\) and \(v \in V\). If \(\phi ^m_v\) denotes omics m type measurement of subject v, then
where \(\texttt {sim}\) is a similarity measure.
The similarity between features of individual omics datasets was determined by the Pearson’s correlation coefficient between the patients:
where N denotes the total feature number and i refers to the \(i{\rm th}\) feature in the dataset of omics m.
These correlation values were normalized and rescaled to represent positive edge weights by using the Weighted Correlation Network Analysis (WGCNA) algorithm^{27}. WGCNA enforces scalefreeness of the PSN by making its nodal degree distribution follow a power law, or at least asymptotically, and thereby our analysis becomes robust to noises and errors.
Network features
From a PSN, we computed two types of features: centrality features and modularity features. The centrality identifies features giving high scores for most important nodes of the network^{28}. We computed 12 centrality features for nodes: weighted degree, closeness centrality, currentflow closeness centrality, currentflow betweenness centrality, eigen vector centrality^{29}, Katz centrality^{30}, hits centrality^{31} (authority values and hub values), pagerank centrality^{32}, load centrality^{33}, local clustering coefficient, iterative weighted degree and iterative local clustering coefficient.
Modularity features were extracted by extracting the network modules by clustering the nodal features. We used spectral clustering^{34} and Stochastic Block Model (SBM) clustering^{35} to find network modules and the most optimal number of modules were determined by the silhouette score. Modular memberships of each node to modules were represented by onehot vectors and the sum of these vectors for all the modules was taken as the modular feature vector for a given node. The centrality features and modular features were concatenated to obtain the network features that were used as the inputs to the classifiers.
Featurelevel fusion
For each PSN obtained from omics dataset m, we extracted n feature vector \(x^m\) for each node or a subject. Using featurelevel fusion, we combined individual omics datasets to obtain multiomics features:
Featurelevel fusion was achieved by concatenating the modularity features and computing the mean of centrality features from individual datasets.
Networklevel fusion
In the networklevel fusion, the PSN for multiomics data G is obtained by combining PSN, \(G_m\), of individual omics data. We derive the similarity matrix of multiomics PSN by fusion of those of singleomics PSN:
We achieved the networklevel fusion of singleomics PSN using Similarity Network Fusion (SNF) algorithm^{20}.
Deep neural networks (DNN)
Let us consider \(L+1\) layer DNN (feedforward network) for prediction of clinical outcomes where layers \(l = 0, 1, \ldots L\) with \(l=0\) and \(l=L\) denoting the input and output layers of the DNN. Let the output, weights, and biases for layer l be denoted as \(h^l\), \(W^l\), and \(b^l\). The input layer receives features x from each subject, so \(h^0 = x\). For layers \(l=1, \ldots L1\)
f denotes the activation function of layer l.
The output y of the output softmax layer L give
The output class label \(k^*\) is assigned the class k receiving maximum output activation:
The network parameters are learned by minimizing the crossentropy loss by using gradient descent approach. In our experiments, we used an Adams optimizer to learn the weights and biases of the network.
Relevance propagation
In order to explore the utility of network features extracted from PSN and the interpretability of our DNN models, Relevance propagation was applied. Relevance propagation is an approach to studying the relevance or attribution of each input feature to a neural network. According to a unified framework comparing existing approaches proposed by M. Ancona et al. (2018)^{36}, relevance propagation methods can be classified into perturbationbased and gradientbased methods. Perturbationbased methods compute the relevance of an input feature by simply removing, masking or altering it and comparing the difference with the original output. While the theory of this kind of methods is straightforward, its drawbacks include: (1) slow running time especially with a huge input feature set; (2) unstable results when number of features removed in each iteration varies due to the nonlinearity of DNN. On the contrary, gradientbased methods compute the relevance in a single forward and backward propagation through the DNN, which is stable and not timeconsuming. Therefore, we adopted a gradientbased method to analyze our DNN model.
Popular gradientbased methods include Gradient * Input^{37}, Integrated Gradients^{38}, Layerwise Relevance propagation (LRP)^{39}, and DeepLIFT^{40}. Notably, Integrated Gradients satisfies two desirable properties, i.e. sensitivity and implementation invariance, while other methods break at least one of them. Sensitivity is satisfied if a feature is given a nonzero attribution when its input and baseline differ and generate different output values. Sensitivity can be readily violated by gradients, when the final prediction is irrelevant to an input and thus always generates a zero gradient regardless of any alterations of the input. Under such circumstances, irrelevant features might be assigned a prominent attribution, which is the condition we attempt to avoid. In addition, implementation invariance means the attributions should be identical to two networks, if their outputs are equal for all inputs, despite that the two networks have disparate implementations. Consequently, we applied Integrated Gradients rather than other approaches because it conforms with sensitivity and implementation invariance.
Specifically, the integrated gradients of the \(i{\rm th}\) dimension of an input x and a baseline \(x'\) can be formulated as^{38}
where F denotes the function of a DNN, and \(\frac{\partial F(x)}{\partial x_i}\) denotes the gradient of F(x) along the \(i{\rm th}\) dimension.
Moreover, several studies have revealed that removing insignificant input features gives rise to performance enhancement^{41}. Thus, after computing the attributions of each input features, we recursively removed the features one at a time depending on their attribution ranks, and tracked how the performance changes.
Recursive feature elimination (RFE)
In addition to DNN, we used Recursive feature elimination (RFE) with other classifier to compare with DNN. RFE is a feature selection method raised by I. Guyon et al. (2002)^{25} designed for identifying salient genes in microarray gene expression data. RFE utilizes an estimator to rank the features with certain criterion (e.g. linear coefficients), and recursively removes the feature with smallest ranking criterion until a desired feature subset is obtained. The iterative procedure of the RFE algorithm can be depicted as:

1.
Train the estimator with the current feature set

2.
Rank all the features according to the ranking criterion

3.
Remove the feature with the smallest criterion
In this research, we applied four Machine Learning classifiers, i.e. SVM (with linear kernel), Random Forests (RF), Logistic Regression (LR), and Decision Trees (DT) as estimators of RFE. The classifiers were iteratively trained on the network feature set with RFE algorithm to select the paramount features. Our network feature set consists of centrality features and modularity features. We hope that RFE will help discover the centrality features computed by the most suitable algorithms and modularity features representing the vital modules’ membership. Currently, RFE algorithm implemented by Scikitlearn^{42} only supports linear models as estimators.
Experiments
We analyzed two multiomics neuroblastoma datasets: (i) microarray and RNAseq expression datasets from 498 neuroblastoma samples from SEQC project^{26} and (ii) 157 neuroblastoma samples including with RNAseq expression and DNA methylation datasets from TARGET project^{22}. The downloaded datasets were processed by removing any missing values or duplicate values using Pandas and Numpy libraries in Python.
The clinical descriptor used as the label for training DNN classifiers was the binary label ‘death from disease’. By excluding the samples with missing descriptors, we performed binary classification on both data sets: ‘death from disease’ or ‘not’. Both datasets were evaluated using nested 3fold crossvalidation due to relatively limited number of samples.
Data preprocessing
The Wilcoxon signedrank test^{43,44} was performed on individual omics datasets to identify the most relevant features of the input features. Correction of multiple test based on BenjaminiHochberg^{45} was applied to control the false discovery rate, considering the highdimensional input features. Then the features that were most correlated with the clinical outcome were identified at a significance level (pvalue) of 0.001. This effectively reduced the input number of features for each omics datasets. Since gene expression or DNA methylation data include lots of noise and not all the genes/features may be relevant to the disease, Wilcoxon Analysis allowed us to identify and eliminate irrelevant features early, making our models simpler and more accurate.
Building PSN and feature extraction
Distances between patients were obtained by computing Pearson’s correlation coefficients among omics features and thereby PSNs for each omics dataset were built. The correlation weights were normalized and rescaled to be positive by using WGCNA algorithm^{27}, making PSN to behave as scalefree networks. We used the smallest beta value for the algorithm, which achieved 90% of the truncated scale free index. WGCNA algorithm was implemented in house, using Python by applying its formula and rescaling the edges of PSN with the formula while trying different hyperparameters to test if the resultant edges successfully make the PSN scalefree. For networklevel fusion, we combined PSNs derived from individual omics datasets via the SNF algorithm^{20}, which was implemented using SNFtool library in R.
Network features of PSN were extracted utilizing NetworkX package on Python. Twelve centrality features and modular features were extracted as input features for classifiers. The number of modules detected for each omics dataset were different. In order to discover network modules, spectral clustering was applied using NetworkX and Stochastic Block Model (SBM) was applied using graphtool package in Python. We extracted 204 modules for microarray and 16 modules for RNAseq expressions of SEQC dataset, and 60 modules for RNAseq expressions and 34 modules for DNA methylation of TARGET dataset. For combined networks generated by networklevel fusion, 109 and 44 modules were extracted for SEQC and TARGET datasets, respectively. Before being fed into the neural network, the features extracted were normalized to have a zero mean and a unit variance. In order to achieve featurelevel fusion, we computed means of centrality features extracted from PSNs of individual omics data, and concatenated the modularity features.
Training DNN and RFE models
We applied feedforward DNN for predicting clinical outcomes with features extracted from multiomics PSNs via Tensorflow V1 framework (https://www.tensorflow.org/versions/r1.15/api_docs/python/tf). The weights and biases of DNN were trained by minimizing the crossentropy loss function with an Adams optimizer^{24}. Notably, the SEQC dataset is extremely imbalanced, where around 77% of the samples belong to the majority class, which is “alive”, while the TARGET dataset does not suffer from imbalance issue. In order to handle data imbalance, we decided to apply weighted crossentropy loss function^{46} on SEQC dataset, whereas since the data was balanced, we used general softmax crossentropy loss function on TARGET dataset. The rationale of a weighted cross entropy function is that it assigns different weights to the majority and minority classes to compensate the unbalance naturally. The weightage for class i is defined as
where \(n_i\) denotes the number of sample belonging to class i.
We used rectified linear unit (ReLU) activation function and dropouts function in the hidden layers. We experimented with batch size of 8 and 32. Early stopping criterion was implemented to determine the convergence of learning in order to avoid overfitting.
Nested crossvalidation (CV)^{47} was employed for tuning the hyperparameters and model selection. Since the number of samples in our dataset is insufficient to create a standalone testing set, using typical crossvalidation may lead to overfitting and data leakage, whereas nested CV is designed to address these issues. The algorithm of nested CV is illustrated in Algorithm 1.
The nested crossvalidation procedure is composed of outer CV loop and inner CV loop. The training fold for the outer CV is further splitted into kfold of inner CV. The inner CV loop is similar to the typical CV, which is used for tuning the hyperparameters such as hidden sizes, learning rates, batch size, etc. The average score of each hyperparameter set is calculated across all the inner CV folds to discover the best hyperparameters. Then the outer CV loop is used for model selection, where each model with the best hyperparameters decided by the inner CV will be tested and compared. This strategy ensures that the testing data for the final evaluation is excluded from the procedures of tuning hyperparameters, which leads to a more robust evaluation.
After tuning the parameters and evaluating models, we obtained the DNN model with best performance and then applied Integrated Gradients^{38} on the model to compute saliency scores for input features. Implementation of Integrated Gradients is imported from DeepExplain framework^{36} because DeepExplain supports Tensorflow V1 which we utilized to develop our DNN model. Then the input features were ranked by their saliency scores and removed one by one to seek for performance improvement.
Other than DNN, RFE was also explored for predicting clinical outcomes with PSN network features. We implemented RFE with four machine learning models (i.e. linear kernel SVM, Random Forests, Logistic Regression, and Decision Trees). Network features were evaluated and ranked on the training set, and only the salient features were selected for clinical outcome prediction on the testing set. The details of RFE procedure is explained in Algorithm 2.
Comparing with existing methods
To further demonstrate the utility of our approach to integrating multiomics data and extracting network features, we compared our developed models with several popular approaches, i.e. RGCCA^{17}, MOFA^{16}, and DIABLO^{12}. In our method, the high dimensional and heterogeneous multiomics data are converted into Patient Similarity Networks (PSN). Then topological features are extracted from the networks, where dimensionality reduction was achieved. And two techniques, i.e. featurelevel fusion and networklevel fusion, are proposed to integrate PSN or topological features from various omics data. Thus, we compared our method with other approaches that handle feature reduction and multiomics integration in different ways. RGCCA and MOFA are unsupervised JDR techniques to discovering latent salient factors in omics features that can be readily fed into a downstream analysis. DIABLO is a supervised extension of sparse RGCCA that can be solely applied in a classification task.
Results
Results of this research consists of DNN performances, Relevance Propagation, Recursive Feature Elimination on Machine Learning classifiers, and comparison with existing multiomics integration techniques.
DNN performances
Survival status predictions with DNN were conducted on SEQC and TARGET datasets. The results are shown in Tables 1 and 2, respectively. Since accuracy fails to measure the performance fairly when the dataset distribution is imbalanced, we also recorded F1 score and ROCAUC score which works better on illdistributed datasets. The results are shown in the format of mean ± standard deviations obtained over different random splitting for crossvalidation.
Firstly, we performed single omics analyses on both datasets to contrast the model performance after multiomics integration. Then for multiomics approaches, we recorded results of both featurelevel and networklevel fusion. For networklevel fusion, PSNs of individual omics datasets were fused in accordance with SNF algorithm, and for featurelevel fusion, the features of single omics PSNs were averaged or concatenated and fed into DNN. To study the contribution of centrality and modularity features, we also separated the two kinds of network features from the whole feature set, and fed them into DNN models individually. The results are also shown in Tables 1 and 2 when the feature type is centrality or modularity.
Moreover, relevance propagation was applied on optimum models trained on all the datasets to compute the saliency scores of input features. Thereafter, the input features were removed one by one to track the performance variation, and best performances achieved by the abridged feature set are recorded in Tables 1 and 2 together with the feature dimensionality.
In SEQC dataset, over 80% of the samples belong to “alive” class rendering the data distribution extremely imbalanced, while in TARGET dataset, the “alive” and “death” class each takes up around 50% samples. Under such circumstances, we decided to give priority to F1 score while analysing the results regarding SEQC dataset, and focus on accuracy for TARGET dataset, since F1 score balances precision and recall on the positive class and measures imbalanced dataset better.
To optimize the performance of DNN, we applied grid search on hyperparameters to discover the best results under specific configurations. Sizes of hidden layers, number of neurons in the layers, batch size, and learning rate were fixed by experimenting with the validation test. The highest F1 score (0.54±0.09) was achieved on SEQC dataset when the DNN architecture is [8, 64, 4, 8], learning rate is 0.01, and batch size is 8. And on TARGET dataset, best accuracy (65.1±4.7%) was gained with the structure [4, 4, 4], learning rate 0.01, and batch size 32.
As seen from Tables 1 and 2, the experiments of multiomics dataset with fusion generally achieved higher F1 score or accuracy than prediction based on single omics datasets. On SEQC dataset, highest accurancy (about 80%) and F1 score (around 0.54) were obtained with featurelevel fusion technique. Although F1 score of RNASequencing prediction was slightly better, its accuracy is not as good as featurelevel fusion. On TARGET dataset, best accuracy (around 65.1%) was achieved by networklevel fusion which is better than other techniques. This demonstrates the potential of our approach to integrating multiomics datasets.
However, it is shown that networklevel and featurelevel fusion behaved differently on SEQC and TARGET datasets. Notably, the two subsets used for fusion in SEQC dataset, i.e. RNASeq and microarray, both belong to the gene expression omics, but leverage different technologies to measure the gene expression profiling. However, in TARGET dataset, RNASeq and DNA Methylation data belong to transcriptomics and epigenomics, respectively. It is observed that featurelevel fusion is prefered in SEQC dataset, whereas networklevel fusion performs better in TARGET dataset. The underlying reason could be that in the condition of homogeneous subsets in SEQC dataset, their features might be redundant rendering the constructed PSNs incompatible to be integrated by SNF algorithm. Therefore, combining the extracted topological features from individual PSNs by averaging the centrality features and concatenating the modularity features is more suitable than networklevel fusion. Thereafter, we can leverage the machine learning algorithms to select the features by learning proper weights for them. On the contrary, networklevel fusion with SNF is outstanding on multiomics subsets in TARGET dataset. Consequently, we claim that networklevel fusion is generally inclined to better integrate multiomics datasets, while featurelevel fusion is more suitable for combining two homogeneous datasets.
Moreover, on both SEQC and TARGET datasets, models with only modularity features outperformed models with centrality features. This illustrates that a sample’s membership of the modules clustered in the PSN contributed more than a sample’s importance in the whole PSN to the clinical outcome prediction. Nevertheless, a better performance is obtained when both centrality and modularity features are involved in most cases.
Relevance propagation results
We applied Integrated Gradients implemented by DeepExplain for computing attributions of input features because DeepExplain can be readily conducted on Tensorflow V1 models. An attribution vector were generated for each sample, thus forming an attribution matrix of size \((n_{sample}, n_{feature})\). In order to generalize the saliency score for each input feature, we calculated the magnitude of the attribution vector across all the samples. Then the input features were removed one at a time according to their ranks of saliency.
In Tables 1 and 2, the Abridged Feature type rows show the performance of DNN after eliminating insignificant features. We found that in the process of removing the input features one by one, the DNN performances did not drop until most features were eliminated. Specifically, for the case of Feturelevel fusion on SEQC dataset, when only 12 out of 233 features were preserved, the performance almost maintained the same with the original models. As for Networklevel fusion on TARGET dataset, the performance maintained until 11 out of 57 features were left. From the perspective of F1 score, eliminating the irrelevant features even enhanced the performances slightly on both datasets.
Then we investigated about these remained features. The indices of the remained features in SEQC dataset are [25, 171, 125, 202, 51, 211, 118, 87, 14, 13, 15, 18], all of which represent memberships of modules clustered by the spectral or SBM algorithms. In TARGET dataset, the indices are [35, 22, 31, 12, 46, 47, 33, 44, 5, 27, 14]. Two of the remained features belong to centrality features, representing load centrality and iterative local clustering coefficient, and the rest 9 features belong to modularity features.
RFE performances
In addition to DNN, we also explored some machine learning techniques’ capabilities of classification with the network features. In the previous section, we compared the performance of DNN models with original input feature set and reduced feature set. In this section, similarly, we would like to compare the performances of several classical classifiers with their performances after Recursive Feature Elimination (RFE). RFE is a technique for feature selection in linear classifiers and is explained in Algorithm 2. The extracted feature set after proposed networklevel or featurelevel fusion is used for fitting the linear classifiers, and the results with or without RFE feature selection are presented in Table 3 and 4.
Table 3 shows the results on SEQC dataset. From the perspective of F1 score, it is apparent that featurelevel fusion outperformed networklevel fusion to a great extent. The highest F1 scores, achieved by Logistic Regression estimator, are approximate to DNN’s results at around 0.71, when 43 out of 234 features are selected. In Table 4 presenting results on TARGET dataset, we found that the accuracy of networklevel fusion is superior to featurelevel fusion, which is in accordance with the results of DNN models. The best accuracy is also achieved by Logistic Regression classifier, which is around 70.1%.
As shown in Tables 3 and 4, RFE did not enhance the accuracy and F1 score all the time. In most cases, RFE generated comparable results with the baseline models after removing redundant features, while sometimes it even lowered the performances significantly (e.g. the Logistic Regression case of networklevel fusion on TARGET dataset). Therefore, we demonstrate that the feature saliency ranks given by the linear classifiers are not so reasonable as the ones given by relevance propagation of DNN models.
Moreover, it is shown that the performance of Linear classifiers sometimes surpassed DNN’s performance. Best accuracy of Logistic Regression on the TARGET dataset is about 70.1%, while DNN’s accuracy reaches 65.1% at the utmost. However, on SEQC dataset, DNN still outperformed Linear models with RFE slightly.
Existing methods
Table 5 displays the comparison of performances obtained by our best model and other popular approaches on SEQC dataset. The three approaches, i.e. RGCCA, MOFA, DIABLO, all aim at reducing the feature dimensionality, discovering the latent factors, and also integrating multiomics data. Notably, RGCCA and MOFA are unsupervised methods, so we implemented a simple Logistic Regression model for the downstream classification analysis. DIABLO is a supervised method that provides their own function for evaluating performances. As shown, RGCCA is inclined to predict that all the samples belong to an arbitrary label, which leads to unstable prediction with high standard deviation and a low F1 score. A potential reason of its poor performance could be that the selected components are tough to decide when the input feature dimensionality is enormous. MOFA achieves a higher F1 score than our method, but its accuracy is over ten percent lower than ours. DIABLO also yields an inferior accuracy than ours, and unfortunately, DIABLO does not offer any evaluation using metrics of F1 score, which renders it difficult to evaluate its performance on an imbalanced dataset. Generally, our method outperformed the other currently popular approaches on SEQC dataset.
Table 6 shows the comparison of results on TARGET dataset. The best performance of our method given by Logistic Regression model significantly outperformed the other three existing approaches, from both accuracy’s and F1 score’s perspectives. Therefore, we demonstrate that our method is effective in distilling the paramount features for clinical outcome prediction in neuroblastoma.
Discussion and conclusion
We addressed two challenges, heterogeneity and high dimensionality of multiomics data, for their integration and analyses by using networkbased methods. The multiomics data were used for building PSNs where nodes represented patients and nodal features of PSNs were used to represent patients’ features. This enables a huge reduction in feature dimensionality  from tens of thousands to tens. Multiomics data are heterogeneous but the PSNs built with different omics data are homogeneous and can be readily combined since they have similar configurations. Building PSN from multiomics data allows for both dimensionality reduction and conversion from different omics types to homogeneous networks. Currently, one limitation of this research might be there are only two available omics types in our datasets. In the future, we plan to experiment our approach on datasets with more types of omics data and evaluate the performance of their integration.
We achieved about 79% and 70% accuracies on SEQC and TARGET datasets, respectively, for clinical end point prediction for neuroblastoma, which were significant improvements over the accuracies obtained only with one omics data type. These have important implications practically as the survival rates of neuroblastoma is about 50%. In our experiments, we used only two omics types in the datasets and our methods are generalizable for any number of omics datasets. Our experiments showed that networklevel fusion where integration of multiomics datasets is achieved by fusing homogeneous networks performed better than simply combining features of different PSNs. We used SNF for combining PSNs but one may use other techniques such as tensor based integration. However, when the two subsets both belong to the same omics, e.g. RNASeq and microarray in the SEQC dataset, featurelevel fusion is prone to generate better results, where the centrality features are averaged and modularity features are concatenated.
The aim for designing DNN relevance propagation and RFE experiments is to identify the paramount network features in the input set. For centrality features, we discovered the most suitable algorithm used for representing a node’s importance in the network. And for modularity features, we identified the essential modules that plays key roles in the clinical outcome prediction. Comparing the baseline feature set with the abridged feature set on DNN, even though the performance is comparative, we realized the majority of the extracted features are insignificant to clinical outcome prediction in neuroblastoma, while only a fraction of the network features is highly related to the task. In the future, we shall investigate more about how these salient features can be extracted and identified in one shot, and how they serve to predict clinical endpoints.
However, scrutinizing the RFE results, we found that RFE strategy failed to enhance the performance, and the selected features varied dramatically in different cases of random splitting, disallowing us to further explore the selected features in a generalized way. Consequently, we believe that relevance propagation in DNN models is more rational than RFE to delve into the networks features extracted from PSN.
As collection of multiomics data becomes more affordable, novel approaches to omics data integration and analysis are of necessity. By comparing our approach with several existing methods, we demonstrate the potentials of networkbased approaches with an application to neuroblastoma clinical outcome prediction. One can further explore our methods on other cancers or complex diseases.
Data availability
The datasets used and analysed during the current study are from SEQC and TARGET projects. SEQC project data is available in NCBI GEO database (https://www.ncbi.nlm.nih.gov/gds) with accession number GSE49710 and GSE62564, and TARGET project data is available in NIH GDC data portal (https://portal.gdc.cancer.gov/projects/TARGETNBL). The implementation of proposed approaches is now released to our GitHub repository (https://github.com/nomoresomethingwentwrong/PSN_Fusion).
References
Tranchevent, L.C. et al. Predicting clinical outcome of neuroblastoma patients using an integrative networkbased approach. Biol. Dir. 13, 1–13 (2018).
Ferlay, J. et al. Cancer incidence and mortality worldwide: Sources, methods and major patterns in globocan 2012. Int. J. Cancer 136, E359–E386 (2015).
Baghban, R. et al. Tumor microenvironment complexity and therapeutic implications at a glance. Cell Commun. Signal. 18, 1–19 (2020).
Krzyszczyk, P. et al. The growing role of precision and personalized medicine for cancer treatment. Technology 6, 79–100 (2018).
Wang, Z., Jensen, M. A. & Zenklusen, J. C. A practical guide to the cancer genome atlas (TCGA). in Statistical Genomics. 111–141 (Springer, 2016).
Rohart, F., Gautier, B., Singh, A. & Lê Cao, K.A. mixomics: An r package for ‘omics feature selection and multiple data integration. PLoS Comput. Biol. 13, e1005752 (2017).
Van De Wiel, M. A., Lien, T. G., Verlaat, W., van Wieringen, W. N. & Wilting, S. M. Better prediction by use of codata: Adaptive groupregularized ridge regression. Stat. Med. 35, 368–381 (2016).
Cun, Y. & Fröhlich, H. Network and data integration for biomarker signature discovery via network smoothed tstatistics. PloS One 8, e73074 (2013).
Sokolov, A., Carlin, D. E., Paull, E. O., Baertsch, R. & Stuart, J. M. Pathwaybased genomics prediction using generalized elastic net. PLoS Comput. Biol. 12, e1004790 (2016).
Zhang, L. et al. Deep learningbased multiomics data integration reveals two prognostic subtypes in highrisk neuroblastoma. Front. Genet. 9, 477 (2018).
Huang, Z. et al. Salmon: survival analysis learning with multiomics neural networks on breast cancer. Front. Genet. 10, 166 (2019).
Singh, A. et al. Diablo: An integrative approach for identifying key molecular drivers from multiomics assays. Bioinformatics 35, 3055–3062 (2019).
Li, W., Zhang, S., Liu, C.C. & Zhou, X. J. Identifying multilayer gene regulatory modules from multidimensional genomic data. Bioinformatics 28, 2458–2466 (2012).
Zhang, S. et al. Discovery of multidimensional modules by integrative analysis of cancer genomic data. Nucleic Acids Res. 40, 9379–9391 (2012).
Lock, E. F., Hoadley, K. A., Marron, J. S. & Nobel, A. B. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann. Appl. Stat. 7, 523 (2013).
Argelaguet, R. et al. Multiomics factor analysis—A framework for unsupervised integration of multiomics data sets. Mol. Syst. Biol. 14, e8124 (2018).
Tenenhaus, M., Tenenhaus, A. & Groenen, P. J. Regularized generalized canonical correlation analysis: A framework for sequential multiblock component methods. Psychometrika 82, 737–777 (2017).
Teschendorff, A. E., Jing, H., Paul, D. S., Virta, J. & Nordhausen, K. Tensorial blind source separation for improved analysis of multiomic data. Genome Biol. 19, 1–18 (2018).
Pai, S. & Bader, G. D. Patient similarity networks for precision medicine. J. Mol. Biol. 430, 2924–2938 (2018).
Wang, B. et al. Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 11, 333 (2014).
Tranchevent, L.C., Azuaje, F. & Rajapakse, J. C. A deep neural network approach to predicting clinical outcomes of neuroblastoma patients. BMC Med. Genomics 12, 1–11 (2019).
Pugh, T. J. et al. The genetic landscape of highrisk neuroblastoma. Nat. Genet. 45, 279–284 (2013).
Zhang, W. et al. Comparison of RNAseq and microarraybased models for clinical endpoint prediction. Genome Biol. 16, 1–12 (2015).
Goodfellow, I., Bengio, Y., Courville, A. & Bengio, Y. Deep Learning. Vol. 1. (MIT Press, 2016).
Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002).
Consortium, S. et al. A comprehensive assessment of rnaseq accuracy, reproducibility and information content by the sequencing quality control consortium. Nat. Biotechnol. 32, 903 (2014).
Zhang, B. & Horvath, S. A general framework for weighted gene coexpression network analysis. in Statistical Applications in Genetics and Molecular Biology. Vol. 4. (2005).
Newman, M. Networks. (Oxford University Press, 2018).
Negre, C. F. et al. Eigenvector centrality for characterization of protein allosteric pathways. Proc. Natl. Acad. Sci. 115, E12201–E12208 (2018).
Katz, L. A new status index derived from sociometric analysis. Psychometrika 18, 39–43 (1953).
Schütze, H., Manning, C. D. & Raghavan, P. Introduction to Information Retrieval. Vol. 39 (Cambridge University Press, 2008).
Sullivan, D. What is google pagerank? A guide for searchers & webmasters. Search Engine Land (2007).
Goh, K.I., Kahng, B. & Kim, D. Universal behavior of load distribution in scalefree networks. Phys. Rev. Lett. 87, 278701 (2001).
Von Luxburg, U. A tutorial on spectral clustering. Stat. Comput. 17, 395–416 (2007).
Peixoto, T. P. Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models. Phys. Rev. E 89, 012804 (2014).
Ancona, M., Ceolini, E., Öztireli, C. & Gross, M. Towards better understanding of gradientbased attribution methods for deep neural networks. arXiv preprint arXiv:1711.06104 (2017).
Shrikumar, A., Greenside, P., Shcherbina, A. & Kundaje, A. Not just a black box: Learning important features through propagating activation differences. arXiv preprint arXiv:1605.01713 (2016).
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. in International Conference on Machine Learning. 3319–3328 (PMLR, 2017).
Bach, S. et al. On pixelwise explanations for nonlinear classifier decisions by layerwise relevance propagation. PloS One 10, e0130140 (2015).
Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. in International Conference on Machine Learning. 3145–3153 (PMLR, 2017).
Gupta, S. et al. Obtaining leaner deep neural networks for decoding brain functional connectome in a single shot. Neurocomputing 453, 326–336 (2021).
Pedregosa, F. et al. Scikitlearn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
McKinney, W. et al. Data structures for statistical computing in python. in Proceedings of the 9th Python in Science Conference. Vol. 445. 51–56 (2010).
Wilcoxon, F. Individual comparisons by ranking methods. in Breakthroughs in Statistics. 196–202 (Springer, 1992).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (Methodological) 57, 289–300 (1995).
King, G. & Zeng, L. Logistic regression in rare events data. Polit. Anal. 9, 137–163 (2001).
Cawley, G. C. & Talbot, N. L. On overfitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 11, 2079–2107 (2010).
Acknowledgements
This research was partially supported by AcRF Tier1 2019T1002057 grant by the Ministry of Education, Singapore.
Author information
Authors and Affiliations
Contributions
J.C.R. and C.W. conceived the ideas; C.W. and W.L. conducted the experiments; C.W. analysed the results; R.K. and P.K. shared their expertise and guided experiments; J.C.R. supervised and oversaw the project; All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, C., Lue, W., Kaalia, R. et al. Networkbased integration of multiomics data for clinical outcome prediction in neuroblastoma. Sci Rep 12, 15425 (2022). https://doi.org/10.1038/s41598022190195
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598022190195
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.