A novel liver cancer diagnosis method based on patient similarity network and DenseGCN

Liver cancer is the main malignancy in terms of mortality rate, accurate diagnosis can help the treatment outcome of liver cancer. Patient similarity network is an important information which helps in cancer diagnosis. However, recent works rarely take patient similarity into consideration. To address this issue, we constructed patient similarity network using three liver cancer omics data, and proposed a novel liver cancer diagnosis method consisted of similarity network fusion, denoising autoencoder and dense graph convolutional neural network to capitalize on patient similarity network and multi omics data. We compared our proposed method with other state-of-the-art methods and machine learning methods on TCGA-LIHC dataset to evaluate its performance. The results confirmed that our proposed method surpasses these comparison methods in terms of all the metrics. Especially, our proposed method has attained an accuracy up to 0.9857.

www.nature.com/scientificreports/ used for cancer diagnosis. lyu and Haque 21 transformed the gene expression data into 2-D images, then input the 2-D images into convolutional neural network to classify cancer of 33 tumor types. Gao et al. 22 introduced a novel cancer diagnosis method (DeepCC). DeepCC performs gene enrichment analysis to transform the gene expression data into functional spectra. Then the resulting functional spectra are input into a multilayer neural network for subsequent training. For both colorectal and breast cancer, DeepCC outperforms random forest (RF) and SVM for cancer subtype classification. However, previous deep learning-based models mainly use single omics data, which is limited to describe all the features of cancer 23 . It limits the performance of deep learning in cancer diagnosis. Accordingly, cancer diagnosis methods based on multiple omics data are increasingly adopted [24][25][26] . Sun et al. 27 proposed a deep learning method which is based on model fusion, named MDNNMD, for breast cancer prognosis. They used two types of omics data, gene expression data and copy number variation (CNV), as well as clinical data, and constructed three deep neural network (DNN) models for the three types of data, and finally fused the prediction scores of the three independent models as the final prediction result. Zhang et al. 28 used variational autoencoder (VAE) to integrate methylation data and gene expression data to diagnose cancer. They used ten-fold cross-validation on 33 types of cancers to evaluate their method, and the final accuracy obtained by their method is 97.49%. Copy number variation, gene expression, and methylation data were used in these researches on cancer diagnosis. This indicated that copy number variation, gene expression, and methylation data bring useful information to cancer diagnosis. Thus, all these three omics data were selected in this work.
Previous studies have often only used genomics data. Interpretability is particularly required in genomics because of relatively smaller sample sizes and to better understand the molecular causes of disease so that targeted therapies can be designed 29 . Patient similarity network (PSN) can address these problems and specializes in integrating multi-omics data and generating interpretable models 30 . However, previous works rarely took the patient similarity into account. To address this issue, we integrated three omics data of liver cancer and calculated the similarity between patients. As the similarity network is none-Euclidean data, previous neural networks like CNNs, are hard to handle this data 31,32 . Thus, graph convolutional network (GCN), which has the advantages in processing non-Euclidean data is used in this work. Meanwhile, since omics data have small sample size, we need a deeper network to fit the data and thus avoid the disadvantages associated with the small sample size 33 . But the number of GCN layers is rarely more than four because of the vanishing gradient problem 34 . To deal with this challenge, we selected the dense graph convolutional neural network (DenseGCN) 35 . DenseGCN improves information flow in the network by densely connecting different layers. DenseGCN is able to overcome the vanishing gradient problem and make the GCN architecture deeper, thus enabling better utilization of patient similarity network and multi-omics data for cancer diagnosis. To the best of our knowledge, this is the first effort to employ DenseGCN in cancer diagnosis field.
In this work, A novel liver cancer diagnosis method (pDenseGCN) based on patient similarity network and DenseGCN is proposed. We first used similarity network fusion (SNF) to construct the patient similarity network using three liver cancer omics data. Then, we extracted latent embedding representation of omics data by using denoising autoencoder (DAE). This can provide a more precise representation of liver cancer. Finally, we adopted DenseGCN for liver cancer diagnosis based on the patient similarity network and latent representation of omics data. By incorporating the supplemental information PSN into the model, we got a more comprehensive view of cancer and finally obtained better performance on liver cancer diagnosis. According to the reliable experiments, our method pDenseGCN gained an accuracy score of 0.9857, and performed better compared with five state-of-the-art methods and machine learning methods.
The main contributions of this paper are as follows.
• A novel deep learning method, named pDenseGCN, is proposed for effectively liver cancer diagnosis.
• pDenseGCN utilizes SNF to construct a patient similarity network based on multi-omics, thus captures the similarity information between patients, which helps in liver cancer diagnosis. • pDenseGCN adopts DenseGCN as the classifier. DenseGCN connects different layers densely to improve information flow in the network, which can overcome vanishing gradient problem. This brings better results in liver cancer diagnosis.

Methods
Proposed method. There are three components in the proposed method pDenseGCN. The first component is generating patient similarity network by omics datasets. Three omics datasets were applied as the input of similarity network fusion method to produce patient similarity network. The second component is extracting feature by denoising autoencoder. In this step, RNA-Seq, DNA Methylation and CNV were put into denoising autoencoder respectively to obtain low-dimensional features. The next component is to input the obtained patient similarity network and feature matrix into dense graph convolutional network (DenseGCN) for classified training and prediction, and a cancer prediction framework was finally built. Figure 1 describes the overall workflow of our proposed method pDenseGCN.
Construction of patient similarity network. In order to construct the patient similarity network (PSN), we employed a method named Similarity network fusion (SNF), which can make full use of multi-omics 36 . SNF is applied to combine RNA-seq, DNA methylation and CNV data to generated a patient similarity network. Assuming that there are n patients, each of them has m type data (such as RNA-Seq and DNA methylation). We represent the PSN as a graph G=(V,E), where V represents the set of patients {x 1 , x 2 , x 3 ..., x n } and the edges E correspond to the similarity between vertices v ∈ V. The weights between edges are represented by an n × n similarity matrix W which is computed by Eq. (1). www.nature.com/scientificreports/ where α is a hyperparameter, φ (x i , x j ) is the Euclidean distance between patients x i and x j and γ i,j is used to eliminate the scaling problem. In order to compute the fused matrix from multiple types of data, the similarity matrix is normalized as Eq. (2).
Assuming N i is a set of x i 's neighbors. Then local affinity matrix S is calculated by Eq. (3).
where the S (h) represents local affinity matrix of h-th type data. Through this process of continuous iterative fusion, a patient similarity network which contains complementary information from three omics dataset is finally obtained. The fused network can be used for classification or clustering, and in this work the fused similarity network is taken as the input of DenseGCN for cancer diagnosis.

Feature extraction by denoising autoencoder.
To reduce the noise in the row omics data and the computational cost, we constructed three independent denoising autoencoders to extract latent embedding representation from the omics datasets, respectively. The autoencoder (AE) is a neural network which typically contains two networks: an encoder network and a decoder network. The encoder network takes a feature vector x ∈ ℜ d as input and encodes it into a low-dimensional representation y ∈ ℜ q , define as f e : x → y. The decoder network maps the low-dimensional representation y back to the input space, define as f d : y → z. The autoencoder is optimised by minimizing the reconstruction loss L between original input x and reconstructed input z as Eq. (5).
where f e , f d represent the parameters of the encoder network and the decoder network, respectively. www.nature.com/scientificreports/ In this work a denoising autoencoder (DAE) 37 is applied to extract latent embedding representation. The architecture of DAE is the same as AE, but the way to train network is different. DAE first corrupted the input data by adding noise, then the corrupted input data x _ noise is fed to the autoencoder. By recovering the damaged input data, DAE extracts robust latent embedding representation. We use the loss function Mean Squared Error to train DAE. The latent embedding representations extracted by three independent DAE are connected and then fed to the further work together with patient similarity network.
DenseGCN. The patient similarity network constructed by SNF is non-Euclidean data that CNNs fail to handle 32 , so GCN is considered in this work because of their advantages in processing non-Euclidean data 38 . However, original GCN model is usually very shallow due to the vanishing gradient problem, this limits the ability of GCN to fit the data 35 . So an improved GCN model named DenseGCN is used in this work.
GCN takes a feature matrix X which describes every node in the graph and an adjacency matrix A which illustrate the structure of the graph as input and generates a node-level matrix Z. The layer-wise propagation rule of GCN can be formulated as Eq. (6). where H(N) is the output of the N layer, and W(N-1) is a weight matrix of the N-1 layer. f( · ) represents graph convolution operation. σ( · ) is an activation function which is usually non-linear. This rule is valid but still has some limitations. Frist the feature vectors of all neighboring nodes are taken into consideration, but the node itself is ignored. This limitation can be fixed by adding self-connections to the adjacency matrix A, define as Â = A+E, where E represents the identity matrix. The second limitation is that A is usually not normalized, this means that the scale of the feature vectors will change when multiplying with A. To get rid of this limitation, symmetric normalization, defining as D −1/2 AD −1/2 , is applied to standardize A, where D is the diagonal node degree matrix. Thus, propagation rule is reformulated as Eq. (7). Theoretically, deeper networks are able to learn more abstract representations and require less data for training than shallow neural networks 33,39 , and at the same time, omics data are characterized by high dimensionality and few samples. This indicates that deep networks are more applicable to omics data. However, GCN is usually very shallow because of the vanishing gradient problem 35 , and most state-of-the-art GCNs are less than 4 layers 34 . Inspired by the dense connectivity of DenseNet 40 , a similar idea is adapted to GCN to improve information flow in the network and avoid gradient vanishing problem 35 . This dense model, named DenseGCN, has a new propagation rule which is define as Eq. (8).
where H(0) is the input feature matrix X, T( · ) represents a vertex-wise concatenation function. The structure of DenseGCN is shown in Fig. 2.
In summary, the original GCN is limited by the gradient disappearance problem, which makes it difficult to have a deep network architecture. In contrast, DenseGCN improves the flow of information by connecting layers densely to solve the gradient vanishing problem, and is able to have a deeper network architecture compared with original GCN. Thus, DenseGCN is more suitable for omics datasets.

Results
A series of experiments were conducted to evaluate the performance of proposed method pDenseGCN. First, pDenseGCN was compared with five state-of-the art methods, namely ASVM 41 , Xgboost-AD 42 , MGRFE-GaRFE 43 , ET-SVM 44 , XOmiVAE 45 , and four machine learning methods, namely Linear Discriminant Analysis (LDA), Naïve bayes (NB), Random Forest (RF), and Decision Tree (DT). Then we investigated the influence Datasets and data preprocessing. We performed our proposed method pDenseGCN on Liver Hepatocellular Carcinoma (LIHC) omics datasets acquired from TCGA portal (https:// www. cancer. gov/ tcga). A R package named TCGA-assembler 46 was used to obtain DNA methylation, RNA-seq and CNV data of LIHC. The detail of above three datasets is described in Table 1. Similar to the previous literature 47 , these three datasets are preprocessed by following steps. The first step is outlier removal. We delete these features which have more than 20% missing values. Similarly, these sample which have moved than 20% features have been removed. 404 common samples remained in this step. The next step is missing-data imputation. We use the mean of remaining features to impute the missing values based on the python package sklearn 48 . Finally, these three datasets are normalized according to Eq. (1).
where X is any column in the omics dataset, X nor is the corresponding columns after normalization, X max is the maximum values in X and X min represent the minimum values in X.

Evaluation metrics.
To fully evaluate different methods, accuracy, precision, recall, F1-score 49 , and AUC 50 were used as the metrics. All of them are defined as follows.
Accuracy: The ratio of correctly predictions. Accuracy can be calculated as Eq. Experiment and parameter settings. For these omics dataset, 60% of the data was randomly selected to train models and 20% of the data was randomly selected as the validation set. The remaining 20% data was used for testing. To reduce the deviation, we repeated the experiments five times and the average result of the five experiments was taken as the ultimate result of the experiment. All of our models were implemented using Pytorch. The experiments were executed on a PC with an Intel core i7-10700 processor of 2.90 GHz and 32.0 GB RAM. The relevant parameters of the used methods are listed in this part. For pDenseGCN, we determined the optimal learning rate (Lr) and the batch size according to the grid search method. For the comparison algorithm, the parameters given in its original paper were slightly modified to make it more suitable for our dataset. Table 2 describes the detailed parameters.
Comparison with other methods. To validate the performance of our proposed method pDenseGCN, we compared it with five state-of-the-art methods and four machine learning methods. We replicated them according to their publications or using publicly available programs. The details of these five state-of-the-art methods are described below. www.nature.com/scientificreports/ • ASVM 41 is a novel multilayer recursive feature elimination algorithm based on embedded variable length encoding genetic algorithm aiming at cancer classification. It utilizes the Shuffled Frog Leaping algorithm to adaptively adjust the parameters of the Support Vector Machine based on data attributes to classify early stage cancers. • Xgboost-AD 42 is a novel cancer classification method. It integrates multi-omics data by autoencoder and utilizes extreme gradient boosting to accurately diagnostic classify cancer. • MGRFE-GaRFE 43 is aiming to use fewer genes for better cancer classification results. It applies a multilayer recursive feature elimination method based on an embedded genetic algorithm to get a better feature subset for cancer classification. • ET-SVM 44 adopts extra trees and variance threshold to select features from gene expression data, and uses these important features to diagnostic classify cancer based on SVM. • XOmiVAE 45 is an interpretable deep learning model for cancer diagnosis based on variational autoencoder.
It uses variational autoencoder to extract low-dimensional expressions from genomics data, which are then fed into a multilayer perceptron for cancer classification.
The results are displayed in Table 3. As seen in Table 3, pDenseGCN has a better performance compared with other methods among all the metrics in LIHC dataset. In terms of accuracy, pDenseGCN achieves 98.57% accuracy, which is 1.31% better than the best remaining method XGBoost-AD and up to 23.9% better than other comparison methods. As for the other four metrics, pDenseGCN gains a best performance which are up to 26.03%, 35.49%, 22.46%, 24.09% better than other methods in terms of precision, recall, f1-score, and AUC. It proves that by introducing the patient similarity network, our proposed method is more advantageous in cancer diagnosis and more applicable to the LIHC dataset.
The influence of patient similarity network. Constructing patient similarity network is one important component of pDenseGCN, since the patient similarity network allows DenseGCN to gain information from the neighboring patients. To investigate the influence of patient similarity network on cancer diagnosis, we designed two experiments. One experiment took patient similarity network as the input and the other one took an identity matrix as the input. The results are presented in Fig. 3. As Fig. 1 shows, the model trained with patient similarity  www.nature.com/scientificreports/ network performs prior to the model trained without patient similarity network. In the case of precision, recall, F1-score, accuracy, and AUC, the model trained with patient similarity network is 9.29%, 15.6%, 12.6%, 12.2%, 12.1% higher than the model trained without patient similarity network. This demonstrates that by introducing a patient similarity network, our proposed method pDenseGCN takes the information from neighboring patients into consideration when predicting the label of a patient. This effectively improves the classification results.
Effectiveness of different omics data. We carried out experiments with varied type of data to confirm the effectiveness of varied omics data and the effect of multi-omics data combination. The results are displayed in Table 4. In Table 4, RNA-seq, DNA Methylation and CNV represent three single omics data, respectively. RNA-Seq+DNAMethy, RNASeq+CNV, and DNAMethy+CNV represent three omics data pairwise combinations, respectively. Multi-Omics represents our proposed method with three omics data. We can see from Table 4 that the performance of our proposed method rises over time as the type of data used increases. These models trained with single omics data have an accuracy of up to 0.8286, however, when the model was trained with two omics data, the lowest accuracy was 0.9. The optimal performance is attained when the model is trained with three kinds of omics data with an accuracy value of 0.9857. It confirms that multiple omics do outperform single omics, and that the performance improves progressively as the number of omics data increases. This indicate that different omics data contain complementary information, which provides a comprehensive view of cancer and improves the result of cancer diagnosis. Besides, the model trained with DNA Methylation performs better in the three single omics data, this may indicate that DNA Methylation contains more information that facilitates cancer diagnosis.
The effect of DenseGCN layer numbers. In order to explore the effect of different DenseGCN layer numbers on the final result, we designed several models with various number of layers. The results of different models are shown in Table 5. As seen in Table 5, unlike the conventional GCN models, pDenseGCN still performs well even if the number of layers is more than three. Meanwhile, excluding 7-layers and 8-layers, the performance of pDenseGCN increases gradually with the increase of layers. This illustrates that deep network is more suitable to fit omics data and can gain a better performance than shallow network in cancer diagnosis. When the number of pDenseGCN layers reaches 10, multiple metrics such as Recall, F1-Score, Accuracy and AUC perform best. However, as the number of layers keeps increasing, these scores of metrics gradually decline. This is probably because although DenseGCN overcomes gradient vanishing by densely connecting layers to   Fig. 4. As can be seen from Fig. 4, when the quantity of features ranges from 100 to 300, three metrics F1-Score, Accuracy, AUC exhibit an obvious rising trend. This may be because that useful information that the model can learn gradually increases as the amount of features increases. The proposed method reaches best performance when the number of features is set to 300, with an accuracy value of 0.9857. As the number of features continues to grow, the performance of our proposed method begins to gradually decrease instead. This indicates that when the number of features is large, irrelevant or redundant information may be incorporated, which does harm to the performance of model. Thus, 300 is selected as the number of features in this work.

Conclusion
Liver cancer is one of the common malignant tumors worldwide with a poor prognosis. Since effective diagnosis helps to improve the cure of liver cancer, there is an urgent need for a method that can accurately perform diagnosis of liver cancer. In this work, we establish a novel method pDenseGCN which consists of similarity network fusion, denoising autoencoder, and dense graph convolutional network for liver cancer diagnosis. The pDenseGCN takes multi-omics data to construct a patient similarity network, which brings more patient information for cancer diagnosis. We explore the differences in the results of pDenseGCN trained with and without patient similarity network. The results indicate that the similarity information does contribute to cancer diagnosis. In addition, since the patient similarity network is non-Euclidean data, and the omics data is characterized by high dimensionality and few samples, pDenseGCN utilizes densely connected graph convolutional neural network to fit them better. Compared with state-of-the-art methods, pDenseGCN achieves better results in terms of the final prediction performance metrics. It demonstrates that our proposed pDenseGCN is a promising method for liver cancer diagnosis. In our future work, we are committed to extend our proposed method to multi-classification tasks, such as cancer subtype classification as well as pan-cancer classification.   www.nature.com/scientificreports/