Identification of a gene signature for discriminating metastatic from primary melanoma using a molecular interaction network approach

Understanding the biological factors that are characteristic of metastasis in melanoma remains a key approach to improving treatment. In this study, we seek to identify a gene signature of metastatic melanoma. We configured a new network-based computational pipeline, combined with a machine learning method, to mine publicly available transcriptomic data from melanoma patient samples. Our method is unbiased and scans a genome-wide protein-protein interaction network using a novel formulation for network scoring. Using this, we identify the most influential, differentially expressed nodes in metastatic as compared to primary melanoma. We evaluated the shortlisted genes by a machine learning method to rank them by their discriminatory capacities. From this, we identified a panel of 6 genes, ALDH1A1, HSP90AB1, KIT, KRT16, SPRR3 and TMEM45B whose expression values discriminated metastatic from primary melanoma (87% classification accuracy). In an independent transcriptomic data set derived from 703 primary melanomas, we showed that all six genes were significant in predicting melanoma specific survival (MSS) in a univariate analysis, which was also consistent with AJCC staging. Further, 3 of these genes, HSP90AB1, SPRR3 and KRT16 remained significant predictors of MSS in a joint analysis (HR = 2.3, P = 0.03) although, HSP90AB1 (HR = 1.9, P = 2 × 10−4) alone remained predictive after adjusting for clinical predictors.

: Identification of response networks. The unweighted directed melanoma network is weighted using gene expression values of different conditions. Networks of 2 conditions A and B are shown here. Shortest-path computation on A and B identifies top-activity paths from node 1 to 11 and node 6 to 9. Comparisons of these paths between two conditions identifies the paths with maximum difference that are referred to as perturbed paths. The paths with highest perturbations are result of systems' response to progression from one condition to other. The network from such paths form response network. Nodes colors in blue represent low expression levels, red shades represent high expression levels and grey is no expression. Edges width is based on expression weights. High activity paths traverse through highly expressed nodes.
To identify the difference between paths of two conditions, the paths were treated as strings and string similarity was computed. The Jaro-Winkler distance is a string similarity matching metric, originally used to study record-linkage.
Where is the number of matching characters and is the half of the number of transpositions while 1 , 2 are the two strings that are being matched. The criterion for a match between the two strings is given by ⌊ (| 1 |,| 1 |) 2 ⌋ − 1. The Jaro distance is given by where is a scaling factor (0.1) which gives extra weightage for strings that match from the beginning for a set prefix length .

Influence score
Degree conserved (DC) is a measure where a ratio is computed between the degree of node v in the top-perturbed path network to degree of node v in preliminary melanoma network. Higher ratio signifies that the node and its connections are important for the specific disease or conditions and minimum ratio means the node and neighbors have a low role to play.

Eccentricity (E)
is a node centrality index. The eccentricity of node v is calculated by computing shortest paths between node v and all other nodes in the network, then choosing the longest shortest path. Suppose w is the farthest node from v and the length is dist(v,w), eccentricity of the node v is (1/ dist(v,w)). If the eccentricity is high that means all other nodes are in the proximity of the node v and if eccentricity is low then the nodes are far from the vicinity of this node.
Nodes with high eccentricity will have the longest shortest paths and hence will be linked to proteins that have the highest 'reach' in the network. In biological networks, such nodes are most likely to exert an influence on the highest number of nodes in the network. In contrast, a protein with low eccentricity will have fewer proteins to influence and the overall effect may be minimal.
Betweenness Centrality (BC) is a node centrality index. The betweenness of node v is calculated as the ratio of shortest paths between nodes (m,n) which pass through node v to the total number of shortest paths between the nodes m and n (m,n) In biological networks, high betweenness centrality reflects a measure of importance of that node in reaching other distant communicating proteins together.

Machine Learning Methods
Extra Trees algorithm is based on based on randomized decision trees. The algorithm uses the perturb-and-combine technique which is specifically designed for trees. In this method, a diverse set of classifiers are initiated and the creation of these classifiers is achieved by introducing randomness. The prediction accuracy is given as the average prediction of the individual classifiers 1 . This method is similar to the random forests, as in, a random subset of candidate features is used, but the thresholds are picked randomly and the best of these randomly generated thresholds are used as the rule for splitting. The advantage of this method is that it allows in the reduction of variance of the model but on the downside it results in the slight increase in bias 2 .

K-Fold
Stratified K-fold 3 is the name given to the method when the stratification process or the rearrangement of data is performed to ensure that the selected fold has approximately similar mean response values this is done to ensure that the stratified data is a representative sample of the whole data. For instance, in a binary classification such as the present problem, each class comprises 50% of the data; hence during the process of rearrangement, it is best to ensure that each class finds almost equal representation is every fold.

Benchmarking of marker identified by networks method compared to machine learning method.
To benchmark our pipeline compared to the standard machine learning approach alone to identify signatures, we carried out following exercise.
Normalized signal intensities of all the genes were used to identify optimal signature set by recursive feature elimination step using the Extra Trees algorithm (see Methods). We considered the top 6 genes (same size as our MM vs. PM signature), which are TMPRSS11B, DPT, LGALS2, ASAP1, ALPI and NEUROG3 as the features. The classification accuracy estimation results of this 6-gene panel using random forest algorithm (see Methods) is given below CA F1 Precision Recall 0.737 0.1 0.125 0.083 Table S1: The classification report using our ML algorithm against the top 6 genes identified by feature elimination. Figure S2. The ROC curve with our ML algorithm using top 6 genes The classification accuracy of these 6 genes was 73% which is lower than 87% accuracy achieved by signature derived using networks approach. This shows that our pipeline has the highest accuracy in achieving the classification between primary and metastatic melanoma samples.
In addition, an analysis of the functional significance of these genes indicated that only one gene (ASAP1) has an established role in melanoma and the rest are not directly related 4 . This additionally shows the advantage of networks method to obtain the features of biological importance.