A Robust Method for Inferring Network Structures

Inferring the network structure from limited observable data is significant in molecular biology, communication and many other areas. It is challenging, primarily because the observable data are sparse, finite and noisy. The development of machine learning and network structure study provides a great chance to solve the problem. In this paper, we propose an iterative smoothing algorithm with structure sparsity (ISSS) method. The elastic penalty in the model is introduced for the sparse solution, identifying group features and avoiding over-fitting, and the total variation (TV) penalty in the model can effectively utilize the structure information to identify the neighborhood of the vertices. Due to the non-smoothness of the elastic and structural TV penalties, an efficient algorithm with the Nesterov’s smoothing optimization technique is proposed to solve the non-smooth problem. The experimental results on both synthetic and real-world networks show that the proposed model is robust against insufficient data and high noise. In addition, we investigate many factors that play important roles in identifying the performance of ISSS.


Problem Definition
Let's take the ultimatum game (UG) on networks 33 as an example. The vertices in the network represent players. If two people play the ultimatum game, there is an edge between them. Everyone in the game will play two times with his neighbors using strategy (p, q), both as a proposer and a responder. p ij ∈ [0, 1] is the amount proposed by player i to offer his neighbor j, while q ij is the minimum amount that i responds to his neighbor j. For instance, player i has a strategy [0.8, 0.65], and one of his neighbors j has a strategy [0.6, 0.5]. Since i proposes to offer p ij = 0.8 to j, which is larger than the minimum acceptance amount of j, i.e., q ji = 0.5, then player i will get a 1−0.8 = 0.2 payoff. At the same time, j offers i with p ji = 0.6, which is lower than the minimum amount of i accepts, i.e., q ij = 0.65, then player i rejects j's offer, and both of them get nothing. The game lasts for M rounds. For round μ ∈ [1, M] with time stamp t μ , the corresponding strategy of player i is denoted as (p j (t μ ), q j (t μ )). The payoff of player i with player j at round μ can be summarized as follows:

t p t p t q t and p t q t p t p t q t and p t q t p t p t q t and p t q t p t q t and p t q t
( ) Hence, the overall payoff of player i with all its neighbors is defined as where E i is the neighbors of player i. The strategy (p i (t μ+1 ), q i (t μ+1 )) changes from time to time with a random number δ ∈ [−ε, ε]. ε is set to 0.05 and the strategy p and q are set to range [0, 1]. The reconstruction problem is formulated as Y i = Φ i × X i as shown in Fig. 1  denotes the neighboring vector of vertex i, which is the network structure we want to infer . t μ ∈ [1, M] is the accessible time instance in time series. Our task is to infer all the neighbors of vertex i, i.e., X i , given noisy, sparse and incomplete Y i and Φ i in Equation (2). Here the UG on networks is just an example. There are also many other applications, such as the evolutionary games, transportation, communication processes, sociology and biology in papers 16,34,35 . For instance, the construction of biological interaction networks with the goal of uncovering causal relationships between genotype and phenotype constitutes a major research topic in systems biology 35 . With the increased availability of DNA microarray time-series data, it is possible to infer gene regulatory networks (GRNs) from gene expression data to understand dynamic GRNs 36 . The information-geometric network inference to use a real radio emulation testbed to infer the end-to-end rate distributions of stochastic network flows from link rate measurements 37 . G Zahoranszky-Kohalmi 38 employs network inference method in investigating the drug-target and target-target interactions in order to design new drugs. In these real world applications, the observations from the network, i.e., Φ i and Y i have noises, and Φ i is generally sparse and incomplete, which are key factors in identifying the performance of the network inference method. In fact, the essence of the inferring network structure (network reconstruction) problem can be formulated by Equation (2). The performance of the model is determined by the network structure and the expressing ability of model itself. Thus if the networks are built using ER network model with the same parameters, then the performance of any methods should be very similar. In the next section, we will discuss the advantages of ISSS and the key factors in identifying the performance of the ISSS and Lasso models.

Results
In this section, we apply our model on many different types of synthetic and real world networks. Then we compare the experimental results of our model with the state-of-art method Lasso 16 on ER, BA and WS networks. The experimental results show the effectiveness of our model, and it outperforms the baseline method significantly. In addition, we also investigate the key factors in identifying the performance of the model, such as the variance Scientific RepoRts | 7: 5221 | DOI:10.1038/s41598-017-04725-2 of the elements in measurement Φ and the properties of many real world networks. Finally, we compare the clustering coefficients of the network derived from the Lasso and ISSS model, respectively. And we also analyze the effectiveness of three penalties in the model.

Experiments on three typical networks.
We take the network inference as a binary classification task, i.e. the edge exists or not. Hence, Area under the Receiver Operating Characteristic (AUROC) 39 is employed to evaluate the performance of the models. The Area under the Receiver Operating Characteristic is a common summary statistic for the goodness of a predictor in a binary classification task. It is equal to the probability that a predictor will rank a randomly chosen positive instance higher than a randomly chosen negative one. According to the experimental results in Table 1, the ISSS method outperforms Lasso on all the ER, BA and WS networks with different training data size, which is defined as Data = M/N, where M is the number of accessible time instances in the time series and the N is the number of vertices in the network. In the experiment, the ER, BA and WS networks all have N = 100 vertices, M = 100 time instances and the edges of networks are constructed randomly according to the parameters. All the experimental results are averaged over at least 10 independent trials. The experiment setup in this paper is the same with the sate-of-art Lasso. With more than 40% of the training Figure 1. The network structure inference problem. The colorful squares in Y i and Φ i represent different float values. The colorful squares in X i represent the connection between vertex i and other vertices. If the square of vertex j is blank, it means that there is no connection between vertex i and j. Hence, the color of squares in X i is the same with the corresponding color of nodes in the graph. In many real world applications, the training data is insufficient, which indicates that some rows of Y i and Φ i are missing, such as the j ∈ E missing row in Y i and Φ i .  Table 1. Comparison between ISSS and Lasso. We use AUROC 39 (Area under the Receiver Operating Characteristic) index to depict the performance of the inference task. Area is the area under the curve. It is a common summary statistic for the goodness of a predictor in a binary classification task. It is equal to the probability that a predictor will rank a randomly chosen positive instance higher than a randomly chosen negative one. data, i.e. Data = 0.4, both ISSS and Lasso can successfully identify the network structure. As shown in Table 1, when the training data size is very small, the performance of ISSS is much better than Lasso. Take Data = 0.05 as an example, the AUCROC of ISSS is at least 0.1 larger than that of Lasso. It indicates that the ISSS improves the performance significantly with very few training data. It is worth mentioning that variance of measurement Φ's elements in Equation (2) play an important role in inferring the network structure. The Φ's elements Φ ij ∈ [0, 1] on ER, BA and WS in Table 1 has default variance = 10 −1 . Here we will analyze the situation that Φ has many other variances, s.t. variance = 10 −2 , 10 −3 and 10 −4 in Fig. 2. We plot the performance of ISSS and Lasso with different variances of Φ's elements on ER, BA and WS networks.
On the networks with variance = 10 −4 in Fig. 2(a,d,g), i.e. the network structure is hard to reconstruct, Lasso fails to infer the network structure. The performance of Lasso is the same with the random guess. Compared with the Lasso, ISSS method greatly outperforms the lasso significantly. As the size of training data goes up, the performance of ISSS gets better. However, the AUROC of ISSS is still lower than 0.8 with all the training data, i.e., Data = 1. On the networks with variance = 10 −3 , as shown in Fig. 2(b,e,h), there is a big gap between two curves, which indicates that ISSS greatly outperforms the Lasso. With variance = 10 −2 as shown in Fig. 2(c,f,i), the performance of ISSS is greatly improved especially when we have only 5% training data. As the size of training data rises, both the performances of ISSS and Lasso are improved. An interesting phenomenon is that ISSS and Lasso perform similarly among 25-40% training data with variance = 10 −2 . The cause of the phenomena is hard to interpret from theory and experiment aspects. In addition, low variance leads to faster convergence of the With very small variance = 10 −4 , the performance of Lasso is equal to random guess, i.e., the blue line in Fig. 2(a,d,g). With very large variance, the inference or reconstruction task is much easier. To show the whole curves, the upper bound of the y axis is set to 1.2 for all the figures in this paper.
model. For instance, to infer a small network, the training time with variance = 10 −3 is 10% shorter than that with variance = 10 −1 . That's the reason that many systems prefer low variance measurements and outputs. In summary, the ISSS model performs well with low variance measurements and few training data.
To further investigate the relation between variance and training data size, as shown in Fig. 3(a,b,c), we fix the training data size, i.e., Data = 0.1, 0.5 and 1, then find out the performance of ISSS and Lasso as variance scale goes up from (0, 1] with step 0.1. The experimental results on ER, BA and WS network are similar. Hence, we take ER network as an example in the figure. When Data = 1 and 0.5, ISSS can successfully infer the network structure with very small variance scale. With Data = 0.1, the AUROC keeps its value as the variance goes up. The performance of Lasso is just a little better than the random guess. However, the AUROC value of ISSS is larger than 0.70. It indicates that training data size is an important factor in identifying the upper bound of the performance. To test the robustness of the ISSS against the noise, we add some Gaussian noises  µ δ ∼ ( , ) 2 to the output Y and measurement Φ. Then we apply the ISSS and Lasso model on the noisy data. The results are shown in Fig. 3(d,e,f). For low noise with δ 2 = 10 −3 , the ISSS outperforms the Lasso, especially when the elements of Φ have very small variance. For very high noise that follows the normal distribution with δ 2 = 10 −1 , the performance of ISSS and Lasso are similar. For the noise with δ 2 = 10 −2 , the performance of ISSS is better than Lasso for all variance scales.
Experiments on real world networks. We apply the ISSS and Lasso on many real world networks to evaluate the performance. Table 2 denotes the minimum data for achieving at least 0.95 AUROC in combination with several real world networks. Lasso cannot achieve 0.95 AUROC on all networks with default variance = 10 −1 . ISSS greatly outperforms the Lasso. On karate and Les Miserables networks, ISSS achieves 0.95 AUROC with only Data = 0.2 and 0.4, respectively. By analyzing many properties of the network, we find that it is much easier to  infer the network structure with higher centralization index with very few training data. The network with both higher heterogeneity and density can also be easily inferred. To evaluate the performance of ISSS and Lasso on large networks, we do some experiments on the power grid (4941 vertices, 6594 edges), roget (1022 vertices, 5075 edges), Gnutella peer-to-peer file sharing network (6301 vertices, 20777 edges) and collaboration network in computational geometry (7343 vertices, 11898 edges). As shown in Fig. 4, ISSS outperforms the Lasso on all the networks. The Gnutella peer-to-peer network is hard to infer. Both ISSS and Lasso require more training data to infer the Gnutella peer-to-peer network structure. We compare the clustering coefficient of the network inferred by ISSS and Lasso with different training data sizes in this part. In Fig. 5, the horizontal axis represents the training data size, and the vertical axis is the clustering coefficient of the network inferred by ISSS and Lasso. The red solid line lies above the blue line all the time. It indicates that the ISSS model is inclined to infer the networks with stronger community structures than that of Lasso. As the training data size increases in Fig. 5, the clustering coefficient of the network inferred by ISSS approaches the ground truth, i.e., the yellow line. These results show that the TV penalty in ISSS model works well. Here is an example to demonstrate why the TV penalty works. For instance, a network has 50 vertices. As shown in Fig. 6, the horizontal axis represents all the 50 vertices in the network. If vertex j is a neighbor of vertex i, it is 1, otherwise it is 0 in the figure. Lasso tends to predict the neighbors of vertex i discretely as shown in Fig. 6(a), while the ISSS is opt to infer the network structure in a continuous way, as shown in Fig. 6(b). The reason for choosing TV penalty is that almost all of the vertices in many networks have similar natural number. Take the karate network as an example. Almost all the vertices' labels that are less than 20 are in the same community. For the dolphin network, the vertices' labels under 30 are connected with each other. For the networks don't have natural number labels or the vertices don't have continuous labels as shown in Fig. 6(a). We can remove the TV penalty from the model. The experimental results are shown in Table 3. The experiment is done on a BA network with continuous natural number labels. Without TV penalty, the ISSS model also works well. However, the ISSS with all the three penalties is still the best.
As we mentioned above, TV penalty is a factor in inferring the network structure. Hence, how to choose the value of parameter γ is important as shown in Equation (4). In this part, we investigate the relationship between the most common used hyper parameter coefficients γ and the training data size, i.e. Data. According to the results in Table 4, γ = 10 −3 and 10 −4 have the best performance on almost all training data. Given Data = 0.1, γ = 10 −6 is the best choice. The hyper parameter β, λ of L 1 , L 2 are empirically set as 0.001 and 0.0001, respectively.

Discussion
In this paper, we propose the robust ISSS model in solving the network reconstruction problem. Compared with the state-of-art Lasso, the ISSS model is more robust, which can successfully infer the network structure with insufficient training data against noise. The sparse regularization term L 1 , L 2 and the structural TV penalties make  the ISSS model reconstruct the network with strong community structure. The experiments on the WS, ER and BA networks show that the ISSS outperforms Lasso even when the network system has very few observations with great noises. We also do some experiments to analyze the factors in identifying the performance of the ISSS, i.e. the variance of Φ, training data size and the clustering coefficient of the network.
In the future, on the one hand, we will continue to find out all the factors in identifying the upper and lower bound of ISSS's performance, and prove that from both theory and experiment perspectives. On the other hand, the community structure of a network defines the characteristic of a network, which identifies its functionality and should be preserved during the inferring process. Though ISSS model can identify the network with community structure, there could be room for improvement. For instance, the group Lasso and the tree group Lasso are the promising directions in network reconstruction task. Furthermore, we will optimize ISSS and apply the network inference method to solve the drug prediction and breast cancer prediction problems.
At present, it is still very challenging to infer many large-scale networks. Hence, it is a future trend with promising application field. And the evaluation criteria on directed and undirected networks are the same. As the development of network inference and compressed sensing research, more reasonable evaluation methods will be proposed to evaluate the experimental results on undirected networks.

Methods
The main goal of complex network reconstruction is to recover the connection X i by the payoff information y i and virtual-payoff matrix Φ i of vertex i from the time series of strategies. Note that the number of the neighbors of vertex i is often much less than the total number of vertices in the entire complex network, i.e., X i is sparse. Han et al. 40 adopted to use Lasso model to identify the neighbors of vertex i. By using Lasso model, the problem of network reconstruction is formulated as where λ is a non-negative regularization parameter. All nonzero elements of the reconstructed X i correspond to the neighbors of i vertex. However, the Lasso-based model (3) only considers the sparse information.
In practice, to estimate a true signal in noise, the most frequently used methods are based on the total variation (TV) norm and the  1 norm. TV penalty forces sparsity on the spatial derivative of the weight map. TV norms are essential  1 norms of derivatives, hence  1 estimation procedures are more appropriate for the subject of TV estimation. The space of functions of bounded total variation plays an important role when accurate estimation of discontinuities in solutions is required 26 . By contrast, total variation is remarkably effective at simultaneously preserving edges whilst smoothing away noise in flat regions for image denoising. Therefore, we consider to use ridge regression model with the TV regularization to preserve the structural information of networks and enhance the robustness performance of network reconstruction. After imposed the TV regularization, the objective of our method is formulated as   where TV(·) represents the TV regularization, which is defined as . It is worth noting that the Lasso-based model 40 is a special case of our model. When β = 0 and γ = 0, our model will reduce to the Lasso-based model 40 (4) is able to utilize the structural information to help recover the connection of network. However, the TV regularization is a complex and non-smoothed sparsity-inducing penalties, and its proximal operator is unknown and difficultly computed. The traditional convex optimization methods, such as iterative soft-thresholding algorithm (ISTA 41 , fast or accelerated ISTA (FISTA) 42 and first-Order primal-dual(Primal-Dual) algorithm 43 and second-order methods [44][45][46] , require only one non-smoothed term alone or access to the unknown proximal operator of the non-smooth part of the objective function. In general, the proximal operator of the sum of two non-smooth functions, i.e.  1 and TV, is unknown and computing expensive. Thus, we can not use them solve this problem (4) directly. Motivated by Nesterov's smoothing technique 27 , we first smooth the TV penalty with the unknown proximal operator, while keeping the exact  1 constraint and then use accelerated proximal gradient descent to minimize the whole function as shown in Algorithm 1.
Time-complexity of iterative convex optimization is kind of tricky to analyze, as it depends on a convergence criterion. Here we give a brief estimation of the time complexity.

Nesterov's smoothing technique for TV penalty. Denote
Using the duality concept 47 , we can establish the dual form of TV(X i ) Scientific RepoRts | 7: 5221| DOI:10.1038/s41598-017-04725-2 Theorem 1 (Nesterov's Theorem 27 ). Let the function s μ be Nesterov's smooth transform of a convex function TV penalty. We can obtain the following results: (1) s μ is convex and differentiable with gradient The proof of Theorem 1 is obtained according to Lemma 1. According to the projection theorem 44 , we know that μ ⁎ z X ( ) i is the projection of μ AX i 1 onto the compact and convex space , that is where Π is the Cartesian product of a set and  ⋅ P ( ) G is the projection onto each compact set  G defined as Optimization. After smoothing by Nesterov's technique, the proximal operator of s μ becomes easily computed. We obtain a new optimization problem by using s μ (X i ) to replace the TV penalty of the problem (4): . By using this iterative method to minimize the smoothed model in (10), it will converge to ⁎ X i , which is the minimum of F μ (X i ). Moreover, X i 1 is a strong convex function, the convergence rate is guaranteed by