Information Spread and Topic Diffusion in Heterogeneous Information Networks

Diffusion of information in complex networks largely depends on the network structure. Recent studies have mainly addressed information diffusion in homogeneous networks where there is only a single type of nodes and edges. However, some real-world networks consist of heterogeneous types of nodes and edges. In this manuscript, we model information diffusion in heterogeneous information networks, and use interactions of different meta-paths to predict the diffusion process. A meta-path is a path between nodes across different layers of a heterogeneous network. As its most important feature the proposed method is capable of determining the influence of all meta-paths on the diffusion process. A conditional probability is used assuming interdependent relations between the nodes to calculate the activation probability of each node. As independent cascade models, we consider linear threshold and independent cascade models. Applying the proposed method on two real heterogeneous networks reveals its effectiveness and superior performance over state-of-the-art methods.

Many real systems can be modeled by networks where a number of individuals interact through a connection graph. Examples of networked systems include the Internet, World Wide Web, the human brain, power grids, online social networks, transportation and water distribution networks. Various dynamical phenomena have been studied on complex networks including synchronisation 1 , consensus 2 , opinion formation 3,4 and information spread 5 . Network topology has the major role in how such dynamical processes evolve on networks. Certain topologies might facilitate synchronisation or information spread, while some other network structures might disrupt such activities 6,7 .
Information diffusion is one of the widely studied dynamical processed on networks, which has potential applications in fields. Information such as a news, innovation, virus or malware, starts from a set of seed nodes and propagates throughout the network. There is a rich literature on information diffusion on complex networks, where different models and their interplay with network topology have been studied 1 . Previous research works have mainly considered heterogeneous networks. An information network G = (V, E) with V as the set of nodes and E as the set of edges, is a homogeneous network if the edges and nodes are of the same type. Networks with nodes and/or edges from more than one type are called heterogeneous networks [8][9][10] . For example, in DBLP network, which is a major bibliography provider in computer science, the nodes are authors, papers, venues (journals/conferences). In this network, edges can be author-author relationship when they co-author a paper, or author-venue relationship when an author participates in a conference.
Here we model information diffusion or more specifically topic diffusion in heterogeneous information networks. To this end, we use the concept of meta-path, which is defined in heterogeneous networks. A meta-path P is a path defined over the general schema of the network T G = (A, R), where A and R denote the nodes and their relations, respectively. The meta-path is denoted by ...
, where l is an index indicating the corresponding meta-path. The aggregated relationship is obtained as R = R 1 oR 2 o...R l between different types of nodes A 1 to A l + 1 , where o is the composition operator. For instance, in DBLP network, each of the author-paper-author and author-conference-author relations is considered to be an individual meta-path. Figure 1 is an example of "Data mining" topic propagation that authors can be connected to one another through different meta-paths in DBLP network.
Recently, much attention has been given to employing non-homogeneous networks in classification and ranking tasks. For instance, sentiment classification of product reviews using heterogeneous networks was addressed 1 Faculty of New Sciences and Technologies, University of Tehran, Tehran, Iran. 2 School of Computer Science, Institute for Research in Fundamental Science (IPM), Tehran, Iran. 3 School of Engineering RMIT University, Melbourne, Australia. Correspondence and requests for materials should be addressed to M.S. (email: mostafa_salehi@ut.ac.ir) by Zhou et al. 11 . In this process, a heterogeneous network connects the users, products, and words, based on which the learning process is conducted using sentiment classification. In this regard, Zhou et al. 11 proposed a co-ranking method which classifies the authors and documents separately based on random walks. Angelova et al. 12 presented a new classification method for the DBLP heterogeneous network. Mining of heterogeneous networks was addressed in a number of studies [13][14][15] . For example, Boccaletti and others 16 studied mining of homogeneous information networks through their decomposition into multiple homogeneous networks. The idea of citation recommendation using mining in heterogeneous networks was proposed by Liu et al. 17 . Heterogeneous networks have also been employed in healthcare. Some papers 18-20 focused on epidemic spreading on heterogeneous networks. Considering an epidemic threshold, Wang and Dai 21 addressed virus spreading in heterogeneous networks based on the well-known susceptible-infected-susceptible model. Moreover, it was shown by Yang et al. 22 that by considering heterogeneity between people, a heterogeneous network is created which is resistant against epidemic spread of virus. Epidemic spreading is important issue that was considered in other networks likes time-varying networks 23 and adaptive network 24 . Nadini et al. 23 used SIR and SIS models and investigated effects of modular and temporal connectivity patterns on epidemic spreading.
Link prediction in heterogeneous networks has also been addressed. Shakibian and Moghadam Charkari 25 used meta-paths for prediction and Jalili and Orouskhani 26 formulated drug response prediction as a link prediction problem using kernelised Bayesian multitask learning algorithm. Some works have considered information diffusion on these networks. Sermpezis and his colleagues 27 used degree distribution for the process of information diffusion assuming that diffusion takes place between two nodes at random times. Zhou and Liu 28 presented a social influence based clustering framework has been presented for analyzing heterogeneous information networks. Moreover, a heterogeneous network model was proposed for new product diffusion in two stages by Li and Jin 29 ; the first stage is transition of information concerning new products to customers through advertisement, and the second stage is changing customer priorities through persuasive advertisements.
As another definition, heterogeneous networks are referred to as multilayer networks, where the nodes and/ or edges can be of different types. In many studies in this field, the concept of heterogeneous networks has been used to present a different definition for the infrastructure networks, based on which the concepts of diffusion are explained. Multilayer networks with all nodes from the same type are often called multiplex networks; a number of works have considered link prediction problem in multiplex networks 16,26 .
Some works have studied topic diffusion in heterogeneous networks. The concept of similarity based on meta-paths (known as Pathsim), between each two nodes was utilised and predictions were made by generalising the Linear threshold (LT) model by Gui and et al. 30 . Pathsim was considered as a weight between each two nodes in this method through which predictions were conducted 31,32 . In our proposed method, each meta-path instance is considered as a path by considering different meta-paths, and the conditional probability model is used to calculate the activation probability of each node. Also, two different diffusion models are used including Independent Cascade (IC) and LT. In these models, first all nodes are considered to be inactive. Then, an initial set of seed nodes are activated and LT/IC is used to activate the subsequent nodes. In IC model, an inactive node is activated under the influence of the active node with the highest probability of influence 33 . In this model, a probability is assigned to each active node for activating its neighbors; the probability of activation of node w triggered by node v is denoted as P(v|w). Every newly-activated node v attempts to trigger its inactive neighbors. If successfully triggered, node w is activated in the next step and triggers its inactive neighbors. Once a node is activated, it has a single chance to independently influence each of its neighbors. In LT mode, each inactive node is activated if the portion of its activated neighbors is more than a threshold θ ∈ [0, 1] 34 . Indeed, an inactive node is activated if and only if the total weight of all its activated neighbors exceeds a given threshold θ u , as equation (1).
where ε u is the active neighbors of node u and W u,v represents the weight of the link between nodes u and v. Watts 35   1. We propose two novel topic diffusion models in heterogeneous networks considering different meta-paths, meaning that the influence of each relation is individually learned. 2. The dependency of active nodes to inactive ones is considered and conditional probability is employed to obtain the possibility of activation of each inactive node. 3. Two frequently used models (LT and IC) are studied in heterogeneous networks and their behavior is compared in two real datasets. We show that IC model has more accurate answer than LT model in properly modeling topic diffusion in heterogeneous networks.

Methods
This study incorporates conditional probability for calculating the activation probability of inactive nodes by neighboring active nodes. This is in fact known as information propagation probability which defines the probability that an active node activates an inactive neighbor. This propagation probability is calculated considering meta-paths and using Bayesian framework. It is assumed that inactive nodes are dependent on the active ones. IC and LT models are employed for the process of information distribution. The stages involved in the proposed method are briefly presented in algorithm 1 with every stage being explained separately in the following subsections.

Datasets. A time-stamp of a year is defined for both datasets, based on which the training set and the test set
are created as explained in the followings: • DBLP (computer science bibliography) 36 : Objects indicate authors in this network. Different meta-path such as APA (Author-Paper-Author), ACA (Author-Conference-Author), APAPA (Author-Paper-Author-Paper-Author), and ACACA (Author-Conference -Author-Conference -Author) are considered. Different topics are extracted from this dataset, and information diffusion about a specific topic is investigated. This dataset include information from 1954 to 2016. • PubMed Dataset 37,38 : In this network, the authors are represented by objects and meta-paths APA and APAPA are used. The dataset consists of information from 1950 to 2013. Information of both datasets is given in Table 1.
Evaluation criteria. All nodes with published papers on our particular topic of interest are tagged as active and the rest as inactive. Assuming the nodes to be predicted at time t, the training and test sets are considered as follows: Training set: Those within the time period from t − 4 to t − 2 are considered as the training set.
Test set: Those within the time period from t − 1 to t are considered as the test set. Additionally, the nodes tagged as active up to the time t − 2 are considered as the seed nodes that are activated initially in the start of the diffusion process.
We use Precision and Recall, F-score, and Recall criteria to assess the performance. These metrics are defined as follows.
where True Positive (TP) is the active nodes that are correctly tagged as active by the algorithm, True Negative (TN) is the inactive nodes that are correctly tagged as inactive by the algorithm, False Positive (FP) is the active nodes that are falsely tagged as inactive by the algorithm, and False Negative (FN) is the inactive nodes that are falsely tagged as active by the algorithm. In IC model, let S t ⊆ V be the set of nodes that are activated at step t ≥ 0, with S 0 = S. At step t + 1, every node u ∈ S t may activate its out-neighbors v ∈ V with a propagation probability of P(v|u). One should also consider the activation threshold for LT model. We study how the diffusion process depends on the threshold value. Initially, the optimal threshold limit is required to be calculated from the training set in order to obtain the evaluation criteria according to the third step of the algorithm 1. Figure 2 shows the F-scores as a function of the threshold value when considering diffusion of the selected topics in DBLP dataset. As it is seen, one can often obtain an optimal value for the threshold for which the F-score is the highest. Note that F-score scales in the range [0, 1], where 1 indicates the best performance. This optimal threshold varies across different topics, which indicates that different topics have different propagation mechanisms in this dataset. The obtained optimal threshold value is then applied to the test set to assess the performance. We also use recall measure to obtain the optimal threshold and the results are similar to those obtained based on F-score (results not shown here).
Calculating propagation probability of Nodes. Propagation probabilities for all edges and nodes are calculated in this stage. In order to calculate the activation probability of each node according to its neighboring nodes, the influence probabilities of each node and edge are calculated considering meta-paths. Edge Propagation Probability: In heterogeneous networks, different routes are available for meta-paths. Hence, for every pair of nodes v 1 and v 2 in meta-path k, the edge probability is equal to the number of path instances between the two nodes divided by all the existing path instances between them, as shown in equation 3. 1. Calculate Probability for nodes: • Find P for each pair of nodes v 1 and v 2 • For t from T 0 to T f • Insert edge from active nodes to inactive ones in the main graph • Delete any edges between active nodes (there is no dependency between active nodes) • Delete any edges between inactive nodes (there is no dependency between inactive nodes) • For each inactive nodes do: • Based on flow graph and α k -find If node(v i ) was activated by one of active neighbors: • For each inactive node do: • Based on flow graph and α k -find . Calculate F-Score and Recall measures as equation (2) The above fraction can be considered as the information propagation probability between nodes v 1 and v 2 . In equation (3), P k (v 1 , v 2 ) denotes the probability of the edge between nodes v 1 and v 2 connecting in meta-path k. n u is the total number of existing nodes and → n v v k 1 2 represents the path instances between these nodes in meta-path k. Node Propagation Probability: The strength of each node, i.e. the amount of information propagation the node is capable of, according to each meta-path is expressed by: For instance, an author with a higher number of published papers should be assigned higher influence strength for information spread. Probability of propagation from node v 1 to node v 2 in meta-path k is expressed using equation (5).
Propagation flow graph. The activation probability is assumed to be conditional as only an active node is capable of activating an inactive one, meaning that the direction of flow is always from the active node to the inactive one. Hence, the network is considered to be of Bayesian type. Additionally, we assume that active nodes are independent as an inactive node can only be activated by an active neighboring nodes and no flow may occur between two active nodes; hence no edge is considered between them. An implicit graph, with an example shown in Fig. 3, known as Propagation Flow Graph (PFG) is considered in this work. It should be noted that in order to calculate the node and edge propagation probabilities, the relationships between all nodes, both active or inactive ones are taken into account. In each state, if a node is activated, it is added to the PFG. In our example shown in Fig. 3, nodes V2 and V4 are activate V1 as there are links from V2 and V4 to V1 on PFG. However, V3 can only be activated by V4 as there is no link from V1 to V3 on PFG. As V1 is activated in the first step, it can also affect V3 in the next step.
Propagation Probability. In this section, the activation probability for each node is calculated according to IC and LT diffusion model. IC Model. In IC model, each inactive node has a single change to be activated by one of its active neighbors. In other words, if an inactive node is not activated by a recently activated neighbor node, it will not be considered in the next steps for being activated. Here, among the neighboring nodes of an inactive node that activated this node, the one with the maximum probability is selected as the activating node. Otherwise, if the state of an inactive node does not change we select the maximum probability of neighbors as the probability of this inactive node. The propagation probability from active neighboring nodes (ε v ) to an inactive node v i through a given meta-path k is obtained according to: As mentioned before, we assume that active nodes are independent since no flow may occur between two active nodes. Since the overall probability is obtained as the sum of meta-paths, the overall activation probability of node v can be obtained as: k m number of metapaths k k 1 which means that a coefficient α k is assigned to each meta-path to obtain the overall probability. Among the active neighboring nodes of inactive node v i , the one with the maximum probability is selected as the activating node for node v i .
LT Model. As a more intuitive and closer assumption to the real world, LT model assumes that a node is activated if at least certain percentage of its neighbors have already been activated. In DBLP network for example, this means that the total number of studied papers from different authors can influence the author to publish a paper on a particular topic. The general type of LT model is as equation (1). On the other hand, due to assuming the conditional probability, we can obtain the probability of each inactive node. In this section, we keep the properties of LT model and conditional probability together. In this case, calculations of propagation probability through active neighboring nodes of node v i are as follows: for obtaining more influence. This means that with higher probability, the neighbors of node v i will have more influence on it, which leads to: We can infer that if multiplication of the neighbors' probability of an inactive node v i becomes more than the threshold λ v i , the inactive node is more likely to be activated. Let us multiply a constant value υ in both sides of equation 10 which does not change the final result: By making logarithm from both sides of the above equation, we have equation (12) as: Equation (12) shows that by considering W i as υ ε | log P v ( ( )) n i viq we kept the LT conditions and also we used conditional probability.
Learning model. Since information diffuses from active to inactive nodes, the flow of propagation is considered as a directed graph from active to inactive nodes. Moreover, due to their active state, no edge is considered between active nodes. Hence, according to PFG, the probability of all nodes is obtained through individual multiplication of active and inactive nodes. In the following, we explain the learning process used for IC and LT models.
IC model. If U t is the set of all graph nodes, V t the set of active nodes and R t the set of inactive nodes at time t, the propagation probability for nodes is obtained by: The objective is to maximise P(U t ); the probability of active nodes (P(V t )) as well as that of unity minus the probability of inactive nodes (1 − P(r|{ε r })) should be maximised to obtain the best results. For convenience, the function can be converted to log-likelihood function as:  Ultimately, both models use equation (16) for calculating the coefficient of each meta-path (α k ).
Example. In this section, we provide the above analysis on a sample network shown in Fig. 3. In this network, nodes V2 and V4 are active nodes, and thus can influence the inactive nodes V1, V3, V5 and V6, and activate them. Considering two meta-paths APA and APAPA, the probability of activation for each node can be calculated as follows:  where the values of α APA and α APAPA are learned for the corresponding meta-paths. Assuming the learned values for α APA and α APAPA as 0.6 and 0.4, respectively, the final activation probability is obtained as:

Experimental Results
As mentioned for DBLP example, the authors are connected to one another according to the specified meta-path. For topic diffusion in such a graph, initially we need to select a special topic like "data mining". The authors with papers related to the selected topic are considered as active nodes. These authors affect their neighbors in a way that of an inactive node (author) might be encouraged to write a paper in this filed affected by active author(s). If a neighbor writes paper in this field, they will be active and will then affect their neighbors. Figure 4 shows an example in which in the first step nodes V2 and V4 are activated by "data mining" topic. Node V2 can activate node V1 while node V4 can activate nodes V3, V5 and V6. In this example, node V4 activates node V5, hence node V5 is persuaded to write paper in "data mining" topic. In this section, we apply the proposed model on two real datasets and discuss the results. We consider two popular datasets, DBLP and PubMed, which include information on authors, papers and venues. We also consider some topics including data mining, machine learning, social networks, healthcare, DNA, and infectious disease, for which the diffusion process is modeled. The topic selection is mainly due to their convenient frequency in the datasets and the considerable amount of data available for comparison and conclusion.

DBLP.
In this dataset, information diffusion is investigated on the selected topics. The results of the proposed method is compared to the state-of-the-art method introduced by Gui et al. 30 , known as MLTM-R. Figures 5 and 6 compare the performance of the proposed model, Heterogeneous Probability Model (HPM), with MLTM-R in terms of F-score and Recall, respectively. Note that the original MLTM-R method is based on LT model for diffusion, while HPM works for both LT and IC models. As it can be seen, HPM significantly outperforms MLTM-R by providing much better F-score and Recall when IC model is used. An improvement of about 30-50% is obtained in HPM as compared to MLTM-R. Furthermore, these results show that one can obtain much better performance when IC model is used rather than LT. This indicates that IC model is better capable of modeling topic diffusion in this dataset. Figure 7 compares TP, i.e., the number of correctly predicted active authors, of the methods. It also include the actual TPs for different years, where the closer is the predicted value to these actual values, the better is the performance of the method. As it is seen, the proposed method with IC model (HPM-IC) has the closest predicted values to the actual ones, followed by HPM-LT and then MLTM-R. This performance is observed across all the selected topics and all years. Figure 8 shows the number of authors who have been incorrectly identified as active or inactive, where HPM-IC has the lowest values (i.e., the best performance) while MLTM has the worst performance.
PubMed. We apply the methods on PubMed dataset with the same selected topics. Figures 9 and 10 show the F-score and Recall of the methods, respectively. Similar to the other dataset, HPM significantly outperforms MLTM-R in all topics. Also, HPM-IC performs better HPM-LT. Figures 11 and 12 show the correctly identified active authors (TP) and incorrectly identified active and inactive authors (FN and FP), respectively. As it is seen, similar to the other dataset, HPM-IC has the best performance.
Analysis. Compared to MLTM-R method 30 , HPM-LT and HPM-IC methods significantly improve the F-score and Recall of the prediction, which is mainly due to the following reasons. MLTM-R uses pathsim to calculate the weight of each edge. Pathsim is not accurate in some cases 31,32 , as it does not obtain similarity value (or obtain low similarity scores) between two similar nodes in certain circumstances. However, in our proposed method, each meta-path instance is considered as a path by considering different routes between nodes, which eliminates the problems of Pathsim as there is no need to calculate the similarity for weights. The proposed method instead uses the conditional probability model to calculate the activation probability of each node. The inactive nodes are considered to be dependent on the active neighboring nodes. This is a realistic scenario as if an author decides to write a paper about an issue, they should have already be aware of the existence papers written by others (active nodes). Unlike the other method, in the proposed algorithm we separately consider the node and edge influence. The node influence is considered by having IC and LT models in which activation of inactive nodes is based on neighboring active nodes. The edge influence is considered as the extent to which the relation between two nodes is important for diffusion process i.e. a relation is more impressive if larger number of multipaths are found between two nodes. Topological properties of networks have significant influence on the way information propagates on them. DBLP has larger average degree than PubMed and having more connections facilitates spread. Our results also confirms this as the performance of the methods is better for DBLP than PubMed.
Better performance of the proposed strategy over of the previous model is due to considering information extracted from meta-paths. A meta-path is a path between any two nodes from different layers of an heterogeneous network. As meta-path traverses between different type of object, it can extract useful information on the structure of the network. method based on meta-paths have already been used for network analysis such as link prediction 5 . Our experiments shows that meta-paths are also important in the way information spread across layers and different object types. We also consider importance of the nodes by taking into account the paths passing through them (equation 4).
Our proposed method use meta-paths with different lengths. Two non-adjacent nodes of the same type, e.g. two authors in DBLP example, might be connected through meta-paths of length two or three. For example, in DBLP network when two authors who do not have any co-authored papers, both have papers with another authors, there is a meta-path of length two between these two authors. Considering such meta-paths allows one to account for such indirect connections between the nodes and taking into account the cross-layer information at the same time.

Conclusion
This paper studied information spread and diffusion of scientific topics in heterogeneous networks. To this end, a novel method called HPM, was developed based on meta-paths and conditional probability. Moreover, propagation flow graph was defined to illustrate the diffusion flow from active to inactive nodes. Propagation probability was then calculated based on this graph and the coefficients of meta-paths were learned using the log-likelihood function. We considered two well-known diffusion models: Linear Threshold (LT) and Independent Cascade (IC) models. In LT model, inactive nodes are activated if the portion of their active neighbors is higher than a certain threshold. In IC model, the recently activate nodes activates its inactive neighbors with a certain probability. We considered the problem of topic diffusion in two real-world networks: DBPL and PubMed. The performance of the proposed model was compared with a state-of-the-art method, where our experimental results showed that the proposed method significantly outperform the other one. Also, Using IC as the diffusion model led to better performance than LT model.