Introduction

Complex networks are common in real world and can be used to represent complex systems in many fields. More and more complex networks come with attributes in nodes and are named as attributed networks1. These networks not only contain topology structures, but also have rich node attribute information such as text descriptions of nodes and comments related to nodes. Influence maximization (IM) is a classic optimization problem in network science, which aims to seek a set of vital nodes that the diffusion orients from these nodes can cause the maximum influence spread in networks. Vital nodes identification for IM has been widely used in many applications such as viral marketing2, information propagation3, rumor analysis4 and so on.

Many IM algorithms have been proposed in complex networks, including diffusion-based algorithms5,6,7 and heuristic-based algorithms8,9,10,11,12. Diffusion-based algorithms provide a good performance guarantee to the optimal solution with the weakness of enormous calculations. Heuristic-based methods improve efficiency to some extent but take no consideration of propagation models or do not optimize a global function of influence. Recently, community-based methods13,14,15 play an important role in the IM problem. A community is defined as a group of nodes with dense internal connections and relatively sparse connections to the rest of the network. It can effectively represents the organization and structure of the network16. Benefiting from the fact that different communities are sparsely connected, the propagation overlap between seed nodes selected from different communities can be effectively reduced.

Due to the benefits of community-based influence maximization algorithms, many previous studies have focused on them in complex networks. The first and foremost step of community-based algorithms is community detection. Numerous community detection methods based on matrix factorization17,18, label propagation19,20, percolation21 and random walks22,23 have been proposed with certain limitations and scalability issues. However, these community detection methods only use the information relevant to the graph topology and fail to correlate node features with the community structure24. Recently, the graph-embedding based community detection methods25,26 have attracted tremendous attention, since they can learn a representation that embeds the topology into the attribute for each node. Given the good performance of graph-embedding methods in community detection, we try to apply it to solve the influence maximization problem.

Although many community-based methods have been proposed for the IM problem, there are few methods that are suitable for attribute networks. Almost all graph clustering or community detection methods in attribute networks do not conduct the influence maximization study since there are no suitable information propagation models for attributed networks. Moreover, community-based influence maximization algorithms avoid the propagation overlap between seed nodes selected from different communities, but the propagation overlap between seed nodes selected from the same community may still exists which may reduce the influence spread. To solve the above problems, we propose an information propagation model and a novel community-based influence maximization algorithm for attributed networks. The main contributions are summarized as following:

  • An extension of classic linear threshold (LT) information propagation model is proposed named LTPlus, which not only considers topology structures of networks but also attributes of nodes.

  • To solve the influence maximization problem in attributed networks, we propose a community-based influence maximization algorithm using graph-embedding. To the best of our knowledge, it is the first time that a graph-embedding based community detection method is used to the influence maximization problem.

  • The proposed method alleviates the propagation overlap between seed nodes selected from the same community by recalculating the influence of seed nodes’ predecessors during the seed nodes selection process.

  • Extensive analysis is performed on six datasets, and experimental results show that the proposed method has a good performance.

Related work

The related IM algorithms in this paper are classified into three categories: diffusion-based methods, heuristic-based methods and community-based methods. These methods are discussed with more details below:

Kempe et al.5 proposed the diffusion-based method, Greedy, which provides a \((1-1/e-\varepsilon )\) approximation performance guarantee to the optimal solution. However, its computation cost is expensive since it needs to perform Monte-Carlo simulations on all possible combinations of the current seed set and remaining nodes. Leskovec et al.6 proposed the CELF algorithm which employed the principle of diminishing marginal utility to avoid a lot of Monte-Carlo simulations. It significantly reduces the time complexity but it is still not scalable to large scale networks.

To improve efficiency, some heuristic centrality measures, such as degree centrality27, K-Shell9, betweenness centrality28 and closeness centrality29 etc., were proposed to evaluate node influence. Moreover, Li et al.3,30 proposed to identify influential nodes by novel gravity models. LENC12 identified influential nodes by the entropy of the node based on the weight distribution of edges connected to it. However, these methods may lead to rich-club effect in solving the IM problem. VoteRank31 was proposed to reduce the rich-club effect by selecting seed nodes based on a voting scheme, where the voting ability of each node is the same and each node gets the vote from its neighbors. NCVoteRank32 argued that the voting ability of each node should be different and depends on its topological position. A fast and accurate IM algorithm, LMP33, was proposed by using a local traveling for labeling of nodes based on the influence power. This method can achieve a linear time complexity, while have good performance. HGD34 presented a heuristic group discovery method to reduce the influence overlap, which utilized the K-Shell and degree centrality to cluster nodes. However, HGD is a local optimal clustering algorithm that cannot guarantee global optimal performance. Overall, heuristic-based methods are relatively time efficiency but may lack performance guarantee in some networks.

As the community detection is an appropriate approach for understanding the structure and hidden information in complex networks35, many community-based IM methods were proposed. Li et al.36 pointed out that higher community diversity can reduce the risk of marketing campaigns and prolong the effect of a marketing campaign in the future promotion. OASNET37 used the Clauset-Newman-Moore community detection method and selected candidate nodes from each community by classic greedy-based algorithm, then selected seed nodes from candidates by dynamic programming. However, the efficiency of this method still need to be improved. A fast overlapping community-based IM method, FIP33, was proposed by removing insignificant communities to decrease the search space for choosing seed nodes. This makes the method time efficient. The probability coefficient of global diffusion is considered to improve seed node selection performance. CoFIM38 used the Louvain algorithm39 for community detection and defined the node-expansion and intra-community propagation under the weighted cascade model, which successfully avoid thousand times of Monte-Carlo simulations. This method performs well on many large-scale datasets and has high time efficiency.

However, these aforementioned methods just focus on network topologies and fail to measure the importance of node attributes in attributed networks, while the attribute is also an essential indicator as well as the topology. Some literature40,41 dealt with node attributes and studied target-aware IM problem, but their optimization objective functions are different from traditional IM. Besides, the continued growth of the network scale and high-dimensional node attributes put forward higher requirements for the efficiency and scalability of community detection algorithms in attributed networks. Inspired by the significant progresses in graph-embedding42, graph-embedding based community detection came into view in recent years. AANE43 computed the attribute similarity matrix between nodes and calculated vector representation associated with structural information and designed the joint learning process in a distributed manner. He et al.44 cast MRFasGCN as an encoder for unsupervised community detection in attributed networks. AGC45, an adaptive graph convolution method, exploited high-order graph convolution to capture global cluster structure and adaptively selected the appropriate order for different networks. These graph-embedding methods only complete the community detection task, but do not solve the IM problem. Therefore, vital nodes identification for IM in attributed networks is still a challenging problem to be solved.

Preliminaries

Attributed networks

Given a directed and attributed network \(G=(V,E,X)\), where \(V=\{v_1,v_2,\ldots ,v_N\}\) is the set of nodes and \(|V|=N\). E is the set of edges which can be represented as an adjacency matrix \(A=\{a_{ij}\}\in {\mathbb {R}}^{N\times N}\), where \(a_{ij}=1\) if node \(v_{i}\) connects to node \(v_{j}\) and otherwise \(a_{ij}=0\). \(X=[x_1,x_2,\ldots ,x_N]^{T}\) is the attribute matrix of all nodes, where \(x_i\in {\mathbb {R}}^d\) is a real-valued attribute vector of node \(v_i\) and d is the dimension of attribute.

Linear threshold (LT) model

The LT model5 is a widely used information diffusion model. In the LT model, nodes are divided into two states: active and inactive. In a directed network, the activation of node \(v_i\) depends on its in-neighbors \(N_{in}(v_i)\). If \(v_j\in N_{in}(v_i)\) is active, it has an influence on \(v_i\), denoted as \(b_{v_j,v_i}\). In the LT model, \(b_{v_j,v_i}\) is set as:

$$\begin{aligned} b_{v_j,v_i} = \frac{1}{k_{in}(v_i)} , \end{aligned}$$
(1)

where \(k_{in}(v_i)\) represents the in-degree of node \(v_i\). Each node in \(N_{in}(v_i)\) has an influence value to \(v_i\), and the summation of these values must be no more than 1, that is \(\sum _{v_j\in N_{in}(v_i)}b_{v_j,v_i}\le 1\). Each node \(v_i\) has an activation threshold \(\theta _{v_i}\) which is between 0 and 1. Therefore, \(v_i\) will be activated once \(\sum _{v_j\in N_{in}(v_i)}b_{v_j,v_i}\ge \theta _{v_i}\). The diffusion process is over until no more nodes can be activated.

Independent cascade (IC) model

Another well-known information diffusion model is the IC model46. In the IC model, each edge has a probability p to measure the social influence of this edge. Nodes are also divided into active and inactive states. If a node \(v_i\) is activated, then it has a chance with probability p to activate its inactive out-neighbor \(v_j\) in a directed network.

Influence maximization

Influence maximization47 aims to find a node subset \(S\subseteq V\) and \(|S|=m\), such that the expected influence scope is maximal:

$$\begin{aligned} S^* = \arg _S\max \phi (S), \end{aligned}$$
(2)

where \(\phi (S)\) is an objective function used to evaluate the expected number of active nodes after the diffusion process.

Well-known state-of-the-art methods

Four state-of-the-art IM methods are introduced in this paper. These algorithms have been proved48,49 to perform well on many datasets.

  • CELF6: a much faster greedy-based algorithm based on the submodularity of the spread function. By using the principle of diminishing marginal utility, CELF achieves an up to 700 times improvement in running time while maintains similar practical performance compared with the simple greedy-based algorithm. However, the running time of CELF is still terrible especially on large-scale datasets which makes it meaningless in practical applications. Thus, we do not compare it on the Synthetic dataset in this paper.

  • IMM50: a martingale-based algorithm which utilizes reverse influence sets51. It computes a lower bound of the maximum expected spread of m nodes and derives the number of random Reverse Reachable(RR) sets needed to be sampled. The first m nodes that appear most frequently in the RR sets are selected as seeds.

  • CoFIM38: a community-based framework for influence maximization assuming that influence propagates from seed nodes to their neighbors and then from these neighbors to other nodes within the same community. Based on this assumption, an incremental greedy algorithm is developed to select seed nodes. In contrast to other community-based algorithms, CoFIM has high time efficiency.

  • HGD34: a heuristic group discovery algorithm using centrality metrics and the strong community rule to cluster cohesive nodes into one group. Compared with other heuristic-based algorithms, HGD is more efficient and perform well especially when m is small since it is a local optimal algorithm.

  • NCVoteRank32: a neighborhood coreness based voting approach designed to find spreaders by taking the coreness value of neighbors into consideration for the voting of node influence. NCVoteRank is also a heuristic-based algorithm, which outperforms many existing popular algorithms and is competitive in time complexity.

  • K-Shell9: in this method, nodes that locate within the core of the network are identified to be more important by the K-Shell decomposition analysis. The top k nodes with larger K-Shell value are selected as seeds.

Methods

The proposed LTPlus propagation model

For a given directed and attributed network G, the LTPlus model considers both the topology influence and the attribute influence between nodes. In order to better compare with the LT model, we do not change the topology influence evaluation method in the classical LT model. Thus, the incoming topology influence of \(v_i\) is the same as Eq. (1), and here it is noted as \(TI_{in}(v_j,v_i)\):

$$\begin{aligned} TI_{in}(v_j,v_i)=\frac{1}{k_{in}(v_i)}, \end{aligned}$$
(3)

where \(v_j\) is the in-neighbour of \(v_i\).

Since node attributes represent common characteristics among nodes which play essential roles in the information diffusion, the incoming influence from in-neighbors in the LTPlus model is jointly decided by both the incoming topology influence and the incoming attribute influence. Similar attribute vectors mean that these nodes are homogenous, and the information propagation between these nodes will be easier. That is to say, the attribute influence will be greater if attribute vectors of two nodes are similar. We simply use the cosine similarity52 to measure the similarity of attribute vectors:

$$\begin{aligned} s_a(v_j,v_i) = \frac{x_i\cdot x_j}{\Vert x_i\Vert \cdot \Vert x_j\Vert }. \end{aligned}$$
(4)

In order to make the topology influence and attribute influence in the same order of magnitude, we adopt the edge-softmax53 method to normalize \(s_a(v_j,v_i)\) for each node and get the incoming attribute influence of \(v_i\):

$$\begin{aligned} AI_{in}(v_j,v_i) = \frac{s_a(v_j,v_i)}{\sum _{v_l \in N_{in}(v_i)} s_a(v_l,v_i)}, \end{aligned}$$
(5)

where \(v_j\) is the in-neighbour of \(v_i\), and \(N_{in}(v_i)\) represents the in-neighbors set of \(v_i\).

To sum up, the incoming influence of node \(v_i\) from its in-neighbour \(v_j\) is calculated as the linear combination of the incoming topology influence \(TI_{in}(v_j,v_i)\) and the incoming attribute influence \(AI_{in}(v_j,v_i)\). Thus, the incoming influence \({\hat{b}}_{v_j,v_i}\) in LTPlus model is defined as:

$$\begin{aligned} {\hat{b}}_{v_j,v_i} = \alpha _1 \cdot TI_{in}(v_j,v_i) + \alpha _2 \cdot AI_{in}(v_j,v_i), \end{aligned}$$
(6)

where \(\alpha _1\) and \(\alpha _2\) indicate the weight coefficients of topology and attribute influence, \(\alpha _1, \alpha _2 \in (0,1)\) and \(\alpha _1 + \alpha _2 = 1\).

Obviously, the LTPlus propagation model takes into account topology structure and attribute similarity between nodes. Besides, the LTPlus propagation model fully considers that different in-neighbors contribute different attribute influence, which is more in line with real situations of information propagation. When \(\alpha _1 = 1\), the LTPlus model degenerate into the LT model, while \(\alpha _1 = 0\) means only node attributes are considered in information diffusion process. Generally, we treat the topology and attribute influence on an equal basis and set \(\alpha _1 = \alpha _2 = 0.5\).

The graph-embedding based community detection method

The goal of graph-embedding based community detection is to partition nodes in the network G into l clusters \(C=\{C_1,C_2,\ldots ,C_l\}\). As mentioned above, an adaptive graph convolution (AGC) method45 is used in this paper as the community detection method. A low-pass graph filter F45 is designed in AGC:

$$\begin{aligned} F = I - \frac{1}{2}L_s, \end{aligned}$$
(7)

where \(L_s = I-D^{-\frac{1}{2}}AD^{-\frac{1}{2}}\) is the symmetrically normalized graph Laplacian operator, I is the identity matrix and D is the degree matrix. To capture global graph structures and facilitate clustering, AGC defined k-order graph convolution45 as:

$$\begin{aligned} {\bar{X}}=(I-\frac{1}{2}L_s)^k X, \end{aligned}$$
(8)

where k is a positive integer. After convolution, AGC employed the linear kernel \(K={\bar{X}}{\bar{X}}^T\) to learn pairwise similarity between nodes and then performed spectral clustering on \(W=\frac{1}{2}(|K|+|K^T|)\) to obtain clustering results.

k-order graph convolution will produce smoother attributes as k increases, but too large k may lead to over-smoothing, i.e., the attributes of nodes in different clusters are mixed and become indistinguishable. To adaptively select the order k, the intra-cluster distance intra(C)45 is computed to measure clustering performance:

$$\begin{aligned} intra(C)= \frac{1}{|C|}\sum _{C_i\in C} \frac{1}{|C_i|(|C_i|-1)}\sum _{v_i,v_j\in C_i,v_i\ne v_j}\Vert \bar{x_i}-\bar{x_j}\Vert , \end{aligned}$$
(9)

where |C| is the number of communities and \(|C_i|\) is the number of nodes in community \(C_i\). This graph convolution network is trained iteratively until intra(C) converges.

However, AGC is designed for undirected networks. The symmetric operator \(L_s\) cannot be directly used for directed networks, since adjacency matrices of directed networks are asymmetric. A simple but effective method is to construct a symmetric matrix \(A_s\)54:

$$\begin{aligned} A_s=A+A^T. \end{aligned}$$
(10)

Then, a degree matrix \(D_s\) is built from \(A_s\) and the Laplacian operator is \(L_{sd} = I-D_s^{-\frac{1}{2}}A_sD_s^{-\frac{1}{2}}\). That is, the graph Laplacian operator \(L_s\) in AGC is replaced by \(L_{sd}\) in this paper. For the convenience of notation, the improved AGC method applicable for directed networks is noted as DAGC.

The seed nodes selection method

After community detection, nodes with powerful influence will be selected from different communities by measuring both topology and attribute influence. There are two key issues in the seed nodes selection phase: (1) The first problem is that how many nodes should be selected from each community. (2) The second problem is that how to select seed nodes.

To address the first problem, we empirically find that communities of different sizes should not be treated the same, since placing seed nodes in a large community could trigger more nodes than in a small community. According to this, a quota-based approach is adopted and \(m_{C_i}\) nodes are selected from each community:

$$\begin{aligned} m_{C_i} = round(m \times \frac{|C_i|}{N}), \end{aligned}$$
(11)

where round() function means rounding the value to the nearest integer, and m is the total number of seed nodes. Thus, \(m_{C_i}\) nodes will be selected from community \(C_i\) and added to the seed node sequence. If the seed node sequence length is larger than or equal to m, the iteration will be broken. In contrast, if the seed node sequence is smaller than m, the node with the maximum influence in the current network will be selected as the seed node.

figure a

For the second key problem, when selecting influential nodes in directed networks, we pay more attention to how many nodes can be affected by one node. The more nodes it points to, the more nodes it can affect. Thus, the out-degree of each node is used to measure its topology influence, which can be formulated by:

$$\begin{aligned} TI_{out}(v_i)=k_{out}(v_i). \end{aligned}$$
(12)

The more similar the attributes between nodes, the more likely the information successfully propagates between these nodes. Thus, the attribute influence of a node is measured by its attribute similarities to its out-neighbors. Attributes after graph convolution \({\bar{X}}\) are used to compute cosine similarities for nodes since they integrates topology and attributes well. It is noteworthy that different from Eq. (4), the attribute similarity after convolution noted as \(\overline{s_a}(v_i,v_k)\) is calculated between node \(v_i\) and its out-neighbor \(v_k\):

$$\begin{aligned} \overline{s_a}(v_i,v_k) = \frac{\bar{x_i}\cdot \bar{x_k}}{\Vert \bar{x_i}\Vert \cdot \Vert \bar{x_k}\Vert }. \end{aligned}$$
(13)

The attribute influence of a node is calculated by summing the attribute similarities to its all out-neighbors:

$$\begin{aligned} AI_{out}(v_i) = \sum _{v_k \in N_{out}(v_i)} \overline{s_a}(v_i,v_k), \end{aligned}$$
(14)

where \(N_{out}(v_i)\) is the out-neighbors set of node \(v_i\).

To ensure that the influence of each node is in the range of [0, 1], the topology and attribute influence of each node are normalized by Min-Max scaling normalization method. The normalization of \(TI_{out}(v_i)\) and \(AI_{out}(v_i)\) noted as \(NTI(v_i)\) and \(NAI(v_i)\) respectively are calculated as follows:

$$\begin{aligned} \left\{ \begin{aligned} NTI(v_i)= & {} \frac{TI_{out}(v_i)-min(TI_{out})}{max(TI_{out})-min(TI_{out})} \\ NAI(v_i)= & {} \frac{AI_{out}(v_i)-min(AI_{out})}{max(AI_{out})-min(AI_{out})}, \end{aligned} \right. \end{aligned}$$
(15)

where \(max(TI_{out})\) and \(min(TI_{out})\) are the maximal and minimal value of nodes’ topology influence respectively, and similarly \(max(AI_{out})\) and \(min(AI_{out})\) are the maximal and minimal value of nodes’ attribute influence respectively. The topology influence and the attribute influence are supposed to be treated on an equal basis. Thus, the total outcoming influence of each node is:

$$\begin{aligned} INF(v_i) = NTI(v_i) + NAI(v_i). \end{aligned}$$
(16)

For communities whose \(m_{C_i}>0\), the INF value of nodes in this community will be calculated and the node with the maximum INF value will be selected as the seed node. To reduce the propagation overlap between seed nodes selected from the same community, the node will be removed from the network when it is selected as a seed node and the influence of its in-neighbors should be weakened. Suppose that node \(v_j\) is a in-neighbour of node \(v_i\), the topology and attribute influence of \(v_j\) will be reduced if node \(v_i\) is selected as the seed node. The updated topology influence \(TI_{out}^{'}(v_j)\) and attribute influence \(AI_{out}^{'}(v_j)\) can be calculated as:

$$\begin{aligned} \left\{ \begin{aligned} TI_{out}^{'}(v_j)&= TI_{out}(v_j)-1 \\ AI_{out}^{'}(v_j)&= AI_{out}(v_j)- \overline{s_a}(v_j,v_i). \end{aligned} \right. \end{aligned}$$
(17)

Then normalization topology and attribute influence of \(v_j\) can be updated by taking Eq. (17) into Eq. (15), respectively. Finally, \(INF(v_j)\) is also updated by recalculating Eq. (16). The node with the maximum INF will be selected as the seed node in each iteration. The proposed seed nodes selection method can be summarized as Algorithm 1.

Complexity analysis

We also analyze the time complexity of our proposed algorithm. Firstly, if DAGC method iterates t times, the time complexity of DAGC community detection is \(O(N^2dt+ndt^2)\) where N is the number of nodes, d is the number of attributes and n is the number of nonzero entries of the adjacency matrix A45. Secondly, influence values for nodes in communities whose \(m_{C_i}>0\) will be calculated in the seed nodes selection phase (as described in the 3th to 9th rows of Algorithm 1), which have a \(O(l\cdot m_{C_i}\cdot |C_i|)\) complexity. Since \(|C_i|\) can be approximated as the average value \(\frac{N}{l}\) and \(m_{C_i}\) is a constant, \(O(l\cdot m_{C_i}\cdot |C_i|)\approx O(N)\). The complexity for recalculating influence of the selected node’s in-neighbors (as described in the 12th row of Algorithm 1) is \(O(l\cdot m_{C_i} *N_{in}(v_i^*))\). Since \(N_{in}(v_i^*)\ll |C_i|\), \(O(l\cdot m_{C_i} *N_{in}(v_i^*))\ll O(N)\), the complexity of the seed nodes selection method is O(N). Overall, the total complexity of our proposed influence maximization algorithm is \(O(N^2dt+ndt^2+N)\).

Results

Data description

We evaluate the performance of the proposed algorithm on five real world datasets and a large-scale synthetic dataset. Details of these datasets are described in Table 1. Five real world datasets including Pubmed, Cora, Cornell, Texas and Washington. The Pubmed dataset consists 19,717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. Its citation network consists 44,338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The Cora dataset consists 2708 scientific publications and 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words. The Cornell, Texas and Washington datasets are gathered from three different universities. Each line of these datasets contains two webpage IDs. The first entry is the ID of the webpage being cited and the second ID stands for the webpage which contains the citation. The synthetic large dataset named ‘Synthetic’ is constructed with 105,000 nodes and 830,159 edges. To generate our synthetic dataset, the function \(random\_partition\_graph()\) in the networkx package of Python is used. More specifically, the number of community is set as 3 and the size of community is set as \([3\times 10^4, 3.5\times 10^4, 4\times 10^4]\). Nodes in the same community are connected with probability \(2.5\times 10^{-4}\) and nodes of different communities are connected with probability \(1\times 10^{-4}\). The attribute of each node is a vector of size 100. Initially, each bit of the vector is randomly assigned 0 or 1. When all neighbors of a node have attributes, the attribute of this node is rounding the average attribute value of its neighbors.

Table 1 Details of six datasets used in this paper.

Performance metrics

Two critical metrics are employed to evaluate the performance of our proposed algorithm in this paper:

  • Influence spread \(\sigma (S)\): for a given seed set S, the number of expected active nodes when the diffusion on the propagation model comes to steady state is denoted as \(\phi (S)\). In the following experiments, \(\phi (S)\) is the average value of 1000 times Monte-Carlo simulations. To facilitate observations on datasets of different scales, influence spread is defined as the ratio between \(\phi (S)\) and the total number of nodes in the dataset:

    $$\begin{aligned} \sigma (S)=\frac{\phi (S)}{N}. \end{aligned}$$
    (18)

    Influence spread is used to evaluate the effectiveness of an influence maximization algorithm. Higher \(\sigma (S)\) value indicates that the algorithm is more effective.

  • Running time: running time is defined as the time for selecting m seed nodes. In the previous community-based influence maximization study38, only the time of seed nodes selection phase is considered. To analyze the running time of the whole influence maximization algorithm in more detail, we report the running time of community detection, attribute similarity calculation (or K-Shell calculation for HGD and NCVoteRank) and seed nodes selection respectively, as shown in Table . The running time is measured in seconds.

  • Speedup: the speedup is measured for influence spread of the proposed method over baseline methods with \(m=30\), 40 and 50 seed nodes. The speedup55 is computed as:

    $$\begin{aligned} \text {speedup}=((A-B)/A)\times 100, \end{aligned}$$
    (19)

    where A and B are the influence spread of two compared methods. For example, if the influence spread of Ours and K-Shell methods are 0.4475 and 0.2328, respectively, the speedup of Ours compared to K-Shell is calculated as: \(\text {speedup}_{Ours\rightarrow K-Shell}=((0.4475-0.2328)\div 0.4475\times 100)=47.98\). Similarly, the speedup of K-Shell compared to ours is calculated as: \(\text {speedup}_{K-Shell\rightarrow Ours}=((0.2328-0.4475)\div 0.2328\times 100)=-92.23\).

Experimental results

Based on the above networks, benchmark algorithms including CELF6, IMM50, CoFIM38, HGD34, NCVoteRank32, K-Shell9 are used to compare with our proposed method. To evaluate the effectiveness of our proposed method, we compare the influence spread \(\sigma (S)\) of different algorithms under different initial numbers of seed nodes m on LTPlus model with random sampling the active threshold of each node. Results on six datasets are shown in Fig. 1, where x-axis represents the number of seed nodes m and y-axis represents the influence spread \(\sigma (S)\). From the results, we can see that our method outperforms community-based method (CoFIM) and heuristic-based methods (HGD, NCVoteRank K-Shell) on all datasets. Besides, our proposed method surpasses CELF on Pubmed dataset in some scenarios. CELF and IMM have similar performance in influence spread on six datasets. On the four small datasets(Fig. 1b–e), our method has similar performance with CELF and IMM which have theoretical guarantees. However, CELF can not be executed on the Synthetic dataset since its running time is intolerable. Methods with no theoretical guarantees may perform well on some datasets, but perform poorly on other datasets. For example, NCVoteRank and CoFIM perform well on Pubmed and Synthetic but poorly on Washington. Since both topology and attribute influence are considered in the seed nodes selection process of Ours, our method is more stable than other methods without theoretical guarantees. Overall, from the influence spread results on six datasets, our proposed algorithm shows its effectiveness and robustness in finding influential seed nodes and achieving influence maximization.

Figure 1
figure 1

The influence spread \(\sigma (S)\) of different algorithms on six datasets with different number of initial spreaders m under the proposed LTPlus model. The active threshold of each node is randomly sampled.

Since Independent Cascade (IC) model is also a classic propagation model, experiments are carried out on the IC model to evaluate the performance of the proposed method. In the IC model, a uniform probability p is assigned to each edge of the graph. A node \(v_i\) has a chance of p to activate its out-neighbors. The probability p in our experiments is set as 0.1 by following the previous study5 and the number of seeds m ranges from 5 to 50. From Fig. 2, we can see that our proposed method still has a good performance in most cases. In addition, our node selection method does not depend on the propagation model, we do not need to re-select seeds when the propagation model changes. This proves the universality of our method.

Figure 2
figure 2

The influence spread \(\sigma (S)\) of different algorithms on six datasets with different number of initial spreaders m under the IC model.

The speedup experiments based on the LTPlus and the IC model are shown in Tables 3 and 4, respectively. Three different number of seeds 30, 40 and 50 are taken for experiments. Table 3 reveals that the proposed method has positive speedup than CoFIM, HGD, NCVoteRank and K-Shell on all datasets. Besides, the proposed method has positive speedup than CELF and IMM on Pubmed and Washington datasets. Although the proposed method has negative speedup than CELF and IMM on Cornell and Texas datasets, the absolute value of the speedup is very small, which means the difference of influence spread between these two methods is small. In Table 4, the proposed method has positive speedup than baseline methods in almost all datasets. The experimental results show the effectiveness of our proposed method.

In the seed nodes selection phase, we propose to recalculate the current influence of seed nodes’ in-neighbors (as shown in the 12th row of Algorithm 1) to reduce the propagation overlap between seed nodes selected from the same community. To verify the effectiveness of this step, we compare the influence spread of our proposed algorithm with/without recalulating INF of seed node’s in-neighbors, respectively. As shown in Table 2, the first row of each dataset is the influence spread of Ours method on the LTPLus model, and the second row of each dataset is the influence spread of our proposed method without recalculating INF of seed nodes’ in-neighbors in seed nodes selection phase, that is, without the 12th row in Algorithm 1. Compared to the method without recalculating INF in seed nodes selection phase, the influence spread of Ours method has an improvement to some extent. Especially in Washington network when \(m=5\), the value of the first row is significantly higher than the second row. This may be due to that nodes in the network are concentrated in the same community and the number of initial seed nodes is small. Most seed nodes are selected from the same community and they may connect with each other. Seed nodes have a large number of common neighbors which eventually lead to a small influence spread. Therefore, it is necessary to recalculate the influence of seed nodes’ in-neighbors in the seed nodes selection process.

Table 2 Ablation experiments that analyze the impact of recalculating the INF of seed nodes’ in-neighbors.
Table 3 Speedup % (in terms of influence spread) for Ours versus other baseline methods on six datasets. The propagation is simulated on the LTPlus model.
Table 4 Speedup % (in terms of influence spread) for ours versus other baseline methods on six datasets.

Time efficiency is a key indicator that many researchers concern about. Therefore, the running time of our proposed algorithm and baselines algorithms are analyzed in stages. Experiments are carried out on a computer with 2.30 GHz Intel i7-10875H CPU and 32GB memory. Table  shows the running time of various algorithms on six datasets. Here the running time of seed nodes selection is the time of selecting 25 seed nodes. As can be seen from this table, the time efficiency of our proposed method is very competitive in seed nodes selection phase. Although CELF has a good performance in influence spread, its running time is too long. IMM shows high time efficiency in all datasets. However, both CELF and IMM select seeds depend on the propagation model. They should reselect seeds when the propagation model changed. CoFIM has a relative high time efficiency in the seed nodes selection process in large-scale datasets. The running time of K-Shell is low, but its influence spread is unsatisfactory. HGD and NCVoteRank show high time efficiency in some datasets but sometimes it is inefficient and their influence spread performance is also not stable.

Table 5 Running time (in seconds) for different algorithms on six datasets.

Besides, except for the time of seed nodes selection phase, the community detection time of Ours and CoFIM is also analyzed. Compared with CoFIM, the graph-embedding based community detection method used in Ours requires more time to find proper communities. Although the community detection phase seems to be time-consuming, it only needs to be carried out once for each dataset, no matter how many groups of experiments are carried out on one dataset. The time of calculation attribute similarities in CELF and Ours under the LTPlus model is reported. Similarly, the time of calculation K-Shell values in HGD and NCVoteRank is also reported. It should be noted that attribute similarities and K-Shell values are computed and saved in advance for the convenience of multiple experiments. That is, they are only executed one time for each dataset.

Discussion

In summary, we propose an extension of LT information propagation model, named LTPlus, that considers topologies and attributes of nodes in propagation simulations. This model is more suitable than previous information propagation models in attributed networks. In addition, we propose a novel community-based method to identify a set of vital nodes to achieve influence maximization in attributed networks. To the best of our knowledge, the proposed method makes the first effort to combine influence maximization with the graph-embedding community detection method. Compared with well-known state-of-the-art methods, empirical analyses on five real world networks and a large scale synthetic network under the LTPlus model suggest that our proposed method always performs very competitively, as shown in Fig. 1. Experimental results in Fig. 2 show the universality of our proposed method under the IC model. We believe our work can bring a little light into studies of the influence maximization problem in the future. For example, the graph-embedding community detection method can be further improved for directed attributed networks. In addition, an end-to-end method considering the property of propagation models can be further explored in the future work.