Social influence, news, as well as infectious diseases diffuse in society, following links drawn between participants by frequent contact, mutual interests, collaboration, communication, or transportation. The influence of a single node in such a network measures the extent to which the node, acting as the seed of a multi-hop diffusion process, will activate the rest of the network (this is the cascade size in the domain of online social networks, and the attack rate or outbreak size in epidemiology). Even assuming that the network links are known and the process of diffusion can be modelled or measured, predicting the top influential nodes when knowing their nodes’ topological centrality indicators remains difficult, also because of the diversity and size of social contact topologies. This study shows, on real-world social networks, that in many networks the joint values of two or more (dissimilar) node centrality indicators are predictive for the influence of the node, and that good combinations are between one local centrality which measures the size of the node’s neighbourhood and one global centrality: a variant of the eigenvector centrality, closeness, or the node’s core number. We illustrate with examples how the addition of such a second centrality to the prediction process is beneficial on some networks, and show that simple, interpretable statistical models can be machine-learnt in a supervised fashion on two or more centrality indicators, with almost universally good results across many real networks and network categories.

Most prior studies predict top influencers by a ranking method1: the nodes in a network are ranked according to a single centrality, with the top assumed to be the best influencers. No single centrality is consistent in performance across realistic case studies. The degree centrality was found a weak predictor in early studies, over both simulated and measured diffusion2,3. With the susceptible-infectious-recovered (SIR)4 diffusion model and also with measured diffusion, the top \(f=5{\%}\) spreaders in a small number of networks were better predicted by their core numbers than by degree or betweenness centrality5,6. The predictive power of the core number was later shown to not generalise, for SIR influence at or above the epidemic threshold. In road networks, the core number correlated little with the spreading ability of a node, while in social networks the degree and core number were either equally predictive7, or variably predictive with f8. Over a test suite of ten networks, the eigenvector centrality was on average better than the core number9. While refinements of classical centrality indicators were designed7,10,11,12,13,14,15,16,17,18,19,20, also alternative ideas to combine classical centrality indicators into a predictor of influence started in 2011. A metric equal to the betweenness centrality of a node, divided by a power of its degree21 was used to recognise the seed of a diffusion process, but was not successful on a real-world topology. By 2020 (the time of this writing), some methods22,23,24,25,26 were not applied beyond relatively small or few networks, and also provide no explanation or intuition for the results. A scalable method based on graph neural networks27 was black-box and cannot explain its decision. More interpretable approaches28,29 aggregated the individual rankings or values of two or more centralities, with coefficients based on the correlations between the rankings, or the information entropy of a centrality. This obtained recognition rates above 0.7 in 16 networks, with 9–18% improvement over the best single ranking in five of these networks (a lower 1–5% in the rest) and drew the conclusion that the same set of centralities suit networks with similar Laplacian spectra, but making a stronger conclusion on the connection between network topologies and centralities requires more network samples28.

Two recent studies gained more detailed insight. Over all non-isomorphic small networks (up to 10 nodes), one normalized spectral centrality (PageRank or Katz centrality) together with degree (or another measure of network density) predicted well the exact expected SIR spread sizes30. For the related problem of maximising collective influence, PageRank plus metrics related to the node’s degree and neighbourhood brought 2–5% improvement compared to the baseline greedy heuristic in real networks31. Here, we aim for more general answers: are there other good combinations of classical centralities? can one explain the added value of a centrality? does the predictive power of a combination of centralities generalise across many topologies?

We give an early example in Fig. 1, for the 4158-node coauthorship network Arxiv GRQC. The top \(f=5{\%}\) of nodes by the size of their neighbourhood (the sum of degrees of nearest neighbours), encircled on the left in Fig. 1, form clusters distributed in the network. The top nodes by the eigenvector centrality (centre) are instead local to one cluster. Neither of these solutions entirely coincides with the correct set of top spreaders, but reasoning with both sets of data leads to a good prediction. The true top spreaders by the SIR diffusion model at the epidemic threshold are shown on the right: these are located in and around only that subset of the clusters with a large neighbourhood which have also marginally higher eigencentrality values due to being in or close to the high-eigencentrality cluster. (Fig. 6 will provide more detail).

Figure 1
figure 1

Comparing the location of the top nodes as ranked by (left) neighbourhood size, (centre) eigenvector centrality, and (right) SIR spread size at the epidemic threshold for the coauthorship network Arxiv GRQC. The network layout is force-directed. The colour of the nodes in each panel shows the value of that metric: darker nodes have higher centrality values or spread size. The top \(f=5{\%}\) of the nodes in each case are encircled.

We study a large and diverse set of real-world test networks of sizes between 1000 and 70,000 nodes, assuming complete knowledge of the links in the network. The predictive power of two or more centrality indicators is measured by training a supervised statistical classifier on sample nodes from each network. The ground truth for the influence of any node is estimated accurately via the simulation of the SIR diffusion model with that node as the seed of diffusion—possible here since there is one seed, unlike in studies on collective influence, where an approximate greedy heuristic must instead be used as a baseline31. The target of the classification is then a binary variable which shows whether the node is in the true top \(f{\%}\) of spreaders. While the results are diverse across the set of networks, we find six universally good pairs between one local centrality which measures the density of the node’s extended neighbourhood and one global centrality (eigencentrality or PageRank, closeness, core number), and give an intuition for why they complement each other well. With all seven classical centralities, the average precision function is close to perfect (0.995) and the average recognition rate is 0.921.

The practical use of these results is twofold. The method of supervised classification can be ported to any new network where the assumption of complete knowledge about the links is satisfied. For a more realistic estimation of node influence, empirical diffusion data3,6, when available, can replace the mathematical model of diffusion. More importantly, the basic principles of centrality pairing can help with the design of more effective centrality indicators or ranking algorithms, and can improve the understanding of diffusion outcomes in social networks.


We run an empirical study over 60 real-world examples of static network topologies (listed in Table 1 in Methods). The networks are directed, unweighted, and fall into six categories: human social networks (separately, online or offline), human networks formed by professional coauthorship or online communication, computer networks, and physical infrastructure. The influence of a node is the SIR spread size when the node is the seed of diffusion, estimated via Monte Carlo simulation (see Methods). Analyses are shown in this section for the SIR influence at the epidemic threshold \(\lambda _c\) for every network; they hold also above the epidemic threshold, at \(1.5\cdot \lambda _c\) (with numerical results for these shown in the Supplementary Information).

We study seven classical centrality indicators and their combinations, as follows.

  • Local metrics, simple to compute, reflect the density of a node’s neighbourhood: the degree, neighbourhood (the sum of the degrees of direct neighbours), and two-hop neighbourhood (the sum of the degrees of neighbours exactly two hops away).

  • The core number results from k-shell decomposition.

  • Distance-based centralities, such as closeness and betweenness, reflect the importance of nodes by their link distances in the network. Of these two popular centralities, in prior studies on the SIR model, betweenness showed weak predictiveness both as a ranker of nodes in large networks5,7 and also in combinations with other centralities on small networks30. We thus study here the closeness centrality.

  • Normalised spectral centralities: PageRank and eigenvector centrality.

The predictive power of single centralities is inconsistent across networks

We first show that the ability of any one centrality indicator to predict the top spreaders across a large number of network cases is too variable to be of universal practical use. Take a network of N nodes, f a fraction, and the task of selecting the best fN spreaders in the network. The standard ranking method has each centrality rank the nodes in this network; the top fN nodes by this ranking are put forward as best spreaders5,6,7,8,9 (see Methods). The predictive power of the degree centrality is shown in Fig. 2, across all networks, at the epidemic threshold. This is measured via the recognition rate (also called recall) r(f): the fraction of correctly identified top spreaders (Eq. 1 in Methods); the 95% confidence interval around r(f) is shown as a shaded area. In Fig. 2, for each of the three categories of networks with lowest recognition rates at \(f=20{\%}\), the worst-case network is named. The degree-influence scatterplots, also in Fig. 2, show the reason: a correlation between degree and influence does exist even in these worst cases, but with too wide a variance of influence per degree for accurate ranking.

Figure 2
figure 2

(left) The recognition rate by degree, across f, for all networks, at the SIR epidemic threshold. Each data line corresponds to a network, with the 95% confidence interval shown as a shaded, partly transparent area. The network categories are Ca (Coauthorship, 6 networks), Cm (Communication between people, 11 networks), Cp (Computer, 11 networks), HS (offline Human Social, 5 networks), In (Infrastructure, 4 networks), S (online Social, 23 networks). (right) Degree-influence scatterplots for three of the worst-case networks.

Compared to the degree, the performance of the core number as a ranker is much less consistent across networks (Fig. 3). The same cause holds for the three worst-case networks marked in the figure: all have few k-shells (between 1 and 5), so the core number by itself it not a discriminative variable for a ranking task. In the very worst case (as in the case of Gnutella25), the network has a single k-shell, so predicting the top spreaders by ranking the nodes in the network is the same as doing a random draw. In Fig. 3, three more networks are marked, for which ranking by core number gives good recognition rates at \(f=20{\%}\), but poor rates when \(f<5{\%}\). The scatterplots between core number and influence show the cause. The nodes with the highest core number in the Twitter Stanford network are poor spreaders; a topological reason for this was found in a prior study focused on the core number8: the most effective core in the network depends not only on its core number, but also on its connectivity to other cores. Even in other topologies, in which high core numbers do correlate with wide spreading (as is the case for Twitch ES and US Airports), the highest core contains many nodes of very variable influence, so the core number alone is not a sufficiently discriminative variable when f is low.

Figure 3
figure 3

As Fig. 2, but with the core number as the ranker.

Neither the degree centrality nor the core number are universally better than the other across the network space. If the core number can be a more accurate ranker in some cases (Fig. 3 shows values of r(f) closer to 1 for the core number, as was also found in prior studies on selected topologies5,6), it is also a poor predictor in absolute terms when \(f<5{\%}\) for many networks, and also across all f values when the network doesn’t have a strong core structure. For online human networks (categories Ca, Cm, and S in this study), and with \(f>5{\%}\), Figs. 2 and 3 show the two centralities to be comparable, with the core number marginally better. In general, as recognised before7,8,9, the predictive power of the core number is not consistently better than the degree centrality for SIR influence.

Another popular ranker, the eigenvector centrality was previously found (on average across a set of networks) more predictive than the core number9. By the summary in Fig. 4, this is the case for low values of f, but there is still a wide variance between networks. In some cases (such as Gnutella24 and Euroroad, marked in the figure), the distribution of centrality values is such that ranking is not better than a random draw; in others, such as Adolescent40, there is little correlation between the centrality and influence, so the ranking remains poor. In the best of cases (for two of which scatterplots are shown in the figure), this correlation is strong, which explains why the eigenvector centrality can be a very good predictor across the range of f.

Figure 4
figure 4

As Fig. 2, but with the eigenvector centrality as the ranker.

A second performance metric is also of interest: the precision function p(f) (Eq. 1 in Methods), which compares the SIR influence of the predicted nodes with the SIR influence of the correct top spreaders. A p(f) value close to 1 for a prediction task means that, regardless whether or not the exact top spreaders were identified, the influence of the nodes which were identified is close to that of the set of top spreaders—so p(f) does not penalise node substitutions, if the substitutes are similar in terms of influence. For ranking by single centralities, the results for both the recognition rate and the precision function are shown in Fig. 5. Each data point marks the performance of a ranking task, over a given network, for a value of f in \(1, 2, \ldots 20{\%}\). (To make the data points visible despite many partial overlaps, each data point is a horizontal line; this line does not denote the uncertainty of the data, but is of fixed size.) The centroid of each data cloud summarises the performance of that centrality over this set of networks. Overall, the neighbourhood centrality makes for the best single ranker, with an average recognition rate of 0.804 and an average precision function of 0.962. The two-hop neighbourhood (not shown in the figure) is only slightly worse (on average 0.781 and 0.942, respectively). PageRank is the least accurate, with an average recognition rate of 0.487, and an average precision function of 0.727. This latter result is not entirely surprising: although widely used for ranking nodes in network structures32, PageRank was found before to not be a competitive predictor for measured diffusion in various networks6,9.

Figure 5
figure 5

The success of single-centrality ranking at predicting spreaders, across all networks and values of f, at the SIR epidemic threshold. The scales are quadratic. Each data point (a horizontal line of fixed size) denotes a prediction task, and the colour shows the category of the network (listed in Table 1 in Methods). The centroid of the point cluster and the standard deviation on both axes are marked with a solid dot and lines. The point of perfect scores (1,1) is also marked with a half circle. The neighbourhood centrality is the best overall single ranker, with an average precision function of 0.962 and an average recognition rate of 0.804.

Next, we show that certain pairs of centrality indicators have, together, sufficient topological information about network nodes to improve the accuracy of the prediction tasks.

Pairs of centralities combine into better predictors

A statistical classifier is now trained with multi-variate data from part of the nodes in each network. The result is one trained classifier per network and fraction f. For training, a centrality is one input feature. The target variable (or class) is binary, and it shows whether or not a node is in the top fraction f in the network by spread size. The two performance metrics for the classifiers are the same as for ranking tasks, with the difference that the recall r(f) is now improved as the F1 score, which is the harmonic mean between the precision of classification and the recall (for motivation, see Methods, Eq. 2).

Parsimonious statistical models are beneficial to gain clear intuition about the results. We report here the most interpretable statistical models which have good performance: support-vector machine (SVM) with second-degree polynomials as kernels (see Methods), whose decision boundaries between classes are simple to understand. We verified that other, higher-variance statistical models based on decision trees have similar performance (with numerical results for Random Forests shown in the Supplementary Information). We start with training SVM classifiers with two centralities, and show that, for certain network examples, certain pairs of centralities build on each other’s strengths and obtain predictive models that are significantly better than either centrality alone.

Combinations with the eigenvector centrality

We show four network examples in Fig. 6. For each network, the left panel maps the distribution of the spread size at the epidemic threshold for all the nodes in the network, against the pairing of the eigencentrality with a neighbourhood indicator. The right panel notes a value for f, and colours the nodes according to their true class: the red nodes are the top f by spread size. Also in the right panel, two dotted lines show the decision boundaries made by the corresponding single-centrality rankers. If \(f=1{\%}\), these boundaries are the 99th percentiles for either centrality; a ranker will predict as top spreaders all nodes above this boundary. These ranking boundaries are improved upon by the classifier, whose decision boundary is shown as the transition between background colours, with a blue (or darker) background showing the centrality space where the top spreaders are predicted to be. (Note that only part of this centrality space may be occupied by nodes; in other words, not every combination of centrality values may be physically possible.) The optimal decision boundary would leave no nodes misclassified and would lead to values of 1 for both the precision function and the recall or F1 score.

Figure 6
figure 6

Network examples for which eigenvector centrality combined with another centrality improves the predictions of single-centrality rankers. In every left panel, a scatterplot of node centralities versus spread size. In every right panel, the top spreaders are coloured in red (or darker), the decision boundaries for rankers using either centrality are dotted lines, and the background colour shows the decision boundaries for the classifiers: a blue (or darker) background denotes the area predicted for top spreaders.

There are clear commonalities among the improved decision boundaries in Fig. 6: for Facebook Artists, Brightkite, and Arxiv GRQC, the joint increase in the values of both centralities in the pair is what determines an effective spreader. For Facebook Artists and Brightkite (both relatively large networks of over 50,000 nodes), ranking the nodes by only one centrality would place some nodes in the wrong class; unlike this, the two-centrality classifier (F1 scores of 0.920 and 0.924, respectively) draws a decision boundary that is much closer to optimal. We illustrated the intuition behind the Arxiv GRQC result (F1 score 0.900) in Fig. 1: the size of the local neighbourhood does affect the spreading ability of nodes, but proximity to the ‘hub’ of high eigencentrality also helps.

There are also exceptions from this. The US Power Grid network (4941 nodes) shown in the same figure has an outlying cluster of low-eigencentrality nodes as top spreaders, while the lesser spreaders instead follow the expected trend described above. Supplementary Figure S1 shows the cause: a small hub of high eigencentrality values lies at a periphery of the network, while a larger region of nodes with large neighbourhoods (but low eigencentrality) is located far apart. It is the latter, larger region which enables the top 1% of the spreaders, and the classifier is able to learn this pattern slightly better, with a 0.162 increase (F1 score 0.509) compared to the r(f) of ranking by the two-hop neighbourhood alone.

Combinations with the core number

A similar intuition holds when pairing the core number with eigenvector centrality, and also with neighbourhood centralities. (Other pairings with the core number are less effective.) We show two examples in Fig. 7. Again it is the joint increase in both centralities which enables superspreading. For Facebook Politicians (F1 score 0.894), Fig. 7 (bottom) also illustrates the intuition. A number of dense cores are distributed in the network, with the highest core numbers not in close proximity, but isolated by regions of low density. On the other hand, a single region of high eigencentrality exists, and the top 5% of spreaders are located exactly in those cores of highest eigencentrality. Interestingly, pairing the core number with a neighbourhood centrality (GooglePlus, F1 score 0.968) also shows that not all the nodes in dense cores are equally good spreaders, and that their neighbourhood size can help to make a selection.

Figure 7
figure 7

As Fig. 6, for core number combined with another centrality.

Combinations with closeness

Closeness also plays a role similar to the eigencentrality—that of guiding the selection of nodes away from more peripheral nodes with dense neighbourhoods, towards the centre of the network, with an increase in performance. Figure 8 shows two examples. In the Adolescent41 offline social network (1,640 nodes), the best ranker is that by neighbourhood (\(r(f)=0.469\)), but when considering also closeness, the F1 score rises to 0.598. On the topology of the network (at the bottom of the same figure), closeness values identify only very few of the top spreaders, while the neighbourhood size identifies more; the correct top spreaders, however, again lie in a region where both centralities jointly have high values. In the Gnutella05 computer network, for a similar reason, the best ranker is instead closeness (\(r(f)=0.594\)), but when considering also the two-hop neighbourhood, the F1 score rises to 0.725.

Figure 8
figure 8

As Fig. 6, for closeness combined with another centrality.

In the examples from Figs. 6, 7 and 8, each classifier’s decision boundary improves upon the decision boundary of the best ranker such that r(f) is raised by between 0.090 and 0.213. Among our 60 test cases, we also found other examples of networks, combined with certain values for f, for which the single-centrality rankers could not be improved by any classifier. For example, only when \(f=1{\%}\), none of the five Adolescent networks is resolved any better by using two centralities—but also there the performance improves when f increases.

From all pairs of centralities, the combination of two-hop neighbourhood and core number has the best average F1 score (0.865) across all the network cases in this study, and across the range of f. On the other hand, the combination of two-hop neighbourhood and eigenvector has the best average precision function (0.992). Figure 9 is a summary for the averages of both performance scores across all single centralities (on the diagonal) and pairs of centralities (the rest of the matrix). All possible pairs of centralities are studied, except for the redundant combinations between degree and neighbourhood, and between the two types of neighbourhood centralities. The six pairs which improve significantly on the most predictive ranker are all composed of one of the neighbourhood centralities, and one of: core number, eigenvector centrality, closeness, or PageRank. These six pairs improve on both recall and precision function.

Figure 9
figure 9

The success of single and pairs of centralities at predicting spreaders: for each pair of centralities, the average performance score across all networks and values of f. The diagonal is the result of ranking by a single centrality and it is scored by the recognition rate and the precision function. The rest of the matrix is the result of classification by two centralities and is scored by the F1 score and the precision function.

Multi-centrality predictors and summary of results

While the previous subsection demonstrated that centrality indicators can play on each others’ strengths and improve the prediction of top spreaders by the SIR diffusion model at the critical threshold, we now show that classifiers using all seven centralities as features give near-perfect prediction on most network examples. One exception is that of offline human social networks (the HS network category) and only at very low fractions f. This category contains networks that are not structurally unusual, but are some of the smallest networks in the study, which leads to very few training data points, thus lower classification performance.

We train a seven-centrality SVM classifier for each prediction task, and summarise the results in Fig. 10. The centroid of all prediction scores (Fig. 10, left) is an average recognition rate of 0.921, and an average precision function of 0.995. While the precision function was almost as high (0.992) when training the classifier using only the eigenvector centrality and the two-hop neighbourhood as features (Fig. 9), the average recognition rate is now further improved by adding more features to the statistical model. Not all six network categories are equal: a breakdown of the scores by network category and by the value of the fraction f (Fig. 10, right) shows that recognising the top 1% of spreaders in the Adolescent networks (the HS network category) remains difficult. All other prediction tasks are resolved well, particularly when performance is measured by the precision function, which ranges between 0.969 and 1.

Figure 10
figure 10

The success of classifiers using all centralities at predicting spreaders, across all networks and values of f, at the SIR epidemic threshold. (left) Each data point denotes a prediction task, and the colour shows the category of the network (listed in Table 1). The centroid of the point cluster and the standard deviation on both axes are marked (counterpart to Fig. 5). (right) The average performance scores across all networks in one of six network categories, and across all values of f (counterpart to Fig. 9).

These conclusion hold also above the epidemic threshold, at \(1.5\cdot \lambda _c\); numerical results showing very similar prediction scores are in Supplementary Fig. S2. They are also not an artefact of the type of statistical model used in the classifier. When training nonlinear Random Forest classifiers, which are high-variance so—in general—are able to obtain better performance than the polynomial SVM, a similar conclusion emerges (Supplementary Fig. S3), so there no significant advantage to using higher-variance classifiers.


Insights gained

We showed that two or more classical centrality indicators can contain sufficient statistical information about the nodes in a real-world network to train an accurate supervised predictor of SIR influence, and outperform node rankers. The decision boundaries between the two classes, as learnt by classifiers, demonstrate where the advantage of multi-variate prediction comes from: certain centrality indicators are particularly good complements to others. Notably, there are multiple answers to the question: what is a good pair of centralities? For the degree centrality, the best complement is the eigenvector centrality. For the neighbourhood centrality (the best overall single ranker), three other centralities make good complements: the eigenvector centrality, closeness, and core number (with PageRank also close). For those network cases where multi-variate prediction has an advantage, the joint distribution of the centralities and the SIR influence is such that one centrality (or, a one-dimensional decision boundary) is insufficient to classify the nodes accurately, but a multi-dimensional decision boundary is able to refine the decision in the most important region of centrality values. When the entire set of classical centralities are used, the prediction performance is close to optimal (to an average recognition rate of 0.921, and an average precision function of 0.995).

We showed the topological intuition behind this improvement in the prediction of superspreaders. Often, when a subset of the top nodes by local centrality indicators are located in more peripheral regions of the network, global centrality indicators step in and act as a selector and guide towards the effective centre of the network, so that the nodes selected jointly maximise the values of both centralities. In exceptional topologies, when the global centrality has high values at a peripheral location (such as US Power Grid, in Supplementary Fig. S1), the roles reverse: the local centrality becomes the selector, and the statistical model learns that high global centrality values are not beneficial.

Practical use, assumptions, and limitations

The basic insight of jointly maximising the values of two or more centralities can help improve existing, unsupervised node ranking methods. The advantage of ranking algorithms is that they are unsupervised, i.e., require no ground truth; their disadvantage is lower recall and precision.

Network practitioners can also use supervised classification as presented here, and train a new classifier on a new network. While this method delivers good predictions, it assumes (a) complete knowledge of the network links, and (b) means to estimate the spread size for a fraction of the network nodes. If historical diffusion data is available (such as the number of retweets on Twitter), this data replaces the need to simulate a theoretical diffusion model in order to obtain ground truth for the spread size. Only a fraction of nodes need ground truth data, since the statistical classifier is trained on a random sample of the nodes in the network, and will predict the class for the others. The size of the training data necessary to obtain good predictions depends on the network and on the distributions of centrality and influence values, but is expected to be small. In Supplementary Fig. S4, we measure the required training set size from the learning curves of three of the largest networks in this study. These show that, to obtain maximum performance, some networks only require a training data size of 1% of the network size, while others need around 10%. The set of centralities to use as features can be tailored to the computational budget available. The type of statistical model can also be tailored with the network size: heuristic training algorithms, such as those training Random Forest classifiers, scale better with large networks.

Future work

There are follow-ups to explore as continuations of this study, at the intersection between real-world network dynamics and machine learning. A method to train a single statistical model for predicting superspreaders across networks is desirable, as long as its performance remains good; this was previously achieved only for small networks30. An unsupervised or semi-supervised learning method (for example, based on clustering nodes using the same centrality indicators as features, such as in the related work33 from the domain of natural-language processing) would lower the computational load required to estimate the spread size of many nodes. Other directions include the prediction of other measures of node influence (such as the measured diffusion of information in large online social networks6) and of node importance (such as the ability of a node to block the diffusion of information), and the study of other types of networks (such as different network categories, networks with node and link attributes, and networks with dynamic structure).


Networks, centrality indicators, and the estimation of node influence

Most of our network case studies (see Table 1 for the overview) model entire communities at a specific point in time. This is the case for the high-school friendships in the Adolescent networks, the daily Gnutella peer-to-peer file sharing networks, the five sets of institutional email exchanges, or the networks of mutual likes between verified Facebook pages. A minority of the networks (such as the Facebook Stanford friendships, collected from survey participants) are instead bounded samples from a larger community. All are (transformed into) directed, strongly connected, and unweighted networks; when the original version in the repository had timestamp, attribute, or weight annotations, these were removed. The direction of the edges is reversed when needed, to model information flow—so the degree centrality of interest is the out-degree. To be able to study the closeness centrality34 which computes the lengths of shortest paths, only the largest strongly connected component (SCC) was kept. These networks were selected from public repositories such that (a) they fit into these six categories, and (b) have the size of their SCC above 1,000 nodes. The upper bound on network size is simply imposed by finite computing resources.

Table 1 The 60 case studies.

The following centrality indicators were computed for every node in every network: its degree, neighbourhood (i.e., the sum of the degrees of the nearest neighbours, previously denoted \(k_{ sum }\) and found to be a competitive predictor in a previous study6), two-hop neighbourhood (as before6 for nearest neighbours exactly two hops away and previously denoted \(k_{ 2sum }\)), PageRank34 with a 0.85 damping factor, eigenvector centrality34, closeness centrality34, and core number5. An additional set of indicators that we tried, the link strength of a node towards upper, equal, or lower shells8, denoted \(r^u, r^e\), or \(r^l\), did not provide notable results.

The ultimate influence of a node in a network is estimated numerically, as the average among \(10^4\) runs of the susceptible-infectious-recovered (SIR)4 diffusion model for infectious diseases. In SIR, an infectious node infects a susceptible neighbour at a rate \(\beta \) (meaning the number of infection events per time unit, so can be higher than 1). An infectious node recovers at a rate \(\mu \). The effective transmission rate is \(\lambda =\beta /\mu \). Here, we take \(\mu =1\) and study the normalized rate \(\lambda \).

As \(\lambda \) increases in SIR simulations, the size of the outbreaks increase from an infinitesimal fraction to a finite fraction of the network size. The regime of interest is neither very low \(\lambda \) values (in which case, the diffusion remains localised to the neighbourhood of the seed node) nor very high (in which case, all nodes should reach a large fraction of the network). Since our test cases are both finite in size, and diverse (a scenario studied previously39), we estimate the epidemic threshold \(\lambda _c\) numerically by identifying it with the variability measure39 \(\Delta = \frac{\sqrt{\langle \rho ^2\rangle - \langle \rho \rangle ^2}}{\langle \rho \rangle }\). Here, \(\rho \) denotes the random variable of outbreak size from different seed nodes, and \(\langle \cdot \rangle \) denotes the mean. Given a value for \(\lambda \), \(\Delta \) is estimated by setting seed nodes from a random sample of \(10^4\) of the nodes in a network (or the entire network size, if this is smaller). After estimating \(\Delta \) for a range of \(\lambda \) values at regularly spaced intervals, we take \(\lambda _c\) to be the position of the peak of \(\Delta \). The resulting values are noted in Table 1. The maximum spread size (influence) at \(\lambda _c\) in any network is between 0.7% and 6% of the network size (with two exceptions among the smallest infrastructure networks, where this reaches 8% and 11%).

Ranking by a single centrality


We first predict superspreaders using the single-centrality ranking method common in prior studies5,6,7,8,9, and also carry forward the performance metrics defined in these studies. This ranking method builds the assumption that higher centrality values for a node will also indicate higher node influence. Given a centrality C, first all the nodes have their values for C computed. The top fraction f of spreaders is then predicted to be the fraction f of nodes with the highest values for C. At ties between nodes (which occur for discrete-valued centralities such as degree and core number) a random subset of the tied nodes are selected. This random sampling is then repeated \(10^2\) times for a bootstrap technique (described below), which averages among the scores of these individual random choices.

Performance metrics

In prior studies, this ranking is evaluated via two metrics. Denote by \(I_f\) the set of the top fraction f of nodes as ranked by their SIR influence, and by \(C_f\) the set of top fraction f of nodes as ranked by their centrality values; the sizes of these sets are equal for a given f, \(\left| I_f\right| = \left| C_f\right| \). Also denote by \(\rho _i\) the spread size when setting node i as seed. The recognition rate r(f) measures the extent to which the identities of the predicted superspreaders match the true identities6. A synonym for the recognition rate is recall. The precision function p(f) is a weaker, but more practically useful performance measure comparing the spread of the predicted superspreaders to that of the true top spreaders:

$$\begin{aligned} r(f) = \frac{\left| I_f \cap C_f \right| }{\left| I_f \right| } \qquad \mathrm{and} \qquad p(f) = \frac{\mathrm{avg}_{i\in C_f} \rho _i}{\mathrm{avg}_{i\in I_f} \rho _i} \end{aligned}$$

Both metrics take values in the interval [0, 1]. An imprecision function \(\epsilon (f)\) was defined previously5, such that lower values of \(\epsilon (f)\) are better. Here, to present the two metrics in a unified fashion, we use instead \(p(f) = 1-\epsilon (f)\), such that higher values are better for both r(f) and p(f). A confidence interval was originally provided for r(f) by bootstrap6. Here, we apply a bootstrap technique when estimating both metrics. Given a network of N nodes, \(10^2\) times, we draw a random sample of the N nodes uniformly with replacement. Among these nodes, the ranking method is applied and a prediction is made and evaluated via either r(f) or p(f), as needed. The final value for each performance metric is the average, together with the 95% confidence interval among these samples.

Classification by a combination of centralities


A multi-centrality method learns a discriminative statistical model able to classify network nodes into superspreaders or not. For this, a dataset is formed for every network; a record describes a node via its centrality values (the predictors). When training the model to recognise the top fraction f of the nodes, the nodes are ranked by their true SIR spread size, and each node is assigned one of two target classes based on whether or not they are in the top fraction f. The model is trained and tuned on a training fraction \(t=0.5\) of the nodes (sampled randomly without replacement), and tested on the remaining nodes.

A binary statistical classifier learns a decision boundary between the classes. We use a support-vector machine (SVM)40, which learns optimal separating hyperplanes in the multi-dimensional predictor space, including in cases where the classes overlap in this space. Here, the optimal decision boundary is that which leaves the largest margin in space between the classes, with still allowing some data points to fall on the wrong side of the boundary. SVMs have advantages: (a) they are optimal learners rather than heuristics, and (b) the kernel function K and the regularisation parameter C, which ultimately give the shape and variance of the boundary41, are tunable hyperparameters.

We aim to obtain the simplest, most interpretable classifier with good performance; higher-variance classifiers bring little performance advantages for this problem, and may lose in interpretability. The results presented are for second-degree polynomials K (which gives a low-variance model, less prone to overfitting), C tuned in the range [1, 100] with five-fold cross-validation, and a fixed tolerance for the stopping criterion42 of 5e-4. No class weights are added to balance the classes artificially. (We tested other, higher-variance statistical models: SVMs with third-degree polynomials for K, and nonlinear models based on decision trees, either boosted or in ensembles43; since they had similar performance to the SVM with a second-degree polynomial for kernel, we retain and present the results for the latter.) We show the decision boundaries learnt by two-centrality models via plotting them in the predictor space.

Performance metrics

For a network of size N and the fraction f, a classifier produces a guess for the class of each network node in the test set. We port the same notation \(C_f\) to mean here the set of nodes classified as top spreaders. The number of superspreaders predicted in this way is decided by the classifier, and may not equal fN. We measure the overlap between the classifier prediction and the ground truth with metrics similar to Eq. 1. In binary classification, the measure r(f) as defined in Eq. 1 is called recall or sensitivity. It is a useful metric, but insufficient to characterise the classifier: alongside making many correct choices (giving a high true positive rate, \(\left| I_f \cap C_f \right| \)), the classifier may also add many false positives. The precision metric helps to quantify the false positives, and a classical metric is the combination of recall and precision in their harmonic mean, the F1 score44:

$$\begin{aligned} \mathrm{recall}(f) = r(f) = \frac{\left| I_f \cap C_f \right| }{\left| I_f \right| } \qquad \mathrm{precision}(f) = \frac{\left| I_f \cap C_f \right| }{\left| C_f \right| } \qquad \mathrm{F1\,score}(f) = \frac{2}{\mathrm{recall}(f)^{-1} + \mathrm{precision}(f)^{-1}} \end{aligned}$$

Note that precision is an established name in the area of Information Retrieval44, while the imprecision function \(\epsilon (f)\) which gave the precision function p(f) was defined recently5 for analysing networks. Although the names are unfortunately too similar, their meaning is different and should not be confused.

The F1 score takes values in the interval [0, 1]. We apply to the classifier the second metric, the precision function p(f), exactly as it is defined in Eq. 1. Its values can exceed 1.0, in cases when the classifier predicts fewer than fN superspreaders, and they are on average better than the true fN superspreaders; we cap higher values to 1.0. We estimate both F1 score and p(f) by randomly drawing different training sets for the classifier (the same training fractions t of the nodes) \(10^2\) times, then training and testing the classifier on each draw. The final value for each performance metric is the average of the individual scores.