Sublinear domination and core–periphery networks

In this paper we devise a generative random network model with core–periphery properties whose core nodes act as sublinear dominators, that is, if the network has n nodes, the core has size o(n) and dominates the entire network. We show that instances generated by this model exhibit power law degree distributions, and incorporates small-world phenomena. We also fit our model in a variety of real-world networks.

Basic definitions. We say that a subset S of the vertex set of a graph is a κ almost dominating set ( κ-ADS) if the set S dominates at least κn of the nodes present in the graph. In other words, at least κn of the nodes of the graph have a neighbor in S.
We say that an event E(n) holds with high probability

Sublinear domination and core-periphery structure
We show that the core of the network consists, as one should expect, from a sublinear number of nodes located at the top levels of the tree. To observe this phenomenon, we calculate the probability q hτ of a node at height h not being dominated by any node between levels 0 and τ where τ ≤ h , which equals where a b denotes that there exists a constant C > 0 independent of b such that a ≤ C · b (i.e. inequality up to a constant factor), and the first inequality holds since 1 − t ≤ e −t for all t ∈ R . Note that q hτ does not depend on the height of the node in question, as long as τ ≤ h . Now the probability that there is at least one node uncovered below level τ + 1 is given by Markov's Inequality and is at most To assert an w.h.p. guarantee we force the above probability to be �(b −H ) , therefore, solving for τ we arrive at a dominating set of size with probability at least 1 − �(b −H ) . Consequently, a sublinear fraction of nodes C = {v : h(v) ≤ τ } located on a logarithmic height τ from the root of the skeleton tree T dominate the whole periphery P = {v : h(v) ≥ τ + 1} with τ = �(log H).

Degree distribution
To fit the model to real-world data, we infer the degree distribution of IGAM. The average degree of a node u at level h is and the total expected number of edges at level h is m h = b hd h = �(b h+H+1 /c h+1 ) . The asymptotics of the previous equation yield a power law with exponent If the rank of u, with h(u) = h is given as r h = c h , which is an increasing function of h, then the expected degree depends on the inverse rank 1/r h , yielding a Zipfian power law. The trials for connecting every node are independent Bernoulli variables, and therefore by the multiplicative Chernoff bound with probability at least 1 − �(b −H ) we have that the average degree at height h, d h is �(1/r h ) ± O( H log b/(2b h )) . By a union bound, we have that Thus, with probability 1 − O(Hb −H ) (i.e. w.h.p.) the degree histogram follows Zipf 's Law. The exponent of the degree distribution can be altered, if the same model is generated with parameters b ′ = b α , c ′ = c α for some α ≥ 0 . The expected number of edges m is given as and is superlinear with respect to the number of nodes.

Insights from data
We describe a fitting algorithm for the IGAM model (Algorithm 1). The fitting process considers of being given a sample of m edges D = {e i } 1≤i≤m on a network of n nodes where n is known. Our goal is to find the optimal fanout b ⋆ , the optimal height function h ⋆ and the optimal scale factor c ⋆ that maximize the log-likelihood of the model, that is where the likelihood equals Directly optimizing the likelihood is very hard since there are O(n) possible fanouts, each fanout can generate an exponential number of possible trees, and thus height functions, and given the fanout and the height function the remaining problem consists of finding the optimal c that explains ℓ(c|D, b, h).
1{u ∈ e i } , and then order the nodes on decreasing order of their sample degrees. After that, we fix a fanout b from the interval {2, . . . , n − 1} , and according to that fanout we start by attributing heights of a hypothetical b-ary tree on the nodes according to their descending order. For example, for b = 2 the first node gets a height of 0, the next two a height of 1, and so on. Then, for each height 0 ≤ h ≤ ⌈log n/ log b⌉ , we form the logdegrees z h = log u:h(u)=hȳ u , and fit linear-least-squares with x-values being the range of heights and y-values being the log-degrees z h . The optimal slope a yields c to be c = b · e −a . If c ≥ b then the current fit is rejected. We can then calculate the likelihood function ℓ and keep the best parameters (b ⋆ , h ⋆ , c ⋆ ) . Each step is dominated by the calculation of the likelihood that costs O(n 2 ) time, since exactly computing the log-likelihood requires summing over all pairs of nodes (regardless of whether an edge exists or not), and thus the total complexity is O(n 3 ) . Note that since the values of f(u, v) are small (i.e. close to 0) and real-world networks are sparse (i.e. m is of the order of n or n log n ) the log-likelihood can be approximated in time O(m) by ignoring the networkindependent term, i.e. the term that sums on V × V , which yields an algorithm with runtime O(nm) instead of O(n 3 ) . If a full b-ary tree does not cover the network, we allow the last level to be incomplete.
We fit the IGAM model to networks examined in Ref. 9 . More specifically, we examine the world-trade network from Ref. 24 Figure 1 presents the (total) degree distribution fits for the IGAM model, where the parameters b, c and the height function have been determined. Observe, that the total degree at each constructed level is linearly correlated ( R 2 ≥ 0.93 except for the airports dataset) with the coreness value of each group of nodes (per level). Moreover, in Fig. 2 we do a log-log plot between the construction of the dominating set as in Section 3 and the construction of the dominating set using the maximum coverage greedy algorithm. The former algorithm treats the nodes as IGAM would do in the construction of the dominating set, i.e. by traversing the levels of the hierarchy from top to bottom. The latter algorithm picks the node with the largest active degree at each step, adds it to the set, and removes itself and all the nodes connected to it from the network up to a certain number of steps or if there are no more nodes left. Markers on the plot represent subsequent iterations of both algorithms. We observe almost perfect correlation between the two algorithms and slightly superlinear relations of the form y ∝ x γ for γ ∈ [1, 1.21] , which is a phenomenon that we should not expect in more general networks, since choosing the nodes with the highest degrees shall not yield good coverage in general. Moreover, note that a sublinear number of iterations, denoted by the number of x markers outside the [1.9, 2.0] 2 box (the mapping is increasing), suffices to dominate 10 1.9 % ≈ 80% of the nodes. A visualization of the IGAM fitting process can be found in Fig. 7  www.nature.com/scientificreports/ Qualitative insights. In this Section, we highlight the following structures that emerge from fitting the IGAM model to the real world data. We examine the first three levels of the hierarchy, devised by the height function h, for the datasets that contain labeled nodes. The analytical form of the core nodes can be found in the "Methods" section.
1. Faculty networks In the faculty networks of each one of the three disciplines (computer science, history, and business) the core consisted of nodes referring to highly ranked universities in the United States (in each discipline), as well as an (aggregate) node referring to faculty coming from all non-US academic institutions.
To elaborate, the cs-faculty network contains MIT, CMU, Stanford, UT Austin, Purdue, and UIUC in its core, together with the aggregate node. The history-faculty core consists, for instance, of Harvard, Yale, University of Chicago, Columbia, Stanford, Johns Hopkins, and Cornell. Finally, the business-faculty network has, for instance, the University of Michigan, UT Austin, Penn State, and the University of Pennsylvania at its core. These findings are consistent with the body of research on faculty hiring networks 25,29 where it is stated that, for the computer science discipline, a very small percentage (9%) of departments is responsible for 50% of academic hires in faculty position.

Open-airlines
The open-airlines network has a core that consists of very large and central international airports such as AMS, FRA, CDG, IST, MUC, ATL, and PEK. 3. World-trade The world-trade dataset contains data about the trade of metals among 80 countries in 1994.
The nodes represent countries who have available entries in the Commodity Trade Statistics released by the United Nations. In this network the core consists of, for instance, from Finland, Hungary, Slovenia, Singapore, Chile, and so on. 4. London-underground In the London-underground dataset, we recover a core that consists of busy train stations such as Bank, Baker Street, Canning Town, and so on, all of which are cardinal to the British underground system.
Relation to logistic core-periphery models. We compare the IGAM model with two logistic models introduced by Jia and Benson 8 and Tudisco and Higham 19 . In detail, we fit both models and give empirical answers to the following question: Are the logistic core-periphery models able to explain the domination structure of core-periphery networks? Intuitively what this model describes is that a node with θ v ≥ 0 is considered to be a core node and a node with θ v < 0 to be a peripheral node. That is, for a pair (u, v) if both nodes are peripheral, i.e. have θ u < 0 and θ v < 0 , then the link probability ρ(u, v) is less than the case when one of θ u , θ v is non-negative that represents a core-periphery link. Similarly, when both θ u ≥ 0 and θ v ≥ 0 , which corresponds to a core-core node, then the edge creation law attributes a larger connection probability. When spatial features x : V → R d are provided, as well as a kernel function K(u, v) (for example, K(u, v) = �x u − x v � 2 ), and a hyperparameter ε , then (Logistic-JB) is generalized to an edge law  18 (x-axis) and selecting nodes according to their hierarchy, i.e. in order of descending initial degree (y-axis). The slope γ and R 2 of linear fits are reported. The rule that selects nodes based on their prestige h yields very close results to the greedy maximum coverage algorithm. In general instances, these two algorithms are expected to have different results, since the former algorithm may select prestigious nodes whose neighborhoods have large overlaps which may not yield good coverage in general. However, specifically in core-periphery networks, high prestige nodes seem to have small overlaps, which justifies the good performance of the prestige-based algorithm. Source code to reproduce the Figure can be found in Ref. 33  www.nature.com/scientificreports/ The model of Tudisco and Higham 19 is based on a logistic probability law determined by a ranking π of the nodes. The more prestigious a node v is the higher the value π v is. The edge creation law is given by where σ s,t = 1/(1 + e −s(x−t) ) is the smooth approximation of the Heaviside step function H t (x) that is 1 if x ≥ t and 0 otherwise. We use s = 10 , and t = 1/2 . Again, the model intuitively says that nodes tend to be associated with more prestigious nodes rather with less prestigious nodes. Finally, the authors propose an iterative method to infer the ranking π which has an O(m) per-step cost.
We evaluate how well can Logistic-CP, Logistic-JB and Logistic-TH capture the domination properties of the core-periphery structure compared to IGAM. For the logistic models of Jia and Benson we fit the Logistic-CP model when there are no spatial data available and the Logistic-JB when spatial data are available (i.e. in the c-elegans, open-airlines, and London-underground datasets). We use the optimal parameters θ * v of the logistic models to build a ranking for the nodes by sorting them in decreasing order of the scores θ * v . For the Logistic-TH model, we use the iterative method provided in their paper to infer the ranking by sorting the entries of the fixed point that their algorithm produces. Then, for all models, we report the domination curves in Figs. 2, 5 and 4. To give better visual insights on how the models perform, we visualize the outcome of fitting the models for the c-elegans dataset on Fig. 6 for a core set of size ⌊n 0.7 ⌋ . For each dataset and figure we report the exponent p ∈ [0, 1] of a set that dominates 80% of the network (i.e. and 0.8-ADS). Namely, if a fraction ̟ ∈ [0, 1] suffices to cover at least 80% of the network, then p = log(̟ · n)/ log n.
Key takeaway. The IGAM model can better explain the sublinear domination phenomenon in core-periphery networks than Logistic-JB, Logistic-CP, and Logistic-TH. Also Logistic-CP and Logistic-JB achieve better coverage compared to Logistic-TH. Perhaps the most characteristic are the faculty (cs-faculty, history-faculty, business-faculty) and the world-trade datasets where IGAM produces an almost dominating set with an exponent p ≤ 0.16 whereas Logistic-TH finds a similar set with p ≥ 0.54 , and Logistic-CP finds an 0.8-ADS with p = 0.15 in the case of business-faculty and with p ≥ 0.32 in the rest of the datasets. In the polblogs dataset, IGAM is able to find an 0.8-ADS with p = 0.27 whereas Logistic-CP finds one with p = 0.64 and Logistic-TH finds a much larger one with p = 0.81 . In the open-airlines dataset the 0.8-ADS corresponds to p = 0.61 for IGAM and to p ≥ 0.82 for the logistic methods. Finally, the smallest variation between the methods exhibits the London-underground dataset where p ranges from p = 0.75 (IGAM) to p = 0.85 (Logistic-TH). Concluding, the ADS constructed by IGAM are consistently smaller than the ones produced by Logistic-CP and Logistic-JB which are smaller than the ones produced with Logistic-TH, which suggests that IGAM is able to explain the sublinear domination phenomenon where other logistic models fail to do so.

Miscellaneous properties
Small-world behaviour. To determine the diameter (the diameter of a disconnected network is taken to be the diameter of its giant connected component) of the network, we build an Erdös-Renyi (ER) network W with n nodes and edge probability f * = min u,v f (u, v) = c −H−1 . It follows from a standard coupling argument, i.e. a "toss-by-toss" comparison, that we can relate the two networks as one being subgraph of the other, in our case the ER network W being subgraph of the IGAM network, say G. The coupling is constructed as Then it follows that the diameter of the IGAM network is at most the diameter of W. Using a result from Ref. 30,31 , we have that since the average degree of W is �((b/c) H ) → ∞ as H → ∞ , the diameter of W is close to log n/ log(nf * ) = �(log b/ log(b/c)) = O(1) a.a.s. From that we can deduce that G has a diameter close to O(log b/ log(b/c)) = O(1) a.a.s. This result can follow from intuition also, since all the nodes at a logarithmic height of the root dominate the periphery and a worst case path should roughly be between two peripheral nodes which are connected via a node at the core, with this node being a common dominator of them.

thus the expected total number of closed triangles T C is
The calculation has been deferred to the "Methods" section (Eq. 1). The probability γ uvw of uvw being a triplet (open or closed) is given as

Model generalizations
IGAM2. We fully align with the stochastic blockmodel definition of core-periphery networks presented in Ref. 3 by defining the following generalization of IGAM, which we call IGAM2, parametrized by b > c 2 > c 1 > 1 .
In this context, we start with the same skeleton tree of fanout b and then the law g(u, v) for generating the edges is where 0 < H 0 < H is the core's threshold. The probability g(u, v) of an edge between two nodes with max{h(u), h(v)} ≤ H 0 (i.e. core-core edges) is greater than the probability between two nodes whose heights satisfy min{h(u), h(v)} ≤ H 0 and max{h(u), h(v)} > H 0 (core-periphery edges), which is greater than the probability of the case that min{h(u), h(v)} > H 0 (periphery-periphery edges). Figure 3 presents the adjacency matrix of a sampled IGAM2 network with parameters c = 1.5, c 2 = 2.5, b = 3, H 0 = 2 , and H = 6. We analyze the mathematical properties of IGAM2, which are similar to the properties of IGAM, in the "Methods" section. Most of our proofs are based on a construction of a coupling of an IGAM2 network with two (simple) IGAM networks with parameters (b, c 1 , H) and (b, c 2 , H) . The coupling is constructed such that the three graphs form an ordering based on the subgraph relation.
Directed and continuous versions. The IGAM model has a natural directed extension: for two nodes u, v with heights h(u) and h(v) with h(u) ≤ h(v) we create an edge from u to v with probability ξ(u, v) = c −1−h(v) and a directed edge from v to u with probability ξ(v, u) = c −1−h(u) . This edge creation law corresponds to the following philosophy: a non-famous node wants to connect to a prestigious node and a famous node does not want to connect to a non-famous one but it has better affinity for the nodes near its prestige h(·) . This version of IGAM has also a sublinear dominating set, that is every node in the periphery has a directed edge to at least one node in the core w.h.p. The proof of this fact is identical to the case of the simple model.
In the continuous version of IGAM the height of a node v is allowed to be any real number h(v) ∈ [0, H] and the edge creation law remains the same as the simple version of IGAM. Moreover, similarly to Ref. 19 the edge creation law f(u, v) can be approximated by the limit as δ → −∞ of a law f δ (u, v) that involves the generalized mean of h(u) and h(v), i.e. The model given by ( δ-IGAM) can be treated as the scale-free version of the logistic model of 19 where the reverse ranking is replaced by the height function. The network has n = b H − 1 nodes. If the heights h are latent variables drawn independently from a distribution with Cumulative Density Function (CDF) equal to then we can easily show that the continuous model has a sublinear dominating set by partitioning [0, H] to intervals of the form [t i , t i + t] and generalizing the analysis of the discrete model as t becomes infinitesimal.
Finally, a mathematical and empirical study and a comparison between the extensions of IGAM and logistic core-periphery models are interesting questions to be addressed in future work, and lie beyond the scope and length of the current paper.

Conclusions
The present paper observes a connection between the core-periphery structure of networks and dominating sets. We devise a simple generative model which facilitates this connection and fit it to real-world data validating our observations. We believe it is worthwhile to explore the algorithmic implications of this connection further.

Methods
Reproducibility. Code and data needed to exactly reproduce are provided in the form of a Jupyter notebook and is available here 33  Qualitative results addendum. The analytical results which are briefly presented in "5.1" section can be found below for the first three levels of the hierarchy. Groups enclosed in parentheses correspond to separate levels. In the faculty hiring networks the "All others" node represents all non-US institutions: • World-trade (Finland) (Hungary, Slovenia, Singapore, Chile) (Salvador, Iceland, Kuwait, Rep., Belgium, Poland, Moldava., Austria, Germany, Indonesia, Guatemala, Bolivia, Paraguay, Australia, Africa, Of).
Properties of IGAM2. We describe the mathematical properties of IGAM2. First, we construct a coupling between IGAM2 and IGAM which we can use a proxy for the behaviour of IGAM2. (2) Coupling construction. We consider a randomly generated network G ∼ IGAM2(b, c 1 , c 2 , H 0 , H) ≡ IGAM(b, c 1 , H) for 1 < c 1 ≤ c 2 < b and 0 ≤ H 0 ≤ H with edge law g. We also consider a network G ′ ∼ IGAM2(b, c 2 , c 2 , 0, H) ≡ IGAM(b, c 2 , H) and a network G ′′ ∼ IGAM2(b, c 1 , c 1 , 0, H) with edges law g ′ and g ′′ coupled with G as follows: www.nature.com/scientificreports/ Un d e r t h i s c oup l i n g , w h i c h w e d e n ot e a s ν , w e h av e t h at Moreover, G is a subgraph of G ′′ ( G ⊆ G ′′ ) since every edge of G belongs to the edge set of G ′′ .
Sublinear dominating set. We let (G ′ , G, G ′′ ) ∼ ν . We know that G ′ is generated from a simple IGAM model therefore it has a dominating set   To bound the average number of edges we refer to the coupling ν and deduce that the average number of edges m of G is at most the average number of edges of G ′ , say m ′ = �(b 2H /c H 2 ) (as we showed in the main part of the paper) due to the subgraph relationship. Therefore, the average number of edges is m = O(b 2H /c H 2 ) . A better bound can be obtained by calculating the expected value analytically using the form of d h we derived above. Namely, We still observe that m = O(b 2H /c H 2 ) . Moreover, using the fact that the edges m ′′ of G ′′ are �(b 2H /c H 1 ) we get, in the same logic, that m ≥m ′′ , and, thus m = �(b 2H /c H 1 ) . Note that setting c 1 = c 2 and H 0 = 0 recovers the result for the simple IGAM model.
Small-world behaviour. We let (G ′ , G, G ′′ ) ∼ ν . Since G ′ ⊆ G , the diameter of G is at most the diameter of G ′ because every path between two nodes in G ′ is a path in G. Since the diameter of G ′ is close to �(log b/ log(b/c 2 )) = O(1) a.a.s., then the diameter of G is also close to O(1) a.a.s.
. Therefore, the number of triangles T C (respectively T ′ C for G ′ and T ′′ The probability γ uvw of uvw being a triplet in G (respectively γ ′ uvw in G ′ and γ ′′ uvw in G ′′ ) satisfies 3c . The expected number of triplets is denoted by for G ′′ ) can be found by using Eq. 2. If we execute the sum mutatis mutandis, we arrive at the fact t h a t �(b 3H /c   www.nature.com/scientificreports/ Core-periphery conductance. Let (G ′ , G, G ′′ ) ∼ ν . Let the partition (S τ ,S τ ) be at level τ , i.e. all nodes with height h ≤ τ and the periphery S with h ≥ τ . From the subgraph relationship we get that e ′ (S τ ,S τ ) ≤ e(S τ ,S τ ) ≤ e ′′ (S τ ,S τ ) , and subsequently E[e ′ (S τ ,S τ )] ≤ E[e(S τ ,S τ ) ≤ E[e ′′ (S τ ,S τ )] . Thus φ ′ (S τ ) ≤φ(S τ ) ≤φ ′′ (S τ ) . Using the fact about the core periphery conductance we proved for the simple IGAM model, since G ′ , G ′′ are equivalently produced from the simple IGAM model, we get that, on expectation, If we take τ = H 0 = O(log H) to be the core, we can deduce that the core conductance is �(b H /H) as in the case of the simple IGAM model. Data preprocessing. We have ignored directionality in the examined networks and have removed nodes with degree less than or equal to 4 (except in the London-underground network where almost all degrees are very small). The removal of nodes with degree less than or equal to 4 is done (i) to remove outlier nodes and, (ii) to refer to the removal of non-engaged nodes (Supplementary Information).
Ethical standard. The current paper proposes a theoretical model, its properties and fits it to real-world data. Thus, there are no ethical concerns.