Monitoring supply networks from mobile phone data for estimating the systemic risk of an economy

Remarkably little is known about the structure, formation, and dynamics of supply- and production networks that form one foundation of society. Neither the resilience of these networks is known, nor do we have ways to systematically monitor their ongoing change. Systemic risk contributions of individual companies were hitherto not quantifiable since data on supply networks on the firm-level do not exist with the exception of a very few countries. Here we use telecommunication meta data to reconstruct nationwide firm-level supply networks in almost real-time. We find the probability of observing a supply-link, given the existence of a strong communication-link between two companies, to be about 90%. The so reconstructed supply networks allow us to reliably quantify the systemic risk of individual companies and thus obtain an estimate for a country’s economic resilience. We identify about 65 companies, from a broad range of company sizes and from 22 different industry sectors, that could potentially cause massive damages. The method can be used for objectively monitoring change in production processes which might become essential during the green transition.


SI Text 1: Firm sample description
To protect the anonymity of the data providing company and the investigated firms, we cannot disclose the exact number of firms in our firm communication dataset. Here, we compare the composition of our sample with respect to number of firms per sector and aggregate turnover per sector with official numbers retrieved from the national statistics office.
We calculate the share of firms, n s , in a sector, s, as n s = N s /N, where N s is the number of firms in s and N is the total number of firms. Supplementary Figure S1a plots the firm share per sector as reported by the statistics office, n official s , against the share in our sample, n sample s . The two quantities are highly correlated, as can be confirmed by Pearson's correlation coefficient r = 0.84 (p < 10 −17 ) and a Spearman rank correlation of ρ = 0.88 (p < 10 −21 ). We repeat the procedure for the aggregate turnovers, T s , for each sector, s, with T denoting the total turnover and t s = T s /T the turnover share. In SI Fig. S1 we plot the turnover share reported by the statistics office, t official s , against the turnover share in our sample, t sample s . The quantities are still highly correlated, albeit less than the number of firms per sector, with a Pearson correlation of r = 0.74 (p < 10 −11 ) and a Spearman rank correlation of ρ = 0.75 (p < 10 −12 ). . Both the number of firms and the aggregate turnover per sector of our sample correlate strongly with official numbers published by the statistics office. SI Text 2: Calculating the conditional probabilities p(s|c) and p(c|s) To calculate conditional probabilities describing the overlap of the communication and supply layer, we use data from a survey conducted in the country of the FCN that was conducted between April 8 and 20, 2020. The survey asked questions on the general business results of the past half year, the outlook for the near future and how the firms expected to be affected by the COVID-19 crisis in the near future. Additionally, the survey contained a part asking for the ten most critical suppliers and ten most critical customers. From this survey we construct a supply network, that we can compare with the FCN. The survey was conducted in cooperation with a large business representation organization, that has very high coverage and represents firms across all sectors except agriculture. The business representation organization sent survey invitation links to all its members and provided the online-tool for the survey as well. The survey was sent out to 102,386 companies; more than 5,955 firms replied to the supply network part, declaring more than 17,393 customer-supplier relations. To keep the privacy of the companies, the data is co-anonymized. This means that the metadata is made available to all parties prior to the data collection process and, subsequently, only anonymized data is made available to the researchers.
We quantify the overlap to find a link x i j , x ∈ {c, s}, in one layer when a link y i j , y ∈ {s, c}, in the other layer is present as the conditional probability p(x|y), When comparing the results of the supply chain survey and the firm communication network there is an additional distinction between firms reporting in the survey and firms reported in the survey. Let's denote the buyer-supplier network as the set S containing all links s i j from reporting firm i to reported firm j. This means where R is the set of reporting nodes, M the set of reported (mentioned) nodes and S the set of all nodes in the network. Please note that the sets R and M are not mutually exclusive; a reporting node can also be mentioned by another firm and, hence, be part of both sets. Let's denote the communication network as the set C of links c i j from firm i to firm j, where i, j ∈ C , C being the set of firms in the communication network.
We are interested in the conditional probability of finding a buyer-supplier relationship where there is a communication link. Formally p(s i j |c i j ). To estimate this, we need to perform a fair comparison and stratify for the fact that only a subset of all buyer-supplier relationships is known. Figure SI Fig. S2 illustrates the problem. The central (black) node has reported in the Figure S2. Schematic visualization of how to calculate conditional probabilities using a supply chain survey. By construction only supply links (solid black line) between the reporting (black) and reported (grey) nodes are observable. This means the communication links (broken red lines) need to be limited to links between reporting and reported nodes. survey and mentioned 6 other nodes (grey). We can only observe links between black and grey nodes, not between two grey nodes, so to perform a fair comparison communication links between two grey nodes need to be excluded from the analysis. We defineC in analogy to the buyer-supplier network as the set of communication links from reporting nodes to mentioned nodes, thereby excluding links between i.e. mentioned nodes Now, following Eq. (1) we can write

2/12
To establish indicators for the error of the survey we perform a bootstrap-like simulation. Our simulations corrects for the fact that the distribution of call durations d i j for the subsample in the survey is not representative of the full network. We start by sampling a synthetic supply network based on the empirical communication network using p(s|c) as found in the survey. We use averages of p(s|c) on bins of d i j . Then we draw subsamples of the sample size of the survey (N ∝ 200). After repeating the process 1500 times, we calculate mean and quartiles and report them in main text Fig. 2b.
For p(c|s) we lack the true distribution of supply link strengths, so we perform a classic bootstrap. For a given conditional probability bin we draw samples of the same size with replacement. We repeat the process 1500 times, calculate mean and quartiles and report them in SI Fig

SI Text 3: Finding the optimal threshold using Kullback-Leibler divergence
The overlap probabilities p(s|c) and p(c|s) increase when thresholded for call duration and/or firm sizes. Of course on the one hand, when thresholding, the information contained in low-intensity contacts (low call duration, small trade volumes) is lost, while on the other hand, link correlations between the communication and supply layers increase. To balance these two effects, we choose the threshold combination where the topology of the thresholded communication network is most similar to a real production network.
To determine when the topology of the thresholded FCN most resembles a real production network, we calculate the Kullback-Leibler divergence between the thresholded FCN and the HSN, We systematically try threshold combinations for the average call duration per week d i j and the number of devices per firm N i , (d i j , N i ). As shown in SI Fig. S4a and S4b, the minimal Kullback-Leibler divergence is found for (30s/d, 0).

SI Text 4 Comparing the network topologies of the FCN, HSN and HCN
Here we compare the network topologies of the FCN, HSN and HCN we discuss the behavior of p(k), k nn (k) and c(k) in greater detail. We begin by characterizing the topology of the firm communication network (FCN). For reference we compare the results to a more directly observed supply network and to a human communication network and show that the topology of the FCN is similar to the known supply network and dissimilar to the social network. Here we describe the network thresholded to only links between firms that have an average interaction duration of more than 30 seconds per day. For the comparison with a real supply network, we compare with the national supply network of Hungary that is obtained through VAT data for 2017 1, 2 (henceforth HSN, short for "Hungarian supply network"). The network shows a link if the tax content of the goods exchanged between two firms exceeds 1,000,000 Forint (approx. 3,000 Euro) and if the link occurs in at least two distinct quarters. For details on the supply network data, see Materials and Methods and 1 . To investigate the difference between the inter-firm communication network and a social communication network (HCN) between humans we use a dataset on calls of individual devices by the same mobile phone provider. The data is for one day during the studied period because the IDs of individual devices are re-anonymized daily, preventing us from studying the communication network for longer than 24h. For details on the HCN we refer to Materials and Methods.

4/12
In SI Fig. S5a we show the degree distribution p(k) of the three networks. In complex networks the degree distribution is often skewed to the right with a fat tail, corresponding to the fact that a few hubs are interacting with many nodes, while the majority of nodes interacts with only a few nodes. The FCN has an average degree of k FCN = 4.79. Its degree distribution has a maximum at k FCN = 2 and has a fat tail that can asymptotically be described by a power-law which we fit with a maximum likelihood estimator using Python's powerlaw package 3 . We find α FCN k = 2.18(12) for k FCN > 30. The PNW does not show the increase for small k but also exhibits a fat tail that can be well fitted by the power law from Eq. 6 with α HSN The mixing patterns of a network have a strong influence on its structure and function. If high-degree nodes tend to interact with other high-degree nodes, the network is called assortative, if high-degree nodes are more likely to interact with low-degree nodes, the network is called disassortative. For supply networks disassortative mixing has been reported 4 ; for social networks we expect assortative mixing 5,6 . To investigate the degree-degree correlations we plot the average nearest neighbor degree where N i is the set of neighbor nodes of i. In SI Fig. S5b we plot k nn as a function of k for the FCN (orange), PNW (blue) and HCN (green). For the FCN k nn (k) shows an increase for small values below k < 10 and then shows decreasing trend for larger k. For very small values the network is assortative, for intermediate and large k the firm communication network is disassortative. For the PNW we find k nn to be relatively flat for small k < 10 and then decrease for large k, thereby showing that the PNW is disassortative for k > 10. This result agrees well with previous studies on the Japanese supply network, which was also shown to be disassortative. The HCN increases to values around k ≈ 30 and then decreases quickly, suggesting the existence of two regimes; a low-degree regime, where assortative mixing patterns dominate, and a high-degree regime, where disassortative mixing is predominant. For the mobile phone communication network of Shanghai very similar results were found. There the authors associate the assortative mixing for nodes with 'reasonable degree' to the social network of calls and the disassortative mixing of high-degree nodes with hotlines or robots 5 . This suggests, that for 'human' callers, the network is assortative.
The local cohesiveness around one node is typically measured by the local clustering coefficient c i . It is defined as the number of closed triangles of node i with its neighbors, t i , divided by the number of possible triangles .
The average local clustering coefficients c , along with their expected value for a random network p is shown in SI Tab. S1. All clustering coefficients are comparably small, but still large compared to random networks. Supplementary Figure S5c shows average c i as a function of degree c(k). For the FCN (orange) and the HSN (blue) c(k) shows a very similar decay. The HCN (green) does not show such a decay, but a decrease for values below k ≈ 20 and an increase for values between k ≈ 20 and k ≈ 40. This weak, peaked dependence of c on k for a mobile phone network was also reported in 5 .

SI Text 5: Calculating Economic Systemic Risk
Here we describe how the ESRI is calculated. We keep our notation closely to the one of 2 . Given a supply network W , where W i j describes the value of products, of type p i , delivered from firm i to firm j. The vector p with element p i ∈ {1, 2, . . . , m} indicates the product type produced by firm i. We identify the product p i with a firm's industry affiliation. The amount of input k firm j uses is Π jk = ∑ n i=1 W i j δ p i ,k , here δ i j denotes Kronecker's delta. A firm's production function is an increasing function f i (Π i1 , . . . , Π im ) that describes how much firm i can produce with a given set of inputs Π i1 , . . . , Π im . Conversely it also allows to assess how much production drops if the amount Π ik of input type k is reduced.
The generalized Leontief production function is a generalization of the regular Leontief production function -with functional form -and a linear production function -with functional form x i = 1 α i1 Π i1 + 1 α i2 Π i2 -and is defined as where the set, I es i , denotes all input types k, that are deemed essential for production of firm i and thus entering the production in a Leontief way, the set I ne i denotes all input types k that enter the production of firm i in a linear way. In our study, firms in sectors up to NACE category F43 are assumed to have a physical production process that is sensitive to the lack of physical inputs. We denote is the overall fraction of output that is spend on all inputs and β i is another parameter inferred from the supply network and defined as the attainable production level if only essential inputs k ∈ I es i are available, i.e., We list the necessary equations to compute the ESRI. First, the following objects have to be defined. The downstream impact matrix Λ d defined by and the elements of Λ d1 and Λ d2 are defined as

6/12
Similarly, the upstream impact matrix is defined by Second, to calculate the ESRI of firm i the initial exogenous shock parameter is chosen to be ψ i = 0 and ψ j = 1 for the other firms j = i.
Third, the following equations are iteratively computed to update the downstream and upstream impeded production levels of firms at time point t. We define the relative output level of firm i at time t as h i (t) = x i (t) x i (0) . 1. Compute the dynamic intraindustry market share for each firm j σ j (t) = min 2. Compute, for essential inputs k ∈ I es i of firm i the fraction available as 3. Compute, for non-essential inputs, k ∈ I ne i , of firm i the fraction available as 4. Update for each firm i the relative production level reduced by downstream shocks 5. Update for each firm i the relative production level reduced by upstream shocks The iteration continues until the algorithm reaches a stable state at time with ε = 10 −2 as convergence threshold. Then the ESRI i of firm i is computed as where s i denotes the size of firm i. The quantity can be interpreted as the fraction of production in the network that is (temporarily) impeded if firm i fails (temporarily). For details of the derivation see 2 Appendix G. Note that the calculation is computationally intensive and scales badly, because with growing network size the number of ESRI i to calculate grows linearly and the convergence times grows by 2 times matrix multiplication costs. For this reason we only show the 1000 most risky firms in Figure 3 (d).

SI Text 6: Extended analysis of the ESRI profiles
We are not only interested in the median damage a company can do, but also in the scenario where the damage is largest. In SI Fig. S6 we plot the maximal ESRI per node of 100 reconstructed supply networks. Reconstructing 100 supply networks and calculating ESRI yields a distribution for every node. Supplementary Figure S6 shows the maximum of each distribution. Note that the maxima are not all from the same configuration. The maximal damage is max(ESRI) = 0.53 and the high systemic risk core consists of around 100 firms, with the majority of nodes having an ESRI around ESRI ≈ 0.25. Figure S6. Maximal worst-case economic systemic risk. We plot the maximal ESRI per node found after simulating 100 configurations of the RSNW ordered from highest to lowest. The shown values are the maximal damage a node does in 100 scenarios, they do not occur all together in one configuration.

SI Text 7: Limitations for systemic risk calculation
Our study is subject to several limitations, in particular (i) the error due to faulty direction/weight estimation, (ii) the limited market coverage of the phone provider (resulting in limited agreement even if p(s|c) = p(c|s) = 1) and (iii) the imperfect overlap of the two networks limiting the possible accuracy. In the following, we address all of these shortcomings one by one and discuss the size of the introduced biases and errors. We end with a simulation to estimate the error introduced by all shortcomings combined. We study the error introduced by sampling the link directions by simulating the reconstruction process on a real supply network topology and then validating it by comparison with the true network. We use the Hungarian supply network and start by aggregating the trade volumes to an input-output table. Then we remove the direction information from the network and sample new link directions according to the probabilities obtained from the input-output table and Eq. (1). We compare the true and simulated link directions and calculate which fraction was guessed correctly. Supplementary Figure S7a shows results of repeating this experiment 1000 times. The mean overlap is 51.5% with a standard deviation of 0.1%. This result, although significantly better than what would be expected from random chance (red line in SI Fig. S7b), is surprisingly low. It can be explained by investigating the probabilities associated with the links in Hungary. If most links were between sectors with a direction as polarized as the relationship between agriculture and the food industry, more links would be guessed correctly. However, as shown in SI Fig. S7b the majority of links has probabilities between 42% and 60% (the lower and upper quartile, shown as red lines). Nevertheless, even though many direct links are guessed incorrectly, on the sector level the proportion of links from sector i to sector j p i j = L i j /(L i j + L ji ) are captured well. We find an average Pearson correlation of the true, empirical sector wise link directions p emp i j and the simulated sector wise link directions p sim i j of r(p emp i j , p sim i j ) = 0.91(1), see also SI Fig. S7c.
As in most countries, there is more than one mobile phone provider in the country where the mobile phone data is from. This results in a market share m less than one. As is schematically shown in SI Fig. S8a, this leads to a large fraction of links that are not accounted for, since we only consider calls between companies who are customers of the mobile phone provider. The graph containing only the links between a set of nodes in a network is called the induced subgraph of the respective set of nodes. To quantify the error introduced by limited coverage we use the real supply network of Hungary and compare the systemic risk as calculated on the full network with the systemic risk calculated on an induced subgraph. We investigate the effect of a market coverage of m = 1/4, m = 1/3 and m = 1/2 using the following steps.  (1). (e) On top of the induced subgraph we limit the overlap of the communication and supply layer by generating a synthetic communication layer using p(s|c) = 0.21. The correlation coefficients of 100 simulations with m = 1/3 are shown in the blue histogram, the mean correlation coefficient is ρ = 0.564(5) (red vertical line). (f) As an additional step we add the reconstruction of the link directions as described in the main text. The correlation coefficients of 100 simulations are shown in the blue histogram, the mean correlation coefficient is ρ = 0.563(6) (red vertical line). Arguably the largest contribution to the overall error is caused by the limited market share, followed by limited overlap p(c|s) < 1. The error introduced due to the link direction estimation is only marginal.