Introduction

As critical infrastructure systems have increased in size and complexity, so has the interdependence between them-communication systems require electricity from the power grid, whose functioning and maintaining relies, however, on communication systems. Both networks rely on the transportation system for repairs, and in turn, transportation needs both electrical power and a functioning communication system. These multiple interdependencies, and their consequences for resilience, have inspired an extensive literature on coupled multilayer networks, crossing disciplinary boundaries1,2,3,4,5,6,7,8,9,10.

The common hypothesis behind the current multilayer network modeling framework is one of hard interdependence, where a node or link failure in one network causes node or link failures in another network, which in turn may induce additional failures in the original network, resulting in a domino-like cascade of failures across multiple systems3 (Fig. 1a). Despite the many modeling insights it has offered, evidence of such hard cascading failures remains limited in real systems. For example, while communications and some transit networks do depend directly on electricity, failures in these networks rarely cause electrical failures8. Furthermore, while cascading failures in the electric grid are well documented11,12,13,14,15,16, despite a decade-long body of literature on the subject, we continue to lack convincing empirical evidence of these cascades, inducing cascades of failures in other infrastructure systems.

Fig. 1: Damage and recovery in interdependent networks.
figure 1

a Under the hard coupling model, when node x1 fails, it causes a cascade across both networks that disables the entire system. b With recovery coupling on the same network, when node x1 fails it is repaired using resources from network Y delivered through node y1. Failures in network Y will impair that repair process.

While direct evidence of hard coupling across multiple networks is limited, there are multiple accounts of interdependencies not considered by the current modeling frameworks, those induced by the recovery process8,17,18,19. Indeed, the repair and the recovery of a network following a local or global failure requires resources typically supplied by other networks. For example, restoring failed power components requires that the repair crews have access to transportation (road networks) and coordination through communications (cellular networks and internet). If the support networks are not fully functional, the delivery of resources critical for recovery will be slowed or impaired (Fig. 1b). Indeed, while a blocked road or an internet outage in a given location will not cause a power outage, it may delay the repair of power outages in the affected area. And because the damage may continue regardless of the system’s ability to recover, impaired recovery could eventually lead to a system’s collapse. Such recovery-based interdependencies were well documented in the aftermath of Hurricane Sandy: at least 85 incidents of recovery interdependence were reported, including the dependency of the power grid’s recovery on other networks17.

Results

Here we show how recovery coupling affects a network’s functionality, finding that its signatures and dynamics are different from the much-studied multilayer cascading failures, as well as from interdependent networks with coupling20,21,22,23,24. To empirically test the developed framework, we collected data on millions of power grid failures in the contiguous United States, finding evidence of striking nonlinear behavior in recovery following large perturbations, consistent with the model predictions.

Network damage and recovery at constant rates

Consider two infrastructure systems X and Y, each composed of N elements (nodes). Each network is described by its adjacency matrix, Xij and Yij, and we label the nodes geographically so that co-located nodes xi and yi, have the same index i. At any moment, each node can either be functional (xi = 1, yi = 1) or non-functional (xi = 0, yi = 0). A non-functional node can cause secondary damage either by isolating its neighbors from the rest of the network, or via cascading mechanisms16. Though a single node or link failure can render other parts of the network nonfunctional, once the initial failure is repaired, typically the secondary failures will also return to functionality25. For example, though a downed power line may cut off power to many homes, once the line is repaired, the power will be restored to each home without needing the individual repair of each component.

Assuming a constant damage rate \({\gamma }_{\mu }^{d}\) and a constant repair rate \({\gamma }_{\mu }^{r}\), the fraction of primary failed nodes in each network fμ evolves in time as

$$\dot{{f}_{\mu }}={\gamma }_{\mu }^{d}(1-{f}_{\mu })-{\gamma }_{\mu }^{r}{f}_{\mu },$$
(1)

reaching the equilibrium damage fraction

$$\langle {f}_{\mu }\rangle =\frac{1}{1+{\gamma }_{\mu }^{r}/{\gamma }_{\mu }^{d}}.$$
(2)

The damage rate \({\gamma }_{\mu }^{d}\) is largely exogenous and determined by weather, accidents, or component failures. The repair rate \({\gamma }_{\mu }^{r}\), in contrast, is determined by the resources available for repair, such as crew and supplies.

Equations (1)–(2) predict a linear relationship between the number of damaged nodes and the number of repairs executed within a given time window, analogous to the elastic balance between displacement and restoring forces in stress-strain relationships in materials science26,27. A constant damage rate \({\gamma }_{\mu }^{d}\) leads to \({\gamma }_{\mu }^{d}N\) sites being damaged at any time, and temporal variability can be modeled by replacing the constant \({\gamma }_{\mu }^{d}\) with a stochastic variable from a representative distribution (see supplementary note 3).

Observed outage recovery behavior

To empirically test the validity of elastic recovery, we built an Outage Observatory, a suite of continually running web crawlers that record live-updating outage maps28,29,30 from electrical utilities around the United States (Fig. 2a). During 2019 the Observatory recorded over 5 million power outages, capturing the geographic location and time of each outage and the repair time for each incident (Fig. 2e). By comparing the number of repairs and outages occurring in a utility at any time, we can construct the damage-repair curves for each utility (Fig. 2b and c), finding that for most utilities the recovery follows the linear response of Eq. (1) 95% of the time, whose slope provides the repair rate (supplementary note 5 for details). However, we also observed multiple large disruptions, for which the number of repairs systematically and significantly deviates from the linear pattern characterizing the elastic behavior (Fig. 2d). We have been able to link many of these to large events such as severe winds, rainfall, snowfall, and fires. For example, a derecho system that struck the Northern Midwest on 19 July 201931 caused over 55,000 outages, resulting in over 60 million lost customer hours. Each perturbation impacts the power grid and its support systems in different ways, hence the precise deviation from linearity cannot be inferred from the number of outages alone. Though each large failure has its unique cause and recovery dynamics, when we place all perturbations on the same graph we observe a remarkable universality, finding that all large events display similar nonlinear behavior (Fig. 2d).

Fig. 2: Elastic and inelastic recovery in the power grid.
figure 2

a Locations of outages recorded by the Outage Observatory, colored by the utility serving that area. b, c Repairs executed vs total outages recorded for each 2-h window for DTE Energy (b) and Consumers Energy (c), two large utilities in Michigan. An elastic response implies that a constant fraction of outages are repaired at any given time. When the number of outages is small, the response is elastic but when the system experiences a large number of outages it can become increasingly inelastic. Red and blue indicate the deviation from elastic response in the downward and upward direction, respectively. d The elastic residual is the difference between the observed repair and the predicted repair based on an elastic response. Comparing the 30 utilities with the most outages, we find a universal downward deviation, (more red points). e The number of outages observed at each moment for DTE Energy and Consumers Energy. Because the deviation from elasticity can be quantified for each time window, we can use the color map of panels (b, c) to indicate system elasticity over time. Measurements of elastic and inelastic recovery for more utilities can be found in Supplementary Figs. 3 and 4.

Modeling recovery coupling

The loss of elasticity during extreme perturbations indicates that the hypothesis of a constant repair rate is not sufficient to explain the system’s behavior. Given that the repair process requires resources from other networks, we hypothesize that a multi-network approach could explain the observed deviation. To model the observed dependency, we allow the repair rate \({\gamma }_{X,i}^{r}\) of the primary network X (e.g., power grid) at node i to depend on the state of the support network Y (e.g., road or communication network) at the same location (Fig. 1b), obtaining

$${\gamma }_{X,i}^{r}(t)=g({\langle y\rangle }_{i}(t))=g(1)-g^{\prime} (1)(1-{\langle y\rangle }_{i}(t))+o({(1-{\langle y\rangle }_{i}(t))}^{2}),$$
(3)

where g(x) is an unknown function that represents the functional dependence of the repair rate of system X on the state of network Y around site i, which we assess with the network average

$${\langle y\rangle }_{i}(t)=\frac{1}{k+1}\mathop{\sum}\limits_{i}{Y}_{ij}{y}_{j}(t),$$
(4)

to capture the fact that repair resources are drawn from the neighborhood of the failure and are affected by the networks which supply them. Thus 〈yi may represent the dynamically evolving accessibility, or availability of electricity. In (4), the variable 〈yi(t) captures the temporally evolving local state of network Y, which may itself co-evolve with the state of the nodes in network X if dependencies exist between the two systems.

Denoting with \({\gamma }_{X}^{r,0}=g(1)\) the elastic repair rate and with \(\alpha =g^{\prime} (1)/g(1)\), we obtain

$${\gamma }_{X,i}^{r}(t)={\gamma }_{X}^{r,0}(1-\alpha (1-{\langle y\rangle }_{i}(t))),$$
(5)

enabling us to describe the expected behavior of g(x) to first order with the assumption that α (0, 1). Specifically, we assume that damage in Y will not improve repair in X (\(g^{\prime} (1)\ge 0\to \alpha \ge 0\)) and that the repair rate must remain positive (\(| g^{\prime} (1)| \le | g(1)| \to \alpha \le 1\)).

If damage is sporadic and uncorrelated across both systems, the simultaneous failure of xi and yi for a given i is rare, and when the failures are limited to a single network, recovery is not impaired (Fig. 1b). However, if damage in X and Y is correlated in time or space, simultaneous damage of nearby sites in X and Y will occur with higher frequency and based on Eq. (5) we expect a reduction in the repair rate. Such correlations are often caused by severe weather events, the main source of disruptions to all infrastructure systems in the United States32,33,34. These events are highly localized in time and space, simultaneously damaging the electric, communications, and transportation networks. Hurricane Sandy, for example, induced failures across the power grid and communications networks (downed lines, flooded control centers) and transportation networks (flooded roads). These simultaneous failures lead to recovery delays, as power outages could not be repaired because roads were flooded. At times, the coupling was bi-directional: some flooded roads had pumping systems for drainage, which could not be operated without electricity17.

Recovery coupling case study: Tropical Storm Imelda

When there are many outages at once, the repair time can also be affected by resource limitations, like a limited number of repair crew members and trucks. Yet resource limitations are expected to impact the whole service area equally. If, however, the slowdown is limited to regions where the support infrastructure is damaged, recovery coupling is the main driving factor. To distinguish between these two mechanisms, we relied on natural experiments, when exogenous shocks simultaneously affected the electrical network and its support networks. In September 2019, Tropical Depression Imelda caused widespread power outages and flooding in Houston, Texas, and the surrounding area (Fig. 3a). We analyzed the duration for all power outages in the vicinity of flooded roads, using areas without flooding as control, allowing us to test whether the slowdown in outage repairs was due to system-wide drains on resources or on the dependence of the repair rate on road networks. We also considered a temporal control, inspecting the repair times of outages reported over the previous 60 days in the same area (Fig. 3b, e). We find that the slowdown in outage restoration is heavily localized in both space and time around the flooded roads: while more than 95% of the outages located more than 30 km from the flooded roads were repaired within 10 h, 40% of the failures occurring within 5 km of a flooded road remained unrepaired after 10 h. Furthermore, even during the storm, outages far from flooded roads were repaired at the same rate as without a storm (spatial control, Fig. 3e). The observed separation of outage survival curves at different distances from flooded roads offers direct evidence of multilayer recovery coupling, illustrating how damage in a non-electrical infrastructure impacts the functionality of the electrical infrastructure.

Fig. 3: Empirical evidence for recovery coupling.
figure 3

a Average outage duration and location of flooded roads during Tropical Depression Imelda. Delayed restoration occurred primarily around Beaumont and Northeast Houston, where most flooded roads were located. b Number of outages (orange) and flooded roads (blue) during Imelda. The shaded orange curve shows the middle quartiles of outages for the same times (hour and day) with no storm, offering a time control. c The recovery of outages during Imelda that were far (≥10 km) from flooded roads is well approximated by an elastic (linear) response. d Outages near (<10 km) flooded roads show substantial deviation from elastic behavior. The coloring encodes deviation from elasticity as in Fig. 2. e Fraction of unrepaired outages grouped according to their distance to the nearest flooded road. Using the Kaplan-Meier estimator, we find statistically significant longer outage durations for outages closer to flooded roads. The spatial control of outages far from the storm and the temporal control of outages from the storm area in the 60 days before the storm, are comparable. In this case, we have taken proximity to damage as a proxy for the network effects of a road closure.

Further evidence of the proposed phenomenon is provided by the coexistence of elastic behavior far from the flooded roads with inelastic behavior near them (Fig. 3c, d). We note that the repair amounts are not only below the elastic prediction, but decrease with increased damage, in line with the prediction that the deviation from elasticity is not due to resource constraints which tend toward saturation of repair per unit time (Supplementary note 2 and ref. 35).

Recovery coupling simulations and phase space

To understand the implications of recovery coupling for multilayer network resilience, we consider the symmetric case in which the network structure, damage, and recovery parameters are the same in both systems. Since the two systems support each other, we let the repair rate of Y be influenced by the state of X in the same manner as Eq. (5): \({\gamma }_{Y,i}^{r}(t)=g({\langle x\rangle }_{i}(t))\). In the symmetric case fx = fy = f, leading to a single equation that governs the state of the system. If the failures are uniformly distributed, we can use percolation theory36,37 to analytically derive the equation that governs the expected fraction of primary failures in the coupled system,

$$f=\frac{1}{1+\frac{{\gamma }^{r,0}}{{\gamma }^{d}}(1-\alpha (1-u(1-f)))},$$
(6)

where u(x) is the probability that a link does not lead to the largest connected component when a random fraction 1 − x of the nodes are removed, and is determined by the network topology. Equation (6) has one or two stable solutions depending on the value of the control parameter \(\frac{{\gamma }^{r,0}}{{\gamma }^{d}}\). The non-symmetric case has similar results, as we shown in supplementary note 1 and Supplementary Fig. 1. In contrast, the uncoupled case (2), which we recover from (6) for α = 0, has a single stable solution. The new solution describes a stable fixed point at f = 1 (all nodes failed), which persists even for high recovery rates \(\frac{{\gamma }^{r,0}}{{\gamma }^{d}}\) (see Fig. 4a). The existence of two stable solutions for f for the same recovery rate \(\frac{{\gamma }^{r,0}}{{\gamma }^{d}}\) indicates that for a wide range of conditions, recovery coupled networks are resilient: they display functionality comparable to the uncoupled case and return to full functionality following small perturbations38,39. However, a sufficiently large perturbation can force the system to cross the unstable branch, pushing it into a dynamically stable non-functional state (Fig. 4a). This is more likely with correlated perturbations across layers, as we show in Supplementary Fig. 2. The existence of this behavior analytically predicts a “catch 22” phase that follows a sufficiently large disaster: infrastructure system X cannot be repaired because it requires resources from Y, and Y cannot be repaired because it requires resources from X. The fact that the collapsed state persists even for high repair rates and low damage rates predicts that it is harder to bootstrap a broken system than it is to maintain the functionality of one that is damaged but still working. Synthesizing elastic residual curves (Fig. 4c) like the observations in Fig. 2d, we find that the full coupling α = 1 reproduces the shape of the curve, while lower values of α do not, providing further evidence that the general deviation from elasticity is consistent with recovery coupling.

Fig. 4: Recovery coupling in multilayer networks.
figure 4

a Comparison between the functionality of uncoupled networks and recovery coupled networks. The uncoupled case (blue line) has a single solution for any repair to damage ratio γr,0/γd, implying that it can recover its functionality after an arbitrarily large perturbation. With recovery coupling the system can function at levels similar to the coupled case (orange line) but the non-functional collapsed state persists as an attractor (red line), implying that for sufficiently large damage, the system can reach a permanently collapsed state. b If we inspect the recovery per unit time as a function of concurrent damage amount, we observe a behavior similar to elasticity in materials science. Recovery coupling leads to a sublinear or inelastic behavior, predicting the loss of resilience under heavy damage. c The elastic residual plot from bidirectionally coupled random networks shows the same pattern as observed for the real data in Fig. 2d.

Discussion

The 27 September 2003 blackout in Italy is often used to illustrate how the interdependence of communications and electrical infrastructure can cause cascading failures3,40. However, a closer look at the sequence of events indicates that though transmission network overload cascades triggered the power outage41, dependence of repair activities on the communication network which was itself disabled, prolonged the recovery process42. Here we demonstrated that such recovery coupling can lead by itself to a collapse of functionality. More importantly, we have shown that the signatures of recovery coupling are directly observable during severe weather events, indicating that the proposed mechanisms have direct relevance to real multilayer networks. Domino-like dependencies, which could co-occur, further amplify this danger, though some interdependencies have been shown to reduce cascading43.

The data-driven approach presented here enables a more precise understanding of infrastructure interdependence. For example, we find that while the set of flooded roads as a whole caused slowdowns in power outage repairs, some impaired roads had much stronger effects than others. The roads in downtown Houston caused only minor delays when flooded, while in Beaumont and Northeast Houston flooded roads caused severe delays (Fig. 3a). Improving the precision with which we measure infrastructure vulnerability is particularly important in light of aging infrastructure and climate change.

Our findings reinforce the importance of engineering for resilience not only through strengthening critical infrastructure, but also focusing on the socio-technological layers needed to restore the infrastructure when damaged. These recovering systems may supply electricity, fuel or access, or they may provide human connections through social networks, which have also been shown to play a powerful role in disaster recovery44.

Recovery coupling has relevance for other systems affected by multiple networks. A pertinent example is the impact of loss of healing ability during aging. Living organisms display a fundamental asymmetry between damage and repair, similar to what we observe in infrastructure networks: damage is typically caused by external factors (oxidants, pathogens, shocks, etc.) while repair is endogenous and is governed by multiple coupled networks (regulatory, metabolic and signaling) requiring diverse resources (nutrients, oxygen, immune cells, etc.). From this perspective our work complements recent network-based modeling of the relation between repair and aging45,46, illustrating how the well-documented loss of healing ability in individual systems47, can lead to systemic frailty, where the organisim can lose its ability to respond to shocks that it could tolerate under normal conditions48.

Methods

Data

Outage data were collected by taking regular snapshots (several per hour) of the outage maps published by electric utilities around the United States. Each snapshot contains a geotagged list of all outages active at that time, including transmission and distribution outages. A single outage is reported for each incident, even if many customers are affected. By comparing snapshots from moment to moment, and noting the first and last time that the outage appeared, we can identify the outage’s location and duration. For more detail about the data collection, see supplementary note 4 and Supplementary Table 1. To download the data used in this analysis, visit https://github.com/mmdanziger/recovery-coupling.

Simulation

We performed discrete-time simulations where at each time point we scan all nodes in each node. For every operational node, we switch it to inoperational with probability ~ γd and for each inoperational node, we switch it to operational with probability ~ γr. The value of γr is calculated as

$${\gamma }_{X,i}^{r}(t)={\gamma }^{r,0}(1-\alpha (1-\langle {y}_{i}(t-1)\rangle ))$$

where \(\left\langle {y}_{i}(t-1)\right.\) is defined as #(operational nodes among yi and its neighbors at previous iteration) / #(neighbors of yi + 1). In Fig. 4c, for every value of γr,0/γd, we simulate the networks until they converge to a fixed point. We simulate transient behavior in the same manner as shown in Supp. Fig. 2.