Introduction

The daily functioning of modern society is necessarily challenging and traditional risk-based approaches to managing critical infrastructure are often criticized for their inability to address widely unknown and uncertain threats1,2,3. Risk-based approaches require developing threat scenarios, evaluating system vulnerabilities and quantifying consequences associated with specific failures of system components. In the case of an unknown threat space, developing realistic scenarios proves to be an additional challenge. Moreover, it may be difficult to justify investing in hardening system components based on hypothetical and uncertain threats4. Weaknesses and the potentially misleading nature of risk quantification approaches, for example in cyber systems, have been pointed out by a number of researchers5,6.

Building resilience into infrastructure networks7 has been proposed as the key to protecting against the deleterious effect of system disruption due to natural disasters8,9 as well as infrastructure and engineering systems’ failures10,11,12,13. Not surprisingly, numerous interpretations of resilience have sprouted, testifying to the richness of the concept but also presenting challenges for its measurement and application1. Yet recent publications and guidance documents14,15 coalesce around the definition of resilience provided by the National Academy of Sciences (NAS)16: Resilience of a system is its ability “to plan and prepare for, absorb, respond to and recover from disasters and adapt to new conditions”. An important feature of resilience captured in this definition is the temporal dimension: the ability to recover and retain critical system functionality in response to a wide range of threats, both known and unknown. The assessment of resilience should therefore identify the critical functionality of a system and evaluate the temporal profile of system recovery in response to adverse events. Resilience management should comparatively evaluate cross-domain alternatives designed to enhance the system’s ability to (i) plan for adverse events, (ii) absorb stress, (iii) recover and (iv) predict and prepare for future stressors in order to adapt to their potential threats.

Even though definitions of resilience as a system property are commonly reported in the literature (see details in Supplement S1, table S1.1), resilience assessment has been implemented in structured but largely qualitative or semi-quantitative ways17. Insightful studies quantify resilience with metrics associated with different domains and subsequently integrate them into a risk-based evaluation or resilience index. For example, Bruneau et al.18 identified four dimensions of resilience of interest to the seismic community: technical, organizational, social and economic. Measures of resilience–robustness, rapidity, resourcefulness and redundancy–were then aggregated in order to minimize a function of the probability of system failures, the consequences arising from such failures and recovery time19. Another qualitative but quantifiable approach sets forth a taxonomy for metrics that accommodates both change and interaction among physical, informational and human domains20. The approach applies the taxonomy to cyber threats, energy systems and coastal infrastructure9,20,21. Such work provides insight and guidance for developing quantitative resilience measures that correspond to the qualitative identification of systemic issues and gaps. Unfortunately, it provides only limited insight into the management and control of the interconnected networks that constitute the entire system. Simultaneously, the field of network science has focused on the challenge of understanding the structure, dynamics and vulnerability of multi-layer systems across multilayer networks22,23,24,25,26.

This paper proposes a methodology for quantifying a system’s resilience that captures the very concept of engineering resilience advanced by the NAS27,28 stated above. We make use of the critical functionality (CF) (which has been referred to before as functionality function8, performance29,30, quality18,31,32), defined as a metric of system performance set by the stakeholders, to derive an integrated measure of resilience. One example for CF, among many possible ones, is the percentage of nodes that are functioning. Another is the ratio of a network’s actual flow to its maximum capacity.

We note that, in addition to resilience, CF is rich in valuable information and can be the source of many quantitative performance metrics, such as robustness, which we also briefly discuss. For application domains, our focus in this paper are the following two classes of models: i) multi-level directed acyclic graphs (DAG)33 and ii) interdependent coupled networks34. While the second class of networks is the subject of intense interest35, the first class of networks, often overlooked by analysts, is of interest in many fields, from biology (gene regulatory networks) to computer science (software dependencies)36,37,38. Infrastructure systems, to a certain extent, can also be represented by DAGs. For instance, in food supply chains the top level nodes may model the original manufacturers, while the bottom level nodes correspond to the final demand. Other infrastructure examples include certain sequential electrical circuits for combinational logic39, data processing graphs40, etc. In this paper we approximate the Linux, specifically Ubuntu 12.04, code system, as a DAG and estimate its resilience. We selected these examples to test and illustrate the approach because they are realistic as well as easily characterized, especially given the availability of data for Ubuntu. The approach is equally applicable to more general networks that are characterized in flexible ways while still being amenable to such resilience analysis methodology.

Because obtaining analytical results for most realistic cases is an intractable task, even for homogeneous networks, our approach is simulation based. We do, however, obtain analytical results in the Methods section for a simple yet illuminating special case of the first model, where only nodes without redundant active supply links may be unable to supply service. This case sheds light on the relationship between redundancy and system resilience.

Resilience: An Analytical Definition

A network is modeled as a graph G(N, L) with a set of nodes N connected by links L. Before considering network models, the proposed concept of resilience for complex networks is described generally. The specification of N and L includes characteristics relevant to resilience, such as capacity, location and weight of each node and link. Let C be the set of temporal decision rules and strategies to be developed in order to improve the resilience of the system during its operation. From a computational viewpoint, the parameters and algorithms defined by C depend on the particular model being implemented.

Ultimately, the system must maintain its critical functionality Κ at each time step t, where Κ maps its states or parameters to a real value between 0 and 1. This mapping may, for instance, be linear

where {N, L} is the set of all nodes and links, wi (t; C) [0, 1] is a measure of the relative importance of node or link i at time t and πi (t; C) [0, 1] is the degree to which a node is still active in the presence of an attack. An alternate interpretation defines πi (t; C) as the probability that node or link i is fully functional. More complex, nonlinear and detailed definitions of critical functionality mappings are also possible. Finally, we introduce the class of adverse events (or potential attacks on targeted nodes) E. For instance, in the case of a random attack on two nodes, E is the set of all attacks on all possible node pairs.

Resilience, denoted by R, is a composite function of the network topological properties and their temporal evolution parameters defined for a certain critical functionality and a class of adverse events E:

Note that not all targeted nodes are necessarily afflicted. For a non-afflicted attacked node, we thus have πi = 1 over the entire time interval of interest. We evaluate R over a certain time interval [0, TC] where TC is the control time41 which can be set a priori, for instance, by stakeholders or estimated as the mean time between adverse events. In continuous time, we define R as

where |E| is the cardinality of set E and Κnominal(t) is the critical functionality of the system in the case where no external events occur (Fig. 1). Equation (3) allows evaluation of the normalized dynamical performance of the system before (plan/prepare), during (absorption) and after an attack (recovery and adaptation); it intends to capture the definition advanced by the NAS given in the introduction. For computational convenience, the above equation is given in discrete time by

In most cases, we normalize to Κnominal(t) =1. Consequently, a normalized measure of resilience may be given by

We note that equation (1) embraces a large class of performance measures found in the literature. For instance, in addition to equation (5), we can also consider the measure

The above measure is referred to as the Robustness8,42. Alternatively one may define (1 – M) as the Risk1.

Figure 1
figure 1

Resilience and critical functionality concepts as advanced by the NAS.

The system’s resilience is evaluated as the integral of the critical functionality’s (K) dependency on time.

Due to the very complex nature of networked systems and the large number of variables defining their states, it is not possible to consider all events in the set E and obtain a closed-form expression for R, even if all design parameters are made homogeneous across nodes, links and time. We therefore rely on a simulation based approach. Each simulation represents a possible scenario of the networked system’s evolution. For each simulation, we calculate the average value of the critical functionality Κ(t, N, L, C) at every time step (equation 1) and from there, the resilience (equation 4 or equation 5) over the interval of interest.

The approach proposed builds upon and extends the works of others8,18,29,30,32. The main issue encountered when dealing with the estimation of resilience based on the simulation of the system performance curves is that those curves in the general case vary depending on the adverse events modeled. The current approach to resolve this issue is to extend the techniques of probabilistic risk analysis to resilience analysis. This extension provides the weighted average performance curve with weights representing the probabilities of the adverse events.

By contrast, in our approach, we would like to argue that the resilience of the system should not be tied to the probabilities of the adverse events to occur. Again, according to the NAS resilience is the ability to plan, absorb, recover and adapt16. Inspired by this definition, we instead simulate the damage to the system from a certain adverse event (regardless of the probability that the event occurs) and define resilience for that particular damage. For simplicity of decision making, however, we suggest considering a certain class of adverse events. For instance, in a networked system we might define one class of adverse events as a case when functionality of 4 to 5 nodes is reduced by 40–50% (instead of defining a particular adverse event reducing the functionality of specific 4 nodes by 50%).

We illustrate the approach with two simple models: multi-level DAGs and interdependent coupled networks. We assume that homogeneous nodes and links comprising the network have only two possible states: active and inactive, meaning wi (t), πi (t) {0, 1} in equation (1). To simplify the explanation, we focus on node failures, though the concept may be extended to include links. If we denote the number of active nodes in the system at time t by A(t) and the total number of nodes in the system by N, then the critical functionality simplifies to

We first consider a hierarchical multi-level DAG model (Fig. 2) with Λ levels of nodes43,44. We investigate how redundancy probability pm, switching probability ps and the recovery time TR, tradeoff parameters at the disposal of the system designer, influence the resilience of a supply-demand multi-level DAG across levels, nodes, links and time, how they affect the absorption and recovery phases of a network’s resilience profile and how they address the optimization of network design, for a variety of attack scenarios. We also distinguish between cases where switching is instantaneous and delayed by one time step. Further description of the model is provided in the Methods section (subsection 1).

Figure 2
figure 2

Network generation and modeling of an adverse event.

The hierarchical area is first defined, then links are established according to the Bernoulli trial probability law, with parameter pm. During the operation, if a node with a redundant link or links is made inactive, it can switch with probability ps at each time step following attack time t. After the repair time period of TR steps elapses following the attack, the initially destroyed nodes are restored.

Two applications of the DAG model will serve to illustrate the quantitative resilience measure introduced, as well as the method for evaluating it: 1) synthetic random hierarchical multi-level supply-demand directed acyclic graphs and 2) the Linux system, specifically Ubuntu 12.04, software network. The first is a useful, if approximate, representation of networks found in many applications36,37,38,45. The second realistically represents an existing and widely used network.

The second model is derived from the model introduced by Buldyrev et al.34 and developed by Parshani et al.46. They consider a system comprised of two coupled undirected networks (A and B). A certain fraction of nodes in network A depends on nodes in network B (qA) and vice versa (qB). If node n in network A depends on a node m in network B then node m can only depend on node n (or not depend on nodes in network A at all) (see Methods, subsection 3). Without loss of generality we consider scenarios where networks A and B have the same node degree distribution. We present results for Erdos-Renyi and scale-free random networks with 800000 nodes (N) and average degree (<k>) of 2.5 and slope factor (in the scale-free case) of 2.25. Networks are generated following the algorithm presented by Catanzaro et al.47.

We consider a case with a single adverse event that destroys a number of nodes in the network. For simplicity, the adverse event happens at the time step t =0. We shall refer to the result of the adverse event as the initial damage. In case of the DAG model we denote the number of nodes that become inactive (i.e., are deactivated) in level i between time steps t and t + 1 as . Thus, values represent the number of nodes made inactive upon the occurrence of the adverse event. In case of the coupled networks model we denote the fraction of nodes rendered inactive in network A as pdestr with the assumption that the adverse event doesn’t affect network B.

Results

Model 1 – Directed acyclic graphs. Synthetic graphs

We consider a network composed of N =1000 nodes in four levels: N0 =32, N1 =87, N2 =237, N3 =644. We first look at the special case where the switching is instantaneous with probability ps =1. Assuming the overall damage to each level is small compared to the total number of nodes in the level, the approximation derived in the Methods section (subsection 2) may be used. Figure 3 provides comparisons between the analytical calculations based on that approximation for this case and simulation results averaged over 2000 samples. We note that this case provides insight into the impact of link redundancy on both critical functionality and resilience over the time interval of interest, as we show in the first two scenarios of Fig. 4.

Figure 3
figure 3

Special case: analytical (solid curve) vs computational (cylinder and extension bars) solution.

Comparison between the analytical solution and the computational experiments in the limiting but insightful special case in which switching is instantaneous when an additional link is available, meaning that ps =1, under three different scenarios. Initial damage numbers for each layer are ordered as follows: . For instance, the initial damage in scenario 1 is: , . This special case reveals the impact on resilience of redundancy levels, as represented by the probability pm. The cylinder represents the 25–75 percentile range.

Figure 4
figure 4

Resilience profiles for different scenarios in synthetic graphs.

Results are shown for the redundancy probability parameter pm = 0.01. Initial damage numbers for each layer are ordered as follows: . For instance, the initial damage in scenario 1 is: , . The robustness (M) values for scenarios 1–4 is the minimum value for each curve: 0.966, 0.787, 0.453 and 0.395 respectively.

Examples of resilience profiles for different cases that vary in their initial damage, switching probability and recovery time are given in Fig. 4. Case 1 is a scenario where only one node in the upper level is initially disabled. This scenario represents, for instance, an accident at a power plant. It follows that the event set E consists of all possible one-node attacks in the upper level. Critical functionality suffers minimally; it reduces from 1 to 0.97 at its lowest. Its integral, resilience, consequently suffers minimally as well: R =0.983. By contrast, for a more serious attack, such as in case 2, in which five nodes at every level are disabled (such an attack might represent a large earthquake in a certain area); both critical functionality and resilience suffer. Critical functionality can be as low as 0.8 (a considerably less robust system) for a protracted number of time steps and resilience is reduced to 0.893.

For case 3, 10 nodes are disabled, all from the top level and the switching probability is reduced to ps =0.25. The robustness, or the critical functionality at its lowest, is more drastically reduced to 0.4, yielding an overall resilience value of 0.728. Case 4 is similar to case 3, except that the switching is delayed, i.e., if node i has become disabled at time t*, then the first attempt to switch is made at time t* for case 3 and at time t* + 1 for case 4.

The dependency of resilience on parameters pm and ps is given in Fig. 5(a) with the recovery time held constant at TR =0.5TC. The figure shows that both parameters are compatible and combinable; they can be smoothly traded off to maintain a desired level of resilience. The designer here has the opportunity to select the combination of pm and ps that is least costly. Additionally, increasing pm and ps simultaneously has an observable additive effect on resilience. Beyond certain level, however, investment in redundancy yields minimal return. For instance, Fig. 5(a) shows that doubling the probability pm from 0.1 to 0.2 leaves the resilience unchanged for ps > 0.3.

Figure 5
figure 5

Resilience as a function of design parameters.

(a) Resilience (value shown on curves, σ [3.1E-3; 2.0E-2]) dependencies on switching probability at each time step, or ps and redundancy parameter pm, for a four level hierarchical network where the initial number of destroyed nodes at each level is , respectively and recovery time is held constant at TR = 0.5TC and (b) resilience (σ [5.9E-4; 4.3E-2]) dependencies on ps and TR for with constant pm = 0.01 (color bar indicates the value of the resilience).

In addition, there is strong synergy between pm and ps; increasing both factors together produces a rapid increase in resilience, but increasing only one or the other variable will cause the resilience metric to plateau. This can be observed in Fig. 5(a) by reading the resilience values shown across the phase diagram curves.

Figure 5 (b) illustrates that similar tradeoffs are possible between the maximum node recovery time TR and the switching probability ps. The redundancy parameter pm is held constant at 0.01. When the recovery time is relatively short, TR < 0.1TC, resilience values close to 1 may be obtained even for values of ps as small as 0.05. Resilience is strongly affected by the recovery time, TR (Fig. 5(b)). This temporal factor determines the characteristics of the recovery phase and has a greater impact on the calculated resilience than does the potential increase in redundancy. This is particularly true when the switching probability ps is low, as Fig. 5(b) demonstrates.

Supplement S2 (Figures S2.1–S2.6) displays additional results for both types of parameter dependencies. Cost and speed of design and implementation can now guide the ultimate choice from among the infinite possibilities of parameter combinations.

Model 1 – Directed acyclic graphs. Linux software network

The Linux software network exemplifies the structure of complex multilevel software systems and is also important in its own right. This software operates in an estimated 95% of all supercomputing systems48 and the majority of the smartphones in use (in the form of the Android operating system). Packages in Linux are linked in a formally defined hierarchy of dependencies between individual software units. In this hierarchy, a package can only be installed if all required higher level packages have previously been installed. Some redundancy is possible when multiple packages provide the same functionality. Figure 6 shows a subnetwork of the packages network consisting of 117 nodes out of 36,902 possible nodes in the entire network. The graph data were obtained using Advanced Packaging Tool49 on a standard installation of the Ubuntu 12.04 system.

Figure 6
figure 6

Subnetwork of the Linux hierarchical packages network.

Many modern cyber threats exploit vulnerabilities in software packages. Disabling a targeted software package leads to the failure of many services that are dependent on it. Even worse, the recovery might be protracted as a result of corrupted user data, thus requiring manual repair and cleanup. For example, an attack on the Apache web server might cause it to fail and subsequently send corrupted or maliciously designed data to backend databases50. Consequently, services dependent on Apache would experience data corruption and if Apache crashes, it would be disabled as well. While the damaged server might be restarted relatively quickly, recovery from such an attack would involve checking the data, causing serious additional delays.

We evaluate the resilience of the Linux packages network in the presence of both random and guided attacks. Critical functionality and resilience profiles for guided attacks on several particularly important packages are given in Fig. 7(a). These packages are: xauth, libstdc++6, libc6 and gcc-4.6-base. Notably in these four cases there are four sets of adverse events E. Each of these sets contains only one event that successfully causes a particular node to be destructed. It is seen that the level of damage depends on which packages are targeted.

Figure 7
figure 7

Resilience profiles for the Linux network.

(a) Guided attacks and (b) random attacks. It is clear that guided attacks are considerably more damaging. Moreover, not all guided attacks are equally damaging; as shown in (a), attacks on xauth are less damaging than on libstdc++6. Most damaging are attacks on libc6 and gcc-4.6-base. The robustness (M) values for scenarios 1–4 in the panel (a) are 0.982, 0.655, 0.130 and 0.129 respectively, while for the case in the panel (b) M = 0.999.

For random attacks (Fig. 7(b)), we consider another set of adverse events, consisting of 36,902 events. In this case, resilience is significantly higher than in the case of guided attacks due to the low importance of many of the nodes that failed from the attacks, thus yielding R =0.99975 and M =0.999.

Model 2 – Interdependent networks. Synthetic graphs

We summarize the results for the second model in Fig. 8. Panels (a) and (b) show the dependency of the critical functionality of a system of two interdependent Erdos-Renyi (ER) networks on time for 2 distinct cases of the recovery resources available expressed as the number of the backup agents (Nb). As it is evidenced by Fig. 8 depending on the value Nb there is a sharp distinction between two cases: in Fig. 8(a) the system is unable to recover and the critical functionality oscillates between 0 and about 0.5 - to random duration of cascading recovery and failure processes eventually the amplitude of the oscillations of the mean value of CF decreases; in Fig. 8(b) the backup supply to 0.4 N nodes in network A allows reaching a stable state (after removal of the backup supply) and following further system recovery.

Figure 8
figure 8

Representative profiles of the dynamics of K and resilience in networks with N = 800000.

Panels (a,b) show results for ER networks with pdestr =0.5, Nb = 0.35N and Nb = 0.4N respectively. Panels (d,e) show results for SF networks with pdestr = 0.5, Nb =0.1N and Nb = 0.62N respectively. In the panels (a,b,d,e) the solid line corresponds to the mean value of K over 100 simulations. Gray area corresponds to the region K ± σ(K) (where σ is the standard deviation). The plot in the panel (d) also shows simulations where critical functionality restores to 1. It follows that the success of the restoration algorithm depends for the most part only on the results of the cascading failure (rather than the random selection of the nodes with the backup supply): in cases when the important hubs are active after the cascading failure recovery is more likely. Finally panels (c,f) display phase diagrams of the resilience dependencies on both Nb and pdestr parameters. In both ER and SF networks the recovery process is stochastic and very sensitive to the backup supply available.

In scale-free (SF) networks (Fig. 8(d,e)) the results for <k> =2.5 and N = 800000 show a much larger dispersion than in Erdos-Renyi networks. In particular some of those networks suffer a much smaller drop in critical functionality in response to the adverse events modeled. This obviously is a consequence of the infinite dispersion in degrees distribution in SF networks (though in our case the dispersion is finite due to the finite number of nodes). Another distinct specific trait of the SF networks with small value of average degree is the fact that whether the network fully recovers or not is strongly dependent on the stochastic nature of the cascading failure process. In particular, it is obvious from Fig. 8(d) that the success of recovery is determined by whether the most important hubs were affected during the deactivation process. If those hubs are not affected the damage is relatively small, otherwise the damage causes a large drop in critical functionality and recovery within the control time is not possible.

Finally panels (c) and (f) in Fig. 8 show the phase diagrams of the dependency of the resilience (calculated as the integral of the CF over the control time) on the model parameters pdestr and Nb. Notably the critical functionality practically drops to zero when only 20% of the nodes are initially destroyed in network A (pdestr). Parshani et al.46 demonstrated that if pdestr is more than 0.2545 the network experiences the first order transition leading to a state with almost no active nodes. We have reproduced their results for the Erdos-Renyi networks and confirmed that if pdestr is less than the threshold of 0.2545 the transition doesn’t occur. However due to minor modifications we made to the network generation algorithms aimed at connecting all the nodes in a single giant component (GC) in the beginning of the process, we observe decrease in the threshold value pdestr, causing the first order transition, to about 0.15–0.2. After the drop of the critical functionality (due to the cascading failure) on the step TR the recovery process starts. The recovery is successful only if Nb is about 0.4 N or higher. Finally if the whole network A is destroyed as a result of the adverse event (pdestr =1) then the recovery cannot start due to the absence of the GC(A). Results for scale-free networks show similar tendencies although they are notably much more disperse in the region . We interpret this as the consequence of the divergence of the standard deviation of the degree distribution in scale-free networks with .

Conclusion

We have presented a detailed approach for implementing the National Academy of Sciences definition of resilience as a function of design tradeoff parameters, as illustrated in the study with multi-level directed acyclic graphs and interdependent networks. The approach allows evaluation of resilience across time and not just as a single quantity. Designers can thus analyze the effect of parameter choice and design emendations on overall network resilience and robustness. Focusing on multi-level directed acyclic graphs and interdependent networks, we have demonstrated how network parameters can be traded off to obtain a desired resilience and other performance measures’ level. Future work will extend to multiplex systems and other real life networks. An important long term challenge is to model adaptation, which is part of the response cycle that follows restoration and includes all activities that enable the system to better resist similar adverse events in the future.

Methods

Absorption and recovery algorithms in the DAG model

A hierarchical multi-level DAG (Fig. 2) has Λ levels of nodes43,44. Each layer is comprised of Ni nodes (i =0, …, Λ – 1). The links represent a supply–demand relationship. A link starts at a supplier node and ends at a demander node. Thus, every link in the network is directed. For every level, we identify a set of services that all nodes in a particular level supply. Then, for every service supplied in the network, we define whether or not it depends on other services and a (possibly empty) list of the dependencies. The levels are ordered in such a way that links only go from a higher level i to a lower level j. With this convention, i < j, or, the higher the level, the smaller the index. Additionally, no links can be formed between nodes in the same level. We also disallow cycles in the network by imposing the following constraint: a node cannot depend on any of the services provided by any of the other nodes in its level or on any of the services provided in any of the lower levels. Initially, all dependencies are resolved and every node has one incoming link from one or more upper levels on which it depends for the supply of its services. Furthermore, for every dependent node and each of its required services, we introduce a list of potential suppliers. The probability that a node has a link from each of the potential suppliers is pm. Said another way, a node has many links supplying a given service but only one of those links is enabled initially (known as real), while the others are contingent backup links (known as virtual), should they exist.

To model an adverse event, we introduce an ability to destruct a node for a time period TR, as was recently done by Majdandzic et al.41. A destructed node is inactive and is therefore unable to supply services until it recovers. Another possible cause of deactivation is an unresolved dependency, that is, the absence of a real link to a node supplying a required service. This can happen when the only supply node available for a service is either destructed or its upstream supplier is destructed. We shall refer to the nodes with an unresolved dependency as disabled nodes. Note that a node can be disabled, destructed, or both at a given time. Once a node becomes inactive, all of its dependencies, connected through their real links, are subject to deactivation unless they have other real links providing all of their required services (Fig. 2).

We assume that a node is eligible to switch links, that is, to turn a virtual or contingent link into a real one, if virtual links for all of the node’s unresolved dependencies exist and the node is not destructed. At every time step during which the node is both disabled and eligible to switch, switching happens with probability ps. Switching can be either instant (the first attempt to switch is made at the same time step the node has become disabled) or delayed (the node with an unresolved dependency remains disabled for at least one time step).

Analytical approximation for the special case of the DAG model

In this section we derive equations describing the number of active nodes in the special case where the switching is instantaneous with probability ps =1 while the initial damage is small compared to the total number of nodes in the network.

Let us denote by Λ the number of levels in the network and by Ni the number of nodes in level i (i =0, …, Λ – 1). We can find the probability that a node in level i has only one service provider in level j as follows:

For the case where the number of deactivated nodes at each time step is small enough or in which pm =0, we may assume that only the nodes with one link for a relevant service are disabled as a result of the inactivation of their supplier (thus neglecting the cases in which the node has more than one supplier of a service and all of them are deactivated).

The average number of active nodes in level i at time step t () is given by the formula:

These nodes have suppliers in each level j < i. We disabled suppliers of the nodes in level j between steps t − 1 and t. The probability for a node in level i to become disabled between time steps t and t + 1 is therefore:

We approximate the distribution of 1/ to be linear in , although the dependency itself is not linear:

where (under the assumption that the overall damage is small enough):

Considering the Taylor expansion of 1/(1 – x), we have for small values of x: 1/(1 – x) ≈ 1 + x.

Then, on average between steps t and t + 1, we disable the following number of nodes in level i:

After the recovery period of TR time steps has transpired, the initially destroyed nodes are rebuilt and become active unless they still lack sufficient supplies from the upper levels. Thus, assuming that TR > Λ,  = 0 for i = {0, …, Λ} and t ={Λ, …, TR – 1}.

The total number of nodes restored at step TR in level i is given by the expression:

Here, represents the total number of nodes disabled in level i at time step t due to the fact that they do not have sufficient supplies from the upper levels:

And for the next steps, the formula is as follows:

Using the formulae above, we may evaluate the average approximated resilience profiles and find the values of resilience.

Absorption and recovery algorithms in the coupled networks model

The failure propagation algorithms are described in the original model of Buldyrev et al.34. Initial damage results in a certain fraction of nodes deactivated in the network A. Once those nodes are deactivated the network A is fractured in clusters. Nodes that do not belong to the largest cluster of the network A are also assumed deactivated. Then all the nodes in the network B that depend on the deactivated nodes in the network A are also deactivated. It results in fracturing of the network B and the nodes that are not in the largest cluster of the network B are also assumed deactivated. In the second step of the process nodes in the network A depending on the deactivated nodes in the network B are deactivated and the process propagates in the same fashion until there are no more nodes to deactivate in any of the networks.

Recovery is accomplished by the backup supply agents replacing unresolved dependencies of the nodes in the first network (A). The number of those agents is denoted Nb. Each backup agent can serve only one node at a time. Nodes to provide the backup supply to are chosen randomly from those nodes in the network A that depend on a currently inactive node in the network B. Thus backup is provided either to all nodes in the network A with an unresolved dependency (in this case full recovery is guaranteed) or to Nb nodes only. If a node has backup supply and it is connected to its network’s GC it is activated. Once a node is activated it is included in the network’s GC. This causes eventual growth of the giant component of the network A. After that the nodes in the network B that depend on the activated nodes in the network A and are connected to the GC of the network B are also activated and the process propagates in a similar fashion. Once the process is complete the recovery phase finishes. After that the backup supply is removed meaning that all the nodes whose supplier in the network B is not active (after the recovery phase) are deactivated. This leads to a cascading failure propagating as described in the introduction section. Once the failure phase finishes the recovery phase is repeated until the full network recovery is established or the control time is reached.

Let us consider a simple two-network system (Fig. 9). At the beginning of simulation (time 0) two nodes (A1 and A3) are assigned to be initially destroyed. Cascading process finishes before the recovery time (that is time to repair a node is more than the cascading failure time) TR. Thus at time TR the network is in the state it had after the cascading failure. After TR steps have passed the recovery process starts. The case of Nb =0 is shown in the panel (a) of Fig. 9. In this case the only recoverable node is A3. After its recovery node B3 is also recovered, but further recovery is not possible. Nodes A1, A2, B1 and B2 can’t be recovered as they have an unresolved dependency. In addition even if nodes A1, B1 were independent they still wouldn’t have been recoverable due to the fact that they are not connected to the respective networks’ GCs.

Figure 9
figure 9

Recovery process in coupled networks.

At time 0 nodes A1 and A3 are deactivated for TR steps. The failure process finishes before TR so at time TR the network is in its state after the cascade. Panel (a) illustrates the case Nb = 0. Only independent node A3 can be recovered as it is connected to the network A GC. Once node A3 is activated its dependent node B3 is also activated. Panels (b,c) illustrate the stochastic nature of the recovery process. In these cases Nb = 1. At time TR + 1 either node A1 (b) or node A2 (c) can have backup supply. In case (b) the recovery phase doesn’t start. After that backup is removed but cascading failure occurs. On the next time step when backup is applied again to a randomly selected node the recovery cascade is possible if the node chosen is A2. This case is the same as the case (c) with time TR + 1 (in the case (c)) corresponding to the time TR + 3 (in the case (b)).

Now consider the case Nb =1. In this case two scenarios are possible:

  1. a

    The node chosen for backup supply is A1 (Fig. 9(b)). Then no recovery can happen as this node is not connected to the network A GC (or GC(A)) and the recovery phase ends in 0 steps;

  2. b

    The node chosen for backup supply is A2 (Fig. 9(c)). Then this node recovers, node B2 recovers in turn. During the second step of the recovery phase node A1 recovers and node B1 recovers in turn.

Additional Information

How to cite this article: Ganin, A. A. et al. Operational resilience: concepts, design and analysis. Sci. Rep. 6, 19540; doi: 10.1038/srep19540 (2016).