Abstract
Building resilience into today’s complex infrastructures is critical to the daily functioning of society and its ability to withstand and recover from natural disasters, epidemics and cyberthreats. This study proposes quantitative measures that capture and implement the definition of engineering resilience advanced by the National Academy of Sciences. The approach is applicable across physical, information and social domains. It evaluates the critical functionality, defined as a performance function of time set by the stakeholders. Critical functionality is a source of valuable information, such as the integrated system resilience over a time interval and its robustness. The paper demonstrates the formulation on two classes of models: 1) multilevel directed acyclic graphs and 2) interdependent coupled networks. For both models synthetic case studies are used to explore trends. For the first class, the approach is also applied to the Linux operating system. Results indicate that desired resilience and robustness levels are achievable by trading off different design parameters, such as redundancy, node recovery time and backup supply available. The nonlinear relationship between network parameters and resilience levels confirms the utility of the proposed approach, which is of benefit to analysts and designers of complex systems and networks.
Introduction
The daily functioning of modern society is necessarily challenging and traditional riskbased approaches to managing critical infrastructure are often criticized for their inability to address widely unknown and uncertain threats^{1,2,3}. Riskbased approaches require developing threat scenarios, evaluating system vulnerabilities and quantifying consequences associated with specific failures of system components. In the case of an unknown threat space, developing realistic scenarios proves to be an additional challenge. Moreover, it may be difficult to justify investing in hardening system components based on hypothetical and uncertain threats^{4}. Weaknesses and the potentially misleading nature of risk quantification approaches, for example in cyber systems, have been pointed out by a number of researchers^{5,6}.
Building resilience into infrastructure networks^{7} has been proposed as the key to protecting against the deleterious effect of system disruption due to natural disasters^{8,9} as well as infrastructure and engineering systems’ failures^{10,11,12,13}. Not surprisingly, numerous interpretations of resilience have sprouted, testifying to the richness of the concept but also presenting challenges for its measurement and application^{1}. Yet recent publications and guidance documents^{14,15} coalesce around the definition of resilience provided by the National Academy of Sciences (NAS)^{16}: Resilience of a system is its ability “to plan and prepare for, absorb, respond to and recover from disasters and adapt to new conditions”. An important feature of resilience captured in this definition is the temporal dimension: the ability to recover and retain critical system functionality in response to a wide range of threats, both known and unknown. The assessment of resilience should therefore identify the critical functionality of a system and evaluate the temporal profile of system recovery in response to adverse events. Resilience management should comparatively evaluate crossdomain alternatives designed to enhance the system’s ability to (i) plan for adverse events, (ii) absorb stress, (iii) recover and (iv) predict and prepare for future stressors in order to adapt to their potential threats.
Even though definitions of resilience as a system property are commonly reported in the literature (see details in Supplement S1, table S1.1), resilience assessment has been implemented in structured but largely qualitative or semiquantitative ways^{17}. Insightful studies quantify resilience with metrics associated with different domains and subsequently integrate them into a riskbased evaluation or resilience index. For example, Bruneau et al.^{18} identified four dimensions of resilience of interest to the seismic community: technical, organizational, social and economic. Measures of resilience–robustness, rapidity, resourcefulness and redundancy–were then aggregated in order to minimize a function of the probability of system failures, the consequences arising from such failures and recovery time^{19}. Another qualitative but quantifiable approach sets forth a taxonomy for metrics that accommodates both change and interaction among physical, informational and human domains^{20}. The approach applies the taxonomy to cyber threats, energy systems and coastal infrastructure^{9,20,21}. Such work provides insight and guidance for developing quantitative resilience measures that correspond to the qualitative identification of systemic issues and gaps. Unfortunately, it provides only limited insight into the management and control of the interconnected networks that constitute the entire system. Simultaneously, the field of network science has focused on the challenge of understanding the structure, dynamics and vulnerability of multilayer systems across multilayer networks^{22,23,24,25,26}.
This paper proposes a methodology for quantifying a system’s resilience that captures the very concept of engineering resilience advanced by the NAS^{27,28} stated above. We make use of the critical functionality (CF) (which has been referred to before as functionality function^{8}, performance^{29,30}, quality^{18,31,32}), defined as a metric of system performance set by the stakeholders, to derive an integrated measure of resilience. One example for CF, among many possible ones, is the percentage of nodes that are functioning. Another is the ratio of a network’s actual flow to its maximum capacity.
We note that, in addition to resilience, CF is rich in valuable information and can be the source of many quantitative performance metrics, such as robustness, which we also briefly discuss. For application domains, our focus in this paper are the following two classes of models: i) multilevel directed acyclic graphs (DAG)^{33} and ii) interdependent coupled networks^{34}. While the second class of networks is the subject of intense interest^{35}, the first class of networks, often overlooked by analysts, is of interest in many fields, from biology (gene regulatory networks) to computer science (software dependencies)^{36,37,38}. Infrastructure systems, to a certain extent, can also be represented by DAGs. For instance, in food supply chains the top level nodes may model the original manufacturers, while the bottom level nodes correspond to the final demand. Other infrastructure examples include certain sequential electrical circuits for combinational logic^{39}, data processing graphs^{40}, etc. In this paper we approximate the Linux, specifically Ubuntu 12.04, code system, as a DAG and estimate its resilience. We selected these examples to test and illustrate the approach because they are realistic as well as easily characterized, especially given the availability of data for Ubuntu. The approach is equally applicable to more general networks that are characterized in flexible ways while still being amenable to such resilience analysis methodology.
Because obtaining analytical results for most realistic cases is an intractable task, even for homogeneous networks, our approach is simulation based. We do, however, obtain analytical results in the Methods section for a simple yet illuminating special case of the first model, where only nodes without redundant active supply links may be unable to supply service. This case sheds light on the relationship between redundancy and system resilience.
Resilience: An Analytical Definition
A network is modeled as a graph G(N, L) with a set of nodes N connected by links L. Before considering network models, the proposed concept of resilience for complex networks is described generally. The specification of N and L includes characteristics relevant to resilience, such as capacity, location and weight of each node and link. Let C be the set of temporal decision rules and strategies to be developed in order to improve the resilience of the system during its operation. From a computational viewpoint, the parameters and algorithms defined by C depend on the particular model being implemented.
Ultimately, the system must maintain its critical functionality Κ at each time step t, where Κ maps its states or parameters to a real value between 0 and 1. This mapping may, for instance, be linear
where {N, L} is the set of all nodes and links, w_{i} (t; C) ∈ [0, 1] is a measure of the relative importance of node or link i at time t and π_{i} (t; C) ∈ [0, 1] is the degree to which a node is still active in the presence of an attack. An alternate interpretation defines π_{i} (t; C) as the probability that node or link i is fully functional. More complex, nonlinear and detailed definitions of critical functionality mappings are also possible. Finally, we introduce the class of adverse events (or potential attacks on targeted nodes) E. For instance, in the case of a random attack on two nodes, E is the set of all attacks on all possible node pairs.
Resilience, denoted by R, is a composite function of the network topological properties and their temporal evolution parameters defined for a certain critical functionality and a class of adverse events E:
Note that not all targeted nodes are necessarily afflicted. For a nonafflicted attacked node, we thus have π_{i} = 1 over the entire time interval of interest. We evaluate R over a certain time interval [0, T_{C}] where T_{C} is the control time^{41} which can be set a priori, for instance, by stakeholders or estimated as the mean time between adverse events. In continuous time, we define R as
where E is the cardinality of set E and Κ^{nominal}(t) is the critical functionality of the system in the case where no external events occur (Fig. 1). Equation (3) allows evaluation of the normalized dynamical performance of the system before (plan/prepare), during (absorption) and after an attack (recovery and adaptation); it intends to capture the definition advanced by the NAS given in the introduction. For computational convenience, the above equation is given in discrete time by
In most cases, we normalize to Κ^{nominal}(t) =1. Consequently, a normalized measure of resilience may be given by
We note that equation (1) embraces a large class of performance measures found in the literature. For instance, in addition to equation (5), we can also consider the measure
The above measure is referred to as the Robustness^{8,42}. Alternatively one may define (1 – M) as the Risk^{1}.
Due to the very complex nature of networked systems and the large number of variables defining their states, it is not possible to consider all events in the set E and obtain a closedform expression for R, even if all design parameters are made homogeneous across nodes, links and time. We therefore rely on a simulation based approach. Each simulation represents a possible scenario of the networked system’s evolution. For each simulation, we calculate the average value of the critical functionality Κ(t, N, L, C) at every time step (equation 1) and from there, the resilience (equation 4 or equation 5) over the interval of interest.
The approach proposed builds upon and extends the works of others^{8,18,29,30,32}. The main issue encountered when dealing with the estimation of resilience based on the simulation of the system performance curves is that those curves in the general case vary depending on the adverse events modeled. The current approach to resolve this issue is to extend the techniques of probabilistic risk analysis to resilience analysis. This extension provides the weighted average performance curve with weights representing the probabilities of the adverse events.
By contrast, in our approach, we would like to argue that the resilience of the system should not be tied to the probabilities of the adverse events to occur. Again, according to the NAS resilience is the ability to plan, absorb, recover and adapt^{16}. Inspired by this definition, we instead simulate the damage to the system from a certain adverse event (regardless of the probability that the event occurs) and define resilience for that particular damage. For simplicity of decision making, however, we suggest considering a certain class of adverse events. For instance, in a networked system we might define one class of adverse events as a case when functionality of 4 to 5 nodes is reduced by 40–50% (instead of defining a particular adverse event reducing the functionality of specific 4 nodes by 50%).
We illustrate the approach with two simple models: multilevel DAGs and interdependent coupled networks. We assume that homogeneous nodes and links comprising the network have only two possible states: active and inactive, meaning w_{i} (t), π_{i} (t) ∈ {0, 1} in equation (1). To simplify the explanation, we focus on node failures, though the concept may be extended to include links. If we denote the number of active nodes in the system at time t by A(t) and the total number of nodes in the system by N, then the critical functionality simplifies to
We first consider a hierarchical multilevel DAG model (Fig. 2) with Λ levels of nodes^{43,44}. We investigate how redundancy probability p_{m}, switching probability p_{s} and the recovery time T_{R}, tradeoff parameters at the disposal of the system designer, influence the resilience of a supplydemand multilevel DAG across levels, nodes, links and time, how they affect the absorption and recovery phases of a network’s resilience profile and how they address the optimization of network design, for a variety of attack scenarios. We also distinguish between cases where switching is instantaneous and delayed by one time step. Further description of the model is provided in the Methods section (subsection 1).
Two applications of the DAG model will serve to illustrate the quantitative resilience measure introduced, as well as the method for evaluating it: 1) synthetic random hierarchical multilevel supplydemand directed acyclic graphs and 2) the Linux system, specifically Ubuntu 12.04, software network. The first is a useful, if approximate, representation of networks found in many applications^{36,37,38,45}. The second realistically represents an existing and widely used network.
The second model is derived from the model introduced by Buldyrev et al.^{34} and developed by Parshani et al.^{46}. They consider a system comprised of two coupled undirected networks (A and B). A certain fraction of nodes in network A depends on nodes in network B (q_{A}) and vice versa (q_{B}). If node n in network A depends on a node m in network B then node m can only depend on node n (or not depend on nodes in network A at all) (see Methods, subsection 3). Without loss of generality we consider scenarios where networks A and B have the same node degree distribution. We present results for ErdosRenyi and scalefree random networks with 800000 nodes (N) and average degree (<k>) of 2.5 and slope factor (in the scalefree case) of 2.25. Networks are generated following the algorithm presented by Catanzaro et al.^{47}.
We consider a case with a single adverse event that destroys a number of nodes in the network. For simplicity, the adverse event happens at the time step t =0. We shall refer to the result of the adverse event as the initial damage. In case of the DAG model we denote the number of nodes that become inactive (i.e., are deactivated) in level i between time steps t and t + 1 as . Thus, values represent the number of nodes made inactive upon the occurrence of the adverse event. In case of the coupled networks model we denote the fraction of nodes rendered inactive in network A as p_{destr} with the assumption that the adverse event doesn’t affect network B.
Results
Model 1 – Directed acyclic graphs. Synthetic graphs
We consider a network composed of N =1000 nodes in four levels: N_{0} =32, N_{1} =87, N_{2} =237, N_{3} =644. We first look at the special case where the switching is instantaneous with probability p_{s} =1. Assuming the overall damage to each level is small compared to the total number of nodes in the level, the approximation derived in the Methods section (subsection 2) may be used. Figure 3 provides comparisons between the analytical calculations based on that approximation for this case and simulation results averaged over 2000 samples. We note that this case provides insight into the impact of link redundancy on both critical functionality and resilience over the time interval of interest, as we show in the first two scenarios of Fig. 4.
Examples of resilience profiles for different cases that vary in their initial damage, switching probability and recovery time are given in Fig. 4. Case 1 is a scenario where only one node in the upper level is initially disabled. This scenario represents, for instance, an accident at a power plant. It follows that the event set E consists of all possible onenode attacks in the upper level. Critical functionality suffers minimally; it reduces from 1 to 0.97 at its lowest. Its integral, resilience, consequently suffers minimally as well: R =0.983. By contrast, for a more serious attack, such as in case 2, in which five nodes at every level are disabled (such an attack might represent a large earthquake in a certain area); both critical functionality and resilience suffer. Critical functionality can be as low as 0.8 (a considerably less robust system) for a protracted number of time steps and resilience is reduced to 0.893.
For case 3, 10 nodes are disabled, all from the top level and the switching probability is reduced to p_{s} =0.25. The robustness, or the critical functionality at its lowest, is more drastically reduced to 0.4, yielding an overall resilience value of 0.728. Case 4 is similar to case 3, except that the switching is delayed, i.e., if node i has become disabled at time t*, then the first attempt to switch is made at time t* for case 3 and at time t* + 1 for case 4.
The dependency of resilience on parameters p_{m} and p_{s} is given in Fig. 5(a) with the recovery time held constant at T_{R} =0.5T_{C}. The figure shows that both parameters are compatible and combinable; they can be smoothly traded off to maintain a desired level of resilience. The designer here has the opportunity to select the combination of p_{m} and p_{s} that is least costly. Additionally, increasing p_{m} and p_{s} simultaneously has an observable additive effect on resilience. Beyond certain level, however, investment in redundancy yields minimal return. For instance, Fig. 5(a) shows that doubling the probability p_{m} from 0.1 to 0.2 leaves the resilience unchanged for p_{s} > 0.3.
In addition, there is strong synergy between p_{m} and p_{s}; increasing both factors together produces a rapid increase in resilience, but increasing only one or the other variable will cause the resilience metric to plateau. This can be observed in Fig. 5(a) by reading the resilience values shown across the phase diagram curves.
Figure 5 (b) illustrates that similar tradeoffs are possible between the maximum node recovery time T_{R} and the switching probability p_{s}. The redundancy parameter p_{m} is held constant at 0.01. When the recovery time is relatively short, T_{R} < 0.1T_{C}, resilience values close to 1 may be obtained even for values of p_{s} as small as 0.05. Resilience is strongly affected by the recovery time, T_{R} (Fig. 5(b)). This temporal factor determines the characteristics of the recovery phase and has a greater impact on the calculated resilience than does the potential increase in redundancy. This is particularly true when the switching probability p_{s} is low, as Fig. 5(b) demonstrates.
Supplement S2 (Figures S2.1–S2.6) displays additional results for both types of parameter dependencies. Cost and speed of design and implementation can now guide the ultimate choice from among the infinite possibilities of parameter combinations.
Model 1 – Directed acyclic graphs. Linux software network
The Linux software network exemplifies the structure of complex multilevel software systems and is also important in its own right. This software operates in an estimated 95% of all supercomputing systems^{48} and the majority of the smartphones in use (in the form of the Android operating system). Packages in Linux are linked in a formally defined hierarchy of dependencies between individual software units. In this hierarchy, a package can only be installed if all required higher level packages have previously been installed. Some redundancy is possible when multiple packages provide the same functionality. Figure 6 shows a subnetwork of the packages network consisting of 117 nodes out of 36,902 possible nodes in the entire network. The graph data were obtained using Advanced Packaging Tool^{49} on a standard installation of the Ubuntu 12.04 system.
Many modern cyber threats exploit vulnerabilities in software packages. Disabling a targeted software package leads to the failure of many services that are dependent on it. Even worse, the recovery might be protracted as a result of corrupted user data, thus requiring manual repair and cleanup. For example, an attack on the Apache web server might cause it to fail and subsequently send corrupted or maliciously designed data to backend databases^{50}. Consequently, services dependent on Apache would experience data corruption and if Apache crashes, it would be disabled as well. While the damaged server might be restarted relatively quickly, recovery from such an attack would involve checking the data, causing serious additional delays.
We evaluate the resilience of the Linux packages network in the presence of both random and guided attacks. Critical functionality and resilience profiles for guided attacks on several particularly important packages are given in Fig. 7(a). These packages are: xauth, libstdc++6, libc6 and gcc4.6base. Notably in these four cases there are four sets of adverse events E. Each of these sets contains only one event that successfully causes a particular node to be destructed. It is seen that the level of damage depends on which packages are targeted.
For random attacks (Fig. 7(b)), we consider another set of adverse events, consisting of 36,902 events. In this case, resilience is significantly higher than in the case of guided attacks due to the low importance of many of the nodes that failed from the attacks, thus yielding R =0.99975 and M =0.999.
Model 2 – Interdependent networks. Synthetic graphs
We summarize the results for the second model in Fig. 8. Panels (a) and (b) show the dependency of the critical functionality of a system of two interdependent ErdosRenyi (ER) networks on time for 2 distinct cases of the recovery resources available expressed as the number of the backup agents (N_{b}). As it is evidenced by Fig. 8 depending on the value N_{b} there is a sharp distinction between two cases: in Fig. 8(a) the system is unable to recover and the critical functionality oscillates between 0 and about 0.5  to random duration of cascading recovery and failure processes eventually the amplitude of the oscillations of the mean value of CF decreases; in Fig. 8(b) the backup supply to 0.4 N nodes in network A allows reaching a stable state (after removal of the backup supply) and following further system recovery.
In scalefree (SF) networks (Fig. 8(d,e)) the results for <k> =2.5 and N = 800000 show a much larger dispersion than in ErdosRenyi networks. In particular some of those networks suffer a much smaller drop in critical functionality in response to the adverse events modeled. This obviously is a consequence of the infinite dispersion in degrees distribution in SF networks (though in our case the dispersion is finite due to the finite number of nodes). Another distinct specific trait of the SF networks with small value of average degree is the fact that whether the network fully recovers or not is strongly dependent on the stochastic nature of the cascading failure process. In particular, it is obvious from Fig. 8(d) that the success of recovery is determined by whether the most important hubs were affected during the deactivation process. If those hubs are not affected the damage is relatively small, otherwise the damage causes a large drop in critical functionality and recovery within the control time is not possible.
Finally panels (c) and (f) in Fig. 8 show the phase diagrams of the dependency of the resilience (calculated as the integral of the CF over the control time) on the model parameters p_{destr} and N_{b}. Notably the critical functionality practically drops to zero when only 20% of the nodes are initially destroyed in network A (p_{destr}). Parshani et al.^{46} demonstrated that if p_{destr} is more than 0.2545 the network experiences the first order transition leading to a state with almost no active nodes. We have reproduced their results for the ErdosRenyi networks and confirmed that if p_{destr} is less than the threshold of 0.2545 the transition doesn’t occur. However due to minor modifications we made to the network generation algorithms aimed at connecting all the nodes in a single giant component (GC) in the beginning of the process, we observe decrease in the threshold value p_{destr}, causing the first order transition, to about 0.15–0.2. After the drop of the critical functionality (due to the cascading failure) on the step T_{R} the recovery process starts. The recovery is successful only if N_{b} is about 0.4 N or higher. Finally if the whole network A is destroyed as a result of the adverse event (p_{destr} =1) then the recovery cannot start due to the absence of the GC(A). Results for scalefree networks show similar tendencies although they are notably much more disperse in the region . We interpret this as the consequence of the divergence of the standard deviation of the degree distribution in scalefree networks with .
Conclusion
We have presented a detailed approach for implementing the National Academy of Sciences definition of resilience as a function of design tradeoff parameters, as illustrated in the study with multilevel directed acyclic graphs and interdependent networks. The approach allows evaluation of resilience across time and not just as a single quantity. Designers can thus analyze the effect of parameter choice and design emendations on overall network resilience and robustness. Focusing on multilevel directed acyclic graphs and interdependent networks, we have demonstrated how network parameters can be traded off to obtain a desired resilience and other performance measures’ level. Future work will extend to multiplex systems and other real life networks. An important long term challenge is to model adaptation, which is part of the response cycle that follows restoration and includes all activities that enable the system to better resist similar adverse events in the future.
Methods
Absorption and recovery algorithms in the DAG model
A hierarchical multilevel DAG (Fig. 2) has Λ levels of nodes^{43,44}. Each layer is comprised of N_{i} nodes (i =0, …, Λ – 1). The links represent a supply–demand relationship. A link starts at a supplier node and ends at a demander node. Thus, every link in the network is directed. For every level, we identify a set of services that all nodes in a particular level supply. Then, for every service supplied in the network, we define whether or not it depends on other services and a (possibly empty) list of the dependencies. The levels are ordered in such a way that links only go from a higher level i to a lower level j. With this convention, i < j, or, the higher the level, the smaller the index. Additionally, no links can be formed between nodes in the same level. We also disallow cycles in the network by imposing the following constraint: a node cannot depend on any of the services provided by any of the other nodes in its level or on any of the services provided in any of the lower levels. Initially, all dependencies are resolved and every node has one incoming link from one or more upper levels on which it depends for the supply of its services. Furthermore, for every dependent node and each of its required services, we introduce a list of potential suppliers. The probability that a node has a link from each of the potential suppliers is p_{m}. Said another way, a node has many links supplying a given service but only one of those links is enabled initially (known as real), while the others are contingent backup links (known as virtual), should they exist.
To model an adverse event, we introduce an ability to destruct a node for a time period T_{R}, as was recently done by Majdandzic et al.^{41}. A destructed node is inactive and is therefore unable to supply services until it recovers. Another possible cause of deactivation is an unresolved dependency, that is, the absence of a real link to a node supplying a required service. This can happen when the only supply node available for a service is either destructed or its upstream supplier is destructed. We shall refer to the nodes with an unresolved dependency as disabled nodes. Note that a node can be disabled, destructed, or both at a given time. Once a node becomes inactive, all of its dependencies, connected through their real links, are subject to deactivation unless they have other real links providing all of their required services (Fig. 2).
We assume that a node is eligible to switch links, that is, to turn a virtual or contingent link into a real one, if virtual links for all of the node’s unresolved dependencies exist and the node is not destructed. At every time step during which the node is both disabled and eligible to switch, switching happens with probability p_{s}. Switching can be either instant (the first attempt to switch is made at the same time step the node has become disabled) or delayed (the node with an unresolved dependency remains disabled for at least one time step).
Analytical approximation for the special case of the DAG model
In this section we derive equations describing the number of active nodes in the special case where the switching is instantaneous with probability p_{s} =1 while the initial damage is small compared to the total number of nodes in the network.
Let us denote by Λ the number of levels in the network and by N_{i} the number of nodes in level i (i =0, …, Λ – 1). We can find the probability that a node in level i has only one service provider in level j as follows:
For the case where the number of deactivated nodes at each time step is small enough or in which p_{m} =0, we may assume that only the nodes with one link for a relevant service are disabled as a result of the inactivation of their supplier (thus neglecting the cases in which the node has more than one supplier of a service and all of them are deactivated).
The average number of active nodes in level i at time step t () is given by the formula:
These nodes have suppliers in each level j < i. We disabled suppliers of the nodes in level j between steps t − 1 and t. The probability for a node in level i to become disabled between time steps t and t + 1 is therefore:
We approximate the distribution of 1/ to be linear in , although the dependency itself is not linear:
where (under the assumption that the overall damage is small enough):
Considering the Taylor expansion of 1/(1 – x), we have for small values of x: 1/(1 – x) ≈ 1 + x.
Then, on average between steps t and t + 1, we disable the following number of nodes in level i:
After the recovery period of T_{R} time steps has transpired, the initially destroyed nodes are rebuilt and become active unless they still lack sufficient supplies from the upper levels. Thus, assuming that T_{R} > Λ, = 0 for i = {0, …, Λ} and t ={Λ, …, T_{R} – 1}.
The total number of nodes restored at step T_{R} in level i is given by the expression:
Here, represents the total number of nodes disabled in level i at time step t due to the fact that they do not have sufficient supplies from the upper levels:
And for the next steps, the formula is as follows:
Using the formulae above, we may evaluate the average approximated resilience profiles and find the values of resilience.
Absorption and recovery algorithms in the coupled networks model
The failure propagation algorithms are described in the original model of Buldyrev et al.^{34}. Initial damage results in a certain fraction of nodes deactivated in the network A. Once those nodes are deactivated the network A is fractured in clusters. Nodes that do not belong to the largest cluster of the network A are also assumed deactivated. Then all the nodes in the network B that depend on the deactivated nodes in the network A are also deactivated. It results in fracturing of the network B and the nodes that are not in the largest cluster of the network B are also assumed deactivated. In the second step of the process nodes in the network A depending on the deactivated nodes in the network B are deactivated and the process propagates in the same fashion until there are no more nodes to deactivate in any of the networks.
Recovery is accomplished by the backup supply agents replacing unresolved dependencies of the nodes in the first network (A). The number of those agents is denoted N_{b}. Each backup agent can serve only one node at a time. Nodes to provide the backup supply to are chosen randomly from those nodes in the network A that depend on a currently inactive node in the network B. Thus backup is provided either to all nodes in the network A with an unresolved dependency (in this case full recovery is guaranteed) or to N_{b} nodes only. If a node has backup supply and it is connected to its network’s GC it is activated. Once a node is activated it is included in the network’s GC. This causes eventual growth of the giant component of the network A. After that the nodes in the network B that depend on the activated nodes in the network A and are connected to the GC of the network B are also activated and the process propagates in a similar fashion. Once the process is complete the recovery phase finishes. After that the backup supply is removed meaning that all the nodes whose supplier in the network B is not active (after the recovery phase) are deactivated. This leads to a cascading failure propagating as described in the introduction section. Once the failure phase finishes the recovery phase is repeated until the full network recovery is established or the control time is reached.
Let us consider a simple twonetwork system (Fig. 9). At the beginning of simulation (time 0) two nodes (A1 and A3) are assigned to be initially destroyed. Cascading process finishes before the recovery time (that is time to repair a node is more than the cascading failure time) T_{R}. Thus at time T_{R} the network is in the state it had after the cascading failure. After T_{R} steps have passed the recovery process starts. The case of N_{b} =0 is shown in the panel (a) of Fig. 9. In this case the only recoverable node is A3. After its recovery node B3 is also recovered, but further recovery is not possible. Nodes A1, A2, B1 and B2 can’t be recovered as they have an unresolved dependency. In addition even if nodes A1, B1 were independent they still wouldn’t have been recoverable due to the fact that they are not connected to the respective networks’ GCs.
Now consider the case N_{b} =1. In this case two scenarios are possible:

a
The node chosen for backup supply is A1 (Fig. 9(b)). Then no recovery can happen as this node is not connected to the network A GC (or GC(A)) and the recovery phase ends in 0 steps;

b
The node chosen for backup supply is A2 (Fig. 9(c)). Then this node recovers, node B2 recovers in turn. During the second step of the recovery phase node A1 recovers and node B1 recovers in turn.
Additional Information
How to cite this article: Ganin, A. A. et al. Operational resilience: concepts, design and analysis. Sci. Rep. 6, 19540; doi: 10.1038/srep19540 (2016).
References
Linkov, I. et al. Changing the resilience paradigm. Nat. Clim. Change 4, 407–409, doi: 1 0.1038/nclimate2227 (2014).
Vespignani, A. Complex networks: the fragility of interdependency. Nature 464, 984–985, doi: 10.1038/464984a (2010).
World Economic Forum, Global Risks 2015. Technical report. (2015) Available at: http://www3.weforum.org/docs/WEF_Global_Risks_2015_Report15.pdf. (Accessed: 9^{th} April 2015).
Park, J., Seager, T. P., Rao, P. S. C., Convertino, M. & Linkov, I. Integrating risk and resilience approaches to catastrophe management in engineering systems: perspective. Risk Anal . 33, 356–367, doi: 10.1111/j.15396924.2012.01885.x (2013).
Jansen, W. Directions in Security Metrics Research (National Institute of Standards and Technology, 2009).
Bartol, N., Bates, B., Goertzel, K. M. & Winograd, T. Measuring Cyber Security and Information Assurance (Information Assurance Technology Analysis Center, 2009).
Holling, C. S. Resilience and stability of ecological systems. Annu. Rev. Ecol. Syst . 4, 1–23, doi: 10.1146/annurev.es.04.110173.000245 (1973).
Cimellaro, G. P., Reinhorn, A. M. & Bruneau, M. Framework for analytical quantification of disaster resilience. Eng. Struct . 32, 3639–3649, doi: 10.1016/j.engstruct.2010.08.008 (2010).
Adger, W. N. Socialecological resilience to coastal disasters. Science 309, 1036–1039, doi: 10.1126/science.1112122 (2005).
Ouyang, M., DueñasOsorio, L. & Min, X. A threestage resilience analysis framework for urban infrastructure systems. Struct. Saf . 3637, 23–31, doi: 10.1016/j.strusafe.2011.12.004 (2012).
Kahan, J. H., Allen, A. C. & George, J. K. An operational framework for resilience. J. Homel. Secur. Emerg. Manag. 6, doi: 10.2202/15477355.1675 (2009).
Como, G., Savla, K., Acemoglu, D., Dahleh, M. A. & Frazzoli, E. Robust distributed routing in dynamical networks  part ii: strong resilience, equilibrium selection and cascaded failures. IEEE Trans. Autom. Control 58, 333–348, doi: 10.1109/TAC.2012.2209975 (2013).
Vugrin, E. D., Warren, D. E., Ehlen, M. A. & Camphouse, C. R. In Sustainable and Resilient Critical Infrastructure Systems Simulation, Modeling and Intelligent Engineering (eds. Gopalakrishnan, K. & Peeta, S. ) 77–116 (Springer, 2010).
U.S. Department of Homeland Security, National infrastructure protection plan. Technical report. (2009) Available at: http://www.dhs.gov/xlibrary/assets/NIPP_Plan.pdf. (Accessed: 9^{th} April 2015).
Obama, B. Presidential Proclamation for National Preparedness Month (The White House, 2009).
Disaster Resilience: a National Imperative (The National Academies Press, 2012).
Barrett, C. B. & Constas, M. A. Toward a theory of resilience for international development applications. Proc. Natl. Acad. Sci . 111, 14625–14630, doi: 10.1073/pnas.1320880111 (2014).
Bruneau, M. et al. A framework to quantitatively assess and enhance the seismic resilience of communities. Earthq. Spectra 19, 733–752, doi: 10.1193/1.1623497 (2003).
Linkov, I. et al. Measurable resilience for actionable policy. Environ. Sci. Technol . 47, 10108–10110, doi: 10.1021/es403443n (2013).
Linkov, I. et al. Resilience metrics for cyber systems. Environ. Syst. Decis . 33, 471–476, doi: 10.1007/s106690139485y (2013).
Carvalho, R. et al. Resilience of natural gas networks during conflicts, crises and disruptions. PLoS ONE 9, doi: 10.1371/journal.pone.0090265 (2014).
Havlin, S., Kenett, D. Y., Bashan, A., Gao, J. & Stanley, H. E. Vulnerability of network of networks. Eur. Phys. J. Spec. Top . 223, 2087–2106, doi: 10.1140/epjst/e2014022516 (2014).
De Domenico, M., Lancichinetti, A., Arenas, A. & Rosvall, M. Identifying modular flows on multilayer networks reveals highly overlapping organization in interconnected systems. Phys. Rev. X 5, doi: 10.1103/PhysRevX.5.011027 (2015).
Brummitt, C. D., Lee, K.M. & Goh, K.I. Multiplexityfacilitated cascades in networks. Phys. Rev. E 85, doi: 10.1103/PhysRevE.85.045102 (2012).
Massaro, E. & Bagnoli, F. Epidemic spreading and risk perception in multiplex networks: a selforganized percolation method. Phys. Rev. E 90, doi: 10.1103/PhysRevE.90.052817 (2014).
Boccaletti, S. et al. The structure and dynamics of multilayer networks. Phys. Rep . 544, 1–122, doi: 1 0.1016/j.physrep.2014.07.001 (2014).
Holling, C. S. In Engineering within Ecological Constraints (ed. Schulze, P. C. ) 31–44 (National Academy Press, 1996).
Pimm, S. L . The complexity and stability of ecosystems. Nature 307, 321–326, doi: 10.1038/307321a0 (1984).
Ouyang, M. & Wang, Z. Resilience assessment of interdependent infrastructure systems: with a focus on joint restoration modeling and analysis. Reliab. Eng. Syst. Saf . 141, 74–82, doi: 10.1016/j.ress.2015.03.011 (2015).
Ouyang, M. & DueñasOsorio, L. Timedependent resilience assessment and improvement of urban infrastructure systems. Chaos Interdiscip. J. Nonlinear Sci . 22, doi: 10.1063/1.4737204 (2012).
Reed, D. A., Kapur, K. C. & Christie, R. D. Methodology for assessing the resilience of networked infrastructure. IEEE Syst. J . 3, 174–180, doi: 10.1109/JSYST.2009.2017396 (2009).
Bocchini, P., Frangopol, D. M., Ummenhofer, T. & Zinke, T. Resilience and sustainability of civil infrastructure: toward a unified approach. J. Infrastruct. Syst . 20, doi: 10.1061/(ASCE)IS.1943555X.0000177 (2014).
Thulasiraman, K. Graphs: Theory and Algorithms (Wiley, 1992).
Buldyrev, S. V., Parshani, R., Paul, G., Stanley, H. E. & Havlin, S. Catastrophic cascade of failures in interdependent networks. Nature 464, 1025–1028, doi: 10.1038/nature08932 (2010).
D’Agostino, G. & Scala, A. Networks of Networks: the Last Frontier of Complexity (Springer, 2014).
Shapiro, J. F. Modeling the Supply Chain (ThomsonBrooks/Cole, 2007).
Yu, H. & Gerstein, M. Genomic analysis of the hierarchical structure of regulatory networks. Proc. Natl. Acad. Sci . 103, 14724–14731, doi: 10.1073/pnas.0508637103 (2006).
CorominasMurtra, B., Goni, J., Sole, R. V. & RodriguezCaso, C. On the origins of hierarchy in complex networks. Proc. Natl. Acad. Sci . 110, 13316–13321, doi: 10.1073/pnas.1300832110 (2013).
Sapatnekar, S. S. Timing (Kluwer Academic Publishers, 2004).
Frasconi, P., Gori, M. & Sperduti, A. A general framework for adaptive processing of data structures. IEEE Trans. Neural Netw . 9, 768–786, doi: 10.1109/72.712151 (1998).
Majdandzic, A. et al. Spontaneous recovery in dynamical networks. Nat. Phys . 10, 34–38, doi: 10.1038/nphys2819 (2013).
Callaway, D. S., Newman, M. E. J., Strogatz, S. H. & Watts, D. J. Network robustness and fragility: percolation on random graphs. Phys. Rev. Lett . 85, 5468–5471, doi: 10.1103/PhysRevLett.85.5468 (2000).
Suzuki, J., Hirao, T., Sasaki, Y. & Maeda, E. Hierarchical directed acyclic graph kernel: methods for structured natural language data in Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics 1, 32–39, doi: 10.3115/1075096.1075101 (Association for Computational Linguistics, 2003).
Cho, S., Elhourani, T. & Ramasubramanian, S. Independent directed acyclic graphs for resilient multipath routing. IEEEACM Trans. Netw . 20, 153–162, doi: 10.1109/TNET.2011.2161329 (2012).
Yan, K.K., Fang, G., Bhardwaj, N., Alexander, R. P. & Gerstein, M. Comparing genomes to computer operating systems in terms of the topology and evolution of their regulatory control networks. Proc. Natl. Acad. Sci . 107, 9186–9191, doi: 10.1073/pnas.0914771107 (2010).
Parshani, R., Buldyrev, S. V. & Havlin, S. Interdependent networks: reducing the coupling strength leads to a change from a first to second order percolation transition. Phys. Rev. Lett . 105, doi: 10.1103/PhysRevLett.105.048701 (2010).
Catanzaro, M., Boguñá, M. & PastorSatorras, R. Generation of uncorrelated random scalefree networks. Phys. Rev. E 71, doi: 10.1103/PhysRevE.71.027103 (2005).
Top500.org, Operating system family / Linux. Technical report. (2014) Available at: http://www.top500.org/statistics/details/osfam/1. (Accessed: 9^{th} April 2015).
Nussbaum, L., Debian packaging tutorial. (2014) Available at: https://www.debian.org/doc/manuals/packagingtutorial/packagingtutorial.en.pdf. (Accessed: 9^{th} April 2015).
Kargl, F., Maier, J. & Weber, M. Protecting web servers from distributed denial of service attacks in Proceedings of the 10th international conference on World Wide Web 514–524, doi: 10.1145/371920.372148 (ACM Press, 2001).
Acknowledgements
The authors would like to thank Dr. Maksim Kitsak (Northeastern University) for his insightful comments.
Author information
Authors and Affiliations
Contributions
All authors developed the concept and the model, A.G. developed the software and conducted the experiment, A.G., E.M., A.G. and I.L. analyzed the results, I.L. provided overall guidance. N.S., J.K., A.K. and R.M. reviewed the manuscript. This study was funded by the US Army. The views and opinions expressed in this paper are those of the individual authors and not those of the US Army or other sponsor organizations.
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Electronic supplementary material
Rights and permissions
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
About this article
Cite this article
Ganin, A., Massaro, E., Gutfraind, A. et al. Operational resilience: concepts, design and analysis. Sci Rep 6, 19540 (2016). https://doi.org/10.1038/srep19540
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/srep19540
This article is cited by

Hybrid datadriven resilience assessment and enhancement of distribution system for cyclone susceptible zones
Scientific Reports (2022)

Risk science offers an integrated approach to resilience
Nature Sustainability (2022)

Differences in the dynamics of community disaster resilience across the globe
Scientific Reports (2021)

Resilience of food, energy, and water systems to a sudden labor shortage
Environment Systems and Decisions (2021)

NonMarkovian recovery makes complex networks more resilient against largescale failures
Nature Communications (2020)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.