Operational resilience: concepts, design and analysis

Building resilience into today’s complex infrastructures is critical to the daily functioning of society and its ability to withstand and recover from natural disasters, epidemics, and cyber-threats. This study proposes quantitative measures that capture and implement the definition of engineering resilience advanced by the National Academy of Sciences. The approach is applicable across physical, information, and social domains. It evaluates the critical functionality, defined as a performance function of time set by the stakeholders. Critical functionality is a source of valuable information, such as the integrated system resilience over a time interval, and its robustness. The paper demonstrates the formulation on two classes of models: 1) multi-level directed acyclic graphs, and 2) interdependent coupled networks. For both models synthetic case studies are used to explore trends. For the first class, the approach is also applied to the Linux operating system. Results indicate that desired resilience and robustness levels are achievable by trading off different design parameters, such as redundancy, node recovery time, and backup supply available. The nonlinear relationship between network parameters and resilience levels confirms the utility of the proposed approach, which is of benefit to analysts and designers of complex systems and networks.


S1. Survey of Existing Approaches to Resilience Quantification
In this section, we focus on definitions and applications of resilience, especially as it pertains to disaster management and engineering. For example, Bruneau et al. 1 identified four dimensions of seismic community resilience: technical, organizational, social, and economic. They integrated different measures of resilience -robustness, rapidity, resourcefulness, and redundancyin order to minimize the probability of system failures, the consequences arising from such failures, and the time it takes for recovery. The definition proposed by the US Department of Defense 2, 3 is useful, except that it presupposes that the metrics are independent of one another and does not contain an explicit consideration of time (but can be extended to do so).
The resilience of complex systems is often quantified as the probability of failure under a specific threat scenario at a specified time. This approach basically adds a temporal dimension to traditional risk assessment. For example, Cimellaro et al. 4 defined resilience quantitatively as the normalized area underneath a function Q(t) that expresses the system's critical functionality as a function of time. Furthermore, they enhanced the definition of the resilience properties proposed by Bruneau et al. 1 by introducing the quantitative measures of control time, T LC , and recovery time, T RE .
Utilizing nonlinear loss functions and recovery functions, the factors of Q(t) are quantified, and resilience is derived as a dimensionless measure that expresses the system's functionality over time. The use of fragility curves for the loss estimation models is another feature of the proposed quantification.
Ouyang et al. 5 proposed an expected, time-dependent, annual resilience metric that measures the system's preparedness and capacity to confront and recover from the occurrence of different types of hazards. The aforementioned system properties are collectively regarded as the system's performance; therefore, the metric provides a performance curve that plots a timedependent graph, the area of which represents the system's resilience. The metric is conceptually similar to other proposals, since it is based on stochastic modeling of a "hazard occurrence -restoration actions -recovery" iterative process; however, it differs from other proposals in that it introduces the notion of quantifying a system's resilience under multiple hazards.
Bocchini and Frangopol 6 devise a model of recovery planning for networks of highway bridges that have been damaged due to an earthquake. In their model, optimal bridge restoration activities are identified so as to maximize resilience, minimize the time required to return the network to a targeted functionality level, and minimize the cost of these activities. The equation they employ to measure resilience under these constraints is as follows: Note that F represents the system performance measure. The time at which the system is disrupted is t 1 , and the time at which the system achieves a targeted level of performance is t 2 , at which time recovery is considered complete. This approach is similar to Bruneau's resilience loss calculation with one noteworthy difference: Bocchini and Frangopol measure the area between F(t) and 0 -as opposed to measuring the area between F 0 and F(t) -and then normalize over the recovery time period. This difference is important because it increases the value of resilience by simply increasing the value of Q 2 . Bruneau's approach has the opposite outcome. We generalize and expand the realm of this definition. For instance, for networks, we incorporate such factors as the relative importance of nodes, links, and the weights w i (t); and the level of damage done to a component, or π i (t).
Vugrin et al. 7 proposed a resilience framework that depends upon the calculation of two key quantities: the systemic impact (SI) -that is, the cumulative impact of decreased system performance after a disruptive event -and the total recovery effort (T RE) -that is, the cumulative resources expended in recovery activities. These authors utilize these quantities to devise a composite resilience metric: Note that α is a weighting factor whose function is to take into accounts both unit conversion and relative weighting between SI and T RE in overall evaluation. This framework can be applied to transportation networks, in addition to continuous, dynamic systems and agent-based models. We also attempt to generalize this definition. In network science, focus is often on multilayer networks 8-10 on which it is intuitive to emulate socio-technical systems, such as human-information networks, 11,12 and to comprehend the interactions among interdependent infrastructures. [13][14][15][16][17] In many of these studies, the focus is on robustness or the percolation process during or after the occurrence of an adverse event, 18 which constitutes the first phase of our resilience definition, that is, when the critical functionality in Figure 1 is decreasing. Recently, some authors have been focusing on the recovery phase and on self-healing processes in complex networks, 19, 20 which constitutes the second or recovery phase of our definition of resilience, that is, when the critical functionality is increasing. Our definition, therefore, embodies both the concept of robustness and the concept of recovery.
In Ref. 21 a different metric for quantifying the resilience or efficiency of interconnected networks is outlined. The authors introduce a mechanism to perform such exploration, using random walks on multilayer networks, and they show how the topological structure, together with the navigation strategy, influences the efficiency of exploring the whole structure. They quantify the efficiency of the system as the number of sites visited by a random walker during a certain temporal window. They define the coverage ρ(t) as the average fraction of distinct vertices visited at least once in a period of time less than or equal to t. Roughly speaking, the resilience of an interconnected system is a function of the fraction φ of random failures. In particular, they define the resilience r(φ ) as: where ρ φ (τ) is the coverage at time t of the network subject to φ failures. The averages are calculated over several random realizations of the failures. However, another area of research is more heavily concentrated on defining all of the temporal aspects of resilience without consideration of the complex interactions among the different layers. Scholars in engineering resilience and, in particular, transportation systems are presently examining whether graph theory is a viable way to quantify resilience. [22][23][24][25][26]

S2. Dependency of the Resilience on the Redundancy and Switching Probabilities for Selected Classes of Attacks
In this section, we provide additional results that reinforce the observations in the Results and Conclusion section. Specifically, with a fixed recovery time, the redundancy and switching probabilities pm and p s , respectively, can be traded off to maintain a desired level of resilience. Moreover, increasing both parameters has an additive effect on resilience. Figures S2.1 -S2.4 illustrate this point. In Figure S2.5, the switching probability is held constant, and we see that T R and p m can also be traded off to maintain a desired resilience level. The additive effect of this parameter pair is also evident.