Infection-induced Cascading Failures – Impact and Mitigation

In the context of epidemic spreading, many intricate dynamical patterns can emerge due to the cooperation of diﬀerent types of pathogens or the interaction between the disease spread and other failure propagation mechanism. To unravel such patterns, simulation frameworks are usually adopted, but they are computationally demanding on big networks and subject to large statistical uncertainty. Here, we study the two-layer spreading processes on unidirectionally dependent networks, where the spreading infection of diseases or malware in one layer can trigger cascading failures in another layer and lead to secondary disasters, e.g., disrupting public services, supply chains, or power distribution. We utilize a dynamic message-passing method to devise eﬃcient algorithms for inferring the system states, which allows one to investigate systematically the nature of complex intertwined spreading processes and evaluate their impact. Based on such dynamic message-passing framework and optimal control, we further develop an eﬀective optimization algorithm for mitigating network failures.


INTRODUCTION
Epidemic outbreaks do not only possess a direct threat to public health but also, indirectly, impact other sectors [1][2][3].For instance, when many infected individuals have to rest, be hospitalized or quarantined in order to slow down the epidemic spread, this could severely disrupt public services, causing disutility even to those who are not infected.For instance, the highly interdependent supply chains can be easily disrupted due to epidemic outbreaks [4,5].Similar concerns apply to cyber security.The spread of malware is not merely detrimental to computer networks, but can also cause failures to power grids or urban transportation networks which rely on modern communication systems [6,7].What is even worse is that the failures of certain components of technological networks can by themselves trigger a cascade of secondary failures, which can eventually lead to largescale outages [8].Therefore, it is vital to understand the nature of epidemic (or malware) spreading and failure propagation on interacting networks, based on which further mitigation and control measures can be devised.
A number of previous papers address the scenario of interacting spreading processes.In the context of epidemic spreading, two types of pathogens can cooperate or compete with each other, creating many intricate patterns of disease propagation [9][10][11][12][13][14][15].For interacting technological networks (e.g., communication and power networks), the failure of components in one network layer will not only affect neighboring parts within the same network, but will also influence the second network layer through the cross-layer connections.Macroscopic analyses based on simplified models show that such a spreading mecha-nism can easily result in a catastrophic breakdown of the whole system [16][17][18].
Most existing research in the area of multi-layer spreading processes employs macroscopic approaches, such as the degree-distribution-based mean-field methods and asymptotic percolation analysis, in order to obtain the global picture of the models' behavior [19].Such methods typically do not consider specific network instances and lack the ability to treat the interplay between the spreading dynamics and the fine-grained network topology [19].For stochastic spreading processes with specific system conditions (e.g., topology initial conditions and individual node properties), it is common to apply extensive Monte Carlo (MC) simulations to observe the evolution of the spread, based on which important policy decisions are made [20].However, such simulations are computationally demanding on big networks and can be subject to large statistical uncertainty; as a result, they are difficult to be used for downstream analysis or optimization tasks.Therefore, researchers have been pursuing tractable and accurate theoretical methods to tackle the complex stochastic dynamics on networks [19,21].
Among the various developed theoretical approaches used, dynamic message-passing (DMP) is based on ideas from statistical physics offering a desirable algorithmic framework for approximate inference while it remains computationally efficient [22][23][24].The DMP method has been shown to be more accurate than the widely adopted individual-based mean-field method, especially in sparse networks [25,26].Moreover, the DMP approach yields a set of closed-form equations, which is very convenient for additional parameter estimation and optimization tasks [14,27,28].
In this work, we study a scenario where the epidemic or malware spreading on one network can trigger cascading failures on another.This is relevant in the cases where epidemic outbreaks cause disruption in public services or economic activities.Similarly, it can also be applied to study the effect of malware spread on computer networks causing the breakdown of other technological networks such as the power grid.The latter phenomenon is gaining more and more attention due to the increasing interactions among various engineering networks [7].We explore the dynamics and consequences of such infectioninduced cascading failures across two-layer networks using the DMP method.Our results reveal that even relatively low infection rates can induce large-scale cascading failures, leading to widespread network disruptions.We characterized these phenomena through the derivation and analysis of DMP equations, achieving a comprehensive understanding by linking the process to combined bond and bootstrap percolation models analytically.Leveraging the analytical tractability of the DMP model, we also developed optimization algorithms that effectively mitigate these network failures.By adjusting control parameters based on the back-propagation of final state impacts, these algorithms help minimize the size of system failure.

Model and Framework
The Model To study the impact of infection spread of diseases or malware and their secondary effects, we consider multiplex networks comprising two layers [29], which are denoted as layers a and b, and are represented by two graphs G a (V a , E a ) and G b (V b , E b ).For convenience, we assume that the nodes in both layers correspond to the same set of individuals, denoted as V = V a = V b .This can be extended to more general settings.Denote ∂ a i and ∂ b i as the sets of nodes adjacent to node i in layers a and b, respectively.We also define ∂ i = ∂ a i ∪ ∂ b i .See Fig. 1 for an example of the network model under consideration.
Each node has states on both layers a and b.In layer a, each node assumes one of four states, susceptible (S), infected (I), recovered (R), and protected (P ) at any particular time step.The infection spreading process occurs in layer a only, which is dictated by the stochastic discrete-time SIR model [19] augmented with a protection mechanism, which we term the SIRP model.Stochastic models are commonly employed for modeling the spreads of epidemics or malware [20,30,31].The stochastic SIR model is commonly used for representing the spread of infections, wherein a susceptible individual (in state S) may become infected through contact with infected neighbors, and an infected individual (in state I) can recover, transitioning to the recovered state (R) after a certain period.The process we consider is based on the SIR model but includes one more state, P , in layer a; it admits the following state-transition rule where β ji is the probability that node j being in the infected state transmits the infection to its susceptible neighboring node i at a certain time step.At each time step, an existing infected node i recovers with probability µ i ; the recovery process is assumed to occur after possible transmission activities.At time t, an existing susceptible node i turns into state P if it receives protection at time t− 1, which occurs with probability γ i (t− 1).
The protection can be achieved by vaccination in the epidemic setting or special protection measures in the malware spread setting, which is usually subject to certain budget constraints.The protection probabilities {γ i (t)} will be the major control variables for mitigating the outbreaks.Note that when no protection is provided, i.e., all {γ i (t)} are zero, the SIRP model reduces to the traditional SIR model.At initial time t = 0, we assume that node i has a probability P i S (0) to be in state S, and probability P i I (0) = 1 − P i S (0) to be in state I.In layer b, each node i can either be in the normal state (N ) or the failed state (F ), indicated by a binary state variable x i where x i = 1 (0) denotes the 'fail' ('normal') state at a particular time step.A node i in layer b fails if (i) it has been infected, i.e., node i is in state I or R in layer a; (ii) there exists certain neighboring failed nodes such that j∈∂ b i x j b ji ≥ Θ i , where Θ i is a threshold and the influence parameter b ji measures the importance of the failure of node j on node i.The latter case indicates that node i can fail due to the failures of its neighbors which it relies on, even though node i itself is not infected.In summary, the failure propagation process in layer b can be expressed as or (ii) The whole process is simulated for T time steps.As we are interested in the time scale of infection spread which is usually very fast, we do not consider any repair rule in layer b.Therefore, a failed node cannot return to normality within the time window under consideration.Such a failure propagation mechanism is equivalent to the linear threshold model (LTM), which is commonly used in studying social contagion and other cascade processes [19,32,33].The LTM model also offers a straightforward yet effective framework for understanding cascading failures in various systems, as it effectively encapsulates the pivotal dynamics where a component can become dysfunctional if a significant number of its dependent components fail [18,32].Other popular models for cascading failures incorporate more details of the system functionalities [34][35][36]; these models require theoretical analyses specific to each case, which fall outside the scope of the current study.
Fig. 1 illustrates the infection-induced cascades of our model in a simple network of 4 nodes.Node 1 is the initial infected node (or the seed) in layer a, which transmits the infection to node 2 at a certain time step.Now that node 2 is in the infected state in layer a, it also fails to function in layer b.If b 24 ≥ Θ 4 , then node 4 will also fail as it loses the support from node 2, even though node 4 itself has not been infected.Such additional cascade propagation needs extra care when infections spread out.Similar interacting SIR (without a protection mechanism) and LTM processes have also been considered in the social contagion setting [37].
We reiterate that the infection-spreading process (described by the SIRP model) occurs in layer a only and not the entire network, while the cascade process (described by the LTM model) occurs in layer b.Typically, a holistic treatment of the combined two-layer processes is needed to understand their impact and develop mitigation strategies.We also remark that our model differs from the traditional settings of interdependent networks, which typically includes reciprocal dependency.

The DMP Framework
We aim to use the DMP approach to investigate the two-layer spreading processes described above.The DMP equations of the usual SIR and the LTM model have been derived, based on the microscopic dynamic belief propagation equations [24,38].As in generic belief propagation methods [39], the DMP method is exact for tree graphs, while it can constitute a good approximation for loopy graphs, particularly when short loops, such as those spanning 3 or 4 nodes, are scarce.The twolayer spreading processes combining the SIR and LTM model appear more involved, where approximations relying on uncorrelated multiplex networks were used [37].Such approximations become less adequate when the two network layers are correlated, e.g., both layers share the same network topology.

Dynamic Belief Propagation
To devise more accurate DMP equations for general network models and accommodate the protection mechanism for mitigation, we start from the principled dynamic belief propagation equations of the two-layer processes.One important characteristic of our model is that state transition is unidirectional, which can only take the direction S → I → R or S → P in layer a, and N → F in layer b.Note that layer b does not influence layer a.As a result, our model admits a reduced representation of the system's dynamical trajectories that subsequently facilitates a drastic simplification of the derivation of the DMP equations, which are exact on tree networks [24].Nevertheless, we emphasize that the exactness of the DMP formalism for tree networks is conditioned on the unidirectional nature of the model, which no longer holds if layer b also influences layer a. Introducing reciprocal interactions between both model layers requires additional theoretical tools, which are interesting by themselves but are beyond the scope of the current study.
Following previous works [24,38], we parametrize the dynamical trajectory of each node by its state transition times.In layer a, we denote τ a i , ω a i and ε a i as the first time at which node i turns into state I, R and P , respectively.In layer b, we denote τ b i as the first time at which node i turns into state F .The quantity of interest is the probability of the trajectory of node i considered in the entire graph comprising layers a and b but having a cavity where node j is absent, denoted as m i→j (τ a i , ω a i , ε a i , τ b i ).Throughout the manuscript, we will refer to probabilities defined within a cavity graph as cavity probabilities.It is computed by the following dynamic belief propagation equations where W i SIRP (•) and W i LTM (•) are the transition kernels dictated by the dynamical rules of the SIRP and LTM model, respectively (for details see Supplementary Note 1).The marginal probability of the trajectory of node i, denoted as m i (τ a i , ω a i , ε a i , τ b i ), can be computed in a similar way as Eq.(3), by replacing the product k∈∂i\j in the last line of Eq. (3) by k∈∂i .That is, the marginal probability m i (•) is calculated using the entire graph, in contrast to the cavity probability m i→j (•) which is determined with a cavity graph where node j is absent.
The probability of node i in a certain state can be computed by summing the trajectory-level probability, which will be described in the next section.

Full Node-level DMP Equations
Consider the cavity probability of node i being in state S in layer a at time t (assuming node j is absent -the cavity), it is obtained by tracing over the corresponding probabilities of trajectories m i→j (•) in the cavity graph (assuming node j is removed) where I(•) is the indicator function enforcing the order of state transitions.Similarly, we denote the cavity probability of node i in state F in layer b (in the absence of node j) as P i→j F (t); it is obtained by The marginal probabilities P i S (t) and P i F (t) can be computed in a similar manner, by replacing m i→j (•) in Eq. (4) and Eq. ( 5) with m i (•).

DMP Equations in Layer a
We note that infection spread in layer a is not influenced by cascades in layer b, while the failure time in layer b depends on the infection time and the protection time of the corresponding node in layer a.Hence, we can decompose the message m i→j (•) to the respective components as where m i→j Summing m i→j a (•) over τ a i , ω a i , ε a i up to a certain time yields the normal DMP equations of node-level probabilities for the infection spread in layer a (see details in Supplementary Note 1).They admit the following expressions for t > 0 where θ k→i (t) is the cavity probability that node k has not transmitted the infection signal to node i up to time t, and φ k→i (t) is the cavity probability that k is in state I but has not transmitted the infection signal to node i up to time t.Note that the messages {P k→i S (t), θ k→i (t), φ k→i (t)} are only needed for edges belonging to layer a where the SIRP model is defined.
At time t = 0, as we consider that each node i is either in state S with probability P i S (0) or in state I with probability 1 − P i S (0), we have the following initial conditions for the messages Upon iterating the above messages ( 7)-( 9) starting from the initial conditions (10), the node-level marginal probabilities can be computed as The above DMP equations ( 11)-( 14) bear similarity to those of SIR model [23], except for the protection mechanism with control parameters {γ i (t)}.The computational complexity for obtaining the messages for the SIRP process in layer a over a total time where |E a | denotes the number of edges in layer a.

DMP Equations in Layer b
As for the cascade process in layer b, whether node i will turn into state F (fail) also depends on the state in layer a, making it more challenging to derive the corresponding DMP equations.The key to obtaining nodelevel DMP equations for P i→j F (t) in Eq. ( 5) (and the corresponding marginal probability P i F (t)) is to introduce several intermediate quantities to facilitate the calculation; the details are outlined in Supplementary Note 1.
To summarize, the node-level failure probability P i F (t) can be decomposed as where P i SF (t) and P i P F (t) are the probabilities that node i is in state F in layer b, while it is in state S or state P in layer a, respectively.For these two cases, the failure of node i is triggered by the failure propagation of its neighbors from layer b.A similar relation holds for the cavity probability P i→j F (t).The probability P i SF (t) admits the following iteration where χ k→i (t) is the cavity probability that node k is in state F at time t − 1, and it has not sent the infection signal to node i up to time t.
The cavity probability χ k→i (t) can be decomposed into where ψ k→i (t) is the cavity probability that node k is in state I or R at time t − 1, but has not transmitted the infection signal to node i up to time t.The cavity probability ψ k→i (t) can be computed as Similarly, the probability P i P F (t) admits the following iteration where the dummy variable ε indicates the time at which node i receives the protection signal.
In Eq. ( 19), χk→i (t, ε) is the cavity probability that node k is in state F at time t−1, but has not transmitted the infection signal to node i up to time ε.It can be decomposed into where ψk→i (t, ε) is the cavity probability that node k is in state I or R at time t − 1, but has not transmitted the infection signal to node i up to time ε − 1.The cavity probability ψk→i (t) can be computed as Note that the cavity probabilities P i→j SF (t) and P i→j P F (t) are computed using the similar formula as in Eq. ( 16) and Eq. ( 19), but in the cavity graph where node j is removed.This closes the loop for the DMP equations in layer b.We also observe in the above equations that the node-level messages for the SIRP process only enter into the DMP equations for the LTM process through the overlapping neighbors ∂ a i ∩ ∂ b i .The initial conditions for the corresponding messages are given by + P k→i SF (t − 1) + P k→i P F (t − 1).(28) We remark that for a total time T , the computational complexity for obtaining the messages of the cascade process in layer b is O(|E b |T 2 ) where |E b | denotes the number of edges in layer b, unlike the O(|E a |T ) complexity for the SIRP process in layer a.This is due to the dependency of layer b on layer a, as well as the protection mechanism in layer a.The summation of the dummy state {x k } k∈∂ b i in Eq. ( 16) and Eq. ( 19) also implies a high computational demand of networks with high-degree nodes.One way to alleviate this complexity is to use the dynamic programming techniques introduced in by Torrisi et al. [40].
These DMP equations are exact if both layers are tree networks, while they are approximate solutions when there are loops in the underlying networks.

Simplification under Small Inter-layer Overlap
If there are no overlaps between the neighbors of node i in layer a and those in layer b, i.e., ∂ a i ∩ ∂ b i = ∅, the messages χ k→i , ψ k→i , χk→i and ψk→i are not needed, and the node-level probabilities P i SF (t) and P i P F (t) can be much simplified as × This is also a reasonable approximation if the two layers a and b have little correlation, which has been exploited by previous work [37].We remark that the computational complexity of obtaining messages for the cascade process in layer b using this approximated method is O(|E b |T ).
In this work, we will employ this approximation when we consider the dynamics in the large time limit and devise an optimization algorithm for mitigating the cascading failures, in order to reduce computing time.In situations where inter-layer overlaps are significant and accuracy is important [41,42], one can always use the complete formulations of the DMP equations as detailed in the "Full Node-level DMP Equations" subsection above.

Effectiveness of the DMP Method
We firstly test the efficacy of the complete DMP equations derived in "Full Node-level DMP Equations" subsection in the Methods section, by comparing the nodelevel probabilities P i S (t) and P i F (t) to those obtained by Monte Carlo simulations.The DMP theory produces exact marginal probabilities for node activities in tree networks; this is verified in Fig. 2(a) and (b) where both layers a and b are the same binary tree network of size N = 63.For random regular graphs (RRG) where there are many loops, the DMP method also yields reasonably accurate solutions; this is demonstrated in Fig. 2(c) and (d) where both layers a and b are the same RRG of size N = 100 and degree K = 5.We also validate the effectiveness of the non-overlapping approximation applied to the DMP equations for the process in layer b introduced in the subsection "Simplification under Small Inter-layer Overlap" in Methods; the results are shown in Supplementary Note 2.

Impact of Infection-induced Cascades
The obtained DMP equations of the two-layer spreading processes allow us to examine the impact of the infection-induced cascading failures, on either a specific instance of a multiplex network or an ensemble of networks following a certain degree distribution.In this section, we do not consider the protection of nodes by setting γ i (t) = 0, where the process in layer a is essentially a discrete-time SIR model.

Impact on a Specific Network
For the process in layer a, we define the outbreak size at time t as the fraction of nodes that have been infected at that time For the process in layer b, we define the cascade size at time t as the fraction of nodes that have failed at that time By definition, we have ρ In Fig. 3, we demonstrate the time evolution of the infection outbreak size and the cascade size in a multiplex network where both layers are random regular graphs with size N = 1600.It can be observed that ρ F is much larger than ρ I + ρ R asymptotically, which suggests that the failure propagation mechanism in layer b significantly amplifies the impact of the infection outbreaks in layer a.In particular, the failure can eventually propagate to the whole network even though less than 70% of the population gets infected when the spread of the infection saturates.Compare to Monte Carlo simulations, the DMP method systematically overestimates the outbreak sizes due to the effect of mutual infection, but it has been shown to offer a significant improvement over the individual-based mean-field method [25,26,43].

Asymptotic Properties
In the above example, the system converges to a steady state in the large time limit.The DMP approach allows us to systematically investigate the asymptotic behavior of the two-layer spreading processes.
For the process in layer a, we define an auxiliary probability Then the messages in layer a admit the following expres- Details of the derivation can be found in Supplementary Note 3. The above asymptotic equations ( 34) suggest a well-known relationship between epidemic spreading and bond percolation [19,22,44].The bond percolation problem involves a network where the bonds (or edges) between nodes are randomly occupied with a certain probability (denoted as λ).The main focus is to understand the formation of a giant cluster comprising connected occupied edges in the network; in large systems, this typically occurs when λ is greater than a transition point λ c [45].
As mentioned above, it is well established that the asymptotic properties of many stochastic epidemic spreading models can be mapped to certain bond percolation problems [44,46]; we refer interested readers to two recent reviews for more details on the subject [19,45].In the SIR model studied here (where γ i (t) = 0), the quantity p ij defined in Eq. ( 33) can be interpreted as the probability that an infection transmission on edge (i, j) has been realized in the long run, corresponding to an edge occupation probability in bond percolation.When the transmission probabilities {β ij } are large ({p ij } will also be large), a few initially infected seeds can eventually infect a significant proportion of the population and lead to a pandemic, which corresponds to the formation of a giant cluster in percolation theory.We refer readers to Supplementary Note 3 for more details of the correspondence between our model and bond percolation.Note that the edge occupation probability p ij in this discrete-time SIR model differs from the continuoustime counterpart [22,44] with an additional term β ij µ i in the denominator.The term β ij µ i accounts for the simultaneous events that node i infects node j and recovers within the same time step [25].
For the process in layer b, we assume that layers a and b are weakly correlated due to their different topologies and adopt the approximation made in the subsection "Simplification under Small Inter-layer Overlap" in Methods.As no protection is applied, we have P i P F (t) = 0. Then the messages in layer b admit the following expression in the limit T → ∞ where a similar expression holds for P i F (∞) by replacing ∂ b i \j with ∂ b i in Eq. (35).The asymptotic equations for layer b suggest a relationship between the LTM model and bootstrap percolation [38].

Two-layer Percolation in Large Homogeneous Networks
The large-time behaviors of the two processes correspond to two types of percolation problems.To further examine the macroscopic critical behaviors of the two-layer percolation models, it is convenient to consider large-size random regular graphs of degree K (which have a homogeneous network topology), and homogeneous system parameters with We further assume that each node i has a vanishingly small probability of being infected at time t = 0 with P i I (0) = 1−P i S (0) ∝ 1/N .In the large size limit N → ∞, we have P i S (0) → 1. Due to the homogeneity of the system, one can assume that all messages and marginal probabilities are identical, It leads to the self-consistent equations in the large size limit (N → ∞), where p = β β+µ−βµ and ⌈x⌉ is the smallest integer greater than or equal to x.
We observe that 40)- (43), which corresponds to vanishing outbreak sizes.When the infection probability β is larger than a critical point β a c , this fixed point solution becomes unstable and another fixed point with finite outbreak sizes develops.
As a concrete example, we consider random regular graphs of degree K = 5 and fix µ = 0.5, b = 1, Θ = 3.By solving Eqs. ( 40)- (43) for different β, we obtain outbreak sizes for both layers a and b under different infection strengths.The result is shown in Fig. 4, where the asymptotic theory accurately predicts the behavior of a large-size system (N = 1600) in the large-time limit.It is also observed that the outbreak sizes in both layers become non-zero when β is larger than a critical point β a c = 1 7 .Furthermore, the outbreak size ρ ∞ F in layer b exhibits a discontinuous jump to a complete breakdown (ρ ∞ F = 1) when β increases and surpasses another transition point β b c ≈ 0.159.However, at the transition point β b c , only about 28.6% of the population has been infected in layer a.
This example again indicates that the cascading failure propagation in layer b can drastically amplify the impact of the epidemic outbreaks in layer a. Lastly, we remark that whether layer b will exhibit a discontinuous transition or not depends on the values of K and Θ [38], as predicted by the bootstrap percolation theory [47].

Mitigation of Infection-induced Cascades
The Optimization Framework The catastrophic breakdown can be mitigated if timely protections are provided to stop the infection's spread.In our model, this is implemented by assigning a non-zero protection probability γ i (t) to node i, after which it is immune from infection from layer a.To minimize the size of final failures, it would be more effective to take into account the spreading processes in both layers a and b when deciding which nodes to prioritize for protection.
Here, we develop mitigation strategies by solving the following constrained optimization problems where the constraint in Eq. ( 45) ensures that γ i (t) is a probability, and Eq. ( 46) represents the global budget constraint on the protection resources.As the objective function O(γ) (the size of final failures) depends on the evolution of the two-layer spreading processes, the optimization problem is challenging.Lokhov and Saad introduced the optimal control framework to tackle similar problems, by estimating the marginal probabilities of individuals with the DMP methods [28].The success of the optimal control approach highlights another advantage of the theoretical methods over numerical simulations [14,28,48].
In this work, we adopt a similar strategy to solve the optimization problem defined in Eqs. ( 44)-( 46), where P i F (T ) is estimated by the DMP equations derived in Methods.As the expressions of the DMP equations have been explicitly given and only involve elementary arithmetic operations, we leverage tools of automatic differentiation to compute the gradient of the objective function ∇ γ O(γ) in a back-propagation fashion [49].It allows us to derive gradient-based algorithms for solving the optimization problem.We remark that such a backpropagation algorithm is equivalent to optimal control with gradient descent update on the control parameters [50].To save computing time, we adopt the approximation made in the subsection "Simplification under Small Inter-layer Overlap" in Methods for conducting the optimization; but we always use the full DMP formulations developed in the subsection "Full Node-level DMP Equations" in Methods for the evaluation of the outcomes.This is particularly suitable for networks having little inter-layer overlaps.In scenarios where significant inter-layer overlaps exist and precision is crucial, it is always possible to resort to the complete version of the DMP equations.
To handle the box constraint in Eq. ( 45), we adopt the mirror descent method, which performs the gradientbased update in the dual (or mirror) space rather than the primal space where {γ i (t)} live [51,52].In our case, we use the logit function Ψ(x) = log( x 1−x ) to map the primal control variable γ i (t) to the dual space as h i (t) = ψ(γ i (t)) ∈ R, where the gradient descent updates are performed.The primal variable can be recovered through the inverse mapping of Ψ(•), which is The elementary mirror descent update step is where n is an index keeping track of the optimization process and s is the step size of the gradient update.
In general, the above optimization process tends to increase the total resources i,t γ i (t).To prevent the violation of the constraint in Eq. ( 46) during the updates, we suppress the gradient component which increases the total resources when i,t γ i (t) ≥ (1 − ǫ)γ tot , by shifting the gradient g n in Eq. ( 48) with a magnitude b n The rationale for the choice of b n is explained in Supplementary Note 4. In our implementation of the algorithm, we choose ǫ = 0.02.Even though the shifted gradient method is used, it does not strictly forbid the violation of the constraint in Eq. ( 46).If the resource capacity constraint is violated, we project the control variables to the feasible region through the simple rescaling Finally, the resource capacity constraint Eq. ( 46) implies that a γ tot amount of protection resources can be distributed in different time steps.In some scenarios, the resources arrive in an online fashion, e.g., a limited number of vaccines can be produced every day.In these cases, there is a resource capacity constraint at each time step.Some results of such a scenario are discussed in Supplementary Note 5.

Case Study in a Tree Network
We first verify the effectiveness of the optimization method by considering a simple problem on a binary tree network of size N = 63, where both layers have the same topology.The results are shown in Fig. 5, where three individuals are chosen to be the infected seeds at time t = 0, and the outbreak is simulated for T = 50 time steps.Without any mitigation strategy, more than half of the population fail at the end of the process.
We then protect some vital nodes to mitigate the system failure, by using the optimization method proposed above.In Fig. 5(a)(b)(c), we restrict the total resources to be γ tot = 5.Fig. 5(a) shows that the optimization algorithm successfully reduces the final failure rate, which demonstrates the effectiveness of the method.We found that the optimal protection resource distribution {γ * i (t)} mostly concentrates on a few nodes at a certain time step (as shown in Fig. 5(b)), which implies that we can confidently select which nodes to protect.All the nodes with high γ * i (t) receive protection at time t = 0, which implies that the best mitigation strategy in this example is to distribute all γ tot resources as early as possible to stop the infection spread.Fig. 5(c) shows the optimal placement of resources, which can completely block the infection spread, hence minimizing the network failure.In this example, both layers a and b have the same network structure, which is depicted in Fig. 5(c).Similar phenomena are observed in the case with γ tot = 4 as shown in Fig. 5(d)(e)(f), except that the protections are not sufficient to completely block the infection spread.The optimization algorithm sacrifices only two nodes in the vicinity of the infected node in the lower right corner of Fig. 5(f) (indicated by a black arrow), leaving other parts of the network in the normal state.In Fig. 6, we further examine the influence of the total resource availability, i.e., γ tot , on the final failure size N • ρ F (T ) determined at the optimal solution γ * i (t).It is observed that when γ tot increases, the failure size (at the optimum) firstly decreases monotonically, and then saturate when γ tot reaches a certain value such that there are enough protection resources to completely block the infection transmission.Another interesting observation is that for the cases with more initially infected seeds, introducing additional units of protection resource yields a less effective reduction in failure size compared to the cases with fewer initial infected seeds.
The good performance of the optimization is based on the fact that there are enough protection resources (i.e., having a large γ tot ) as well as being aware of the origins of the outbreak.In some cases, whether a node was infected at the initial time is not fully determined but follows a probability distribution.Such cases can be easily accommodated in the DMP framework which is intrinsically probabilistic.We investigated such a scenario with probabilistic seeding in Supplementary Note 6, and found that the optimization method can still effectively reduce the sizes of network failures.

Case Study in a Synthetic Network
To further showcase the applicability of the optimization algorithm for failure mitigation, we consider a synthetic technological multiplex network where layer a represents a communication network and layer b represents a power network.We consider the scenario that the communication network can be attacked by malware but can also be protected by technicians, which is modeled by the proposed SIRP model.The infection of a node in the communication network causes the breakdown of the corresponding node in the power network.The breakdown of components in a power network can trigger further failures and form a cascade, which is modeled by the proposed LTM model.We have neglected the details of the power flow dynamics in order to obtain a tractable model and an insightful simple example. .Final failure size N • ρF (T ) of a binary tree network evaluated at the optimal solution {γ * i (t)}, as a function of the amount of total resources γ tot .Different curves correspond to different number of initially infected seeds.The network topology and the system parameters are the same as those in Fig. 5.
Here, we extract the network topology from the IEEE 118-bus test case to form layer b [53], which has N = 118 nodes.We then obtain layer a by rewiring a regular graph of the same size with degree K = 4 using a rewiring probability p rewire = 0.3, which creates a Watts-Strogatz small-world network and mimics the topology of communication networks [54].The resulting multiplex network is plotted in Fig. 7(a).
As the failures in layer b are initially induced by the infections in layer a, one may wonder whether deploying the protection resources by minimizing the size of infections, i.e., minimizing ρ I (T ) + ρ R (T ) instead of minimizing ρ F (T ), is already sufficient to mitigate the final failures.To investigate this effect, we replace the objective function in Eq. ( 44) by O a (γ) = ρ I (T ) + ρ R (T ) and solve the optimization problem using the same techniques in the subsection "The Optimization Framework".The result is shown in Fig. 7(b), which suggests that blocking the infection is as good as minimizing the original objective function in Eq. ( 44) for the purpose of minimizing the total failure size.Minimizing either objective function constitutes a much better improvement over the random deployment of the same amount of protection resources in this case.
The results in Fig. 7(b) point to the conventional wisdom that one should try best to stop the epidemic or malware spread (in layer a) for mitigating system failure.The situation will be different if there are vital components in layer b, which should be protected to prevent the failure cascade.This is typically manifested in the heterogeneity of the network connectivity or the system parameters.To showcase this effect, we manually plant a vulnerable connected cluster in layer b by setting the influence parameters b ji for an edge (i, j) in this cluster as b ji > Θ i , so that the failure of node j itself is already sufficient to trigger the failure of node i.Such a set-up is relevant for commercial, industrial and engineering networks, among others; e.g., supply chain networks evolve to enhance their throughput and efficiency but may operate with little redundancy and low robustness.In this case, we found that minimizing ρ F (T ) yields a much better improvement over minimizing ρ I (T ) + ρ R (T ) for the purpose of mitigating the system failure, as shown in Fig. 7(c).

Case Studies in a Real-world Social Networks
Lastly, we examine the Kapferer's tailor shop network, a well-known social network dataset gathered by B. Kapferer in Zambia, documenting interactions among workers in a tailor shop [55,56].This dataset records two types of interactions across two different time frames.The first interaction type is termed "sociational", which encapsulates friendship and socioemotional relationship among the workers.The second interaction type is termed "instrumental", which reflects work-and assistance-related connections among them.For our analysis, the "sociational" network observed in the initial time frame is assigned to layer a, acting as the substrate for infection transmission, while the corresponding "instrumental" network is assigned to layer b, where the failure (in terms of work accomplishment) of a node can be triggered by the malfunctioning of its neighboring nodes.These networks are treated as undirected graphs for simplicity.The resulting two-layer network is depicted in Fig. 8(a).
We assign homogeneous values to the majority of system parameters without deliberately introducing any vulnerable component in the network; the set-up closely aligns with the scenario depicted in Fig. 7(b) and presents a stark contrast to the scenario in Fig. 7(c).We select five nodes that possess the highest degrees to serve as the initially infected individuals, which can be viewed as super-spreaders in the network.We then protect the vital nodes to mitigate the system failures by using the optimization method as above, where the result is shown in Fig. 8(b).Interestingly, minimizing the size of failures (i.e., ρ F (T )) is evidently better than minimizing the size of infections (i.e., ρ I (T ) + ρ F (T )) for the purpose of failure mitigation.It suggests that in this realistic and natural scenario, simply blocking the infection transmission is sub-optimal and one needs to take a holistic view of the two-layer model for optimizing the network's utility.

CONCLUSION
We investigate the nature of a type of two-layer spreading processes in unidirectionally dependent networks, comprising two interacting layers a and b.Disease or malware spreads in layer a, which can trigger cascading failures in layer b, leading to secondary disasters.The spreading processes in the two layers are modeled by the SIRP and LTM models, respectively.To tackle the complex stochastic dynamics in the two-layer networks, we utilized the dynamic message-passing method by working out the dynamic belief propagation equations.The resulting DMP algorithms have low computational complexity in sparse networks and allow us to perform accurate and efficient inference of the system states.
Based on the DMP method, we systematically studied and evaluated the impact of the infection-induced cascading failures.The cascade process in layer b can lead to large-scale network failures, even when the infection rate in layer a remains at a relatively low level.By considering a homogeneous network topology and homoge- neous system parameters, we derive the asymptotic and large-size limits of the DMP equations.The asymptotic limit of the two-layer spreading processes corresponds to the coupling between a bond percolation model and a bootstrap percolation model, which can be analytically solved.The infection outbreak size in layer a changes continuously from zero to non-zero as the infection probability β surpasses a transition point β a c , while the failure size in layer b can exhibit a discontinuous jump to the completely failed state when β surpasses another transition point β b c under certain conditions.All these results highlight the observation that cascading failure propagation in layer b can drastically amplify the impact of the epidemic outbreaks in layer a, which requires special attention.
Another advantage of the DMP method is that it yields a set of closed-form equations, which can be very useful for other downstream analyses and tasks.We exploited this property to devise optimization algorithms for mitigating network failure.The optimization method works by back-propagating the impact at the final time to adjust the control parameters (i.e., the protection probabilities).The mirror descent method and a heuristic gradient shift method were also used to handle the constraints on the control parameter.We show that the resulting algorithm can effectively minimize the size of system failures.We believe that our dedicated analyses provide valuable insights and a deeper understanding of the impact the infection-induced cascading failures on networks, and the obtained optimization algorithms will be useful for practical applications in systems of this kind.

DATA AVAILABILITY
Datasets cited in this study are publicly accessible and have been referenced accordingly in the manuscript.

SUPPLEMENTARY NOTE 1: DERIVING THE DMP EQUATIONS FROM DYNAMIC BELIEF PROPAGATION
In this section, we supplement some technical details of the DMP equations based on dynamic belief propagation.Firstly, we provide a summary of the main notations used throughout the study in Supplementary We assume that at the initial time t = 0, node i is either in state S or state I, occurring with probabilities P i S (0) and P i I (0) (with P i S (0) + P i I (0) = 1), respectively.According to dynamical rule of the SIRP model defined in the main text, the transition kernel W i SIRP (•) of the spreading process in layer a admits the following form The transition kernel W i LTM (•) of the cascade process in layer b admits the following form where the first term corresponds to the case where node i fails (in layer b) due to infection (from layer a), while the second term corresponds to the case where node i failed due to losing supports from neighboring nodes ∂ b i in layer b.For infection spread in layer a, the probability of node i in a certain state is computed by tracing over the corresponding probabilities of trajectories m i→j a (•) (note that the process in layer b does not exert any influence on layer a) The computation of these probabilities is similar to that of the SIR model, where we refer readers to Ref. [1] for the details.
For activities in layer b, the probability of node i in state F (failed) is computed as which appears much more difficult to treat due to the dependence on the activities in layer a.In particular, the failure of node i can be attributed to the infection from one of its neighbors from layer a, or to the failures of its neighbors from layer b.For the latter case, node i can be either in state S or in state P at time t, which depends on the infection time τ a i and protection time ε a i .For this reason, we introduce the following conditional failure probability where we have defined which is the cavity probability of node i not in state I or R but being failed at time t due to the failures of neighbors from layer b, while it follows the specific trajectory {τ a i , ω a i , ε a k } in layer a.The marginal probability of node i in state F at time t is obtained by tracing over all the possible trajectories of layer a as where the summation in the last term can be further decomposed into P i→j SF (t) and P i→j P F (t), depending on whether the protection on node i (given at time ε a i ) occurs after time t or before time t In summary, we can decompose P i→j F (t) into four terms where a similar form holds for P i F (t) as stated in the main text.To obtain node-level iteration equations of P i→j SF (t) and P i→j P F (t), the key is to further introduce the auxiliary probabilities χ k→i (t), ψ k→i (t), χk→i (t, ε) and ψk→i (t, ε), defined as These auxiliary probabilities are linked to P i→j SF (t) and P i→j P F (t) through the transition kernel W i SIRP and W i LTM , respectively, and their iteration equations can be mechanistically derived (e.g., by relating χ k→i (t) to χ k→i (t − 1)).The explicit forms of the iteration equations and the physical interpretation of the auxiliary probabilities are stated in the main text.The trajectory-level and node-level messages employed for the two-layer processes are summarized in Fig. S1.

SUPPLEMENTARY NOTE 2: TESTING THE NON-OVERLAPPING APPROXIMATION
In this section, we test the efficacy of the non-overlapping approximation applied to the DMP equations in layer b as introduced in the subsection "Simplification under Small Inter-layer Overlap" in the Methods section of the main text.In particular, we compare the evolution of the expected cascade sizes ρ  approaches; the results are shown in Fig. S2.In Fig. S2(c), both layers are random regular graphs (RRG) which are constructed independently, having distinct network topologies.The estimations given by the DMP method under the non-overlapping approximation matches with those from the full DMP method, as expected.In Fig. S2(a) and (b), both layers have the same network topology, therefore having the maximal neighbor overlap and violating the non-overlapping assumption.In these cases, the non-overlapping approximation for the DMP equations yields less accurate results compared to the complete treatment.Nonetheless, it still maintains the fundamental qualitative characteristics of the cascade dynamics.

SUPPLEMENTARY NOTE 3: DERIVING THE LARGE-TIME LIMIT OF THE DISCRETE-TIME SIR MODEL
Here, we derive the DMP equations of the discrete-time SIR model in the large-time limit, which differs from the continuous-time counterpart [2].The mirror descent algorithm introduced in the main text can be readily applied to solve this optimization problem, except that the shifted gradient g n t = ∇ γ(t) O(γ n (t)) − b n t has a time-dependent shift b n t in this case.The shift magnitude b n t at time t can be derived in the same spirit as in Supplementary Note 4, which admits the following expression .

(S40)
If the resource capacity constraint for γ tot t is still violated, we project the control variables at time t to the feasible region through a simple rescaling γ n (t) ← γ tot t i γ n i (t) γ n (t). (S41) As a concrete example, we consider the synthetic two-layer network and the planted influence parameters {b ji } in the subsection "Case Study in a Synthetic Network" in Results of the main text.The results of online resource supply in this case are shown in Fig. S3, which illustrates the effectiveness of the optimization algorithms for the scenario with the online supply of resources.

SUPPLEMENTARY NOTE 6: PROBABILISTIC SEEDING
There are some cases where the initial infection status of a node i is not fully determined but follows a probability distribution P i I (0), which may be obtained after some inference.In the DMP framework, we simply use the available {P i I (0)} as the initial condition to iterate the DMP equations, and further optimize the evolution by deploying the protection resources.
We consider a similar setting as in the subsection "Case Study in a Tree Network" of Results in the main text, where there are 3 seeds, each of which has P i I (0) = 1.In this section, we consider 6 seeds, each of which has P i I (0) = 1 2 .The results of optimization are shown in Fig. S4.It can be observed that the optimization algorithm can still successfully reduce the final failure size, as in the cases with deterministic seeding.Interestingly, the optimal protection resource distribution {γ * i (t)} is also concentrated among a few nodes at a certain time step (as shown in Fig. S4(b) and (e)), even though one cannot be sure about which nodes are initially infected.This may be due to the simple network topology, where there exists a clear optimal deployment strategy to block the infections coming from the 6 probabilistic seeds altogether, as shown in Fig. S4(c) and (f).

Figure 1 .
Figure 1.An example of the two-layer spreading process considered in this work.A node is in state I if it is infected in layer a, and a node is in state F if it fails in layer b.In this example, node 2 is infected by node 1 in layer a, therefore it turns into state F in layer b.If b24 ≥ Θ4, then node 4 will also fail as it loses the support from node 2, even though node 4 itself has not been infected.
denote the trajectory-level probabilities of the processes in layer a and b, respectively.Note that the messages {m i→j (•)} live in the entire network comprising layers a and b, which implies that {m i→j a (•), m i→j b (•)} are also defined on the entire network.
Figure 2. Comparison of node-level probabilities.The nodelevel probabilities P i S (t) and P i F (t) are obtained by the DMP theory and Monte Carlo (MC) simulation (averaged over 10 5 realizations).Panels (a) and (b) correspond to a binary tree network of size N = 63 for both layers.Panels (c) and (d) correspond to a random regular graph (RRG) of size N = 100 and degree K = 5 for both layers.The system parameters are T = 50, βji = 0.2, µi = 0.5, bji = 1, Θi = 0.6|∂ b i |, γi(t) = 0.

Figure 3 .
Figure 3. Evolution of the sizes of the infection outbreak in layer a and total failures in layer b.The size of infection outbreak is measured by ρI +ρR (green lines), while the size of total failures is measured by ρF (orange lines).Both the DMP method (solid line) and MC simulation (dashed-dotted line) are considered.Layer a and layer b have different network topologies, but both are realizations of random regular graphs of size N = 1600 and degree K = 5.At time t = 0, there are 5 infected nodes.The system parameters are βji = 0.2, µi = 0.5, bji = 1, Θi = 0.6|∂ b i |, γi(t) = 0.

Figure 4 .
Figure 4. Size of infection outbreak and total failures as a function of the infection probability β.The size of infection outbreak is measured by ρI +ρR (green lines), while the size of total failures is measured by ρF (orange lines).The limits of large system size and large time are considered.(a) Random regular graphs with N = 1600, K = 5 are considered.The spreading processes are iterated for T = 100 steps, where stationary states are attained.Both the DMP method (solid line) and MC simulation (dashed-dotted line) are considered.(b) Random regular graphs with K = 5 in the asymptotic limit T → ∞, N → ∞ are considered by analyzing the largetime behaviors of the DMP equations.The triangle and the square markers indicate the locations of the two transition points β a c and β b c , respectively.The system parameters are homogeneous, with µ = 0.5, b = 1, Θ = 3, γi(t) = 0.

Figure 5 .
Figure 5. Mitigation of the network failures in a binary tree network of size N = 63, where both layers have the same topology.Panels (a)(b)(c) correspond to the case with γ tot = 5, while Panels (d)(e)(f) correspond to the case with γ tot = 4. Panels (a) and (d) depict how the final failure size changes during the optimization process.Specifically, the control parameters {γ n i (t)} for each optimization step n were recorded, which were fed to the DMP equations for computing ρF (T ) at step n.Panels (b) and (e) plot the histogram of the optimal decision variables {γ * i (t)}.Panels (c) and (f) show the optimal placement of resources on layer a, where green square nodes receive protection (having a high γ * i (t) at time t = 0).The three red triangle nodes are the initially infected individuals.The system parameters are set as βji = 0.5, µi = 0.5, bji = 1, Θi = 0.6|∂ b i |.

6 Figure 6
Figure 6.Final failure size N • ρF (T ) of a binary tree network evaluated at the optimal solution {γ * i (t)}, as a function of the amount of total resources γ tot .Different curves correspond to different number of initially infected seeds.The network topology and the system parameters are the same as those in Fig.5.

Figure 7 .
Figure 7.A synthetic two-layer network and the evolution of its failure rate.(a) The structure of the two-layer network, where each layer has N = 118 nodes.Layer a is a Watts-Strogatz small-world network, which mimics the topology of communication networks; it is obtained by rewiring a regular graph of degree 4 with rewiring probability prewire = 0.3.Layer b is a power network extracted from the IEEE 118-bus test case.(b) Evolution of the failure rate ρF (t) under various mitigation strategies under homogeneous {bji}.The curve labeled by "random γ" corresponds to the random deployment of a γ tot amount of protection resources at time t = 0; 20 different random realizations are considered and the error bar indicates one standard deviation of the sample fluctuations.The time window is set as T = 50.Most system parameters are homogeneous with βji = 0.2, µ = 0.5, bji = 1, while Θi = 0.6|∂ b i |.Five nodes are randomly chosen as the in1itially infected individuals, and γ tot = 10 is considered.(c) Evolution of the failure rate ρF (t) under various mitigation strategies under planted {bji}.The system parameters are βji = 0.17, µ = 0.5, Θi = 0.6|∂ b i |.Planted influence parameters {bji} are considered.Three nodes are randomly chosen as the initially infected individuals, and γ tot = 9 is considered.Other experiment set-ups are identical to those in Panel (b).

Figure 8 .
Figure 8.The Kapferer's tailor shop network and the evolution of its failure rate.(a) The structure of the Kapferer's tailor shop network, which involves interactions among 39 workers in a tailor shop in Zambia during a period of one month.Layer a represents the "sociational" relations, while layer b represents the "instrumental" relations.(b) Evolution of the failure rate ρF (t) of the tailor shop network up to time T = 10 under various mitigation strategies.The "random γ" strategy has the same set-up as the one in Fig. 7(b); 20 different random realizations are considered and the error bar indicates one standard deviation of the sample fluctuations.. Most system parameters are homogeneous with βji = 0.07, µ = 0.5, bji = 1, while Θi = 0.6|∂ b i |.Five nodes with the highest degrees in layer a are selected as the initially infected individuals, and γ tot = 10 is considered.

Figure S1 .Figure S2 .
Figure S1.Messages employed for the two-layer processes considered.(a) The trajectory-level messages {m i→j (•)} live in the entire network comprising layers a and b, which implies that {m i→j a (•), m i→j b (•)} are also defined on the entire network.(b)The messages {P k→i S (t), θ k→i (t), φ k→i (t)} are only needed for edges belonging to layer a where the SIRP model is defined.The node-level messages for the SIRP process only enter into the DMP equations for the LTM process through the overlapping neighbors ∂ a i ∩ ∂ b i (i.e., {k2, k3} in this example).

Figure S3 .= 2
Figure S3.Evolution of the size of failures ρF (t) under various mitigation strategies.A synthetic two-layer network (mimicing communication network and power grid) is considered.The time window is set as T = 50.The system parameters are βji = 0.17, µ = 0.5, Θi = 0.6|∂ b i |.Planted influence parameters {bji} are considered.Three nodes are randomly chosen as the initially infected individuals, and γ tot t = 2 is considered.

Figure S4 .
Figure S4.Mitigation of the network failures in a binary tree network of size N = 63 for both layers.Panels (a)(b)(c) correspond to the case with γ tot = 5, while Panels (d)(e)(f) correspond to the case with γ tot = 4. Panels (a) and (d) depict how the final failure size changes during the optimization process.Panels (b) and (e) plot the histogram of the optimal decision variables {γ * i (t)}.Panels (c) and (f) show the optimal placement of resources, where green square nodes receive protection (having a high γ * i (t) at time t = 0).The six red triangle nodes are the potential seeds for infection, each of which has an initial infection probability P i I (0) = 1 2 .

Table 1 .
Table. 1. denotes that node i is in state N , while xi = 1 denotes that node i is in state F βji the probability of infection transmission from node j (in state I) to node i (in state S) γi(t) the probability that a susceptible node i turns into the protected state at time t + 1 Θi a threshold parameter of node i below which the node fails bji the influence parameter which measures the importance of the failure of node j on node i ∂ Summary of the main notations.