## Introduction

The study of interdependent, networked, systems is an area that has recently received a lot of attention1,2,3,4,5,6,7,8,9,10,11, where the majority of work has so far focussed on the interactions between different ‘critical infrastructures’12,13,14,15,16. We argue that critical infrastructures should themselves be viewed as a special class of interdependent systems, due to the presence of in-built monitoring and control mechanisms12,17,18. The type of control most prevalent in such systems is so-called ‘supervisory’ control— as distinguished from, say, controllability19 —which typically involves monitoring an underlying process, with the option of a pre-defined intervention once a critical state is reached. Here, in keeping with the picture of interdependent networks, both monitoring and intervention are local processes, associated with specific points on the underlying network. Furthermore, we are interested in the case when the control is ‘distributed’, that is the local interventions are somehow coordinated via communications between sensors. At the most general level, we are interested in building a physics-like model of such systems: that is, complicated enough to encompass any interesting behaviour, but sufficiently idealized that the mechanisms at play can be easily identified and understood.

For modelling the electrical network we adopt a straightforward approach which has been proposed and analysed elsewhere22. The idea assumes a set of producers and consumers linked by power lines, where the resulting load carried by each line, or edge, may be represented by a random variable drawn from a uniform distribution U. Since U is properly normalized, the upper and lower bounds of the distribution are related to the average load , such that

In keeping with the above, it is also assumed that the transmission lines have an intrinsic carrying capacity (assumed here, without loss of generality, to be one) which, if exceeded, causes the line to fail and the load to be redistributed evenly amongst its nearest neighbours23. The crucial departure from Ref. 22, is in our choice of network topology. Since many critical infrastructures are, to a good approximation, planar subdivisions24, we use the well known Delaunay triangulation25, which is a simple, reasonable model for planar networks such as power grids.

## Results

We test the vulnerability of our model against failure cascades by using computer simulations (see Methods section for details). For given values of the parameters p, q, μ and , we repeatedly generate instances of the ensemble, each time initiating a cascade according to a ‘fallen tree’ approach— that is, an unspecified external event removes an edge and, if it is supervised, the associated control device. Following each cascade, Nlcc, the size of the remaining largest connected component of the underlying electricity network, is recorded. We assume that administrators/designers of real systems are interested in ensuring that cascades are bounded by a certain size. To this end, we consider

the probability that, following a cascade, the number of nodes disconnected from the largest connected component— the effective cascade ‘size’: 1 − Nlcc/N— is less than a fraction ε (0, 1] of the original nodes.

In general, as one would expect, the larger the average load carried by the system, the smaller the probability that the cascade size is bounded (see Fig. 2a). However, we also observe another feature of this type of cascading model, first identified in Ref. 22: for each value of p, there is a non-zero critical value

that corresponds to the maximum average load below which cascade sizes are bounded with probability one (within a given accuracy, here 1 part in 5 × 103). Plotting the values of against p, a sharp transition can be observed at some point p* (see Fig. 2b). Above this value, the fraction of disconnected nodes is always bounded by ε = 1/2, regardless of how much load the system is carrying. In the completely reliable case (q = 1), p* just corresponds to the percolation threshold pc (~0.33 for Delaunay triangulations26). The cascades are then ‘percolation controlled’ due to the formation of a giant component connected by supervised edges, coined here the giant supervised component (GSC). The upper bound on cascade size that is enforced by the GSC can be lowered by employing more control devices— i.e., increasing p (see Fig. 2c). For p ≥ 1 − pc, most nodes are connected by supervised edges and cascades cannot disconnect nodes from the giant component.

Whilst q = 1, the only impact of decreasing μ is to increase the number of devices disconnected by the initial external shock. Disregarding the correlation induced by starting the cascade at the point of disconnection, this effect corresponds to a small shift

in the positive x-direction of Figs. 2b and 2c. Here, 〈s〉 is the average sub-tree size associated with a randomly chosen node (see Fig. 2c inset). Figure 2d shows the effects of this shift when p > p*, for both large and small ε. Here, it is natural to characterize changes in μ by a normalized cost

where L(μ) is the total length of the supervisory network. The message of Fig. 2d is that: increasing the number of direct CPU connections at the cost of increased network length, is only beneficial if the suppression of small cascades is desired.

If, in contrast to above, the control devices have an inherent rate-of-failure (q < 1), then a GSC may be either disconnected or reduced in size as control devices fail. In the best case scenario, when the supervising network is mono-centric and q is close to one, the picture is one of ‘effective percolation’ with (see Methods)

where α is determined by the topology of the underlying network (~2.4 for a Delaunay triangulation, see Methods section for details). This simple form shows good agreement with direct estimates of the value of p* (see Fig. 3b and Methods for details). For lower values of q, percolation-like descriptions are no longer appropriate: regardless of the number of control devices, it is not possible to bound cascade sizes in a way that is independent of the average load carried by the system. Indeed, if control devices are both unreliable (q < 1) and the control network is tree-like (μ < 1), the system is very susceptible to large failure cascades, with little impact made by increasing p (see Fig. 4). In this case, we see that for both large and small cascades, the topology of the control network is very relevant and can induce extreme fragility in the control system (see Fig. 5).

## Discussion

In conclusion, we have introduced a minimal model which incorporates the salient features of many real-world control systems. Firstly, the control devices are simple: they only have so-called ‘supervisory’ functions of monitoring and intervention. Secondly, the system is ‘distributed’, that is, not only are the devices positioned in space but they require coordination— in this case, by connection to a CPU. Thirdly, we also incorporate the effects of devices having an inherent rate-of-failure. With only these simple characteristics, the resulting behaviour is very rich. The primary feature concerns the fragility of such control systems: a small reduction of control device reliability leads to a regime where the ability to suppress cascades is dramatically affected by the topology of the control network. Our results suggest that it is much more cost-effective to try to improve the reliability of control devices rather than working on the stability of the supervisory control network. We believe that these results make a first step in understanding distributed supervisory control, whilst also providing helpful guidelines to designers and administrators of real systems. We welcome further work in the area.

## Methods

### Simulations

To simulate the system, N nodes are placed in the plane at random, the Delaunay triangulation is then formed and loads are allocated to the resulting edges according to . The supervisory network is incorporated by first adding a control device to each edge with probability p, then forming the network according to the rewiring procedure described in the main text (dependent on parameter μ). Cascades are initiated by assuming an external event that causes an edge to be removed at random and its load is redistributed amongst its nearest neighbors. If the failing edge was supervised, then the control device is also removed. During the ensuing cascade, we stipulate that for a control device to work, it must be connected to the CPU, a special node that cannot be removed. If a control device is unconnected, then it cannot work and is of no use. However, if a control device is connected and it is supervising an edge that is about to fail— i.e., it is carrying the largest excess load in the system— then there is a probability q that the excess load is dissipated and the load of the edge is reset to . The quantity q can be thought of as the inherent reliability of a device.

Simulations were written in C++ and implemented using the Boost Graph library27 where possible. Delaunay triangulations were produced using an iterative algorithm25.

Results are presented for systems of size N = 500 (~3 × 103 edges) and statistics are calculated over 5 × 103 instances of each ensemble (defined by parameters p, q, μ and ). Critical values and p* are accurate up to an error of approximately ±0.02, since they are identified by varying the underlying parameter by finite increments. In Figs. 4c and 5, corresponds to Pε > 0.99 in order to accommodate the noise associated with different control network structures.

### Formation of an effective GSC

Labelling each supervised edge by i = 1, 2, …, Es, the probability that a supervised edge survives a cascade is where ni is the number of times a device is solicited— i.e., it tries to dissipate its excess load with probability q. Here, for large enough systems the number of supervised edges is given by Es = pE. (Since the average degree of a Delaunay triangulation is peaked around six, the total number of edges E is well approximated by E ~ 3N.) Using a bar to denote system average , we know that if Var [n] is small, then . Approximating a large system average with an ensemble average 〈…〉 over many smaller systems, the results are given in Table 1. Here it is clear that the average 〈n〉 is well approximated by the value 2.4, regardless of p and q and that the variance is always very small compared to the average. We can then write the effective probability that a generic edge resists failure as

with . The system will then be resilient if peff = pc, which implies Eq. (6).

Equation (6) may be contrasted with a direct approximation of when an effective GSC forms. From simulation results, we associate each transition with the value pmid, defined as halfway between pc and the lowest value of p for which is maximal (i.e., the midpoint of the transition).