arising from S. Flaxman et al. Nature (2020)

Flaxman et al.1 took on the challenge of estimating the effectiveness of five categories of non-pharmaceutical intervention (NPI)—social distancing encouraged, self isolation, school closures, public events banned, and complete lockdown—on the spread of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). On the basis of mortality data collected between January and early May 2020, they concluded that only one of these, the lockdown, had been effective in 10 out of the 11 European countries that were studied. However, here we use simulations with the original model code to suggest that the conclusions of Flaxman et al. with regard to the effectiveness of individual NPIs are not justified. Although the NPIs that were considered have indisputably contributed to reducing the spread of the virus, our analysis indicates that the individual effectiveness of these NPIs cannot be reliably quantified.

Flaxman et al.1 presented a method to estimate the effects of NPIs on the time-varying reproduction number (Rt) of SARS-CoV-2 infection. Data from 11 European countries were pooled on the basis of the assumption that the effects of NPIs on Rt are not country-specific: the factor of relative change in Rt resulting from a particular NPI was assumed to be independent of the country in which the NPI was implemented.

Some country-specific flexibility was, however, provided through the basic reproduction number (R0) being country-specific. More notably, additional flexibility was introduced by ascribing a country-specific effect to the NPI that was introduced last in each country. This replaced the parameterization in a preprint version (Imperial College Report 13)2, in which a country-specific effect was instead assigned to the lockdown NPI.

Our criticism concerns the final published version of the model1,3. Previous iterations of the model are not explicitly considered, but we reference them for two purposes: (1) to demonstrate the sensitivity of the final published model to subtle and realistic alterations in parameter values; (2) to illustrate how the modelling choices appear to lack motivation other than to introduce flexibility, which masks sensitivity issues pertaining to the fundamental structure of the model. As made evident below, we believe the core problem is that the death data are not descriptive enough to support the conclusions of Flaxman et al., which were based on simulation results obtained using an over-flexible model.

Of the 11 modelled countries, Sweden is worthy of particular attention, given that it was the only country in which no lockdown took place. As we have previously shown4, the estimated effects of NPIs change markedly when the model is not allowed to give the Swedish data the special treatment that the country-specific last NPI parameter enables. The country-specific last NPI parameter is needed to explain the decrease of Rt supported by the Swedish death data, and to provide a good model fit despite the absence of a lockdown in Sweden.

Figure 1 shows the outcome for Sweden when executing the model3,5 either with (Fig. 1a) or without (Fig. 1b) the last NPI adjustment in place. With the last NPI adjustment in place, the public events ban results in a mean reduction of Rt of 71% (95% credible interval: 59–81%) in Sweden, which contrasts with the negligible effect of the public events ban in the other 10 countries (less than 2% mean reduction of Rt and less than 15% with 95% credibility). Notably, the estimated effectiveness of the public events ban in Sweden is comparable to that of lockdown in the 10 countries in which one was implemented. As lockdown was the last intervention in most countries, its estimated effect comprises a pooled effect (82% mean reduction of Rt) and a separate country-specific ‘last NPI’ effect (mean change in Rt of between −24% and 18% for the countries considered).

Fig. 1: Estimated effectiveness of the public events ban in Sweden.
figure 1

Top, posterior credible intervals for the reproduction number Rt in Sweden. Bottom, effectiveness of the pooled interventions in the 11 modelled European countries. a, Reproduced results from Flaxman et al.1, using the original model code3,5, including a country-specific effectiveness parameter for the last NPI to be implemented in each country. This corresponds to a country-specific effectiveness for the public events ban in Sweden and for lockdown in the other 10 countries. b, Results using the same code, but with the ‘last NPI’ parameters replaced with country-specific parameters for the lockdown NPI, as in the preceding report2. This change does not affect the 10 countries for which lockdown was the last NPI, but for Sweden it removes the flexibility of a last NPI parameter, which is needed to explain the Rt value supported by the Swedish death data.

The result above—that is, the public events ban and the lockdown being mutually effective in Sweden and 10 other European countries—was not addressed by Flaxman et al, which is noteworthy as this result undermines the conclusion of lockdown being especially effective. Furthermore, without the introduction of the last intervention parameter after the publication of the preprint2, the inconsistency would have been readily visible in reported plots (Fig. 1b).

It seems unlikely to be a result of circumstance that lockdown was implemented in the 10 countries in which it had a large effect on Rt, and omitted in the single country in which the public events ban instead had a similar effect (sufficient to drive Rt below 1). An alternative hypothesis is that the infection-to-death distribution used by the model, combined with the death data that were available by early May, makes the model ascribe almost all of the reduction in Rt to the last intervention that was implemented in each country. This hypothesis is supported by executing the model code3,5 with different interventions being defined as having occurred last in the country in which no lockdown occurred (Sweden), as shown in Fig. 2.

Fig. 2: The effects of interventions on virus spread in Sweden, with slightly varying definitions of the interventions.
figure 2

a, School closure defined to have taken place on 18 March 2020; public events ban defined to have taken place on 29 March 2020. b, Same as a, but with the public events ban moved back to 12 March 2020. c, Same as b, but with school closure defined not to have taken place. As expected, the visual appearance of the plots is similar, with the last intervention contributing most to the reduction of virus spread. This is problematic, as the last intervention differs between a, b and c, with each relying on equally motivated NPI implementation dates that were introduced by Flaxman et al.1,2 in different versions of the model code3. The conclusion is that subtle changes in the definitions of NPIs result in a great deal of variation in the estimated effectiveness of the NPI categories considered.

Exchanging the last intervention for a different one is not merely interesting from a theoretical perspective. For example, it is hard to judge whether transitioning to online teaching at high school and university levels, while keeping elementary schools and preschools open, constitutes a school closure or not. Similarly, the crowd-size limit associated with the public events ban NPI remains a parameter to be decided by the modeller. Early versions of the model defined the public events ban to have taken place in Sweden on 12 March 2020, when gatherings exceeding 500 persons were prohibited. This was later changed to 29 March 2020, when gatherings exceeding 50 persons were prohibited. These subtle alterations of the definitions alter which NPI, of school closure, public events ban, or social distancing encouraged, was the last to be implemented in Sweden. In each case, the model uses the last intervention to explain the majority of the drop of Rt to below 1, which is needed to stay consistent with the decrease in reported deaths.

As mentioned above, our analyses were conducted using the original model implementation3,5 referenced from the final published paper1, and we have considered the definitions of NPIs reported in the preceding versions of the model1,2,3 solely to highlight how small and plausible perturbations of these definitions can result in a lack of practical identifiability, in the statistical sense. Identifiability issues have to some extent been acknowledged by the authors; Flaxman et al. state that “The close spacing of interventions in time [...] means that the individual effects of the other interventions are not identifiable”1. However, this is overshadowed by the subsequent presentation of credible intervals for the effects of the different NPIs, and the claim that “Lockdown has an identifiable large effect on transmission (81% (75–87%) reduction)”1. We believe that the basis of this claim is unclear. As seen in the supplementary videos of the Nature article1, the credible intervals narrow as more data become available, further hiding the identifiability problems of the underlying model and potentially giving the results a false sense of reliability.

Our point here is not to argue whether or not a school closure took place in Sweden, or what the most appropriate crowd-size limit is. Instead, our findings highlight that the model presented by Flaxman et al. is very sensitive to reasonable, minor changes in the input data. As indicated by our simulation examples, and further supported by our previous analyses4, there is a fundamental problem with the identifiability of the effectiveness of individual NPIs, including the lockdown. This problem is caused by the close temporal spacing between the implementation of these NPIs throughout Europe. In particular, we note in relation to the lockdown NPI that an estimated value that is considerably larger than zero should not be confused with statistical identifiability of the corresponding parameter.

Although we fully support the ambition of Flaxman et al.1—to estimate the effectiveness of different NPIs from the available data—we find the underlying modelling approach problematic. Flexible parameterization leads to issues with identifiability, which are masked by model assumptions. In particular, we find it questionable to designate a country-specific effectiveness parameter to the last NPI that was introduced in each country. Besides the problems illustrated in Fig. 2, with large variations in the estimated effectiveness of NPIs, this prohibits prospective use of the model, as it is unknown at any given time whether the latest NPI will also be the last to be implemented in a particular country.

We conclude that the model1,3 is in effect too flexible, and therefore allows the data to be explained in various ways. This has led the authors to go beyond the data in reporting that particular interventions are especially effective. This kind of error—mistaking assumptions for conclusions—is easy to make, and not especially easy to catch, in Bayesian analysis. As NPIs are revoked, and possibly reintroduced over an extended period of time, more data will become available and practical identifiability of the separate effects of NPIs may be obtained. Until then, we suggest that the model1,3, and its conclusion that all NPIs apart from lockdown have been of low effectiveness, should be treated with caution with regard to policy-making decisions.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.