replying to K. Soltesz et al. Naturehttps://doi.org/10.1038/s41586-020-3025-y (2020)

The accompanying Comment1 concerns our original paper, Flaxman et al.2, in which we introduced a Bayesian hierarchical model to estimate the transmission intensity (in terms of the time-varying reproduction number, Rt) of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) from observed counts of coronavirus disease 2019 (COVID-19)-related deaths. We parameterized Rt in terms of a binary set of government-mandated non-pharmaceutical interventions (NPIs), with the motivation of examining how effective NPIs were at controlling the transmission of SARS-CoV-2. We concluded that the NPIs that were widely used across Europe successfully drove Rt below 1, thus controlling the epidemic. However, we were unable to disentangle the effect sizes of the NPIs we considered, except for concluding that lockdown had a stronger effect than the other NPIs.

We start by giving some background on the evolution of the paper. Our first preprint, released as Imperial College Report 133, was based on data up to 28 March 2020 and used a simpler model in which the effect size of each intervention on transmission is the same across countries (here referred to as a full pool model; in our published paper2 we use this model for the single-country analyses reported in supplementary discussion 8 of the paper). As more data became available (Flaxman et al.2 uses data up to 4 May 2020), more heterogeneity between countries became evident and we therefore extended the full pool model.

This extended model, which is the one used in Flaxman et al.2, includes a random effect, with the aim of capturing country-specific variation in the effectiveness of the last government-mandated intervention or interventions; for example, lockdown in Italy, lockdown and a ban on public events in the UK, and a ban on public events in Sweden (see Extended Data Fig. 1). Random effects are common components of statistical models to account for heterogeneity not explained by covariates4,5,6.

The focus of Soltesz et al.1 is the size of the random effect assigned by our model to the last intervention in Sweden. Specifically, a large random effect is needed to explain the Swedish data, and this could have been more explicitly stated in our original paper. Soltesz et al.1 claim that the difference between effect sizes in a full pool model and in Flaxman et al.2 points to our model having little practical statistical identifiability. On this basis, Soltesz et al.1 question whether the effectiveness of lockdown can be resolved to the degree our paper stated.

The main goal of our paper was to examine multiple countries to see what worked in most places, not to explain the trajectory of the epidemic in each individual country. Although we feel that Soltesz et al.1 raise an interesting point, we stand by our assessment that the effectiveness of NPIs can in principle be identified when looking at what worked in most countries, subject of course to the available data.

Here we present further analyses that support our finding reported in Flaxman et al.2 that lockdown was an identifiable intervention with a major effect. We accept that additional covariates beyond the timing of mandatory measures are likely to be needed to provide a fully satisfactory explanation of the trajectory of the epidemic in Sweden, as that country relied on voluntary social distancing measures rather than government-mandated interventions.

Because our goal was to estimate which NPIs worked consistently in most countries, we argue that an analysis of the effectiveness of NPIs should be robust to leaving any one country out. In Extended Data Fig. 1 of this Reply we compare results from the full pool model (used by Soltesz et al.1), the model used in Flaxman et al.2, and a partial pool model, removing one country at a time from the input data. In the partial pool model4,6,7, all NPIs have both a random effect component shared between all countries and a country-specific random effect (via a Gaussian shrinkage prior).

In the full pool model, results for effect sizes are dependent on whether Sweden is included, hence Sweden has a very high statistical influence8. As seen in Extended Data Fig. 1, when Sweden is left out of the full pool model, we recover the results from Flaxman et al.2, but when Sweden is included the estimates change markedly. This happens because the full pool model attributes a large effect size to the ban on public events to explain the Swedish death data.

However, both the model we considered in Flaxman et al.2 and the partial pool model discussed here show consistent effect sizes across all ‘held-out’ (that is, excluding a given country from fitting) countries. (For space, only the UK, Italy and Sweden are shown in Extended Data Fig. 1.) This explains our choice to move from a full pool model, which is the one used by Soltesz et al.1, to the model used in Flaxman et al.2

The partial pool model is what we recommend (and are currently adopting) for such analyses in future. Partial pooling allows all interventions to have a shared effect and an effect specific to each country for each intervention. Thus, it stands somewhere between a full pool model and 11 separate models, with the data informing this location. These choices mean the partial pool model has no specific affinity towards a country or a specific intervention.

To further explore issues around identifiability at an individual country level versus across countries, in Fig. 1 we present the effects of NPIs for each country from separate country-specific models, a meta-analysis of these effects, and the estimates from our various joint models. In summary, we see that although the overall mean effect for lockdown is lower in the meta-analysis, it is still the only NPI with an identifiable effect size. The individual country fits provide insight into why this occurs; the only intervention that is consistently significant is lockdown (and the banning of public events in Sweden, as discussed in the legend of Extended Data Fig. 1).

Fig. 1: Inferred intervention effect sizes.
figure 1

The x axis shows the relative reduction in transmission. Rows show model predictions for our published model (Flaxman et al.2), the model from Soltesz et al.1 (full pool), a generalized version of our published model (partial pool) and fits to individual countries (reported in supplementary discussion 8 of our original paper2; the model is the same one considered by Soltesz et al.1). We also include the mean effect size derived from a meta-analysis (mean across countries for each individual sample) of the individual country effects.

Considering the single-country models, we see that the effectiveness of lockdown is not merely the result of a modelling choice on our part. In countries such as Italy, no intervention is estimated to be significantly more effective than any other. The lack of identifiability is not a feature inherent to our model, but a limitation of the data available at the time, as we noted in our paper2. In particular, although we noted the close spacing of interventions in time, in a few countries lockdowns and the banning of public events coincided exactly (for example, in the UK). The result is that in the separate country analyses and full pooling (Soltesz et al.1), there is a strong posterior correlation between the effects of these two NPIs (Pearson correlation of −0.59 in separate country analyses; −0.67 in full pooling analysis): when one has a large effect, the other by necessity has a small effect.

It is crucial to note here that Soltesz et al.1 are correct that the relative effect of different interventions cannot be disentangled for a single country treated in isolation. This probably reflects the limitation of using time series of deaths to infer transmission changes, given the high mean and variance of the distribution of the delay from infections to deaths. However, when looking across multiple countries, all aggregate models suggest that the lockdown intervention has an identifiable effect. This is true for all models considered, including the full pool model of Soltesz et al.1, in which the posterior probability that lockdown is the most effective intervention is 76%, as compared with 96% in the meta-analysis and 100% in both partial pooling models. Therefore, by simultaneously analysing trends in multiple countries, our model has the ability to resolve an identifiable signal of the effect of lockdown.

To further reinforce this point, we also undertook a simulation study examining the extent to which the timing and ordering of the interventions used fundamentally limit the ability to infer effect sizes reliably.

We used our model to simulate synthetic epidemics for all 11 countries, keeping the original timing and ordering of interventions and the same initialization priors, but assigning hypothetical effect sizes to each intervention. We assigned small effect sizes (5% with a tight prior) to all but one NPI, giving the remaining one an effect size with a mean of 59%, also with a tight prior, across countries. In addition, to better reflect reality, we simulate another, country-varying NPI, at a random time, which we treat as unobserved in our model. This unknown and unobserved NPI has a diffuse prior bounded between 0% and 100%, with a mean of 27%, and it is included to assess whether an omitted variable (for example, representing spontaneous behaviour change in response to government messaging) could bias the effect-size estimates of our modelled NPIs. We keep the dates for NPIs the same as the ones in the real data to account for concerns raised about the possible effects of coincident timing on the identifiability of effect sizes.

Next, we fitted the Flaxman et al. model2 to these simulated datasets (20 different simulations for each setting). As shown in Fig. 2, the estimates from the Flaxman et al. fitted model2 (without any information about the unobserved NPI) are in agreement with the NPI effect sizes that were used to generate the data. This analysis provides further evidence that the results we found were not merely artefacts of the modelling approach; if there is a strong signal in the data for a specific NPI, our model can recover it.

Fig. 2: Estimated effect sizes from simulated data.
figure 2

Top to bottom, three separate simulations for lockdown, school closure and self isolation, with a mean of 59% effect size (grey lines), were repeated 20 times each. In each panel, effect sizes from the Flaxman et al. model2 fitted to the 20 simulations are plotted as the mean point estimate with 95% intervals from the 20 runs.

However, this does not on its own show that the converse is necessarily true. To evaluate competing explanations for the observed dynamics of transmission, additional empirical evidence—such as NPI efficacy or alternative epidemiological explanations—is needed.

In summary, we believe that the additional evidence we present here confirms that the key conclusion from our paper is robust: within our model we can conclude that all NPIs together brought the epidemic under control; lockdown had the strongest effect and was identifiable; and the precise effect of the other NPIs was not identifiable. Although our work shows that lockdowns had the largest effect size, we did not and do not claim that they were the only path to controlling the virus; merely that among the NPIs we considered, lockdown was the most effective single measure. We of course acknowledge that improvements could be made to our model, such as including random processes, partial pooling (see above) or more prior analysis. Improved models and more granular information on NPIs and population behaviour will in future hopefully give a more nuanced understanding of which measures—whether mandatory or voluntary—contributed most to reductions in transmission.