RETRACTED ARTICLE: Stay-at-home policy is a case of exception fallacy: an internet-based ecological study

A recent mathematical model has suggested that staying at home did not play a dominant role in reducing COVID-19 transmission. The second wave of cases in Europe, in regions that were considered as COVID-19 controlled, may raise some concerns. Our objective was to assess the association between staying at home (%) and the reduction/increase in the number of deaths due to COVID-19 in several regions in the world. In this ecological study, data from www.google.com/covid19/mobility/, ourworldindata.org and covid.saude.gov.br were combined. Countries with > 100 deaths and with a Healthcare Access and Quality Index of ≥ 67 were included. Data were preprocessed and analyzed using the difference between number of deaths/million between 2 regions and the difference between the percentage of staying at home. The analysis was performed using linear regression with special attention to residual analysis. After preprocessing the data, 87 regions around the world were included, yielding 3741 pairwise comparisons for linear regression analysis. Only 63 (1.6%) comparisons were significant. With our results, we were not able to explain if COVID-19 mortality is reduced by staying at home in ~ 98% of the comparisons after epidemiological weeks 9 to 34.


Results
A flowchart of the data manipulation is depicted in Fig. 1. Briefly,  Community Mobility Report data between February 16th and August 21st, 2020, yielded 138 separate countries and their regions. The website Our World in Data provided data on 212 countries (between December 31st, 2019, and August 26th, 2020), and the Brazilian Health Ministry website provided data on all states (n = 27) and cities (n = 5,570) in Brazil (February 25th to August 26th, 2020).
Characteristics of these 87 regions are presented in Table 1 (further details are in Supplemental Material-Characteristics of Regions).
Comparisons. The restrictive analysis between controlled and not controlled areas yielded 33 appropriate comparisons, as shown in Table 2. Only one comparison out of 33 (3%)-state of Roraima (Brazil) versus state of Rondonia (Brazil)-was significant (p-value = 0.04). After correction for residual analysis, it did not pass the autocorrelation test (Lagrange Multiplier test = 0.04). (Further details are in Supplemental Material-Restrictive Analysis).
The global comparison yielded 3,741 combinations; from these, 184 (4.9%) had a p-value < 0.05, after correcting for False Discovery Rate (Table S1). After performing the residual analysis, by testing for cointegration between response and covariate, normality of the residuals, presence of residual autocorrelation, homoscedasticity, and functional specification, only 63 (1.6%) of models passed all tests (Table S2). Closer inspection of several cases where the model did not pass all the tests revealed a common factor: the presence of outliers, mostly due to differences in the epidemiological week in which deaths started to be reported. A heat map showing the comparison between the 87 regions is presented in Fig. 2.
Characteristics of these 87 regions are presented in Table 1 (Table S1 suppl). After performing the residual analysis, by testing for cointegration between response and covariate, normality of the residuals, presence of residual autocorrelation, homoscedasticity, and functional specification, only 63 (1.6%) of models passed all tests (Table S2-suppl). Closer inspection of several cases where the model did not pass all the tests revealed a common factor: the presence of outliers, mostly due to differences in the epidemiological week in which deaths started to be reported. A heat map showing the comparison between the 87 regions is presented in Fig. 2.

Discussion
We were not able to explain the variation of deaths/million in different regions in the world by social isolation, herein analyzed as differences in staying at home, compared to baseline. In the restrictive and global comparisons, only 3% and 1.6% of the comparisons were significantly different, respectively. These findings are in accordance with those found by Klein et al. 46 These authors explain why lockdown was the least probable cause for Sweden's high death rate from COVID-19 46 . Likewise, Chaudry et al. made a country-level exploratory analysis, using a variety of socioeconomic and health-related characteristics, similar to what we have done here, and reported that full lockdowns and wide-spread testing were not associated with COVID-19 mortality per million people 47 . Different from Chaudry et al., in our dataset, after 25 epidemiological weeks, (counting from the 9th epidemiological www.nature.com/scientificreports/ week onwards in 2020) we included regions and countries with a "plateau" and a downslope phase in their epidemiological curves. Our findings are in accordance with the dataset of daily confirmed COVID-19 deaths/ million in the UK. Pubs, restaurants, and barbershops were open in Ireland on June 29th and masks were not mandatory 48 ; after more than 2 months, no spike was observed; indeed, death rates kept falling 49 . Peru has been considered to be the most strict lockdown country in the world 30 , nevertheless, by September 20th, it had the highest number of deaths/million 50 . Of note, differences were also observed between regions that were considered to be COVID-19 controlled, e.g., Sweden versus Macedonia. Possible explanations for these significant differences may be related to the magnitude of deaths in these countries. After October 2020, when our study was published in a preprint server for Health Sciences, new articles were published with similar results 51-54 .
Our results are different from those published by Flaxman et al. The authors applied a very complex calculation that NPIs would prevent 3.1 million deaths across 11 European countries 44 . The discrepant results can be explained by different approaches to the data. While Flaxman et al. assumed a constant reproduction number (R t ) to calculate the total number of deaths, which eventually did not occur, we calculated the difference between the actual number of deaths between 2 countries/regions. The projections published by Flaxman et al. 44 have been disputed by other authors. Kuhbandner and Homburg described the circular logic that this study involved. Flaxman et al. estimated the R t from daily deaths associated with SARS-CoV-2 using an a priori restriction that R t may only change on those dates when interventions become effective. However, in the case of a finite population, the effective reproduction number falls automatically and necessarily over time since the number of infections would otherwise diverge 55 . A recent preprint report from Chin et al. 56 explored the two models proposed by the Imperial College 44 by expanding the scope to 14 European countries from the 11 countries studied in the Table 2. Comparisons using the 4-point criteria. Comparability was considered if at least 3 out of 4 of the following conditions were similar: a) population density, b) percentage of the urban population, c) Human Development Index and d) total area of the region. Similarity was considered adequate when a variation in conditions a), b) and c) was within 30%, while, for condition d), a variation of 50% was considered adequate (Further details are in Auxiliary Supplementary Material-4 point criteria). *Linear regression. www.nature.com/scientificreports/ original paper. They added a third model that considered banning public events as the only covariate. The authors concluded that the claimed benefits of lockdown appear grossly exaggerated since inferences drawn from effects of NPIs are non-robust and highly sensitive to model specification 56 . The same explanation for the discrepancy can be applied to other publications where mathematical models were created to predict outcomes [14][15][16][17][18] . Most of these studies dealt with COVID-19 cases 33,34 and not observed deaths. Despite its limitations, reported deaths are likely to be more reliable than new case data. Further explanations for different results in the literature, besides methodological aspects, could be justified by the complexity of the virus dynamic, by its interaction with the environment, or they may be related to a seasonal pattern that was, by coincidence, established at the same time when infection rates started to decrease due to seasonal dynamics 57 . It is unwise to try to explain a complex and multifactorial condition, with the inherent constant changes, using a single variable. An initial approach would employ a linear regression to verify the influence of one factor over an outcome. Herein we were not able to identify this association. Our study was not designed to explain why the stay-at-home measures do not contain the spread of the virus SARS-CoV-2. However, possible explanations that need further analysis may involve genetic factors 58 , the increment of viral load, and transmission in households and in close quarters where ventilation is reduced.
This study has a few limitations. Different from the established paradigm of randomized clinical trial, this is an ecological study. An ecological study observes findings at the population level and generates hypotheses 59 . Population-level studies play an essential part in defining the most important public health problems to be tackled 59 , which is the case here. Another limitation was the use of Google Community Mobility Reports as a surrogate marker for staying at home. This may underestimate the real value: for instance, if a user´s cell phone is switched off while at home, the observation will be absent from the database. Furthermore, the sample does not represent 100% of the population. This tool, nevertheless, has been used by other authors to demonstrate the efficacy in reducing the number of new cases after NPI 60,61 . Using different methodologies for measuring mobility may introduce bias and would prevent comparisons between different countries. The number of deaths may be another issue. Death figures may be underestimated, however, reported deaths may be more relevant than new case data. The arbitrary criteria used for including countries and regions, the restrictive comparisons, and our definition of an area as COVID-19 controlled are open for criticism. Nonetheless, these arbitrary criteria were created a priori to the selection of the countries. With these criteria, we expected to obtain representative regions of the world, compare similar regions, and obtain accurate data. By using a HAQI of ≥ 67, we assumed  www.nature.com/scientificreports/ that data from these countries would be accurate, reliable, and health conditions were generally good. Nevertheless, the global analysis of the regions ( n = 3741 comparisons) overcame any issue of the restrictive comparison. Indeed, the global comparison confirmed the results found in the restrictive one; only 1.6% of the death rates could be explained by staying at home. Also, our effective sample size in all studies is only 25 epidemiological weeks, which is a very small sample size for a time series regression. The small sample size and the non-stationary nature of COVID-19 data are challenges for statistical models, but our analysis, with 25 epidemiological weeks, is relatively larger than previous publications which used only 7 weeks 62 . A short interval of observation between the introduction of an NPI and the observed effect on death rates yields no sound conclusion, and is a case where the follow-up period is not long enough to capture the outcome, as seen in previous publications 44,45 . The effects of small samples in this case are related to possible large type II errors and also affect the consistency of the ordinary least square estimates. Nevertheless, given the importance of social isolation promoted by world authorities 63 , we expected a higher incidence of significant comparisons, even though it could be an ecological fallacy. The low number of significant associations between regions for mortality rate and the percentage of staying at home may be a case of exception fallacy, which is a generalization of individual characteristics applied at the group-level characteristics 64 .
There are strengths to highlight. Inclusion criteria and the Healthcare Access and Quality Index were incorporated. We obtained representative regions throughout the world, including major cities from 4 different continents. Special attention was given to compiling and analyzing the dataset. We also devised a tailored approach to deal with challenges presented by the data. To our knowledge, our modeling approach is unique in pooling information from multiple countries all at once using up-to-date data. Some criteria, such as population density, percentage of urban population, HDI, and HAQI, were established to compare similar regions. Finally, we gave special attention to the residual analysis in the linear regression, an absolutely essential aspect of studies using small samples.
In conclusion, using this methodology and current data, in ~ 98% of the comparisons using 87 different regions of the world we found no evidence that the number of deaths/million is reduced by staying at home. Regional differences in treatment methods and the natural course of the virus may also be major factors in this pandemic, and further studies are necessary to better understand it.

Methods
Rationale and approach for analyzing the time series data. The proposed approach was tailored to present a way to evaluate the influence of time spent at home and the number of deaths between two countries/ regions while avoiding common problems of other models presented in the literature. We focused on detecting the variation of the differences between the number of deaths and how much people followed stay-at-home orders in two regions in each epidemiological week.
For instance, let us consider two similar regions we shall call 'Stay In county' and 'Go Out county' . Both regions started with the same number of cases. After the first 1000 cases were recorded, Stay In county declared that all people should stay at home, while Go Out county allowed people to circulate freely. After a few epidemiological weeks, we examine the data collected on the number of deaths in both counties and how much time people stayed at home by using geolocation software. If the difference between the number of deaths in Stay In county and Go Out county (variable A) is affected by the difference of the percentage of time people stayed at home in these two areas (variable B), then we can consider that the difference in the number of deaths by COVID-19 is influenced by the difference in the percentage of time people stayed at home. Both effects can be detected using linear regression and careful examination of the problem.
Time series on COVID-19 mortality (deaths/millions) display a non-stationary pattern. The daily data present a very distinct seasonal behavior on the weekends, with valleys on Saturdays and Sundays followed by peaks on Mondays ( Figure S1). To account for seasonality, one may introduce dummy variables for Saturdays, Sundays, and Mondays, regress the number of deaths in these dummy variables, and then analyze the residuals. However, in most cases, the residuals are still non-stationary, and special treatment would be required in each case. Although this approach may be feasible for a few series, we are interested in analyzing hundreds of time series from different countries and regions. Hence, we need a more efficient way to deal with this amount of data. The covariates present another issue in regressing the daily time series of deaths/staying at home. The covariates are typically correlated with error terms due to public policies adopted by regions/countries. Mechanisms controlling social isolation are intrinsically related to the number of deaths/cases in each location. An increase in the death rate may cause more stringent policies to be adopted, which increases the percentage of people staying at home. This change causes an imbalance between the observed number of deaths and staying at home levels. In a regression model, this discrepancy is accounted for in the error term. Hence, the error term will change in accordance with staying at home levels.
Data aggregation by epidemiological week is a plausible alternative ( Figure S2). In this way, artificial seasonality, imposed by work scheduled during weekends and the effect of governmental control over social interaction, in a regression framework, are mitigated. The drawback is that the sample size is significantly reduced from 187 days ( Figure S1) to 26 epidemiological weeks ( Figure S2).
Aggregation by epidemiological week, however, still yields non-stationary time series in most cases. To overcome this problem, we differentiated each time series. Recall that if Z t denotes the number of deaths in the t-th epidemiological week, we define the first difference of Z t as Intuitively, Z t denotes the variation of deaths between weeks t and t-1, also known as the flux of deaths. The same is valid for the staying at home time series. This simple operation yielded, in most cases, stationary www.nature.com/scientificreports/ time series, verified with the so-called Phillips-Perron stationarity test 65 . In the few cases where the resulting time series did not reject the null hypothesis of non-stationarity (technically, the existence of a unitary root, in the time series characteristic polynomial), this was due to the presence of one or two outliers combined with the small sample size. These outliers were usually related to the very low incidence of COVID-19 deaths by the 9th epidemiological week when paired with countries with a significant number of deaths in that same week, thus resulting in an outlier which cannot be accounted for by linear regression.
To investigate pairwise behavior, we propose a method to assess the relationship between deaths and staying at home data between various countries and regions. For two countries/regions, say A and B, let Y A t and Y B t denote the number of deaths per million at epidemiological week t for country A and B respectively, while X A t and X B t denote the staying at home at epidemiological week t for A and B, respectively. The idea is to regress the difference Formally, we perform the regression where β 0 and β 1 are unknown coefficients and ε t denotes an error term. Estimation of β 0 and β 1 is carried out through ordinary least squares. The interpretation of the model is important. We are regressing the difference in the variation of deaths between locations A and B into the difference in the variation of staying at home values between the same location. If the number of deaths in locations A and B have a similar functional behavior over time, then Y A t − Y B t tends to be near-constant, and Y A t − Y B t tends to oscillate around zero. If the same applies to X A t − X B t , then we expect β 1 = 0 ; consequently, we conclude that the behavior, between A and B, is similar and the number of deaths and the percentage of staying at home are associated in these regions. The other non-spurious situation implying β 1 = 0 occurs when the variation in the number of deaths in locations A and B increases/decreases over time following a certain pattern, while the variation in the percentage of "staying at home" values also increases/ decreases following the same pattern (apart from the direction). In this situation, we found different epidemiological patterns as in the variation in the number of deaths, and in the staying at home values, in locations A and B were on opposite trends. However, if these patterns were similar (proportional), this would be captured in the difference and, as a consequence, in the regression. This means that the different trends were near proportional and, hence, the variation in staying at home is associated with the variation in deaths.
In the section below "Definition of areas with and without controlled cases of COVID-19", each country/ region was classified into a binary class: either controlled or not controlled areas for COVID-19. The proposed method allows for insights regarding the association of the number of deaths and staying at home levels between countries/regions with similar/different degrees of COVID-19 control. Assumptions related to consistency, efficiency, and asymptotic normality of the ordinary least squares, in the context of time series regression, can be found in 66 . Since we are comparing many time series, to avoid any problem with spurious regression, we performed a cointegration test between the response and covariates. In this context, this is equivalent to testing the stationarity of ε t , which was done by performing the Phillips-Perron test. Residual analysis is of utmost importance in linear regression, especially in the context of small samples. The steps and tests performed in the residual analysis are described in the statistical analysis section. Study design. This is an ecological study using data available on the Internet.
Setting-data collection on mobility.  Community Mobility Reports 31 provided data on mobility from 138 countries 67,68 and regions between February 15th and August 21st, 2020. Data regarding the average times spent at home was generated in comparison to the baseline. Baseline was considered to be the median value from between January 3rd and February 6th, 2020. Data obtained between February 15th and August 21th 2020 was divided into epidemiological weeks (epi-weeks) and the mean percentage of time spent staying at home per week was obtained. Inclusion criteria for analysis. Only regions with mobility data and with more than 100 deaths, by August 26th, 2020, were included in this study. This criteria has been chosen since the majority of epidemiological studies start when 100 cases are reached 69,70 . For data quality, only countries with Healthcare Access and Quality Index (HAQI) of ≥ 67 71 were included. The HAQI has been divided into 10 subgroups. The median class is 63.4-69.7. The average in this median class is 66.55 (rounding up to 67). By choosing a HAQI of ≥ 67, we assumed that data from these countries were reliable and healthcare was of high quality. For Brazilian regions, a HAQI was substituted for the Human Development Index (HDI), and those with < 0.549 (low) were excluded.
Three major cities with > 100 deaths and well-established results (Tokyo, Japan; Berlin, Germany, and New York, USA) were selected as controlled areas.
Dataset of COVID-19 cases and associated data to reduce bias. After inclusion of the countries/ regions, further data were obtained to reduce comparison bias, including population density (people/km 2 ), percentage of the urban population, HDI, and the total area of the region in square kilometers. All data were obtained from open databases 72-74 . www.nature.com/scientificreports/ Definition of areas with and without controlled cases of COVID-19. Regions were classified as controlled for cases of COVID-19 if they present at least 2 out of the 3 following conditions: a) type of transmission classified as "clusters of cases", b) a downward curve of newly reported deaths in the last 7 days, and c) a flat curve in the cumulative total number of deaths in the last 7 days (variation of 5%) according to the World Health Organization 75 . An example is shown in Figure S3. Data from the cities (Tokyo, Berlin, New York, Fortaleza, Belo Horizonte, Manaus, Rio de Janeiro, São Paulo, and Porto Alegre) were obtained from official government sites [76][77][78][79] . Tokyo, Berlin and New York were chosen for having controlled the COVID-19 dissemination, for representing 3 different continents, and for similarity to major Brazilian cities (Fortaleza, Belo Horizonte, Manaus, Rio de Janeiro, São Paulo, and Porto Alegre).

R E T R
Merged database. Different databases from the sites mentioned above were merged using Microsoft Excel Power Query (Microsoft Office 2010 for Windows Version 14.0.7232.5000) and manually inspected for consistency.
Processing the data-cleaning. Data collected from multiple regions were processed using Python 3.7.3 in the Jupyter Notebook 80 environment through the use of the Python Data Analysis Library in Google Colab Research 81 . Details of preprocessing are described in Python script (Supplement). Briefly, after taking the sum of deaths/million per epi-week, and the average of the variable "staying at home" per epi-week, non-stationary patterns were mitigated by subtracting week t by week t-1 .
Time series data setup and variables. Details regarding the pre-processing and methodological details were presented on the Approach for analyzing the time series data section. Our variables were the difference in the variation of deaths between locations A and B (dependent variable-outcome), and the difference in the variation of staying at home values between the same location (independent variable).
Comparison between areas. Direct comparison, between regions with and without controlled COVID-19 cases, was considered in two scenarios: 1) Restrictive if, at least 3 out of 4 of the following conditions were similar: a) population density, b) percentage of the urban population, c) HDI and d) total area of the region. Similarity was considered adequate when a variation in conditions a), b), and c) was within 30%, while, for condition d), a variation of 50% was considered adequate. 2) Global: all regions and countries were compared to each other.
The restrictive comparison used parameters related to how close people may have made physical contact. The major route of transmission for COVID-19 is from person-to-person via respiratory droplets and direct personal and physical contact within a community setting 82,83 . Statistical analysis. After data preprocessing, the association between the number of deaths and staying at home was verified using a linear regression approach. Data were analyzed using the Python model statsmodels. api v0.12.0 (statsmodels.regression.linear_model.OLS; statsmodels.org), and double-checked using R version 3.6.1 84 . False Discovery Rate proposed by Benjamini-Hochberg (FDR-BH) was used for multiple testing 85 .
We checked the residuals for heteroskedasticity using White's test 86 ; for the presence of autocorrelation using the Lagrange Multiplier test 87 ; for normality using the Shapiro-Wilk's normality test 88 ; and for functional specification using the Ramsey's RESET test 89 . All tests were performed with a 5% significance level and the analysis was performed with R version 3.6.1 84 .
Data from 30 restrictive comparisons were manually inspected and checked a third time using Microsoft Excel (Microsoft). A heat map was designed using GraphPad Prism version 8.4.3 for Mac (GraphPad Software, San Diego, California, USA). Graphs plotting the number of deaths/million and staying at home over epidemiological weeks were obtained from Google Sheets 90 .