Early-stage spatial disease surveillance of novel SARS-CoV-2 variants of concern in Germany with crowdsourced data

The emergence and rapid spread of novel variants of concern (VOC) of the coronavirus 2 constitute a major challenge for spatial disease surveillance. We explore the possibility to use close to real-time crowdsourced data on reported VOC cases (mainly the Alpha variant) at the local area level in Germany. The aim is to use these data for early-stage estimates of the statistical association between VOC reporting and the overall COVID-19 epidemiological development. For the first weeks in 2021 after international importation of VOC to Germany, our findings point to significant increases of up to 35–40% in the 7-day incidence rate and the hospitalization rate in regions with confirmed VOC cases compared to those without such cases. This is in line with simultaneously produced international evidence. We evaluate the sensitivity of our estimates to sampling errors associated with the collection of crowdsourced data. Overall, we find no statistical evidence for an over- or underestimation of effects once we account for differences in data representativeness at the regional level. This points to the potential use of crowdsourced data for spatial disease surveillance, local outbreak monitoring and public health decisions if no other data on new virus developments are available.


2021)
. This time lag between the start of the treatment and the reporting of the first VOC case should ensure that latent transmission effects are captured in the estimation. For instance, in the case of Flensburg, VOC infections could be traced back to illegal parties on December 31, 2020 [7]. Considering a median incubation time of 5 days for SARS-CoV-2 infections [8], we can thus expect that first VOC effects become visible in the data from January 5, 2021 onwards. This is before the first VOC case was confirmed through genome sequencing on January 24, 2021 for Flensburg.
Data on reported SARS-CoV-2 infections are taken from the Robert Koch Institute [9]. For our empirical analysis we use aggregate case numbers for each NUTS-3 region and day tracked on the basis of symptom onset for individual cases rather than the reporting date by local health authorities. This allows us to estimate the transmission timing of SARS-CoV-2 at the regional population level more precisely. We aggregate the data across age groups. Data on confirmed cases of the three novel VOC (B.1.1.7., B.1.351, P.1) together with their reporting dates are gathered from a public crowd-sourcing project [10], which bases case documentation on newspaper and public health reports. We have cross-checked data consistency by retracing individual cases and their timing from the documented source information and have conducted additional online searches for selected cases.
A relevant concern against SARS-CoV-2 case numbers is that they may grow with growing test intensity. Test intensity may rise in regions in regions with confirmed VOC cases. To rule out this effect, we use the hospitalization rate as alternative outcome. Specifically, we use daily information on the number of COVID-19 patients in intensive care (with and without artificial ventilation) per 100,000 population [11].
In the specification of SCM estimation, we account for the autoregressive dynamics of infections and the hospitalization rate by including the 7-day incidence rate and the absolute number of cumulative SARS-CoV-2 infections during the last 3 weeks before treatment start as time-varying predictors. Other time-varying predictors are the average daily temperature for each region during the last 2 weeks and changes in average daily mobility during the last 2 weeks before treatment start. Changes in average daily mobility per region are measured relative to a 2019 (pre-COVID) benchmark period. We use data on daily temperatures from Deutscher Wetterdienst [12] and data on mobility changes from [13].
We further include time-constant cross-sectional predictors characterizing regional demographic structures and the regional health care system as in [1] based on data from the INKAR online database of the Federal Institute for Research on Building, Urban Affairs and Spatial Development [14]. We use the latest year available in the database, which is 2017. Employed cross-sectional predictor variables include population density (Population/km2), regional settlement structure (categorial dummy), the share of highly educated population (in %), the share of female in population (in %), the average age of female and male population (in years), old-and young-age dependency ratios (in %), the number of physicians per 10,000 of population and pharmacies per 100,000 of population.
We conduct all SCM estimations in STATA using the SYNTH [15] and SYNTH_RUNNER [16] packages. Confidence intervals (CIs) are calculated from one-sided pseudo p-values obtained on the basis of comprehensive placebo-in-space tests. The latter tests calculate pseudo-treatment effects for all regions in the donor pool treating each of the regions as if it would have received the treatment of a confirmed VOC case by or after January 5, 2021. One-sided pseudo p-values are then calculated of the share of placebo-treatment effects that are larger than the observed treatment effects for treated regions and thus indicate the probability that the increase in the number of SARS-CoV-2 infections was observed by chance given the distribution of pseudo-treatment effects in the donor pool. To account for differences in pre-treatment match quality of the pseudo-treatment effects, only donors with a good fit in the pre-treatment period are considered for inference. Specifically, we do not include placebo effects in the pool for inference if the match quality of the control region, measured in terms of the pre-treatment root mean squared prediction error (RMSPE), is greater than 10 times the match quality of the treated unit [6]. Based on the obtained pseudo p-values we calculate confidence intervals as described in [17].
Robustness.-We mainly perform robustness tests by changing the composition of the donor pool. First, we exclude donor regions that were selected in the baseline SCM estimation. The idea behind this analysis is to preclude unintended selection effects resulting from latent VOC transmissions captured in the overall infection dynamics of donor regions in the pre-treatment period. Second, we reduce the pool of donor regions to those NUTS-3 regions ii/vii which are located in the same federal state as the treated regions (Schleswig-Holstein for Flensburg and North Rhine-Westphalia for Cologne, Leverkusen and Düren). This approach should minimize differences in public health measures, which are mainly decided under the authority of individual federal states in Germany. While mentioned but not reported in the manuscript in detail, we include codes to run all robustness tests in the replication files.

Difference-in-difference estimation (DiD)
To investigate average and dynamic treatment effects for the entire group of treated regions with at least one confirmed VOC case, we additionally run a series of panel regressions in a DiD and Panel event study (PES) framework. The sample period for the panel regressions includes the time period between November 15, 2020 and February 4, 2021. This ensures that we cover all confirmed VOC cases (in the currently best possible way) together with a sufficient pre-treatment period for each region of at least 3 weeks. As for the case of SCM, we use the 7-day incidence rate and the hospitalization rate as key outcome variables. For the incidence rate we measure the timing of infections in terms of symptom onset rather than reporting by local health authorities. We also use the same set of time-varying predictor variables as described above; cross-sectional predictors for the set of NUTS-3 regions are not included as we account for NUTS-3 region fixed effects in the panel regressions.
Specifically, we run DiD regressions as two-way fixed effects model of the following general form In equation 1, y i,t is the epidemiological outcome of interest (7-day incidence rate, hospitalization rate) observed for NUTS-3 region i and day t. The variable VOC i,t is our main treatment indicator, which takes values of 1 from the day onwards for which the first VOC case was confirmed in the region. The coefficient δ measures the direction and strength of the correlation between VOC reporting and outcome variables. We additionally test for latent transmission effects prior to the first reporting of a VOC (given that genome sequencing to identify SARS-CoV-2 variants may take up to 2 weeks). This is done by moving forward the date when the treatment dummy VOC i,t takes values of 1 for treated regions by 7, by 14 and by 21 days, respectively. Importantly, these extended treatment specifications do not test for early anticipation effects caused by latent confounding factors (this is done in the Panel Event Study), but averages estimated effects over a longer treatment period to capture potential latent transmissions prior to the first VOC confirmation as, for instance, identified for Flensburg in the SCM estimations.
It is important to control for factors that potentially confound the link between VOC and the overall SARS-CoV-2 incidence rate at the regional level. The set of confounding factors, which we can directly account for, is included in the variable vector X i,t− j . Specifically, similar to the SCM application, we control for the number of SARS-CoV-2 cases in region i during the last, the second last and the third last week. We also include a spatially lagged variable covering the number of SARS-CoV-2 cases in region i's (direct) spatial neighbors during the last, the second last and the third last week. A spatial lag is important because infections can easily spread from one region to another region nearby, e.g., due to commuting or general mobility [18]. Spatial association between regions is measured through first-order contiguity, i.e. whether regions share a common border or not. We row standardize the resulting spatial weights matrix.
Further, we control for the average temperature [19] and the relative change in average daily mobility at t − 1, at t − 7 and at t − 14 in region i. Controlling for mobility is important because lower mobility can be disease mitigating [20,21,22]. Including mobility effectively controls for lockdown measures implemented during the sample period and how people follow the rules. In addition, we include linear and quadratic time trends for four different region types RegionType r(i) classified on the basis of the region's settlement structure including region type 1 (large district-free cities, kreisfreie Städte), type 2 (urban regions, Landkreise), type 3 (rural regions, Landkreise) and type 4 (sparsely populated regions, Landkreise), i.e. R = 4. The classification of region types follows the definition of the Federal Institute for Research on Building, Urban Affairs and Spatial Development [14]. Table 1 shows descriptive statistics.
iii/vii  τ t are time fixed effects for each day in the sample, which for instance control for daily changes in infection levels similar across regions. µ i controls for time-constant region fixed effects, which could, e.g., be caused by region-specific SARS-CoV-2 testing intensities. ε i,t is the model's error term. We cluster standard errors at the NUTS-2 level (each of the 401 NUTS-3 regions belongs to one of the 38 NUTS-2 regions in Germany). We estimate δ and Γ using the REGHDFE package [23] in STATA, which allows us to control for τ t , µ i , γ r and ψ r .
Robustness.-Besides the full sample covering all treated regions, we also conduct estimations with sub-samples. First, we focus on those experiencing treatment early on (before January 22, 2021) to observe at least 14 days of treatment after the first confirmed VOC case for each treated region. Second, we study regions with a relatively high number of VOC reported cases. Here, we restrict the treated regions to the top-10 percentile of VOC counts, which corresponds to at least 9 VOC cases per region. The idea of this subsample is to test for treatment effect difference associated with confirmed VOC counts rather than the presence of at least one VOC case. Finally, we run estimations by variant type. In a first sub-sample, we focus on the British variant but allow for other reported variants. In a second sub-sample, we only consider the British variant. Similarly, we investigate effects for regions where the South-African variant was reported (together with potential other variants) and for regions for which only the South African variant was reported.

Panel event study (PES)
The estimation of the PES differs from the two-way fixed effects DiD specification mainly in the way that it accounts for the staggered emergence of a VOC in treated regions throughout the sample period. This allows us to identify dynamic treatment effects over time. Dynamic treatment effects may arise for different reasons: First, they could reflect early anticipation effects prior to the treatment start due to latent VOC transmissions or, second, they could result from other unobserved confounding factors systematically affecting incidence rates in treated regions around the treatment start. Thus, it is important to test for such early anticipation effects. The absence of statistically significant estimates for the latter but significant treatment effects could, accordingly, be interpreted in favor of our empirical identification strategy.
Moreover, we may expect that infection and hospitalizations effects do not immediately occur after the reporting of the first VOC case but potentially build up over time at the regional population level. This may particularly be the case if public health authorities can only imperfectly trace and mitigate VOC-related disease spread. In this case, the estimation of average treatment effects on treated (ATTs) as shown in equation 1 may potentially underestimate the dynamics of SARS-CoV-2 infections and hospitalizations subject to confirmed VOC cases. By including sufficient lag and lead terms in the estimation framework for the timing of treatment start, we can identify dynamic treatment effects.

iv/vii
The staggered nature of treatment start in different treated regions can be incorporated into the panel regression approach by translating the model from a specification in absolute time t (as shown in equation 1) to a specification that measures time for each region relative to treatment start. Together with the recognition of potential heterogeneity in the strength of treatment effects over time, the PES setup allows us to precisely estimate the impact of the passage of a treatment (here: VOC reporting in a region) that occurs at different times in different spatial units. A more formal presentation of the PES approach together with estimation challenges is given in [24,25] among others. Prior COVID-related PES applications have dealt, for instance, with the infection effects from school re-opening in Germany [26] and [27], university students traveling during the U.S. spring break [28] or mass protests from the Black Lives Matter movement [29].
In the implementation of the PES approach, we include the same set of covariates and fixed effects as in the case of the two-way fixed effects DiD estimation. We also cluster standard errors at the NUTS-2 level. We set the maximum number of pre-treatment leads to 10 days and the maximum number of lags after the treatment start to 20 days. Further effects from leads/lags before (after) this range are accumulated to a single coefficient. To allow for an easy comparison of the SCM and PES results, we express all reported effects as percentage change relative to the observed 7-day incidence rate in the last pre-treatment period. Robustness.
-Besides the full sample covering all treated regions, we estimate the effects for sub-samples of regions. First, we focus on those experiencing treatment early on (before January 22, 2021 to observe at least 14 days of treatment after the first confirmed VOC case). Estimation results for this sub-sample are only shown in the replication files. Second, we study regions with a relatively high number of VOC reported cases. We restrict the treated regions to the top-10 percentile of VOC counts, which corresponds to at least 9 VOC cases per region.
We conduct the PES estimations in STATA. The analysis builds on the EVENTDD package [30]. We document the full analyses in the replication files, particularly the mentioned robustness tests.

Data availability
All study data and codes to replicate the near-time estimation results (including robustness tests) and the expost validity checks are stored in a publicly available data repository accessible through the following DOI: 10.6084/m9.figshare.13946903.