National and subnational short-term forecasting of COVID-19 in 1 Germany and Poland, early 2021 2

,


Introduction
Short-term forecasts of infectious diseases and longer-term scenario projections provide complementary perspectives to inform public health decision making.Both have received considerable attention during the COVID-19 pandemic and are increasingly embraced by public health agencies.This is illustrated by the US COVID-19 Forecast (1; 2) and Scenario Modelling Hubs (3), supported by the US Centers for Disease Control and Prevention, as well as the more recent European COVID-19 Forecast Hub (4), supported by the European Center for Disease Prevention and Control (ECDC).The Forecast Hub concept, building on prepandemic collaborative disease forecasting projects like FluSight (5), the DARPA Chikungunya Challenge (6) or the Dengue Forecasting Project (7) aims to provide a broad picture of existing short-term projections in real time, making the agreement or disagreement between different models visible.Also, it forms the basis for a systematic evaluation of performance.This is a prerequisite for model consolidation and improvement, and a need repeatedly expressed (8).In the German-speaking public discourse, the need for well-designed prospective and pre-registered studies in the field of disease modelling has been highlighted (9).
We here report on the second part of such a study, pre-registered on 8 October 2020 (10) and including forecasts made between 11 January 2021 and 29 March 2021 (with last observed values running through April; twelve weeks of forecasting).It is based on the German and Polish COVID-19 Forecast Hub (https://kitmetricslab.github.io/forecasthub/),which gathers and stores forecasts in real time.This platform was launched in close exchange with the US COVID-19 Forecast Hub in June 2020.In April 2021 it was largely merged into the European COVID-19 Forecast Hub, shortly after the latter had been launched by ECDC.During our study period, fifteen independent modelling teams provided forecasts of cases and deaths by reporting date, based on data either from national health authorities (Robert Koch Institute, RKI or the Polish Ministry of Health, MZ; the primary data source) or the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE; (11)).As in the first part of our study ((12), October 2020-December 2020), we focus on forecasts one and two weeks ahead.As non-pharmaceutical interventions were more stable in the second than in the first period, we give more attention to the evaluation of three-and four-week-ahead forecasts, but acknowledge that forecasts (as opposed to scenarios) are most meaningful for short time horizons.
The time series of cases and deaths in both countries are displayed in panels (a) and (b) of Figure 1.
The study period covered in this paper is marked in dark grey, while the light grey area represents the time span addressed in (12).Our study period contains the transition from the original wild type variant of the virus to the B.1.1.7 variant (later called Alpha); panel (c) of Figure 1 shows the weekly percentages of all cases which were due to the B.1.1.7 variant in the two countries in calendar weeks 4-12 (taken from (13) for Germany and ( 14), (15) for Poland).Panel (d) shows the Oxford Coronavirus Government Response Tracker (OxCGRT) Stringency Index (16).It can be seen that compared to the first part of our study, the level of non-pharmaceutical interventions was rather stable at a high level during the second period.We note, however, that on 27 March a new set of restrictions was added in Poland (closure of daycare centers, hair salons and sports facilities, among others), which is not reflected very strongly in the stringency index.
The start of vaccination rollout in both countries coincides with the start of our study period.However, by its end only roughly one sixth of the population of both countries had received a first dose, and roughly one twentieth had received two doses (with the role of the one-dose Johnson and Johnson vaccine negligible in both countries); see panel (e).
We find that averaged over the second evaluation period, most though not all of the compared models were able to outperform a naïve baseline model.Heterogeneity between forecasts from different models was considerable.Combined ensemble forecasts achieved very good performance relative to single-model forecasts.However, most models, including the ensemble, did not anticipate changes in trend well, in particular for cases.Pooling results over both evaluation periods we find that ensemble forecasts for deaths were well-calibrated even at longer prediction horizons and clearly outperformed baseline and individual models, while for cases this was only the case for one-and to a lesser degree two-week-ahead forecasts.

Results
Figures 2 and 3 show the forecasts made by the median ensemble (KIT-median ensemble; our pre-specified main ensemble approach), a naïve last-observation-carried-forward model (KIT-baseline) and seven con- The dark grey area indicates the period addressed in the present manuscript, the light grey area the one from (12).tributed models with above-average overall performance across locations and targets.The forecasts are probabilistic, and we illustrate the 50% and 95% prediction interval (PI) along with the respective median.
Forecasts by the remaining teams are illustrated in Supplementary Figures 8 and 9, and forecasts at horizons of three and four weeks are shown in Supplementary Figures 10-13.In the following, we discuss the performance of these forecasts, starting with a formal statistical evaluation before directing attention to the behaviour at inflection points.Additional information on the submitted, baseline and ensemble models can be found in Sections 4.3 and 4.4.

Formal evaluation, January-April 2021
Table 1 and Figure 4 summarize the performance of the submitted, baseline and ensemble models over the twelve-week study period.Performance is measured via the average weighted interval score (WIS, (17)), an error measure for probabilistic forecasts, and the absolute error of the predictive median.For both measures lower values indicate better predictive performance, and the WIS can be decomposed into components representing underprediction, forecast spread and overprediction, see Section 4.2.Detailed results in tabular form at horizons of three and four weeks ahead can be found in Supplementary Table 4.
As specified in the study protocol, we also provide results for cumulative cases and deaths (Supplementary Tables 6 and 7) and based on JHU rather than RKI/MZ data (Supplementary Tables 8 and 9; evaluation The figure shows forecasts from a baseline model, the median ensemble of all submissions and a subset of submitted models with above-average performance.Asterisks mark prediction intervals exceeding the upper plot limit.The remaining submitted models are displayed in Supplementary Figure 8.The figure shows forecasts from a baseline model, the median ensemble of all submissions and a subset of submitted models.Asterisks mark prediction intervals exceeding the upper plot limit.The remaining submitted models are displayed in Supplementary Figure 9. against JHU data leads to slightly higher WIS and absolute errors, but quite similar relative performance of models).Both for incident cases and deaths, a majority, but not all models outperformed the naïve baseline model KIT-baseline (a model outperforms the baseline for a given target whenever its stacked bar chart in Figure 4 does not reach into the grey area).As one would expect, the performance of all models considerably deteriorated for longer forecast horizons.The pre-specified median ensemble (see Materials and Methods) was consistently among the best-performing methods, outperforming most individual forecasts for all targets.The KIT-extrapolation baseline model shows quite reasonable relative performance for cases in both countries.This model extrapolates exponential growth or decrease if the last three observations are monotonically increasing or decreasing, with a weekly growth rate equal to the one observed between the last and second to last week; if the last three observations are not ordered, it predicts a plateau.Predictive quantiles are obtained by assuming a negative binomial observation model with a dispersion parameter estimated via maximum likelihood from five recent observations (see Supplementary Note 2 of (12) for a detailed description).Given the relatively long stretches of continued upward or downward trends, this simple heuristic was not easy to beat.For deaths, on the other hand, the ensemble approaches achieve quite substantial improvement relative to this baseline.
The most striking cases of individual models outperforming the ensemble occurred for longer-range case forecasts in Poland.Here, the two microsimulation models MOCOS-agent1 and ICM-agentModel performed considerably better.These two models were arguably among the ones which were most meticulously tuned to the specific national context.It seems that this yielded benefits for longer horizons, while at shorter horizons the ensemble and some considerably simpler models were at least on par (the best performance at the one week horizon being achieved by the compartmental model MIMUW-StochSEIR).
There were considerable differences in the forecast uncertainty of the different models.This can be seen from the quite variable forecast interval widths in Figures 2 and 3, and resulted in large differences in the empirical coverage rates of 50% and 95% prediction intervals (Table 1).The ensemble methods performed quite favourably in terms of coverage, typically with slight undercoverage for cases and slight overcoverage for deaths.The differences in forecast dispersion are also reflected by the components of the weighted interval score shown in Figure 4 (see Materials and Methods for an explanation of the decomposition).Some models, most strikingly ITWW-county repro, issued very sharp predictions, leading to very small dispersion compo- A subset of models also provided forecasts at the subnational level (states in Germany, voivodeships in Poland).Table 2 provides a summary of the respective results at the one and two week horizons (results for three and four weeks can be found in Supplementary Table 5).Despite the rather low number of available models, the ensembles generally achieved improvements over the individual models and, with exceptions for case forecasts in Germany, clearly outperformed the baseline model KIT-baseline.The mean WIS values are lower for the regional forecasts than for the national-level forecasts in Table 1 primarily because the numbers to be predicted are lower at the regional level; the WIS -like the absolute error -scales with the order of magnitude of the predicted quantity and cannot be compared directly across different forecasting tasks.Coverage of the ensemble forecasts was close to the nominal level for deaths and somewhat lower for cases.Note that in this comparison part of the forecasts from the FIAS FZJ-epi1Ger model were created retrospectively as the team only started issuing forecasts for all German federal states on 22 February 2021.
As specified in the study protocol (10), we also report evaluation results at the national level pooled across the two study periods for those models which covered both.These are summarized in Supplementary Tables 10 and 11.For deaths, ensemble forecasts clearly outperformed individual models, the four-week-ahead horizon in Poland being the only one at which an individual model (epiforecasts-EpiExpert) meaningfully outperformed the pre-specified median ensemble.While most contributed and baseline models were somewhat overconfident, the ensemble showed close to nominal coverage even at the four-week-ahead horizon.For cases, the median ensemble achieved good relative performance (comparable to the best individual models) one and two weeks ahead, but was outperformed by a number of other models at three and four weeks.
Notably, it failed to beat the naïve last-observation-carried-forward model KIT-baseline.Its coverage of prediction intervals was acceptable one week ahead, but substantially below nominal at higher horizons (e.g.,     13/19 and 10/19 four weeks ahead in Germany and Poland, respectively, at the 0.95 level), which reflects the severe difficulties in predicting cases in Fall 2020 as discussed in (12).

Behaviour at inflection points
From a public health perspective, there is often a specific interest in how well models anticipated major inflection points.We therefore specifically discuss these instances.However, we note that, as will be detailed in the discussion, post-hoc conditioning of evaluation results on the occurrence of unusual events comes with important conceptual challenges.18) and Denmark, there was public discussion about the likelihood of a re-surgence, but there was considerable uncertainty about the timing and strength (see (19) for a German newspaper article from early February 2021).This was largely due to the limited availability of representative sequencing data.In certain regions of Germany, specifically the city of Cologne (20) and the state of Baden-Württemberg ( 21), large-scale sequencing had been adopted by late January, but results were considered difficult to extrapolate to the whole of Germany.An updated RKI report on virus variants from 10 February 2020 ( 22) described a "continuous increase in the share of the VOC B.1.1.7",but cautioned that the data were "subject to biases, e.g., with respect to the selection of samples to sequence" (our translation).
Given the limited available data, and the fact that many approaches had not been designed to accommodate multiple variants, only two of the teams submitting forecasts for Germany opted to account for this aspect (a question which was repeatedly discussed during coordination calls).These exceptions were the Nonetheless, the ICM-agentModel and MOCOS-agent1 models explicitly took this aspect into account to the degree possible.Again, the ITWW-county repro model was the first to predict a change in overall trends (in this case without having predicted turning points already in the preceding weeks; see Supplementary Figure 8).
Peak of the third wave (cases) In Poland, the third wave reached its peak in the week ending on 3 April 2021.Despite the fact that it coincided with the Easter weekend and thus somewhat unclear data quality, this turnaround was predicted quite well by two Poland-based teams, MOCOS-agent1 and ICM-agentModel.
As can be seen from Figure 6, the trajectory of these two models differed substantially from those of most other models, including the ensemble, which predicted a sustained increase.This successful prediction of the turning point was in large part responsible for the good relative performance of MOCOS-agent1 and  ICM-agentModel at longer horizons reported in Section 2.1.In retrospective discussions, the respective teams noted that the tightening of non-pharmaceutical interventions (NPIs) on 27 March (which they had anticipated) in combination with possible seasonal effects had led them to expect a downward turn.
For Germany, the peak of the third wave occurred only after the end of our pre-specified study period, but we note that numerous models showed strong overshoot as they expected the upward trend to continue.The exact mechanisms underlying the turnaround remain poorly understood (a new set of restrictions referred to as the Bundesnotbremse was introduced too late to explain the change on its own).
Changes in trend of deaths In Germany, the study period coincided almost perfectly with a prolonged period of decline in deaths.In Figure 7, panels (a) and (b) show the behaviour of the median ensemble at the beginning and end of this phase.The ensemble had already anticipated a downward turn on 4 January, two weeks before it actually occurred.Following the unexpected strong increase in the following week, it went to extending the upward tendency, before switching back to predicting a turnaround.It seems likely that the irregular pattern in late December and early January is partly due to holiday effects in reporting, and forecast models may have been disturbed by this aspect.
At the end of the downward trend in late March, the ensemble again anticipated the turnaround to arrive earlier than it did, and predicted a more prolonged rise than was observed.Nonetheless, in both cases the ensemble to some degree anticipated qualitative change, and the observed trajectories were well inside the respective 95% prediction intervals (with the exception of the forecast from 4 January; however, this forecast had prospectively been excluded from the analysis as we anticipated reporting irregularities).
In Poland, deaths started to increase in early March after a prolonged period of decay.As can be seen in panel c of Figure 7, the median ensemble had anticipated this change (22 February 2021), but in terms of its point forecast did not initially expect a prolonged upward trend as later observed.Nonetheless, the observed trajectory was contained in the relatively wide 95% prediction intervals (Figures 2 and 3).

Discussion
We presented results from the second and final part of a pre-registered forecast evaluation study conducted in Germany and Poland (January-April 2021).During the period covered in this paper, ensemble approaches yielded very good performance relative to contributed individual models and baseline models.The majority of contributed models was able to outperform a simple last-observation-carried-forward model for most targets and forecast horizons up to four weeks.
The results in this manuscript differ in important aspects from those for our first evaluation period (October-December 2020), when most models struggled to meaningfully outperform the KIT-baseline model for cases.Fall 2020 was characterized by rapidly changing non-pharmaceutical intervention measures, making it hard for models to anticipate the case trajectory.Pooled across both study periods, we found ensemble forecasts of deaths to yield satisfactory reliability and clear improvement over baseline models.
For cases, however, coverage was clearly below nominal from the two-week horizon onward, and in terms of mean weighted interval scores the ensemble failed to outperform the KIT-baseline model three and four weeks ahead.This strengthens our previous conclusion (12) that meaningful case forecasts are only feasible at very short horizons.It also agrees with recent results from the US COVID-19 Forecast Hub (23), which led the organizers to suspend ensemble case forecasts beyond the one-week horizon.
The differences between our two study periods illustrate that performance relative to simple baseline models is strongly dependent on how good a fit these are for a given period.Cases in Germany plateaued during November and early December 2020, making the last-observation-carried-forward strategy of KIT-baseline difficult to beat.The second evaluation period was characterized by longer stretches of continued upward or downward trends, making it much easier to beat that baseline.In this situation, however, many models did not achieve strong improvements over the extrapolation approach KIT-extrapolation baseline.Ideally one would wish complex forecast models to outperform each of these different baseline models.However, there are many ways of specifying a "simple" baseline ( 24), and post-hoc at least one of them will likely be  in acceptable agreement with the observed trajectory.While the choice of the most meaningful reference remains subject to debate, we believe that the use of a small set of pre-specified baselines as in the present study is a reasonable approach.
An observation made for both the first and the second part of our study is that predicting changing trends in cases is very challenging; turnarounds in death counts are less difficult to anticipate.This finding is shared by other works on short-term forecasts of COVID-19 in real time, see (25) for the UK and ( 26) for the US.
To interpret these insights we note that, in principle, there are two ways of forecasting epidemiological time series: (ii) Establishing a statistical relationship (often with a mechanistic motivation) to a leading indicators, i.e. a data stream which is informative on the trajectory of the quantity of interest, but available earlier.
Changes in the trend of the leading indicator can then help anticipate future turning points in the time series of interest.
Death forecasts belong into the realm of category (ii), with cases and hospitalizations serving as leading indicators.This prediction task has been addressed with considerable success.Case forecasts, on the other hand, typically are based on approach (i), which largely reduces to trend extrapolation, unless models are carefully tuned to changing NPIs (see Table 3).Theoretical arguments on the limited predictability of turning points in such curves have been brought forward (27; 28), and empirical work including ours confirms that this is a very difficult task.The success of the two microsimulation models MOCOS-agent1 and ICM-agentModel in anticipating the downward turn in cases in Poland is encouraging, but remains a rather rare exception.Potential leading indicators to improve case forecasts could be trajectories in other countries (29) or additional data streams on e.g., mobility, insurance claims or web searches.However, the benefits of such data for short-term forecasting thus far have been found to be modest (30).Changes in dominant variants may make changes in overall trends predictable as they arise from the superposition of adverse but stable trends for the different variants.The availability of sequencing data has improved considerably since our study period, but in practice the associated delays may still limit predictability in crucial periods.
We have extensively discussed the difficulties models encountered at turning points, both upward and downward.In the aftermath of such events, epidemic forecasts typically receive increased attention in the general media (see e.g.(31) for coverage of the rapid downward turn in cases in Germany in May 2021).While important from a subject-matter perspective, this is not without problems from a formal forecast evaluation standpoint.Major turning points are rare events and as such difficult to forecast.Focusing evaluation on solely these instances will benefit models with a strong tendency to predict change, and adapting scoring rules to emphasize these events in a principled way is not straightforward.This problem is known as the forecaster's dilemma (32) in the literature and likewise occurs in, e.g., economics and meteorology (see illustrations in Table 1 from ( 32)).
The present paper marks the end of the German and Polish COVID-19 Forecast Hub as an independently run platform.In April 2021, the European Center for Disease Prevention and Control (ECDC) announced the launch of a European COVID-19 Forecast Hub (4), which has since attracted submissions from more than 30 independent teams.The German and Polish COVID-19 Forecast Hub has been synchronized with this larger effort, meaning that all forecasts submitted to our platform are forwarded to the European repository, while forecasts submitted there are mirrored in our dashboard.In addition, we still collect regional-level forecasts, which are not currently covered in the European Forecast Hub.The adoption of the Forecast Hub concept by ECDC underscores the potential of collaborative forecasting systems with combined ensemble predictions as a key output, along with continuous monitoring of forecast performance.We anticipate that this closer link to public health policy making will enhance the usefulness of this system to decision makers.
An important step will be the inclusion of hospitalization forecasts.Due to unclear data access, these had not been tackled in the framework of the German and Polish COVID-19 Forecast Hub, but have recently been added in the new European version.

Materials and Methods
The methods described in the following are largely identical to those in the first part (12) of our study, but are presented in abridged form to ensure self-containedness of the present work.

Targets and submission system
Teams submitted forecasts for weekly incident and cumulative confirmed cases and deaths from COVID-19 via a dedicated public GitHub repository (https://github.com/KITmetricslab/covid19-forecast-hub-de).
For certain teams running public dashboards, software scripts were put in place to transfer forecasts to the Forecast Hub repository.Weeks were defined to run from Sunday through Saturday.Each week, teams were asked to submit forecasts using data available up to Monday, with submission possible until Tuesday 3pm Berlin/Warsaw time (the first two daily observations are thus already available at the time of forecasting).
Forecasts could either refer to the time series provided by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE; (11)) or those from Robert Koch Institute and the Polish Ministry of Health.All data streams were aggregated by time of reporting, see Supplementary Note 4 of (12) for details.Submissions consisted of a point forecast and 23 predictive quantiles (1%, 2.5%, 5%, 10%, . . ., 95%, 97.5%, 0.99) for the incident and cumulative weekly quantities.As in (12), we focus on the targets on the incidence scale.These are easier to compare across the different data sources than cumulative numbers which sometimes show systematic shifts.

Evaluation metrics
As forecasts were reported in the form of 11 nested central prediction intervals (plus the predictive median), a natural choice for evaluation is the interval score (33).For a central prediction interval [l, u] at the level (1 − α), thus reaching from the α/2 to the 1 − α/2 quantile, it is defined as where χ is the indicator function and y is the realized value.Here, the first term characterizes the spread of the forecast distribution, the second penalizes overprediction (observations fall below the prediction interval) and the third term penalizes underprediction.To assess the full predictive distribution we use the weighted interval score (WIS; (17)).The WIS is a weighted average of interval scores at different nominal levels and the absolute error.For eleven prediction intervals it is defined as where m is the predictive median.The WIS is a well-known approximation of the continuous ranked probability score (CRPS; (33)) and generalizes the absolute error to probabilistic forecasts.Its values can be interpreted on the natural scale of the data and measures how far the observed value y is from the predictive distribution (lower values are thus better).For deterministic one-point forecasts the WIS reduces to the absolute error.A useful property of the WIS is that it inherits the decomposition of the interval score into forecast spread, overprediction and underprediction, which makes average scores more interpretable.As secondary measures of forecast quality we use the absolute error to assess the central tendency of forecasts and interval coverage rates of 50% and 95% prediction intervals to assess calibration.
As specified in our study protocol, whenever forecasts from a model were missing for a given week, we imputed the score with the worst (largest) value achieved by any other model for the respective week and target.However, almost all teams provided complete sets of forecasts and very few scores needed imputation.

Individual models
During the evaluation period, forecasts from fifteen different models run by fourteen independent teams of researchers were collected.Thirteen of these were already available during the first part of our preregistered study, see Table 3 and Supplementary Note 3 of ( 12) for detailed descriptions.Table 3 provides a slightly extended summary of model properties, including the two new models, itwm-dSEIR and Karlen-pypm; a more detailed description of the latter can be found in Supplement A.
During the evaluation period, only the ICM-agentModel explicitly accounted for vaccinations (given the low realized vaccination coverage by the end of the study period this aspect likely had limited impact).Only four models (ICM-agentModel, Karlen-pypm, LeipzigIMISE-SECIR and MOCOS-agent1, all only for certain weeks) explicitly accounted for the presence of multiple variants.
To put the results achieved by the submitted models into perspective, the Forecast Hub team generated forecasts from three simple reference models: a last-observation-carried-forward model (KIT-baseline), a multiplicative extrapolation model which continues exponential growth or decline based on the last three observations (KIT-extrapolation baseline) and an exponential smoothing time series baseline (KIT-time series baseline) which has been taken from (34).Detailed descriptions can be found in (12), Supplementary Note 2. As a further external comparison we added publicly available death forecasts by the Institute for Health Metrics and Evaluation (IHME, University of Washington (35); available under the CC BY-NC 4.0 license).Here, we always used the most recent prediction available on a given forecast date.

Forecast ensembles
The Forecast Hub team used the submitted forecasts to generate three different ensemble forecasts: KITCOVIDhub-median ensemble The α-quantile of the ensemble forecast is obtained as the median of the α-quantiles of the member forecasts.

KITCOVIDhub-mean ensemble
The α-quantile of the ensemble forecast is obtained as the mean of the αquantiles of the member forecasts.

KITCOVIDhub-inverse wis ensemble
The α-quantile of the ensemble forecast is a convex combination of the α-quantiles of the member forecasts.The weights are chosen inversely to the mean WIS value obtained by the member models over the last six evaluated forecasts (last three one-week-ahead, last two two-week-ahead, last three-week-ahead; missing scores are imputed by the worst score achieved by any model for the respective target).This is done separately for each time series to be predicted.
Inverse score weighting has recently also been employed by (36) who found it to perform well in a re-analysis of forecasts from the US COVID-19 Forecast Hub.In the study protocol, the median ensemble was defined as our primary ensemble approach (10), which is why we displayed this version in all figures and focused our discussion on it.We have previously discussed advantages and disadvantages of the different ensemble approaches in (12).
There were no formal inclusion criteria other than completeness of the submitted set of 23 quantiles.The Forecast Hub team did, however, occasionally exclude forecasts with highly implausible central tendency or degree of dispersion.These exclusions have been documented in the Forecast Hub platform.

Figure 1 :
Figure 1: Reported cases (a) and deaths (b) in Germany and Poland according to Robert Koch Institute, the Polish Ministry of Health (MZ) and Johns Hopkins CSSE.Additional panels show (c) the share of cases due to the B.1.1.7 (Alpha) variant, (d) the overall level of non-pharmaceutical interventions as measured by the Oxford Coronavirus Government Response Tracker (OxCGRT) Stringency Index, and (e) the population shares having received at least one vaccination dose and complete vaccination.The dark grey area indicates the period addressed in the present manuscript, the light grey area the one from (12).

Figure 2 :
Figure 2: One-week-ahead forecasts of confirmed cases and deaths from COVID-19 in Germany and Poland.The figure shows forecasts from a baseline model, the median ensemble of all submissions and a subset of submitted models with above-average performance.Asterisks mark prediction intervals exceeding the upper plot limit.The remaining submitted models are displayed in Supplementary Figure8.

Figure 3 :
Figure 3: Two-week-ahead forecasts of confirmed cases and deaths from COVID-19 in Germany and Poland.The figure shows forecasts from a baseline model, the median ensemble of all submissions and a subset of submitted models.Asterisks mark prediction intervals exceeding the upper plot limit.The remaining submitted models are displayed in Supplementary Figure9.
nents of the weighted interval score (the darkest block in the middle of the stacked bar).In turn, this model received rather large penalties for both over-and underprediction.Other models, like LANL-GrowthRate, epiforecasts-EpiNow2 and ICM-agentModel issued comparatively wide forecasts, leading to WIS values with large dispersion components.While there is no clear rule on what the score decomposition of an "ideal" forecast should look like, comparisons of the components provide useful indications on how to improve a model (e.g., the ITWW-county repro model might benefit from widening the uncertainty intervals).
tr a p o la ti o n _ b a s e li n e K IT − e x tr a p o la ti o n _ b a s e li n e K IT − e x tr a p o la ti o n _ b a s e li n e K IT − e x tr a p o la ti o n _ b a s e li n e tr a p o la ti o n _ b a s e li n e K IT − e x tr a p o la ti o n _ b a s e li n e K IT − e x tr a p o la ti o n _ b a s e li n e K IT − e x tr a p o la ti o n _ b a s e li n

Figure 4 :
Figure 4: Average weighted interval score and absolute error achieved by models across countries, targets and forecast horizons.The grey area represents the performance of the baseline model KIT-baseline.WIS values are decomposed into components for forecast spread and penalties for overprediction and underprediction.
Shift from wild type to B.1.1.7 variant The renewed increase in cases in both Germany and Poland ("third wave") in late February 2021 was due to the shift from the wild-type variant of the virus to the B.1.1.7 (or Alpha) variant, see Figure 1, panel (c) for estimated shares of the new variant over time.Given earlier observations about the spread of the B.1.1.7 variant in the UK ( Karlen-pypm and LeipzigIMISE-SECIR models, which starting from 1 March 2021 explicitly accounted for the presence of two variants.As a result, most models did not anticipate the change in trend well and only reacted implicitly once the change became apparent in the data on 27 February 2021.Figure5shows the case forecasts of all submitted models and the median ensemble from 15 February, 22 February and 1 March 2021.We also show the two short time series of shares of the B.1.1.7 variant available from Robert Koch Institute at the respective prediction time points.The ITWW-county repro model was the only one to anticipate a change in trend on 15 February (though slower than the observed one), and adapted quickly to the upward trend in the following week.This model extrapolates recently observed growth or decline at the county-level and aggregates these fine-grained forecasts to the state or national level.Therefore it may have been able to catch a signal of renewed growth, as a handful of German states had already experienced a slight increase in cases in the previous week (e.g., Thuringia and Saxony-Anhalt, see panel (b) of Supplementary Figure14).However, as illustrated in panel (a) of the same Figure, the ITWW model had also predicted turning points earlier during the same phase of decline in cases, and might generally have a tendency to produce such patterns.Another noteworthy observation in this context is the change in the predictions of the Karlen-pypm model.After the extension of the model to account for the B.1.1.7 variant on 1 March, its forecasts changed from the most optimistic to the most pessimistic among all included models (panels b and c of Figure 5).In Poland, availability of sequencing data was very limited during our study period; as indicated in (14), the GISAID database (15) only contained 2271 sequenced samples for Poland by 29 March 2021.

Figure 5 :
Figure 5: Panels (a)-(c) Point forecasts of cases in Germany, as issued on 15 February, 22 February and 1 March 2021.These dates mark the start of a renewed increase in overall case counts due to the new variant of concern B.1.1.7.Panel (d): Data by RKI on the share of the B.1.1.7 variant as available on the different forecast dates (the next data release by RKI occurred on 3 March).

Figure 6 :
Figure 6: Point forecasts of cases in Poland from 22 March, 29 March and 5 April 2021, surrounding the peak week.

Figure 7 :
Figure 7: Point forecasts of the median ensemble during changing trends in deaths.Panel (a): Downward turn in Germany, January 2021.Panel (b): Upward turn in Germany, March 2021.Panel (c): Upward turn in Poland, February/March 2021.Different colours represent forecasts made at distinct time points.
(i) Applying a mechanistic model to project future spread based on recent trends and other relevant factors like NPIs, population behaviour or vaccination.Models can then predict trend changes based on classical epidemiological mechanisms (depletion of susceptibles) or observed/anticipated changes in surrounding factors, which depending on the model may be treated as exogenous or endogenous.

Figure 8 :
Figure 8: One-week-ahead forecasts of confirmed cases and deaths from COVID-19 in Germany and Poland.Asterisks mark prediction intervals exceeding the upper plot limit.The figure shows forecasts from models not displayed in Figure2

Figure 9 :
Figure9: Two-week-ahead forecasts of confirmed cases and deaths from COVID-19 in Germany and Poland, same models as displayed in Figure8.Asterisks mark prediction intervals exceeding the upper plot limit.

Figure 10 :Figure 11 :
Figure10: Three-week-ahead forecasts of confirmed cases and deaths from COVID-19 in Germany and Poland, same models as displayed in Figure2.Asterisks mark prediction intervals exceeding the upper plot limit.

Figure 12 :Figure 13 :
Figure12: Three-week-ahead forecasts of confirmed cases and deaths from COVID-19 in Germany and Poland, same models as displayed in Figure8.Asterisks mark prediction intervals exceeding the upper plot limit.

22 Figure 14 :C
Figure 14: a Forecasts of cases in Germany by the ITWW-county repro model, 25 January to 22 February 2021.b Forecasts for cases in selected German states by the ITWW-county repro model, 22 February 2021.

Table 1 :
Forecast evaluation for Germany and Poland (incidence scale, based on RKI/MZ data).C 0.5 and C 0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score.

Table 2 :
Forecast evaluation at the regional level, Germany and Poland (incidence scale, based on RKI/MZ data).Results are averaged over the different regions (states in Germany, voivodeships in Poland).C 0.5 and C 0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score.

Table 3 :
(12)cast models contributed by independent external research teams.Abbreviations: NPI: Does the forecast model explicitly account for non-pharmaceutical interventions?Test: Does the model account for changing testing strategies?Variants: Does the model accommodate multiple variants?Age: Is the model age-structured?DE, PL: Are forecasts issued for Germany and Poland, respectively?Regional: Were regional-level forecasts for at least one country submitted?Truth: Which truth data source does the model use?Pr: Are forecasts probabilistic (23 quantiles)?Detailed descriptions of the different models can be found in(12), Supplementary Note 3 and in Supplement A of this article.

Table 4 :
Forecast evaluation for Germany and Poland, 3 and 4 weeks ahead (incidence scale, based on RKI/MZ data).C 0.5 and C 0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score.

Table 5 :
Forecast evaluation at the regional level, Germany and Poland, 3 and 4 weeks ahead (incidence scale, based on RKI/MZ data).Results are averaged over the different regions (states in Germany, voivodeships in Poland).C 0.5 and C 0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score.

Table 6 :
Forecast evaluation for Germany and Poland, 1 and 2 weeks ahead (cumulative scale, based on RKI/MZ data).C 0.5 and C 0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score.Asterisks mark entries where scores were imputed for at least one week.Weighted interval scores and absolute errors were imputed with the worst (largest) score achieved by any other forecast for the respective target and week.Models marked thus received a pessimistic assessment of their performance.If a model covered less than two thirds of the evaluation period, results are omitted. *

Table 7 :
Forecast evaluation for Germany and Poland, 3 and 4 weeks ahead (cumulative scale, based on RKI/MZ data).C 0.5 and C 0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score.Asterisks mark entries where scores were imputed for at least one week.Weighted interval scores and absolute errors were imputed with the worst (largest) score achieved by any other forecast for the respective target and week.Models marked thus received a pessimistic assessment of their performance.If a model covered less than two thirds of the evaluation period, results are omitted. *

Table 8 :
Forecast evaluation for Germany and Poland, 1 and 2 weeks ahead (incidence scale, based on JHU data).C 0.5 and C 0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score.

Table 9 :
Forecast evaluation for Germany and Poland, 3 and 4 weeks ahead (incidence scale, based on JHU data).C 0.5 and C 0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score.

Table 10 :
Forecast evaluation for Germany and Poland, pooled across evaluation periods, 1 and 2 weeks ahead (incidence scale, based on RKI/MZ data).C 0.5 and C 0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score.

Table 11 :
Forecast evaluation for Germany and Poland, pooled across evaluation periods, 3 and 4 weeks ahead (incidence scale, based on RKI/MZ data).C 0.5 and C 0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score.