Collaborative efforts to forecast seasonal influenza in the United States, 2015–2016

Since 2013, the Centers for Disease Control and Prevention (CDC) has hosted an annual influenza season forecasting challenge. The 2015–2016 challenge consisted of weekly probabilistic forecasts of multiple targets, including fourteen models submitted by eleven teams. Forecast skill was evaluated using a modified logarithmic score. We averaged submitted forecasts into a mean ensemble model and compared them against predictions based on historical trends. Forecast skill was highest for seasonal peak intensity and short-term forecasts, while forecast skill for timing of season onset and peak week was generally low. Higher forecast skill was associated with team participation in previous influenza forecasting challenges and utilization of ensemble forecasting techniques. The mean ensemble consistently performed well and outperformed historical trend predictions. CDC and contributing teams will continue to advance influenza forecasting and work to improve the accuracy and reliability of forecasts to facilitate increased incorporation into public health response efforts.

for influenza-like illness (ILI), hospitalizations, and mortality 5 . While these systems collect valuable data, they are intrinsically describing activity that occurred in the past and require data processing time, limiting their utility for real-time public health decision making. Accurate and timely forecasts of influenza activity could assist in the public health response to both seasonal epidemics and future pandemics.
Since the 2013-2014 influenza season, CDC has hosted collaborative challenges to forecast the timing, intensity, and short-term trajectory of ILI activity in the United States using data from the US Outpatient Influenza-like Illness Surveillance Network (ILINet), a robust and geographically broad surveillance system, as its benchmark 6,7 . While ILI can capture both influenza and non-influenza illnesses, it is of high public health value as it correlates strongly with laboratory confirmed influenza and its magnitude correlates well with other measures of seasonal influenza severity 8 .
To continue the advancement of forecasting science, the application of forecasts for public health decision-making, and the development of best practices, CDC and challenge participants update challenge guidelines each year. For example, after the first challenge, several improvements were made including standardizing forecast submission formats, requiring specification of probabilistic forecasts rather than point forecasts, and implementing fully quantitative forecast evaluation 9 . Additional changes were made for the 2015-2016 season to improve the public health utility of the forecasts. First, challenge participants provided forecasts with increased resolution for peak intensity and trajectory predictions, which allows for a more detailed interpretation of forecasts and flexibility in scoring forecast accuracy. In addition, the evaluation methodology was modified to allow for a pre-specified number of preceding and proceeding values to be considered correct to reduce the effect of revisions to ILINet on forecast scores. To help communicate forecasts in real-time, a public webpage to host predictions was created 10 .
In the present analysis, we summarized the results and insights gained from the 2015-2016 challenge and identified areas for improvement moving forwards. We also evaluated the performance of a simple average ensemble of the submitted influenza forecasts since ensemble forecasts have demonstrated several advantages over single forecast models in both weather and infectious disease forecasting [11][12][13][14][15] . Finally, we used gamma regression to investigate characteristics of both forecast models and influenza seasons that may be associated with increased forecast accuracy. Figure 1 shows the national ILINet curve for the 2015-2016 season in comparison to the 2009-2010 through 2014-2015 seasons. Compared to earlier seasons, the 2015-2016 season started later and had a later peak. The peak intensity was 3.5%, well below the high value of 7.7% set in the 2009-2010 pandemic season and below the peak of 6.0% in 2014-2015. Seasonal forecast targets and evaluation periods for short-term forecasts for each region are shown in Table 1. The evaluation period for each target reflected the Morbidity and Mortality Weekly Report (MMWR) calendar weeks when forecasts for that target have the most utility (see Methods).

Results
Eleven teams submitted fourteen separate forecasts (models A-N, Table 2).  Forecast Accuracy. Overall forecast accuracy was assessed using a metric of forecast skill, where 1 is a perfect forecast and 0 means the forecast assigned a <1% probability to the correct outcome. Average forecast skill for national targets over their respective evaluation periods are shown in Table 3. At the national level, median team forecast skill was highest for short-term forecasts of ILINet 1 week in advance (0.66) and decreased for the 2, 3, and 4 week ahead targets (median skill 0.41, 0.31, and 0.29, respectively). Median forecast skill for peak intensity (0.30) was comparable to that of short-term forecasts at 3-4 weeks. Median forecast skill for peak week and onset week are not directly comparable to the ILINet intensity forecasts because the scales and bins were different, but were both low (0.03 and 0.04, respectively). At the national level, Model F had the highest skill for forecasts of ILINet 1 and 3 weeks ahead (0.89 and 0.66, respectively), as well as the highest average skill across all short-term targets (0.70). Model L had the highest skill for forecasts of ILINet 2 weeks ahead (0.76), while Model G had the highest skill for forecasts of ILINet 4 weeks ahead (0.58). Model B had the highest skill for both season onset and peak week (0.18 and 0.20, respectively), as well as the highest average skill across all seasonal targets (0.27). Model E had the highest skill for maximum intensity (0.66). Models E, F, G, and L had skills greater than the median team skill for all targets, while Model B had skills greater than the median team skill for all targets except forecasts of ILINet 1 week ahead.
As the season progressed, forecast skill for season onset, peak week, and peak intensity at the national level generally increased, though individual team skill varied considerably, especially for peak intensity (Fig. 2). For all seasonal targets, skill improved noticeably once the target of interest had passed according to observed ILINet data. For example, no models assigned a >50% probability to week 10 as the peak on week 6, while at week 10, 36% of submitted models assigned a >50% probability to week 10 as the peak. Week-ahead forecasts at the national level also showed considerable variability (Fig. 3), especially near the peak intensity of the influenza season when week-to-week variability in the ILINet value was the highest. All short-term forecasts had noticeable dips in accuracy around MMWR weeks 50 and 10, corresponding to inflection points in the ILINet data ( Fig. 3). At the regional level, median model forecast skill generally followed the same trend as the national level across the short-term forecasts, with 1 week ahead forecasts having the highest score and 4 week ahead forecasts having the lowest (Supplementary Tables S1-S10). Median forecast skill for peak intensity and onset week varied considerably across the regions, with scores ranging from 0.06 to 0.51 and 0.003 to 0.46, respectively. Median forecast skill for peak week was low across all regions, ranging from 0.006 to 0.15. Across regions, median model forecast skill for short-term targets was lowest in HHS Regions 6 and 9 and highest in HHS Regions 8 and 10, while median skill for season targets was highest in HHS Regions 9 and 10 and lowest in HHS Regions 2 and 4 ( Table 4).

FluSight Ensemble and Historical Average Comparisons.
Nine models outperformed the model based on the historical average of ILINet data at the national level for both peak intensity and 1 week ahead forecasts, while seven models outperformed the historical average for 2 weeks ahead and six models outperformed the historical average for both peak week and 3 weeks ahead (Table 3). Only three models outperformed the historical average for forecasts of ILINet 4 weeks ahead and only one model outperformed the historical average for onset week. For all targets at the national level, a model consisting of the unweighted mean of submitted models, which we refer to as the FluSight Ensemble model, outperformed the majority of submitted models (Table 3). Similar performance was seen for forecast targets at the HHS Regional level (Supplementary Tables S1-S10).

Gamma Regression by Model and Influenza Season
Characteristics. Seven models were submitted by four teams that participated in the 2014-2015 CDC influenza forecasting challenges, though model specifications were updated between seasons. On average, these models were significantly more accurate than the seven forecasts submitted by first-time participating teams (Fig. 4). Five models utilized mechanistic models, encompassing compartmental modelling strategies such as Susceptible-Infected-Recovered (SIR) models. These methods attempted to model the underlying disease transmission dynamics and translated that into forecasts. Nine models utilized statistical methods that did not attempt to model disease transmission, but instead directly estimated the ILINet curve or target of interest using approaches such as time-series analysis or generalized linear models. The statistical models generally outperformed mechanistic models, with significant differences for peak week and 2, 3, and 4 week ahead forecasts. During the period between MMWR weeks 50-1, when there is a historic rise and dip in ILI values ( Fig. 1), statistical models generally outperformed mechanistic models for 1-4 week ahead forecasts (Supplementary Table S11). Five models used only ILINet data to inform their predictions and nine models used additional data sources beyond those available in ILINet. The models using only ILINet data generally outperformed models incorporating additional data, with significant differences for all targets except peak week. Finally, six models used an ensemble approach combining predictions from multiple "in-house" models, and these models were associated with significantly higher skill for all forecasting targets compared to single models. We compared forecasts across all forecast locations (10 HHS regions and the entire United States) to assess how seasonal characteristics (timing of season onset and peak, level of peak intensity relative to baseline, number of weeks above baseline, revisions to initial published wILI% values) affected forecast skill for those targets. Forecasts of peak week and onset were less accurate and forecasts for 1 and 2 weeks ahead were slightly more accurate for locations with a later peak, though forecast skill for other targets was unaffected (Fig. 5). Similarly, forecasts of season onset were less accurate for locations with a later onset. Relative peak intensity, defined as the peak intensity for a location divided by that location's baseline ILINet value, had a significant but small association with increased accuracy for forecasts of 1 week ahead, but was not associated with forecast accuracy for any other targets examined. Short-term forecasts were generally less accurate in locations with longer influenza seasons, as  measured in the number of weeks wILI% was above baseline. Forecasts of season onset were significantly more accurate in locations with longer influenza seasons, while accuracy of peak week and peak intensity forecasts were not associated with the length of the influenza season. Forecasts of short-term targets were also less accurate for forecasts based on weeks with larger differences between the initial and final published ILINet values than weeks with smaller revisions.

Discussion
The 2015-2016 influenza season was the third consecutive influenza season that CDC hosted an influenza forecasting challenge. This accumulating body of real-time forecast data provides new insights on forecast accuracy, relative model performance, the value of ensemble approaches, and the challenges of influenza forecasting. Forecast skill varied as the season progressed. Short-term forecast skill was generally highest at during the shoulders of the season when ILINet values were low and relatively constant and lowest around the peak week, an inflection point of the ILINet curve and a period during the influenza season when forecasts likely have the highest value from a public health perspective (Fig. 3). Forecast accuracy for seasonal targets, on the other hand, generally improved throughout the season as models incorporated new data. Skill for the seasonal targets generally began to improve substantially between 2 and 4 weeks prior to the predicted event (Fig. 2). While this improved accuracy may in part reflect more accurate short-term forecasts, the identification of the change from increasing to decreasing incidence (i.e. the peak) is a critical milestone for decision-makers. Even a lead time of only a few weeks is helpful for situational awareness, especially with a reporting delay of 1 to 2 weeks for initial surveillance data, subsequent revisions to those data as reporting is completed, and week-to-week variation that may occur even in the complete surveillance data. These forecasts can therefore provide public health officials with some level of confidence that the event has occurred.
A comparison of forecast skill across the forecast locations revealed additional characteristics of forecast performance. Forecasts for onset week and peak week generally had lower skill in locations with later onset weeks and peak weeks (Fig. 5). Seasonal targets that occur particularly early or late in a flu season are likely harder to predict simply because they are atypical, possibly with respect to other locations in the same season, with respect to previous seasons in the same location, or both. For short-term forecasts, these effects were not as strong (Fig. 5), indicating that late seasons have less of an effect on short-term forecasts. Conversely, short-term forecasts  based on weeks with large subsequent revisions to the originally published ILINet values were less accurate than forecasts based on weeks that had minimal revisions to the final ILINet values. This is supported by the low median scores for short-term targets seen in HHS Regions 6 and 9 ( Table 4), both of which had among the highest levels of backfill during the 2015-2016 season ( Supplementary Fig. S1). Compared to the previous season (2014-2015), average forecast skill in 2015-2016 was higher for peak intensity and lower for onset week and peak week ( Table 5). The higher skill for peak intensity may reflect that 2014-2015 was an abnormally intense season while the peak intensity for 2015-2016 was more in line with typical past seasons (Fig. 1). Meanwhile, the onset and peak occurred later than typical in 2015-2016, possibly leading to the lower forecast skill for onset week and peak week. Notably, this agrees with our finding that forecast skill was lower for locations with later onset and peak weeks. The short-term forecasts had higher average forecast skill in 2015-2016 compared to 2014-2015. This may reflect short-term dynamics that were easier to predict, but more likely indicates higher model accuracy as this improvement was seen across locations where dynamics were quite different (Table 5).
Overall, there was no single best model across all targets; eleven of the fourteen participating models had the highest average score for at least one of the 77 short-term and seasonal targets across the 10 HHS regions and the United States. Nonetheless, Model B and Models E, F, G, and L consistently outperformed the FluSight Ensemble, other models, and the historical average for the seasonal and short-term targets at the national level, respectively. Also of note, the FluSight Ensemble outperformed the majority of individual forecast models for all targets and the historical average for all seven targets at the national level, showing that the combined forecasts provided more reliable information than most specific forecasts and more information than historical data alone. As the FluSight Ensemble was a simple average of received forecasts, the application of more sophisticated ensemble methods offers an opportunity for further improvements. The intention of the FluSight Ensemble was to evaluate a simple a priori ensemble approach that could be used during the season to combine information from multiple models, and as such we did not evaluate a posteriori approaches that could not be applied in real-time.
The variation in accuracy between models and the wide variety of forecasting approaches also provides insight into the characteristics of more accurate models. Comparisons of these approaches are not generalizable because they only reflect the combination of characteristics included in the submitted models, nothing close to the full spectrum of possible approaches 16 . They nonetheless provide notable insights. Models submitted by teams who had competed in the CDC forecasting challenge in previous years generally outperformed models submitted by new teams (Fig. 4). This may reflect self-selection of high-performing teams deciding to continue participating or it may indicate the value of participating in previous challenges. Making and submitting updated probabilistic forecasts on a weekly basis is a substantial technical challenge and those with experience doing that may be in a better position to identify and implement changes to improve accuracy. Models using ensemble approaches to generate their forecasts also outperformed single models, providing additional evidence of the value of ensemble forecasting approaches. Models that used data in addition to ILINet were less accurate than those only using ILINet data for six of seven targets, indicating that including auxiliary data does not necessarily lead to more accurate forecasts. Comparisons between statistical and mechanistic approaches indicated that performance varied by target, with statistical models outperforming mechanistic models for four of seven targets. However, the  five models that consistently outperformed the median skill for all targets were all statistical models, illustrating the potential for this forecasting method. Additionally, during MMWR weeks 50 -1, when there historically is a peak and dip in ILI values (Fig. 1), statistical models outperformed mechanistic models (Supplementary Table S11), illustrating that the statistical approaches may be more resilient to predictable patterns in ILI. As more forecasting approaches are applied over more seasons, more locations, and more diseases, more substantive analyses of these differences will be possible.
As an open, standardized, real-time, forecasting challenge, the CDC influenza forecasting challenge has provided unique insight into epidemic forecasting. The results highlight the continuing challenge of improving forecast accuracy for more seasons and at lead times of several weeks or more, forecasts that would be of even more utility for public health officials. To improve future forecasts, we found evidence that experience may help, that there is room for improving the use of external data, and that combining forecasts from multiple models in ensembles improved accuracy. Despite remaining challenges, both the top models and the FluSight Ensemble provided more accurate forecasts than historical data alone. Moreover, the accuracy for more typical seasons and for nearer targets (e.g. 1-week vs. 4-week ahead forecasts or peak forecasts early in the season vs. as the peak approaches) indicates that the models are producing valuable information as is. Because these forecasts are available in real time, they can actively improve situational awareness and be used to directly address immediate public health needs such as planning for hospital staffing and bed availability, outbreak preparedness, and stocking of antivirals.
Interest in infectious disease forecasting has increased in recent years, with challenges to predict epidemics of both chikungunya 17 and dengue fever 18 in addition to influenza. As the only ongoing infectious disease forecasting challenge in the United States, the CDC influenza forecasting challenge sets a model for other infectious diseases by identifying data and resource constraints that limit model development, establishing best practices for forecast submission and evaluation, identifying areas where forecasts can be improved, tying forecasting efforts to real public health needs, and assessing their performance related to those needs.

Methods
Challenge Structure. Teams from the previous challenge as well as research groups with experience in influenza or infectious disease forecasting worked with CDC to define the structure for the 2015-2016 challenge. Teams submitted weekly forecasts from November 2, 2015, to May 16, 2016. Forecasting targets were based on data from ILINet, a syndromic surveillance system consisting of more than 2,000 outpatient providers 5 . These providers send CDC weekly reports consisting of the number of patients with ILI and the total number of patients seen. These reports are weighted based on state population to determine a weighted percentage of patient visits due to ILI (wILI%). ILINet data use the Morbidity and Mortality Weekly Report (MMWR) week system, where a week starts on a Sunday and ends on a Saturday. Data for a given MMWR week are usually released the following Friday in CDC's weekly FluView publication 19 . Each week's publication includes initial values for the most recent week as well as potential revisions of prior published values, and the difference between initial and final published value varies by week and region ( Supplementary Fig. S1). Forecasting targets included seasonal and short-term targets. To participate in the challenge, teams were required to submit predictions on each target for the United States as a whole and had an option to submit predictions for each of the ten HHS Regions. The seasonal targets were onset week, defined as the first week where wILI% was at or above the location-specific baseline and remained at or above for at least two additional weeks; peak week, defined as the MMWR week during which the wILI% was highest; and peak intensity, defined as the season-wide maximum wILI%. Short-term targets included forecasts of wILI% one-, two-, three-, and four-weeks in advance of FluView publication. Due to the delay in reporting surveillance data (e.g. data for MMWR week 50 is published on MMWR week 51), the short-term targets provide a forecast for ILINet activity that occurred in the past week (1-week ahead), the present week (2-weeks ahead), and 1 (3-weeks ahead) and 2 weeks (4-weeks ahead) in the future.
As in the 2014-2015 season, participants submitted forecasts weekly as point estimates and probability distributions in a series of bins categorized across possible values for each target. For onset week and peak week, there was a bin for each single week of the season, with an additional bin for onset week corresponding to no onset. For Each sub-panel is divided by seasonal characteristics including observed timing of onset week, observed timing of peak week, relative intensity of the peak wILI% value to the baseline value, the number of weeks ILINet remained above baseline, and the absolute difference between the initial published wILI% value for the week a forecast is based on and the week's final wILI% value. Bold diamonds represent the average log score across models for each target in each category. Solid lines indicate statistically significant differences determined by multivariable gamma regression controlling for forecast week.  peak intensity and short-term forecasts, semi-open 0.5% intervals (e.g. 1.0% ≤ wILI% < 1.5%) were used from 0% up to 13%, with the final bin representing the probability for all values greater than or equal to 13%. Teams submitted a written narrative of their forecasting methods for each model. Changes to the methods and narrative description were permitted during the season. This study did not involve human participants and institutional review board approval was not required.

Historical Average Forecast.
To provide a benchmark to compare submitted forecasts to, we created a historical average forecast using ILINet data from the 1997-1998 flu season through the 2014-2015 flu season, excluding the 2009-2010 H1N1 pandemic year as its dynamics were atypical compared to seasonal epidemics. A Gaussian kernel density estimate using bandwidths estimated by the Sheather-Jones method 20 was fitted to each MMWR week's previous observed ILINet values, and approximate probabilities for each prediction bin were calculated by integrating the kernel density 9 . The point estimate was the median of the estimated distribution. Forecasts for onset week, peak week and peak intensity were calculated in the same way. Onset week forecast probabilities were adjusted to reflect the probability of no onset week based on the percentage of prior years in which ILI values did not cross the region-specific baseline. As CDC only began publishing regional baselines with the 2007-08 flu season, only seasons from that point onwards were used to calculate the onset week forecasts.

Unweighted FluSight Ensemble.
To evaluate the utility of a simple ensemble of influenza forecasts, we constructed an unweighted average of the individual forecasts received, which we refer to as the FluSight Ensemble. The estimated distribution of the FluSight Ensemble was created by taking the arithmetic mean of all submitted distributions for a given target/location combination during a particular week. As with the historical average forecast, we used the median of each distribution as the point estimate.
Forecast Evaluation. We compared the forecasts, including the historical average and FluSight Ensemble forecasts, to weighted ILINet data published on MMWR week 28 (ending July 16, 2016), which was chosen a priori to represent final ILINet values for the season. We scored the forecasts using a forecast skill metric derived from the logarithmic scoring rule 21,22 . Let p be the binned probabilities submitted for a given forecast target, with p i the probability assigned to the bin containing the observed outcome i. For all targets, we included the bin above (i + 1) and below (i -1) the observed outcome, and calculated the logarithmic score as 1 For example, if the peak week was MMWR week 10, the logarithmic score would be calculated by summing the probabilities assigned to MMWR weeks 9-11 and taking the natural logarithm of that sum. In the case of multiple weeks having the same maximum wILI% and therefore being peak weeks, both peak weeks were considered as observed outcomes and the bins surrounding each peak were also included in the calculated score. Scores below −10, missing forecasts, or forecasts that summed to probabilities less than 0.9 or greater than 1.1 were all assigned scores of −10. Scores were averaged across different combinations of locations, targets, and time periods. As in 2014-2015, the averaged log scores were exponentiated to create a forecast skill on a scale of 0 to 1. Perfect forecasts would receive a skill of 1, while forecasts that assign very low probabilities to the observed outcome would receive a skill close to 0.
Evaluation periods varied by target and were chosen at the end of the season to include the weeks in which forecasts would have had the most utility to public health decision makers. The evaluation period for each seasonal target began with the first forecast submitted (MMWR week 42) and ended a target-specific number of weeks after each outcome had occurred. For onset week, the evaluation period ended six weeks after the season onset. For peak week and peak intensity, the evaluation period extended until one week after wILI% went below baseline and stayed below baseline for the remainder of the season (Table 1). For short-term forecasts, the evaluation period for each location began with forecasts received four weeks prior to season onset in that location and extended to 4 weeks after ILINet returned below baseline for that location.
We utilized gamma regression to analyse the effect of model type, data sources, targets, absolute change between initial and final published wILI% in the week each forecast was based on, and season types (e.g., late vs. early defined continuously by season onset and peak week) on forecast accuracy characterized as the negative log score. Gamma regression is restricted to outcome values greater than or equal to zero and is well-suited for analysing right-skewed data. For all regression models, we analysed across all weekly forecasts, targets and locations, excluding week-target-location forecasts that were not submitted. For comparisons of model characteristics, we controlled for location and the week a forecast was received in the regression analysis. For comparisons of seasonal characteristics across regions, we controlled for the week a forecast was received.
To compare forecasts across seasons, we summarized the 2015-2016 forecasts received into the larger, 1% wide bins utilized in the 2014-2015 challenge and scored the forecasts using the 2014-2015 log scoring rules. Forecasts for onset week and peak week were scored the same way during the 2014-2015 season, while for peak intensity and the short-term targets, only the probability assigned to the bin containing the observed value p i was used.
Analyses were conducted using R version 3.4.3 23 and significance was assessed using a cutoff of p < 0.05.

Data Availability
The received forecasts that support the findings of this study are publicly available on the CDC Epidemic Prediction Initiative GitHub page at https://github.com/cdcepi/FluSight-forecasts.