A new time-varying coefficient regression approach for analyzing infectious disease data

Since the beginning of the global pandemic of Coronavirus (SARS-COV-2), there has been many studies devoted to predicting the COVID-19 related deaths/hospitalizations. The aim of our work is to (1) explore the lagged dependence between the time series of case counts and the time series of death counts; and (2) utilize such a relationship for prediction. The proposed approach can also be applied to other infectious diseases or wherever dynamics in lagged dependence are of primary interest. Different from the previous studies, we focus on time-varying coefficient models to account for the evolution of the coronavirus. Using two different types of time-varying coefficient models, local polynomial regression models and piecewise linear regression models, we analyze the province-level data in Canada as well as country-level data using cumulative counts. We use out-of-sample prediction to evaluate the model performance. Based on our data analyses, both time-varying coefficient modeling strategies work well. Local polynomial regression models generally work better than piecewise linear regression models, especially when the pattern of the relationship between the two time series of counts gets more complicated (e.g., more segments are needed to portray the pattern). Our proposed methods can be easily and quickly implemented via existing R packages.

Since the WHO declared the novel coronavirus (COVID-19) outbreak a global pandemic on March 11, 2020, impacts of the pandemic on people's daily life have been profound in many different aspects (e.g., physical health, mental health, social impacts).Despite the strong global desire to end the pandemic, the evolving variants and subvariants of SARS-CoV-2 have posed challenges for predicting what is ahead.Under the pressure of co-circulation of viral infections, such as the tripledemic (COVID-19, seasonal influenza, RSV) in the 2022-2023 influenza season, healthcare systems can be easily over-burdened due to the quickly rising number of cases with severe symptoms in need of medical care.Though the currently dominant Omicron subvariants can be less severe than the original variant, in early 2023 the Canadian healthcare system remained under the risk of crisis.This was due to the risk that the faster transmission/spread of the dominant variants may lead to a larger number of people who need to seek medical care within a short time period.
There is a large body of existing literature on predicting/forecasting COVID-19-related deaths and hospitalizations.According to Avery et al. 1 , two primary types of modeling are dominant.One type of model is mechanistic and focuses on the underlying process of the disease spread.For example, system dynamics models with different formulations of state variables have seen numerous applications in COVID-19 modeling.The Centers for Disease Control and prevention (CDC) has featured a set of different prediction models for COVID-19 death forecasting 2 .These models are generally complex and rely on assumptions that are often violated (e.g.homogeneity) or hard to verify, as discussed by Avery et al. 1 and Li et al. 3 .
The other primary type of model is phenomenological, usually parameterized through curve-fitting based on reported data.This is the type of modeling our proposed work follows.The focus is not the transmission dynamics, but rather the relationship between the reported cases and deaths.The majority of the existing work on phenomenological modeling considers a single time series of interest (e.g., time series of case counts).To name a few, Cascon and Shadwick 4 and Harvey and Kattuman 5 use the Gompertz Function to model cumulative pandemic case counts, and Dash et al. 6 use a logistic growth model with accommodations for nonlinear trend and seasonality.Additionally, Petropoulos et al. 7 applied a non-seasonal multiplicative error to a multiplicative trend exponential smoothing model.Change-point models have received a substantial amount of attention to capture abrupt changes in a single time series (e.g.Jiang et al. 8 and 9 ).Clearly which time series model to use should depend on the pattern exhibited by the data to be analyzed.
It is worth noting that artificial intelligence (AI) methods have seen some successful applications in COVID-19 related predictions 10,11 .Nonetheless, it is still not clear when AI methods can be applied successfully 12 or which AI methods are best.In contrast, the aim of our proposed modeling is to find a more general and flexible tool that can accommodate various kinds of relationships between one time series and another.
There has been quite sparse literature that makes use of the relationship between the different time series for prediction.To our best knowledge, Hierro et al. 13 is the only paper of this kind.In their work, the so-called delayed elasticity method (hereafter referred to as DEM) is used to characterize the relationship between cumulative death counts and cumulative case counts.Intuitively, their method, which is essentially classical linear regression models, may work well at the beginning of the pandemic (limited available data) but may not be able to fully capture the evolving relationships for a longer study period.We compare our proposed methods to theirs in "Data analysis" section and the supplementary document.
Distinct from the existing literature on COVID-19, we aim to build a general and flexible modeling approach that can capture the dynamic nature of the relationships between different COVID-19 data series of interest.With such a relationship, we can predict future deaths/hospitalizations based on the case counts up to present.It is worth noting that our approach can also be used to analyze other infectious disease data or wherever dynamics in a lagged dependence relationship is of interest.With this aim in mind, time-varying coefficient regression models 14 are a natural choice.Different from the classical linear regression models, the regression coefficients are not fixed as constant but rather functions of some other covariate(s) (e.g., time).The fact that Canada, for example, has experienced several different waves driven by different coronavirus variants suggests that it would be more appropriate to consider the time-varying relationship between case counts and death counts (or hospitalization counts).
To summarize, the novelty of our work is two-fold.First, we introduce an explicit way to account for the lagged dependence between the predictor and response variables in the context of varying-coefficient models.Moreover, statistical learning is successfully combined with a machine learner that selects the optimal lag based on out-of-sample predictability.Second, our method produces inferences for the out-of-sample predictions, while most of the existing literature on time-varying coefficient models focuses on regression coefficients.
As explained in Section 2 of the seminal paper by Hastie and Tibshirani 14 , time-varying coefficient regression models have a broad general form and thus include several commonly used models as special cases (e.g., generalized linear models, generalized additive models, piecewise linear regression models).In this study, we consider two different techniques: local polynomial regression and piecewise linear regression.Influenced by the extensive literature on the kernel estimation in time-varying coefficient models, we embarked with local polynomial regressions 15,16 .Then we realized the smoothness assumption of the regression coefficients may not always be valid, especially when there are abrupt changes.For example, when Omicron's subvariants quickly took over the dominance of Omicron in some major cities (or travel hubs), the relationship between case counts and deaths/hospitalizations may have changed quickly accordingly.To address this possibility, we also consider piecewise linear regression models that (1) bear simple parametric forms and (2) can capture abrupt changes.
We acknowledge the fact that neither the reported case counts nor the reported death counts are truly reflecting the underlying true variables, respectively.As such, our objective is to capture the dynamic relationship between reported COVID-19 data.One key novelty in our approach is to identify the lag between different reported data series (e.g., death counts and case counts).For details, please see "Models and notation" section.

Models and notation
Modeling the daily counts is a natural choice and has been considered in many studies.Nonetheless, we noticed the pattern between cumulative counts is much cleaner.Consider, for example, Fig. 1 for the scatter plots for Ontario data between 2021-10-31 and 2022-02-10.As shown in the bottom panel in Fig. 1, there is a very neat pattern between cumulative counts.But there is no such clean pattern in the daily counts plot (top panel in Fig. 1).The noise level seems to be fairly high so even a time-varying coefficient model may not lead to a good fit.It turns out the signal may get strengthened by balancing out the noise in the daily counts when adding them up.Therefore, we consider cumulative counts in our modeling but utilize daily counts for model assessment/ comparison.After all, the future trend of daily counts is of our primary interest.We remark that the fitted/ predicted values in cumulative counts can be easily converted to fitted/predicted values in daily new counts by taking the difference between any two consecutive cumulative counts.
In the following, we will present two modeling strategies to capture the lagged dynamic relationship, that is, local polynomial regression and piecewise linear regression.
Local polynomial regression with lagged dependence.Different from the classical regression models, the time varying coefficient regression model allows the regression coefficients β 's to change over some other covariate (called smoothing variable).Suppose the data are in form of (Y i , X i ) n i=1 .The time-varying coefficient regression model is where To reflect the lagged dependence between the death count time series and other time series (e.g., case count time series), the outcome variable (Y) in the above model is for time i + L conditional on the value of the predic- tor (X) at time i.The optimal choice of L is selected based on the criterion of minimizing mean squared prediction errors for out-of-sample daily counts predictions.More details will be given in "Selection of lag" section.
The local constant (also called the Nadaraya-Watson method) and local linear estimation methods are the commonly used local polynomial methods for time-varying coefficient regression models.The local constant estimates are obtained by minimizing the following objective function, that is, ) is the kernel and b is a bandwidth, and Obviously the resulting estimator depends on the choice of bandwidth b.As stated in 17 , the bandwidth is selected by cross validation (leave-oneout cross-validation by default in tvReg).The triweight kernel is the default choice in tvLM, an R function in the R package tvReg 17 .
The local constant estimator can be written in the following matrix form that resembles the weighted least squares estimator.Let Then we have the local constant estimator, denoted by β, Similarly, local linear estimators can be obtained by minimizing where β (1) (t) is the first order derivative of β(t).
Let U t = diag{1/n − t, . . ., i/n − t, . . ., 1 − t} and Ŵ t = (X, U t X) .Therefore, the local linear estimator can be expressed in the following matrix form Piecewise linear regression model with lagged dependence.As explained in "Introduction" section, the smoothness imposed for the time-varying coefficients may not be able to capture some abrupt changes.Therefore, we consider piecewise linear regression models as an alternative.
where with k defined as a breakpoint.We also assume E( In our data analysis, the R package "segmented" 18 is used to implement the model estimation for piecewise linear regression models.The restarting bootstrap method 19 is implemented in "segmented" to handle spurious local minima (e.g., flat segments).The segmented package also provides an automatic option 20 for determining the number and location of the breakpoints.We found the automatic option over-estimated the number of breakpoints, which led to worse performance for out-of-sample prediction.Therefore, we don't recommend the use of the automatic option because out-of-sample prediction is of our primary interest.In the following data analysis with the piecewise linear regression method, we selected the starting values of breakpoints by examining the scatter plot.

Selection of lag
In this subsection, we discuss how to determine the lag for the dependence between the response variable and the covariate.For the response variable, we consider Y t+l with l = L min , L min + 1, . . ., L max with varying lag l.In our data analysis, we use L min = 5 and L max = 21.The optimal lag that best captures the lagged dependence, denoted by L, is selected to be the one that gives smallest mean squared prediction error for out-of-sample daily counts prediction.
To explain, for each l ∈ {L min , . . ., L max } , we fit the model (either local polynomial regression or piecewise linear regression) for the same training data.Suppose the maximum date of the data is T m , then the training data set consists of data up to date T m − 2L max .Then we predict the cumulative death counts for the time window We refer to the next section for details about how to conduct out-of-sample (2) prediction.By subtracting any two consecutive cumulative counts, we get the predicted daily counts for the time window . Therefore, we can calculate the mean squared prediction error (MSPE) for daily death counts reported during [n − 2L max + 2, n − 2L max + l] for each l.The optimal lag is the value of l that yields the smallest MSPE.

Prediction
One immediate potential application of the lagged dependence structure discussed in "Models and notation" section is for prediction.In piecewise linear regression models, the prediction is fairly straightforward conditional on the estimated breakpoints.Basically we make use of the simple linear regression model for the segment that the new observations of the predictor fall into.
In the local polynomial regression setting, we consider the Direct-recursive hybrid multi-step forecast 21 and can be implemented in the function forecast() in the R package tvReg.Here is the outline of how the prediction can be done.
Step 1. Apply the local polynomial regression method to the data until time point n.
Step 2. Predict Y n+1 by using the estimate of the regression coefficient β n from Step 1, that is, Ŷn+1 = β′ n x n+1 .
Step 3. Predict Y n+2 by treating Ŷn+1 as if it were the actual observation at time n + 1 and implementing the local polynomial regression method to the augmented data (x i , y i ), i = 1, . . ., n + 1 where y n+1 = Ŷn+1 .Then we repeat Steps 2 and 3 until L future predictions is done, where L is the lag discussed in the previous section.
The potential problem with this prediction strategy is that it uses the predicted values as if they were the real observations.If the regression coefficients change very slowly over a short prediction time window, such a prediction strategy may not be a big problem.As shown in all the figures except Fig. 2, local constant regression models seem to be the winner for the out-of-sample daily counts prediction.It is worth noting that the propagated error in using the predicted values may lead to unreliable results.As shown in Fig. 5, the predicted daily deaths based on local linear method tends to deviate more from the reported counts near the end of the out-of-sample prediction window.
In the following, we present bootstrap methods for calculating point-wise and simultaneous confidence bands for out-of-sample predictions in time-varying coefficient regression models.The assumption is i.i.d.random error terms in time-varying coefficient regression models.where D = (z 1 , z 2 , . . ., z n , X 1 , X 2 , . . ., X n ) .Please note that z i = i/n, Y (z i ) = Y i , and X(z i ) = X i .Using a similar rational as that in 22 for constructing confidence intervals for regression coefficients, we implement the following steps to construct the confidence intervals for the predicted values.

Point-wise confidence bands for out-of-sample predictions. The
Step 1.For available data (say up to time n), we fit a time-varying regression model and produce the fitted values ŷi and residuals e i = y i − ŷi , i = 1, . . ., n.
Step 2. Generate synthetic data y * i = ŷi + e * i , where e * i = η i ẽi , and ẽi = e i 1 n n i=1 e i , that is, the centred resid- uals.Re-fit the time-varying regression model based on the synthetic data and use the built-in R function forecast() to predict the future L observations, denoted by ŷ * n+1 , . . ., ŷ * n+L , or equivalently, ŷ * ( Step 3. Repeat Step 2 for B times and obtain B bootstrap predicted values for the future L observations.For , the estimate of sd( Ŷ (u)|D) is the sample standard deviation of bootstrap samples {ŷ * (b) (u) : b = 1, . . ., B} and is denoted by sd * ( Ŷ (u)).
Step 4. For each b = 1, . . ., B , we calculate Simultaneous confidence bands for out-of-sample predictions.Since the point-wise confidence bands only provide interval estimates for each given future time point, we here discuss simultaneous confidence bands that allow us to infer all future time points simultaneously.As such, one can make inference about the trend of future predictions based on such confidence bands.Following the bootstrap methods proposed by 23 , we construct the simultaneous confidence bands as outlined below.Let where D = (u 1 , . . ., u n , X(u 1 ), . . ., X(u n )) with u i = i n .
The 100(1 − α)% simultaneous confidence band for {E(Y (u)|D)} for u ∈ [ The last two terms after the ± can be estimated from the bootstrap methods.
Step 1. Fit a time-varying coefficient regression model.Denote the predicted values by ŷ(u), u ∈ [ The predicted values can be directly calculated by using the built-in R function forecast() in the R package tvReg.Step 2. For each i = 1, . . ., n , generate a bootstrap sample where ẽi = e i − 1 n n i=1 e i , that is, centred residuals; η i i.i.d.
∼ N(0, 1). Step .  predictions.Local linear regression tends to be more sensitive to aberrant data points (such as negative values of some daily new deaths due to retrospective re-assessment), as suggested by Fig. 4.Moreover, local linear regression did not work for some provinces (e.g., Saskatchewan) due to some singular fits.It is likely related to the sparsity in the data due to rareness of deaths in such provinces.It is also worth noting that the starting values for the breakpoints affects the predictive performance of the piecewise linear models.In Fig. 3, the piecewise linear regression performed the worst.Because we did not set up the last breakpoint properly; the segmented package

10 Figure 1 .
Figure 1.Scatter plots for Ontario case counts vs death counts (daily counts in the top panel vs cumulative counts in the bottom panel) between Oct 31, 2021 and February 10, 2022.
2, . . ., M. Vol.:(0123456789) Scientific Reports | (2023) 13:14687 | https://doi.org/10.1038/s41598-023-41551-1www.nature.com/scientificreports/Data analysis Provincial data in Canada.We use the publicly available data resource maintained by the COVID-19 Canada Open Data Working Group 24 .We consider the time window near to the first Omicron wave for provinces in Canada.Please note that the time window is different for different provinces.For each of Figs. 2, 3, 4, we plotted out-of-sample predictions for daily new deaths using local polynomial and piecewise linear regression models.The actual reported deaths were overlayed in each plot so the prediction accuracy can be easily visualized.The general impression is that local constant regression and piecewise linear regression produced better daily Out-of-sample Prediction for daily deaths in Japan based on the input data from January 10, 2022 to December 31, 2022.Mean Squared Prediction Errors (MSPE) for the out-of-sample predictions and selected lags (in brackets) are listed as follows.Local Constant: 46,277 (5 days), Local Linear: 23,073 (6 days), Piecewise Linear: 34,834 (5 days), DEM: 29,644 (6 days).
Figure 6.Out-of-sample Prediction for daily deaths in South Korea based on the input data from January 30, 2022 to December 31, 2022.The Mean Squared Prediction Errors (MSPE) for the out-of-sample predictions and selected lags (in brackets) are listed as follows.Local Constant: 142 (6 days), Local Linear: 997 (9 days), Piecewise Linear: 175 (20 days), DEM: 258 (6 days).