Some developments on seasonal INAR processes with application to influenza data

Influenza epidemic data are seasonal in nature. Zero-inflation, zero-deflation, overdispersion, and underdispersion are frequently seen in such number of cases of disease (count) data. To explain these counts’ features, this paper introduces a flexible model for nonnegative integer-valued time series with a seasonal autoregressive structure. Some probabilistic properties of the model are discussed for general seasonal INAR(p) model and three estimation methods are used to estimate the model parameters for its special case seasonal INAR(1) model. The performance of the estimation procedures has been studied using simulation. The proposed model is applied to analyze weekly influenza data from the Breisgau- Hochschwarzwald county of Baden–Württemberg state, Germany. The empirical findings show that the suggested model performs better than existing models.

Influenza epidemic data are seasonal in nature.Zero-inflation, zero-deflation, overdispersion, and underdispersion are frequently seen in such number of cases of disease (count) data.To explain these counts' features, this paper introduces a flexible model for nonnegative integer-valued time series with a seasonal autoregressive structure.Some probabilistic properties of the model are discussed for general seasonal INAR(p) model and three estimation methods are used to estimate the model parameters for its special case seasonal INAR(1) model.The performance of the estimation procedures has been studied using simulation.The proposed model is applied to analyze weekly influenza data from the Breisgau-Hochschwarzwald county of Baden-Württemberg state, Germany.The empirical findings show that the suggested model performs better than existing models.
Many epidemic diseases follow a seasonal pattern and so is the Influenza.Influenza shows a seasonal pattern 1 in temperate regions.The weekly/monthly/yearly epidemic data form a time series of counts.Analyzing and forecasting time series of counts remain a useful technique of getting information needed for successful policy making and management of epidemics.Before modeling a count time series, an analyst should be familiar with specific characteristics of the series in order to select a good model for the data.Dispersion characteristic, stationarity, autocorrelation structure and seasonality remain factors that help a researcher to determine the model that should be fitted to a count time series.Seasonality refers to the tendency of a time series (including low count time series) to exhibit patterns or movements that are completed within a year and repeat themselves every year.Seasonality deals with regular and predictable patterns in a time series.For example, a weekly time series of low counts is said to be seasonal with seasonal period s = 52 if its associated autocorrelation function has spikes at multiples of lag 52.
New INAR processes have been constructed and used to model stationary nonseasonal count time series from a variety of fields since the independent works of Al-Osh and Alzaid 2 , and McKenzie 3 , which paved the way for research on the applications and construction of nonnegative integer-valued autoregressive (INAR) processes.Several models were constructed for stationary, overdispersed nonseasonal series based on the binomial thinning operator and first-order autoregressive correlation structure 4,5 .Authors who used other thinning operators to build models for the count time series include Liu and Zhu 6 , and Ristić et al. 7 .There is also evidence of stationary, first-order integer-valued autoregressive processes, which were constructed using innovation distributions, that can exhibit any of the underdispersion, equidispersion and overdispersion phenomena [8][9][10] .
An INAR (1) process with an innovation distribution that is basically a standard discrete or compound Poisson distribution has been described as inappropriate for modeling any of underdispersed and overdispersed series with either inflation or a deflation of zeros 11 .Zero-modified distributions are useful in modeling count data with an inflation or deflation of zeros.The zero-modified versions of some discrete distributions have been introduced and used as innovation distributions to construct INAR(1) models by some authors.For example, Barreto-Souza 11 constructed a zero-modified geometric INAR(1) process using the negative binomial thinning operator and illustrated empirically the applicability of the model using two practical data sets.In particular, the model may be used to describe an underdispersed or overdispersed series with deflationary or inflationary zeros.Sharafi et al. 12 suggested an INAR(1) process using the zero-modified Poisson-Lindley (ZMPL) distributed

The general seasonal INAR model
In response to the comment made by one of the anonymous reviewers, and considering that the literature lacks a general seasonal INAR model, we introduce the general seasonal INAR model in this section.Let " * " be the nega- tive binomial thinning (NBT) operator and * W = W i=1 Z i , where Z i refers to the sequence of independent and identically distributed (i.i.d.) geometric variables having the probability mass function (PMF) An elaborate discussion of the properties of the NBT operator is available in Ristić et al. 7 .The general seasonal INAR model of order P and seasonal period s (INAR(P) s model) is defined by In (1), the counting series are mutually independent, l ∈ [0, 1], l = 1, 2, . . ., P and {ν t } are independent of all the counting series.
Theorem 2.1 contains the condition for the existence of a unique stationary INAR(P) s model.
and * Y has the geometric 1+ distribution.The remaining part of the proof can be deduced from the proof of Theorem 2.1 in Jin-Guan and Yuan 23 .
Taking expectation of both sides of (1) and assuming stationarity, the mean of the INAR(P) s model becomes (1) www.nature.com/scientificreports/Ne x t , w e s t u d y t h e aut o c or re l at i on s t r u c t u re of t h e I NA R(P) s m o d e l .L e t W t = W t , W t−s , W t−2s , . . ., W t−(P+1)s ′ .
Then (1) can be written as where The vector of the autocovariances is defined by

It follows that the autocovariance function at lag k is
The associated autocorrelation coefficient has the form: In view of (6), it is obvious that the model under consideration and the Gaussian AR model of order P and seasonal period 's' have similar autocorrelation structures.Hence, the identification of the model can be done using the autocorrelation function (ACF) and partial autocorrelation function (PACF).Theoretically speaking, the ACF of the INAR(P) s model has nonzero values at the seasonal lags.The nonzero values at the seasonal lags tail off while the related partial ACF cuts off after lag P. When we have a stationary seasonal time series with sample ACF and sample partial ACF that have patterns that are akin to the theoretical ACF and sample partial ACF, we are expected to fit the INAR(P) s model to the data.In situations where, the sample ACF and sample partial ACF do not look exactly like their theoretical counterparts, a variety of models can be fitted to the given data and the best model is determined using model selection criteria 24 .
The parameters l , l = 1, 2, . . ., P of the model can be estimated using any of the Yule-Walker and conditional least squares approaches.For k = s, 2s, . . ., Ps , we derive P equations from (6).These equations are written in matrix form as where The Yule-Walker estimator ˆ of is given as For example, for a second order monthly count time series, s = 12 and the estimates ˆ 1 and ˆ 1 based on sample autocorrelation coefficients at lags 12 and 24 are given as. (2) . . .
www.nature.com/scientificreports/In (7), Furthermore, the Yule-Walker estimate of µ ν is The corresponding estimate of σ 2 ν is Notably, and The CLS estimator ˆ of is obtained by minimizing Q( ) .In this case, we solve the following equations simultaneously: Certain properties of CLS estimators are well-known (see, Klimko and Nelson 25 ).The minimum mean squared error (MMSE) predictor of W n+1 is In general, the MMSE of W n+m becomes We have carried out a simulation study to assess the performance of the Yule-Walker (YW) and conditional least squares(CLS) estimates.For the sake of brevity we consider INAR(2) S model and its parameter estimation.We simulated 1000 series from the proposed model for various parameter combinations and for various sample sizes as given in Table 1.We have computed the mean estimates and their related MSEs based on 1000 simulations.In this study, we estimated two autoregressive parameters ( 1 and 2 ) and the mean of (8) Vol.:(0123456789) www.nature.com/scientificreports/ the innovation distribution ( µ ν ).For simulation from an innovation distribution, we use the parameter com- binations: α = (4, 1, 2, 0.5) and δ = (−2, 0.5, −1, 0.7), which give the mean of the innovation distribution as µ ν = (0.9, 0.75, 1.3333, 1.00) .The estimation of AR coefficient and all the parameters of innovation distribution have been studied in detail for INAR(1) S model in "Construction of the proposed seasonal INAR(1) process and its properties".From Table 1, it can be seen that the estimates perform well, and their mean squared errors (MSEs) decrease as the sample size increases.It also can be seen that the CLS estimation performs much better than the YW estimation in terms of the MSE.As the joint distribution of 1 * W t−s and 2 * W t−2s is not tractable, we can not find the conditional distribution and hence maximum likelihood estimation cannot be attempted, as it has been done for INAR(1) S model in "Construction of the proposed seasonal INAR(1) process and its properties".

Construction of the proposed seasonal INAR(1) process and its properties
In this section, the definition of the first-order seasonal INAR process based on zero-modified Poisson-Lindley (ZMPL) innovations 12 and NBT operation is given, with an extensive study.Using the NBT operator, we define the proposed seasonal INAR(1) process as follows.
Definition 3.1 Let {W t } be a discrete-time nonnegative integer-valued process.Then, the process is a seasonal INAR(1) process with zero-modified Poisson-Lindley innovations if where 's' represents the seasonal period such that s ∈ N + , N + is the set of positive integers, {W t } stands for a sequence of identically distributed nonnegative random variables, {ν t } is an innovation sequence of i.i.d.ZMPL (12) In the sequel, the notation INAR(1) s ZMPL process is used to represent the model defined in Eq. (12).A random variable X follows the ZMPL distribution with parameters α and δ if its PMF is where α > 0 and − α 2 (α+2) α 2 +3α+1 ≤ δ ≤ 1.Thus, the unconditional mean and variance of the random variable {ν t } are, respectively, given by and Following 26 , the distribution of the random variable (RV) ν t is overdispersed if δ ∈ [0, 1) , while it is underd- ispersed when α > √ 2 and δ ∈ −α 2 (α+2) α 2 +3α+1 , 0 .The INAR(1) s ZMPL process comprises 's' mutually independent INAR(1) processes with ZMPL innovations, which are constructed using the NBT operator such that they have the autoregressive parameter and an innovation distribution.The simulated sample paths, sample ACFs and sample PACFs of the process for various parameter combinations are given in the Fig. 1.The seasonality in the ( 13) t } and the counting processes that define the requisite thinning operators.Additionally, the process defined in Eq. ( 12) is an s-step Markov chain, implying that The conditional mean of the process is Hence, the unconditional expectation of the process is Assuming stationarity and using Eq. ( 17), the unconditional mean is found to be The conditional variance satisfies the equation: The unconditional variance is obtained as Hence, it follows that Suppose that W (k+h)s+j and W ks+i satisfy the process in Eq. ( 12), h ∈ N + and k ∈ N 0 , i, j = 1, 2, . . ., s .In order to gain insight into conditional moments based on the seasonal period as well as the autocorrelation function of the process, we proceed to give the following propositions: 1− , which is the unconditional expectation of W t .If i = j , we have k+h and W (i) k are independent.Thus, the conditional variance of W (k+h)s+j given W ks+i is , the unconditional variance of W t .
where Notably, Var W (j) k W (j) k = 0 .Also, after some algebra with the earlier discussions, we can find that, In the light of the above, we have Hence,

Proposition 3.3 The covariance of W (k+h)s+j and W ks+i has the form
Proof Once i = j , the variables W (k+h)s+j and W ks+i are uncorrelated.In this case, Covar W (k+h)s+j , W ks+i = 0.
Dividing the autocovariance function by Var(W t ) , the autocorrelation function (ACF) is obtained as , if i = j.
On the other side,

Techniques for estimating model parameters
Three popular and widely used methods of point estimation for the parameters of the INAR(1) S ZMPL are adopted in this section.These are, the Yule-Walker, conditional least squares and conditional maximum likelihood methods.In each of the methods, we assume a realization of the seasonal count time series.

Yule-Walker approach
To estimate the parameters , δ and α of the INAR(1) s ZMPL process, we form three equations by equating ρ(s) , E(W t ) and Var(W t ) to γ (s) γ (0) , W and γ (0) , respectively.Notably, and n are, respectively, the sample mean, sample autocorrelation function and length of the time series.If ˆ YW , δYW and αYW are the Yule-Walker (YW) estimators of , δ and α in that order, then the following equations are needed: where From Eq. ( 20), we obtain Using Eqs. ( 21) and ( 22), the following cubic equation is obtained: where R packages for solving polynomial equations, particularly the cubic equation, abound.In "Yule-Walker approach" of this study, polyroot R function is used to obtain the roots of Eq. ( 23) after the coefficients have been calculated from the given time series.Though the equation has three roots, only the root that gives an acceptable value of δYW will be used for further computations.

Conditional maximum likelihood estimation
In view of the s-step Markovian property of the seasonal INAR(1) s ZMPL process, we define the transition probabilities as Using this we can write the detailed transition probabilities as Let ξ CML be the conditional maximum likelihood(CML) estimator of ξ .To find ξ CML , maximize the condi- tional log-likelihood function below: Obviously, ξ CML has no closed form.Furthermore, the required estimators can only be found through a numerical technique.(29)

Simulation study for the INAR(1) s ZMPL
In this section, we have carried out a simulation study to assess the performance of the proposed estimation methods for INAR(1) s ZMPL process, paying attention to the cases of zero-inflation, overdispersion, zero-deflation and underdispersion baesd on the INAR(1) s ZMPL process.We have simulated 1000 series of the sample sizes 100, 300, 500 and 1000 for various parameter combinations and for seasonal period s = 52 .We have computed the mean and the mean squared errors (MSEs) of the estimates over all the 1000 simulations.All the quantities are given in the Table 2, the first row for each parameter and sample size (n) combination represents the mean estimate and second row represents the MSE.From the table, it can be seen that the MSE is decreasing with the increase in sample size, indicating the consistency of the estimates.The numerical optimization of the likelihood function to obtain CML estimates is done using the function 'constrOptim' in R software.In our setup, constraints are imposed on all the parameters.In particular, the upper limit of δ is 1 while its lower limit is a negative quantity that is a function of α .These constraints are considered while numerical optimization using 'constrOptim' .In all, CML estimates correspond to minimum MSEs compared to the YW and CLS estimates.

Model-based forecasting
Time series models are usually constructed with a view to forecasting future values of time series data.In the case of INAR(1) s ZMPL process, we handle the problem of forecasting a future observation W n+h , h ∈ N using the information F n available upto time n.From Eq. ( 12), we deduce using induction and properties of the NBT that where d = implies equality in distribution, q = ⌈ h s ⌉ is the integer part of h s .That is ⌈y⌉ = min n ∈ N + y ≤ n .We round off this section with the following proposition.

Proposition 3.4 Consider the INAR(1)
s ZMPL process in Eq. (12).Then It is not difficult to prove Proposition 3.4 using knowledge of the proofs of Propositions 3.1 to 3.2 and Eq.(30).Therefore, we omit the proof of Proposition 3.4.The h-step ahead forecast can be written as

Model comparison with seasonal ARIMA
In this section, we have compared the forecast performance of the proposed model and its counterpart SARIMA(0, 0, 0)(1, 0, 0) s model.Here, we have simulated 1000 series each of size 500 from the INAR(1) s ZMPL model with various parameter combinations given in Table 3 and the seasonal period s = 52 .For each series, the first 495 observations are used for the estimation of the parameters and last five observations are used to check the forecast accuracy.We have used three forecast accuracy measures to assess the forecast performance of the model.These measures include prediction root mean squared error (PRMSE), prediction mean absolute error (PMAE) and percentage of true prediction (PTP).Notably, where X(k) (t+i) = mean(X t+i |X t−k+i ) is the k-step ahead conditional mean of the fitted process, (30) Ŵn+h = ˆ q Ŵn+h−qs − μW + μν .
where X(k) (t+i) = median(X t+i |X t−k+i ) is the k-step ahead conditional median of the fitted process, and where I(•) denotes the indicator function.Here, X (t+i) is the actual ith observation at time point (t + i) and X(k) is the k-step ahead forecast value at the same time point.
We have obtained the conditional mean of the process as traditional forecast while the median and mode from one-step ahead forecast distribution serve as coherent forecasts.For the computation of PMAE in the SARIMA model, we have used mean forecast while in the INAR(1) s ZMPL model, we have used the median of the one-step ahead forecast distribution.In the SARIMA model, we can have only mean as the forecast, for computation of percentage of true prediction (PTP) based on mean we have considered rounded mean for both models.We have computed median and mode from the forecast distribution for the computation of PTP-Med and PTP-Mode.In INAR literature, the median and mode of the forecast distribution are called the coherent forecasts.From Table 3, it can be seen that in terms of PRMSE both models perform equally well, as their forecasts are the same.However, the PMAE for the SARIMA model is higher than that of the PMAE for the proposed model, showing the better performance of the proposed model than the SARIMA model.This may be because median is used in the proposed model and mean is used in the SARIMA model for the computation of the PMAE.But when it comes to the coherent forecast performance, the proposed model outperforms the SARIMA model, which can be seen from the PTP for median and mode.When data are overdispersed, as it is the case in the top panel in the table, the coherent forecast performs five times better than the traditional forecast.Both models perform equally well when the data are underdispersed.

Application of INAR(1) s ZMPL model
We have analyzed weekly Influenza data from the Breisgau-Hochschwarzwald county of Baden-Württemberg state, Germany for the years 2001 to 2008 (https:// survs tat.rki.de).The data have mean 0.4735 and variance 2.5969.Clearly the data are overdispersed.From the relative frequency plot in Fig. 2, we can see that the number of zeros in the data is excessively high.Hence, this series can be modeled using the proposed zero-modified model.The seasonality of the data can be seen from the ACF and PACF plot in Fig. 3, as it is characterized by the sinusoidal autocorrelation pattern.Also, the data are seasonal because the corresponding ACF plot has significant peaks at multiples of 54.From Fig. 4, it can be seen that the AIC and BIC values are small for the period s = 54 .Hence the seasonal period of the series ( s = 54 ) is confirmed.4, it can be seen that the proposed model is the best model for the data.We have obtained point forecast using the INAR(1) s ZMPL model for the given data.The data has 416 observations.We have used 411 observations (training data) for estimating the parameters and last five observations are predicted using the fitted model.We have used the conditional 'h' step ahead mean as the forecast function.The parameter estimates (CML) based on training data are ˆ = 0.2497 , α = 0.5289 , δ = 0.8619 .The mean forecast Xt+h and the rounded mean forecast [ Xt+h ] are given in the Table 5.From this table it can be seen that the model predicts the future observations with good accuracy.

Conclusion
In this paper, we have introduced a seasonal, nonnegative, integer-valued autoregressive model, with multiple features, using the negative binomial thinning operator and zero-modified Poisson-Lindley distributed innovations.General seasonal INAR model of order P has been discussed in the paper and it has been illustrated in detail for order one.This model, denoted by INAR(1) s ZMPL model, is the first of its kind for modeling zerodeflated or zero-inflated seasonal time series of counts.Some theoretical results are determined for the model.Specifically, means, variances and autocorrelation function are obtained.In estimating the parameters of the model, the Yule-Walker, conditional least squares and conditional maximum likelihood approaches are given due consideration.The simulation results obtained in this study demonstrate the superiority of the conditional maximum likelihood method over the other two point estimation methods of estimating the model parameters.
A real-life application of the model was investigated by analyzing a zero-inflated overdispersed seasonal count time series.The model fit was compared to the fits of three competing models.In the final analysis, the proposed model outperforms the other models.

Figure 2 .Figure 3 .
Figure 2. Relative frequency plot for the influenza data.

Figure 4 .
Figure 4. Akaike information criterion (AIC) and Bayesian information criterion (BIC) for influenza data for various seasonal periods 's' .Green dashed line is for the BIC and Blue solid line is for AIC.

Table 1 .
Parameter estimates and their mean squared errors for INAR(2) S model.

Table 2 .
Parameter estimates and their mean squared errors.

Table 5 .
Point forecasts for the influenza data.