Abstract
We are concerned with the issue of detecting changes and their signs from a data stream. For example, when given time series of COVID19 cases in a region, we may raise early warning signals of an epidemic by detecting signs of changes in the data. We propose a novel methodology to address this issue. The key idea is to employ a new informationtheoretic notion, which we call the differential minimum description length change statistics (DMDL), for measuring the scores of change sign. We first give a fundamental theory for DMDL. We then demonstrate its effectiveness using synthetic datasets. We apply it to detecting early warning signals of the COVID19 epidemic using time series of the cases for individual countries. We empirically demonstrate that DMDL is able to raise early warning signals of events such as significant increase/decrease of cases. Remarkably, for about \(64\%\) of the events of significant increase of cases in studied countries, our method can detect warning signals as early as nearly six days on average before the events, buying considerably long time for making responses. We further relate the warning signals to the dynamics of the basic reproduction number R0 and the timing of social distancing. The results show that our method is a promising approach to the epidemic analysis from a data science viewpoint.
Similar content being viewed by others
Introduction
Motivation
We address the issue of detecting changes and their signs in a data stream. For example, when given time series of the number of COVID19 cases in a region, we may expect to warn the beginning of an epidemic by detecting changes and their signs.
Although change detection^{1,2,3} is a classical issue, it has remained open how signs of changes can be found. In principle the degree of change at a given time point has been evaluated in terms of the discrepancy measure (e.g.. the Kullback–Leibler (KL) divergence) between probability distributions of data before and after that time point^{1,4}. It is reasonable to think that the differentials of the KL divergence may be related to signs of change. This is because the first differential of the KL divergence is a velocity of change while its second differential is an acceleration of change.
The problem is here that in real cases, the KLdivergence and its differentials cannot be exactly calculated since the true distribution is unknown in advance. A question lies in how we can estimate the discrepancy measure and their differentials from data when the parameter values are unknown.
The purpose of this paper is to answer the above question from an informationtheoretic viewpoint based on the minimum description length (MDL) principle^{5} (see also studies^{6,7} for its recent advances). The MDL principle gives a strategy for evaluating the goodness of a probabilistic model in terms of codelength required for encoding the data where a shorter codelength indicates a better model. We apply this principle to change detection where a shorter codelength indicates a more significant change. Along this idea, we introduce the notion called the differential MDL change statistics (DMDL) for the measure of change signs. We theoretically and empirically justify this notion, and then apply it to the COVID19 pandemic analysis using open datasets.
Related work
There are plenty of work on change detection^{1,2,3,4,8,9,10,11}. In many of them, the degree of change has been related to the discrepancy measure for two distributions before and after a time point, such as likelihood ratio, KLdivergence. However, there is no work on relating the differential information such as the velocity of the change to change sign detection.
Most of previous studies in change detection are concerned with detecting abrupt changes^{3}. In the scenario of concept drift^{12}, the issues of detecting various types of changes, including incremental changes and gradual changes, have been addressed. How to find signs of changes has been addressed in the scenarios of volatility shift detection^{13}, gradual change detection^{14} and clustering change detection^{15,16,17}. However, the notion of differential information has never been related to change sign detection.
The MDL change statistics has been proposed as a test statistics in the hypothesis testing for change detection^{14,18}. It is defined as the difference between the total codelength required for encoding data for the nonchange case and that for the change case at a specific time point t. A number of data compressionbased change statistics similar to it have also been proposed in data mining^{19,20,21}. However, any differential variation of the compressionbased change statistics has never been proposed.
Significance of this paper
The significance of this paper is summarized as follows:

(1)
Proposal of DMDL and its use for change sign detection. We introduce a novel notion of DMDL as an approximation of KLdivergence of change and its differentials. We then propose practical algorithms for online detection of change signs on the basis of DMDL.

(2)
Theoretical and empirical justification of DMDL. We theoretically justify DMDL in the hypothesis testing of change detection. We consider the hypothesis tests which are equivalent with DMDL scoring. We derive upper bounds on the error probabilities for these tests to show that they converge exponentially to zero as sample size increases. The bounds on the error probabilities are used to determine a threshold for raising an alarm with DMDL. We also empirically justify DMDL using synthetic datasets. We demonstrate that DMDL outperforms existing change detection methods in terms of AUC for detecting the starting point of a gradual change.

(3)
Applications to COVID19 pandemic analysis. On the basis of the theoretical and empirical advantages of DMDL, we apply it to the COVID19 pandemic analysis. We are mainly concerned with how early we are able to detect signs of outbreaks or the contraction of the epidemic for individual countries. The results showed that for about \(64\%\) of outbreaks in studied countries, our method can detect signs as early as about 6 days on average before the outbreaks. Considering the rapid spread, 6 days can earn us considerably long time for making responses, e.g., implementing control measures^{22,23,24}. The earned time is especially precious in the presence of a considerably long period of the incubation of the COVID19^{25,26,27}. Moreover, we analyze relations between the change detection results and social distancing events. One of findings is that for individual countries, an average of about four changes/change signs detected before the implementation of social distancing correlates a significant decline from the peak of daily new cases by the end of April 2020.
The change analysis is a pure data science methodology, which detects changes only using statistical models without using differential equations about the time evolution. Meanwhile, SIR (Susceptible Infected Recovered) model^{28} is a typical simulation method which predicts the time evolution of infected population with physics modelbased differential equations. Although the fitness of the SIR model or its variants to COVID19 data was argued^{29,30}, the complicated situation of COVID19 due to virus mutations^{31,32,33}, international interactions, highly variable responses from authorities^{34}, environmental effects^{35,36} etc. does not necessarily make any simulation model perfect. Therefore, the basic reproduction number R0^{37} (a term in epidemiology, representing the average number of people who will contract a contagious disease from one person with that disease) estimated from the SIR model may not be precise. We empirically demonstrate that as a byproduct, the dynamics of R0 can be monitored by our methodology which only requires the information of daily new cases. The data science approach then may form a complementary relation with the simulation approach and gives new insights into epidemic analysis. The effect of social distancing in Germany has been evaluated using the framework of change point analysis together with SIR model^{38}. However, there is no work on machine learning approaches to detecting signs of outbreak for COVID19.
The software for the experiments is available at https://github.com/IbarakikenYukishi/differentialmdlchangestatistics. An online detection system is available at https://ibarakikenyukishi.github.io/dmdlhtml/index.html
The rest of this paper is organized as follows: “Methods” introduces DMDL and gives a theory of its use in the context of change sign detection. “Result I: experiments with synthetic data” gives empirical justification of DMDL using synthetic datasets. “Result II: applications to COVID19 pandemic analysis” gives applications of DMDL to the COVID19 pandemic analysis. “Conclusion” gives concluding remarks.
Methods
Definitions of changes and their signs
Let \({{\mathcal {X}}}\) be a domain, which is either discrete or continuous. Hereafter we assume that \({{\mathcal {X}}}\) is discrete without loss of generality. For a random variable \({{\varvec{x}}}\in {{\mathcal {X}}}\), let \(p({{\varvec{x}}};\theta )=p_{_{\theta }}({{\varvec{x}}})\) be the probability mass function (or the probability density function in the continuous case) specified by a parameter \(\theta\). Supposing that \(\theta\) changes over time. In the case when \(\theta\) gradually changes over time, we define the signs of change as the starting point of that change.
Let us consider the discrete time t. Let \(\theta _{t}\) be the parameter value of \(\theta\) at time t. Let D(pq) denote the KullbackLeibler (KL) divergence between two probability mass functions p and q:
We define the 0th, 1st, 2nd change degrees at time t as
When the parameter sequence \(\{\theta _{t}: t\in {{\mathbb {Z}}}\}\) is known, we can define the degree of changes at any given time point. We can think of \(\Phi _{t}^{(0)}\) as the degree of change of the parameter value itself at time t. We can think of \(\Phi _{t}^{(1)}, \Phi _{t}^{(2)}\) as the velocity of change and the acceleration of change of the parameter at time t, respectively. All of them quantify the signs of change. However, the parameter values are not known in advance for general cases. The problem is how we can define the degree of changes for such cases.
Differential MDL change statistics
In the case where the true parameter values are unknown, the MDL change statistics has been proposed to measure the change degree^{14,18} from a given data sequence. Below we denote \(x_{a},\dots , x_{b}=x_{a}^{b}\). In the case of \(a=1\), we may drop off a and write it as \(x^{b}\).
When the parameter \(\theta\) is unknown, we may estimate it as \({\hat{\theta }}\) using the maximum likelihood estimation method from a given sequence \(x^{n}\). I.e., \({\hat{\theta }}= \text {argmax} _{\theta }p(x^{n};\theta ).\) Note that the maximum likelihood function \(p(x^{n};{\hat{\theta }})\) does not form a probability distribution of \(x^{n}\) because \(\sum _{x^{n}}p(x^{n};{\hat{\theta }})>1\). Thus we construct a normalized maximum likelihood (NML) distribution^{40} by
and consider the logarithmic loss for \(x^{n}\) relative to this distribution by
which we call the NML codelength, where log means the natural logarithm and \(C_{n}\) is called the parametric complexity defined as
It is known^{39} that Eq. (1) is the optimal codelength that achieves the Shtarkov’s minimax regret in the case where the parameter value is unknown. It is known^{40} that under some regularity condition for the model class, \(C_{n}\) is asymptotically expanded as follows:
where \(I(\theta )\) is the Fisher information matrix defined by \(I(\theta )=\lim _{n\rightarrow \infty }(1/n)E_{\theta }[\partial ^{2}\log p(X^{n}; \theta )/\partial \theta \partial \theta ^{\top }]\), d is the dimensionality of \(\theta\), and \(\lim _{n\rightarrow \infty }o(1)=0\).
According to the study^{14}, the MDL change statistics at time point t is defined as follows:
The MDL change statistics is the difference between that the NML codelength of a given data sequence for nonchange and that for change at time t. It is a generalization of the likelihood ratio test^{1,41}.
Therefore, by extending the change degrees \(\Phi _{t}^{(0)}, \Phi _{t}^{(1)}, \Phi _{t}^{(2)},\dots\) to the cases where the true parameters are unknown, we may consider the following statistics:
\(\Psi _{t}^{(\alpha )}\) corresponds to \(\Phi _{t}^{(\alpha )}\). We call \(\Psi _{t}^{(\alpha )}\) the \(\alpha\)th differential MDL change statistics, which we abbreviate as the \(\alpha\)th DMDL (\(\alpha =0,1,2,\dots )\). The 0th DMDL is the original MDL change statistics as in the study^{14}.
For example, let us consider the univariate Gaussian distribution:
where \(x\in {{\mathbb {R}}}\) and \(\theta =(\mu , \sigma )\). We assume \(\mu < \mu _{\max }\) and \(\sigma _{\min }<\sigma <\sigma _{\max }\) where \(\mu _{\max }<\infty\), \(0<\sigma _{\min }, \sigma _{\max }<\infty\) are hyper parameters. The 0th DMDL at time t is calculated as
where \({\hat{\sigma }}_{0}, {\hat{\sigma }}_{1}\) and \({\hat{\sigma }}_{2}\) denote the maximum likelihood (ML) estimators of \(\sigma\) calculated for \(x_{1}^{n}, x_{1}^{t}\) and \(x_{t+1}^{n}\), respectively. \(C_n\) is the parametric complexity, which is calculated according to the study^{14}, as
The 1st and 2nd DMDL are calculated according to Eqs. (5) and (6) on the basis of Eq. (8).
Hypothesis testing for change detection
The 0th DMDL test
We give rationale of DMDL using the framework of hypothesis testing for change detection. First suppose that a change point exists at t or not. Let us consider the following hypothesis testing framework: The null hypothesis \(H_{0}\) is that there is no change point while the alternative hypothesis \(H_{1}\) is that t is an only change point.
where \(\theta _{0},\theta _{1},\theta _{2}\ (\theta _{1}\ne \theta _{2})\) are all unknown.
With the MDL principle, the test statistics is given as follows: For an accuracy parameter \(\epsilon >0\),
where \(\Psi _{t}^{(0)}\) is the 0th DMDL as in equation (4). \(H_{1}\) is accepted if \(h_{0}(x^{n}; t, \epsilon )>0\), otherwise \(H_{0}\) is accepted. We call this test the 0th DMDL test.
We define Type I error probability as the probability that the test accepts \(H_1\) although \(H_{0}\) is true (false alarm rate) while Type II error probability as the one that the test accepts \(H_{0}\) although \(H_{1}\) is true (overlooking rate). The following theorem justifies the use of the 0th DMDL in change detection.
Theorem 2.1
^{14} Type I and II error probabilities for the 0th DMDL test are upper bounded as follows:
where \(C_{n}\) is the parametric complexity as in Eq. (2) and
d(p, q) in Eq. (12) is the Bhattcharyya distance between p and q.
This theorem shows that Type I and II error probabilities in Eqs. (10) and (11) converge to zero exponentially in n as n increases for some appropriate \(\epsilon\) when \(d(p_{_{\rm{NML}}},p_{_{\theta _{1}*\theta _{2}}})\) is large. We see that the error exponents are governed by the parametric complexity (2) of the model class. In this sense the 0th MDL test is effective in change point detection.
The 1st DMDL test
Next we give a hypothesis testing setting equivalent with the 1st DMDL scoring. We consider the situation where a change point exists at time either t or \(t+1\). Let us consider the following hypotheses: The null hypothesis \(H_{0}\) is that the change point is t while the alternative one \(H_{1}\) is that it is \(t+1\).
where \(\theta _{0},\theta _{1},\theta _{2},\theta _{3}\ (\theta _{0}\ne \theta _{1},\ \theta _{2}\ne \theta _{3})\) are all unknown.
We consider the following test statistics: For an accuracy parameter \(\epsilon >0\),
which compares the NML codelength for \(H_{0}\) with that for \(H_{1}.\) We accept \(H_{1}\) if \(h_{1}(x^{n}; t, \epsilon )>0\), otherwise we accept \(H_{0}\). We call this test the 1st DMDL test. We easily see
where \(\Psi _{t}^{(1)}\) is the 1st DMDL. This implies that the 1st DMDL test is equivalent with testing whether the 1st DMDL is larger than \(\epsilon\) or not. Hence this test is also equivalent with comparison of the degree of change at time \(t+1\) and that at time t. Intuitively, if the degree of change increases significantly as time goes by, then \(H_{1}\) is accepted. Thus the basic performance of discrimination of the 1st DMDL can be reduced to that of the 1st DMDL test.
The following theorem shows the basic property of the 1st DMDL test.
Theorem 2.2
Type I and II error probabilities for the 1st DMDL test are upper bounded as follows:
where \(C_{n}\) is the parametric complexity as in Eq. (2), d is the Bhattacharyya distance as in Eq. (12) and
(The proof is in Sec. 1 of the supplementary information.)
This theorem shows that for some appropriate \(\epsilon\), Type I and II error probabilities in Eqs. (15) and (16) converge to zero exponentially in n as n increases where the error exponents are related to the parametric complexities for the hypotheses as well as the Bhattacharyya distance between the null and alternative hypotheses. In this sense the 1st MDL test is effective. Type I error probability in Eq. (15) will be used for determining a threshold of the alarm.
The 2nd DMDL test
Next we consider a hypothesis testing setting equivalent with the 2nd DMDL scoring. Suppose that change points exist either at time t or at \(t1\) and \(t+1\).
where \(\theta _{0},\theta _{1},\theta _{2},\theta _{3},\theta _{4},\ (\theta _{0}\ne \theta _{1}, \theta _{2}\ne \theta _{3}\ne \theta _{4})\) are all unknown. \(H_{0}\) is the hypothesis that a change happens at time t while \(H_{1}\) is the hypothesis that two changes happen at time \(t1\) and t. In \(H_{0}\), t is a single change point while in \(H_{1},\) t is a transition point between two close change points. Thus this hypothesis testing evaluates whether time t is a change point or a transition point of close changes.
The test statistics is: For an accuracy parameter \(\epsilon >0\),
We accept \(H_{1}\) if \(h_{2}(x^{n}; t, \epsilon )>0\), otherwise accept \(H_{0}\). We call this test the 2nd MDL test.
Under the assumption \((1/n)L_{_{\rm{NML}}}(x^{t+1}_{1})\approx (1/n)( L_{_{\rm{NML}}}(x_{1}^{t1})+ L_{_{\rm{NML}}}(x_{t}x_{t+1}))\) and \((1/n)L_{_{\rm{NML}}}(x^{n}_{t})\approx (1/n)(L_{_{\rm{NML}}}(x_{t}x_{t+1})+L_{_{\rm{NML}}}(x_{t+2}^{n})),\) we have
This implies that the 2nd DMDL test is equivalent with testing whether the 2nd DMDL is larger than \(2\epsilon\) or not. Intuitively, if the degree of twostep change exceeds significantly that of onestep change as time increases, then \(H_{1}\) is accepted. Thus the basic performance of discrimination of the 2nd DMDL can be reduced to that of the 2nd DMDL test.
The following theorem shows the basic property of the 2nd DMDL test.
Theorem 2.3
Type I and II error probabilities for the 2nd DMDL test are upper bounded as follows:
where \(C_{n}\) is the parametric complexity as in Eq. (2), d is the Bhattacharyya distance as in Eq. (12) and
This theorem can be proven similarly with Theorem 2.2 Type I probability in Eq. (19) will be used for determining the threshold in “Sequential change sign detection with DMDL”.
Sequential change sign detection with DMDL
In previous sections, we considered how to measure the change sign score at a specific time point t. In order to detect change signs sequentially for the case where there exist multiple change points, we can conduct sequential change sign detection using DMDL in a similar manner with the study^{14}. We give two variants of the sequential algorithms. One is the sequential DMDL algorithm with fixed windowing while the other is that with adaptive windowing. In the former, we prepare a local window of fixed size to calculate DMDL at the center of the window. We then slide the window to obtain a sequence of DMDL change scores as with the study^{14} (see also the study^{42} for local windowing). We raise an alarm when the score exceeds the predetermined threshold \(\beta\). The algorithm is summarized as follows:
In the study^{43}, the sequential algorithm with adaptive windowing (SCAW) was proposed by combining the 0th DMDL with ADWIN algorithm^{9} (see also the study^{44} for adaptive windowing) where the window grows until the maximum of the MDL change statistics in the window exceeds a threshold. Once it exceeds the threshold, we drop the data earlier than the time point where the maximum is achieved and the window shrinks. Then the process restarts. It outputs the size of window whenever a change point is detected.
According to the study^{43}, for the window size w, the threshold \(\epsilon _{w}\) for \(w \Psi ^{(0 )}\) is set so that the total number of false alarms is finite. This is set as follows: For some parameter \(\delta >0\), when the parameter is ddimensional,
Hierarchical sequential DMDL algorithm
Practically, we combine the algorithm with adaptive windowing for the 0th DMDL and the algorithms with fixed windowing for the 1st and 2nd DMDL. We call this algorithm the hierarchical sequential DMDL algorithm. It is designed as follows. We first output not only the 0th DMDL score but also a window size with the 0th DMDL with adaptive windowing and raise an alarm when the window shrinks, i.e., Eq. (21) is satisfied for some time in the window. We then output the 1st and 2nd DMDL scores using the window produced by the 0th DMDL and raise alarms when for some time in the window, the 1st or 2nd DMDL exceeds the threshold so as to expect the 1st and 2nd DMDL to detect change signs before the window shrinkage. Note that the window shrinks only with the 0th DMDL, but neither with the 1st nor 2nd DMDL.
In this algorithm, for the window size w, the threshold for the 1st DMDL score \(w\Psi ^{(1)}_{t}\) is determined so that Type I error probability in Eq. (15) is less than the confidence parameter \(\delta _{1}\). That is, from Eqs. (15) and (3), letting the threshold be \(\epsilon _{w}^{(1)}=\epsilon w ,\) we use Eq. (3) ignoring O(1) term to obtain
This yields
We employ the righthand side of Eq. (22) as the threshold of an alert of the 1st DMDL.
The threshold \(\epsilon ^{(2)}_{w}\) for the 2nd DMDL score \(w\Psi ^{(2)}_{t}\) can also be derived similarly with the 1st one. Note that by Eq. (18), the threshold is 2 times the accuracy parameter for the hypothesis testing. Letting \(\delta _{2}\) be the confidence parameter, we have
We employ the righthand side of Eq. (23) as the threshold of an alert of the 2nd DMDL. In practice, \(\delta _{1}\) and \(\delta _{2}\) are estimated from data (see “Data modeling”).
The hierarchical sequential DMDL algorithm is summarized as follows:
Result I: experiments with synthetic data
Datasets
To evaluate how well DMDL performs for abrupt/gradual change detection, we consider two cases; multiple mean change detection and multiple variance one.
In the case of multiple mean change detection, we constructed synthetic datasets as follows: Each datum was independently drawn from the Gaussian distribution \(\mathcal {N}(\mu _{t}, 1)\) where the mean \(\mu _{t}\) abruptly/gradually changed over time according to the following rule: In the case of abrupt changes,
where H(x) is the Heaviside step function that takes 1 if \(x> 0\) otherwise 0. In the case of gradual changes, H is replaced with the following continuous function:
In the case of multiple variance change detection, each datum was independently drawn from the Gaussian distribution \(\mathcal {N}(0, \sigma _{t}^{2})\) where the variance \(\sigma _{t}^{2}\) abruptly/gradually changed over time according to the following rule: In the case of abrupt changes,
In the case of gradual changes, H is replaced with S as with the multiple mean changes.
We define a sign of a gradual change as the starting point of that change. In all the datasets, change points for abrupt changes and change signs for gradual changes were set at nine points: \(t=1000\), 2000, \(\dots\), 9000.
Evaluation metric
For any change detection algorithm that outputs change scores for all time points, letting \(\beta\) be a threshold parameter, we convert changepoint scores \(\{ s_{t} \}\) into binary alarms \(\{ a_{t} \}\) as follows:
By varying \(\beta\), we evaluate the change detection algorithms in terms of benefit and false alarm rate defined as follows: Let T be a maximum tolerant delay of change detection. When the change truly starts from \(t^{*}\), we define benefit of an alarm at time t as
where \(t^{*}\) is a change point for abrupt change, while it is a sign for gradual change.
The total benefit of alarm sequence \(a_{0}^{n1}\) is calculated as
The number of false alarms is calculated as
where \(\Theta (t)\) takes 1 if and only if t is true, otherwise 0. We evaluate the performance of any algorithm in terms of AUC (Area under curve) of the graph of the total benefit \(B / \sup _{\beta } B\), against the false alarm rate (FAR) \(N / \sup _{\beta } N\), with \(\beta\) varying.
Methods for comparison
In order to conduct the sequential DMDL algorithm, we employed the univariate Gaussian distribution whose probability density function is given by Eq. (7).
We employed three sequential change detection methods for comparison:

(1)
Bayesian online change point detection (BOCPD)^{11}: A retrospective Bayesian online change detection method. It originally calculates the posterior of run length. We modified it to compute a change score by taking the expectation of the reciprocal of run length with respect to the posterior.

(2)
ChangeFinder (CF)^{4}: A stateoftheart method of abrupt change detection.

(3)
ADWIN2^{9}: A change detection method with adaptive windowing.
We conducted the sequential DMDL algorithms with fixed window size in order to investigate their most basic performance in terms of the AUC metric. The sequential DMDL algorithm with adaptive windowing outputs the window size rather than the DMDL values themselves, hence in order to evaluate the effectiveness of the magnitude of DMDL, the sequential DMDL with fixed windowing is a better target for the comparison. All of CF, BOCPD, and ADWIN2 had some parameters, which we determined from five training sequences drawn from the data generation mechanism so that the AUC scores were made the largest.
Results
The performance comparison is summarized in Table 1. We see that both for the datasets, in the case of abrupt changes, the 0th DMDL performs best, while in the case of gradual changes, the 1st DMDL performs best and the 2nd DMDL performs worse than the 1st but better than the 0th. That matches our intuition. Because the 0th DMDL was designed so that it could detect abrupt changes while the 1st one was designed so that it could detect starting points of gradual changes.
Result II: applications to COVID19 pandemic analysis
Since the beginning of 2020, many regions/countries have suffered from the epidemic of COVID19. The purpose of our analysis is to demonstrate the importance of monitoring the dynamics of the epidemic through detecting the occurrence of drastic outbreaks and their signs. We define outbreak as a significant increase in the number of cases in a region/country. We note that to contain the spread of COVID19, many countries have enacted social distancing policies, e.g., stayathome order, closing nonessential services, and limiting travel. We thus also relate the results of our analysis to social distancing events.
In particular, we are mainly concerned with the following two problems:

1.
How early are the outbreak signs detected prior to outbreaks?

2.
How are the outbreaks/outbreak signs related to the social distancing events?
As a byproduct, the analysis of the dynamics of the basic reproduction number R0^{37} is conducted, which can serve as supplementary information to the particular value estimated from the SIR model^{45}.
Data source
We studied the data provided by European Centre for Disease Prevention and Control (ECDC) which can be accessed through the link https://www.ecdc.europa.eu/en/publicationsdata/downloadtodaysdatageographicdistributioncovid19casesworldwide. In this paper, we focused on the first wave because various factors made the situations very complicated in later waves, e.g., virus mutations^{31,32,33}, people being tired of social distancing and the mixture of two waves in the transition period. In particular, we studied 37 countries with no less than 10,000 cumulative cases by Apr. 30, 2020 since some countries started to ease the social distancing around the date. More details about these countries can be found in Sec. 2 of the supplementary information. It is worth mentioning that the proposed method can be applied to any region/country where there is a COVID19 epidemic because the input to the method is only the number of cases. In practice, we suggest starting to run our algorithm when the spread of the virus into the region of concern through local infections begins but not when the cases are just imported.
Data modeling
We studied two data models by considering the value of R0, which by definition is the product of transmissibility, the average contact rate between susceptible and infected individuals, and the duration of infectiousness^{45}. At the initial phase of an epidemic, R0 is larger than one^{37}. And the cumulative cases may grow exponentially^{46,47,48,49}. We thus employed the Malthusian growth model^{50} because it is widely used for characterizing the early phase of an epidemic^{48,49}. In particular, the cumulative cases at time t, C(t), grows according to the following equation:
where C(0) is the number of cases at the start of an epidemic, and r is the growth rate of daily new cases. In the experiments, we took the logarithm of C(t) to obtain the linear regression of the logarithm growth with respect to time as follows:
We modeled the residual error of the linear regression using the univariate Gaussian. See Sec. 3 in the supplementary file for the detail of calculation of the MDL change statistics for this model. When a change is detected in the modeling of the residual error, we examine the increase/decrease in the coefficient of the linear regression, i.e., r. We expect to detect changes in the parameter of the exponential modeling to monitor the increase/decrease of R0 because \(R01\) is proportional to r^{47}.
In later phases, the exponential growth pattern may not hold. For instance, when \(R0 < 1\), daily new cases would continue to decline and cease to exist^{37}. Considering the complicated real scenarios, epidemic models with certain assumptions on the growth rate or R0 may not fit an epidemic at a given time. Therefore, we employed the univariate Gaussian distribution as in Eq. (7) to directly model the number of daily new cases, without assuming any patterns of the growth. The change in the parameter of the Gaussian modeling may reveal the relation between one and R0, i.e., \(R0 > 1\) when daily new cases increase significantly or \(R0 < 1\) when daily new cases decrease significantly.
We conducted the hierarchical sequential DMDL algorithm as in “Hierarchical sequential DMDL algorithm”. The confidence parameter \(\delta\) for the 0th DMDL as in Eq. (21) was set to be 0.05. Those for the 1st and 2nd DMDL, i.e. \(\delta _{1},\delta _{2}\) as in Eqs. (22), (23) were determined as follows: We calculated the DMDL scores around the time when the initial warning was announced by an authority; we determined \(\delta _{1},\delta _{2}\) so that the score was the threshold. For example, the initial warning for Japan was set on Feb. 27, when the government required closing elementary, junior high and high schools. If the resulting \(\delta _{1},\delta _{2}\) was larger than 1, it was set to be 0.99 because of the concept of confidence parameter. More details about the implementation are provided in Sec. 4 of the supplementary information.
Case study
We present a representative case study of Japan due to space consideration. For results of all the studied countries, please refer to Sec. 5 of the supplementary information. In Japan, state of emergency as the social distancing event was issued on Apr. 7. The results are presented in Fig. 1 and Fig. 2 for the Gaussian modeling and the exponential modeling, respectively. Change scores were normalized into the range [0, 1]. The data of Japan did not include the confirmed cases from ‘Diamond Princess’.
With the Gaussian modeling, there were several alarms raised before the social distancing event. For each alarm raised by the 0th DMDL, the interpretation can be a statistically significant increase in cases, with reference to Fig. 1a. Hereafter, a change that was detected by the 0th DMDL and that corresponded to the increase of cases was regarded as an outbreak, which instantiates our definition of outbreak. The outbreak detection is the classic change detection. We further relate it to R0. Around the dates of the alarms, \(R0 > 1\) was considered since we can confirm that the new infections resulted from community transmission. Correspondingly, R0 was estimated around 2.5 in early March by an epidemiological study^{51}. When the 0th DMDL raised an alarm, the window size shrank to zero. Before that, both the 1st and the 2nd DMDL raised alarms, which are interpreted as the changes in the velocity and the acceleration of the increase of cases, respectively. We can conclude that the 1st and the 2nd DMDL were able to detect the signs of the outbreak by examining the velocity and the acceleration of the spread. The sign detection is the new concept with which we propose to supplement the classic change detection. The 0th DMDL raised no alarms about outbreaks after the event. We think the social distancing played a critical role in containing the spread because it can significantly suppress R0 through reducing the contact rate. The 1st DMDL still raised alarms, which were signs of decreases in the cases.
As for the exponential modeling, there were alarms raised by the 0th DMDL both before and after the social distancing event. By looking at the growth pattern of local cumulative cases in Fig. 2a, we can see that all the alarms were about the cessations of the exponential growth. Moreover, we checked that the alarms were associated with decreases in the coefficient of the linear regression. Therefore, we concluded that all the alarms indicated significant decreases in R0. Although the last two alarms were raised on Mar. 26 and Apr. 28, the dates as the change points were within the windows as of Mar. 26 and Apr. 28, and were identified as Mar. 12 and Apr. 18, respectively. There was an epidemiological study^{51} which showed the effectiveness of the initial warning announced on Feb. 27 at reducing R0. As a result, it demonstrated that our method can effectively identify the decrease in R0 around Mar. 12. According to the result, our method identified another decrease in R0 around Apr. 18, which we think was mainly due to the social distancing event on Apr. 7. Therefore, our method based on the exponential modeling also confirmed that social distancing was very effective at containing the spread. The alarms raised by the 1st and 2nd DMDL demonstrated the capability of the sign detection.
As a comparison, the Gaussian modeling was effective at estimating the relation between one and R0 while the exponential modeling was able to monitor the change in the value of R0. The two models form a complementary relation on monitoring the dynamics of R0. For instance, for Japan, the Gaussian modeling showed that the value of R0 reminded at a value larger than one, and the exponential modeling showed that its value decreased during the studied period. Due to the difference in the modeling, the changes detected by the 0th DMDL were at different dates between the Gaussian modeling and the exponential modeling. In terms of the sign detection, both the Gaussian modeling and the exponential modeling were effective.
Summarization on individual countries
This section summarizes several statistics about the change detection results in Table 2 and presents two interesting observations. The first is about how early the signs can be detected prior to changes. For the countries studied, there were 106 and 54 changes in total detected by the Gaussian modeling and the exponential modeling, respectively. There were more changes detected by the Gaussian modeling because daily cases would significantly change with either \(R0>1\) or \(R0<1\) while it may take relatively longer time for significant changes in R0. The number of changes whose signs were detected by either the 1st or the 2nd DMDL was 68 and 26 for the Gaussian modeling and the exponential modeling, respectively, representing high detection rates. For each change whose signs were detected, we measured the time difference between the earliest sign alarm and the change alarm. For the Gaussian modeling which can detect outbreaks, the time difference in terms of the number of days is 6.25 (mean) ± 6.04 (standard deviation). Considering the fast spread, six days can buy us considerably long time to prepare for an outbreak, and even to avoid a potential outbreak.
In particular, with the Gaussian modeling, the 1st DMDL detected signs for 65 changes and the 2nd DMDL detected signs for 27 changes. The smaller number by the 2nd DMDL might be because the 1st DMDL is better at detecting starting points of gradual changes, and is consistent with results on the synthetic datasets as in Table 1. The number of days before which the 1st DMDL detected signs was 6.35 ± 5.91, and the number for the 2nd DMDL was 5.56 ± 6.50. Note that not all the changes allowed for sign detection since the 1st DMDL and the 2nd DMDL sign detection require one more and two more data points in the window than the 0th DMDL, respectively. The number of changes allowing for a 1st DDML sign was 88 while the number for a 2nd DDML sign was 81. Hence, it turned out that some changes occurred too quickly before signs can be detected. The analysis of the results obtained by the exponential modeling is similar and omitted for space consideration.
Second, we observed that on average, countries responding faster in terms of a smaller number of alarms raised by the Gaussian modeling before the social distancing event saw a quicker contraction of daily cases. As of Apr. 30, the curve of daily cases in many countries had been flatten, and even started to be downward. Therefore, alarms for declines in the number of daily cases from the global peak number were raised for ten countries including Austria, China, Germany, Iran, Italy, Netherlands, South Korea, Spain, Switzerland, and Turkey. These countries are referred to as downward countries. In total, the number of all kinds of alarms raised before the event for downward countries was 4.30 ± 2.79 while it was 5.96 ± 4.22 for other countries. Therefore, if the social distancing is a viable option, it is suggested that the action should better be taken before it is late, e.g., later than four alarms. We further measured that it took an average of 30 days to suppress the spread if prompt social distancing policies were enacted. By contrast, the average number of days from the date of social distancing event to Apr. 30 was nearly 37 for nondownward countries, which was considerably more than the time used for suppressing the spread in downward countries. The results of the exponential modeling confirmed the above observation. In particular, changes and their signs which corresponded to decreases in R0 for the downward countries were more than those for the nondownward countries.
Limitations and challenges of the COVID19 analysis
Since the proposed method only examines the number of COVID19 cases, the analysis can only give an overall estimation of the dynamics of the pandemic which are the results of the joint effects of various kinds of physical factors including the characteristics of the virus, human mobility patterns, mask usage, vaccine coverage, environmental factors, and etc. When changes happen to any one of the physical factors, e.g., virus mutations or the entry of the virus into sewage^{52}, the number of cases may change. Accordingly, the major limitation of the proposed method is that itself cannot associate the detected changes, either outbreaks or their signs, with a particular physical factor.
We were concerned with detecting signs of the first wave of COVID19. Although we employed the Gaussian model and the exponential growth model in computing DMDL, such models might not be necessarily most appropriate for dealing with later waves, since a number of waves are mixed in the transition periods. One of challenges is to consider more sophisticated models such as latent variable models in dealing with later waves.
Conclusion
This paper has proposed a novel methodology for detecting signs of changes from a data stream. The key idea is to use the differential MDL change statistics (DMDL) as a sign score. This score can be thought of as a natural extension of the differentials of the Kullback–Leibler divergence for measuring the degree of changes to the case where the true mechanism for generating data is unknown. We have theoretically justified DMDL using the hypothesis testing framework and have empirically justified the sequential DMDL algorithm using the synthetic data. On the basis of the theory of DMDL, we have applied it to the COVID19 pandemic analysis. We have observed that the 0th DMDL found change points related to outbreaks and that the 1st and 2nd DMDL were able to detect their signs several days earlier than them. We have further related the change points to the dynamics of the basic reproduction number R0. We have also found that the countries with no more than five changes/change signs before the implementation of social distancing tended to experience the decrease in the number of cases considerably earlier. This analysis is a new promising approach to the pandemic analysis from the view of data science.
Change detection, which aims to detect points in a sequence of random variables at which the probability distribution change, has been studied for decades and has wide applications, such as event detection, failure detection, malware detection, etc^{4,14,43}. Change sign detection proposed in this paper aims to detect early warning signals of such changes by identifying the speed and acceleration of changes in the probability distribution, and therefore has the same applicability as the change detection.
Future work includes studying how we can integrate the change analysis such as our methodology with the conventional simulation studies such as SIR model. It is expected that our data science approach has a complementary relation with the simulation approach and gives new insights into epidemiology. Moreover, we plan to study later waves which are more complicated situations than the first wave.
References
Page, E. S. Continuous inspection schemes. Biometrika 41(1/2), 100–115 (1954).
Hinkley, D. V. Inference about the changepoint in a sequence of random variables. Biometrika 27(1), 1–17 (1970).
Basseville, M. & Nikiforov, I. V. Detection of Abrupt Changes: Theory and Application (PrenticeHall Inc., 1993).
Takeuchi, J. & Yamanishi, K. A unifying framework for detecting outliers and changepoints from time series. IEEE Trans Knowl. Data Eng. 18(4), 482–492 (2006).
Rissanen, J. Modeling by shortest description length. Automatica 14(5), 465–471 (1978).
Grünwald, P. D. The Minimum Description Length Principle (MIT Press, 2007).
Rissanen, J. Optimal Estimation of Parameters (Cambridge University Press, 2012).
Guralnik, V. & Srivastava, J. Event detection from time series data. in Proceedings of ACM SIGKDD International Conference on Knowledge Discover and Data Mining (KDD1999). 33–42 (1999).
Bifet, A. & Gavalda, R. Learning from timechanging data with adaptive windowing. in Proceedings of SIAM International Conference on Data Mining (SDM2007). 443–448 (2007).
Fearnhead, P. & Liu, Z. Online inference for multiple change point problem. J. R. Stat. Soc. Ser. B 69(4), 589–605 (2007).
Adams, R. P. & MacKay, D. J. C. Bayesian online change point detection. Preprint at https://arxiv.org/pdf/0710.3742.eps (2007).
Gama, J., Žliobaite, I., Bifet, A., Mykola, P. & Abdelhamid, B. A survey on concept drift adaptation. ACM Comput. Surv. 46(4), 1–37 (2014).
Huang, D. T. J., Koh, Y. S., Dobbie, G., & Pears, R. Detecting volatility shift in data streams. in Proceedings of 2014 IEEE International Conference on Data Mining (ICDM2014). 863–868 (2014).
Yamanishi, K. & Miyaguchi, K. Detecting gradual changes from data stream using MDL change statistics. in Proceedings of 2016 IEEE International Conference on BigData (BigData2016). 156–163 (2016).
Hirai, S. & Yamanishi, K. Detecting latent structure uncertainty with structural entropy. in Proceedings of 2018 IEEE International Conference on BigData (BigData2018). 26–35 (2018).
Ohsawa, Y. Graphbased entropy for detecting explanatory signs of changes in market. Rev. Soc. Netw. Strateg. 12, 183–203 (2018).
Hirai, S. & Yamanishi, K. Detecting model changes and their early warning signals using MDL change statistics. in Proceedings of 2019 IEEE International Conference on BigData (BigData2019). 84–93 (2019).
Yamanishi, K. & Fukushima, S. Model change detection with the MDL principle. IEEE Trans. Inform. Theory 64(9), 6115–6126 (2018).
Keogh, E., Lonardi, S. & Ratanamahatana, C. Toward parameterfree data mining. in Proceedings of 2004 ACM SIGKDD International Conference on Knowledge Discover and Data Mining (KDD2004). 206– 215 (2004).
Vreeken, J., Van Leeuwen, M. & Siebes, A. Krimp: Mining itemsets that compress. Data Min. Knowl. Discov. 23(1), 169–214 (2011).
van Leeuwen, M. & Siebes, A. Streamkrimp: Detecting change in data streams. Mach. Learn. Knowl. Disc. Databases Lect. Notes Comput. Sci. 52(11), 672–687 (2008).
Bi, Q. et al. Epidemiology and transmission of COVID19 in 391 cases and 1286 of their close contacts in Shenzhen, China: A retrospective cohort study. Lancet Infect. Dis.https://doi.org/10.1016/S14733099(20)302875 (2020).
Kraemer, M. U. et al. The effect of human mobility and control measures on the COVID19 epidemic in China. Science 368(6490), 493–497 (2020).
Kucharski, A. J. et al. Early dynamics of transmission and control of COVID19: A mathematical modelling study. Lancet Infect. Dis. 20(5), 553–558 (2020).
Backer, J. A., Klinkenberg, D. & Wallinga, J. Incubation. Period of novel coronavirus (2019nCoV) infections among travellers from Wuhan, China, 20–28 January 2020. Eurosurveillance 25(5), 2020. https://doi.org/10.2807/15607917.ES.2020.25.5.2000062 (2019).
Linton, N.M. Incubation. et al. novel coronavirus infections with right truncation: A statistical analysis of publicly available case data. J. Clin. Med. 9(2), 2020. https://doi.org/10.3390/jcm9020538 (2019).
Lauer, S. A. et al. The incubation period of coronavirus disease 2019 (COVID19) from publicly reported confirmed cases: Estimation and application. Ann. Intern. Med. 172(9), 577–582 (2020).
Kermack, W. O. McKendrick, A.G. A contribution to the mathematical theory of epidemic. Proc. R. Soc. Lond. Ser. A 115(772), 700–721 (1927).
Lourenco, J. et al. Fundamental principles of epidemic spread highlight the immediate need for largescale serological surveys to assess the stage of the SARSCoV2 epidemic. Preprint at https://www.medrxiv.org/content/10.1101/2020.03.24.20042291v1 (2020).
Zou, D. et al. Epidemic model guided machine learning for COVID19 forecasts in the United States. Preprint at https://www.medrxiv.org/content/10.1101/2020.05.24.20111989v1 (2020).
Korber, B. et al. Tracking changes in SARSCoV2 spike: Evidence that D614G increases infectivity of the COVID19 virus. Cell 182(4), 812–827 (2020).
Wise, J. Covid19: New coronavirus variant is identified in UK. BMJ 371, M4857 (2020).
Starr, T. N., Greaney, A. J., Dingens, A. S. & Bloom, J. D. Complete map of SARSCoV2 RBD mutations that escape the monoclonal antibody LYCoV555 and its cocktail with LYCoV016. Cell Rep. Med. 2(4), 100255 (2021).
Carroll, W. D. et al. European and United Kingdom COVID19 pandemic experience: The same but different. Paediatr. Respir. Rev. 35, 50–56 (2020).
Yao, Y. et al. No association of COVID19 transmission with temperature or UV radiation in Chinese cities. Eur. Respir. J.https://doi.org/10.1183/13993003.005172020 (2020).
Huang, Z. et al. Optimal temperature zone for the dispersal of COVID19. Sci. Total Environ. 736, 139487. https://doi.org/10.1016/j.scitotenv.2020.139487 (2020).
Diekmann, O., Heesterbeek, J. A. P. & Metz, J. A. J. On the definition and the computation of the basic reproduction ratio R 0 in models for infectious diseases in heterogeneous populations. J. Math. Biol. 28, 365–382 (1990).
Dehning, J., Zierenberg, J., Spitzner, F.P., Wibral, M., Neto,J.P., Wilczek, M., & Priesemann,V. Inferring change points in the spread of COVID19 reveals the effectiveness of interventions. Science, 369, 10 (2020).
Shtarkov, Y. M. Universal sequential coding of single messages. Probl. Peredachi Inf. 23(3), 3–17 (1987).
Rissanen, J. Fisher information and stochastic complexity. IEEE Trans. Inform. Theory 42(1), 40–47 (1996).
Moustakides, G. V. Optimal stopping times for detecting changes in distributions. Ann. Stat. 14(4), 1379–1387 (1986).
Moskvina, V. & Zhigljavsky, A. An algorithm based on singular spectrum analysis for changepoint detection. Commun. Stat.Simul. C. 32(2), 319–352 (2003).
Kaneko, R., Miyaguchi, K., & Yamanishi, K. Detecting changes in streaming data with informationtheoretic windowing. in Proceedings of 2017 International Conference on BigData (BigData2017). 646–655 (2017).
Killick, R., Fearnhead, P. & Eckley, I. A. Optimal detection of change points with a linear computational cost. J. Am. Stat. Assoc. 107(500), 1590–1598 (2012).
Jones, J. H. Notes on R0. in California: Department of Anthropological Sciences. https://web.stanford.edu/~jhj1/teachingdocs/JonesonR0.eps (2007).
Kermack, W. O. & McKendrick, A. C. Contributions to the mathematical theory of epidemics IV. Analysis of experimental epidemics of the virus disease mouse ectromelia. Epidemiol. Infect. 37(2), 172–187 (1937).
Anderson, R. M. & May, R. M. Infectious Diseases of Humans: Dynamics and Control (Oxford University Press, 1992).
Viboud, C., Simonsen, L. & Chowell, G. A generalizedgrowth model to characterize the early ascending phase of infectious disease outbreaks. Epidemics 15, 27–37 (2016).
Chowell, G., Sattenspiel, L., Bansal, S. & Viboud, C. Mathematical models to characterize early epidemic growth: A review. Phys. Life Rev. 18, 66–97 (2016).
Malthus, T. R., Winch, D. & James, P. Malthus: An Essay on the Principle of Population (Cambridge University Press, 1992).
Sugishita, Y., Kurita, J., Sugawara, T. & Ohkusa, Y. Preliminary evaluation of voluntary event cancellation as a countermeasure against the COVID19 outbreak in Japan as of 11 March. medRxiv (2020).
Petala, M. et al. A physicochemical model for rationalizing SARSCoV2 concentration in sewage. Case study: The city of Thessaloniki in Greece.. Sci. Total Environ. 755, 142855 (2021).
Acknowledgements
This work was partially supported by JST KAKENHI JP19H01114 and JSTAIP JPMJCR19U4.
Author information
Authors and Affiliations
Contributions
Conceptualization, K.Y.; methodology, K.Y., L.X., R.Y. and S.F.; software, L.X., R.Y., S.F. and C.L.; validation, K.Y., L.X., R.Y., S.F. and C.L.; formal analysis, K.Y., L.X., R.Y. and S.F.; investigation, K.Y. and L.X.; resources, K.Y.; data curation, L.X. and R.Y.; writingoriginal draft preparation, K.Y. and L.X.; writingreview and editing, K.Y., L.X. R.Y., and S.F.; visualization, L.X. and R.Y.; supervision, K.Y. and L.X.; project administration, K.Y.; funding acquisition, K.Y.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yamanishi, K., Xu, L., Yuki, R. et al. Change sign detection with differential MDL change statistics and its applications to COVID19 pandemic analysis. Sci Rep 11, 19795 (2021). https://doi.org/10.1038/s41598021987814
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598021987814
This article is cited by

Detecting signs of model change with continuous model selection based on descriptive dimensionality
Applied Intelligence (2023)