Forecasting the long-term trend of COVID-19 epidemic using a dynamic model

The current outbreak of coronavirus disease 2019 (COVID-19) has recently been declared as a pandemic and spread over 200 countries and territories. Forecasting the long-term trend of the COVID-19 epidemic can help health authorities determine the transmission characteristics of the virus and take appropriate prevention and control strategies beforehand. Previous studies that solely applied traditional epidemic models or machine learning models were subject to underfitting or overfitting problems. We propose a new model named Dynamic-Susceptible-Exposed-Infective-Quarantined (D-SEIQ), by making appropriate modifications of the Susceptible-Exposed-Infective-Recovered (SEIR) model and integrating machine learning based parameter optimization under epidemiological rational constraints. We used the model to predict the long-term reported cumulative numbers of COVID-19 cases in China from January 27, 2020. We evaluated our model on officially reported confirmed cases from three different regions in China, and the results proved the effectiveness of our model in terms of simulating and predicting the trend of the COVID-19 outbreak. In China-Excluding-Hubei area within 7 days after the first public report, our model successfully and accurately predicted the long trend up to 40 days and the exact date of the outbreak peak. The predicted cumulative number (12,506) by March 10, 2020, was only 3·8% different from the actual number (13,005). The parameters obtained by our model proved the effectiveness of prevention and intervention strategies on epidemic control in China. The prediction results for five other countries suggested the external validity of our model. The integrated approach of epidemic and machine learning models could accurately forecast the long-term trend of the COVID-19 outbreak. The model parameters also provided insights into the analysis of COVID-19 transmission and the effectiveness of interventions in China.

Coronavirus disease 2019 (COVID-19) is infectious pneumonia caused by severe acute respiratory syndrome coronavirus 2 1 . The disease was first reported in December 2019 in Wuhan city, the capital of Hubei province in China, and has since then spread across China and globally 2 . As of 19 August, a total of 22 million COVID-19 cases and 773,067 deaths have been reported in more than 200 countries and territories 3 . The World Health Organization (WHO) has declared the COVID-19 outbreak as a Public Health Emergency of International Concern and a pandemic recently 4 .
Forecasting the long-term trend of the epidemic can help health authorities determine the transmission characteristics of the virus and develop appropriate prevention and containment strategies beforehand. Recently, some researchers applied the traditional epidemic models like Susceptible-Exposed-Infective-Recovery (SEIR) or machine learning models like logistic regression to predict the trend of COVID-19 5,6 . To the best of our knowledge, most of those researches were performed retrospectively, or subject to overfitting or underfitting problems. The validity of the SEIR model depends on accurate estimation of virus transmission characteristics such as the basic reproduction number R 0 , incubation period, and infectious period. In a real scenario, those parameters are not easy to estimate. For example, Wu et al. made an estimation of the basic reproduction number using exported cases from China to abroad/overseas and estimated that 75,815 individuals had been infected in Wuhan as of 25 January 6 , which significantly overestimated the figure. On the other hand, due to insufficient training data and valid features, machine learning models were subject to overfitting, restricted to retrospective analysis, or only forecasting short-term trends 5,7-10 .

Scientific Reports
| (2020) 10:21122 | https://doi.org/10.1038/s41598-020-78084-w www.nature.com/scientificreports/ To address the aforementioned issues, we propose a novel model named Dynamic-Susceptible-Exposed-Infective-Quarantined (D-SEIQ), by making appropriate modifications of the SEIR model and integrating machine learning based parameter optimization under reasonable constraints. Our D-SEIQ model effectively improves the performance of long-term trend forecast for COVID-19 outbreak in China and five other countries. In addition, the model parameters, such as the dynamic reproduction number, could provide insights into the analysis of COVID-19 transmission characteristics and the effectiveness of interventions.
Methods D-SEIQ model. The primary differences from our D-SEIQ model and SEIR model include (1) replacing recovered individuals R with quarantined individuals Q , and (2) introducing time-dependent dynamics to the estimation of the effective reproduction number R t .
SEIR model is a classic compartmental model that has been initially used to simulate the spread of flu. Some previous work employed the SEIR model to predict the trend for COVID-19, which assumed that the exposed individuals (who were infected but displays no symptoms) are not infective 2 . However, it has been reported that COVID-19 might be transmissible for exposed individuals 11 . Moreover, R compartment in the traditional SEIR model indicates recovered cases or more precisely removed cases, who were removed from the total population and lost their infective or susceptible properties. Unlike flu patients who recovered soon after treatment or untreated, there was no specialized treatment for COVID-19, and COVID-19 patients were usually quarantined quickly by health workers and lost their infective or susceptible properties. In this scenario, the counterpart of R compartment in the traditional SEIR model should be replaced quarantined compartment (Q). Therefore, the infectious period which was the time between state infection (I) and recovered (R) in the traditional SEIR model, corresponded to the time between state infection (I) and quarantined (Q) in the COVID-19 epidemic. Therefore, we replaced the recovered individuals R with the quarantined individuals Q, and the model became the SEIQ model. The quarantined individuals Q indicated the confirmed cases who were detected and centrally quarantined. The quarantined individuals Q became either recovered (R Q ) or death (D Q ) eventually. Meanwhile, some infected cases recovered or deceased without being detected and diagnosed. We defined those cases as undetected recovered (R u ) and death (D u ) cases. The epidemic spreading model for the SEIQ model is therefore illustrated in Fig. 1.
The transmission dynamics are governed by the following system of equations: where N = S(t) + E(t) + I(t) + Q(t) + R u (t) + D u (t) is the total population, which is assumed a constant. Like the SEIR model, parameter β indicates the infectious rate with β = R t TE where R t is the dynamic effective reproduction number and TE is the average duration of incubation; parameter σ indicates the incubation rate with σ = 1 TE . However, in our model, parameter γ indicates the quarantine rate with γ = 1 TI (where TI is the average duration of an infectious individual to be detected and quarantined). The parameter TI reflects the timeliness of patient detection and admission and usually varies across different regions. Parameters ε and μ indicate the undetected recovered and death rate, respectively.  The basic reproduction number R 0 is the most important parameter to determine the intrinsic transmissibility of COVID-19, and it is defined as the average number of infections one infectious agent can generate over the course of the infectious period without any interventions. R 0 was assumed to be a constant or arbitrarily modified at specific points for forecasting in previous work 12,13 . However, in real-world scenarios, with the development of the epidemic, more and more interventions are often taken to control the spread, which gradually reduces R 0 . In this work, the basic reproduction number R 0 is generalized to a dynamic value R t , which is defined as the average number of secondary infectious cases generated by an infectious at time t. After the worldwide outbreak of COVID-19, many governments took considerable measures to contain the spread of the virus. In our preliminary analysis and some previous work 14 , the infectious rate β was shown to decrease exponentially with time. As parameter TE is constant, the effective reproduction number R t should follow a similar pattern as decreasing exponentially with time. Thus, we introduced time-dependent dynamics to the estimation of R t for better simulation of the real-world transmission, where R ∞ is the final reproduction number at the end of the pandemic and θ is the decrease ratio of the reproduction number, which is associated with the corresponding interventions. At the very beginning when t = 0 , R t = R 0 , and it gradually reduces to R ∞ as t increases. The epidemic is considered to be under control with R t < 1 , and the reasonable range of R ∞ was provided in some previous analysis of coronavirus 15 .
Parameter constraints and optimization. The simulation and prediction of the D-SEIQ model require the determination of parameters R 0 , R ∞ , TE, TI, θ . Although we incorporated machine learning techniques to help us to fit the reported data, the parameter range needs to be pre-set carefully and to conform to epidemiological rationality. For instance, Wu et al. applied an adjusted SEIR model to estimate R 0 ( R 0 = 2.68 ) in major cities of China by analyzing the number of cases exported from Wuhan internationally 6 . Some work concluded that the daily reproduction number varied between 2 and 7 16 . Therefore, we set a reasonable range for parameter R 0 ∈ [2,7] . Likewise, after reviewing the previous work on the analysis of COVID-19 [2,11], we summarized the ranges for parameters in our model as Table 1. And, we set TE > TI as an additional constraint. Therefore, the parameter optimization process is as follows: Moreover, we adjusted the number of newly confirmed cases in Wuhan between 12 and 14 February, due to the inclusion of clinically confirmed cases without coronavirus test. The clinically confirmed cases between 12 and 14 February were assumed to be suspicious cases in the last 7 days. Specifically, we redistributed the clinically confirmed cases according to the distribution of suspected cases over the past 7 days.
Forecasting long-term trends of confirmed case numbers. Because China's NHC publicly reported case numbers starting from 20 January, we set this date as the starting point of our training data. As of 10 March, the daily increased case numbers declined to single digits across most areas in China, we set this date as the ending point of our model. Table 1. The constrained range for parameters with epidemic rationality. R 0 denotes the basic reproduction number; TE denotes the incubation period; TI denotes the infectious period; R ∞ denotes the final value of R t ; θ denotes the decrease ratio of R t .

Parameters
Reasonable ranges www.nature.com/scientificreports/ We updated our models dynamically from the 7th day following the starting point (i.e., 27 January). In this article, we presented the prediction of our models at the time points of 1st to 5th week, namely 27 January, 4 February, 11 February, 18 February, and 25 February.
For example, the model for the first week (as of 27 January) used the data from 20 to 26 January for model construction and forecasted the daily increased and cumulative case numbers from 27 January to 10 March.
As of 27 April, the date on which the manuscript was finished, we used the same model to make a one-month prediction for the top five countries with worst outbreaks, including the United States, Italy, Spain, Germany, and France, to test the external validity of our models.

Results
The simulation and prediction of our D-SEIQ models are illustrated from three different regions: China excluding Hubei, Hubei excluding Wuhan, and Wuhan.
China excluding Hubei. The D-SEIQ model with the prediction date of 26 January showed that the cumulative number would reach 65,282 (red dotted line in Fig. 2) on 10 March. In retrospect, our model greatly overestimated the development of the epidemic, possibly because at the early stage of the epidemic when intervention had not taken its effect, the number of cases increased sharply and did not show the potential decline of R t . The overestimation also illustrated the effectiveness of the subsequent containment measures.
The D-SEIQ model trained on 27 January showed that the cumulative number would reach 12,506 on 10 March, and the daily number would reach the peak on 1 February. In retrospect, the prediction was quite close to the real scenario. The real cumulative number on 10 March was 13,005 which was only 3.8% different from the predicted value. Also, the outbreak peak predicted by our model is exactly the same as the actual date (around 1 February to 3 February). Therefore, in the region of China excluding Hubei, the D-SEIQ model is shown to successfully estimate the trend for up to 40 days, with one-week data after the first public report.
At the late stage of epidemic spread, the model is capable of fitting on previous data and also predicting the epidemic development. For example, on 11 February, we predicted the cumulative number was 13,006 at the endpoint while the true value is 13,005.
The parameters learned at the late stage could accurately reflect the intrinsic characteristics of COVID-19. Thus, the parameters on 25 February were used as the estimation of true values. In the region of China excluding Hubei, the basic reproduction number R 0 was estimated to be 6.3; the decrease ratio θ to be 0.2; the incubation period TE to be 3 days, and the infectious period TI to be 2 days. The effective reproduction number R ∞ ultimately dropped to around 0.3.

Hubei excluding Wuhan. The number of confirmed cases grew rapidly in the region of Hubei excluding
Wuhan in the first week, which biased our model of 27 January to enormously overestimate the peak value. Our model predicted that the cumulative number would reach 65,763 by 10 March. On the other hand, the overestimation also indicates that, without containment, the epidemic would show explosive growth as the influence of containment measures remained unseen at the early stage of the epidemic.
After the clinically confirmed cases between 12 and 14 February were adjusted by redistribution, we re-trained our model with adjusted values (Fig. 3). The model on 14 February after adjustment showed that the cumulative number would reach 18,844 with an error of 6% compared with the real number.
Similarly, based on the model of the late stage of the epidemic (25 February), the transmission parameters of the virus were estimated as follows: the basic reproduction number R 0 was 6.3; the decrease ration θ was 0.15; the final reproduction number R ∞ was 0.2; the incubation period TE was 3 days; and the infectious period TI was 2 days.
Wuhan. In the early days of the epidemic outbreak in Wuhan, due to the deficiency of detection capabilities and limited medical resources, the reported numbers were far below the real incidences. During the first week, the daily increased numbers even showed a declining trend, and the D-SEIQ model of 27 January consequently underestimated the epidemic development. There was a large increase in clinically confirmed cases between 12 and 14 February. We adjusted the numbers on 14 February and the prediction showed that the cumulative number would reach 54,492 at the endpoint, with an error of 9% from the actual number of 49,980. On 18 February, the D-SEIQ model showed a convincing simulation of the overall trend, and the overall predicted curve indeed fitted the adjusted values quite well (grey dashed line in Fig. 4).
The estimated parameters of the COVID-19 transmission were as follows: the basic reproduction number R 0 was estimated to be 4.63; the decrease ratio θ was 0.1; the final reproduction number R ∞ was 0.15; the incubation period TE was 3 days; and the infectious period TI was 2.5 days.
Analysis of reproduction number R t . We further analyzed the reproduction number R t by our D-SEIQ models. We used the R t learned at the late stage of the simulation. We plotted the R t curve from 20 January to 10 March as Fig. 5 to compare the reproduction numbers in three different regions. At the initial time, R 0 was 6.3 in China excluding Hubei and Hubei excluding Wuhan, both of which were larger than that in Wuhan with R 0 = 4.63 . However, the decrease ratio θ for R t was largest in China excluding Hubei (0.20), followed by Hubei excluding Wuhan and then Wuhan. Therefore, R t in China excluding Hubei dropped below 1 the earliest, meaning that COVID-19 was under control in other provinces sooner than Hubei province. The final R ∞ of three different regions all approached zero, demonstrating a great achievement in epidemic containment and interventions.

Discussion
We proposed a new model named D-SEIQ, which applies appropriate modifications of the SEIR model and combines with parameter optimization of machine learning. We evaluated our model on officially reported data from three different regions in China, and the results proved the effectiveness of our model in terms of simulating and predicting the trend of COVID-19 outbreak and regional spread. Especially, in China excluding Hubei area within 7 days after the first public report, our model successfully and accurately predicted the long trend up to 40 days and the exact date of the outbreak peak. Traditional epidemic transmission models like SEIR need an accurate estimation of model parameters such as basic reproduction number, incubation period, and infectious period through epidemiological investigation. However, in terms of a new epidemic, due to the rapid outbreak, insufficient sample size, and the deviation of On the other hand, machine learning methods, such as logistic regression models, were subject to overfitting problems 17 , which means they could fit the training data well but fail to predict on unseen data. The accountable reasons include the limited epidemic rationality of the models and the insufficiency of data and salient features. Deep neural networks like long short-term memory (LSTM) were proven to be incapable of predicting the long-term trends and the outbreak peak 18 .  www.nature.com/scientificreports/ Our model takes advantage of both epidemic and machine learning models, which combine the explainability of the epidemic model with the data-fitting ability of machine learning. In the process of machine learning, we set the parameters within a reasonable range, and exploit mutual constraints between the parameters.
Meanwhile, we innovatively introduced dynamic R t , which can reflect the time-dependent influence of intervention measures on basic reproduction number. Overall, our approach could more accurately simulate the real-world scenario of the COVID-19 spread, thus making better prediction.
Furthermore, the parameters learned by our D-SEIQ model could provide some insights into the assessment of the prevention and containment measures on COVID-19. Firstly, the basic reproduction number was relatively large (4-6), which was larger than SARS-COV with R 0 ranging from 1.6 to 3.7 15,19,20 . Without strong and effective intervention measures including city lockdown, travel containment, mask-wearing, quarantine, and screening, it could lead to catastrophic consequences to society. The final reproduction number of different areas of China gradually dropped to around 0.2, illustrating the considerable effect and the significant importance of interventions from governments or the public. Secondly, the decrease ratio of R t was slower in Wuhan which indicates the shortage of medical resources and delayed patient admission in Wuhan. This conclusion is also supported by the estimated infectious period ( TI ), which has a larger value in Wuhan than other regions of China. Moreover, our model obtained the same incubation period ( TE ) with 3 days across three regions, which was consistent with that from the Chinese CDC official report 11 .
The D-SEIQ model is applicable only when the following conditions are satisfied: adequate medical capacities, consistency of containment measures and ascertainment criteria, and timely case detection and reporting. This explained the reason why our model performed better in China excluding Hubei region. Therefore, caution needs to be taken when applying our model to other countries. The detection and reporting were not timely in some countries like the United States at the early phase, and subsequent containment measures were introduced and lift at different time points, which might influence the prediction results. Another limitation was that our model can only predict the trend of a single epidemic wave. Recently, China as well as some other countries have seen a second wave of the epidemic due to imported cases or relaxed containments. Mathematical models are currently not available to predict the possibility of the second wave.

Conclusion
We have proposed a new approach for forecasting the COVID-19 long-term trend. The model has accurately predicted the long-term trend of the epidemic in China, and the parameters learned from the model suggested the effectiveness of the intervention measures that have been conducted in China, which can help us analyze and fight against the new epidemic.

Data availability
The data sets used in this study are freely available to public on the webpage: https ://ourwo rldin data.org/coron aviru s. The codes and processed data for different regions of China are available on GitHub: https ://githu b.com/ jicha osun0 01/covid _forec ast.git.

Scientific Reports
| (2020) 10:21122 | https://doi.org/10.1038/s41598-020-78084-w www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.