The effect of population size for pathogen transmission on prediction of COVID-19 spread

Extreme public health interventions play a critical role in mitigating the local and global prevalence and pandemic potential. Here, we use population size for pathogen transmission to measure the intensity of public health interventions, which is a key characteristic variable for nowcasting and forecasting of COVID-19. By formulating a hidden Markov dynamic system and using nonlinear filtering theory, we have developed a stochastic epidemic dynamic model under public health interventions. The model parameters and states are estimated in time from internationally available public data by combining an unscented filter and an interacting multiple model filter. Moreover, we consider the computability of the population size and provide its selection criterion. With applications to COVID-19, we estimate the mean of the effective reproductive number of China and the rest of the globe except China (GEC) to be 2.4626 (95% CI: 2.4142–2.5111) and 3.0979 (95% CI: 3.0968–3.0990), respectively. The prediction results show the effectiveness of the stochastic epidemic dynamic model with nonlinear filtering. The hidden Markov dynamic system with nonlinear filtering can be used to make analysis, nowcasting and forecasting for other contagious diseases in the future since it helps to understand the mechanism of disease transmission and to estimate the population size for pathogen transmission and the number of hidden infections, which is a valid tool for decision-making by policy makers for epidemic control.

population size of pathogen transmission may decrease to a new population size of a small control region. Therefore, the population size should be carefully considered in an epidemic model and it is not appropriate to keep in a fixed population size of the city or country for making accurate predictions. Here, we assume that there exists a closed subsystem in a city or country when the uninfected people are in home quarantine or some cities are locked down. The population size of the subsystem is considered as the population size for pathogen transmission, which is usually much smaller than the total population of a city or county when deploying intervention strategies. The total population is divided into the population for pathogen transmission and the isolated uninfected population (see Fig. 1).
In this paper, we use the population size N c for pathogen transmission to measure the intensity of public health interventions, discuss the identifiability of the population size N c and propose an intuitive and efficient criterion to select the unknown population size. Furthermore, we concentrate on three specific aims: (i) Modeling: develop mathematical differential equation dynamic models that account for the random variations in pathogen, society, and public health intervention variables etc.; (ii) Nowcasting: estimate the states and parameters as a function of time through nonlinear filtering with publicly available international data of COVID-19. Moreover, we analyze the key factor, the number of latent infections, which determines the size of the newly confirmed infections in the next seven to fourteen days in terms of the incubation period 4,23 ; (iii) Forecasting: predict local and global spread trends (e.g. scales, peaks and confidence intervals) of the infections under different control intensities, i.e., different population sizes for pathogen transmission.
Our contributions mainly lie in three aspects. First of all, we develop a stochastic epidemic dynamic model under public health interventions. By formulating a hidden Markov dynamic system, the model parameters and states are estimated in time from publicly available international data by combining an unscented filter and an interacting multiple model filter. Secondly, we consider the computability of the population size for pathogen transmission and provide a selection criterion. Here, we use it to measure the intensity of public health interventions, which is a key characteristic variable for estimation and prediction of the epidemic. Finally, we provide comprehensive nowcasting and forecasting results of some countries, which show the effectiveness of the stochastic epidemic dynamic model with nonlinear filtering. The mean of the effective reproductive number of China was estimated to be 2.4626 (95% CI: 2.4142-2.5111), which is consistent with what was reported in the literature.

Methods
Mathematical epidemic modeling. It is well recognized that random effects are often unavoidable in reality and a population is often affected by random disturbances. Due to factors such as climate change, population density, public health intervention, and medical care capability, the parameters of epidemiological models may also experience random changes during transmission. Some researchers have devoted themselves to studying stochastic epidemic models through stochastic differential equation models and methods 16 . In this section, we introduce a stochastic susceptible-exposed-infectious-recovered (SEIR) model for the free propagation stage of the epidemic, where random variables and processes are included in the epidemiological process model to characterize the uncertainty associated with the propagation of the epidemic. Moreover, we develop a stochastic SEIR model under public health interventions.
The stochastic SEIR model. Based on the classical SEIR model proposed by Aron and Schwartz 15 that is deterministic in nature. In order to account for system uncertainties, we modify the SEIR model by introducing random noises and random parameters in the SEIR model as follows:  www.nature.com/scientificreports/ where S(t), E(t), I(t), R(t) and D(t) are the number of the susceptible, exposed, infectious, recovered and diseasecaused death cases at time t, respectively. The parameter vector p(t) = (α(t), β(t), γ 1 (t), γ 2 (t)) T includes the mean contact rate, transfer rate from exposed to infective, recovery rate and disease-caused mortality. Since these model parameters may experience random changes in the spreading process due to different climate changes, population densities, public health interventions, and medical cares in different regions, we model the uncertainty associated with the parameters as Brownian motions, which could be explained by a random walk with a small noise due to the fact that the parameters represent average random characteristics of a large number of members of the total population 24 , and w p (t) = (w α (t), w β (t), w γ 1 (t), w γ 2 (t)) T is a white noise process with spectral density Q p (t) . Moreover, the basic reproductive number R 0 = α/(γ 1 + γ 2 ) is defined as the average number of secondary infections caused by an infected individual in a completely susceptible population, and the effective reproduction number R e is defined as the average number of secondary infections caused by an infected individual in the population after there is some immunity or interventions have been put in place 25,26 . Besides, the birth rate, natural death rate and other uncertainties are considered as a zero mean white noise process w s (t) with spectral density Q w s (t) , which is defined as follows: Since the epidemic happens in a closed community, we further assumed that the total population size remains constant in a short epidemic period, which is commonly used in stochastic epidemic models 16,27,28 , and The stochastic model (1)-(6) is suitable for the free propagation of epidemic without control measures. It also characterizes the uncertainties of parameters and states in the propagation process. where S c and N c are the susceptible and the population size under public health interventions, respectively. E c (t) and I c (t) are redefined as the number of latent infections, and confirmed infections at time t, respectively. The reason is that the confirmed infections can be observed each day and it is convenient to estimate the model parameters and forecast the number of confirmed and latent infections by nonlinear filtering methods. Moreover, the infectious I(t) in (1) are replaced by E c (t) in (8), which means that the susceptible are not infected by isolated infections. The parameters w s c (t) = (w S c (t), w E c (t), w I c (t), w R c (t), w D c (t)) T , w p c (t) and p c (t) are similarly defined as that in the stochastic model (1)- (6).
It is worth noting that under public health interventions, we assume that there exists a closed subsystem in a city or country when the uninfected people are in home quarantine or some cities are locked down. The www.nature.com/scientificreports/ population size of the subsystem is considered as the population size for pathogen transmission N c , which is usually much smaller than the urban or national population when deploying intervention strategies. The total population is divided into the population for pathogen transmission and the isolated uninfected population (see Fig. 1). In fact, the population size N c plays a key role in nowcasting and forecasting epidemic trends and measures of control intensity. Herein, we assumed that N c satisfies (7) as well. This paper aims to analyze the data of COVID-19 epidemic when the population size N c is unknown. For simplicity, the subscript c of N c , S c , E c , I c , R c , D c , p c , w p c and w s c is omitted when there is no confusion. Specifically, the population size N refers to the population size for pathogen transmission in the rest of this paper.
Nowcasting and forecasting methods. In this section, we provide the observation model for the stochastic SEIR dynamic system (8)-(13) as well as methods for nowcasting, forecasting and the selection criterion of the population size for pathogen transmission N.
Continuous-discrete dynamic system. It is worth to note that the stochastic SEIR model (8)-(13) satisfies the linear equality constraint (7) as well, which can be used to reduce the system model parameterization. By sub- (8)-(13), we have the following equivalent reduction model: where S(t) is determined from the equality constraint (7). Moreover, we denote the state vector and process noise as where w(t) is a white noise process with spectral density Q(t) . On the basis of the state x(t) defined in the model (14)- (18) and the prevalence data that can be obtained, we model the observation equation as follows: where z k is the measurement of the system state x(t k ) , which includes the number of the cumulative confirmed infections M c k , cumulative recovered cases R c k and cumulative disease-caused death cases D c k at time t k . The statistical error associated with the measurements is modeled as a Gaussian white measurement noise v k with a small covariance R k , and H is a linear measurement matrix defined as follows: As a consequence, we have the following unconstrained continuous-discrete dynamic system: where f (·) is a simplified expression of the drift function corresponding to (14)- (18). The process noise w(t) , measurement noise v k , and initial state are assumed to be mutually independent.
Nowcasting and forecasting. For the nonlinear system (20)- (21), the optimal Bayesian state estimation is usually intractable. There are several strategies to approximate the optimal estimation, such as the unscented Kalman filter (UKF) 29,30 , particle filter and cubature Kalman filter 31,32 etc. For the system under consideration, we employ the continuous-discrete UKF 29,30 to estimate the posterior expectation of the random state and parameters recursively by the daily data z k . Specific steps of the continuous-discrete UKF are shown in supplementary information, where the continuous state model is implemented by the fourth order Runge-Kutta method 33 . www.nature.com/scientificreports/ Remark 1 Since there are sudden changes in the states and parameters, e.g., the scheme to designate the confirmed infections in China was revised on February 12, 2020 so that the number of cumulative confirmed infections had a drastic increase of 15152 persons 34 . The transfer rate β(t) should be modeled differently before and after February 12, 2020. Moreover, the classical interacting multiple model (IMM) filter in maneuvering target tracking 35 can be used to enhance the stability of the algorithm and improve the performance of the nowcasting and forecasting methods.
Furthermore, the prediction probability density of the state x(t) is derived by the model prediction step of UKF through the model (14)- (18) with posterior parameters. In this paper, we adopt the expectation of the predicted state as the prediction result. Let m i+j|i = (I i+j|i , R i+j|i , D i+j|i ) T denote the vector consisting of the expectation of the confirmed infections, recovered cases and disease-caused death cases predicted from time t i to t i+j , where M c i+j|i = I i+j|i + R i+j|i + D i+j|i represents the expectation of the cumulative confirmed infections predicted from time t i to t i+j , and σ i+j|i is the standard deviation of the error of M c i+j|i . Then, the approximate 95% confidence intervals (CIs) are calculated as follows: Based on the proposed stochastic SEIR model under public health interventions (14)-(18), the nowcasting and forecasting methods described above, the local and global spread trends of the epidemic are analyzed under public health interventions and different control intensities, respectively.
Selection criterion of the population size N. When the epidemic spreads freely in a city, the urban population can be chosen as the population size N. However, under public health interventions, the value of N should be selected carefully, and we use N to measure the intensity of interventions. In this section, we investigate the identifiability of population size and provide the selection criterion. (1)-(5) Based on the recursive method, Propositions 1-2 show that N cannot be estimated since the state estimates are the same almost everywhere for different N when they are large. Thus, it is unnecessary to consider a very large N, especially under strict control measures or early on in an outbreak. On the other hand, it is not difficult to show that N can be estimated when it is not too large since E(E+I+R+D) N is not infinitesimal (see the analysis in supplementary information). Moreover, we provide a selection criterion for the value of N by solving the following optimization problem, i.e., we select the population size by minimizing the average relative prediction error of the cumulative confirmed infections in the next J time steps:

Lemma 1 Considering the deterministic SEIR model (i.e., neglecting the random noises and randomness of the deterministic parameters in
where J is a constant, e.g., 7 days. N is the variable over which optimization is to be carried out, N u may be half of the regional population or other values based on prior data, which depends on the intensity of the control measures etc., and the e j (N) is calculated as follows: In order to make predictions of epidemic spread, we need to estimate the parameters α(t), β(t), γ 1 (t), γ 2 (t) accurately. By implementing continuous-discrete UKF, the posterior expectation of the parameters in the model (14)-(18) are derived. α(t) is the contact rate representing the average rate of transmission for the susceptible to latent infections, and we choose the initial value of α(t) between 0.1-0.2 6 . β(t) is the average transfer rate from exposed to infective and 1 β(t) represents the incubation period. Because the incubation period of the COVID-19 has been reported to be between 0 to 24 days 4 , we choose 7-14 days, i.e., the initial value of β(t) is between 1 7 and 1 14 . And γ 1 (t) is the average rate of recovery (i.e., γ 1 (t) = 1 D , where D is the average duration of the infection), we choose the average duration of the infection between 10 days to 20 days, i.e., the initial value of γ 1 is between 0.05 and 0.1. Here we use the disease-caused mortality 0.1%-3% 38 . Additionally, the initial values of I, R, D are set as the number of confirmed cases, recovery cases and disease-caused death cases reported on the day of the first confirmed infections was found, respectively. Figure 2a shows the prediction trends of the confirmed infections I(t) based on the data through January 20 to February 24 of Hubei Province, China under different intensities of public health interventions, i.e, maintaining the control intensity and the population size N = 1, 000, 000 , relaxing the control intensity and the population sizes N = 2, 000, 000 , and N = 3, 000, 000 , respectively. The results indicate that relaxing the Hubei quarantine would lead to a second epidemic peak, and the larger the population size N, the higher the peak, which will lead to the selection of different control measures. It also illustrates the effect of the population size N, due to the prediction trends of confirmed infections in hospitals are completely different when using different population sizes in the model. Thus, the selection of the population size for pathogen transmission is a key factor for the forecasting of the epidemic and N is identifiable for this case. Figure 2b shows the prediction for Beijing City. It indicates that when N is greater than 100,000, the predicted values are the same, i.e, N is unidentifiable. This is consistent with Propositions 1-2. In fact, early on in the epidemic, Beijing took very strict control measures so that the effective reproductive number R e is very small, the small number of infections cannot spread to a larger population.

The selection of the population size N and estimation of the effective reproductive number.
Therefore, for global prediction, it is required to select the population size carefully. Through the selection criterion (22) and setting J = 7 , we select the population size for pathogen transmission of China and the globe except China (GEC) based on the prevalence data of COVID-19 epidemic 36,37 . The selected population size N of China is about 1,220,000 derived by using data from January 21, 2020 to March 18, 2020 when there were no local newly confirmed cases 36 . And the selected population size N of GEC is about 224,000,000 derived by using data from January 21, 2020 to December 8, 2020. It indicates that both the selected population sizes of China and GEC are far less than the total population. Figure 2. The forecasting of the epidemic trend of I(t) with different population sizes. www.nature.com/scientificreports/ Figure 3a shows the estimation of the effective reproductive number R e of China from January 21, 2020 to March 18, 2020 using the selected population size N = 1, 220, 000 . It shows that the effective reproductive number R e increases monotonically early on in the outbreak, then monotonically decreases with a sharp infection point. This is consistent with the policy decision in which Chinese government implemented unprecedented public health interventions to prevent the spread of the epidemic and the incubation period is about 7-14 days 4 . The reason for monotonic increase in the early stages of the epidemic may be that, in a limited local city, the density of infections increases as a result of the infections increasing exponentially. It results in a quick growth of the contact rate α(t) . The recovery rate is very low due to the recovery cycle. The reason for monotonic decrease is that the public health interventions (e.g. scales of quarantine, strict controls on travel, hand hygiene, and use of face masks etc. 39 ) significantly decrease the contact rate α(t) and the recovery rate gradually increases, which results in a monotonic decrease of the effective reproductive number R e . In Fig. 3b, it can be seen that the curve of the effective reproductive number R e of GEC reached its first peak later than that of China, due to the fact that the coronavirus has been detected in many countries since February, 2020 and the response measures of these countries were later than China. Due to the implementation of strict public health interventions in European countries and USA in March to contain the pandemic, the spread of the virus has been curbed to a certain extent so that the effective reproductive number R e of GEC declined. However, the effective reproductive number R e had a second peak. The reasons may be that the isolation control measures have been relaxed in many countries, and the epidemic in some other countries such as Brazil, India, Russia became serious from May. The current effective reproductive number R e of GEC is still high and fluctuates around 3.
We also present the estimated mean values and 95% CI of the key epidemiological parameters of GEC and China in Table 1. The mean of the effective reproductive number of COVID-19 in China is 2.4626 (95% CI: 2.4142-2.5111). This finding aligns with other recent estimates of the effective reproductive number for this time period 5,[40][41][42][43][44] .
Moreover, we select the population size of several countries such as India, Iran, Russia and USA based on the prevalence data of COVID-19 epidemic 37 from the date of the first confirmed infection to December 8, 2020 under the same selection criterion. Table 2 gives the selected values of N and national populations 45 of several major epidemic affected countries. Here, we restate that N is the population size for pathogen transmission. The total national population is divided into the population for pathogen transmission and the isolated uninfected population. It shows that the selected N under public health interventions are far less than the national population.    www.nature.com/scientificreports/ To validate the selected population sizes N, Fig. 4 shows the reported confirmed infections from the date of the first confirmed case to December 8, 2020, daily prediction of one time step, the corresponding 95% CIs and the relative error between daily prediction of one time step and reported confirmed infections for several major epidemic affected countries, respectively. Due to the large amount of data, in order to show the results more clearly, we only draw the data with an interval of seven days in Fig. 4 and the following pictures. It indicates that the 95% CIs can cover the true confirmed infections, even in the outbreak phase (see Fig. 4a-d). Figure 4e-h shows that all of the relative error curves converge after several days and have a small error, which show that our model, selected population size N and the parameters estimated through nonlinear filtering can accurately predict the real data in a short period, qualifying it for making nowcasting and forecasting. Furthermore, more prediction results of 1-7 days are updated and available on https://github.com/SCU-Bigdata/2019-nCoV-Forecast in time since February 7, 2020.
The nowcasting and forecasting results. Under the selected population size N, the number of latent infections E(t) can be estimated in time again by UKF. Figure 5 shows the reported cumulative confirmed infections, the estimate of the actual infections and corresponding 95% CIs, where the difference between the two lines represents E(t). It indicates that the situation in USA is the most serious one and it has 2,295,026 latent infections on December 8, 2020. The latent infections account for a large fraction 12.99% of the total number of infections and based on the incubation period of 7-14 days 4 , we infer that the pandemic will continue for a long time in USA, unless more stringent public health interventions are taken to reduce the spread of COVID-19. Similar results are anticipated for India, Russia with 327,922 and 273,583 latent infections, respectively. Compared with the other three countries, Iran has relatively few hidden infections, about 121,817 cases. Figure 6 shows that the reported number of confirmed infections in hospitals of some major countries, the fitted number using the selected population size N through nonlinear filtering, and the forecast results including the number, corresponding 95% CIs, peak and arrival time of the infections in hospitals under different control measures by changing the population size for pathogen transmission and for fixed contact rate, e.g., 1) maintaining the selected population size N for pathogen transmission, 2) relaxing the control intensity and the population size N has a 50% increase, 3) strengthening the control intensity and the population size N has a 50% reduction. As shown in Fig. 6, the fitted values are accurate and the 95% CIs can cover reported number of confirmed infections in hospitals, even if the reported number of I(t) has many fluctuations. We find that except India and Iran, by decreasing of the population size, i.e., strict control measures, the peaks would have a decrease, especially in USA and Russia, and the time of peaks would arrive earlier. It is worth noting that this is different from the results of Wu et al. 5 . The reason is that Wu et al. 5 assume that the population size is fixed and strengthen the control intensity by reducing the contact rate α(t) . Thus, it is not contradictory. In this experiment, there Number of infections I(t) in hospitals 10 5 Reported number of I(t) Fitted I(t) using N Predicted I(t) using N Predicted I(t) using 1.5N Predicted I(t) using 0.5N  www.nature.com/scientificreports/ are no effects on the peak of different population sizes in India and Iran, but the decline rates of the infections in hospitals are different. If the population size continues to grow, there could be a second peak. The prediction results in Fig. 6 indicate that the use of different population sizes N to make prediction will have different trends and peaks. Therefore, it is necessary to take public health intervention measures to avoid the spread of COVID-19 to a larger subsystem, in other words, to avoid an increase in the population size N for pathogen transmission.

Discussion
Nowcasting and forecasting of epidemics play a vital role in predicting geographic disease spread as well as case counts so as to better inform public health interventions when outbreaks occur 46 . In this study, we developed a stochastic SEIR model under public health interventions, the nonlinear filtering and prediction methods and selection criterion for the population size for pathogen transmission, which could be valuable to national and international agencies for public health situation perception and design of intervention strategies. Our technical contributions are made with a backdrop of the COVID-19 pandemic that has engulfed the world. We infer that USA has the most serious situation as it has 2,295,026 latent infections as of December 8, 2020. India and Russia are with 327,922 and 273,583 latent infections, respectively. Based on the present trajectory, the latent infections may cause further outbreaks in the case of weak public health interventions. Therefore, it is essential to strengthen control intensity such as large-scale quarantine, strict controls on travel and extensive monitoring of suspected cases to reduce the contact rate and the population size of the susceptible. Here we regard the population size N for pathogen transmission as one of the measures of the control intensity, which is conducive to assess the effectiveness of public health interventions. Moreover, it is valuable to obtain the more accurate number of the latent infections, which is helpful to design control strategies (e.g., scales of quarantine, strict controls on travel, contact isolation, hand hygiene, and use of face masks) in the future. However, on the population size N, we selected it by minimizing the prediction error based only on the current data, which may be prone to overfitting of the data. One possible improvement approach is to include the prediction covariance in the selection criterion. Besides, the stochastic model (14)- (18) ignored the impact of the imported cases, which are considered as random noises. If the airline transportation data can be obtained, by combining the networked dynamic metapopulation model 14,19,47 and selection of the local population sizes in time, more accurate prediction methods and results for the local and global trend may be obtained. We believe that it is extremely important to predict further spread of COVID-19 globally and evaluate the effectiveness of different mitigation strategies and to try to mitigate the impact of this pandemic on the entire global population.