State-specific projection of COVID-19 infection in the United States and evaluation of three major control measures

Most models of the COVID-19 pandemic in the United States do not consider geographic variation and spatial interaction. In this research, we developed a travel-network-based susceptible-exposed-infectious-removed (SEIR) mathematical compartmental model system that characterizes infections by state and incorporates inflows and outflows of interstate travelers. Modeling reveals that curbing interstate travel when the disease is already widespread will make little difference. Meanwhile, increased testing capacity (facilitating early identification of infected people and quick isolation) and strict social-distancing and self-quarantine rules are most effective in abating the outbreak. The modeling has also produced state-specific information. For example, for New York and Michigan, isolation of persons exposed to the virus needs to be imposed within 2 days to prevent a broad outbreak, whereas for other states this period can be 3.6 days. This model could be used to determine resources needed before safely lifting state policies on social distancing.


Mathematical model and configuration
The mathematical model used in this article is derived from the susceptible-exposed-infectiousremoved (SEIR) [7,6,2,8] augmented to incorporate human movement, and separate the infected compartment into reported cases and unreported cases. The model is illustrated in the flowchat below: In the model, for each state, the population is divided into six compartments: S i (susceptible), E i (latent), I i (reported infections), U i (unreported infections) and R i (removed: either recovered or deceased individuals). The subindex i stands for the index for the state. Among the six compartments, S, E and U are "free" people and can move from state to state, while I, and R are monitored and isolated.
The model is composed of 51 coupled ordinary differential equation ( Figure S1: Illustration of the travel flow-network augmented susceptible-exposed-infectiousremoved model.

S-1
state i, the model writes as: (1) The ODE system is equipped with the following initial data (t = 0 standing for March 1, 2020): In the equation, the unit for t is one day. N i (t) is the total population of state i at time t, and P i = S i + E i + U i is the free population. n ij is the number of inflow from state j to state i. b i and r i are the transmission rate and reporting rate of state i. c I (c U , resp.) is the proportion of positive cases that show critical condition for I (unreported cases U , resp.). D e is the latent period. D c and D l are the infectious periods of critical cases and mild cases. α t is a parameter to tune the traffic flow.
We emphasize two main differences in modeling compared with literature. In [10], the authors study the inter-city traffic and its impact on the spreading of COVID-19 in China. The situation in China and that in the US are very different. In China, the epicenter is clear: the city of Wuhan, Hubei province, and the outbreak starts mid-January, 2020. The COVID-19 outbreak in the US, however, is multi-sourced. The consequence is that in the model in [10], the initial condition for cities excepts Wuhan is clear: the latent, the reported and the unreported cases are all zero. In this model, however, the initial conditions E i0 are unclear for all states; Another big difference is, according to clinical findings, the latent cases also have the potential of transmitting the virus, and thus we add the interaction of E i with S i into the increment of E i [9, 10, 1].
The unknown parameters and state variables in the equation set are D l : the average duration of infection for mild cases. We assume D l = 6 days.
α t : the ratio of interstate travel volume compared to that of 2019 during the same period. The travel flow information n ij was extracted from the SafeGraph mobility data, and we set α t = 0.5 to represent the travel reduction situation observed in the year of 2020.
c I : proportion of critical cases among all reported cases. We assume c I = 0.1.
c U : proportion of critical cases among all unreported cases. We assume c U = 0.2.
There is an essential assumption made in the model: the homogeneity in the population. It means that the traffic flow is a good representation of the total population without considering their demographic and socioeconomic characteristics. The susceptible, exposed, and unreported move in and out of states at the same rate. This explains the S i The effective reproductive number R e could be computed as R e depends on time due to the time dependence of E and U .

Population and Human Mobility Data
We downloaded the total population data by state in 2019 from the US Census Bureau. In addition, we collected over 3.6 million points of interest (POIs) with travel patterns in the United States from the SafeGraph business venue database 2 . The SafeGraph's data sampling correlated highly with the United States Census populations 3 . These mobile location data consist of "pings" identifying the coordinates of a smartphone at a moment in time. To enhance privacy, SafeGraph excludes census block group (CBG) information if fewer than five devices visited a place in a month from a given CBG. For each POI, the records of aggregated visitor patterns illustrate the number of unique visitors and the number of total visits to each venue during the specified time window (i.e., March 1st to March 31st 2019 in our dataset), which could reflect the attractiveness of each venue and the national spatial interaction patterns during the last March travel. According to the ODE modeling needs, we further aggregated the travel patterns to the state-to-state spatial scale as shown in Figure S2. In the model, we set the parameter α t = 0.5 to represent the travel reduction situation observed in the year of 2020 [14].

Parameter fitting methodology
Each state has its own S −E −I −U −R data. We assume all the states have their own transmission rate b, and the reporting rate r. In this section we discuss the method we apply to recover these parameters.
To identify the parameters is a typical data assimilation problem: one has the knowledge from an underlying ODE model, and the access to evolution data. The goal is to build a probability density function that reveals the possible value and the probability of the state variable. Two main ingredients in DA are ODE simulation, and the Bayesian analysis. The ODE system serves as the prior information, and the Bayesian formula blends such dynamical system with the newly fetched data to generate a posterior distribution of the state variables and the parameters. The general flow chart of data assimilation is found in Figure S3.
In the flowchart, P n|n−1 is the probability density of u upon the evolution step, and P n|n is the probability density obtained through the analysis step.
In our case, u is the augmented state variable that includes both the unknown parameters (b, r) and the state variable (I, E, S, U, R) for every state: where k is the number of states/regions used in the fitting. We transpose it to make it into a column vector. We are aiming at building a distribution density function of u over the R 7k space.
analysis: evolution: Figure S3: Flowchart of data assimilation: u is the state variable, G n−1,n (u) is the forward map by running the ODE from time step t n−1 to t n . q is the likelihood function that measures the probability of the error term d − Mu where M is the measuring operator.
, and thus we denote meaning M is a 7k length vector that has k nontrivial entries that pick up all I i information.
We further assume the collected data has a Gaussian perturbation from the true measuring operator: and thus naturally the likelihood function is: Then together with the two formula for the evolution step and the analysis step: one can iteratively update P n|n (u), giving P n−1|n−1 , the probability density of u at time t n−1 . In the equation ∝ is the proportional sign: one needs to normalize P n|n to make it a probability density function so that P n|n (u)du = 1.
There are many choices of data assimilation methods. We choose to utilize Ensemble Kalman Filter that is steered towards analyzing systems having high dimensional state variables. It is a technique, derived from the classical Kalman Filter for the application in atmospheric science, with the analytical covariance matrix replaced by the ensemble version, eliminating the computation of the the Kalman gain matrix and the Riccati Equation that are typically expensive in high dimensional space. It is proved to the effective in the Gaussian case for linear forward model S-5 and the measuring operator [4,11,12]. The idea is to sample a fixed number of particles on the state variable space according to the initial distribution, and move these particles around at every discrete time step with certain dynamics, to represent the newly adjusted distribution. Denote the number of particles by N , and the j-th particle after the evolution step at time t n by u j,E n , and the j-th particle after the analysis step at t n by u j,A n , we now summarize the algorithm: where F E is the Forward-Euler discretization applied on ODE (1). -Analysis: where ξ j n is a k-length vector with each entry being random variable i.i.d. drawn from N (0, σ 2 ), d n is the k-length vector collecting the reported infected data on k counties. To compute the covariance matrices, we set: Then naturally Cov up n = Cov uu n · M , Cov up n = M · Cov uu n · M . are matrices of size 7k × k and k × k respectively.

Results
The results are divided into two categories: 1. parameter fitting; 2. COVID-19 infection prediction. The computation is done on the state level. Data from 50 states and D.C. (thus k = 51) in the United States are used. The model and data assimilation analyses were ran from March 1 to March 20, 2020, and we predict the future infectious cases in different states from March 21 to April 29, 2020 with different parameter setting scenarios.

State results
For parameter fitting, we utilize the method discussed in Section 3. Total 2000 samples with noninformative prior are adopted to determine 7 × 51 = 357 state variables. The standard deviation of noise is set to be σ = 10. In Figure S4-S5, we plot the susceptible, exposed, unreported infections and removed, in time, for the top 10 states with most total confirmed cases as of March 20, respectively. In Figure S6-S7, we plot the reported infections in time, for these 10 states. For most states, the number of reported case grows essentially exponentially fast. Figure S8-S11 show the inferred transmission rate b and reported rate r. Figure S12-S13 show the time series of effective reproductive number R e for different states. The signal in the data is rather weak, and for some states, the number of E and U cannot be inferred at the early stage of the breakout, leading to R e = 0 for a small period of time for some states.

Prediction
The first part of our prediction includes the case study for different transmission rate b and reporting rate r. In the following, we define the ratio between transmission rate in prediction step and data assimilation step (March 20) by α b , and define the ratio between unreporting rate 1−r in prediction step and data assimilation step (March 20) by α r , namely Fig S15 shows the effective reproductive number R e on April 29 as a function of α r and α b for five states.
We further take a proactive approach by directly identify and quarantine the exposed population. The time in days used to discover and isolate them is, on average D q . The model is changed to: As shown in Fig. 3c, the timely self-quarantine and strict isolation right away or starting at most about 3.6 days (the median value) after exposed to the SARS-CoV-2 virus is found most effective in containing the COVID-19 outbreak for most contagious states in the US.

Limitation
For the scope of this paper, we do not consider the hospital capacity, and thus we assume the current or timely built facility can accommodate all reported infected cases. However, if one can have access to the hospital capacity data in each state, the capacity information can be fed to the model as well. Furthermore, we do not distinguish mild and severe symptoms in the model.  Figure S15: R e on April 29 for different α b and α r . The red line is the level set R e = 1. It can be seen that increasing the reported rate helps to diminish the reproductive number. S-19