Examining the COVID-19 case growth rate due to visitor vs. local mobility in the United States using machine learning

Travel patterns and mobility affect the spread of infectious diseases like COVID-19. However, we do not know to what extent local vs. visitor mobility affects the growth in the number of cases. This study evaluates the impact of state-level local vs. visitor mobility in understanding the growth with respect to the number of cases for COVID spread in the United States between March 1, 2020, and December 31, 2020. Two metrics, namely local and visitor transmission risk, were extracted from mobility data to capture the transmission potential of COVID-19 through mobility. A combination of the three factors: the current number of cases, local transmission risk, and the visitor transmission risk, are used to model the future number of cases using various machine learning models. The factors that contribute to better forecast performance are the ones that impact the number of cases. The statistical significance of the forecasts is also evaluated using the Diebold–Mariano test. Finally, the performance of models is compared for three waves across all 50 states. The results show that visitor mobility significantly impacts the case growth by improving the prediction accuracy by 33.78%. We also observe that the impact of visitor mobility is more pronounced during the first peak, i.e., March–June 2020.

www.nature.com/scientificreports/ analysis also showed that travel restrictions for inbound traffic and the local mitigation strategies reduced the transmission of the virus. In a study evaluating the effect of mobility restriction in limiting COVID-19 spread, using zip code data for Atlanta, Boston, Chicago, New York (NYC), and Philadelphia, the authors estimated that total COVID-19 cases per capita decreased on average by approximately 20% for every 10% fall in mobility between February and May 2020 17 . In another study, the correlation between the COVID-19 growth rate and travel distance decay rate and dwell time at home change rate was − 0.586 (95% CI − 0.742 to − 0.370) and 0.526 (95% CI 0.293-0.700), respectively. Increases in state-specific doubling time of total cases ranged from 6.86 to 30.29 days after social distancing orders were put in place 4 . Another analysis across counties in the US showed that the adoption of government-imposed social distancing measures reduced the daily growth rate of confirmed COVID-19 cases by 5.4 percentage points after 1-5 days, 6.8 percentage points after 6-10 days, 8.2 percentage points after 11-15 days, and 9.1 percentage points after 16-20 days 13 . IA recent European study showed that internal mobility is more important than mobility across provinces to control COVID-19, and the typical lagged positive effect of reduced human mobility in reducing excess deaths is around 14-20 days. Similarly, Linka et al. evaluated the impact of global mobility using air travel and local mobility across various countries in the European Union. Their results show a maximal correlation between driving mobility and disease dynamics with a time lag of 14.6 ± 5.6 days.
Similarly, in the United States, various states, counties, and cities imposed restrictions on travel locally and from people traveling outside. For example, early in the pandemic, non-essential businesses were closed to curb the spread of the virus. Later in the pandemic, restrictions were imposed for visitors traveling from high-risk regions. For example, the United States restricted travel from various countries with a high number of COVID cases. Similarly, states like Illinois and New York mandated vaccination proofs for travelers from states with an increased number of cases during the Delta variant. A question that has not been explored is whether the impact of visitor mobility is different than local mobility and which of the either contributes more to the spread of the virus.
In this study, we examine the impact on the daily number of COVID cases resulting from local mobility and visitor mobility for all the states in the United States. In the United States, most public health decisions are made at the State level. In practice, the impact of mobility across state borders is limited (i.e., the number of people crossing state boundaries) compared to mobility between counties, especially in counties in the same metro region. Also, given the number of counties in the United States and their variability in size, population, and socioeconomic factors, it is challenging to arrive at generalizable conclusions. Therefore, we consider the statelevel granularity for this analysis. We consider two different variables to capture the infection propagation risk from infected people traveling and transmitting the virus within and across state boundaries, namely the local transmission risk (due to local mobility) and the visitor transmission risk (due to visitor mobility). We evaluate the impact of these variables to predict the case growth using various machine learning models. Assuming all the other variables like social distancing measure, mask mandates, and the local number of cases are similar, we evaluate which combination of the cases, local, and visitor mobility are more accurate at predicting the future number of cases. The more accurate the models are, the higher the impact of the variables included in the model on the target variable (the case growth) 27 . This study analyzes the mobility data both within and among all 50 states and the number of new cases per day in the United States.

Methods
Data collection. Infection data. The confirmed cases data was retrieved from the Corona Data Scraper open-source project 28 , which provides county-level data on the number of new cases per day. We aggregated the number of daily new cases to a state-level between March 1, 2020, and December 31, 2020. Mobility data. State-level mobility datasets and metrics were provided by SafeGraph 29 . SafeGraph provides aggregated trip information obtained from anonymized mobile device locations at a census tract level. The intrastate (or local) trips represent the trips taken by individuals starting and ending within the same state (i.e., state boundary). The inter-state (or visitor) trips are those where the origin and destination are in different states (i.e., origin in one state and destination in a different state). This data was collected for all the trips made between March 1, 2020, and December 31, 2020.

Approach.
To measure the impact of mobility (both local and visitor), we model the number of cases at a particular location based on the historical number of cases and the transmission of infection based on mobility. The risk of mobility associated with COVID transmission is calculated as a product of traffic flow and the number of cases per capita at origin 30 . Therefore, the local and visitor transmission risk reflects the number of infected individuals traveling from origin to destination, assuming uniform transmission. It has been established in the literature that a higher accuracy when a new feature is introduced into a machine learning model indicates that the particular feature is an important predictor for the target variable 27 . For a state i, we introduce three features, the current number of cases ( C i ), local transmission risk ( LT i ), and visitor transmission risk ( VT i ), to predict the future number of cases in a particular state. Higher forecasting accuracy when using visitor transmission risk means that it impacts the number of future cases. The prediction model is built using both linear and non-linear regression-based machine learning approaches and a combination of the three features noted earlier.
More information about the features and the machine learning methods is provided below: www.nature.com/scientificreports/ Local transmission risk. The local transmission risk represents the transmission potential of the virus based on the recent number of cases per capita (which represents local case incidence) and the mobility both at the local level. The local transmission coefficient LT for a spatial region i is calculated using the formula: where M i,i represents the number of trips where the origin and destination of the trips fall within the region i. The cases per 100,000 people at location i, which we denote as C i .
Visitor transmission risk. The visitor transmission risk represents the transmission potential of the virus based on the recent number of cases per capita at the visitor origin. The visitor transmission risk VT at a location i can be calculated using: where M j,i represents the number of trips that originate at j and end at location i and j = i . The cases per capita at location j are represented by C j . These three measures are illustrated in Fig. 1.
(1) www.nature.com/scientificreports/ Relationship between number of cases and transmission risk. The relationship between the number of cases and mobility is formulated using the below regression equations, where number of cases is the dependent variable, and the mobility related risks are the independent variables.
where C i−14 , LT i−14 , VT i−14 refer to the inputs (or independent variables) to the model, the current number of cases, local transmission risk, and the visitor transmission risk lagged by 14 days, respectively. The dependent or the predicted variable is the current number of cases. The three parameters θ C i−14 , θ LT i−14 , θ VT i−14 are parameters to be estimated. The function h is deduced using a machine learning approach, and the impact of historical cases and no transmission risk, historical cases and local transmission risk, historical cases and visitor transmission risk, and finally, historical cases, local and visitor transmission risk, are used to understand the impact on the future number of cases. Comparing the accuracy of different models and evaluating the statistical significance provides insights into which variables have a higher impact on the case growth.
Machine learning methods. Earlier work on studying mobility and its impact on the number of COVID cases identifies the relationship as non-linear 18 . Machine learning models can capture both linear and non-linear relationships between different variables. The abundance of the COVID data-case data and mobility patterns, enables us to identify complex relationship patterns. This study used popular machine learning methods: Linear Regression, Support Vector Regression, K-Nearest Neighbor Regression, Multilayer Perceptron, Random Forest Regression, and eXtreme Gradient Boost (XGBoost) Regression to forecast the number of cases. The linear regression model is linear, the other models are non-linear, where the Support Vector Regression is a support vector machine based approach, K-Nearest Neighbor is a similarity based approach, Multilayer perceptron is a neural network based approach, and finally, the random forest and XGBoost are decision tree based. These models consider the historical number of cases, local transmission risk, and visitor transmission risk when forecasting the future number of cases.
Evaluation criteria. The predictive performance of the proposed approach for each of the stations is compared using the following two metrics: mean absolute percentage error (MAPE) measures the average percent of absolute deviation between actual and forecasted values.
Root mean squared error (RMSE) captures the square root of the average of squares of the difference between actual and forecasted values.
where N is the number of test samples, A is the actual value, and P is its predicted value. For each technique, we evaluate the accuracy of prediction with and without the visitor transmission risk.
Diebold-Mariano test (DM-test) is used to evaluate the significance of the predictions of the two models 32 . The models use the forecasted number of cases generated with and without using the visitor transmission for each of the three machine learning approaches. The null hypothesis of the DM-test is that the two forecasts have similar forecast accuracy. The alternative or rejection of the null hypothesis is that the two forecasts have significantly different forecasting accuracy, i.e., the forecasts are not similar using the two models.
The results for each of the models are evaluated using 10-fold cross-validation at a state level to ensure that the data is not overfitting. The data for each state is extracted and split into 10 equal subsets. The data is trained using the 9 subsets and evaluated on the 10 for each of the subsets. The reported results are the average MAPE and RMSE for all the 10 folds.

Results
Tables 1 and 2 compare the machine learning forecasts with and without the inclusion of visitor mobility and local mobility. We compare the performance of the three machine learning models (XGBoost, Linear Regression, and Random Forecast) using the MAPE and RMSE. The results show that the MAPE of the forecasted cases is higher when both the local and visitor transmission risk is taken into account for the top 4 of the 6 models, and the RMSE of the forecasted number of cases is higher for all the models. We also evaluated the forecasting capacity of the models when the mask mandates and the social distancing guidelines were also used as features in the machine learning models to forecast the number of cases; we did not notice a significant improvement in the forecasting capability. In addition, a DM test was performed to evaluate the significance of forecasts when visitor mobility is included in the model. Table 1 shows the MAPE of machine learning models for the complete duration (i.e., March 2020-December 2020). We observe that the MAPE of the best performing model (i.e., XGBoost) has a MAPE of 16.8% when using visitor mobility compared to 22.2% when just the local transmission risk is taken into account. The linear regression model performs better when using local transmission risk than combined local and visitor transmission risk. However, Table 2 shows that the RMSE is lower when using the combination of local and visitor transmission risk than local transmission risk. In addition, the RMSE is lower when local and visitor mobility is used for all three models.
(3) www.nature.com/scientificreports/ The significance of the results is evaluated using the Diebold-Mariano score that evaluates the null hypothesis that both the forecasts are the same. The p value shows that the null hypothesis is rejected, and the difference in forecasts is statistically significant. The DM scores for the XGBoost, linear regression, and random forest are 8.7 (p = 0.2), 2.66 (p = 0.04), and 8.26 (p = 0.04) respectively. The MAPE and RMSE on the state-level data in Table 5 show that the inclusion of external mobility leads to better forecasts for all 50 states in the United States.
We make similar observations when the data is separated into three waves (Tables 3, 4) to show the performance of the model using the top two machine learning approaches along with linear regression as a baseline. The MAPE and RMSE for all the three approaches report lower MAPE and RMSE when visitor transmission risk is included in the model generation compared to just the local mobility risk. The inclusion of the visitor transmission risk improves the MAPE of the forecast by about 57.19% on average across all states for the three waves. During the first wave, the external mobility decreases MAPE by 110%, with the percentage error at 17.14 ± 0.28% and 30.81 ± 0.48% with external and only local transmission coefficients, respectively. Similarly, the improvement for MAPE for the second and third waves is 34.23% and 26.17%, respectively. Table 5 shows the importance of each feature when forecasting the number of future cases using a random forest regressor. The importance of each feature is calculated using an estimator based on the increase in error  www.nature.com/scientificreports/ when the particular feature is not considered. The results show that the current number of cases has the highest impact on the number of cases. However, the visitor transmission potential is more important than the local transmission risk for all the waves of the pandemic. The visitor transmission risk is considered twice as important as the local transmission risk. These results show that the non-linear models can accurately predict the number of new cases in the future with high accuracy when considering the visitor mobility risk along with the local mobility risk and the current number of cases. With the rest of the factors like social distancing, mask mandates, and vaccination status constant, visitor mobility is a significant factor in determining the number of cases. We also note that the impact of local and visitor mobility risk is not consistent across the three waves of the pandemic. During the first and the third waves, mobility has a higher impact on the number of cases compared to the second wave. The forecasting performance for each state using XGBoost for the three waves of the pandemic in 2020 is presented in Table 6. Figure 2 shows the cumulative number of cases per capita, local transmission risk, and visitor transmission risk for each of the states in the United States for the entire pandemic and the three phases of the pandemic. We observe that certain states have a lower local transmission risk and a higher visitor transmission risk. For example, during the second phase of the pandemic, in states like Illinois and Georgia, the local transmission risk is much lower than the transmission risk posed by travelers from other states to Illinois. Similarly, for states like New York in phase 1, the local transmission risk is higher than the visitor transmission risk. There are also variations between the interplay of local and visitor transmission risks for different phases of the pandemic. The first phase is primarily driven by local mobility, and the other two phases are a combination of local and visitor mobility.

Discussion of results
While it is apparent that the majority of the visitor transmission risk is due to travelers crossing state boundaries from neighboring cities, there is also considerable transmission risk due to long-distance travel. For example, for Louisiana, the majority of the risk for its second peak is contributed by Mississippi, Texas, and Florida, which is higher than the Arkansas that borders the state in the north. The states of Mississippi and Florida contributed more to the second peak, whereas travelers from Texas contributed to the second and third phases of the pandemic. For the states like North Dakota, most of the visitor transmission risk is attributed to their neighboring states. New York, on the other hand, has a huge increase in visitor transmission risk from Florida during the late winter when the state of Florida opened up to travelers compared to the rest of the country. These trends are presented in Figs. 3, 4 and 5. Figure 6 also shows how the pandemic spread across the United States during the first year of the pandemic. The first set of states that saw a significant increase in the number of cases are primarily located in the northeast   Fig. 4, this risk is imported from the neighboring states, i.e., New Jersey and Connecticut, in the case of New York. While all the states restricted travel during this time period, we noticed that interstate travel still contributed a significant risk across these states compared to the rest of the states in the country. During the second wave of the pandemic during the summer, the majority of the states in the southern United States and the West Coast were impacted. The mobility of the individuals is comparatively higher than during the first wave. We also observe that these states had relaxed travel restrictions compared to the states that had a peak during the first wave. Finally, the states that had an increase in the number of cases during the third wave did not have an earlier peak and had a significant increase in mobility, both locally and from outside the states. We would also point out that these states had a higher number of cases

Limitations and future work
In this study, we explored the impact of local and visitor mobility on the transmission of COVID-19 in all 50 states in the United States. However, here are some areas where this work can be potentially extended. First, the primary objective of this work is to evaluate the impact of internal and external mobility, and their effects on disease incidence. This study does not consider the many mitigating factors imposed by local authorities to curb the spread of the virus; these include: mask usage, lockdowns, social distancing guidelines, and public compliance with health regulations that could have had an effect on the number of cases. The impact of these measures are studied in other works on COVID 34 . Second, the models have been generated on the data aggregated at a daily level of granularity. There have been several issues with the data reported by state and local authorities, that include less testing over the weekend and bulk reporting of missed cases. We handle this problem by smoothening the data over 7 days. Third, the mobility data considers the number of individuals traveling from one state to another but does not capture the distance traveled by individuals during the trip. Incorporating the distance traveled might help enhance the relationship between the number of cases and the mobility of individuals. Fourth, the goal of our excercise is to evaluate the impact of local and visitor mobility on the number of cases; the provided solution is not a forecasting solution to predict the future number of COVID cases. There have been various studies that developed models that take into account: the number of cases, mobility, socio-demographic information, serological impacts etc. to predict the number of cases. The COVID Hub Ensemble model had a MAPE of 11.8% for a 2 week ahead forecast during the first year of the pandemic, DeepCOVID model from Rodriguez et al. had an accuracy of 9.2% during the same time period, compared to 16.8% in our analysis 35,36 . Finally, we consider the state as a single unit to measure the mobility and the number of cases. We do not consider the population density at origin and destination and the number of people traveling to a particular city in a state. For example, the first wave (March-June) was dominated by cases from metropolitan areas, whereas the cases during the third wave were primarily in the rural areas of the state. In the future, we would like to extend this model to various metropolitan areas in the county for analysis at a more refined level of granularity.

Conclusions
In this paper, we evaluated the impact of the disease transmission risk due to visitor and local mobility on the number of cases at a state level for all 50 states in the United States. We observed that visitor mobility is an important factor in explaining case growth. The prediction accuracy improved by 33.78% for the whole duration of the pandemic in 2020 (March-December) when visitor mobility was used in the forecasting model. The impact of transmission risk due to external mobility is observed across all three phases of the pandemic in the United States. We observe the influence of mobility is much stronger in the first phase of the pandemic compared to the second or third phase. These observations are consistent with some of the earlier studies 4,11 where mobility was observed to be an important predictor of case growth in the first phase of the pandemic.