A retrospective analysis of the dynamic transmission routes of the COVID-19 in mainland China

The fourth outbreak of the Coronaviruses, known as the COVID-19, has occurred in Wuhan city of Hubei province in China in December 2019. We propose a time-varying sparse vector autoregressive (VAR) model to retrospectively analyze and visualize the dynamic transmission routes of this outbreak in mainland China over January 31–February 19, 2020. Our results demonstrate that the influential inter-location routes from Hubei have become unidentifiable since February 4, 2020, whereas the self-transmission in each provincial-level administrative region (location, hereafter) was accelerating over February 4–15, 2020. From February 16, 2020, all routes became less detectable, and no influential transmissions could be identified on February 18 and 19, 2020. Such evidence supports the effectiveness of government interventions, including the travel restrictions in Hubei. Implications of our results suggest that in addition to the origin of the outbreak, virus preventions are of crucial importance in locations with the largest migrant workers percentages (e.g., Jiangxi, Henan and Anhui) to controlling the spread of COVID-19.

www.nature.com/scientificreports/ been analyzed and compared over a rage of affected countries (e.g., Australia 34,34 , Germany 35 , Italy 36,37 and South Korea 38 ). Among those emerging large volume of studies, mathematical and statistical modeling plays a non-negligible role. Also, the classical susceptible exposed infectious recovered (SEIR) model with its various extensions is the most popular method [39][40][41][42][43][44][45][46][47][48][49][50][51][52] . SEIR family models are effective in exploring the epidemic characteristics of the outbreak, forecasting the inflection point and ending time, and deciding the measures to curb the spreading. Despite this, they are less appropriate in identifying transmission routes of the COVID-19 outbreak, which is also not thoroughly investigated in existing literature.
In this paper, we fill in this gap and perform a retrospective analysis using the publicly available data 53 . Rather than employing the SEIR, we develop a time-varying coefficient sparse vector autoregressive (VAR) model. Using the least absolute shrinkage and selection operator (lasso) 54,55 and the local constant kernel smoothing estimator 56 , our model is capable of estimating the dynamic high-dimensional Granger causality coefficient matrices. This enables the detection and visualization of time-varying inter-location and self-transmission routes of the COVID-19 on the daily basis. The resulting "road-map" can help policy-markers and public-health officers retrospectively evaluate both the effectiveness and unexpected outcomes of their interventions. Such an evaluation is critical to winning the current battle against COVID-19 in China, providing useful experience for other countries facing the emerging threat of this new coronavirus, and saving lives when a new epidemic occurs in the future.

Methods
Model. Throughout this study, we are interested in the growth rate y i,t such that: where x i,t is the accumulated confirmed cases in the provincial-level administrative region (location, hereafter) i on day t ( i = 1, . . . , N and t = 1, . . . , T ). T and N define the number of days and number of locations under consideration, respectively. We then define y t = (y 1,t , . . . , y N,t ) ′ , an N × 1 vector of the growth rate on day t. To investigate a dynamic direct transmission of the growth rate among locations, we propose a time-varying coefficient sparse VAR model, namely the tvSVAR model, which assumes that Granger causality coefficients are functions of time, such that: where α t is an N-dimensional intercept vector at time t. B t is an N × N Granger causality matrix at time t with a dynamic sparse structure, for which entries can be exactly zero and the locations of zeros can vary with time. ǫ t is an N × 1 vector of error terms. The sparsity of B t is assumed because N could be even larger than T in our case, which leads to very unstable estimations and problematic interpretations of B t .
One important benefit of using the proposed tvSVAR to model the transmissions is that the Granger causality matrix, B t , can provide both the direction and strength of the route on day t. For example, the ijth entry in B t measures the strength of the transmission from location i to location j on day t. The ith diagonal of B t represents the self-transmission in location i that captures the relationship between the growth rate in the current and previous days. More critically, the sparse structure eases the interpretation of B t because many weak transmissions may be of a random nature. The corresponding coefficients, therefore, can be treated as noises and are shrunk to zeros exactly. Moreover, a time-varying design of B t allows us to investigate changes in the identified transmissions over time. For instance, let 11, 14 and 17 indicate Hubei, Jiangxi and Shanghai, respectively. On day t = 1 , the estimated β 11,14,1 and β 14,17,1 are 0.52 and 0.35, respectively. This suggests on that day, moderately strong transmission routes of confirmed COVID-19 cases are detected from Huber to Jiangxi and from Jiangxi to Shanghai, respectively. Further, estimated β 17,i,1 for all i = 1, . . . , 20 are zeros, suggesting that the confirmed cases in Shanghai cannot spread to other locations on day 1. On day t = 2 , we observe estimated β 11,14,1 = 0.61 , β 14,17,1 = 0.41 and all β 17,i,1 = 0 . Thus, the two detected routes from Huber to Jiangxi and Jiangxi to Shanghai have become more influential, whereas the cases in Shanghai are still yet to spread out on day 2. The above results cannot be derived using the classic epidemiological SEIR model.
To capture both dynamic and sparse structure of the Granger causality coefficients, we solve the following optimization problem: We use the Epanechnikov kernel K(x) = 0.75(1 − x 2 ) + and a unified bandwidth for each i ( b i ≡ b ) to avoid a large number of tuning parameters. The coefficients β i,j,t denotes the ijth entry of the Granger causality matrix B t , and is the tuning parameter that aims to shrink insignificant β i,j,t to zero and thus controls the sparsity of B t . Another essential feature of our proposed model is that the adaptive weights w i,j,t are employed to penalize β i,j,t differently in the lasso (L1) penalty 54,55 . The choice of weights w i,j,t takes account of the prior knowledge about the transmissions and can be specified by the users. In this study, we consider w i,j,t as the reciprocal of the accumulated confirmed in location i on day t − 1 . That is, the growth rate of a location with a smaller accumulated confirmed cases is less likely to influence the growth rates of others, and thus, more likely to be shrunk to zero. The final sparsity structure of B t is still data-driven.
(1) www.nature.com/scientificreports/ The estimators as in (3) can also be viewed as a penalized version of local constant kernel smoothing estimator 56 . We utilize a modified version of the fast iterative soft thresholding algorithm (FISTA) 57 to solve the optimization problem (3).
Given a bandwidth b and a penalty parameter , we can find the estimator ( α t , B t ) for each day t and observe the dynamic patterns of the transmission over time t for each pair of locations. The selection of b is critical to detecting the influential routes, which depends on the chosen criterion. Among the existing literature, a popular approach is to adopt the cross-validation strategy, such that based on the estimated (α t , B t ) , the model will not 'overfit' y t . As for the time-series analysis, we use an expanding-window sample to implement the crossvalidation 58 . This requires that the chosen b will minimize the cross-validated forecast error, which is measured by the one-step-ahead root mean squared forecast error (RMSFE), such that where [T 0 , T 1 ] is the evaluation period, which is given by the last third of the data in our study, y i,t+1 denotes the one-step-ahead forecast for location i based on the data up to day t, and y i,t+1 defines the observed growth rate at day t + 1 for location i. Note that RMSFE is analogous to the square root of the popular least squared errors. An interpretation is that the chosen b will lead to the minimized total out-of-sample forecast errors of the growth rates of confirmed cases over the last third of the sample period.

Data and results
Data. The data studied in this paper include confirmed COVID-19 cases which occurred in mainland China.
The data are publicly available and sourced from the website of the National Health Commission of the People's Republic of China 53 . The data-coverage ranges from January 29, 2020 to February 19, 2020, during which no missing data were recorded at location-level. The accumulated cases and the associated growth rates, grouped by the total national number, cases in Hubei and cases in all other locations, are plotted in Fig. 1a,b, respectively. The total national (Hubei) accumulated confirmed cases increased rapidly from 7,736 (4,586) on January 31, 2020 to 75,101 (62,457) on February 19, 2020. Note that on February 12, 2020, confirmed cases in Hubei included those confirmed by both laboratory and clinical diagnosis, leading to a one-time hump of the accumulated number. Compared to those of Hubei, confirmed cases of other locations took up a smaller proportion of the total national number, ranging from 40.7% on January 29, 2020 to 16.8% on February 19, 2020. This suggests that the growth rate of other locations should be lower than that of Hubei, which is consistent with Fig. 1b. Throughout our investigation period, except for the one-time hump on February 12, 2020, growth rates of Hubei and the rest steadily declined, from 33% and 25% to 5% and 1%, respectively.
Estimation results: transmission routes. By taking the difference of the logged accumulated cases and applying one lag, our estimated transmission routes are available from January 31 to February 19, 2020 (two observations are lost). To avoid potential noises caused by small numbers, we only include data of locations, which had at least 150 accumulated confirmed cases as of February 19, 2020. Altogether, our modeled sample contains 20 location-level confirmed cases. We firstly test the stationarity of the 20 growth rates separately. Based on the Augmented Dickey-Fuller test, only the rates of three locations (Beijing, Hainan and Heilongjiang) are insignificant, which is in-line with the employed 10% significance level. The detailed results are available upon request. The model explained in "Methods" is then fitted incorporating all the 20 growth rates. A non-zero estimate of β i,j,t , the ijth entry of B t in (2), indicates that on the tth day, the growth rate of location j is Granger caused by that of location i. In other words, there is a transmission route from location i to location j. Among the 20-day results, we noticed that the estimated transmission routes on days 1-5 changed considerably on daily basis. From the sixth day onwards, however, those estimated routes were more steady. Hence, we plot the estimates on days 1-5 and those on the every fifth day thereafter, on Fig. 2. Be noted that estimates smaller than 0.2 (none-influential) are not presented a better visual illustration purpose. Also, this research focuses on the analysis of mainland China only, which excludes Taiwan, Macau, Hong Kong and all important islands of China's territory, such as those located in the South China Sea. The plots presented in Fig. 2 therefore do not present a complete map of the territory of China, nor should they be used for purposes other than displaying identified transmission routes of COVID-19 in mainland China. The readers are directed to The National Administration of Surveying, Mapping and Geographic Information of the People's Republic of China, should they need to precisely explore the scope of maps, national boundaries and the drawing of important islands of the Chinese territory.
In Fig. 2, we use color of light orange (small) to dark red (large) indicating the accumulated confirmed cases in each location, up to time t. Estimated transmission routes are colored in blue. Self-transmissions (indicated by β i,i,t ) are denoted by dots, and a larger size of dot suggests a larger estimated β i,i,t . Inter-location transmission (indicated by β i,j,t , where i = j ) is represented by arrows, with the transparency indicating the magnitude of estimated β i,j,t . On the first day (January 31, 2020), there were influential inter-location transmissions from Hubei to Jiangxi, Heilongjiang, Zhejiang, Henan, Shandong, Jiangsu and Shaanxi, sorted by the magnitudes of strength (big to small). There were a few additional detected such transmissions on the second day, including those from Hubei to Guangxi, from Jiangxi to Fujian, and from Guangdong to Anhui, Yunnan and Hunan. The number of such identified inter-location routes, however, reduced rapidly over the next three days. On the fifth day (February 4, 2020), no influential transmission routes were found from Hubei to directly affect other locations, and there were only three influential routes identified nationally, including Zhejiang-Shaanxi, www.nature.com/scientificreports/ Zhejiang-Jiangxi and Jiangxi-Shanghai. The number of those detected inter-location routes declined again in the next few days, and on day 13, only Henan-Heilongjiang was found influential. On days 19 and 20 (February 18 and 19, 2020), there were no influential inter-location transmissions identified. The above findings suggest that the number of influential inter-location transmissions overall dropped quickly in the first five days and then reduced steadily for the rest 15 days. This is consistent with the observations of Fig. 3a, where the time-varying estimates of the Granger causality of Hubei on other locations are plotted. On each day, we report the mean, standard deviation (Std. Dev.), the 25% quantile ( Q 1 ) and 75% quantile ( Q 3 ) of those estimates in Table 1, which also leads to consistent findings.  www.nature.com/scientificreports/ As for the self-transmission, we firstly examine (Fig. 2b). It can bee seen that there were quite a few detected influential self-transmissions on the first two days. However, this number dropped quickly over days 3-5, and only self-transmissions of Heilongjiang, Guangdong and Zhejiang were found influential on day 5. Since then, the number of influential self-transmissions increased quickly with growing magnitudes (influence). On the sixteenth day (February 15, 2020), 16 out of the 20 examined locations had an estimated β i,i,20 of at least 0.2. Those large self-transmissions, however, disappeared rapidly again in the next 3 days. On February 18 and 19, 2020, there were no influential self-transmissions identified. This is consistent with our findings on Fig. 3b, where time-varying estimated β i,i,t are plotted for each location. We report daily descriptive statistics of those estimates in Table 1, which also results in consistent conclusions.

Discussions
Since 23 January, 2020, many cities on mainland China started to introduce travel restrictions, including five cities (Wuhan, Huanggang, Ezhou, Chibi and Zhijiang) of Hubei 15 . According to WHO's situation report 59 , the average incubation period of COVID-19 is up to 10 days. Thus, our estimated dynamic transmission routes supports the significant effectiveness of the interventions taken by the Chinese authorities [15][16][17][18][19][20][21][22][23][24][25] . This is evidenced by Fig. 2a-e, where the number of influential inter-location transmissions from Hubei to other locations reduced very quickly. Compared to multiple influential routes originating in Hubei detected on the first two days (January 31 and February 1, 2020), by February 4, 2020 (around 10 days after the travel restrictions), there were already no such transmissions identified. On the other hand, from February 5 to 16, 2020, Table 1 suggests that the averaged magnitudes of self-transmission on each day were strengthening steadily. This may also be explained by the interventions, which have effectively blocked inter-location transmissions, such that the growth rate of each location could only be caused by its internal transmissions.
We now focus on the inter-location transmission routes. Since influential routes from Hubei were no longer detected since day 5, we calculate the average β i,j,t of the 19 locations affected by Hubei over the first four days   Table 2. This is consistent with the fact that travel restrictions in Hubei should not affect the connections among other locations. In all cases, Jiangxi, Henan, Guangdong, Zhejiang and Anhui are the most influential origins other than Hubei. It is worth noting that Jiangxi, Henan and Anhui belong to both the top origins and destinations of the interlocation transmissions, excluding Hubei. Since the impact of Hubei is not considered, this cannot be explained by the two influential transmission routes of Hubei-Jiangxi and Hubei-Henan listed in Panel A of Table 2. To see this, over days 5-20, the transmissions out of Hubei are no longer significant and thus should not affect routes from Jiangxi and Henan to another location. In contrast, one explanation is the large migrant workers from Jiangxi, Henan and Anhui to other locations (excluding Hubei). According to the Report on China's migrant population development of 2017 60 , Jiangxi (7.25%), Henan (6.30%) and Anhui (6.27%) are among the top five locations in mainland China, ranked by the percentages of migrant workers in 2017. www.nature.com/scientificreports/ conclusions Coronaviruses have lead to three major outbreaks ever since the SARS occurred in 2003. Although the exact origin is still debatable, the current shock, namely COVID-19, has taken place in Wuhan, the capital city of Hubei province in mainland China. As the fourth large-scale outbreak of coronaviruses, COVID-19 is spreading quickly to all provincial-level administrative regions (locations, hereafter) in China and has recently become a world-wide epidemic. As a significant complement to existing research, this study employs a tvSVAR model and retrospectively investigates and visualizes the transmission routes in mainland China. Demonstrated in Fig. 2, our baseline results review both the dynamic inter-location and self-transmission routes. Since February 4, 2020, the spread out of Hubei was largely reduced, leading to no identifiable routes to other locations. Simultaneously, the self-transmissions started to accelerate and peaked on around February 15, 2020 for most locations. Given an average incubation period of 10 days, those results support the argued effectiveness of the travel restrictions to control the spread of COVID-19, which took place in multiple cities of Hubei on January 23, 2020. On February 18-19, 2020, there existed no influential inter-location or self-transmission routes. Thus, the growth rates of confirmed cases are of a more random nature in all locations thereafter, implying that the spread of COVID-19 has been under control. For the detected inter-location transmissions, our findings demonstrate that Jiangxi, Heilongjiang, Zhejiang, Henan and Shandong are the top 5 locations affected mostly via routes directly from Hubei. When the influence of Hubei is excluded, Jiangxi, Henan and Anhui are among both the top origins and destinations of transmission routes.
Our results have major practical implications for public health decision-and policy-makers. For one thing, the implemented timely ad-hoc public health interventions are proven effective, including contact tracing, quarantine and travel restrictions. For another, apart from the origin of the virus, as locations with largest migrant workers percentages, virus preventions are also of crucial importance in Jiangxi, Henan and Anhui to controlling the epidemics like the outbreak of COVID-19 in the future. With limited resources, taking ad-hoc interventions in such locations may most effectively help stop the spread of a new virus, from an economic perspective.

code availability
The R code that supports the findings of this study is available from the author on request. www.nature.com/scientificreports/ Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.