Estimation of Regional Economic Development Indicator from Transportation Network Analytics

With the booming economy in China, many researches have pointed out that the improvement of regional transportation infrastructure among other factors had an important effect on economic growth. Utilizing a large-scale dataset which includes 3.5 billion entry and exit records of vehicles along highways generated from toll collection systems, we attempt to establish the relevance of mid-distance land transport patterns to regional economic status through transportation network analyses. We apply standard measurements of complex networks to analyze the highway transportation networks. A set of traffic flow features are computed and correlated to the regional economic development indicator. The multi-linear regression models explain about 89% to 96% of the variation of cities’ GDP across three provinces in China. We then fit gravity models using annual traffic volumes of cars, buses, and freight trucks between pairs of cities for each province separately as well as for the whole dataset. We find the temporal changes of distance-decay effects on spatial interactions between cities in transportation networks, which link to the economic development patterns of each province. We conclude that transportation big data reveal the status of regional economic development and contain valuable information of human mobility, production linkages, and logistics for regional management and planning. Our research offers insights into the investigation of regional economic development status using highway transportation big data.

With the booming economy in China, many researches have pointed out that the improvement of regional transportation infrastructure, the mobility of labor and capital, and industry reform along with other socioeconomic factors play an important role on economic growth. Timely estimation of social and economic status of cities and regions has important implications for enterprise investment and government policy making. Traditional inference approaches to economic status mainly rely on official reports and census surveys, which usually take a long period and are labor intensive. With the rapid development of information, communication and technology (ICT), new data sources of human activities 1 and vehicle movement flow [2][3][4] , air transport flow 5,6 , financial flow 7 , information flow 8 , communication flow [9][10][11][12] , and others 13 have become available for better understanding and monitoring the status of our socioeconomic environments 14,15 . Liu et al. 1 found that online social activity could reflect the macro economic status of 282 prefecture-level cities in China. Recently, Gao et al. 16 conducted a comprehensive review on data resources, computational tools, analytical methods, theoretical models, and applications in computational socioeconomics.
In the past decades, a wealth of works have been dedicated to studying the pattern of human mobility involving passenger transportation [17][18][19][20][21][22][23] . A comparatively smaller literature has been dedicated to the pattern of transportation activity embedded in goods movement 24,25 . The scarcity of reliable data sources on freight transportation appears to be one of the challenges 26 . Early studies rely on traditional freight traffic surveys, which are typically enterprise questionnaire surveys to obtain information such as the traffic volume and speed in specific road sections. For example, Ogunsanya 27 examined the field survey data collected from several freight terminals using questionnaires and freight delivery records of major warehouses in Lagos, Nigeria. With floating car technology being increasingly applied in transportation, large-scale global positioning system (GPS) tracking data of trucks have been used by researchers in exploring the spatial interaction patterns. In addition, several applications have been created in transportation planning through the prediction of road travel time, logistic demand, and analysis of vehicle operation characteristics. Comendador et al. 28 presented urban freight distributions in two Spanish cities using GPS data, and compared the general mobility patterns of five groups of freight according to the type of goods including frequency and average distance. Fu and Shi 26 analyzed the spatiotemporal features of freight truck trips using GPS data that record the truck operation in real time. Zanjani et al. 29 estimated origin-destination truck flows based on truck GPS data, and the results indicate the value of GPS data in freight travel demand modeling. Mrazovic et al. 30 explored the spatial and temporal patterns of trucks based on data on urban freight deliveries. Boarnet et al. 31 explored the freight flows in Los Angeles and provided evidence on employment being an important driver of freight activity. With the burgeon of new data sources on transportation of goods, De Montis et al. 32 obtained a weighted network representation in which the vertices correspond to the Sardinian municipalities in Italy and the weighted edges correspond to the amount of commuting traffic among them. They characterized the structure of human traffic (at the intercity level) and investigated its relation to the topological structure of cities (defined by the connectivity pattern among them). Zhao et al. 4 used logistics data on origin-destination flows of goods in Hong Kong and analyzed the intra-urban freight movement based on the gravity model with a specific interest in uncovering potential trends in the distance decay effect, reflected by the parameter β, through multiple years of observation. They took the population of sub-districts in the area of study as the values for nodal attractions and estimated the distance decay parameter β by selecting among the candidates the parameter value that yielded the best goodness of fit. With the obtained estimates for β, they demonstrated that the estimated interaction flows coincide reasonably well with observed interaction flows, which argues for the effectiveness of the gravity model in determining spatial interactions of freight movements. Recently, Ding et al. 33 reviewed the applications of complex network theory in urban traffic studies. Our study differs from theirs in terms of the activity space (i.e. intra-urban vs. inter-urban). Accordingly, our study investigates regional connections and includes both intracity and intercity transportation of people and goods through highways.
Apart from the studies using data on coach (vehicle) services, the analysis of airline services 5,34,35 and railway services 36 have been catching the attention of researchers on the studies of spatial interactions. Barrat et al. 37 provided an early study of the worldwide airports network including the traffic flow and its correlation with the topological structure. A closely related field of study on transportation is dedicated to investigating the impact of transportation infrastructure on socioeconomic status such as economic growth or population growth which Beyzatlar et al. 38 , Iacono and Levinson 39 investigated. The rapid development of high-speed rails plays an important role in regional economic development. Zheng and Kahnconnect 40 demonstrated the effects of China's bullet trains in facilitating labor market integration and mitigating the housing price and living costs in megacities. Jia et al. 41 showed that the high-speed rail construction has a positive effect on regional economic growth in China. Cheng et al. 42 and Chen & Vickerman 43 investigated how the new development of high-speed rail infrastructure impacts on the economy structure of cities and regions in Europe. However, the endogenous problem of inverse causality of development potential on infrastructure investment has been a hurdle in producing a convincing conclusion on causality. Although the new development of high-speed rails reduces the transportation costs of people between large cities, there is a significant reduction in GDP after the high-speed rail upgrade in the counties located along the affected railway lines. The reduction was largely driven by the concurrent drop in fixed asset investments of those bypassed counties 44 . Gao et al. 45 introduced the concept of inter-industry and inter-regional learnings of regional economic development. Using 25 years of economic data in China between 1990 and 2015, they addressed the endogeneity concerns by using the difference-in-differences analysis. Results showed that the high-speed rail development increased the industrial similarity of connected pairs of neighboring provinces 45 .
In this research, we examined the regional economic development indicator measured by the gross domestic product (GDP) values of cities and the relationship between GDP and transportation activities of human and goods in three provinces of China. Over 3.5 billion records of vehicle entry and exit data were collected in highway toll stations (287 in Liaoning, 421 in Jiangsu, and 335 in Shaanxi, respectively) using toll collection systems and surveys in these three provinces between years 2014 and 2017. The raw station-level data were then aggregated and summarized to the city level as our main focus is to investigate the relationship between regional economic development and the transportation networks. More detailed data, feature construction, and method descriptions can be found in the "Methods" section. We conducted the multiple linear regression analysis using a set of traffic flow features with/without regularization techniques, and the modeling results explained about 89% to 96% of the variation of cities' GDP across three provinces. To further investigate the regional spatial interaction characteristics, we then constructed the highway transportation networks to fit gravity models using annual traffic volumes of cars, buses, and trucks between pairs of cities for each province separately as well as together for the whole dataset. We found the temporal changes of distance-decay effects on spatial interactions in transportation networks, which link to the regional economic development patterns in each province. In summary, the major contributions of this research are three fold: (1) we present an analytical workflow to investigate the regional economic development status using highway transportation big data; (2) the analyses of highway transportation networks using the gravity model and the principle component analysis provide a good interpretation of the spatial structure of regional highway transportation development and the temporal economic changes; (3) the weighted network measures using the traffic flows correlate better with regional economy than that using the physical distance-based ones.

Results
Estimation of GDP from traffic flows. The relationship between the fitted city GDP values using the multiple linear regression (MLR) models of transportation features and the actual city GDP in three provinces (i.e., Liaoning, Jiangsu, and Shaanxi) are summarized in Fig. 1. It shows that simple transportation flow features (i.e., intra-city and inter-city flows of cars, buses, and trucks) extracted from the transportation networks of cities (in Eq. 2) can explain the variation of the economic development indicator (i.e., GDP) among cities very well, with the goodness of fit: R-squared of 0.934 (in Liaoning province), 0.892 (in Jiangsu province), and 0.967 (in Shaanxi province). In addition, the R-squared further increased a margin (+0.025 in Liaoning, +0.021 in Jiangsu, and +0.014 in Shaanxi) by including the volume of passengers in cars & buses and the freight truck weights in the MLR model. After transforming the dependent variable (GDP) with the natural log form (Ln), the generalized linear model had goodness of fit R-squared of 0.849 (in Liaoning), 0.727 (in Jiangsu), and 0.914 (in Shaanxi), respectively. The results are not as good as the original MLR approach. The prediction root-mean-square error (RMSE) of city GDP using original MLR model in three provinces are 53.5 (Liaoning), 119.8 (Jiangsu), and 30.11 (Shaanxi) billion CNY, respectively. As shown in the residual plots Figs. S2-S5, the residuals center on zero and are not correlated with any predictors, which indicate that these models' predictions have a relatively constant variance and show homoscedasticity and normality. In addition, by applying two regularized regression methods: the Ridge regression 46 and the least absolute shrinkage and selection operator (LASSO) regression 47,48 , the smallest RMSE values were 54.2 (Liaoning), 121.0 (Jiangsu), and 30.66 (Shaanxi) billion CNY, respectively. Moreover, these two methods only select a few predictors while reducing the coefficients of other highly correlated predictors to zero. It helps mitigate the problem of multicollinearity in MLR based on the ordinary least squares estimation, and the model still keeps a high goodness of fit for each province' data (See the'Discussion and Conclusion' section in detail). However, the selected three provinces had different economic development patterns in the study period. Accordingly, the goodness of fit (R-squared = 0.587) of a universal MLR model which combines all the GDP values and traffic flow data across three provinces is not as good as that in each province. The corresponding prediction RMSE increased to about 212 billion CNY using MLR. This demonstrates the complexity and heterogeneity of the economic development in different cities and provinces. The findings correspond to existing research on the regional economic complexity in China using non-monetary metrics 49,50 . As for the impact of predictors, we found both similarities and variations of their standardized regression coefficients in different provinces (as shown in Table 1). Specifically, the intracity flow of cars and buses has the largest positive impact to the city GDP with a standardized coefficient (0.362) in Jiangsu province, followed by the intercity truck in-flow (0.263) and the intercity car/bus in-flow (0.262), while in the case of Liaoning province, the intercity car/ bus in-flow has the largest positive standardized coefficient (7.371) followed by the intracity flow of cars and buses (1.134). In Shaanxi province, the intercity out-flow of cars and buses has the largest positive impact to the city GDP with a standardized coefficient (2.527), followed by the intracity flow of trucks (0.476) and the intracity flow of cars and buses (0.401). With the Ridge and LASSO regressions, we found very similar results (in Table 2) to the non-regularized MLR regarding the predictor selection. Accordingly, car/bus intracity flow and intercity in-flow are both selected to explain the GDP variation among cities in Jiangsu and Liaoning provinces. One different feature selection result is that the intercity out-flow of trucks selected in the Ridge and LASSO regression models for determining cities' GDP in Jiangsu province had a smaller standardized coefficient than that of the intercity in-flow of trucks in the non-regularized MLR. As for the Shaanxi province, six out of eight transport predictors are selected and the intracity flows of cars & buses as well as trucks play an important role in predicting the city GDP as reported in Table 2.
In addition, Fig. 2 and Supplementary Figs. S6 and S7 show that the temporal changes of traffic volumes of cars & buses and trucks over the consecutive four-year period match the city GDP overall changes well. Moreover, our experiments based on the total traffic volume (including all flows of cars & buses and trucks) of each city over four years and city GDP data using a Bayesian structural time-series model 51 verify that transportation infrastructure can boost economic growth (with p-value 0.002), since it has significant impact on transport activities including both human movements and freight flows.
City attractiveness and distance-decay effects. Next, we utilize the gravity model to estimate the city nodal attractiveness and the distance-decay effect on spatial interactions of cities based on the traffic-flow-weighted networks derived from cars & buses and freight trucks' entry and exit records along highways. Figure 3 shows the spatial distributions of the traffic volumes of cars and buses between cities in Shaanxi province. The magnitude of the traffic volumes of cars and buses within cities are larger than their pairwise inter-city traffic volumes for each category of vehicles (i.e., cars & buses, and trucks). The maps visualizing the traffic volumes of freight trucks in Shaanxi over four years can be found in Fig. S8. Similar maps of the other two provinces can be found in Supplement Figs. S9-S12. The spatiotemporal distributions of traffic volumes in each province could reflect the highway construction and transportation trends in each province 52,53 . Figure 4 displays   www.nature.com/scientificreports www.nature.com/scientificreports/ the relationships between the estimated and observed flow volumes of freight trucks, cars & buses in three provinces (LN: Liaoning; JS: Jiangsu; SX: Shaanxi) over four consecutive years using the gravity models. The goodness of fit R 2 of those models are all over 0.99 using the linear regression approach. The results indicate that the gravity model fits well in the regional land transportation patterns for cars & buses and freight trucks in all three provinces over the study period.
In addition, we estimated the distance-decay coefficient β in the gravity models for both categories of transportation networks using three different approaches: linear regression, linear programming (MINIMAX), and the Null model. Figure 5 shows the temporal changes of the distance-decay coefficient β in three provinces using different parameter estimation approaches. Although we got different estimated absolute values of β using varying parameter estimation strategies, the relative overall trend remains. Specifically, as the regional economy grows, we would expect more spatial interactions of people and goods (and other types of flows 7 ) among cities across different ranges of distances. Thus, it results in a smaller β and vice versa. Furthermore, the rapid development of transportation infrastructures boosts the long-distance travels in a province. We found that the distance-decay effects decreased significantly in Shaanxi province and aligned well with its fast economic growth trend in the study period (see Fig. 2). In contrast, the regional economy in Northeast China grew slowly in recent years and even decreased in several cities. Liaoning province was selected as a representative province to illustrate this trend. From year 2015 to 2016, most cities' GDP decreased which may limit some long-distance travels of passengers and goods. Thus, the distance-decay coefficient β increased from 1.03 to 1.11 for traffic flows of cars and buses using the null model estimation approach, and increased from 0.85 to 0.94 for flows of freight trucks respectively. Besides, the distance decay effects of traffic flows in Jiangsu province didn't change much while it kept a stable economic growth across all cities over years. The spatial concentration of economic activities and interactions within the southern part of Jiangsu but fewer transportation interactions between the southern and www.nature.com/scientificreports www.nature.com/scientificreports/ the northern parts may contribute to the high distance-decay effect (as shown in Fig. S9). Moreover, we found that the distance-decay coefficients β for the truck flow networks are smaller than those of the cars & buses flow networks in Shaanxi and Liaoning provinces, where long-distance travels of freight trucks are important components of regional economic activities.
Network measures and sub-network structure. Network science methods are useful in uncovering inherent characteristics of transportation networks and the spatial structure of cities and regions 7,10 . A variety of network indicators such as Centrality, PageRank, LeaderRank and many others have been proposed for Then, the betweenness and closeness centrality measures were computed in the network G D , which help understand the spatial structure of cities in the physical space. The betweenness quantifies node importance based on how often shortest paths pass through a specific node, but it needs to incorporate human population distribution and the distance decay function to better explain traffic flow distribution 2 . As for the flow-weighted networks G C and G K , the closeness centrality and the weighted PageRank 57-59 were computed to www.nature.com/scientificreports www.nature.com/scientificreports/ measure the node importance. Then, we computed the Pearson's correlation coefficient between city GDP and each of these node importance measures. As shown in Fig. 6, the city PageRank measures in the networks of cars & buses PageRank C ( ) and trucks PageRank K ( ) strongly positively correlate with city GDP in all three provinces. The flow-weighted closeness centrality Closeness C ( ) and Closeness K ( ) are positively correlated with city GDP whereas neither Closeness D ( ) nor Betw D ( ) centrality in physical space had a significant correlation with city GDP. In sum, the weighted network measures using the traffic flows correlate better with regional economy than that using the physical distance-based ones.
In addition, by applying principle component analysis (PCA) on the spatial interaction network matrices, we extract the subsystem of flows with a large portion of the total variance of interactions among cities. The first few components represent a substantial part of the total variance, which could reveal prominent regional transportation connection patterns. Taking the inter-city spatial interaction network of Shaanxi as an example, using the observed flow volumes of cars and buses in year 2014 as the weights, the PCA results in Fig. 7(a) show that first two principle components already account for over 80% (PC1: 63.1% and PC2: 17.0%) of the total variance of the inter-city flows. Xi'an, as the provincial capital city of Shaanxi province, has the largest standardized component score for the first principle component. By linking each group of city destinations (filtered by the factor loadings) to a common set of strongly connected origins (filtered by component scores) using the Goddard's approach 60 , a dominant nodal sub-system is extracted. As shown in Figs. 7(b) and 3, this sub-system links several important cities in the central and southern parts of Shaanxi province through the highway transportation system, including Xi'an, Xianyang, Weinan, Baoji, Shangluo, Ankang, and Hanzhong. Interestingly, the same dominant nodal sub-system is also extracted using the freight truck traffic volumes as the weights for the inter-city spatial interaction networks as shown in Fig. S8. This finding confirms the strong regional spatial connections of people and goods to support economic development in this area. Similarly, the PCA analysis results for extracting the dominant nodal sub-systems of Jiangsu province and Liaoning province are shown in Figs. S13 and S14. The sub-network structure reflects the complexity and the spatial connection characteristics of the regional economy.

Discussion and Conclusion
When estimating the city GDP from traffic flow variables, the existence of multicollinearity among independent variables may change the coefficient estimates of the MLR model erratically in response to small changes in the data. We used the variance inflation factors (VIF), which is a measure of how much the variance of the estimated regression coefficient b k (where k is the index of a predictor) is "inflated" by the existence of correlation among the predictor variables in the model. where R k 2 is the R-squared obtained by regressing the k th predictor on the remaining predictors. The VIF scores for the predictor variables incoming flow (I C ), outgoing flow (O C ) of cars and buses are very high (>2000) across all three provinces. As the original multi-linear regression model includes the ratio of incoming/outgoing flow (R C ), such high multicollinearity is expected. After removing the two variables with largest VIF scores in each regression model for predicting the city GDP, the R-squared values slightly reduced: from 0.934 to 0. 909 (Liaoning), from 0.892 to to 0.889 (Jiangsu), and from 0.967 to 0.955 (Shaanxi), respectively.
Moreover, by calibrating the model penalty term of L1-norm and L2-norm using empirical data in each province, we derived the regularized regression results using the Ridge and LASSO approaches. As shown in Table 2, both Ridge and LASSO methods got a high goodness of fit in the predictive modeling of city GDP: R-squared of 0.887 (Ridge) and 0.932 (LASSO) in Liaoning province, 0.888 (Ridge) and 0.882 (LASSO) in Jiangsu province, and 0.961 (Ridge) and 0.965 (LASSO) in Shaanxi province. The R-squared results are almost as good as the non-regularized MLR results using all features, but the regularized regression models are more stable in the cross-validation experiments although with some additional cost of bias. Regarding the feature/predictor selection, each province has its own unique combination of transport flow predictors to its city economy. The models also identify the prominent predictor (i.e., the intracity flow of cars and buses N C ) that has a large positive impact  www.nature.com/scientificreports www.nature.com/scientificreports/ on their GDP values but with different magnitudes of coefficients across three provinces. The detailed model coefficient results can be found in Table 2. Interestingly, if we combine all the cities' data from three provinces into one linear model, the top two selected predictors are the intercity incoming flow of trucks (I K ) and the intracity flow of trucks (N K ) for both Ridge and LASSO models, which have distinctive patterns in each province and thus help explain the city GDP variance across three provinces.
In addition, it is worth noting that if we took the province information as a dummy variable in the non-regularized MLR model of predicting GDP values using aforementioned traffic variables for all cities, the goodness of fit (R-squared) in the MLR model increased from 0.587 to 0.818, and the coefficients of this dummy variable had p-values close to 0. These values indicated that each province has its own structure of variability and such effect is statistically significant when predicting the GDP value from the transportation flow volumes of cars, buses, and trucks.
In sum, this study demonstrates that highway transportation big data could reveal the status of regional economic development and contain valuable information of human mobility, production linkages, and logistics for regional management and planning.

Methods
Data. Annual transportation data between years 2014 and 2017 for three provinces, namely Shaanxi (10 cities), Jiangsu (13 cities) and Liaoning (14 cities), in China were collected and summarized in the city level. All vehicles are required to pay for their trips within the highway in China (Except in Hainan province). Using manual and electronic toll collection systems including radio-frequency identification (RFID) and automatic plate recognition technology as well as surveys in each station, detailed data including the entrance and exit stations, and the type of each vehicle (two categories: cars & buses, and freight trucks) were recorded. In addition, for cars and buses, the average number of passengers of each vehicle was estimated based on the type of cars and buses and the survey per vehicle type. Whereas for trucks, the total weight of each vehicle was also recorded and the loading weight was estimated based on the type of truck. More details about the data collection in highway toll stations in China can be found in previous works 52,61 . City level data were aggregated and summarized from the raw station-level data as our main focus was to investigate the relationship between regional economy and transportation networks. We eliminated the traffic flows between different provinces and only kept the traffic flows within one province. In sum, the total number of vehicles, the sum of passengers (for cars and buses) and the weights (for trucks) as well as the distance (km) between paired cities were calculated respectively (see Table 3). In addition, the gross domestic product (GDP) data for each city in the same study period (2014-2017) were collected from the City Statistics Yearbook in each province, which measured the market value of all the final goods and services produced by all resident and institutional units engaged in a city.
Estimating GDP based on the traffic flow data. The multiple linear regression model (MLR) with ordinary least squares (OLS) parameter estimation is used to discover the statistical relationship between the transportation flow data and the economic development indicator (i.e. GDP). The GDP of a given city i is considered as a dependent variable GDP i predicted by eight independent variables (also known as features or predictors) related to its transport flows of people and goods: intercity incoming flow of cars & buses (I C ), intercity outgoing flow of cars & buses (O C ), intracity flow of cars & buses (N C ), and the ratio of incoming/outgoing intercity flow for cars & buses (R C ), intercity incoming flow of trucks (I K ), intercity outgoing flow of trucks (O K ), intracity flow of trucks (N K ), and the ratio of incoming/outgoing intercity flow for trucks (R K ). The formula is as follows: The intercity incoming flow (i) is the traffic flow with all origins outside the city and all destinations inside the city. The intercity outgoing flow (O) is the flow that originates inside the city and goes outside of the city. The intracity flow (n) is the flow with all origins and destinations inside the same city. The index (R) is the ratio of incoming and outgoing flows for a city, which may be able to indicate the role of a city compared with other cities among a transportation network. To demonstrate, when the ratio is larger than 1, it means there are more transportation flows coming to the city than leaving it, and the city may work as a 'sink' in the network. It may be a commercial center attracting people coming to find jobs or an industry center consuming a large amount of goods. When the ratio is less than 1, it means that the city works more like a 'source' which sends people or provides materials to other places. In addition, this study takes two types of traffic flows into consideration: the cars & buses flow and the truck flow. These two kinds of flows can reflect different aspects of the city economy, from the perspectives of human movement and goods movement. There are in total eight independent variables together with the constant (b 0 ) and the error term (e) included in the multi-linear regression model. By solving this linear model using the OLS technique, the relationship between GDP and the traffic flow of each city in each year can be discovered. The model is first used for each province separately, in attempt to distinguish the pattern of economic development in each province. Then the data of three provinces are merged together to identify the general relationship of GDP and traffic flow across provinces.
In addition, the standardized coefficient ′ b j of an independent variable from the multi-linear regression model is calculated as follows. It can be interpreted as the change of dependent variable in standard deviations, per standard deviation change in the predictors. www.nature.com/scientificreports www.nature.com/scientificreports/ Where b j is a regression coefficient. S y is the standard deviation of the dependent variable and S X j is the standard deviation of independent variable X j .
Besides, we also investigated a generalized linear model (GLM) with the natural log transformation of each city GDP given its log-normal or gamma distribution characteristics (in Supplement Fig. S1).
Linear models with regularization. When some independent variables are highly correlated in the MLR, the OLS coefficient estimations will have a large variance. The introduced regularization techniques in statistics and machine learning help reduce variance at the cost of introducing some bias in a model and avoid the overfitting issue 62,63 . There are two types of regularization techniques with two different cost functions regarding the model complexity: the L1 norm (least absolute deviations) and the L2 norm (least squares). The Ridge regression applies the L2 penalty term to control the coefficient of each independent variable in a linear regression model 46 . The objective of Ridge approach is to minimize the following cost function: where y i is the observed value of the dependent variable for each sample i (i.e., the city GDP); N is the total number of sample observations; b 0 is the constant in MLR; x ij is the value of the independent variable j for each sample i; b j is the estimated coefficient for the independent variable j (abovementioned transportation flow variables); p is the total number of independent variables (predictors), and λ >= 0 is a regularization penalty parameter. If λ is 0, the cost function backs to the OLS estimation. However, if λ is very large then it will add too much penalty to the model complexity and may lead to the model underfitting. The parameter calibration is performed to find the best λ that fits the data well and achieves the bias-variance balance. The LASSO regression has a similar cost function to minimize but it applies the L1 penalty term rather than the L2 norm to regularize the coefficient of each independent variable 47 :  Table 3. Summarized highway transportation data between 2014 and 2017 in three provinces: Jiangsu, Liaoning, and Shaanxi.
By shrinking some coefficients, the Ridge regression and the LASSO regression are able to control the multicollinearity in the model. It tends to pick one predictor from a few very correlated predictors and set the coefficients of the others to zero 47,48 . Therefore, the regularization techniques are very helpful when there are many intercorrelated features and feature selection is necessary 64 . The gravity model fitting with linear regression and linear programming. To further estimate the relative attractions of cities and identify the important cities in each province, the gravity model is applied to the study area. This widely used model assumes that the flows between nodes are generated by some attraction force from the nodes 65 . With known flow values between cities, the attraction value of each city in this model can be estimated through a reverse calibration process 65 .
The traffic flow G ij is proposed to be computed by the following equation: where k is a constant and d ij is the distance between two cities. The attraction of each city(P i and P j ) and the exponent on distance β (i.e., the coefficient of distance decay effect) are unknown and need to be estimated. The equation can then be transferred by taking a natural logarithm.
By representing the formula with simpler symbols, we got the following equation: ij , D ij is the deviation between the estimation and the observation (i.e., the error term).
Here the b ij can be considered as a known dependent variable and X i and X j are the coefficients of a series of dummy variables showing the relationship between cities 65 . For each given flow, the dummy variable will be 1 for the origin city and the destination city, and 0 for other cities. By forming such a linear regression relationship, it is possible to find the best X i , X j , and β that are closest to the real situation.
Besides the linear regression approach, a linear programming approach is also conducted as a comparison. It solves the problem starting from the deviation.
and D ij can be represented as ij ij ij 1 2 D ij 1 and D ij 2 is always greater than or equal to 0. When D ij > 0, D ij 1 = D ij and D ij 2 = 0; When D ij < 0, D ij 1 = 0 and D ij 2 = −D ij . When D ij = 0, both D ij 1 and D ij 2 is 0 66 . The goal of the linear programming is to minimize the maximum absolute error (MINIMAX). Here the M represents the maximal absolute deviation and the optimization model is listed as: Minimize: M Subject to: ij ij i j 1 2 By solving this model, the values of X i for all cities and the distance-decay coefficient β can be estimated. The attraction of each city and the β values of four years calculated from the MINIMAX approach and the linear regression approached are then compared.
The null model. A null model is also used to examine the effect of distance decay (i.e. the value of β). The null model first constructs an interactive network without considering the distance among nodes 67,68 . Then by comparing the estimated flow in the null model with the observed flow in reality, the difference can reflect the effect of distance. The total flow of a specific node can be denoted as: i j ij Letting = ∑ ∑ F G i j ij and = ∑ ∑ N W W i j i j , and the estimated flow between node i and j is: ij null i j (2020) 10:2647 | https://doi.org/10.1038/s41598-020-59505-2 www.nature.com/scientificreports www.nature.com/scientificreports/ The ratio = R G G / ij ij ij null is used to measure the effect of distance decay and since it should be related to the distance itself. A linear regression model between R ij and d ij is constructed. Then the best fitted slope is taken as the value of the distance-decay coefficient β.
Network structure analyses. The centrality and PageRank measures are often used in transportation network analysis to quantify the importance of a node in road networks 2,55,69 . Given a network < > G V E W , , of nodes (V) and edges (E) with weights (W), The betweenness and closeness centrality measures are defined as follows. The shortest paths can be calculated based on different weights such as physical distance and traffic volume, as discussed in the section "Network measures and sub-network structure".
The betweenness centrality of a node Betw i ( ) quantifies how often the shortest paths passes through a node i in a network. where n jk is the number of shortest paths between nodes j and k, and n i ( ) jk is the number of shortest paths between nodes j and k that contain node i.
The closeness centrality of a node Clos i ( ) is the reciprocal of the sum length of the shortest paths between the node and all other nodes in the network.
Where n is the total number of nodes; d ij is the length of the shortest path between nodes i and j. The PageRank PR i ( ) quantifying the node importance in networks is computed as follows 57,59 .
Where d is a damping factor between 0 and 1 and is usually set as 0.85 57 ; n is the total number of nodes; j is the index of a set of nodes M i ( ) that link to i, L j ( ) is the number of outbound links on node j, and PR j ( ) is the PageRank of node j. As for the weighted networks, the PageRank algorithm interprets the normalized edge weight W i j as connection strengths and an edge with a larger weight is more likely to be selected for connection.
In addition, the Principle Component Analysis (PCA) is utilized to identify the structure of the network and discover potential sub-networks inside a province 70 . By applying a standard PCA to the transport flow matrix F [ ] ij , every principal component (PC) can be related to a sub-system of flows and the first few components represent stronger sub-systems 60,70 . The factor loadings and component scores are computed for each PC and they are used to identify the dominant sub-networks. Given a dimension of * n m flow data set F ij , the first principal component of a set of destination features X 1 , X 2 , ..., X j , X m , is the normalized linear combination of the features that has the largest variance. As shown below, the elements φ φ … , , m 11 1 as the factor loadings of the first principal component.
j j j i j n j 1 2