A DFA-based bivariate regression model for estimating the dependence of PM2.5 among neighbouring cities

On the basis of detrended fluctuation analysis (DFA), we propose a new bivariate linear regression model. This new model provides estimators of multi-scale regression coefficients to measure the dependence between variables and corresponding variables of interest with multi-scales. Numerical tests are performed to illustrate that the proposed DFA-bsaed regression estimators are capable of accurately depicting the dependence between the variables of interest and can be used to identify different dependence at different time scales. We apply this model to analyze the PM2.5 series of three adjacent cities (Beijing, Tianjin, and Baoding) in Northern China. The estimated regression coefficients confirmed the dependence of PM2.5 among the three cities and illustrated that each city has different influence on the others at different seasons and at different time scales. Two statistics based on the scale-dependent t-statistic and the partial detrended cross-correlation coefficient are used to demonstrate the significance of the dependence. Three new scale-dependent evaluation indices show that the new DFA-based bivariate regression model can provide rich information on studied variables.

2 , where r 12,3 denotes the partial correlation coefficient between the first and second variables eliminating the effects of the third one, N − 3 is the degree of freedom) of the partial correlation coefficients to assess the statistical significance at the given significance level. Unsurprisingly, Table 1 shows that the correlations of PM2.5 between per two cities are of statistical significance. It explains that the air quality in one city of Northern China cannot be irrelevant to that of its neighbouring cities, which implies potential dependence among the three cities. However, we also note in Table 1 that the degrees of relevance are different among different cities and in different seasons though all of them are significant.
To fully detect and quantify the dependence among the PM2.5 series of the above-mentioned three cities, in this work, we construct a new bivariate regression framework which prevails the DFA method and allows us to investigate the dependence of three nonstationary series with multiple time scales. With the DFA-based variance instead of the standard variance, this new DFA bivariate regression model provides more information on the dependence among variables at different time scales. We organize the rest of this paper as follows. The performance of the proposed DFA regression model and the results on the application to PM2.5 series analysis are reported and discussed in the following section, which is followed by our conclusions. The methodologies including the standard regression method, the DFA method, and the DFA-based regression method are introduced at the end of this paper.

Results and Discussions
Performance of DFA estimators. The bivariate DFA-based regression model produces two time scalebased regression coefficients. This allows us to detect the dependence of a response variable and two independent variables at different time scales. In order to examine the validity of the model and show its advantages, in this section, we perform two numerical tests on the non-stationary bivariate regression frameworks Y = β 0 + β 1 X 1 + β 2 X 2 + ε.
In the first test, we investigate the performance of the DFA estimators under different levels of long-term dependence in X 1 , X 2 , and Y. According to 31 , the setting I is given as below: two artificial series X 1 and X 2 with length 10000 are generated by ARFIMA(0, d, 0) process with identical fractional integration parameter (d) and independent Gaussian noises (ξ i (t), i = 1 and 2) as The quantity a n (d) is defined by a n (d) = Γ(n − d)/[Γ(−d)Γ(n + 1)], where Γ(⋅) is the Gamma function. The error-term ε is set as a standard Gaussian noise so that the response variable Y has the same parameter d as the two independent variables. The regression coefficients are set as β 0 = β 1 = 1 and β 2 = 2. Figure 1 shows mean values and standard deviation of the two DFA estimators βˆi DFA (i = 1 and 2) for the generated series with d ranging from −0.5 to 0.5 (at the step size of 0.1). The estimators are averaged over scales between 10 and 1000 with a logarithmic isometric step. Each case is run 1000 times to eliminate the noise interference. It is clear that the two estimators locate the two given regression coefficients of 1 (Fig. 1a) and 2 ( Fig. 1b) unbiasedly, and are independent of the value of d. In addition, the standard deviations of both estimators decrease with the increasing memory. The good performance shows that the method is feasible. On the other hand, to investigate the performance of the DFA estimators faced with a long-range dependent error-term ε, we use setting II given as: the memory parameter d is fixed at 0.4 for both X 1 and X 2 , and the ε is produced by an ARFIMA process with d ε varying from −0.5 to 0.5. Other settings are as those in setting I. Figure 2 records similar information as that in Fig. 1. Although the fluctuation of DFA estimators increases with d ε , which is expected due to an increasing weight of the error-term in the dynamics of Y with the increasing memory of the error-term, we are satisfied to find that the two estimators are still unbiased pointing to the given values with a narrow range for each level of memory of the error-terms.   Our second numerical test aims to show that the DFA estimators are able to identify the dependence of studied variables at different time scales whereas the classical method cannot. To this end, a binomial multifractal series (BMFs) is employed to be regarded as the independent variable X 1 , which is constructed as −1] , k = 1, 2, …, 2 n , where the parameter p ∈ (0, 0.5) (We take p = 0.3 in our test), n[k] denotes the number of digit 1 in the binary representation of the index k. The variable X 2 is a Gauss variable with 0 mean and 0.0001 standard deviation. Both X 1 and X 2 are of length 2 15 . The bivariate regression framework Y = β 0 + β 1 X 1 + β 2 X 2 + ε is set with the same coefficients as the first test (β 0 = β 1 = 1 and β 2 = 2). The error-term ε is the Gauss noise of the same strength as X 2 . For the BMFs X 1 , we remove all values smaller than 0.00001 so that only a few of the largest elements are left. In their places, we substitute Gaussian distributed random numbers with 0 mean and 0.0001 standard deviation. Then we obtain a binomial cascade series embedded in random noise. We analyze the dependence between the response variable Y and two independent variables and find that the estimated βˆD FA 2 is unbiased at 2 with a desirable error bar for every time scale, as shown in Fig. 3. However, the performance of βˆD FA 1 has changed a lot. The dependence between Y and X 1 is obviously less than the given value at the smaller scales contrary to the larger ones. This is because in the smaller scales, the dependency has been destroyed by the random noise. Our DFA estimators have the capability to recognize this effect while the classical estimators fail to do so (see the errorbar with circle symbol in Fig. 3).  Performance of the three models' regression coefficients. As mentioned above, air pollution in Northern China is very serious in recent years. Fine particulate matter from industrial exhaust and smoke dust forms smog to fill in the air. We now apply our DFA regression model to investigate the dependence of PM2.5 series in these three cities. We build three bivariate models for Beijing, Tianjin, and Baoding, respectively. In Model I, the dependent variable (Y) is the PM2.5 series of Beijing while the two independent variables are the PM2.5 series of Tianjin (X 1 ) and Baoding (X 2 ); in Model II, Y is the PM2.5 series of Tianjin, X 1 is the PM2.5 series of Beijing and X 2 is the PM2.5 series of Baoding; in Model III, Y is the PM2.5 series of Baoding, X 1 and X 2 stand for the PM2.5 series in Beijing and Tianjin, respectively. In this section, we first show the performance of the regression coefficients at different scales in the three models and then make two statistical tests for the two regression coefficients in each model. Some evaluations for the DFA-based regression and the standard regression are conducted at the end of this section.
The two regression coefficient estimators together with their standard deviations of the three models are sketched in Figs 4-6, respectively. As expected, the effect is obviously positive. However, a strong variation across scales is found in different seasons. More specifically,  is close to 0 from the smaller scale to the larger scale at about 50 days (1200 hours), which implies that the positive correlation between Tianjin and Baoding can last less than 50 days. In addition, the two coefficients are less than 0.5 in most days, which indicates that Beijing and Baoding have little impact on the PM2.5 in Tianjin. (c) For the model of Baoding, the effect of Tianjin (X 2 ) to Baoding is similar to that of Baoding to Tianjin in model II. However, the fact that after approximately 17 days (408 hours) the effect reaches the value greater than 1 indicates that an increase of 1 unit PM2.5 concentration of Tianjin will lead to the increase of more than 1 unit PM2.5 concentration in Baoding. In this regard, Tianjin has more impact on Baoding. In addition, the narrow confidence intervals and low standard deviations (less than 0.02) shown in all sub-plots suggest satisfied reliability of the estimates.  Statistic significance tests of regression coefficients. As mentioned above, the estimated βˆn ( ) DFA is able to theoretically describe the dependence between the impulse variables and the response variables at different time scales. In theory, as long as βˆn ( ) j DFA is not equal to zero, the independent variable X j will affect Y. However, for finite time series, βˆn ( ) j DFA is not always equal to 0 even in the absence of relationship between X j and Y due to the size limitation. Therefore, we perform a hypothesis test for the estimated βˆn ( ) DFA to ensure the significance. The standard regression analysis provides a so-called t statistic defined as j j j . In general, if |t j | > t 1−α/2 (N − 3) with a given α, we should reject the null hypothesis of β j = 0 and the dependence between X j and Y is considered to be statistically significant. However, since lots of time scales are taken accounted in the DFA regression model, using a single critical value of t 1−α/2 (N − 3) is inappropriate. A correct way is to generate a critical value t c (n) for each time scale. To this end, inspired by the idea proposed by Podobnik et al. 32 , we shuffle the considered PM2.5 series and repeat the DFA regression calculations for 10,000 times. Then let the integral of probability distribution function (PDF) from −t c (n) to t c (n) be equal to 1 − α (here, we take α = 0.01). As an example, we show the PDF of t c (n) with five given n's produced by the shuffled PM2.5 series of fall in Fig. 7.
As expected, the symmetrical PDF of t c (n) converges to a Gaussian distribution according to the central limit theorem. In addition, the critical value increases as n increases. This implies that large time scale may strengthen dependence between two variables. By using t c (n), we can determine whether the dependence between the impulse variable and the response variable is significant or not. In practice, the dependence between X j and Y is present when The partial DCCA coefficient ρ PDCCA (n) is recently developed to uncover the intrinsic relation for two nonstationary series at different time scales. We also calculate the partial DCCA coefficients ρ PDCCA (n) of Beijing and Tianjin, Beijing and Baoding, and Tianjin and Baoding, respectively, and present the results in Fig. 9. For the same purpose of testing the statistical significance, we also produce a critical value for the four seasons. Similarly, the PM2.5 data are shuffled 10,000 times in the PDCCA calculations repeatedly, and thus ρ n ( ) PDCCA c for 99% confidence level is obtained, which is also shown in Fig. 9.
Comparing results in Figs 8 and 9 gives amazing similarities, which are also in agreement with the results shown in Figs 4-6. Based on the results, we can draw the following three main points. However, the dependence between the two cities is lower than other cities. This finding uncovers that the reason for the serious air pollution in these two cities are mainly due to their own heavy smog or are impacted by other cities. (b) The dependence between Beijing and Baoding (the green triangle line) is significant in spring, summer, and fall. In winter, however, the dependence disappears at long time scale, which implies that the two cities can only affect each other at relatively short term. Moreover, compared to winter and fall, the dependence is much stronger in spring and summer, especially at long time scales, which indicates that they affect much longer in warm weather. (c) In spring and summer, the t(n)-statistics and ρ PDCCA (n) of Tianjin vs. Baoding (the red circle line) go down through the critical lines of t c (n) and ρ n ( ) PDCCA c at about 800 hours, respectively. This suggests that the dependence between Tianjin and Baoding will disappear when it's more than one month. However, the exact opposite occurs in winter and fall. In these two seasons, both t(n)-statistic and ρ PDCCA (n) increase with the increasing time scales, which demonstrates that the interaction of bad air quality between the two cities will last longer in cold days.  , and the beta coefficient β* DFA (n) and the average elasticity coefficient η DFA (n) in Fig. 10, Figs 11-13, respectively.
To show the new model provides more information than the standard regression model does, we also include the three corresponding coefficients of standard bivariate regression model in these figures. As seen from Fig. 10 that R n ( ) DFA 2 is superior to the standard R 2 at most time scales. The good performance illustrates that one will gain richer information in explaining the response variable when using our DFA-based regression model. On the other hand, we can conclude from Figs 11-13 that (1) Baoding has more influence than Tianjin on Beijing in all seasons except for winter. (2) Tianjin is more sensitive to Baoding's changes in air quality than Beijing's in winter and fall. , respectively, at smaller scales but much smaller at larger scales, which shows that the sensitivity of Y to X 2 (Baoding) is greater than that of Y to X 1 (Tianjin) for short term (≤300 hours) but Tianjin is   Fig. 11. Here the subscripts 1 and 2 denote Beijing and Tianjin, respectively.
Scientific RepoRts | (2018) 8:7475 | DOI:10.1038/s41598-018-25822-w more sensitive to Beijing at the long term. This can help air quality inspectors make the correct analysis for Beijing's PM2.5 at different periods.

Conclusions
The study of dependence between variables helps expose the causal relationship and correlation of the variables of interest in the real world. The linear regression model is undoubtedly one of the simplest methods among many approaches. However, single variety of regression coefficient and evaluation index cannot show all aspects of the dependence between independent variables and dependent variable. As a meaningful extension, we design a new framework for bivariate regression model using the prevailing DFA method. The proposed bivariate DFA regression model allows us to estimate multi-scale regression coefficients and other corresponding scale-dependent evaluation indicators. It has been shown via two artificial tests that these DFA-based regression coefficients are able to describe the dependence between the response variable and two independent variables exactly; and can capture different dependence at different time scales.
An application of the new model to the study of dependence of PM2.5 series among three heavily air polluted cities in Northern China unveils that huge difference of the dependence exists in per two cities in different seasons and at different periods. Three new indicators of the scale-dependent determination coefficient, the scale-dependent beta coefficient, and the scale-dependent elasticity coefficient are proposed, which turned out to be more practical than those in standard regression models. Three main points can be concluded as (1) Beijing and Baoding have little impact on the PM2.5 in Tianjin while Tianjin takes more impact on Baoding and the air quality of Beijing is more sensitive to the changes in Baoding. (2) In contrast, the air quality in Beijing and Tianjin is not significantly relevant, while the air quality in Tianjin and Baoding has a very significant impact on each other especially in the cold weather. (3) In comparison, the fluctuation of PM2.5 in Baoding has the greatest impact on the other two cities in most days. While Baoding's air quality is more sensitive to Beijing's changes in spring and summer, and is more sensitive to Tianjin's changes in winter and fall. These findings may provide some useful insights on understanding air pollution sources among cities in Northern China.

Methods
The standard bivariate regression model. To study the dependence of air quality among three neighboring cities, we consider a bivariate linear regression model as where Y is a dependent variable, X 1 and X 2 are two independent variables, ε is a Gaussian error term with zero mean value, and β j (j = 1, 2) is the partial regression coefficient characterizing the dependence on X j . The most critical work in empirical studies is to estimate β 1 and β 2 . The OLS method gives x y x xy x x where 〈·〉 denotes the mean value of the whole time period, x 1t = X 1t − 〈X 1 〉, x 2t = X 2t − 〈X 2 〉, and y t = Y t − 〈Y〉.
Then the estimator of residuals can be determined by With it one can obtain the estimators of variance of the two regression coefficients as The variance illustrates the accuracy of the estimated parameters. The estimated regression coefficients together with their variances can be further employed for hypothesis test and model evaluation. As an important indicator to evaluate the regression model, the determination coefficient R 2 is defined by j j j which can explain the relative importance of variables X 1 and X 2 to Y. According to 31 , the advantage of translating the standard notation into variance and covariance shown on the right-hand side of Eqs (2)-(5) is available to use the DFA/DCCA methods based on the same idea.
The DFA-based variance and DCCA-based covariance functions. DFA  into N n = [N/n] nonoverlapping segments of equal length n, denoted as Z j,k , k = 1, 2, …, n. The same procedure is repeated starting from the opposite end to avoid disregarding a short part of the series in the end and thus 2N n segments are obtained altogether. In the j th segment, we have Z j,k = Z (j−1)n+k for j = 1, 2, …, N n and = − − + Z Z j k N j N n k , ( ) n for j = N n + 1, N n + 2, …, 2N n , where k = 1, 2, …, n. In each segment, the local linear (or other) trend 33,34 can be fitted as  X j k , (in our work, we use 2 nd order polynomial to fit the trend in each segment). Fluctuation function f n j ( , ) The scale-characteristic fluctuation F n ( ) Z Z 2 1 2 is the so-called DCCA-based covariance, which expresses the cross-correlation fluctuations between the series of {z 1t } and {z 2t }. Thus we have obtained all objects to create the DFA-based regression model. But for purpose of testing, we need some accessories of the DFA process. The DCCA cross-correlation coefficient ρ(n), proposed by Zebende 35 , can measure the cross-correlation between two nonstationary series at multiple time scales, which is defined as To access intrinsic relations between the two time series on time scales of n, Yuan et al. 36 and Qian et al. 37 developed a so-called partial DCCA coefficient independently, which applies partial correlation technique to delete the impact of other variables on the two currently studied variables. This coefficient is defined as where C is the inverse matrix of the cross-correlation matrix produced by ρ DCCA (n) of Z 1 , Z 2 , …, and subscripts j 1 and j 2 stand respectively for the row and column of the location of ρ DCCA (Z 1 , Z 2 , n).
The DFA-based bivariate regression model. We now translate the standard bivariate regression process described above into the DFA-based bivariate regression model. The two estimators in Eq. (2) can be extended to the scale-dependent estimators in the following way using the scale-dependent variance and covariance defined in Eqs (7) and (9)