Statistical analysis and estimation of the regional trend of aerosol size over the Arabian Gulf Region during 2002–2016

In this article, we present the results of the regional estimation of the evolution of monthly mean aerosol size over the Arabian Gulf Region, based on the data collected during the period July 2002 – September 2016. The dataset used is complete, without missing values. Two methods are introduced for this purpose. The first one is based on the partition of the regional series in sub-series and the selection of the most representative one for fitting the regional trend. The second one is a version of the first method, combined with the k-means clustering algorithm. Comparison of their performances is also provided. The study proves that both methods give a very good estimation of the evolution of the aerosol size in the Arabian Gulf Region in the study period.

Namikas 19 investigated the causes of wind erosion in sand dunes. His research revealed that wind speed varies regionally, frequently occurring during certain periods and its interaction with the land-use disturbances may produce big quantities of atmospheric dust. It was shown that much of the desert dust mass transported in the atmosphere occurs during a few events. Part of them, occurring in South and Central Asia have been extensively studied by Chen et al. 12 and Huang et al. 20 .
Recent research indicates a significant variability of the airborne desert dust concentrations that during the past decades in the Middle East, Africa, Central Asia and South America 12,[21][22][23] , and the augmentation of aerosol optical depth worldwide, especially in spring and summer, suggesting a relationship with the dust abundance 24 . Analyses of airborne desert dust and atmospheric concentrations of mineral dust particles extending from the Sahara, across the Arabian Peninsula and the Middle East, to South and Central Asia have been provided, as well 2,13,14,20 .
Most article focused on classifying the aerosol types using the satellite data. The classification are based on the aerosol optical depth 25,26 , Angstrom exponent and index of refraction [27][28][29] , or fine mode fraction and the aerosol index 30 .
The reviews on the impact of the dust size on climate and biogeochemistry, are focusing on the characterization of the size distributions of the aerosol particles. It was shown that the particles with sizes between 0.2-2 μm produce the largest shortwave radiative effect per unit mass 20,31 . As condensation nuclei, the number of particle above a given size is important 32 . Point of view of geochemistry, the amount of dust deposition is essential.
Given the importance of the study of aerosols dimensions, we notice only few articles treating this topic 20,33 . Therefore, we aim at analyzing the series of aerosols radius over the Arabian Gulf Region and to model the trend of the regional series. Two approaches are proposed, based on the partition of the regional series into subseries and the selection of the most representative one, used for building the regional trend.
In the following, we shall understand by particular series, a series recorded at a certain point and by the regional series, the multidimensional series containing all the particular series. Figure 1 represents the maxima, minima and average regional series during the study period and Fig. 2 presents the monthly average of the series in the study period.  We remark a seasonal variation of the series. At the beginning of the period the highest means have been recorded in October, November and December, but after 2010, we notice the highest monthly averages in November, December and January. The months when the smallest aerosols radii have been registered are May, June and July.

Results and Discussions
The analysis of the aerosols' radii at all the observation sites provide the following limits: the minimum is between 12.99 and 39.76 µm, the maximum is between 71.65 and 100 µm, the average is in the interval 51.21-56.79 µm, with the standard deviation in the interval 8.67-15.21. The series' skewness is between −0.41 and 0.48, 39% being in the interval [−0.05, 0.05], 31% greater than 0.05 and 33% less than −0.05, indicating that the majority of the series distributions are left or right skewed. The excess kurtosis is between −1.31 and 1.05: only 5 values are greater than zero, and approximately 90% around −1. Therefore the majority of the distributions of the dust aerosol size are platykurtic, only five of them being leptokurtic.
The normality hypothesis for the individual series has been rejected for all the series by the Shapiro-Wilk and Anderson-Darling tests. Jarque-Bera's test rejected the normality hypothesis for 95% of series. This result is in concordance with the finding related to the skewness and excess kurtosis.
After performing the Henze -Zirkler test, the null hypothesis was rejected as well. Therefore, the regional series does not follow a multivariate Gaussian distribution.
After applying the dcor t-test, the independence hypothesis has been rejected since the dcor coefficients are higher than 0.586. Since the p-value associated with the Kruskall-Wallis test is less than 0.0001, we can't accept the hypothesis that all the samples come from the same population. Therefore, the series are not independent, but they don't have the same distribution.
For emphasising the differences between the pairs of series, after the rejection of the null hypotheses by the Kruskall-Wallis test, the post-hoc Dunn's test was performed. Its results show that 82.2% of pairs of series don't come from the same population.
Mandel's test has been performed for detecting if there are outlying means among the series averages. It was found that there are only 18 outlying means of the study series. Figure  The Brown-Forsythe test rejected the homoscedasticity hypothesis. Therefore, we can conclude that the variances of the individual series are not homogenous.
For modeling the regional distribution of the aerosols' size, we used the following two algorithms.

Method I.
(1) Given the data series recorded at m different observation points, on n consecutive periods, build the matrix of the regional series, , where y ji is the series recorded at the moment j at the point i. So, a column of the matrix Y contains the data collected at a specific point, and a raw of the matrix contains all the values collected at a certain moment at all the sites. For each j n 1, = , perform the steps (2)-(6). (2) Compute the extreme values (maximum and minimum) on each row of the matrix (that is the regional extrema at each moment) and the amplitude, as the difference between the maximum and the minimum values. For example, denoting respectively by y j max and y j min the maximum and minimum values at the moment j, the amplitude at the moment j will be defined by:  (4) Attach to each sub-interval interval I jl its frequency, f jl , defined as the number of values from I jl .
(5) Choose that interval I jl whose frequency is maximum. Denote it by I j max and by f j max the corresponding frequency. If the highest frequency appears more than once, I j max will be that interval whose average is the closest to the average of the entire period j. For example, suppose that the maximum frequency is f j3 = f j5 , the average of the values in I j3 is 24.5, the average of the values in I j5 is 29.5 and the average of the period j is 30. Then, we select I j max = I j5 because 29.5 is closer to 30 than 24.5. (6) Choose the representative value for the period j to be equal to the average of the values from the interval I j max , and denote it by y j max . (7) Build the trend series that fit the regional one using y ( ) (8) Estimate the fitting quality, computing the mean absolute error and the mean standard error of each series The mean absolute error and the mean standard error of the series i (denoted respectively by MAE i and MSE i ) are defined by: where n i is the number of values of the series i, x iq is the q th value of the series i, (x iq ) e is the value estimated by the model for the q th value of the series i.
Lower MAE i and MSE i are, better the modeling quality is.

Method II.
This algorithm is based on the previous one, but the selection of the interval with the maximum frequency is replaced by the selection of the cluster with the highest number of elements, after pruning the k-means clustering algorithm, for classifying the series in clusters (homogeneous and disjoint groups within which the patterns are similar). The number of clusters, k, is a priori specified and the algorithm k-means stores k centroids used to define clusters. The clustering idea is finding groups (clusters) that minimise an error criterion, as, for example, Sum of Squared Error (SSE), which measures the total squared Euclidian distance of instances to their representative values 35 .
being the given points, the k-means clustering algorithm has the following steps 32 : (a) Choose the number of clusters, k. (b) Initialize the cluster centroids v 1 , v 2 , …, vk ∈ R n randomly. (c) Compute the distance between each data point and the cluster centers. (d) Assign the data point to the cluster whose distance from the cluster center is the minimum of all distances to the cluster centers. For more information about other clustering algorithms, the reader could refer to Rokach and Maimoon 36 , Everitt et al. 37 , Xu and Wunsch 38 .
In our case, Z = X and the element z i is the column i of the matrix X. The stages or Methods II are the following: (I) Similar to step (1) of the Method I.
(II) Choose the number of clusters, k, and perform the k-means clustering.
(III) Determine the cluster with the highest number of elements and build the matrix X c with the columns of X that contain the values recorded at the observation points included in this cluster. (IV) Choose the representative value for the period j to be equal to the average of the values from the j th row of X c . Denote it by y j C . (V) Build the series that fit the regional one, using The comparison of the methods' performances is done by computing the mean absolute error and mean standard error of each series and the overall the mean absolute error and mean standard error.
The overall mean absolute error is defined by: and the overall mean standard error is given as: where MAE is the overall mean absolute error, MSE is the overall mean standard error, MAE i is the mean absolute error of the series i, MSE i is the mean standard error of the series i, n i is the number of values in the series i. Lower MAE (or MSE) is, better the modeling result is. The modeling techniques have been applied for monthly average series collected at the study sites for 171 month. The data series and the computation done by applying Methods I and II can be found in the Supplementary Dataset 1 and Supplementary Tables S1 and S2.
For Method I, n = 387, m = 171. We chose m j = 6, for all = j 1, 171 at stage (3), meaning that after detecting the range of the values in a given period, this interval was divided into 6 subintervals. Method II was applied using a number of clusters k = 6, for comparison reasons.
After performing the k -means algorithm we got the results presented in Tables 1-3. We remark that the within-class variances, minimum, maximum and average distances to centroids are close to each other for the classes 2 and 3. The classes have different numbers of elements, the highest one being in the third one that will be selected for modeling with Method II.
The regional series obtained by applying the methods previously described are presented in Fig. 4, where Series I is obtained by Method I and Series II is obtained by Method II.
The corresponding errors (MAD and MSE) are presented in Figs 5 and 6. They were obtained based on the data provided in the Supplementary Tables S1 and S2. The values from the Supplementary Tables S1 and S2 can be synthetized in Table 4, where the minimum and maximum modeling error are presented.
Mean absolute deviations and mean standard error of the individual series (MAD i , MSE i ) vary in larger limits when using Method II, than when using Method I. Comparing the MADs (respectively MSEs) from Supplementary Tables S1, one can see that the first method better performed for 223 series (respectively 218 series). This means that MADs (respectively MSEs) were smaller in 57.62% (respectively 56.33%) cases when  Method I was used. Overall MADs for the models were respectively 2.536 and 2.607. Overall MSEs for the models were respectively 3.547 and 3.639. Therefore, the overall mean absolute deviations and mean standard errors are comparable for both methods. The study series were not highly inhomogeneous. Only some outlying means were detected and the standard deviations were between 8.69 and 15.25. This could be a reason for which the goodness of fit of the methods are similar.   The Method I gave better results due to the procedure of selection of the values of the individual series that participate in fitting the regional one. While in the second method, all the values that participate to this process are taken from the same series, belonging to the cluster with the highest number of member, in Method I, the values taken into account at each moment can belong to different series. For example, the representative values for the time t = 2 could be in the subinterval 3 and those for the time t = 4 could be in the subinterval 6.

Conclusions
In this article, we presented two methods for estimating the evolution of the regional time series of the aerosols dimensions. Both methods are easy to use, the advantage of the second one being that the k-mean algorithm is implemented in many software. In both methods, the number of subintervals, respectively clusters must be specified from the beginning for the selection of the series that will finally participate in fitting the regional series. The fitting results are not significantly different (in terms of overall MADs and MSEs), but the first method gave a better estimation of the individual series values (223 and 218 series, respectively). For highly inhomogeneous series, we recommend the use of the first method, knowing that the k-mean algorithm is sensitive to the outliers' existence. Therefore statistical tests for the mean homogeneity and homoskedasticity, as well as that for the outliers' existence must be performed before deciding the modeling method.
Both methods could be successfully used to estimate the regional evolution of the pollutants' series when some particular series in the study area present missing values, but the neighboring series are complete. In the future works we shall study the sensitivity of both methods to the selection of the clusters' number. We shall compare the methods' goodness of fit for regional series of data (as precipitation, temperature, anthropic aerosol) before and after the seasonality removal, as well. We also aim at implementing both methods in a friendly -user software.

Methodology
Data series. Data used are monthly series containing the aerosol particle radius record from July 2002 till September 2016, over the Gulf Region (Fig. 7), retrieved by MODIS 39 . There were 1850 observation points. For modeling the regional evolution of the dimension of aerosol particles we selected 387 complete series (without missing data), each of them containing 171 values. The coordinates of these points can be found in Supplementary Dataset.   To determine the type of aerosols, we compared the aerosols dimensions and the optical depth (AOT) downloaded from MODIS site 39 for our series, with the results from the literature [25][26][27][28][29] . It resulted that the type of aerosol is mostly dust because AOT was larger than 0.3 (majority around 0.43). The aerosols' dimensions are in the range of the coarse aerosols (as per the data from the Supplementary Dataset). Due to lack of space and since our main purpose is the modeling, we shall not insist on other characteristics of the study aerosols. We intend to dedicate another article to an extensive study of their properties. Statistical analysis. Before modeling, statistical analyses of the series have been performed, at the significance level α = 0.05. For all the tests, when the p-value associated is lower than the significance level, one should reject the null hypothesis H 0 , and accept the alternative hypothesis H a .

MAD
In what follows we shall call individual series the series recorded a certain observation point and the regional series that one recorded at all the sites. Therefore, an individual series will be represented by a column vector and the regional series as a matrix formed by all the columns containing the individual series.
The statistical tests have been conducted using the R software. For testing the normality hypothesis against the non-normality for the individual series, the Anderson-Darling 40 , Jarque-Bera 41 and Shapiro-Wilk 42 tests were performed. They are implemented in 'fBasics' package in R 43 .
The test statistic for the Anderson-Darling test is defined as: i n i n i Φ is the cumulative distribution function of the standard normal distribution, x is the mean and s is the standard deviation of elements in the sample, If the p-value associated to the test is less than the significance level, the normality hypothesis can be rejected. The Jarque-Bera test statistic is defined as: where n is the sample volum, k 3 is the sample skewness and k 4 is the sample excess kurtosis, defined as: For large samples, the JB statistics is compared to a chi-squared distribution with 2 degrees of freedom (χ 2 (2)). The normality hypothesis is rejected if the test statistic is greater than χ 2 (2).
The statistic of the Shapiro-Wilk test is defined as: where n is the sample volume, x 1 , x 2 , …, x n are the original data, x x x , , , n 1 2 ′ ′ ... ′ are the ordered data, x is the sample mean of the data, and the constants w i are given by: (the transposed vector (m 1 , m 2 , …, m n )) formed by the expected values of the order statistics of independent and identically distributed random variables sampled from the standard normal distribution and V is the covariance matrix of those order statistics.
Small values of W indicate non -normality. All software provide the p-values corresponding to this test statistics. If the p-value is less than the significance level, the normality hypothesis can be rejected.
For testing the multivariate normality (the normality of the regional series) against the non-normality hypothesis, we used the multivariate Henze -Zirkler test 44 , implemented in the R package 'MVN' 45 . The test is presented in the following.
Let X 1 , X 2 , …, X n ∈ R d be a random sample, where d is the dimension of X i and n is the number of observations. The Henze -Zirkler test is based on a nonnegative functional D that measures the distance between two distribution functions and has the property that D (N d (0, I d  The test statistic T β (d) is defined by: and S is the sample covariance matrix of the matrix X formed by … X X X , , , as columns. In the null hypothesis, the test statistic is approximately lognormal distributed. The null hypothesis is rejected if the p-value associated to the test statistics is less than the significance level. Since the regional series is not multivariate Gaussian, for testing the independence hypothesis of the regional series, the nonparametric dcor t-test 46 of independence has been performed.
The test statistics is: where n is the sample size, ⁎ R n is the distance correlation statistics and ν = − n n ( 3)/2. The test rejects the null hypothesis at level α if T c n > α , where α c is the (1−α) quantile of a Student t distribution with ν − 1 degrees of freedom. For details, the reader may refer to the article of Székely and Rizzo 46 .
For testing the hypothesis H 0 : The individual series come from the same population against the alternative H a : The individual series do not come from the same population, the Kruskall-Wallis test 47 was performed.
The first stage all data is ranked, ignoring the group membership. The rank of the tied values is the average of the ranks they would have received had they not been tied.
The test statistic is defined as: found in the tables of the chi-square distribution at the significance level α (generally set to be 0.05).
If the null hypothesis is rejected, the multiple pairwise comparisons are done using the Dunn's test 48 .
Let W i the i th group's summed ranks and n i its sample size and = W W n / i i i . Assign any tied values the average of the ranks they would have received had they not been tied.
For testing the hypothesis that the group A and B come from the same population against the hypothesis that they don't come from the same population, we compute where N is the total number of observations, k is the number of groups, n j is the number of observations in group j, . z j is the mean of the elements z ij in group j, and .. z is the overall mean of the z ij . The test rejects the hypothesis that the variances are equal at the significance level α if > α − − F F k n k , 1, , where F k n k , 1, α − − is the upper critical value of the F distribution with (k -1) and (n -k) degrees of freedom at the significance level of α. Alternatively, the null hypothesis is rejected if the p-value corresponding to the test is less than α.
Availability statement. Data are available in the Supplementary Database1.