Evolution and forecasting of PM10 concentration at the Port of Gijon (Spain)

The name PM10 refers to small particles with a diameter of less than 10 microns. The present research analyses different models capable of predicting PM10 concentration using the previous values of PM10, SO2, NO, NO2, CO and O3 as input variables. The information for model training uses data from January 2010 to December 2017. The models trained were autoregressive integrated moving average (ARIMA), vector autoregressive moving average (VARMA), multilayer perceptron neural networks (MLP), support vector machines as regressor (SVMR) and multivariate adaptive regression splines. Predictions were performed from 1 to 6 months in advance. The performance of the different models was measured in terms of root mean squared errors (RMSE). For forecasting 1 month ahead, the best results were obtained with the help of a SVMR model of six variables that gave a RMSE of 4.2649, but MLP results were very close, with a RMSE value of 4.3402. In the case of forecasts 6 months in advance, the best results correspond to an MLP model of six variables with a RMSE of 6.0873 followed by a SVMR also with six variables that gave an RMSE result of 6.1010. For forecasts both 1 and 6 months ahead, ARIMA outperformed VARMA models.

Pollution and particulate matter studies. The World Health Organisation has reported that air pollution has an adverse effect on people's health and development 2 . It is well-known that long-term exposure to high levels of air pollution is linked to decrements in lung function in children 3 . A Swiss study found increased levels of allergic sensitisation in adults living in proximity to busy roads for periods longer than 10 years 4 . Also, the PM 10 pollutant is amongst those regulated under the Air Quality Framework Directive on ambient air quality assessment and management 5 . A continuous exposure to pollutants such as Carbon Monoxide (CO), Carbon Dioxide (CO 2 ), oxides of nitrogen (NO x ), and particulate matter is reported to cause health problems in the population living in the affected areas 6,7 . Particulate matter is formed by different chemical products, mostly produced by anthropogenic processes 6 and with significantly variable diameters. Their anthropogenic origin is the reason why they are more present in urban areas 8 than in unpopulated areas.
Air quality issues are relevant in ports and areas nearby. In general, the duty cycle of marine vessels is longer than that of roadside vehicles. This means that ship engines generally use older technology than cars and due to their engine power they are also much more pollutant 9 . Previous studies have analysed PM 10 concentrations in ports and coastal areas like the Bay of Algeciras in Southern Spain 10 . Another study analysed the impact of PM 2.5 particles from ship emissions in southern California 11 . In Turkey, shipping emissions in the regions of Candelari Gulf 12 and Ambarli Port 13 , both with heavy shipping traffic, were investigated. Research carried out in the port 14 of Tarragona, Spain, made use of multi-linear regression models to study the contribution of different harbour activities to the levels of particulate matter in its area. In the same line there is another study, 15 performed in Barcelona's harbour, also located in Spain and about 80 kms. as the crow flies from Tarragona, which has estimated that around 50-55% of PM 10 and PM 2.5 concentrations measured at the port could be attributed to harbour activities and that such activities provide about 9-12% of the total PM 10 concentration in the air and about 11-15% of PM 2.5 to the metropolitan area of this city Another interesting and innovative study 16 that deals with the problem of particulate matter in ports was performed in the port of Zhejiang. In this research, with the help of an unmanned aerial vehicle that integrated different sensors, authors have been able to create a profile of the vertical distribution of PM 2.5 , PM 10 and total suspended particles from ground level to a height of 120 m. A study made at the port of Volos 17 , in Greece, found that the highest PM 10 concentration values were associated with days of calm winds, meaning a wind speed under 0.5 m.
s. . The only research into ports that made use of a supervised learning methodology was the one concerning the port of Koper 18 . Koper is the only port  www.nature.com/scientificreports/ in Slovenia and is located at the northern tip of the Adriatic Sea. Researchers made use of hourly PM 10 concentrations and employed k-means clustering with Euclidean and city-block distances to cluster days. The results obtained showed the influence of rain intensity and wind speed in the clusters performed but the influence of any other pollutant was not studied. Finally, another study of interest was performed at the Port of Cork, which, like Gijón, is located on the Atlantic coast 19 .
Use of machine learning techniques to forecast pollutant concentrations. In general, machine learning can be understood as a subset of methodologies of the artificial intelligence field that are able to learn in an automatic way. In other words, they can learn from data and predict future events. Nowadays, the use of machine learning methodologies has extended to almost all branches of science, including environmental studies. One of the main reasons for the use of machine learning approaches for air quality forecasting is the ability of these methodologies to capture non-linear relationships among variables.
Interest in the forecasting of air pollution in urban area dates back to more than a century ago, when large cities began to have problems with pollution 20 . In the 1970s, several statistical models for pollution forecasting were proposed 21,22 . The first applications of machine learning methodologies in this field were in the 1990s. In those days most research performed made use of artificial neural networks 23,24 .
Since then, the different studies performed have made use of other techniques such as genetic algorithms 25 , Hierarchical Agglomerative Clustering 26 , k-means 27 or support vector machines as regressors 28 .
Genetic algorithms have been employed as a supporting methodology for selecting the input variables and designing the high-level architecture of neural networks models. In certain research works 25 , they were applied to the selection of the architecture and input variables of a multilayer-perceptron model for forecasting of hourly concentrations of NO. One of the limitations found by this technique is that training each neural network model is a time-consuming task and therefore, the number of parameters to be tuned must be limited.
Hierarchical agglomerative clustering is employed to group objects that are similar in subsets called clusters. The agglomerative clustering methodology starts with many small clusters and merges them together to create large ones. It has been successfully applied in order to study ozone exposure and cardiovascular-related mortality in Canada 26 . The results obtained showed that this methodology is useful for studying the long-term effects of air pollution on cardiovascular diseases.
A recent study has shown how k-means clustering can be employed to categorize different locations in a big and populated city representing the variability of pollution according to the variables employed for the study 27 . Finally, the use of support vector machines as a regressor has also been reported in some studies 28,29 . In one of these 28 the support vector machine is employed as a regressor model for the forecast of the daily Beijing air quality index from 1 st January 2014 to mid-2016, while in the others 29,30 they are employed for the forecast of the daily average PM 10 .
The aim of the present research is to forecast the air quality in a port area, specifically in the port area of the city of Gijón. For this purpose, the article applied different machine learning models (multilayer perceptron neural networks, support vector machine as regressor and MARS) and compared the performance of the predictions obtained for different time intervals with those given by two time series methodologies, one of them univariate (ARIMA) and the other multivariate (VARMA). This means that an exhaustive comparison is made of the prediction from 1 to 6 months in advance of the performance of five methods. This provides an interesting framework for the comparison of methodologies. All these methods were employed in the past for pollution forecasting, but never all in the same research, as far as the authors know. Therefore, the relevance of the present research is that it deals with the topic of monitoring air quality in a city, comparing different machine learning methodologies applied to the same data set.
The database. The information employed for this research has been obtained from one of the meteorological stations belonging to the network of Air Quality Monitoring of the Government of the Principality of Asturias, and more specifically from the one closest to the Port of Gijón, which is located at Argentina Avenue. This station records environmental measurements hourly. As is normal in all this kind of databases, about 0.23% of the raw observations taken each 15 min for all variables were missing. They were imputed with the help of the Multivariate Imputation by Chained Equiations (MICE) algorithm 31 . Table 1 shows the minimum, mean, maximum and standard deviation of the pollutants measured at Gijón Port for the period of study. The values considered for the present research were average monthly measurements Table 1. Port of Gijón. Minimum, mean, maximum and standard deviation of the variables of the study: sulfur dioxide (SO 2 ), nitrogen monoxide (NO), nitrogen dioxide (NO 2 ), carbon oxide (CO), ozone (O 3 ) and particulate matter with a diameter less than 10 µm (PM 10 ).

Materials and methods
The present research calculates predictive models of PM 10 concentration by means of autoregressive integrated moving average (ARIMA), vector autoregressive moving-average (VARMA), multilayer perceptron neural networks (MLP), support vector machines as regressor (SVMR) and multivariate adaptive regression splines (MARS) models. In all cases the PM 10 values were calculated in two ways: firstly, using the concentration of the six pollutants available as input variables and afterwards employing only four: SO 2 , NO, NO 2 and PM 10 . The main reason why new models using only four variables of the six available are also trained and validated is that many meteorological stations, including some pertaining to the net of Air Quality Monitoring of the Government of the Principality of Asturias are only able to measure these four variables. In other words, the use of only the aforementioned four variables will allow us to compare the model performance according to the input variables employed and will serve as a reference for future studies. Please note that what was said before relates to all the models of the present research except for ARIMA, where only concentration of PM 10 are employed for the forecasting. In all cases, for continuous variables minimum, mean, maximum and standard deviation were calculated. Forecasts are performed from 1 to 6 months in advance. The reason why it might be of interest to perform forecasts 6 months in advance is two-fold. On the one hand, high PM 10 concentrations have adverse effects on human health and on the other, having such a forecast would be helpful in order to take measurements that would make it possible to comply with European air quality standards. According to the results obtained, the best forecast of PM 10 concentration 1 month ahead is obtained by the SVMR model calculated with six variables. In the case of the forecast 6 months ahead the results of the MLP with six variables are slightly better. In other words, in the short-term the best forecasts are given by SVMR but in the long-term it is outperformed by MLP.
Autoregressive integrated moving average (ARIMA). ARIMA models can be considered as being an extension of ARMA (autoregressive moving average) known for their ability to provide a parsimonious description of a stationary stochastic process 32 . ARMA models are composed of two polynomial terms, one for autoregression (AR) and another for moving average (MA). Given a time series of data X t , the ARMA model can be expressed as: where c is a constant, ε t are white noise error terms, P i=1 ϕ i X t−i is the autoregressive addend where ϕ i are parameters and X t−i is the value of variable X in time t − i . ARIMA models are appropriate for those observation sets that are not necessarily generated by a time series, as is the case of the present problem. They considerably improve the empirical description of non-stationary time series 29 . A stochastic process can be characterized as an ARIMA model if the d-th difference of X t , constitutes an ARMA stationary and invertible process of p , q orders.
In this case, p represents the order of the autoregressive part of the model, q is the order of the weighted moving average and another parameter called d represents the number of differencing required to reach stationarity 33 . If the differencing operator is denoted by ∇ , the general ARIMA equation can be written as follows 30 : where ∅ p (B) and θ q (B) are the autoregressive polynomials of weighted moving averages and ε i is the model perturbation.
A more in-depth explanation of ARIMA models goes beyond the scope of this research and can be found elsewhere 34 . All the models employed in the present research were calculated with the help of the statistical software R 35 . ARIMA models were calculated with the help of the series library 36 .
Vector autoregressive moving-average (VARMA). The Vector autoregression Moving-Average (VARMA) method models the next step in each time series using an ARMA model. In other words, it can be considered the generalization of ARMA to multivariate time series. This kind of model makes it possible to compute a set of time series at the same time, obtaining their within-correlations and cross-correlations 32 . For these models calculus was performed with the help of the MTS library 37 .
If a k-dimensional time series is represented by z t , the vector autoregressive moving-average VARMA p, q process can be expressed as: where φ 0 is a constant vector www.nature.com/scientificreports/ are two matrix polynomials and a t is a sequence of independent and identically-distributed random vectors with mean zero and positive-definitive covariance matrix a . A general VARMA p, q model is represented as follows 37 : In this equation p and q are nonnegative integers, φ 0 is a vector of constants, φ i and θ j are two constant matrix and {a t } is a sequence of independent and identically-distributed random vectors with mean zero and positive definite covariance matrix.
According to Tsay and Wood 37 , the VARMA model expressed in the previous equation can be rewritten in a more convenient way as follows: where θ * j = θ j L where L is a lower triangular matrix with 1 being the diagonal elements. The determination of p and q values was performed following a methodology suggested in previous research 38 . Akaike information criterion 39 (AIC) and Schwarz information criterion 40 (SIC) were employed to balance the improvement in the value of the log-likelihood function with the loss of degrees of the freedom which results from increasing the lag order of a time series model. With the help of both the maximum p and q values were calculated. All those models with p and q values less or equal to then were calculated and finally, those with the best RMSE were presented in this paper.

Multilayer perceptron neural networks (MLP).
One of the first bio-inspired machine learning models was the one-layer perceptron. This kind of network was proposed by Rosemblatt 41 as a possible modelization of the neuron of the human brain. The rule of the perceptron adaption consists of a supervised iterative method that modifies the neuron weights. The multilayer perceptron is useful as a way in which to modelize a function. In a neural network the outcome is modelled by an intermediary data set of unobservable variables called hidden variables, which are linear combinations of the original predictors. However, this linear combination is typically transformed by a nonlinear function. Kolmogorov 42 demonstrated that a two-layer network (one hidden layer and one output layer), with a nonlinear differentiable activation function is able to approach any "soft" mapping if the number of neurons in the hidden layer is high enough. If a two-layer network like the one employed in the present research is considered, the operations for a system with p input variables, one output variable and q neurons in the hidden layer can be expressed as: where y(n) and x(n) are the output and input of the net; σ is the activation function of the output layer; ϕ is the activation function of the hidden layer; w y and w h are the weights matrix for the output and hidden layer respectively.
One main requirement in order to make possible the MLP training 43 is that σ and ϕ be continuously-differentiable functions. Training is performed with the backpropagation method, which is a recursive application of the gradient descent method. For the purposes of this research, the neural network models were trained and validated with the help of the library neuralnet 44 . The activation function employed is the logistic function. A more in-depth explanation of the foundations of neural networks may be found elsewhere 45 .

Support vector machines as regressor (SVMR). Support Vector Machines were introduced by the
work of Vapnik 46 . Although they were created by binary classification, nowadays they are used for different kinds of problems. Those employed for regression problems are called SVMR 29 .
Let a training data set S = (x 1 , x 2 ), . . . x n , y n , where x i ∈ ℜ d and y i ∈ ℜ the regression task involves finding those parameters w = (w 1 , . . . , w d ) that make it possible to find the following lineal function 27 : As in practice it is not possible to find these parameters with a prediction error equal to zero, a concept called soft margin is employed. For this, variable ξ i is employed and the equation is written as follows: www.nature.com/scientificreports/ Please note that ξ + i > 0 when the forecast of the model f (x i ) is larger than its real value y i and ξ + i < 0 in other cases.
With the help of the lagrangian function and the Karush-Kuhn-Tucker conditions, the problem can be expressed as follows: where In those cases where data cannot be adjusted with the help of a linear function, kernels are employed 47 . Kernels transform data into a new space called characteristics space.
The regressor associated to the lineal function in the new space is as follows: please note that b * is not included in the function as it can be included as a constant inside the kernel. The kind of kernel function to be employed depends on the problem to be solved. For example, the radial basis function has been shown to be very effective, but in those cases where the data set comes from a linear regression, the linear kernel function obtains better results 48 . The SVM as regressor models have been implemented with the functionalities of the library e1071 49 . A good explanation of the use of SVM as regressor can be found in the work of Drucker et al. 50 .

Multivariate adaptive regression splines (MARS). MARS is a non-parametric modelling method driven by the following equation 51 :
where y t is the output variable for each time t and β i are the model parameters for the different x it . β 0 is the intercept and B represents the model basis functions.
One of the main characteristics of the MARS models is that they do not make use of any a priori hypothesis concerning the relationships among the variables 52 . The basis functions are defined as follows: q is the power of the basis function as is always a value either equal o larger than zero. In order to adjust a MARS model and decide which basis functions are to be included, MARS makes use of the generalized cross validation (GCV). This represents the root mean squared error divided by a penalty parameter that is defined by the model complexity 53 . Its equation is as follows: where M represents the number of basis functions in the equation and d is a penalty parameter for each base function included in the model. For this research, a value of 2 has been assigned to such a parameter, while the maximum number of tracer interaction type base functions is restricted to 3. The MARS models employed in this research are based on those programmed in the library earth 54 . A complete explanation of MARS models can be found in the original work of Friedman 51 . Also, an easy-to-read introduction to this methodology can be found in the works of Put et al. 55 . Table 2 shows the Pearson's correlation coefficients of all the variables in the study. The largest correlation coefficient in absolute value corresponds to variables NO and NO 2 with 0.8626, followed by NO and O 3 with − 0.7593 (inverse relationship) and SO 2 and NO 2 and SO 2 and NO with 0.7160 and 0.7090 respectively. Correlation coefficients of variables SO 2 , NO, NO 2 and O 3 with PM 10 can be considered in absolute value terms as moderate as they range from 0.4320 (CO and PM 10 ) to 0.5251 (NO 2 and PM 10 ). www.nature.com/scientificreports/ Table 3 shows the results of the ARIMA model using the previous values of PM 10 as the input variable. Tables 4, 5, 6, 7 and 8 show the results obtained using the different models of four (SO 2 , NO, NO 2 and PM 10 ) and six variables (SO 2 , NO, NO 2 , CO, O 3 and PM 10 ) employed in the present research. In all cases, the results are presented in the same way. The first line represents the forecast performed using information from January 2010 to December 2017 as training values. This forecast is performed for the following 6 months. The second   www.nature.com/scientificreports/ line shows the forecast performed using information from January 2010 to January 2018 and the forecasts from February 2018 (1 month ahead) to June 2018 (5 months ahead) as training values. For all the cases, and in order to make an easy comparison of real values with forecasting, root mean squared errors (RMSE) forecasting values from 1 to 6 months ahead and 1 month ahead for all models are presented in Table 9. In the case of the ARIMA model (Table 3), the one that only makes use of past PM 10 concentrations in order to predict their future values, the RMSE obtained for forecasts performed 1 month ahead was 6.3163 while the RMSE for forecast performed from 1 to 6 months ahead, the RMSE value was 7.6312. Please note that when we speak about the RMSE obtained   Tables 3, 4, 5, 6, 7 and 8 make comparisons more direct.

Results and discussion
The RMSE values achieved 1 and up to 6 months ahead for all the models trained in the present research are shown in Table 9. For forecasting 1 month ahead, the best results are obtained for the six variables of SVMR and MLP models, followed by the same models including only four variables. These results give us the idea that all the variables included in the study have a certain relevance in terms of performing an accurate PM 10 prediction. After the MLP and SVMR models, according to RMSE values the next best in forecasting 1 month ahead is the ARIMA model, the only one that makes exclusive use of past PM 10 values in order to forecast future concentrations. The ARIMA model is followed by MARS with six and four variables, while VARMA are the models that give the worst performance.
In the case of a forecast of up to 6 months ahead, the best performance according to RMSE value is also achieved by 6 variables MLP and SVMR models followed by the same models using only four variables. A remarkable change when compared with the forecast 1 month ahead is that the MARS model that includes 6 variables performs better than the ARIMA model. Finally, and as also happened with the forecasts 1 month ahead, the worst performance was shown by the VARMA models.  www.nature.com/scientificreports/ From our point of view, a remarkable fact is that the model performance in terms of RMSE in both 1-and 6-month ahead models is not only linked to the number of variables considered in it, but also to the kind of model selected. In other words, it is possible to find a model of only one variable (ARIMA) that performs better than others that include six variables in both 1-and 6-month ahead predictions (VARMA). Finally, the importance of a variable is very easy to assess with the help of a MARS model. The importance order found for the prediction of PM 10 was as follows: PM 10 value in the previous moments, followed by the previous measurements of CO, NO, O 3 , SO 2 and NO 2 .
The main limitation of this study is that although original data is taken each 15 min, forecasts are performed for average monthly values. The reason why average monthly values were forecasted is that the results obtained by the authors when daily or hourly forecasts were performed were not as stable as the average monthly values. This is due to the influence of the port traffic in the pollution area, which does not follow a fixed cycle like urban traffic. Another limitation to be overcome in future studies is that in order to improve the results obtained it would be of interest to introduce some meteorological variables such as temperature, humidity, pressure, sun radiation, rainfall and wind speed and direction.

conclusions
The results obtained in this research allow us to say that it is possible to predict PM 10 concentration with the help of the value of this variable and the concentration of other pollutants by means of statistical and machine learning models. Also, another interesting issue is that as had already been found in previous studies, 56 the use of the concentration of other pollutants helps to obtain a more accurate prediction. In fact, the most accurate results were obtained for two kind of machine learning models, SVMR and MLP, when they made use of the values of the six available variables. The results obtained show how regression-based models like SVMR, MARS and MLP outperform univariate and multivariate time series-based models (ARIMA and VARMA). According to the findings of this paper and other previous ones 29 , this is because the short-term relationships among pollutants are stronger than temporal relationships of PM 10 concentration values with itself and with other variables. In other words, although it is possible to find certain seasonal patterns in monthly average pollutant values, the relationship of PM 10 with the concentration of other pollutants is more important than the seasonal pattern.
Finally, this research affords the reader the opportunity to compare different machine learning and time series methodologies applied to the same data set to establish whether they are useful for PM 10 concentration forecasting. If the average monthly values of PM 10 from January to June 2018 are compared with those corresponding to the same months of the previous year, the RMSE result is 6.8557. This means that in forecasts 1 month ahead, MLP and SVM models of four and six variables and MARS of six variables outperform it. When forecasts are performed 6 months ahead MLP models of four and six variables and SVM of six variables outperform it. Although the proposed methodologies do not always outperform the mere use of the average values of PM 10 concentrations of the same months of the previous year, they are a useful complementary tool for planning and taking decisions in advance.