Uncertainty assessment of PM2.5 contamination mapping using spatiotemporal sequential indicator simulations and multi-temporal monitoring data

Because of the rapid economic growth in China, many regions are subjected to severe particulate matter pollution. Thus, improving the methods of determining the spatiotemporal distribution and uncertainty of air pollution can provide considerable benefits when developing risk assessments and environmental policies. The uncertainty assessment methods currently in use include the sequential indicator simulation (SIS) and indicator kriging techniques. However, these methods cannot be employed to assess multi-temporal data. In this work, a spatiotemporal sequential indicator simulation (STSIS) based on a non-separable spatiotemporal semivariogram model was used to assimilate multi-temporal data in the mapping and uncertainty assessment of PM2.5 distributions in a contaminated atmosphere. PM2.5 concentrations recorded throughout 2014 in Shandong Province, China were used as the experimental dataset. Based on the number of STSIS procedures, we assessed various types of mapping uncertainties, including single-location uncertainties over one day and multiple days and multi-location uncertainties over one day and multiple days. A comparison of the STSIS technique with the SIS technique indicate that a better performance was obtained with the STSIS method.

includes a kriging variance that measures the estimation uncertainty. A contaminated area cannot be reliably classified without considering this uncertainty 14 ; thus, estimation uncertainty is an important factor when assessing the level of risk resulting from a pollutant 15 .
Generally, risk assessments are based on the quantification of specific uncertainties involved in classifying contaminated sites, and the results are expressed in terms of exceedance probabilities. In certain cases, quantitative uncertainty assessments can be performed using two main groups of techniques: the first group includes non-linear geostatistics techniques, such as disjunctive kriging (DK) and indicator kriging (IK) [16][17][18] ; and the second group includes stochastic simulation algorithms, such as sequential indicator simulations (SISs) and sequential Gaussian simulations (SGSs), which generate a set of equiprobable representations (realizations) of the spatial distribution of target attribute values and uses the differences among the simulated maps as a measure of uncertainty 19,20 . In general, SIS is more commonly used (or perhaps is more "fashionable") than IK for uncertainty modeling 21 . Moreover, SIS can overcome the limitations inherent in IK, such as the smoothing effect 22 and an inability to consider variation in estimations at unsampled locations or simultaneously reproduce multi-points of uncertainty 21 .
However, long-term uncertainty information related to PM in the atmosphere for a region may be more meaningful because a number of studies have linked long-term exposure to PM with certain diseases 23,24 . Nevertheless, the uncertainty assessment methods listed above are generally used for processing data in a single period because they are incapable of integrating multi-temporal data. Therefore, we cannot determine the spatial distribution of exceedance probabilities over a long period of time. Furthermore, it is important to determine whether multi-temporal data for an environmental variable can improve the accuracy of uncertainty models when the environmental variable is monitored continuously over many sites.
Based on these considerations, the aim of the present work was to use the spatiotemporal sequential indicator simulation (STSIS) technique 25 to assimilate multi-period data and generate many realizations. The many realizations generated by STSIS were subsequently used to estimate the various uncertainties associated with the delineation of PM 2.5 , and the results were compared with those obtained using the SIS. For illustration purposes, we used a data set of PM 2.5 concentrations in the air recorded in 2014.

Materials and Methods
Study area and data sources. The study area is located in Shandong Province, China, and it covers a national territorial area of 157.9 thousand Km 2 . The data presented in this study were obtained from 96 national air quality monitoring sites during the period from January 1, 2014 to December 31, 2014 (data were obtained from the following website: http://113.108.142.147:20035/emcpublish). The spatial distribution of monitorting was shonw in Fig. 1. The ambient concentration of PM 2.5 was measured according to the China Environmental Protection Standard HJ655-2013 26 . At each site, the daily PM 2.5 concentration was calculated by averaging the hourly data.

Spatiotemporal sequential indicator simulation (STSIS) algorithm.
To distinguish between space (S) and time (T), let Z(x) = {Z(s, t)|s ∈ S, t ∈ T} represent a variable defined on a geographical domain S ∈ R 2 and a time interval T∈ R. The STSIS algorithm used in this study involves the following steps. The first step is to code each PM 2.5 concentration observation value z(s, t) into vector K indicator values using the indicator transformation function I(s, t; z c ): where z c is a desired cutoff value of PM 2.5 concentrations. In this study, the cutoff values were set to 34 μg · m −3 (20% percentile), 53 μg · m −3 (40% percentile), 74 μg · m −3 (60% percentile), and 106 μg · m −3 (80% percentile). For each of the four PM 2.5 concentration cutoff values (z c ), the experimental spatio-temporal (ST) semivariogram of the indicator code was calculated using the following equation: where h S and h T are the spatial and temporal lags, respectively, and N(h s , h T ) is the number of pairs in the ST lag for the indicator codes of PM 2.5 concentrations.
There are two main approaches to fitting a theoretical model to the spatiotemporal experimental variogram. The first approach relies on separable variogram modeling, which assumes separate spatial and temporal variation structures and represents the total ST variogram as the sum of these structures. This approach facilitates structural analyses; however, it presents a number of important drawbacks caused by assumption of a strict separation of spatial and temporal structures. For example, this approach implies that the spatial behavior must be the same for all time points and the temporal behavior must be the same at all spatial locations. However, such consistency is not observed in practice, where different spatial patterns emerge at different times and time series at different locations show different behaviors 27,28 . The second approach relies on a non-separable model that can overcome some of the above drawbacks. There are various non-separable covariance and variogram models [29][30][31][32] . In this study, the experimental ST semivariogram was modeled using a non-separable spatiotemporal semivariogram model.
The parameters c 0 , c, v, w, ξ and α should be calculated from the data. The parameters of the model (3) were calculated simultaneously using a genetic algorithm to simultaneously estimate the parameters 33 using a fitness minimization function, such as the root mean square error (RMSE): where n s and n t denote the number of spatial and temporal data pairs, respectively. A random path visiting each ST node of a grid defined over the study area was established. Base on the procedures of SIS 34,35 , at each unsampled location, the following procedures were employed.
(i) The probability that the PM 2.5 concentration will not exceed 34, 53, 74 and 106 μg · m −3 was estimated at each point (s i , t i ) of the random path as a linear combination of the neighboring indicator values using ST ordinary kriging. These probabilities were formally expressed by the corresponding conditional cumulative distribution function (CCDF), whereas the spatiotemporal distances between (s i , t i ) and the neighboring indicator data points were determined by h s + αh T . (ii) The order relation deviations of the obtained probabilities were corrected, and a continuous model of the prior CCDF of the PM 2.5 concentration at location (s i , t i ) was built by interpolating or extrapolating the CCDF values. (iii) A simulated PM 2.5 concentration value was randomly drawn from the prior CCDF at each spatiotemporal point (s i , t i ). (iv) The indicator code of the simulated value at location (s i , t i ) was added to the prior CCDF modelling at the next point (s i+1 , t i+1 ). (v) Following the random path, the procedure (i)-(iv) above was repeated until all of the nodes were visited and each node was assigned a simulated value, thus obtaining a STSIS realization.
By selecting various random paths, a number of STSIS realizations were generated. Each realization used a different path to visit all of the nodes of the grid covering the study area, thus representing a possible spatiotemporal distribution of PM 2.5 concentrations. In this way, the mapping uncertainty was determined using a number of STSIS realizations. In the present study, 1000 realizations were generated using STSIS.

Uncertainty Assessment
Single location uncertainty for one day. The uncertainty of PM 2.5 estimation at a single spatiotemporal location p′ = (s′ , t′ ), which indicates that the probability of a PM 2.5 concentration z(p′ ) is higher than the threshold level of contamination (z c ; e.g., 75 μg · m −3 ), can be represented by the following exceedance probability: where the threshold value of 75 μg · m −3 represents the lower limit of light pollution for PM 2.5 concentrations according to the China National Ambient Air Quality Standards 5 and n(p′ ) is the number of PM 2.5 realizations generated by STSIS in which the concentration values were greater than the threshold z c (out of a total of 1000 Scientific RepoRts | 6:24335 | DOI: 10.1038/srep24335 realizations). The exceedance probability of Eq. (5) expresses the likelihood that a designated value (z c ) will be exceeded. In addition, the variance ′ s p ( ) p 2 of P STSIS [z(p′ ) > z c ] is obtained as follows: Single-location uncertainty for multi-days. A single-location uncertainty refers to the joint PM 2.5 estimation uncertainty at a spatial location over multiple days, and it indicates that the probability of PM 2.5 concentrations z(p′ ) over multi-days may be higher than the contamination threshold z c : where n t (p′ ) is the number of realizations in which the PM 2.5 concentrations generated by STSIS over the multi-day period were greater than the threshold in each one of the 1000 realizations. A location with multi-day uncertainties can reveal the pollution risk over the long term. The variance of Eq. (7) can also be obtained from Eq. (6).
Multi-location uncertainty for one day. A one-day multi-location uncertainty represents the joint uncertainty at several specified locations over a single day, and it can be used to measure the reliability of contamination assessments based on the probability map of P STSIS [z(p′ ) > z c ] for a given critical probability p c . For example, for a given p c and PM 2.5 concentrations z c , the number of points p′ where the following condition applies should be determined:

STSIS c c
Accordingly, the probability that the PM 2.5 concentrations at n locations in an area will all be greater than the threshold z c can be calculated based on the following equation: is the number of realizations in which all of the simulated PM 2.5 values at the m locations are greater than z c (in this case, out of a total of 1000 realizations). The variance can also be calculated as follows: where p j is the value of the probability in Eq. (9).

Multi-location uncertainty over multi-days.
Multi-day multi-location uncertainties represent the joint uncertainty at a set of specified locations over the multi-day period of interest, and they can be used to assess the reliability of contamination assessments based on the following probability map for a given p c : Based on Eq. (9), the multi-day multi-location uncertainty can be calculated as follows: Goodness of uncertainty assessment. For comparative analysis purposes, the SIS technique 21 was used to assess single-location PM 2.5 uncertainties using data recorded for the same day only. Then, we compared the results obtained by SIS with those obtained by STSIS.
Based on the CCDF F (u; z|(n)) at any test location u (where the notation |(n) expresses conditioning to the local information, such as n neighboring data), the series of symmetric p probability intervals (PI) considered were bounded by the corresponding p-percentile. For example, the 0.5 PI is expressed as F −1 (u; 0.5|(n), + ∞ ] or as F −1 (u; 0.5|(n), z max ] in practice. Adequate local uncertainty modeling requires that 50% of the true values over the study area locally exceed the CCDF median. Given a set of sampling points and independently generated CCDFs by the STSIS and SIS techniques at the corresponding N sampling locations u j , , where |(n) denotes the conditioning to the local information (e.g., n neighboring data), the fraction of true values falling into the symmetric p PI was calculated as follows: At this point, the root mean squared error (RMSE) for the T technique (in this case, T = SIS and STSIS) was defined as follows: Smaller RMSE values suggest more accurate assessments of PM 2.5 contamination uncertainty. The true value should fall into the PI according to the expected probability, and this interval should be as narrow as possible to reduce the value's uncertainty. Therefore, a better probabilistic model would generate a smaller spread (less uncertain). In this study, the average width of the PIs for a series of probabilities p, W p ( ), was calculated as follows:

Results and Discussion
Preliminary data description. A summary of the descriptive statistics of the PM 2.5 concentrations recorded in 2014 is presented in Fig. 2. The temporal trend of the average values for all of the monitoring sites is presented in Fig. 3. Figure 2 shows that the PM 2.5 concentrations for all of the collected data ranged from 1 to 1000, and the  mean concentration was 74.84. The coefficient of variation (CV) was 0.69, which indicates that the PM 2.5 for all of the monitoring data presented a medium variability (i.e., 0.1 < CV < 1). The skewness and kurtosis values were 2.31 and 14.95, respectively, indicating that the null hypothesis of normality was rejected for the monitoring data. Figure 3 shows that a characteristic seasonal variation in the PM 2.5 occurs in the study area, with elevated concentrations occurring in spring and winter. These variations are related to seasonal fluctuations in the emissions as well as to meteorological effects 3,32 .  Table 1.

Mapping PM2.5 concentrations: STSIS vs. STOK.
To compare the results generated by STSIS and a general ST prediction method, spatiotemporal ordinary kriging (STOK) 36 was employed to predict the ST distribution of PM 2.5 . As shown in Fig. 2, the null hypothesis of normality was rejected for the original monitoring data. Thus, before performing the STOK, the data were logarithmically transformed. After the logarithmical transformation, the K-S test value, the skewness value and the kurtosis value were 4.725, −0.252, and 0.145, respectively. Thus, the PM 2.5 concentrations after the logarithmical transformation followed a normal distribution. The experimental ST variograms were calculated, and the theoretical model was fit. The results are shown in the last row of Table 1 and Fig. 4(e). A STOK prediction was then performed on the LgPM 2.5 concentration data based on the ST theoretical variogram model. Finally, the predicted LgPM 2.5 values were translated into the original values by antilogarithms (Fig. 5(a)). Three randomly selected STSIS realizations out of 1000 realizations are shown in Fig. 5(b-d).
A comparison of the summary statistics is shown in Table 2. The maximum value, the standard deviation (SD) and coefficient of variation (CV) of the STOK were obviously smaller than those of the STSIS and original data,  indicating an obvious smoothing effect of the STOK. However, the maximum value, SD and CV of the STSIS realizations were close to those of original data, indicating a similar variability in the STSIS results with that of the original data. As shown in Fig. 5. The STSIS polygons were more fragmented relative to those of the STOK because of the smoothing effect of the STOK. Thus, the STOK results only present a simplistic spatial pattern and do not capture important information that is revealed in the more detailed STSIS maps, such as hot PM 2.5 spots. Moreover, the STSIS realizations covered all possible spatial patterns, indicating that mapping uncertainties can be fully assessed by using a sufficient number of STSIS realizations.

Contaminated sited classifications based on single-location uncertainties. In this study, 1000
STSIS simulated realizations were used to determine the single-location uncertainties, which are measured on four time scales: one day, one month, one season, and one year. In addition, the uncertainties are expressed by the probabilities of the PM 2.5 concentrations being higher than a certain threshold value. Figure 6 shows that the probability that the PM 2.5 concentration will exceed 75 μg · m −3 for most of the study locations on the 1 st day and 100 th day is close to 1, whereas the probability that the PM 2.5 concentration will exceed 75 μg · m −3 for most of the    study locations on the 200 th day and 300 th day is close to 0. Moreover, the highest probabilities on the 200 th day and 300 th day are 0.98 and 0.99, respectively, indicating that none of the study sites will present a PM 2.5 concentration that will definitely exceed 75 μg · m −3 for these two days. Figure 7 shows the spatial distribution of the probabilities in which the PM 2.5 concentrations will exceed 25 μg · m −3 (guideline provided by the WHO) 37  during each month. The smallest probability is 0, and 12.7%, 46.6%, 8.7%, 37.6%, 41.5%, 33.6%, 54.9%, 47.5%, 58.5%, 50.3%, 28.1%, and 28.8% of the study locations presented a probability of 0 from January to December, indicating that the PM 2.5 concentrations in these areas are not always >25 μg · m −3 during the corresponding month. As shown in Fig. 8, the low mean values with high coefficients of variation (CVs) are found in May, July and September, and these values indicate lower PM 2.5 pollution risks and high variation. Furthermore, the highest mean value and lowest CV are found in March, indicating that the highest PM 2.5 pollution risk occurs for almost the entire study area in this month. In terms of spatial distribution, a high PM 2.5 pollution risk (Fig. 7) is observed in the southwestern region of the study area during January, February, March, October, and November; and an absence of PM 2.5 pollution risk is observed in the eastern region of the study area at the month scale for the entire year.
Contaminated site classification with multi-location uncertainty. The CCDF generated by the STSIS can be used to measure the local uncertainty at a single location; however, a series of single-point CCDFs cannot be used to measure multi-point spatial uncertainties 21 . Therefore, an adequate reliability assessment of PM 2.5 contamination distributions requires a multi-location uncertainty assessment for 1 day (multi-location/ single-day uncertainty) and multiple days (multi-location/multi-day uncertainty) at a set of locations in the contaminated area based on the corresponding single-location uncertainty for 1 and multi-days (single-location/ single-day uncertainty and single-location/multi-day uncertainty, respectively). Figure 9 shows two types of maps for different days classified as contaminated based on the probability maps determined by

STSIS c
where the critical probabilities p c = 0.9 and 0.8. Figure 10 shows the maps for every month classified as contaminated based on the exceedance probability maps determined by of Eq. (10) are also listed. The p j value can be used to represent the reliability of the contaminated site classification. For example, the joint probability p j is 0.76 based on 9234 simulated locations of the contaminated sites based on a given critical probability p c = 0.8 for day 1, which means that for day 1, the probability of the PM 2.5 concentration at all 9234 simulated locations exceeding the threshold (75 μg · m −3 ) is 76%. If the critical probability p c = 0.9 is adopted, the joint probability is p j = 0.9 in day 1, and the likelihood that the PM 2.5 concentrations in the contaminated area will exceed 75 μg · m −3 is greater. However, the joint probabilities of all months are p j = 0, indicating a zero likelihood that the PM 2.5 concentrations at all of the simulated locations will exceed 25 μg · m −3 for each month.    Goodness of uncertainty assessment: STSIS vs. SIS. To assess the improvements provided by using multi-temporal data, we first applied the purely spatial SIS technique to the original data recorded for each day of interest. In addition, to determine the effect of performing a composite spatiotemporal data analysis, we used the STSIS technique. The results of the SIS and STSIS techniques were assessed using the methods introduced in section 2.4. The CCDFs were obtained using the SIS and STSIS techniques and a cross validation of the monitoring points recorded during 2014.
As shown in Fig. 11(a), the distance between the estimated points on the plots and the 45° line was smaller for the STSIS technique than for the SIS techniques, and the RMSE values of Eq. (14) for the STSIS and SIS were 0.031 and 0.14, respectively. Hence, the STSIS probability analysis is more accurate than the SIS analysis. Figure 11(b) shows the PI widths, which in this case correspond to the differences between the maximum value and the (1-p)-quintile of the CCDF. All of the points of the STSIS analysis fall below the points of the SIS analysis,  Table 3   indicating that the PIs obtained by the STSIS are narrower than those of the SIS. Thus, the STSIS performs better than the SIS.

Conclusions
In this work, the STSIS technique was used to perform uncertainty assessments for the PM 2.5 concentrations in Shandong Province, China. The results suggest that the STSIS can represent composite spatiotemporal variations of PM 2.5 concentrations using a non-separable semivariogram model as well as assimilate multi-temporal monitoring data. A comparison of the results of the STSIS with that of the STOK showed that the map of PM 2.5 concentrations generated by the STSIS exhibits more realistic variations and is closer to the experimental data than the map generated by the STOK. In addition, the PM 2.5 maps for 2014 revealed marked spatial and temporal trends. In terms of the spatial trends, the western part of the study area was heavily polluted with PM 2.5 , whereas the eastern part of the study area presented relatively good air quality. In terms of the temporal trends, a significant seasonal trend was observed, with high concentrations observed in spring and winter and relatively low concentrations observed in summer and autumn.
The STSIS realizations can be used to determine various types of site classification uncertainties in terms of exceedance probabilities, including single-location/single-day uncertainties, single-location/multi-day uncertainties, multi-location/single-day uncertainties, and multi-location/multi-day uncertainties. A comparative analysis showed that by using multi-temporal data, the STSIS provided a better performance than the SIS because the corresponding probability intervals of the STSIS were consistently narrower than those of the SIS.