Introduction

Asian dust (AD) events, global dust transport events, have increased over the last 20 years due to global climate change and desertification1,2,3. East Asia is a major source region of global wind-blown dust aerosols. In spring and winter, dust uplifted from arid Asian areas is transported to northern China, Korea, Japan, and even as far as the western United States1,2. AD events are becoming less predictable due to an increase in the fraction of unanticipated dust particles derived from the newly formed deserts in western China and Mongolia1,2. Most previous studies have suggested that AD events result in increased occurrences of human diseases and environmental problems1,2,4,5. Therefore, AD events are recognized as a major social/environmental/clinical issue, with growing concern in East Asia1.

Although biological agents in AD have received scant attention compared with physiochemical attributes, there is increasing evidence that exposure to bioaerosols during AD events may cause adverse health effects and severe diseases when pathogenic bacteria are involved2,6,7. To investigate their effects on public health during AD events, an appropriate methodology must define potential pathogens and employ an effective monitoring system8,9; however, there is sparse information on urban airborne bacterial communities2,9. Next-generation sequencing (NGS) can offer insights into the diversity and composition of airborne culturable and non-culturable bacteria7,10. Research suggests that 16S rRNA gene-based NGS can successfully determine the abundance and diversity of potentially pathogenic bacteria for screening purposes in activated sludge, biosolids, drinking water, and soil11,12,13,14.

Identification of pathogens in bioaerosols requires long-term monitoring, and assessing bioaerosol risks to human health is time-consuming and costly. Instead, current real-time atmospheric environmental parameters are not only closely related to the occurrence of AD events but are also relatively faster and easier to analyze than detecting and assessing potential pathogens during AD events15. Therefore, modeling that depends on statistical analysis could be an alternative approach for exploring the relationship between airborne bacterial communities and atmospheric environmental conditions16. If certain relationships can be found between them it will then be possible to predict potential hazards one or two days in advance and more effectively protect public health17,18. Most importantly, reliable short-term prediction of potential airborne bacterial hazards may assist the authorities in managing atmospheric environmental policy for AD events. Despite the extensive research on physiochemical modeling studies during AD events19,20, no specific research has so far been carried out to predict biological hazards during AD events.

Multiple linear regression (MLR) is one of the widely used statistical tools for finding an appropriate mathematical model and for determining the best-fitting coefficients of a model from the given data16,18. MLR generally provides good predictive capability in environmental studies, such as air quality prediction models16,18, and can provide reasonable interpretation between dependent and predictor variables by statistical tests21. Machine learning and rule induction is a powerful statistical method for collecting, summarizing, and analyzing data from different perspectives into valuable and practical information to identify useful relationships22,23. As a representative machine learning method, the classification and regression tree (CART) has considerable advantages, including that it is nonparametric and is suitable for nonlinear structures and that it may be appropriate for solving complex, dynamic environmental problems from a small dataset22,24. Rule induction employed in CART can be used to find key rules on the basis of interactions between independent and dependent variables22,23. CART approaches have been used in environmental forecasting research to estimate urban air quality18, determine groundwater pollution vulnerability24, predict in situ dechlorination potential25, predict water quality from wastewater treatment plants26, assess microbial source tracking27, and predict heavy metal sorption to soil28. Therefore, CART and MLR models could support decision-making and effective management of potential urban airborne bacterial hazards during AD events. However, no detailed comparison of the model performance has yet to be evaluated.

The aims of this study are to (1) compare the predictive abilities between MLR and CART approach for assessing potential airborne bacterial hazards during AD events, and (2) identify key atmospheric environmental parameters that significantly influence potential airborne bacterial hazards during AD events.

Results

Characterization of Atmospheric Parameters between AD and Non-AD Events

The average PM10 concentration of AD events was 178 µg/m3, which was significantly (t-test, p < 0.001) higher, by 112 µg/m3, than that of non-AD events (Table 1). Seasonal monitoring revealed that airborne bacterial abundance with PM10 concentrations was more than 10- to 50-fold higher during AD events, and non-AD events did not affect airborne bacterial abundance. Although studies5,6 have indicated that atmospheric indicators such as temperature and relative humidity exhibit relatively high correlations during AD events, our monitoring results revealed no significant difference between AD and non-AD events. The parameters of the other air masses (e.g., wind speed, sunshine, evaporation, and surface temperature) displayed no differences between AD and non-AD events (Table 1).

Table 1 Statistical summary of the data for the atmospheric environmental parameters and 730 airborne bacterial parameters between AD events (n = 10) and non-AD events (n = 45).

Characteristics of Bacterial Communities between AD and Non-AD Events

The abundance of airborne bacteria was determined by qPCR, targeting the 16S rRNA gene in samples collected during the three study years. The 16 S rRNA gene copy numbers ranged from 4.85 × 103 to 2.58 × 108 gene copies/m3. During AD events, the gene copy numbers (mean: 6.05 × 107 gene copies/m3, Stdev: 1.00 × 106) increased remarkably compared to the non-AD (mean: 3.22 × 105 gene copies/m3, Stdev: 1.37 × 104) levels (p < 0.001) (Table 1). Additionally, the bacterial 16 S rRNA gene copy numbers tended to correlate positively with PM10 concentration (Supplementary Fig. S1a). As indicated by the Shannon index (H′) values, airborne bacterial diversity significantly increased during AD events (Supplementary Fig. S1b). The increased airborne bacterial diversity during AD events and correlation with dust parameters suggest that dust events increase local airborne bacterial diversity.

AD and non-AD events were characterized by different bacterial taxa (Fig. 1). Firmicutes significantly increased with those for the non-AD events (p < 0.05) and composed the most dominant bacterial group during AD events (Fig. 1a). According to the NMDS plot, airborne bacterial structures of the AD samples were clustered together and separated from those of non-AD samples (Fig. 1b), indicating that AD events caused a significant shift in microbial community structures.

Figure 1
figure 1

Relative abundance of airborne bacterial community structures between AD events and non-AD events (a) and non-metric multidimensional scaling (NMDS) ordination at the phylum level (b). Others indicate minor genus members with relative abundances <1.00%. *p < 0.05 (t-test in SAS v. 9.2).

These results imply that although the nature of aerosol bacterial populations is variable, most airborne bacteria during AD events may be associated with particle size and air environmental conditions. A significant correlation between bacterial diversity and PM10 abundance during AD events suggested that desert dust might be the source of airborne bacteria29. According to the backward trajectory analysis (Supplementary Fig. S2), air masses during AD events contained microorganisms originating from the Gobi Desert that passed over China and the Yellow Sea to Seoul. However, air masses from non-AD events contained microorganisms transported from various directions near Korea. These results may support that the shift in airborne bacterial communities between AD and non-AD events is affected by the source of airborne bacteria and transport pathways (Supplementary Fig. S2).

Screening of Potential Pathogenic Bacteria Candidates

The sequences obtained using pyrosequencing were extracted by alignment with reference sequences, and all sequences were assigned at the species level (Supplementary Table S1). Potential pathogenic bacteria belonging to Bacillus, Neisseria, Pseudomonas, Clostridium, Shigella, Acinetobacter, Ralstonia, and Staphylococcus were detected in non-AD samples (Fig. 2), suggestive of the potential presence of bacterial hazards in urban bioaerosol environments, even though the 16 S rRNA gene sequence is limited in its ability to accurately determine pathogenicity13,30. The relative abundance of potential pathogenic bacteria candidates increased significantly during AD events and was positively correlated with PM10 concentration (Supplementary Fig. S1c). Compared with non-AD samples, significantly higher Bacillus (a potential pathogenic candidate) was detected in AD samples. In particular, B. cereus and B. licheniformis significantly increased (p < 0.05), suggestive of their potential as AD-specific bacterial pathogen candidates (Fig. 2). Although B. licheniformis was identified as an AD-specific candidate pathogen, the primer information on its pathogenic gene is insufficient for quantitative examination. Conversely, however, sufficient primer information of the pathogenic gene for B. cereus has been established previously. Therefore, we selected B. cereus as the AD-specific candidate pathogen.

Figure 2
figure 2

Relative abundance of potential pathogenic bacteria candidates among the total 16S rRNA gene sequence reads from the Pyrosequencing. * indicates p < 0.05 from t-test in SAS v.9.2.

The abundance of bceT gene copy numbers ranged from 3.27 × 104 to 1.15 × 105 gene copies/m3 during AD events (Table 1). BceT gene copy numbers exhibited a similar trend as the relative abundance of potential pathogenic bacteria (Supplementary Fig. S1c) and were significantly higher during AD events (p < 0.05).

Assessment of Prediction Performance for AD Events

After demonstrating that airborne bacterial parameters, in particular bacterial hazards, increased significantly (p < 0.05) during AD events, we used AD-specific airborne bacterial parameters to evaluate whether the MLR and CART models could achieve good performance in reflecting AD event characteristics. According to the performance indexes, the CART approaches outperformed the MLR approaches (Table 2). Most airborne bacterial parameters yielded good correlations between predicted and real-time measured values in the CART model (Table 2). The estimates of the relative abundance of potential pathogenic bacteria, B. cereus populations, and bceT gene abundance for AD events displayed relatively good fits (R2 = 0.71–0.77) with the least bias and smallest RMSE (11.3–14.4) and MAE (7.25, 10.4) in the test set results (Table 2). CART and rule induction effectively reproduced variations in airborne bacterial parameters using on-site measurement data, in particular the relative abundance of B. cereus populations and bceT gene abundance during AD events (Table 2).

Table 2 Performance indicators for the developed predictive MLR and CART models.

Identification of Important Variables Associated with Airborne Bacterial Parameters

The CART and rule induction method has outstanding advantages in terms of identifying independent variables that may significantly influence its dependent variables and in providing rule induction between the independent and dependent variables23.

To induct a rule between the atmospheric environmental input variables and target variables (airborne bacterial parameters), we performed a CART-based tree analysis. The final regression trees generated by rule induction with the airborne bacterial parameters for each child node of this tree in the training dataset were shown (Fig. 3, Supplementary Fig. S3). With respect to the independent variables, the first split of the tree was defined as the PM10 subject (Fig. 3a). Fourteen datasets were clustered with PM10 concentrations ≥78.4 µg/m3, and the remaining twenty-four datasets were clustered with PM10 concentrations <78.4 µg/m3. Higher PM10 subjects were segregated based on the temperature subject (Fig. 3a). Figure 3b was constructed for the relative abundance of B. cereus as predictors. The first split of the tree was defined with respect to the PM10 subject, and the nodes were segregated with relative humidity and temperature as the subject (Fig. 3b). All figures can be interpreted in the same way (Fig. 3, Supplementary Fig. S3). A relative importance ranking of individual parameters for airborne bacterial hazards was possible (Supplementary Table S2). PM10, relative humidity, and temperature took precedence over the other parameters and were deemed essential parameters for predicting the airborne bacterial hazard potential.

Figure 3
figure 3

Determination of the relative importance of the predictor variables in the CART model for prediction of relative abundance of potential pathogens (a) and B. cereus (b), and bceT gene abundance (c) by binary regression tree analysis.

Discussion

Recently, the East Asian region’s climatic conditions such as scarce rains and droughts have boosted the persistence of atmospheric bioaerosols1. Therefore, it is important to integrate this process into air quality modeling systems intended for air quality planning and assessment in order to assess impacts on human health31 and ecosystems32. Although it is recognized that dust particles contain pathogens, in most cases the potential hazards or risks associated with them is still largely unclear2. The pathogenic bacteria effect of dust inhalation can be attributed to the direct physical action of dust particles, and may be exacerbated by the toxic effects of biologically active compounds33. Although prediction accuracy was overall good as shown our study (Table 2), regression models such as MLR have certain limitations. For example, it is relatively difficult to reflect non-linear conditions, and multi-collinearity between independent and dependent variables usually causes MLR to be inefficient32. Motivated by knowledge of these limitations, we applied the CART and rule induction method to predict potential hazards of urban airborne bacteria during AD events. This CART and rule induction approach successfully evaluated the prediction performance between observed, real-time measurable atmosphere environmental parameters and airborne bacterial parameters from NGS-based screening and targeted toxin genes from qPCR results. These results could be because the training datasets fit relatively well, reflecting the relationships between airborne bacterial parameters and atmospheric environmental parameters. From these results, we suggest that the correlations between airborne bacterial parameters and atmospheric environmental parameters during AD events are an approximately good fit with the CART and rule induction method for predicting the potential bacterial hazard in urban areas. Although the 16S rRNA gene sequence has been restricted to identifying the taxonomic resolution of bacterial pathogens13,30, combining high-throughput sequencing and qPCR results can provide relatively high resolution34. Because metagenomic approaches could be used to screen potential pathogens in AD samples, the identified potential pathogens subsequently could be quantified by using qPCR, which targets the potential pathogens using their biomarkers34.

During AD events, biological concentrations significantly increase with PM10 concentrations, with differences in bacterial community structure. The high correlation of bacterial abundances with PM10 during the AD events (Table 1, Supplementary Fig. S1) and backward trajectory results (Supplementary Fig. S2) in this study indicate that desert dust might be the source of airborne bacteria. However, there were not significant changes during non-AD events. These results indicate that the high concentration of bacteria during AD events was due to the large increase of the concentration of soil-originated particles which contained higher bacterial concentration1,2,3. The airborne bacteria from AD events may have mixed with indigenous airborne bacterial communities before reaching our sampling point, having traveled through industrial, agricultural, and urban areas5,6. As such, the suspended particle composition (e.g., PM10) may have been affected due to the addition of local pollutants and physicochemical changes in the atmospheric environment during transport; therefore, the frequency of potential pathogenic bacteria may have increased during AD events, which could affect ecosystem and human health. PM10 always segregated the first split of the tree, while temperature, relative humidity, and evaporation were important in predicting the airborne bacterial parameters in the rule induction (Fig. 3, Supplementary Fig. S3). PM10 is well established as an indicator of heavy air pollution, based on physical and chemical results and clinical evidence35. There is mounting evidence of the negative effects of bioaerosols associated with PM10 on ecosystems and human health36,37. However, the correlation between airborne bacterial parameters, including potential pathogens, and PM10 in urban areas during AD events is not well understood.

From our results, high PM10 concentrations were significantly correlated with potential pathogen indicators during AD events (Table 1, Supplementary Fig. S1). When the training datasets were constructed to predict bacterial abundance and diversity in the CART model, most PM10 concentrations were segregated into two split nodes between 65.3 and 70.8 µg/m3 (Supplementary Fig. S3). Meanwhile, the relative abundances of potential pathogens, B. cereus, and the bceT gene were segregated into higher PM10 concentrations (78.4 to 92.2 µg/m3) than bacterial abundance and diversity (Fig. 3), suggesting that the relative abundances of potential pathogens, B. cereus, and bceT gene were more significantly affected by PM10 concentrations and AD events than seasonal changes and local environmental effects. Our results revealed PM10 concentrations between 78.4 and 92.2 µg/m3 during AD events, indicative of a relatively high risk. PM10 prediction has attracted special legislative and scientific attention due to its negative effects on human health38. Since these results could offer AD-specific bacteria or relative environmental parameters for the implementation of a robust biosurveillance network, current air pollution policy may be further improved by taking into consideration the potential of biological hazards during AD events.

Airborne bacteria growth is affected by relative humidity and temperature39. Temperatures above 24 °C decrease airborne bacterial survival39, while relative humidity of 70–80% has a protective effect on aerosolized bacteria40,41. The temperature during most AD events (13–17 °C) may have supported airborne bacteria survival; however, the relative humidity (40–50%) may have adversely affected survival. The CART approach reflected the characteristics of these heterogeneous atmospheric conditions during AD events better than descriptive statistics, and successfully identified key atmospheric parameters associated with AD events and airborne bacteria. Thus, although aerosol bacterial populations are variable, the airborne bacteria community during AD events might be associated with specific atmospheric conditions.

Endospore-forming bacteria (e.g., Bacillus) have been isolated from inter-continentally transported dust2,42,43. These high-tolerance bacteria could survive during long-range dispersal and be efficiently transported by atmospheric dust1,2, shielded from inactivation by ultraviolet light and low relative humidity by attaching to crevasses within coarse particles. The trajectories pathway (Supplementary Fig. S2) is also considered to represent a protective mode that allows for the survival of B. cereus in hostile environments. Numerous fungal, bacterial, and viral species have been found in desert dust samples2,42. Endotoxins and other biologic compounds in PM10-2.5 from dust storms can activate inflammatory responses44,45. For example, in North Carolina ambient PM10-2.5 exacerbated allergic response to airborne bacteria44, and in six European cities the PM10-2.5 fraction triggered the highest inflammatory effect45.

The correlation between bacterial abundance and particulate matter in the air is likely a result of the dependence of bacteria on coarse particles (e.g., PM10) rather than on fine particles (e.g., PM2.5)46. Thus, molecular airborne bacteria community data with PM10 characteristics is rational to investigate the distribution and changes in airborne bacterial communities during AD events by resolving genetic diversity and populations. There are two reasons for excluding the possibility of a correlation between airborne bacterial communities and PM2.5. First, a large amount of PM2.5 are basically produced via homogeneous processes in the atmosphere, with no direct association with pre-existing particles47. Second, the suggested correlation is potentially wrong, since coarse and fine particles are not significantly correlated, according to the Murata and Zhang46 study. There are usually primary particles among PM2.5 such as fine particles, and the increase of coarse particles such as PM10 is commonly accompanied with an increase in fine particles in East Asia. This is supported by the dependence of airborne bacteria on dust particles5,43.

This study quantified the independent effects of different PM10 fractions, included a large distribution of complete differences among PM10 concentrations on case and control days, which provided acceptable statistical significance to detect relative high or low significant effects, with minimizing misclassification. Although machine learning and rule induction from small data sets makes the modeling procedure difficult and prone to overfitting, there are many situations in which organizations must work with small data sets in environmental analysis48. Thus, it is worthwhile to start developing appropriate forecasting models with smaller variance of forecasting error and good accuracy based on small data sets. To avoid overfitting due to the use of the small data set, k-fold cross-validation and random sampling alternatively can be used in the CART model23,49. Previous studies reported that k-fold cross-validation and random sampling are useful when no test sample is available and the learning sample is too small to have the test sample removed from it49,50. Although we tried to decrease error and biased predictors, relatively small-sized training and test data still can result in overfitting or misclassifications in this study. Therefore, further validation of our results is needed. Because recent studies have suggested that resampling and virtual data generation significantly improved predictive accuracy48,51, resampling and virtual data generation can be considered as an alternative method to improve problems inherent within small data sets. Additionally, if a sufficiently large dataset were obtained to further test the feasibility of this approach, the concepts outlined in this study could have potentially broad applications in real-time forecasts. Our concept can be potentially useful for further designing the spatial distribution of monitoring networks to protect public health during AD events. In addition, it could provide a scientific reference for the policy maker in developing future policies.

Material and Methods

Bioaerosol Sample Collection

We collected 55 air samples from 2011 to 2013 in Seodaemun-gu of Seoul, Korea, of which 16 were from the rooftop of the Seoul Air Monitoring Station in Bulgwang (37°61′31″N, 126°93′01″E) in 2011, and 39 were from the rooftop of the 3rd Engineering building of Yonsei University in Shinchon (37°33′42″N, 126°56′07″E) in 2012 and 2013. These sites are located about 10 km from each other in an urban area characterized by human activities without industrial complexes. All air samples were collected 20–30 m above the ground. Ten AD events occurred in Seoul, Korea in 2011 and 2013. All data were separated into AD (ten samples) and non-AD (45 samples) events based on the “Asian Dust Occurrence Reports” from the National Institute of Environmental Research (NIER), Korea.

Bioaerosol samples were collected with a high-volume air sampler (Thermo Scientific, MA, USA). Samples were collected for 24 h at air flow rates of 300–500 L/min on 8 × 10-in. track-etched polycarbonate membrane filters (0.2 µm pore size; Whatman, GE, USA). The filters were autoclaved before sampling, and the filter holder in sampling apparatus was cleaned with 70% ethanol before each sampling event to avoid microbial contamination. After sampling, each filter was stored at −20 °C before DNA extraction.

DNA Extraction from Bioaerosol Samples

Genomic DNA was extracted using a Fast DNA spin for Soil Kit (MPBiomedicals, OH, USA) following a previous method52, with slight modifications15. A negative control was included with every set of DNA extractions. These negative controls were treated exactly the same as all the samples through the entire experiment process, including amplification and sequencing. The extracted DNA samples were stored at −20 °C until use.

Total Bacterial and bceT gene Quantification in Bioaerosol Samples

The total numbers of bacterial 16S rRNA genes copied from each bioaerosol sample were measured using qPCR with an iQ5 Real-Time PCR Detection System (Bio-Rad, CA, USA). The total reaction volume was 20 µL, containing 1× SYBR Master Mix (Bio-Rad), primer sets (300 nM each), and 10-fold-diluted template DNA. The primers targeting bacterial 16S rRNA gene and bceT gene have been described previously53,54. Because bceT is the pathogenic gene in B. cereus, and usually causes illness through the production of enterotoxin55, we used it to quantitatively examine the presence of potential pathogenic bacteria. A total of 1 × 101 to 1 × 107 copies/reaction of PCR products of Escherichia coli W3110 and Bacillus cereus strain KACC 11240 were used as the standard DNA template to generate a standard curve to quantify the 16S rRNA and bceT genes. For 16S rRNA gene, the thermal cycling conditions were followed as: 94 °C for 10 min, followed by 40 cycles at 94 °C for 15 s and 60 °C for 60 s. For bceT gene, the thermal cycling conditions were followed as: 95 °C for 5 min, followed by 37 cycles of 95 °C for 10 s and 60 °C for 45 s. Gene copy numbers (per m3) were calculated as described previously56. For the qPCR run of each sample, triplicate reactions were performed with positive and negative controls. Melting curve analysis (Tm) was performed for 1 cycle of 95 °C for 15 s, 1 cycle of 60 °C for 20 s and 1 cycle from 60 °C to 95 °C for 20 min.

NGS Targeting Bacterial 16S rRNA Gene in Bioaerosol Microbial Communities

In this study, 454 FLX pyrosequencing was used to characterize microbial communities between AD and non-AD events. To provide PCR amplicons for the pyrosequencing, 563 F/16 (5′-AYTGGGYDTAAAGNG-3′) and BSR926/20 (5′-CCGTCAATTYYTTTRAGTTT-3′) targeting V4-V5 regions of 16S rRNA gene were amplified as described previously57. Forward primers included pyrosequencing adapter sequences and 8-bp barcode to distinguish each sample in the pool of amplicons15. PCR was conducted with a C1000TM Thermal Cycler (Bio-Rad) as follows: 3 min for 94 °C, followed by 35 cycles of 94 °C for 1 min, 55 °C for 30 s, 72 °C for 1 min, and a final extension at 72 °C for 5 min15. Negative controls consisting of the same process were included in each PCR run. Amplicons were pooled at equal concentrations using a NanoDrop 1000 spectrophotometer (Thermo Fisher Scientific, Wilmington, DE, USA), and PCR purification was performed using the MinElute PCR Purification Kit (Qiagen, CA, USA). Pyrosequencing was performed on a 454 GS-FLX Titanium Instrument (Roche, NJ, USA) at Macrogen (Seoul, Korea).

Quality control and taxonomic analysis of the 16S rRNA gene sequence reads were performed with Mothur package v.1.30 according to Schloss’ SOP58. All sequencing analysis process was performed following our previous work15. The obtained sequences were separated according to the barcodes, and quality filtering was performed using the Flowgram filtering method. Low-quality sequences with more than one mismatch to the barcode, two mismatches to the primer, or ambiguous nucleotides, negative controls were discarded. Sequences were removed if the homopolymers were longer than 8 bps and/or sequences were shorter than 300 bps59. UCHIME was used to remove expected chimeras derived from PCR using chimera.uchime from Mothur60. To remove or reduce PCR amplification and sequencing errors, sequences were denoised using the shhh.seqs command in AmpliconNoise in Mothur61. After quality filtering, sequences were aligned with the SILVA reference database using the NAST algorithm58,62, and similar sequences (≥97% similarity) were clustered into operational taxonomic units (OTUs). Sequences were assigned to phylotypes using the RDP classifier63. Non-metric multidimensional scaling (NMDS) was performed using the vegan package in R to visualize the taxonomic structure differences between AD and non-AD samples. The data were based on the Bray–Curtis dissimilarity measure of the binary matrix information of 55 air samples.

To screen for human pathogenic bacteria sequence candidates, representative 16S rRNA gene sequences of the bacterial genera OTUs were matched with the reference list of 16S rRNA gene sequences for known human pathogenic bacteria (Supplementary Table S1) from existing databases and studies11,13,64 using BLAST (blastn, cut-off identity ≥97%)65, and the first-cut screened sequences were matched again (identity >97%) using EzTaxon66 to identify bacterial 16S rRNA gene sequences similar to those of known pathogenic isolates.

Characteristics of Atmosphere Environmental Parameters

Daily atmospheric environmental parameter measurements were obtained from the NIER, Korea (http://www.airkorea.or.kr/) using fully automated and daily measurements of atmospheric environmental parameters (e.g., PM10, temperature, relative humidity, wind speed, duration of sunshine, evaporation, and surface temperature). Available atmospheric environmental parameter data were extracted from the NIER daily, and averaged over the sampling time. Where data were missing for particular atmospheric environmental parameters on a given day, the values from the remaining data were used to compute the average. Daily information was provided by the Korea Meteorological Administration (KMA) (http://web.kma.go.kr/eng/index.jsp). Descriptive statistics were calculated for each parameter using SAS v.9.2 (SAS Institute Inc., USA).

Data Processing of Multiple Linear Regression and CART

Multiple linear regression (MLR) is one of the most widely used methodologies for modeling the dependence of a dependent variable on several independent variables17. In general, a linear regression model assumes that (a) the error term has a normal distribution with a mean of 0, (b) the variance of the error term is constant across cases and independent of the variables in the model and (c) the value of the error term for a given case is independent of the values of the variable in the model and of the values of the error term for other cases.

MLR is one of the modeling techniques to investigate the relationship between a dependent variable and several independent variables17,18. In the MLR model, the error term denoted by ε is assumed to be normally distributed with mean 0 and variance σ2 (which is a constant). ε is also assumed to be uncorrelated. Thus, the regression model can be written as17:

$$y={b}_{0}+\sum _{i=1}^{n}{b}_{i}{x}_{i}+\varepsilon $$
(1)

where bi are the regression coefficients, xi are independent variables and ε is stochastic error associated with the regression. To estimate the value of the parameters, the least squares method was used.

CART is a nonparametric statistical technique developed by Breiman et al.23 that can solve classification and regression problems for categorical and continuous dependent variables. One notable advantage is that the models are scalable to large problems and small datasets23. CART is constructed by subsets of a dataset using all predictor variables to repeatedly create two child nodes beginning with the entire dataset23, and uses a stepwise method to establish splitting rules23. Although there are seven single variable splitting criteria, the Gini index is the default method, and it usually performs best23.

We included seven properties (PM10, temperature, relative humidity, wind speed, duration of sunshine, evaporation, and surface temperature) as independent variables and five properties (bacterial abundance, bacterial diversity, relative abundance of potential pathogenic bacteria, B. cereus, and bceT gene) as dependent variables in MLR and CART model. In CART, the Gini index was used to determine the dataset. To evaluate model performance, we partitioned the data into training (70% of the dataset for each class) and testing (remaining 30% of the entire dataset) datasets. The training dataset was used to find an optimal value from one or more predictors during the CART model construction. The testing dataset was used to evaluate the optimal value by verifying the prediction accuracy of the dependent variables. We used the SAS for the MLR model learning and SAS Enterprise Miner v.9.2 (SAS Inc.) for the CART model learning. Ten-fold cross-validation was used to avoid model over-fitting23,67. In this study, the data randomly broke into ten different parts. We used nine of these parts to train the model and the remaining part to test the model performance. We repeated these nine more times, using each of the ten parts as testing data. Then, we averaged the accuracy of the model in classifying the testing samples over each of the ten datasets to obtain a measure for the accuracy of MLR and CART.

Model Performance Criteria

We evaluated the performance of the constructed MLR and CART model statistically, using the root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (R2)18 to evaluate the MLR and CART model performance between the dependent variables and predicted values of the response. Each performance criteria term indicates specific information regarding the predictive performance efficiency18. RMSE is a quadratic scoring rule that measures the average magnitude of the error. It gives a relatively high weight to large errors; hence, it is most useful when large errors are undesirable18. MAE measures the average magnitude of the error in a set of predictions without considering their direction. It is a linear score, implying that all individual differences between predictions and corresponding observed values are weighted equally in the average18. R2 is the best single measure of how well the predicted values match the observed values18. RMSE, MAE, and R2 are defined by the equations:

$${\rm{RMSE}}=\sqrt{\frac{{\sum }_{i=1}^{n}{({Q}_{pre}-{Q}_{obs})}^{2}}{n}}$$
(2)
$${\rm{MAE}}=[\frac{{\sum }_{i=1}^{n}|{Q}_{pre}-{Q}_{obs}|}{n}]$$
(3)
$${R}^{2}=[1-\sum _{i}\frac{{({Q}_{obs}-{Q}_{pre})}^{2}}{{({Q}_{obs}-{\bar{Q}}_{obs})}^{2}}]$$
(4)

where Qobs = observed value; \({\bar{{\rm{Q}}}}_{obs}\) = the mean of the observed data; Qpre = predicted value; i = number of observations; and n = number of points in the dataset. The best score for RMSE and MAE is defined as minimizing the training error; the measure is 1 for R2 and 0 for the other measures.