A novel approach combining self-organizing map and parallel factor analysis for monitoring water quality of watersheds under non-point source pollution

High content of organic matter in the downstream of watersheds underscored the severity of non-point source (NPS) pollution. The major objectives of this study were to characterize and quantify dissolved organic matter (DOM) in watersheds affected by NPS pollution, and to apply self-organizing map (SOM) and parallel factor analysis (PARAFAC) to assess fluorescence properties as proxy indicators for NPS pollution and labor-intensive routine water quality indicators. Water from upstreams and downstreams was sampled to measure dissolved organic carbon (DOC) concentrations and excitation-emission matrix (EEM). Five fluorescence components were modeled with PARAFAC. The regression analysis between PARAFAC intensities (Fmax) and raw EEM measurements indicated that several raw fluorescence measurements at target excitation-emission wavelength region could provide similar DOM information to massive EEM measurements combined with PARAFAC. Regression analysis between DOC concentration and raw EEM measurements suggested that some regions in raw EEM could be used as surrogates for labor-intensive routine indicators. SOM can be used to visualize the occurrence of pollution. Relationship between DOC concentration and PARAFAC components analyzed with SOM suggested that PARAFAC component 2 might be the major part of bulk DOC and could be recognized as a proxy indicator to predict the DOC concentration.


Results and Discussion
Fluorescence characterization of DOM. PARAFAC is considered as a robust analytical tool to discriminate DOM compositions from massive data of EEMs 20,21 . A five-component model was developed to explain the majority of fluorescence information from EEMs. Figure 1 shows the modeled component spectra of the five components. Component 1 had a peak at λ ex /λ em = 250/440 nm and a shoulder at λ ex /λ em = 330/440 nm. Fluorescence in this region is referred to as peak A (humic-like) based on Coble 9,10 or as Region III (fulvic acid-like) based on FRI technique by Chen, et al. 11 . Fulvic-like DOM is ubiquitous in natural water. Component 2 had a peak at λ ex /λ em = 230/300 nm, whose shape was different from component 1. It overlaps with the region of peak B (tyrosine-like) based on Coble 9,10 and Region I (aromatic protein) based on FRI technique by Chen, et al. 11 (2003). This type of DOM composition has been observed in biological processes during bloom periods 10 . Component 3 had a similar fluorescence shape to component 1 with a peak at λ ex /λ em = 290/490 nm. Fluorescence of component 3 had a similar location to peak C (humic-like) based on Coble 9,10 and fell into Region V (humic acid-like) based on FRI by Chen, et al. 11 . Component 4 had a similar spectral characteristics to that of peak T 1 (tryptophan-like) with the peak at λ ex /λ em = 280/330 nm and a shoulder at λ ex /λ em = 235/330 nm. The majority of component 4 located in Region IV is considered as soluble microbial product (SMP)-like by Chen, et al. 11 , which is frequently observed in waterways impacted by wastewater treatment plant (WWTP) effluents 28 . Component 5 had a peak at λ ex /λ em = 265/480 nm. Fluorescence in this region is referred to as peak A (humic-like) based on Coble 9,10 and as Region V (humic acid-like) based on FRI technique by Chen, et al. 11 . A summary table (Table S1) lists the characteristic peaks, type classified by methods by Coble 9,10 and Chen, et al. 11 , and the possible sources.
According to the methods for DOM fractionation developed by Coble 9,10 and Chen, et al. 11 , DOM pool could be divided into two categories: humic-like substances and protein-like substances. Humic-like substances comprise peak A, C 9,10 , or Region III, V 11 . Humic-like substances are ubiquitous in almost all natural waters 9,10,29,30 and are thought to originate from terrestrial organic matter from soils 31 . Humic-like fluorescence might be intensified by substantial surface runoff/lateral seepage input into ambient waterways caused by rainfall 25 . Protein-like substances comprise peak B, T 1 and T 2 9,10 , or Region I, II and IV 11 . Protein-like fluorescence is associated with microbially-derived organic matter 32 ; hence, the presence of protein-like fluorescence could be attributed to microbially-derived organic matter originating from agricultural and rural activities involving biological processes. Protein-like substances are also found in freshwaters affected by wastewater and in productive oceanic environments 10,30,33 . Moreover, Henderson, et al. 34 reported that additional peaks in protein-like region might originate from optical brightening agents used in paper brightening and household detergents which could be found in sewage-polluted waters 35 . Fluorescence as an indicator for NPS pollution. An approach introduced in 1980s for data mining, called SOM 36 which is a powerful computational tool classified as artificial neural networks, was employed to explore the considerable dataset for the fluorescence properties of DOM. SOM analysis was used to assist the PARAFAC results which is an alternative to peak-picking method to discriminate between fluorescence compositions from a massive dataset.
Sample distribution on SOM map is illustrated in Fig. 2. The SOM map is divided into two clusters according to fluorescence properties of DOM, with distinct fluorescence feature in each cluster. It is clear that the SOM map can be divided into two parts respectively in the vertical and horizontal direction. Horizontally, the SOM map can be divided into two types of water quality: the samples polluted by NPS in the bottom of the map, and the samples unpolluted in the top of the map. Compared with the samples located in the upper side of the map, the samples located in the bottom of the map consist higher content of DOM and fluorescence intensity. Vertically, the SOM map can be divided into two time periods: the samples collected in fall in the left side of the map, and the samples collected in spring and summer in the right side of the map. In spring and summer, fertilization contributed high amount of organic matter release from agricultural lands via runoff 6,25 , and the rainfall intensified the organic matter input into the surrounding waterways 37,38 . In fall, leaching of deposited straw and litter material contributed considerable organic matter to ambient waterways [39][40][41][42] . From the U-matrix of Fig. 2, we can see the color is a little darker on right hand side than left hand side. Thus, we concluded the right side of the SOM map exhibits a higher DOM content and fluorescence intensity compared with the left side of the SOM map because organic matter released more in spring and summer.
To combine the sample distribution ( Fig. 2), the hit histograms were applied to illustrate how many times each neuron was the winning neuron for the dataset of water samples. Each neuron (map unit) of the hit histogram ( Fig. 3) is corresponding to the neuron of the SOM map for sample distribution (Fig. 2). The difference between SOM map for sample distribution and hit histogram is that, each neuron in SOM map for sample distribution give the sample name of the most frequent best matching sample, standing for the several samples falling into this winning neuron with similar fluorescence properties, while each neuron in hit histogram gives the number of samples falling into the winning neuron. The neurons with higher number of hits represent more water samples with similar fluorescence properties. Accordingly, neurons with higher number in hit histogram reveal more typical fluorescence feature of DOM observed during the research. It can be demonstrated from Fig. 3a that the most typical map neurons (most typical fluorescence features) are located at the edges of the map. Furthermore, different colors in hit histogram reveal the difference between polluted and unpolluted water samples' organic matter fluorescence properties. Figure 3b shows a great distinction between polluted and unpolluted water sample properties that may be indicative of a NPS pollution.
Previous studies on monitoring pollution in surface waters and drinking water supply concluded that protein-like fluorescence peaks (e.g. peak B and T) are the best indicators for pollution 34 and peak C could be used as a supplementary pollution indicator 18,19 . Herein, a comparison between SOM analysis and peak-picking method is carried out to explore a better indicator for NPS pollution. We applied cluster analysis based on the values of peak B, T 1 , T 2 and C to examine whether peak-picking could be considered as a better indication for NPS pollution than SOM analysis. Supplementary Fig. S1 showed that each type of water (polluted or unpolluted) could not be consistently clustered into one category, for instance, A-Pol-1 and A-Pol-3 are clustered into a class with 9 unpolluted samples in the first stage. It can be inferred that there is no consistent picked peak fluorescence character within the 15 polluted DOM or within the 21 unpolluted DOM, in terms of peak B, T and C fluorescence. Accordingly, peak-picking method could not provide a better indication for NPS pollution than SOM analysis could.

. U-matrix (on left) and sample distribution map (on right) of SOM analysis.
In sample distribution, "A", "B", "C", "D" represent different sampling events in chronological order; "a" and "b" represent "unpolluted" and "polluted" respectively; the arabic numerals represent different sampling sites. The figures were created using MATLAB 7.0. pollution, the relationship between PARAFAC scores and EEM measurements was explored. Correlation between fluorescence intensities of PARAFAC component peaks and raw EEM measurements was analyzed to examine the effectiveness of fluorescence results as indicators for NPS pollution. Figure 4 shows the contour graphs of determination coefficients and regression coefficients from the regression analysis between PARAFAC intensities (F max ) for component [1][2][3][4][5]  In Fig. 4, the region where the determination coefficient (R 2 ) and the regression coefficient (m) are both closer to 1.0 (the intercept is zero) means more accurate and reliable prediction of fluorescence phenomenon in original EEM measurements using PARAFAC scores as proxy indicators. Additionally, the phenomenon that the reddest region is closer to the white cross in the left panels of Fig. 4 means more accurate and reliable prediction of fluorescence phenomenon in EEM measurements using PARAFAC components as proxy indicators. Accordingly, the phenomenon that R 2 and m equivalent to 1.0 are both located at the same point, viz, the white cross, is the best and ideal scenario for the prediction using PARAFAC model. For component 1 in Fig. 4, the R 2 and m at the peak point (λ ex /λ em = 250/440 nm) and shoulder point (λ ex /λ em = 330/440 nm) are both close to 1.0, indicating the position of component 1 peak is a good indicator for fluorescence DOM composition. For component 1 in the right panel, the region around the point that m is equivalent to 1.0 is a gentle slope, with a larger distance between two contour lines, meaning that little deviation in the fluorescence position during measurements would not significantly diminish the accuracy and reliability of prediction using PARAFAC scores as proxy indicators. However, for component 2 in the right panel, the region around the point that m is equivalent to 1.0 is a steep slope, with a small distance between two contour lines, meaning that the prediction using PARAFAC scores as proxy indicators is sensitive to the wavelength positions of EEM measurements. For component 3, the R 2 and m near the peak point (λ ex /λ em = 290/490 nm) are both close to 1.0, and the region encompassing the peak has a gentle slope. Accordingly, it is a good scenario to predict PARAFAC

Identification of raw EEM as proxy indicator for dissolved organic carbon (DOC) concentration.
To verify the hypothesis that several raw EEMs could be used as surrogates for labor-intensive water quality indicators, relationship between bulk DOC concentration and raw EEM was explored. Linear correlation between DOC concentration and raw EEM measurements was analyzed to inspect the effectiveness of effortless EEM measurement as surrogate indicators to predict routine and labor-intensive water quality indicators such as DOC concentration (Fig. 5). In Fig. 5 there exists a strong linear correlation (R 2 > 0.8) between DOC concentration and a region of fluorescence ex-em pairs (Fig. 5). The most reliable prediction namely highest R 2 value (R 2 > 0.8) was located within excitation 230 to 285 nm and emission 305 to 455 nm of EEM region. This region includes peak B, which was associated with tyrosine-like, and PARAFAC component 2. In the last section discussing reliability evaluation of Raw EEM measurements surrogate for PARAFAC EEMs, there is a strong correlation between raw EEMs and scores of PARAFAC component 2 in the region encompassing the peak of component 2. Accordingly, we can infer that there might be a significant correlation between DOC concentrations and scores of PARAFAC component 2.
Using optical properties as surrogates for labor-intensive routine water quality indicators has been studied for many years 10,25 . Absorption coefficients from absorption spectrum (e.g. ultraviolet absorbance at 254 nm (UVA 254 )) and fluorescence values from EEM spectrum (e.g. FDOM 370/460 ) have been shown to be reliable predictors of DOC concentration [43][44][45] . However, the determination coefficient between UVA 254 and DOC in this study (Fig. S2) did not show a better fit than the correlation between EEM and DOC (Fig. 5). Here, R 2 value is 0.70 for correlation between UVA 254 and DOC, lower than that between DOC and a region within raw EEM (excitation 230 to 285 nm and emission 305 to 455 nm) (Fig. 5,  Fig. S2), and that between DOC and PARAFAC component 2 or 5 (Table 1). Moreover, the correlation between UVA 254 and DOC concentrations was carbon source dependent so that different carbon source will show a different slope of linear regression. Distinct linear regressions between UVA 254 and DOC concentrations imply that different carbon sources have different chemical characteristics. The slope   of the linear regressions between UVA 254 and DOC concentrations is considered as specific ultraviolet absorbance at 254 nm (SUVA 254 ). SUVA 254 , in general, is proportional to the aromaticity of DOC (the amount of chromophore or aromatic carbon per unit of DOC) and has also been widely considered as a surrogate for indicating DBP precursors 8,46 . From the view of mechanism, a low SUVA 254 value for DOC indicates that few conjugated double bonds and aromatic carbon existed per unit DOC. In addition, using one fixed fluorescence peak value (e.g. FDOM 370/460 ) will bring bias to the prediction of DOC concentration, because the best wavelength location for fluorescence peak value to predict DOC will vary with different conditions (e.g. DOM source). In this dataset, the best DOC prediction location falls on λ ex /λ em = 265/310 nm, both of which emission and excitation wavelengths were shifted towards shorter wavelengths away from FDOM location. Therefore, fluorescence peaks used to predict DOC concentration or other water quality indicators are DOM source dependent and should not be fixed to several single EEM locations.
Relationship between DOC concentration and PARAFAC components. As mentioned above, 5 fluorescence components were obtained from PARAFAC. We further inspected the relationship between DOC concentration and PARAFAC components with SOM approach. The component planes for each variable of the SOM output are illustrated in Fig. 6. The component planes of the same clusters have a certain similarity, that is, if corresponding neurons' color trends are similar, there is a certain correlation between them. Results suggested that high DOC concentrations (> 6.01 mg L −1 ) are a response of high PARAFAC component 1 scores (> 0.474 Raman unit (RU)), high PARAFAC component 2 scores (> 0.523 RU), high PARAFAC component 3 scores (> 0.282 RU), high PARAFAC component 4 scores (> 0.380 RU), and high PARAFAC component 5 scores (> 0.380 RU), collectively (Fig. 6). Regression analysis indicated there were significant linear correlations between DOC concentration and the five PARAFAC components, and component 2 gives the best prediction (R 2 = 0.87). Incorporation of all the five components into the model resulted in a better fit (R 2 = 0.91) ( Table 1), suggesting that each of the five components contributed a part of the DOM to the bulk DOC, despite a weak correlation (R 2 = 0.19) between component 1 and DOC concentration.
The strongest relationship between DOC concentration and PARAFAC component 2 indicated that aromatic protein associated with peak B (tyrosine-like) contributed the greatest part to the bulk DOC. Since aromatic protein is autochthonous (microbially derived) DOM, it can be inferred that anthropogenic practice such as agricultural and rural NPS pollution contributed high content of autochthonous DOM. NPS pollution from agricultural lands via runoff or seepage contained soluble microbial products formed in the biochemical processes in agricultural fields (e.g. paddy fields), which could be a source of aromatic protein in DOM in samples. The aromatic protein is also known as a kind of DBP precursors 47 .

Sampling and Analyses.
To assess the effects of NPS pollution on water quality, samples were collected from six sites in the upstream of river and from four sites in the downstream of river over the whole year of 2014 (Fig. 7). The sampling dates were Apr 22, Jun 17, Sep 5 and Nov 2 respectively. Samples were collected over a 1-d period according to a synoptic sampling approach. A combination of depth integrating sampling and grab sampling was employed to collect river samples. As to unsafe sites, grab sampling was chosen. The river was well mixed due to high gradient and lack of point sources, so grab sampling was acceptable. Whole water samples were collected in polyethylene terephthalate (PET) bottles. Samples were 50 mL triplicates extracted in the laboratory from a 3 L sample. Samples were kept on ice and in the dark. Dissolved analytes were analyzed from samples filtered through precombusted 60-mm, 0.45-μ m nominal pore size GF/F filters. Laboratory experiments indicated no fluorescent leachates from the PET bottles during this period.
DOC concentration was determined with a MultiN/C2100TOC/TN analyzer of analytikjenaAG with a detection limit of 0.05 mg L −1 . Fluorescence EEMs were measured on filtered samples with an F-4500 fluorescence spectrophotometer (Hitachi, Shanghai) with a 5-nm band pass and 0.050-s integration time. Fluorescence intensity was measured at excitation wavelengths of 230 to 450 nm at 5-nm intervals and emission wavelengths of 300 to 600 at 5-nm intervals on room temperature samples (25 °C) in a 1-cm quartz cell. Inner filter corrections were applied to EEMs with ultraviolet absorbance at 254 nm (UVA 254 ) greater than 0.03 (1-cm cuvette) as described by Gu and Kenny 48 . Data Analysis. SOM approach. To visualize the cluster of sample distribution and the relationships between DOM bulk indicators and PARAFAC components, the SOM approach was performed with MATLAB (Version 7.00) software. The SOM is a competitive artificial neural networks based on unsupervised learning 49 , which requires merely SOM toolbox and some basic functions to achieve its function in MATLAB. The principle of SOM analysis can be found in many studies 50,51 . In this study, we developed two datasets to serve two objectives. Firstly, a dataset with a 36 × 1748 matrix was established, comprising 36 data samples and 1748 ex-em pairs as variables, in order to visualize the distribution and cluster of samples based on fluorescence properties. Secondly, a dataset with a 36 × 6 matrix was established, comprising 36 data samples and 6 variables including DOC concentration and five PARAFAC components' scores. For the first purpose, three-dimensional EEM of 36 samples were unfolded to two-dimensional vectors, where each row represents data sample and each column represents unfolded ex-em pairs. The sample distribution of SOM map and hit histograms were obtained for clustering of samples. For the second purpose, a series of component planes was obtained for visualization of correlation analysis. In the training section of SOM running, each neuron of input layer of SOM is associated with all input samples and has reference vector with SOM weights. The neuron weights were processed with linear initialization along the two greatest eigenvectors of the input matrix 36 . The ultimate size (10 × 3) of output SOM map was determined by the ratio of the two greatest eigenvalues of the input matrix. The output U-matrix visualized the distances between two map neurons, where the reddest U-matrix map units represent the border of clusters. The output component planes visualized the property distribution of samples, where similar component patterns indicate positive correlations.
PARAFAC analysis. To decompose the fluorescence signal into underlying individual fluorescence composition information, the PARAFAC analysis was performed with MATLAB (Version 7.00) software. PARAFAC analysis is a competitive technique for modeling and visualizing complicated multi-variate data 52 , which requires merely certain toolboxes and some basic functions to achieve its function in MATLAB. The basic principle of PARAFAC analysis is an alternating least-squares algorithm which decomposes the data into a set of trilinear terms and a residual array, and it can be found in many studies 20,52 .
PARAFAC model was derived for all samples using DOMFluor Toolbox for MATLAB with non-negativity constraints applied on all modes. The majority of Raman scatter was removed by subtracting the pure water spectrum from the sample spectrum. The first and second order scatter peaks were cut from EEM spectra and replaced with zeros. Two different split half analyses were run to inspect whether the model was validated. Tucker congruence coefficients 53 were used for comparing components between different PARAFAC models. Finally, a validated and fitted model was obtained, and a dataset comprising the fluorescence intensities of each component in each sample and the emission and excitation loadings of each component was exported.
To evaluate the potential for estimating DOC concentrations and PARAFAC scores from raw EEMs, the original measured EEM data were regressed against the DOC concentrations the maximum fluorescence (F max ) of each component obtained via PARAFAC. To each ex-em wavelength pair, we can get a 36 × 1 vector of raw EEM fluorescence intensities. This 36 × 1 vector was regressed against the 36 × 1 vector of DOC concentrations and 36 × 1 vector of PARAFAC scores of each component. Thus, regression coefficients (m) and determination coefficients (R 2 ) were obtained as a function of wavelength, which can be plotted as contour graphs.