Application of convolutional neural networks for prediction of disinfection by-products

Fluorescence spectroscopy can provide high-level chemical characterization and quantification that is suitable for use in online process monitoring and control. However, the high-dimensionality of excitation–emission matrices and superposition of underlying signals is a major challenge to implementation. Herein the use of Convolutional Neural Networks (CNNs) is investigated to interpret fluorescence spectra and predict the formation of disinfection by-products during drinking water treatment. Using deep CNNs, mean absolute prediction error on a test set of data for total trihalomethanes, total haloacetic acids, and the major individual species were all < 6 µg/L and represent a significant difference improved by 39–62% compared to multi-layer perceptron type networks. Heat maps that identify spectral areas of importance for prediction showed unique humic-like and protein-like regions for individual disinfection by-product species that can be used to validate models and provide insight into precursor characteristics. The use of fluorescence spectroscopy coupled with deep CNNs shows promise to be used for rapid estimation of DBP formation potentials without the need for extensive data pre-processing or dimensionality reduction. Knowledge of DBP formation potentials in near real-time can enable tighter treatment controls and management efforts to minimize the exposure of the public to DBPs.


Results
A dataset of DBP formation potentials and associated fluorescence EEM measurements were used to assess the capabilities of deep NNs and CNNs for water quality analysis. Water samples analyzed were from a pilot-scale treatment plant receiving river water. Samples were taken throughout a treatment train consisting of several unit processes including coagulation, flocculation, sedimentation, ozonation, advanced oxidation (peroxide and ozone), and filtration through anthracite or activated carbon. As such, the samples analyzed had a wide range of NOM concentrations and characteristics. Dissolved organic carbon varied from 2.6 to 6.3 mg L −1 , and specific ultraviolet absorbance varied from 0.75 to 2.53 L mg −1 m −1 over all samples. DBP formation potentials were determined by maintaining a free chlorine residual of 1.5 mg/L for 24 h. Although all four chlorinated or brominated THM and nine HAA species could be detected, only trichloromethane (TCM), bromodichloromethane (BDCM), trichloroacetic acid (TCAA), and dichloroacetic acid (DCAA) were consistently identified at concentrations above detection limits.
Multi-layer perceptron. An iterative optimization approach was used to understand the impact of NN structure on overall performance. While many aspects of network structure can be optimized, the focus in this work was on the number of hidden layers (i.e. depth). A multi-layer perceptron (MLP) network was trained with an increasing number of layers to identify the degree to which network depth can improve prediction accuracy. Figure 1 shows the total THM and HAA predictions results given the number of layers in a MLP. The error bars in Fig. 1 represent the standard deviation of 8 repeated random initializations of the network. A network with 0 hidden layers is simply the input values (dimensions = 5632) connected to 1 output node. When the number of layers was increased, each layer's nodes were set to half of the previous layer. For example, with two hidden layers, hidden layer 1 would have 2816 nodes, and layer 2 would have 1408 nodes. www.nature.com/scientificreports/ As observed in Fig. 1, MLPs with greater than 1 hidden layer did not improve network performance in predicting both total THMs and HAAs. Total THM mean absolute error was at a minimum with one hidden layer (MAE: 12.26 ± 0.34 µg/L; p < 0.02 compared to all other number of hidden layers) and total HAA error was also at a minimum with one hidden layer (MAE: 10.04 ± 0.33 µg/L, p < 0.03), although no significant difference between HAA performance with 1, 7, or 8 layers was found (p > 0.05). Similar results were observed for individual species (Table 1). Adding additional layers also came at the cost of increased variability between random network initializations. For example, the coefficient of variation (CV) increased from 3.3% for a one-layer MLP to predict total HAAs to 8.4% for 6 layers. Increased variability in performance could be due to the increased number of learnable parameters and the relatively small sample size used in this study. The structure chosen resulted in 15,870,977 trainable parameters with one hidden layer. Decreasing the number of nodes in 1 hidden layer, to effectively reduce the number of trainable parameters, demonstrated that reducing the number of nodes below 500 resulted in a decrease in performance and optimal performance for THMs was with 500 or 5000 nodes (p < 0.01 for all other comparisons) or 2500 for total HAAs (p < 0.01) (Fig. 1b). The lack of improvement of MLPs beyond one hidden layer is expected given the small data size and demonstrates that deep MLPs are unlikely to provide advantages in modelling small water quality datasets.
Convolutional networks. In contrast to MLPs, prediction accuracy was minimized with increasing CNN network depth (Fig. 2). Network depth was investigated by increasing the number of convolutional layers and the number of layer sets (convolution followed by max pooling). Convolutional layers provide learned filters or kernels that identify small features in the spectrum, while max pooling layers decrease dimensionality and pool redundant features 28,29 . Including 4-6 hidden convolutional layers without pooling layers was found to optimize prediction accuracy compared to 0-3 layers (p < 0.03 for all comparisons) for THMs and 5-6 layers was optimal for HAAs (p < 0.01) (Fig. 2a). The performance of models with 4-6 layers were not found to be significantly different for THMs (p > 0.13) or 5-6 layers for HAAs (p > 0.80). Increasing the number of layer sets also improved performance compared for THMs (Fig. 2b). A further decrease in error of 15.1% for total THMs was significant when both the number of convolutional and max pooling layers were increased (p < 0.01), but the more marginal increase in HAA performance (5.3%) was not significant (p = 0.07) ( Fig. 2c; Table 1).  www.nature.com/scientificreports/ It was also of interest to investigate the role of the size of receptive fields for each filter and the number of filters included in each convolutional layer. The receptive field size identifies the number of adjacent data points to be considered by each filter. Previous work in chemometrics has shown relatively large receptive fields to work well 32 and could identify features that span over large areas of the spectra, however, expanding the filter size increases the number of trainable parameters. Alternatively, by including convolutional layers in sequence, the receptive field's effective size is expanded, minimizing the number of trainable parameters and including additional layers of non-linearity 34 . As such, the results suggest that larger receptive fields could improve CNN performance. However, increasing the receptive field size beyond (3,3) for individual layers did not improve performance (p > 0.06 for sizes (3,3), (5,5), and (7, 7) for both THMs and HAAs) (Fig. 2e), and expanding receptive fields may be best accomplished by stacking convolutional layers in sequence. It is also of note that www.nature.com/scientificreports/ increasing the number of trained filters improved performance up to 24-36, after which no further changes were observed (p > 0.34) (Fig. 2d). An example of the learned filters are shown in Fig. 3. CNNs create hierarchical representations of data showing how specific irrelevant spectral features are discarded and specific areas of the spectra needed to predict DBP concentrations are magnified 30 . The initial filter layer identifies large and smooth and broad features in the spectra. After pooling, feature maps become more coarse and more distinct patterns between filters can be discerned, highlighting specific areas of the spectra. In the last layer of feature maps, many filters highlight one constant emission level over several excitation bands (left to right).
The comparison of MLP and CNN structures shows a marked performance improvement using a convolutional approach ( Fig. 2 and Table 1). Compared to the optimized MLP, a network with one single convolutional layer improved prediction accuracy by 54.6% for total THMs (p < 0.01; MAE reduced by 6.69 µg/L) and 55.9% for total HAAs (p < 0.01; MAE reduced by 5.61 µg/L) ( Table 1). Improvements were also significant for individual species (39.1-61.5% reduction in MAE). In general, the most significant decrease in error was observed using CNNs, followed by adding several pooling layers. While adding multiple convolutional layers between pooling layers did further reduce error, the gains were more minor (MAE difference 0.25-1.48 µg/L) but statistically significant (p < 0.02). However, adding stacked convolutional layers resulted in a slight increase in error for total HAA prediction (− 0.23 µg/L). As such, the pooling of features and reduction of dimensions is likely key to best interpreting EEMs. It is hypothesized that the success of using pooling layers is due to building in dimensionality reduction and reducing the number of nodes in the final fully-connected layer.

Model explanations.
The objective of identifying model explanations was to confirm that that model predicts high or low concentrations of DBPs based on fluorescence features that are known or possibly associated with DBP precursors. There are scattering signals (i.e. not from organic material) or other potential artifacts from the sample analysis process that would bias the model to "know" concentrations of DBPs for incorrect reasons. The second objective was to identify fluorescence regions most highly associated with specific DBP formation potentials. This information could be used to further understanding of the characteristics of DBP precursors and potentially optimize treatment processes that preferentially remove compounds with those characteristics. www.nature.com/scientificreports/ An occlusion method was applied to identify spectral areas that most significantly influence prediction accuracy. The occlusion method identifies spectral regions most relevant to a prediction by randomly occluding or setting a segment of all inputs in a specified region to 0. The error incurred due to this occlusion indicates how relatively important that specific area is to accurate predictions. The error was calculated as the difference between non-occluded and occluded predictions, and the direction or sign of the error was preserved. As such, positive values indicate that the model underestimated DBP formation with a specific patch occluded, and negative values indicate overestimated DBP formation. A total of 20,000 iterations of random patches per model were chosen to build the heat maps. A random approach to selecting the patch was taken to reduce any bias from neighbouring values since the variables included in each patch would change between iterations. Figure 4 shows the average heat maps identified from training deep CNNs on total DBPs and individual species. Likewise, Fig. 5 shows heat maps based on MLPs.
From occlusion heat maps of variable or spectra area importance (Fig. 4), it was observed that fluorescence in the area of approximately ex: 225-260 nm and em: 370-500 nm was most impactful of prediction accuracy for all DBPs (both total and individual species). A second common area of importance at ex > 300 nm and em > 400 nm was also observed. Fluorescence in these two regions is generally considered to be humic-like and fulvic-like material 35 . Several heat maps also show areas of importance in protein-like fluorescence regions (excitation:  www.nature.com/scientificreports/ 230-250 nm, emission: 300-360 nm) associated with tryptophan-like or tyrosine-like fluorescence 35 . Spectral importance in these regions conforms well to expectations of DBP precursor type material that can fluoresce, generally thought to be aromatic humic-like or fulvic-like material 35 . The heatmaps provide evidence that the NNs are utilizing signals from regions that are reasonable for DBP prediction. Previous DBP prediction methods based on fluorescence data have utilized the same spectral regions 16,17,36 . Compared to MLPs, CNN heat maps show broader areas of importance with more gradual changes (Figs. 4 and 5). Gradual changes conform with the expectation of fluorescence signals from fluorophores, and sharp changes are not typically associated with fluorescence from natural organic matter 15 . Furthermore, CNN heat maps emphasize higher excitation bands. Particularly for prediction of trihalomethanes, peaks at excitation > 300 nm were important for positive predictions, while MLP heat maps placed less importance on these areas. Several CNN heat maps show signals that have several excitation peaks, with limited changes in emission. For example, the BDCM heat map shows peaks at approximately 250 nm and 330 nm, with emissions constant at 450 nm. Multiple excitation peaks at one singular emission conform well with expectations of fluorescence from individual fluorophores, where multiple wavelengths can cause excitation, however, the emission is always from the lowest singlet state and therefore only at one wavelength 37 . The identification of areas of importance at several emission bands suggests several distinct fluorophores contributing to DBP formation potential rather than individual components.
It can also be observed that there is greater continuity of areas of importance with individual DBP species and the total levels. Since the individual species should sum to total concentrations, total THM and total HAA heat maps would be expected to show similar characteristics to the individual species. Ex/em 350 nm/380 nm is observed in CNN heat maps for total THMs (Fig. 3a) and TCM (Fig. 3b). The secondary peak for total THMs at approximately ex/em 325 nm/450 nm is mirrored in the BDCM heat map (Fig. 3c). Similar conformance was not observed with MLP heat maps, for example, the areas of highest importance for BDCM (Fig. 4c) was not present in the total THM heat map (Fig. 4a). However, while some overlap is present between CNN heat maps of species and the total DBP levels, it should be noted that not all peaks are mirrored (e.g. BDCM peak at ex/em 250 nm/450 nm not seen in total THM map).
Individual species of DBPs showed differences between CNN heat maps. BDCM areas of importance were shifted to higher excitation and emission areas compared to TCM. A similar pattern can be seen between DCAA and TCAA. Identified differences in spectral areas between individual species were expected since preferential yields of specific by-products from pure model compounds have pointed to certain molecular structures resulting in the preferential formation of individual DBP species 38,39 . A shift to the greater importance of fluorophores at emissions > 450 nm could indicate BDCM and DCAA formation resulting from humic-like material with greater oxygen/carbon ratios and lower hydrogen/carbon ratios, implying an oxidation state ≥ 0 40 .
A second notable difference is the increased importance of protein-like material (ex/em 230-250 nm/300-350 nm) for HAA predictions. This peak location is typically associated with aromatic amino acids such as tryptophan and tyrosine 35 . Previous studies show that aromatic amino acids 41 and protein-like fluorescence signals strongly correlate with HAA formation potentials 20,42,43 . In particular, the protein-like peak was observed to be most prominent for the prediction of TCAA. This observation conforms well to previous results that show higher TCAA formation than DCAA from aromatic amino acids that would contribute to the observed fluorescence signal 41 . From the MLP heat map of TCAA, regions surrounding the expected aromatic amino acid peak are positive. However, there is a strong negative relationship in the specific location of tryptophan fluorescence (ex/em 230 nm/340 nm).

Discussion
This study investigated the use of deep CNNs to interpret fluorescence spectra and predict the formation of regulated chlorination DBPs from a drinking water treatment plant. The observed results indicate that deep CNNs are well suited to the task of interpreting fluorescence excitation-emission matrices and prediction of DBPs for several reasons: (1) overall prediction accuracy for all DBP groups and species were significantly reduced compared to MLP and previous modelling approaches using dimensionality reduction, (2) results from random initializations were less variable using deep CNNs compared to MLP and shallow CNNs, (3) deep CNN heat maps show trained networks utilize data from spectral regions that are well known to be associated with DBP formation potentials, and (4) compared to MLPs, CNNs show heat maps with characteristics more conformant with expectations of fluorescence from organic precursor material.
Compared to previous work that utilized dimensionality reduction prior to regression, the use of CNNs significantly improved the accuracy of prediction. Two commonly used dimensionality reduction methods, PARAFAC, PCA were applied for comparison as well as a factor regression using 3-way partial least squares (PLS) ( Table 1). The components identified by these methods are discussed in more detail in a previous article using the same dataset 18 . In all cases, significant reduction of prediction accuracy was achieved using CNN architectures, particularly for HAA prediction. It is not straightforward to compare results to other studies given the variation in number of samples, methods for formation potential determination, range of concentrations in the training/test sets, and performance metrics. However, results found in this study also represent improvements over HAA and THM formation predictions previously reported using similar performance metrics (e.g. total THM MAE 13.5 µg/L and total HAA MAE 7.7 µg/L) 16 . Furthermore, the previous approaches to utilize fluorescence data for DBP predictions that have relied heavily on dimensionality reduction to identify relevant fluorescence features add complexity to the analysis process. In contrast, deep CNNs present an opportunity to utilize full fluorescence spectra without the need for manual or highly supervised feature selection through peakpicking, regional integrations, or PARAFAC analysis. It is thought that deep CNNs provide an opportunity for complex behaviours to be represented by several simpler representations. Observation of feature maps produced www.nature.com/scientificreports/ by convolutional layers show a hierarchy of feature representations, with general and smooth representations in high layers and progressively coarser and more specific highlighted spectral areas as increasing numbers of convolution and pooling layers are applied. Neural networks are often discussed as black-box type algorithms, where the internal reasoning is unknown or difficult to illustrate. However, it is imperative that the logic of prediction algorithms used in applied tasks, such as prediction of potentially toxic disinfection by-products, is discernible. There is also an opportunity to use these powerful data-driven approaches to help identify important variables or characteristics of the system. The use of heat maps generated from an occlusion approach to identifying spectral areas that highly influence predictions gives insight into the decision-making process and helps confirm that trained networks are relying on data from spectral regions associated with DBP precursors. Furthermore, heat maps can help direct future more detailed studies investigating the characteristics of precursor material.
As such, the use of fluorescence spectroscopy coupled with machine learning techniques, such as deep CNNs, show promise to be used for rapid estimation of DBP formation potentials. In the context of typical regulatory thresholds for water treatment (total THMs < 80 µg/L, total HAAs < 60 µg/L) the presented methodology produced error levels (MAE 3.39-5.53 µg/L) that would be appropriate for rapidly informing operations and management regarding conformance with regulatory thresholds. Knowledge of DBP formation potentials in near realtime can enable tighter treatment controls and management efforts to minimize the public's exposure to DBPs.

Methods
Water samples. Water samples were obtained from parallel pilot treatment trains that were fed Otonabee River water (Peterborough, Ontario, Canada). Samples were obtained throughout the treatment train for fluorescence analysis and for determining DBP formation potentials. Processes applied included coagulation, flocculation, sedimentation, ozonation, advanced oxidation (peroxide and ozone), and filtration through anthracite or activated carbon. Further information on the pilot-scale set-up and water samples can be found in Peleato et al. 44 .

Fluorescence.
A total of 140 fluorescence spectra were collected using an Agilent Cary Eclipse fluorescence spectrophotometer (Mississauga, Canada). Excitation and emission wavelength ranges were 225-380 nm (5 nm increments), and 250-600 nm (2 nm increments), respectively. This resulted in fluorescence spectra with dimensions of 32 by 176, or 5632 total excitation/emission pairs. The fluorescence spectra were blank subtracted using Milli-Q ® water. The spectrum for Milli-Q ® water was also used to apply Raman corrections at an excitation wavelength of 350 nm and bandwidth of 5 nm to allow fluorescence intensities to be reported in Raman Units (RU) 45 . Absorbance values collected over the excitation-emission range were used to correct for inner-filter effects. Rayleigh scattering lines were removed by setting all values above 2nd order Rayleigh or below 1st order Rayleigh to 0. The absorbance corrected spectra were then scaled between 0 and 1 for each excitation/emission pair for input into MLPs or CNNs. Dimensionality reduction. PCA and PARAFAC were applied to identify a lower dimensional representation of fluorescence spectra. The two methods and resulting components are described further in Peleato et al. 18 . Briefly, PCA was carried out using R (V 3.2.5) on fluorescence spectra that were vectorized, mean centered and scaled to unit variance with respect to excitation/emission pairs. PARAFAC was applied using the drEEM toolbox for MATLAB and following the methodology described by Murphy et al. 24 . The PARAFAC model was validated using split-half validation where model consistency was checked between randomized dataset halves. The number of components for a valid PARAFAC model was found to be 5 and this number of components was also used to decide on the number of principal components used in regression. The loading plots of both PCA and PARAFAC are described and visualized in Peleato et al. 18 . The scores of each of the 5 components from PCA and PARAFAC were used as inputs into a MLP with 1 hidden layer and 100 nodes for regression.
3-way PLS was carried out using the PLS toolbox for MATLAB (Eigenvector Research). The normalized spectra used for both PARAFAC and PCA were utilized and the number of components was set to 5 for consistency with the dimensionality reduction models.
Neural networks. All NNs were trained in Python 3.6 using the Keras library (v2.3.1; TensorFlow v1.15.0 backend). Hardware used was a Intel ® Xeon ® E2286G CPU and a NVIDIA GeForce RTX 2080. Training of each iteration of all models took less than 20 min.
Two general types of NNs were investigated: MLPs where there is a weighted connection between every node in subsequent layers, and CNNs. The number of nodes in each hidden layer of a MLP was defined as half of the previous layer. For example, with two hidden layers, hidden layer 1 would have 2816 nodes and layer 2 would have 1408.
A general schematic of the CNN structure is shown in Fig. 6. Convolutional layers involved training a set of 2D filters or kernels, which are weighting functions multiplied with input values in a specific spatial window. 2D kernels were chosen to capture the 2D structure of the fluorescence excitation-emission spectra. Filters are smaller than the input dimensions and are slid across the entire input to produce feature mapping of the input. Since one filter or weighting function is used for the whole input space, fewer trainable parameters are needed than MLPs. It also gives rise to feature invariance since the trained filter can identify a feature in any position www.nature.com/scientificreports/ of the input space. Max pooling layers look for the maximum value within a spatial window and then uses that maximum value to represent the output over that spatial window, effectively reducing dimensionality. For more details on the mathematics of CNNs and the training process, see LeCun et al. 46 and Goodfellow et al. 29 . CNN layers were considered as a set of convolutional layers followed by max pooling. Max pooling layers reduce the dimensionality of the features and provide an effective way to create hierarchies of detailed to general features. Several convolutional layers were set in series for some models, as this has been shown to provide an effective receptive field, which may provide advantages in chemometric applications 32 , while minimizing the number of trainable parameters 34 . Following convolutional layers, the structure was flattened, where pooled feature maps are vectorized into a dense hidden layer followed by a single output node. A varied number of structures were investigated to identify changes in performance based on depth (number of layers), size of convolutional filters (spatial window), number of max pooling layers, and number of convolutional filters. A summary of these structures is presented in Table 2. Common between all structures and types of NNs was the use of batch normalization to speed up training 47 , followed by activation using an exponential linear unit (ELU) activation function (Eq. 1). The ELU function was chosen based on reports of greater learning rate and generalization for deep networks 48 . ELUs also avoiding issues with 'dead' nodes common with activations such as rectified linear units (ReLU) that have zero gradients below inputs of 0.
All networks were trained with a mean squared error loss function coupled with L2 regularization to prevent overfitting (Eq. 2). The Adam optimization algorithm was used for all training.
where, y i is the network output for sample i (prediction), y i is the true value for sample i, N is the total number of samples, is a hyperparameter to control L2 regularization (set to 0.01), w are all the network weights.
Prediction accuracy was determined on a test set (20% of all data, n = 28) that was not used for training the network. The metric used to assess predictive performance was mean absolute error (MAE), primarily since it provides a metric in the same units used in analysis and is more easily interpreted. Statistical significance was determined using paired t tests comparing mean and standard deviation of MAEs with a confidence level of 95%.
Occlusion method. An occlusion approach was used to generate heat maps of spectral areas that most influence prediction accuracy. After training a network, test data was modified by iteratively setting a spectral area or patch equal to 0. These occluded or corrupted training samples are then fed through the network to produce a prediction of DBP concentration ( Table 3).
The difference between initial predictions and occluded predictions provided an estimate of the importance of the occluded patch. If initial predictions and occluded predictions are identical or close, the trained network is not relying on that spectral area to estimate DBPs. On the other hand, if the error is high, the occluded region is influential on the accurate prediction of DBP levels.
(1) ELU : f (x) = x, x > 0 e x − 1, x ≤ 0 y i − y i 2 + w 2 Figure 6. General schematic of convolutional neural network structure. The number of convolutional layers as well as the number of layers (convolution + max pooling) can be varied. Table 2. Descriptions of the general model types used. n refers to the number of layers and is varied based on model depth.