Clustering micropollutants and estimating rate constants of sorption and biodegradation using machine learning approaches

Lim, Seung Ji; Seo, Jangwon; Seid, Mingizem Gashaw; Lee, Jiho; Ejerssa, Wondesen Workneh; Lee, Doo-Hee; Jeong, Eunhoo; Chae, Sung Ho; Lee, Yunho; Son, Moon; Hong, Seok Won

doi:10.1038/s41545-023-00282-6

Download PDF

Article
Open access
Published: 28 October 2023

Clustering micropollutants and estimating rate constants of sorption and biodegradation using machine learning approaches

npj Clean Water volume 6, Article number: 69 (2023) Cite this article

1962 Accesses
2 Citations
79 Altmetric
Metrics details

Subjects

Abstract

Effluent from wastewater treatment plants is considered an important source of micropollutants (MPs) in aquatic environments. However, monitoring MPs in effluents is often inefficient owing to the variety in their types. Thus, this study derived marker constituents to estimate the behavior of MPs in each cluster using the self-organizing map (SOM), a machine learning-based clustering analysis method. In SOM analysis, the physicochemical properties, functional groups, and the initial biotransformation rules of 29 out 42 MPs were used to ultimately estimate the degradation rate constants of 13 MPs. Consequently, when the physicochemical properties and functional groups were considered, SOM analysis showed outstanding performance to label MPs with an accuracy value of 0.75 for each aerobic and anoxic condition. Based on the clustering results, 11 MPs were determined to be marker constituents under each aerobic and anoxic condition. Moreover, an estimation method for the rate constants of unlabeled MPs was successfully developed using the identified markers with the random forest classifier. The proposed algorithm could estimate both sorption and biotransformation of MPs regardless of dominant removal mechanisms, whether the MPs were removed by sorption or biotransformation. An accuracy of 0.77 was calculated for estimating rate constants under both aerobic and anoxic conditions, which is remarkably higher than those reported previously. The proposed procedure could be extended further to efficiently monitor MPs in effluents.

Machine learning approaches for predicting arsenic adsorption from water using porous metal–organic frameworks

Article Open access 30 September 2022

Machine learning for sustainable organic waste treatment: a critical review

Article Open access 08 April 2024

Identification of mine water sources using a multi-dimensional ion-causative nonlinear algorithmic model

Article Open access 08 February 2024

Introduction

Domestic and industrial chemicals, such as pharmaceuticals, personal care products, steroids, estrogens, pesticides, and surfactants have become essential for modern living. More than 350,000 chemicals have been registered and used worldwide from 2010 to 2019¹. It was also reported that global medicine consumption in 2020 was estimated to be 4.5 trillion doses². Most or some of these chemicals flow into wastewater treatment plants (WWTPs), but are not completely degraded^3,4. These undecomposed chemicals can adversely impact aquatic ecosystems, when discharged from the WWTPs⁵. To assess the impact and potential risks of these unwanted chemicals, called micropollutants (MPs), frequent and accurate monitoring of effluent from WWTPs is a crucial requirement. However, periodic monitoring of MPs is expensive and labor-intensive.

Rather than monitoring individual MPs, monitoring a grouping of them is more efficient because it can reduce the number of samples to be monitored by providing the group representative values^6,7. For example, one study selected caffeine, which is widely available in food, drinks, and pharmaceuticals, as a marker for evaluating the degree of aquatic ecosystem contamination by untreated wastewater⁸. The concentration of caffeine in untreated wastewater is orders of magnitude higher than that in treated water from WWTPs due to the high removal efficiency of caffeine during wastewater treatment in general (>99%). Therefore, rather than detecting individual MPs, solely monitoring the caffeine concentration was sufficient for assessing anthropogenic contamination⁹. Similar to the monitoring of MPs in the aquatic ecosystem, their monitoring in WWTPs would be significantly simplified if a representative marker for each group of MPs is selected prior to analyzing the concentration of all MPs.

Several clustering analyses have been attempted in order to identify similarities among MPs and increase the prediction accuracy of their behavior. One example is the dendrogram, which is often employed to generate a graphical representation exhibiting the trends of biodegradation rate constants associated with solid retention time^10,11. Recently, the clustering of MPs using initial biotransformation rules was also introduced using the Eawag pathway prediction system (Eawag-PPS)^12,13. Although these clustering methods are suitable (i.e., wide range of applications or explainable clustering), they contain issues that need to be addressed. For instance, the dendrogram provides insufficient information for interpreting the clustering results since it represents clustering results on the one-dimensional graph. While the clustering with the biotransformation rule is more explainable, it often shows insufficient prediction accuracy because it lacks detailed chemical characteristics such as functional groups¹⁴.

To address the aforementioned limitations of conventional clustering analyses, this study proposed a novel approach for determining markers based on the clustering results and estimation of the following: the rate constants of MPs, biological degradation rate (k_bio), and sorption coefficients (K_d); these were determined based on classification and the identified markers. When clusters are established, a marker MP in each cluster can potentially provide information on the behavior, i.e., rate constants, of other MPs in the same cluster. For this purpose, sequential analyses of the self-organizing map (SOM) and random forest classifier (RFC) methods were implemented for analyzing the dataset of 42 MPs consisting of physicochemical properties, functional groups, the initial biotransformation rules, and the rate constants that were obtained under aerobic and anoxic conditions. The SOM, an unsupervised neural network algorithm, was chosen due to its ability for clustering and dimensionality reduction with superior visualization of the input features¹⁵. In addition, the RFC, a supervised algorithm, was adopted because of its remarkable inference performance for tabular datasets^16,17. The RFC was utilized to classify the dataset depending on the input features, i.e., the physicochemical properties, functional groups, and the initial biotransformation rules. Hence, the objectives of this study were to: (1) propose the appropriate clustering method for MPs using clustering analysis, (2) determine marker constituents for each cluster aggregated by the SOM, (3) classify MPs using the RFC based on physicochemical properties, functional groups, and biotransformation rules, and (4) estimate a range of rate constants for unlabeled MPs. The results suggest that this approach provides a good framework for monitoring the fate of MPs and can be used as an efficient and effective tool to further reduce the monitoring overheads in WWTPs.

Results and discussion

Removal of micropollutants under aerobic and anoxic conditions

Figure 1 shows the removal efficiency of 42 MPs after 24 h of incubation. Regardless of the aerobic and anoxic conditions, the removal efficiency of most MPs was less than 5% in the control experiment. These findings suggest that abiotic processes such as hydrolysis are unlikely to be involved in the transformation of MPs^18,19. Likewise, volatilization is not considered as the main removal route, because the Henry’s constant of the MPs was significantly low²⁰. Conversely, adsorption was primarily relevant within 1 h for selected MPs, which is in line with the results of previous studies^21,22. More than 30% (32–57%) of parabens (methyl paraben, ethyl paraben, propyl paraben, and butyl paraben), estrogens (estrone and estriol), diclofenac, and atorvastatin was removed by sorption onto sludge. However, the removal efficiency of most MPs through sorption was less than 14% (Fig. 1).

**Fig. 1: Removal efficiency of MPs under aerobic and anoxic conditions.**

Distinct differences were observed in the biodegradation of each MPs. For example, ibuprofen, naproxen, caffeine, metformin, gemfibrozil, and acetaminophen were almost completely removed under aerobic conditions (Fig. 1a). The removal of these MPs primarily resulted from biodegradation, which is consistent with previous findings^21,23,24. Atorvastatin, parabens, and estrogens were also completely removed (Fig. 1a) through sorption and biodegradation, accounting for 36–70% and 30–64% of the removal, respectively. On the other hand, antibiotics (sulfathiazole, sulfamethazine, sulfamethoxazole, trimethoprim, and lincomycin), carbamazepine, atrazine, clofibric acid, and N,N-diethyl-meta-toluamide were poorly removed (sorption: less than 5% and biodegradation: up to 40%). Less removal of these MPs is consistent with the findings of Ternes, et al.²⁰ and Joss, et al.²⁵; this is mainly ascribed to less reactivity of the functional groups²⁶.

Propranolol exhibited negative removal under aerobic conditions (Fig. 1), most likely due to back-transformation of propranolol and deconjugation²⁷. Within 24 h, approximately 70–85% of atenolol, ranitidine, iopromide, cimetidine, and gemfibrozil were removed mainly by biodegradation. The total removal efficiency of diclofenac was approximately 74% (Fig. 1a). However, sorption (47%) was more effective than biodegradation (27%) for the removal of diclofenac. This is in line with the previous studies demonstrating the sorption onto sludge is a predominant route of diclofenac removal^22,28,29. Perfluorinated compounds were removed less efficiently (7.5–37.3%), except for perfluoropentanoic acid (negatively removed). The biological removal efficiencies of N-nitrosamines varied significantly, as shown in Fig. 1a, with N-nitrosomorpholine (5.7%) and N-nitrosodibutylamine (75%) exhibiting the lowest and highest removal rates, respectively. The wide range of total removal efficiencies of 12.6–79.4% (sorption: less than 6% and biodegradation: 5.7–75%) in this study is analogous to the results of previous studies (>10–94%)^30,31,32. Nitrosamines with acyclic groups, such as N-nitrosodibutylamine, were more biodegradable than those with alicyclic and morpholine groups³¹.

Metformin was almost completely biodegraded under anoxic conditions, whereas MPs such as ranitidine, iopromide, and acetaminophen were removed with varying degrees of efficiency (62–85%). Also, parabens, estriol, and estrone were significantly removed under anoxic conditions, most likely due to sorption (Fig. 1b). Biodegradation efficiency of β-blockers (atenolol, metoprolol, and propranolol) and trimethoprim was 45% higher under anoxic conditions than that under aerobic conditions. In contrast, corrosion inhibitors (1-H-benzotriazole and 4-methyl-1H-benzotriazole), gemfibrozil, diclofenac, ibuprofen, naproxen, and caffeine were only marginally removed (7.9–28.6%). Alvarino et al.³³ and Mazioti et al.³⁴ reported that the degradation of these MPs was only observed under aerobic conditions depending on the activity of nitrifying bacteria. Perfluoropentanoic acid had a negative removal efficiency in this study, while other MPs had less than 35% removal efficiency via biodegradation under anoxic conditions (Fig. 1b). The negative removal of perfluoropentanoic acid could be resulted from the transformation of other perfluorinated compounds present in the sludge samples³⁵.

Overall, most results of batch experiments under both conditions were consistent with previously reported results. According to the Nash–Sutcliffe model efficiency coefficient (NSE) results, the pseudo first-order degradation model can predict the majority of biodegradation rate constants of MP under aerobic and anoxic conditions except for perfluoropentanoic acid, perfluorobutanesulfonate, atrazine, and nitrosamines. The k_bio of MPs ranged from 0 to 2.3 L g⁻¹ h⁻¹ and 0 to 1.8 L g⁻¹ h⁻¹ under aerobic and anoxic conditions, respectively (Table 1). As shown in Supplementary Fig. 1, however, most of the MPs showed higher k_bio values in the aerobic process than those in the anoxic process. The sorption coefficients were very similar under both aerobic (0–0.44 L g_MLSS⁻¹) and anoxic conditions (0–0.5 L g_MLSS⁻¹), which agreed well with previously reported ranges^{25,28,36,37,38,39,40}.

Table 1 Sorption coefficient and biological reaction rate constant of MPs in aerobic and anoxic processes estimated through the dynamic modeling approach.

Full size table

Machine learning models combining clustering and classification

Machine learning models that combine clustering and classification are proposed in this study (Fig. 2). The clustering analysis was used to assign a label to unlabeled data that could be further used during classification⁴¹. The dataset consisted of physicochemical properties, functional groups, initial biotransformation rules, and rate constants of 42 MPs (Supplementary Table 1). The dataset was randomly divided into two parts: 29 MPs for the training and validation (70% for cross-validation) datasets and 13 MPs for the test (30%) dataset. It is noted that the abbreviations are used to indicate clustering scenarios based on the physicochemical properties and functional groups as PF and the initial biotransformation rules as BT.

**Fig. 2: Schematic diagram of the combination of clustering and classification algorithms.**

Clustering analysis and determination of marker constituents

The distance maps derived from the SOM are illustrated with different colors according to the relative distance between each neuron of the map (Figs. 3a and 4a). The MPs assigned closely in the distance map filled with similar colors were interpreted as MPs having analogous characteristics (Figs. 3b and 4b). The radius of sectors represents the relative importance of each input feature to cluster MPs. On the other hand, the MPs with remarkably different features were in the separate neurons with dissimilar colors. The solid lines determined by Ward’s method indicate the boundaries dividing each cluster. The marker constituents among MPs are indicated with superscripts (A) for aerobic and (AN) for anoxic conditions, respectively (Figs. 3 and 4).

**Fig. 3: Clustering results for physicochemical properties and functional groups of MPs.**

**Fig. 4: Clustering results for initial biotransformation rules of MPs.**

Clustering analysis based on physicochemical properties and functional groups

Recent research has found that the fate of MPs is influenced by physicochemical properties such as the octanol-water partition coefficient and accessible functional moieties^10,20,26. Hence, we first assessed the suitability of physicochemical properties and functional groups for clustering MPs (Fig. 3). Using Ward’s method, MPs having similar input features were clustered into 11 clusters with the lowest Davies–Bouldin index (DBI) in the PF scenario (0.49).

Because nitrosamines commonly contain amine and amide functional groups, they are clustered together as shown in the left upper side of Fig. 3a, b. However, N-nitrodiphenylamine and N-nitrosomorpholine were assigned in different clusters due to having diphenylamine and morpholine as aromatic functional group, respectively. Carbamazepine and N,N-diethyl-meta-toluamide were also grouped together with N-nitrodiphenylamine and N-nitrosomorpholine because they contain amine, amide, and aromatic ring as functional groups. The MPs having nitrogen- and sulfur-containing functional groups such as sulfathiazole, sulfamethazine, ranitidine, and cimetidine were assigned to one cluster in the lower-left corner of Fig. 3a. This clustering result is line with previous studies in which MPs with sulfonamide functional group were aggregated in the same cluster¹⁴ and sulfamethazine and sulfathiazole were closely located in the dendrogram on the basis of biodegradation rate¹¹. The parabens were clustered in the same unit because of their high log K_ow values and functional groups, i.e., ester and aromatic ring. The long alkyl ester chain and high log K_ow value are the unique properties of parabens, which lead readily to sorption and biodegradation⁴². Although estrogens do not have an ester functional group in their structure, parabens and estrogens were assigned in the same cluster due to their similarity in log K_ow value and having alcohol and aromatic ring as functional groups (upper right corner of Fig. 3a, b). The MPs located in the lower-right corner of Fig. 3a, b contain a halogen-containing functional group in common. The perfluorohexanoic acid and perfluoropentanoic acid were separately clustered from clofibric acid and iopromide because of the fluorinated carbon chain in their structure rather than the aromatic ring structure. Similar clustering results can be found in the previous study in which perfluorinated compounds were grouped in the same cluster due to their fluorinated carbon chain structure⁴³.

In summary, the clustering result represented in the SOM map (Fig. 3a) was interpretable using the physicochemical properties of each MP (Fig. 3b). In the figure, the MPs on the left side have relatively low molecular weights or log K_ow and consist of nitrogen-containing functional groups (i.e., amine and amide) compared with the MPs on the right side. The MPs having the aromatic ring functional group were in a diagonal direction (lower left to upper right), and the MPs with the chain structure were positioned at each corner, in the upper left and lower right. Other MPs containing sulfur and halogen atoms in their functional groups aggregated in the clusters at the bottom of the distance map. One limitation of the clustering result in this study was the uneven distribution of MPs in each cluster due to the lack of available MP data. This limitation should be overcome in future studies by increasing the number of MPs included in the analysis.

Clustering analysis based on biotransformation rules

Perfluorinated compounds, N-nitrosodimethylamine, and N-nitrosopyrrolidine were excluded in this section because their initial biotransformation rules were not predictable using Eawag-PPS. When the SOM clustered the MPs based on the BT scenario, the algorithm generated 15 clusters (DBI = 0.87). The MPs, most commonly following the 1–3 initial biotransformation rules, were aggregated in the same cluster (Fig. 4). 1-H-benzotriazole was grouped together with clofibric acid because its biotransformation was mainly initiated by the aromatic ring dihydroxylation (bt0005) (right middle area of Fig. 4a, b). Atenolol and iopromide were aggregated in the same cluster, since the biodegradation of atenolol and iopromide likely occurred through H-abstraction from side chains (bt0002) and demethylation or dealkylation of ether group (bt0023) (lower-left area of Fig. 4a, b). These results were not consistent with the previous report demonstrating that atenolol and 1-H-benzotriazole were tied to the same cluster when using the elimination rates instead of the biotransformation rule as input features⁴³. Since sulfathiazole and sulfamethazine contain a sulfonamide functional group, which can be biodegraded through hydrolysis and bond-cleavage in sulfonamide group (bt0144), they were aggregated together in the same cluster. This is in line with a previous study showing that the MPs having the sulfonamide functional group were aggregated in the same cluster^10,14. Among the nitrosamine compounds, N-nitrosodiethylamine and N-nitrosomethylethylamine, biodegradation mainly resulted from the monohydroxylation of methyl group (bt0334), and hence were aggregated in one cluster.

Estimation of rate constants using the proposed algorithms and markers

The feasibility of the proposed algorithms and derived marker constituents was evaluated by classifying unlabeled MPs, followed by estimating the range of rate constants for each MP. In this study, the role of a marker is to provide representative information regarding the rate constants of MPs in each cluster. Therefore, the marker was designated as an MP having a minimum Euclidean distance from the mean of the rate constants in each cluster. For example, when an unlabeled MP is classified in a specific cluster, the ranges of its rate constants can be calculated using the rate constants of the markers, ${K}_{d,m}$ and ${k}_{{bio},m}$. The rate constants for unlabeled MPs, ${K}_{d,u}$ and ${k}_{{bio},u}$, can have the values in the range as follows:

$${K}_{d,m}-N{\sigma }_{{K}_{d}}\le {K}_{d,u}\le {K}_{d,m}+N{\sigma }_{{K}_{d}}$$

(1)

$${k}_{{bio},m}-N{\sigma }_{{k}_{{bio}}}\le {k}_{{bio},u}\le {k}_{{bio},m}+N{\sigma }_{{k}_{{bio}}}$$

(2)

where ${\sigma }_{{K}_{d}}$ and ${\sigma }_{{k}_{{bio}}}$ indicate the standard deviation of sorption coefficient and biodegradation rate constant obtained from the MPs in each cluster, respectively. N was set to one, two, and three in this study. The estimation accuracy was calculated by counting the numbers of MPs that lie within the range calculated using Eqs. (1) and (2) (Table 2).

Table 2 Performance of clustering (DBI) and classification (accuracy, f1-score, precision, and recall) approaches and estimation performance of rate constants range using markers for each clustering scenario.

Full size table

In the preliminary simulation to design this study, a random forest regressor (RFR) was solely employed to directly predict the degradation rate constants (Supplementary Fig. 2). The coefficient of determination (R²) for degradation rate constants in the test step was lower than 0.5 regardless of input features and operating conditions (Supplementary Figs. 3 and 4). An overfitting problem that the prediction accuracy for the training step was significantly higher (R²: 0.78–0.90) than the test step (R²: −0.08–0.45) occurred in the RFR model. However, the machine learning approach combining SOM and RFC performed better than the RFR model only; hence, the SOM and RFC were utilized in this study. In the training and validation steps, the classification accuracy (0.75) and f1-score (0.61) of the PF scenario were significantly higher than those of the BT scenario (accuracy: 0.43 and f1-score: 0.32). In the test step, with respect to the aerobic condition, the algorithm using the PF scenario was able to estimate the range for rate constants with an accuracy of 0.38 using one standard deviation and marker’s rate constants of each cluster. In contrast, one standard deviation was insufficient to estimate the range of rate constants in the BT scenario (0.10). The best estimation accuracy of the BT scenario (0.40) was relatively lower than that of the PF scenario when the estimations were made within three standard deviations (0.77). Similar to the aerobic condition, under the anoxic condition, the estimation accuracy for the PF scenario (0.46–0.77) showed better estimation performance compared with that of the BT scenario (0.30–0.40). Collectively, the PF scenario showed higher performance in the classification of MPs and estimation of rate constants compared with the BT scenario. The better clustering results can explain this relatively higher classification and estimation accuracies of the PF scenario than the BT scenario. For example, the DBI value for the PF scenario (0.49) was only half of the DBI value for the BT scenario (0.87), implying that the clustering using the PF scenario was more well-organized than that of the BT scenario.

Applicability of the proposed algorithm to different microbial community data

We further conducted simulations using the previously reported aerobic experimental data to evaluate the applicability of this proposed machine learning algorithm to different microbial community data¹⁰. In this case, the dataset consisted of 42 MPs, mainly degraded through biotransformations but not sorptions. The proposed algorithm was retrained using physicochemical properties, functional groups, and biotransformation to estimate the rate constants of MPs in the reported datasets. As expected, the proposed algorithm was able to classify MPs and estimate the rate constants of MPs in the different microbial community. Interestingly, in this case, the BT scenario (0.72) showed a slightly higher classification accuracy than that of the PF scenario (0.62). Consequently, the estimation performance using the BT scenario (0.69) was also slightly higher than that of the PF scenario (0.62) (Supplementary Table 2). This superior estimation accuracy under the BT scenario is likely because the selected MPs in the literature datasets follow the rules of biotransformation well, as stated¹⁰. However, the use of the biotransformation rules only as input features led to a lower estimation performance of rate constants compared with the PF scenario for our experimental datasets. This can be ascribed to the fact that the sorption is indirectly counted under the PF scenario, which has considered the physicochemical properties and functional groups of MPs but not under the BT scenario. As a result, the estimation of rate constants could not be precisely conducted under the BT scenario.

Comparison of model performances with previous studies

The proposed algorithm exhibited a comparable classification performance and superior estimation accuracy of MPs when estimating the range of rate constants compared to the ones proposed by previous studies. For example, a previous model based on meta-analysis accounted for only 17% of the variability in the removal efficiencies of the targeted MPs⁴⁴, which is lower than the performance of the PF scenario under the aerobic condition within one standard deviation (0.38). In another study employing hierarchical clustering and multivariable analysis, the estimation accuracy for the complete dataset was only 0.19 owing to the unpredictable characteristics of biodegradation¹⁴. A recent study proposed an RFC to classify MPs into two classes (fast or slow biotransformation) with classification accuracies of 0.95 for the predicted biotransformation rules and 0.78 for the observed biotransformation rules⁴⁵. This classification accuracy is similar to the present study. Importantly, in this study, a direct estimation of the range of rate constants of unlabeled MPs was possible. However, the previous study could only classify whether the MPs were biodegraded slowly or rapidly.

Overall, the superior estimation accuracy of this proposed machine learning algorithm suggested two noteworthy findings. First, the markers represented each cluster successfully, particularly when the physicochemical properties and functional groups of each MP were employed during the model training. Second, the markers derived from the proposed algorithm were used to estimate the range of rate constants for unlabeled MPs in the test dataset with relatively high accuracy, using only their physicochemical properties and functional groups as input features. In summary, the proposed machine learning approach could be employed to estimate the sorption and degradation rate of unlabeled and emerging MPs based only on the physicochemical properties and functional groups rather than measuring time-course change of their concentration to estimate the fate of MPs. The proposed machine learning approach trained with sufficient process operational and experimental data could reduce the labor and expenses required for monitoring MPs. Thus, monitoring only the marker MP could reduce the cost of measuring each MP concentration. As with other machine learning techniques, one important prerequisite for successfully applying this machine learning model is to secure sufficient data to train the model. With sufficient data, the grouping and positioning of MPs with SOM could become more refined while improving the accuracy of predictions with RFCs.

Methods

The details of the activated sludge, reagents, and chemicals used in this study are provided in the Supplementary Information (See Supplementary Note 1). Unless otherwise noted, all experiments were conducted using synthetic wastewater (SyWW). The detailed composition of SyWW is presented in Supplementary Table 3.

Batch experiments

The biodegradation of 42 MPs was evaluated under aerobic and anoxic redox conditions. These MPs were chosen because of their frequency of occurrence, persistence, and negative impact on aquatic life. The agitated batch reactor setups are presented in Supplementary Note 2 and Supplementary Fig. 5. Approximately 2.2 L SyWW with 0.8 L activated sludge was filled in 3 L batch reactors. A cocktail of 42 MPs was spiked into the reactors with a final concentration of 0.1 mg L⁻¹. The concentration values of mixed liquor suspended solids (MLSS) and mixed liquor volatile suspended solids (MLVSS) were maintained at 3 g L⁻¹ and 1.8 g L⁻¹ in all the experiments, respectively. The pH and the water temperature were kept at pH = 7 and 22 °C, respectively, throughout the experiment. An 11 mL aliquot of the sample solution was collected from the reactors at the following periods: 0, 10, 20, and 30 min and 1, 2, 4, 8, 12, and 24 h. Control experiments without sludge were also performed to verify abiotic transformation of MPs with a sampling interval of 0 and 24 h. To investigate the adsorption effect on their removal (sterile control), the samples from the reactor spiked with sodium azide (3 g L⁻¹) to suppress the microbial activity were collected at 0, 10, 20, 30 min, and 1 h. Prior to analysis, all samples were filtered using a 0.2 μm syringe filter (Whatman), fortified with internal standards (50 ng mL⁻¹), and immediately stored in a freezer at −20 °C.

Analysis of micropollutants

Nitrosamines were analyzed using gas chromatography coupled to low-resolution mass spectrometry (GC-LRMS,6890 N GC system, Agilent Technologies, USA). The details of the procedure and validation of the GC-LRMS method have been reported elsewhere⁴⁶. Thirty-five additional MPs were monitored using an ultra-high performance liquid chromatography (UHPLC) Vanquish system (Thermo Scientific, San Jose, USA). The system consisted of a cooling auto-sampler, column oven enabling temperature control, ultra-high pressure solvent delivery pump, and automatic degasser. Chromatographic separations of the samples were performed using a Cortecs C18 column (100 × 2.1 mm, 1.6 μm, Waters Co., Milford, MA, USA). The column temperature was set at 45 °C, and the injection volume was 3 μL with a flow rate of 0.3 mL min⁻¹. The mobile phases included 0.1% hydrofluoric acid in high-performance liquid chromatography (HPLC) grade water (Solvent A) and methanol (Solvent B). The gradient elution consisted of 0–0.5 min, 40–70% B, 0.5–6.5 min, 70–100% B and a 1 min hold time, followed by a 4 min re-equilibration to the starting conditions. The internal standards were used for quantification of analytes and a ten-point calibration curve was constructed with a concentration range of 0.1 to 100 ng mL⁻¹. Details of the optimization and validation of the UHPLC-MS/MS methods are described in Supplementary Note 3, Supplementary Tables 4 and 5.

Pseudo first-order degradation models

Based on the results obtained from lab-scale batch experiments, a pseudo first-order degradation kinetic model (Supplementary Note 4) has been frequently used for describing the fate of MPs^25,38,47. The pseudo first-order degradation model in this study assumed fast sorption that reached the equilibrium condition immediately due to observation of instant reduction of soluble MP concentration. Other degradation/removal mechanisms such as volatilization were not considered. The performance of the model was evaluated using the NSE (Supplementary Note 5). Within the scope of this study, the pseudo first-order degradation model considering k_bio and K_d can effectively describe the kinetics of MPs.

Machine learning approaches using clustering and classification for micropollutants

In Step 1, the SOM, followed by Ward’s method, was employed in the training and validation datasets to cluster MPs in the reduced dimension, mapping high-dimensional data onto a two-dimensional grid. Ward’s method draws the decision boundary to effectively separate clusters generated by SOM (Supplementary Note 6). The optimum number of clusters was calculated by evaluating the Davies–Bouldin index (DBI) (Supplementary Note 7). Step 1 aims to assign a label to MPs whose appropriate grouping rules do not yet exist. The MPs in the same cluster were considered to have similar functional groups or biodegradation rules. The labels derived in this step were used to train the classification algorithm in Step 2. The marker for each cluster was determined after verifying the number of clusters having a minimum DBI. The marker MPs are the representative MP of each cluster, which were used in Step 3 to estimate the degradation rate constants of the unlabeled MP in the test dataset. Two clustering scenarios were designed to find the proper input features for clustering MPs: clustering based on the physicochemical properties and functional groups (e.g., octanol-water partitioning coefficient, ether, ester, and amine functional groups) (see more information provided in Supplementary Table 1) and the initial biotransformation rules predicted from Eawag-PPS as presented in Supplementary Table 1^12,13.

In Step 2, the RFC was used to establish a classification algorithm predicting labels assigned to the training and validation datasets in the clustering (in Step 1). The input features, i.e., the physicochemical properties, functional groups, and the initial biotransformation rules, used in clustering were also employed to classify MPs to each label. Cross-validation with a five-fold size was conducted to evaluate the classification performance. The clustering scenario with the better classification accuracy and f1-score (Supplementary Note 9) was chosen as the best clustering scenario for the machine learning model.

In Step 3, the trained model (trained SOM-WARD-RFC model in Fig. 2) was utilized to classify the unlabeled MPs. When the trained model classified the unlabeled MP in the test dataset to the established cluster in Step 1, the classified MP could be considered to have similar degradation properties to other MPs in the same cluster. The markers in each cluster were used to estimate the range of rate constants for unlabeled MPs using Eqs. (1) and (2). Since the unlabeled MPs in the test dataset were completely separated from the MPs in the train and validation dataset, there was no possibility that the model had previewed the data used in the test step. The specific operation conditions regarding the SOM and RFC mentioned in this section are given in Supplementary Note 6–9. In this study, all simulations were performed using Python 3.7 and the clustering was conducted using the SOM from MiniSOM toolbox version 2.3.0⁴⁸. Ward’s method and the RFC from Scikit-learn version 1.0 were used to draw decision boundaries and classify MPs depending on input features, respectively⁴⁹.

Data availability

All data are available in the manuscript or the supplementary information.

Code availability

The underlying code for this study is not publicly available for proprietary reasons.

References

Wang, Z., Walker, G. W., Muir, D. C. G. & Nagatani-Yoshida, K. Toward a global understanding of chemical pollution: a first comprehensive analysis of national and regional chemical inventories. Environ. Sci. Technol. 54, 2575–2584 (2020).
Article CAS Google Scholar
Patel, M. et al. Pharmaceuticals of emerging concern in aquatic systems: chemistry, occurrence, effects, and removal methods. Chem. Rev. 119, 3510–3673 (2019).
Article CAS Google Scholar
Eggen, R. I. L., Hollender, J., Joss, A., Schärer, M. & Stamm, C. Reducing the discharge of micropollutants in the aquatic environment: the benefits of upgrading wastewater treatment plants. Environ. Sci. Technol. 48, 7683–7689 (2014).
Article CAS Google Scholar
Luo, Y. et al. A review on the occurrence of micropollutants in the aquatic environment and their fate and removal during wastewater treatment. Sci. Total Environ. 473-474, 619–641 (2014).
Article CAS Google Scholar
Rout, P. R., Zhang, T. C., Bhunia, P. & Surampalli, R. Y. Treatment technologies for emerging contaminants in wastewater treatment plants: a review. Sci. Total Environ. 753, 141990 (2021).
Article CAS Google Scholar
Buerge, I. J., Kahle, M., Buser, H. R., Müller, M. D. & Poiger, T. Nicotine derivatives in wastewater and surface waters: application as chemical markers for domestic wastewater. Environ. Sci. Technol. 42, 6354–6360 (2008).
Article CAS Google Scholar
Tran, N. H., Li, J., Hu, J. & Ong, S. L. Occurrence and suitability of pharmaceuticals and personal care products as molecular markers for raw wastewater contamination in surface water and groundwater. Environ. Sci. Pollut. Res. 21, 4727–4740 (2014).
Article CAS Google Scholar
Buerge, I. J., Poiger, T., Müller, M. D. & Buser, H. R. Caffeine, an anthropogenic marker for wastewater contamination of surface waters. Environ. Sci. Technol. 37, 691–700 (2003).
Buerge, I. J., Poiger, T., Müller, M. D. & Buser, H. R. Combined sewer overflows to surface waters detected by the anthropogenic marker caffeine. Environ. Sci. Technol. 40, 4096–4102 (2006).
Article CAS Google Scholar
Achermann, S. et al. Trends in micropollutant biotransformation along a solids Retention time gradient. Environ. Sci. Technol. 52, 11601–11611 (2018).
CAS Google Scholar
Desiante, W. L., Minas, N. S. & Fenner, K. Micropollutant biotransformation and bioaccumulation in natural stream biofilms. Water Res. 193, 116846 (2021).
Article CAS Google Scholar
Ellis, L. B. & Wackett, L. P. Use of the University of Minnesota Biocatalysis/Biodegradation Database for study of microbial degradation. Micro. Inf. Exp. 2, 1 (2012).
Article Google Scholar
Ellis, L. B., Gao, J., Fenner, K. & Wackett, L. P. The University of Minnesota pathway prediction system: predicting metabolic logic. Nucleic Acids Res. 36, W427–W432 (2008).
Article CAS Google Scholar
Wang, Y., Fenner, K. & Helbling, D. E. Clustering micropollutants based on initial biotransformations for improved prediction of micropollutant removal during conventional activated sludge treatment. Environ. Sci. Water Res. Technol. 6, 554–565 (2020).
Article CAS Google Scholar
Kohonen, T. Essentials of the self-organizing map. Neural Netw. 37, 52–65 (2013).
Article Google Scholar
Shwartz-Ziv, R. & Armon, A. Tabular data: deep learning is not all you need. Inf. Fusion 81, 84–90 (2022).
Article Google Scholar
Ullah, Z., Yoon, N., Tarus, B. K., Park, S. & Son, M. Comparison of tree-based model with deep learning model in predicting effluent pH and concentration by capacitive deionization. Desalination 558, 116614 (2023).
Article CAS Google Scholar
Williams, M., Du, J., Kookana, R. & Azzi, M. In Biodegradation, hydrolysis and photolysis testing of nitrosamines in aquatic systems, 1–30 (Commonwealth Scientific and Industrial Research Organisation, 2011).
Bergheim, M., Gieré, R. & Kümmerer, K. Biodegradability and ecotoxicitiy of tramadol, ranitidine, and their photoderivatives in the aquatic environment. Environ. Sci. Pollut. Res. 19, 72–85 (2012).
Article CAS Google Scholar
Ternes, T. A. et al. A rapid method to measure the solid-water distribution coefficient (K_d) for pharmaceuticals and musk fragrances in sewage sludge. Water Res. 38, 4075–4084 (2004).
Article CAS Google Scholar
Park, J., Yamashita, N., Wu, G. & Tanaka, H. Removal of pharmaceuticals and personal care products by ammonia oxidizing bacteria acclimated in a membrane bioreactor: contributions of cometabolism and endogenous respiration. Sci. Total Environ. 605-606, 18–25 (2017).
Article CAS Google Scholar
Da Silva, T. H. G., Furtado, R. X. S., Zaiat, M. & Azevedo, E. B. Tandem anaerobic-aerobic degradation of ranitidine, diclofenac, and simvastatin in domestic sewage. Sci. Total Environ. 721, 137589 (2020).
Article Google Scholar
Joss, A., Andersen, H., Ternes, T., Richle, P. R. & Siegrist, H. Removal of estrogens in municipal wastewater treatment under aerobic and anaerobic conditions: consequences for plant optimization. Environ. Sci. Technol. 38, 3047–3055 (2004).
Article CAS Google Scholar
Tisler, S. & Zwiener, C. Aerobic and anaerobic formation and biodegradation of guanyl urea and other transformation products of metformin. Water Res. 149, 130–135 (2019).
Article CAS Google Scholar
Joss, A. et al. Biological degradation of pharmaceuticals in municipal wastewater treatment: proposing a classification scheme. Water Res. 40, 1686–1696 (2006).
Article CAS Google Scholar
Cooper, M. M., Elzerman, A. W. & Lee, C. M. Teaching chemistry in the new century: environmental chemistry. J. Chem. Educ. 78, 1169–1169 (2001).
Article CAS Google Scholar
Brown, A. K., Ackerman, J., Cicek, N. & Wong, C. S. Insitu kinetics of human pharmaceutical conjugates and the impact of transformation, deconjugation, and sorption on persistence in wastewater batch bioreactors. Environ. Pollut. 265, 114852 (2020).
Article CAS Google Scholar
Radjenović, J., Petrović, M. & Barceló, D. Fate and distribution of pharmaceuticals in wastewater and sewage sludge of the conventional activated sludge (CAS) and advanced membrane bioreactor (MBR) treatment. Water Res. 43, 831–841 (2009).
Article Google Scholar
Fan, H., Li, J., Zhang, L. & Feng, L. Contribution of sludge adsorption and biodegradation to the removal of five pharmaceuticals in a submerged membrane bioreactor. Biochem. Eng. J. 88, 101–107 (2014).
Article CAS Google Scholar
Krauss, M., Longrée, P., Dorusch, F., Ort, C. & Hollender, J. Occurrence and removal of N-nitrosamines in wastewater treatment plants. Water Res. 43, 4381–4391 (2009).
Article CAS Google Scholar
Wijekoon, K. C. et al. Removal of N-nitrosamines by an aerobic membrane bioreactor. Bioresource Technol. 141, 41–45 (2013).
Article CAS Google Scholar
Brakstad, O. G. et al. Biotransformation in water and soil of nitrosamines and nitramines potentially generated from amine-based CO₂ capture technology. Int. J. Greenh. Gas. Control 70, 157–163 (2018).
Article CAS Google Scholar
Alvarino, T., Suarez, S., Lema, J. M. & Omil, F. Understanding the removal mechanisms of PPCPs and the influence of main technological parameters in anaerobic UASB and aerobic CAS reactors. J. Hazard. Mater. 278, 506–513 (2014).
Article CAS Google Scholar
Mazioti, A. A., Stasinakis, A. S., Gatidou, G., Thomaidis, N. S. & Andersen, H. R. Sorption and biodegradation of selected benzotriazoles and hydroxybenzothiazole in activated sludge and estimation of their fate during wastewater treatment. Chemosphere 131, 117–123 (2015).
Article CAS Google Scholar
Loganathan, B. G., Sajwan, K. S., Sinclair, E., Senthil Kumar, K. & Kannan, K. Perfluoroalkyl sulfonates and perfluorocarboxylates in two wastewater treatment facilities in Kentucky and Georgia. Water Res. 41, 4611–4620 (2007).
Article CAS Google Scholar
Urase, T. & Kikuta, T. Separate estimation of adsorption and degradation of pharmaceutical substances and estrogens in the activated sludge process. Water Res. 39, 1289–1300 (2005).
Article CAS Google Scholar
Abegglen, C. et al. The fate of selected micropollutants in a single-house MBR. Water Res. 43, 2036–2046 (2009).
Article CAS Google Scholar
Xue, W. et al. Elimination and fate of selected micro-organic pollutants in a full-scale anaerobic/anoxic/aerobic process combined with membrane bioreactor for municipal wastewater reclamation. Water Res. 44, 5999–6010 (2010).
Article CAS Google Scholar
Stevens-Garmon, J., Drewes, J. E., Khan, S. J., McDonald, J. A. & Dickenson, E. R. Sorption of emerging trace organic compounds onto wastewater sludge solids. Water Res. 45, 3417–3426 (2011).
Article CAS Google Scholar
Fernandez-Fontaina, E., Pinho, I., Carballa, M., Omil, F. & Lema, J. M. Biodegradation kinetic constants and sorption coefficients of micropollutants in membrane bioreactors. Biodegradation 24, 165–177 (2013).
Article CAS Google Scholar
Chakraborty, T. EC3: Combining clustering and classification for ensemble learning. Proc. IEEE Int. Conf. Data Min. ICDM 2017, 781–786 (2017).
Google Scholar
Lu, J., Li, H., Tu, Y. & Yang, Z. Biodegradation of four selected parabens with aerobic activated sludge and their transesterification product. Ecotoxicol. Environ. Saf. 156, 48–55 (2018).
Article CAS Google Scholar
Gallé, T. et al. Large-scale determination of micropollutant elimination from municipal wastewater by passive sampling gives new insights in governing parameters and degradation patterns. Water Res. 160, 380–393 (2019).
Article Google Scholar
Douziech, M. et al. Quantifying variability in removal efficiencies of chemicals in activated sludge wastewater treatment plants – a meta-analytical approach. Environ. Sci. Process. Impacts 20, 171–182 (2018).
Article CAS Google Scholar
Rich, S. L., Zumstein, M. T. & Helbling, D. E. Identifying functional groups that determine rates of micropollutant biotransformations performed by wastewater microbial communities. Environ. Sci. Technol. 56, 984–994 (2022).
Article CAS Google Scholar
Kim, G. A., Son, H. J., Kim, C. W. & Kim, S. H. Nitrosamine occurrence at Korean surface water using an analytical method based on GC/LRMS. Environ. Monit. Assess. 185, 1657–1669 (2013).
Article CAS Google Scholar
Pomiès, M., Choubert, J. M., Wisniewski, C. & Coquery, M. Modelling of micropollutant removal in biological wastewater treatments: a review. Sci. Total Environ. 443, 733–748 (2013).
Article Google Scholar
Giuseppe, V. MiniSom: minimalistic and NumPy-based implementation of the Self Organizing Map. https://github.com/JustGlowing/minisom/ (2018). Accessed on 21 March 2022.
Pedregosa, F. et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar

Download references

Acknowledgements

This study was supported by the Korea Environment Industry & Technology Institute through the “Project for developing innovative drinking water and wastewater technologies,” funded by the Korea Ministry of Environment [Grant No. 2019002710010], and the National Research Foundation of Korea (NRF) grant, funded by the Korean government (MSIT) [No. 2021R1C1C2005643].

Author information

These authors contributed equally: Seung Ji Lim, Jangwon Seo.
These authors jointly supervised this work: Moon Son, Seok Won Hong.

Authors and Affiliations

Center for Water Cycle Research, Korea Institute of Science and Technology (KIST), Seoul, 02792, Republic of Korea
Seung Ji Lim, Jangwon Seo, Mingizem Gashaw Seid, Jiho Lee, Wondesen Workneh Ejerssa, Eunhoo Jeong, Sung Ho Chae, Moon Son & Seok Won Hong
Division of Energy and Environment Technology, KIST-School, University of Science and Technology, Seoul, 02792, Republic of Korea
Wondesen Workneh Ejerssa, Moon Son & Seok Won Hong
Mass Spectrometer Laboratory, National Instrumentation Center for Environmental Management, Seoul, 08826, Republic of Korea
Doo-Hee Lee
School of Earth Sciences and Environmental Engineering, Gwangju Institute of Science and Technology (GIST), Gwangju, 61005, Republic of Korea
Yunho Lee

Authors

Seung Ji Lim
View author publications
You can also search for this author in PubMed Google Scholar
Jangwon Seo
View author publications
You can also search for this author in PubMed Google Scholar
Mingizem Gashaw Seid
View author publications
You can also search for this author in PubMed Google Scholar
Jiho Lee
View author publications
You can also search for this author in PubMed Google Scholar
Wondesen Workneh Ejerssa
View author publications
You can also search for this author in PubMed Google Scholar
Doo-Hee Lee
View author publications
You can also search for this author in PubMed Google Scholar
Eunhoo Jeong
View author publications
You can also search for this author in PubMed Google Scholar
Sung Ho Chae
View author publications
You can also search for this author in PubMed Google Scholar
Yunho Lee
View author publications
You can also search for this author in PubMed Google Scholar
Moon Son
View author publications
You can also search for this author in PubMed Google Scholar
Seok Won Hong
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The manuscript was written with the contributions of all authors. All authors have read and agreed to the published version of the manuscript. Each author’s contributions are as follows: S.J.L.: Conceptualization, Methodology, Data analysis, Writing- Original draft, Co-first author J.S.: Conceptualization, Methodology, Writing- Original draft, Co-first author M.G.S.: Conceptualization, Data curation, Validation, Writing- Original draft. J.L.: Data curation, Validation, Reviewing. W.W.E.: Data curation, Validation. D.-H.L.: Data curation, Validation. E.J.: Data curation, Validation. S.H.C.: Writing- Reviewing and Editing. Y.L.: Writing- Reviewing and Editing. M.S.: Supervision, Funding acquisition, Writing- Reviewing and Editing, Co-corresponding author S.W.H.: Supervision, Resources, Funding acquisition, Writing- Reviewing and Editing, Co-corresponding author.

Corresponding authors

Correspondence to Moon Son or Seok Won Hong.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lim, S.J., Seo, J., Seid, M.G. et al. Clustering micropollutants and estimating rate constants of sorption and biodegradation using machine learning approaches. npj Clean Water 6, 69 (2023). https://doi.org/10.1038/s41545-023-00282-6

Download citation

Received: 02 January 2023
Accepted: 16 October 2023
Published: 28 October 2023
DOI: https://doi.org/10.1038/s41545-023-00282-6