Introduction

Domestic and industrial chemicals, such as pharmaceuticals, personal care products, steroids, estrogens, pesticides, and surfactants have become essential for modern living. More than 350,000 chemicals have been registered and used worldwide from 2010 to 20191. It was also reported that global medicine consumption in 2020 was estimated to be 4.5 trillion doses2. Most or some of these chemicals flow into wastewater treatment plants (WWTPs), but are not completely degraded3,4. These undecomposed chemicals can adversely impact aquatic ecosystems, when discharged from the WWTPs5. To assess the impact and potential risks of these unwanted chemicals, called micropollutants (MPs), frequent and accurate monitoring of effluent from WWTPs is a crucial requirement. However, periodic monitoring of MPs is expensive and labor-intensive.

Rather than monitoring individual MPs, monitoring a grouping of them is more efficient because it can reduce the number of samples to be monitored by providing the group representative values6,7. For example, one study selected caffeine, which is widely available in food, drinks, and pharmaceuticals, as a marker for evaluating the degree of aquatic ecosystem contamination by untreated wastewater8. The concentration of caffeine in untreated wastewater is orders of magnitude higher than that in treated water from WWTPs due to the high removal efficiency of caffeine during wastewater treatment in general (>99%). Therefore, rather than detecting individual MPs, solely monitoring the caffeine concentration was sufficient for assessing anthropogenic contamination9. Similar to the monitoring of MPs in the aquatic ecosystem, their monitoring in WWTPs would be significantly simplified if a representative marker for each group of MPs is selected prior to analyzing the concentration of all MPs.

Several clustering analyses have been attempted in order to identify similarities among MPs and increase the prediction accuracy of their behavior. One example is the dendrogram, which is often employed to generate a graphical representation exhibiting the trends of biodegradation rate constants associated with solid retention time10,11. Recently, the clustering of MPs using initial biotransformation rules was also introduced using the Eawag pathway prediction system (Eawag-PPS)12,13. Although these clustering methods are suitable (i.e., wide range of applications or explainable clustering), they contain issues that need to be addressed. For instance, the dendrogram provides insufficient information for interpreting the clustering results since it represents clustering results on the one-dimensional graph. While the clustering with the biotransformation rule is more explainable, it often shows insufficient prediction accuracy because it lacks detailed chemical characteristics such as functional groups14.

To address the aforementioned limitations of conventional clustering analyses, this study proposed a novel approach for determining markers based on the clustering results and estimation of the following: the rate constants of MPs, biological degradation rate (kbio), and sorption coefficients (Kd); these were determined based on classification and the identified markers. When clusters are established, a marker MP in each cluster can potentially provide information on the behavior, i.e., rate constants, of other MPs in the same cluster. For this purpose, sequential analyses of the self-organizing map (SOM) and random forest classifier (RFC) methods were implemented for analyzing the dataset of 42 MPs consisting of physicochemical properties, functional groups, the initial biotransformation rules, and the rate constants that were obtained under aerobic and anoxic conditions. The SOM, an unsupervised neural network algorithm, was chosen due to its ability for clustering and dimensionality reduction with superior visualization of the input features15. In addition, the RFC, a supervised algorithm, was adopted because of its remarkable inference performance for tabular datasets16,17. The RFC was utilized to classify the dataset depending on the input features, i.e., the physicochemical properties, functional groups, and the initial biotransformation rules. Hence, the objectives of this study were to: (1) propose the appropriate clustering method for MPs using clustering analysis, (2) determine marker constituents for each cluster aggregated by the SOM, (3) classify MPs using the RFC based on physicochemical properties, functional groups, and biotransformation rules, and (4) estimate a range of rate constants for unlabeled MPs. The results suggest that this approach provides a good framework for monitoring the fate of MPs and can be used as an efficient and effective tool to further reduce the monitoring overheads in WWTPs.

Results and discussion

Removal of micropollutants under aerobic and anoxic conditions

Figure 1 shows the removal efficiency of 42 MPs after 24 h of incubation. Regardless of the aerobic and anoxic conditions, the removal efficiency of most MPs was less than 5% in the control experiment. These findings suggest that abiotic processes such as hydrolysis are unlikely to be involved in the transformation of MPs18,19. Likewise, volatilization is not considered as the main removal route, because the Henry’s constant of the MPs was significantly low20. Conversely, adsorption was primarily relevant within 1 h for selected MPs, which is in line with the results of previous studies21,22. More than 30% (32–57%) of parabens (methyl paraben, ethyl paraben, propyl paraben, and butyl paraben), estrogens (estrone and estriol), diclofenac, and atorvastatin was removed by sorption onto sludge. However, the removal efficiency of most MPs through sorption was less than 14% (Fig. 1).

Fig. 1: Removal efficiency of MPs under aerobic and anoxic conditions.
figure 1

The sorption and biodegradation in the a aerobic and b anoxic processes were separately represented. Negative removal is expressed as 0%.

Distinct differences were observed in the biodegradation of each MPs. For example, ibuprofen, naproxen, caffeine, metformin, gemfibrozil, and acetaminophen were almost completely removed under aerobic conditions (Fig. 1a). The removal of these MPs primarily resulted from biodegradation, which is consistent with previous findings21,23,24. Atorvastatin, parabens, and estrogens were also completely removed (Fig. 1a) through sorption and biodegradation, accounting for 36–70% and 30–64% of the removal, respectively. On the other hand, antibiotics (sulfathiazole, sulfamethazine, sulfamethoxazole, trimethoprim, and lincomycin), carbamazepine, atrazine, clofibric acid, and N,N-diethyl-meta-toluamide were poorly removed (sorption: less than 5% and biodegradation: up to 40%). Less removal of these MPs is consistent with the findings of Ternes, et al.20 and Joss, et al.25; this is mainly ascribed to less reactivity of the functional groups26.

Propranolol exhibited negative removal under aerobic conditions (Fig. 1), most likely due to back-transformation of propranolol and deconjugation27. Within 24 h, approximately 70–85% of atenolol, ranitidine, iopromide, cimetidine, and gemfibrozil were removed mainly by biodegradation. The total removal efficiency of diclofenac was approximately 74% (Fig. 1a). However, sorption (47%) was more effective than biodegradation (27%) for the removal of diclofenac. This is in line with the previous studies demonstrating the sorption onto sludge is a predominant route of diclofenac removal22,28,29. Perfluorinated compounds were removed less efficiently (7.5–37.3%), except for perfluoropentanoic acid (negatively removed). The biological removal efficiencies of N-nitrosamines varied significantly, as shown in Fig. 1a, with N-nitrosomorpholine (5.7%) and N-nitrosodibutylamine (75%) exhibiting the lowest and highest removal rates, respectively. The wide range of total removal efficiencies of 12.6–79.4% (sorption: less than 6% and biodegradation: 5.7–75%) in this study is analogous to the results of previous studies (>10–94%)30,31,32. Nitrosamines with acyclic groups, such as N-nitrosodibutylamine, were more biodegradable than those with alicyclic and morpholine groups31.

Metformin was almost completely biodegraded under anoxic conditions, whereas MPs such as ranitidine, iopromide, and acetaminophen were removed with varying degrees of efficiency (62–85%). Also, parabens, estriol, and estrone were significantly removed under anoxic conditions, most likely due to sorption (Fig. 1b). Biodegradation efficiency of β-blockers (atenolol, metoprolol, and propranolol) and trimethoprim was 45% higher under anoxic conditions than that under aerobic conditions. In contrast, corrosion inhibitors (1-H-benzotriazole and 4-methyl-1H-benzotriazole), gemfibrozil, diclofenac, ibuprofen, naproxen, and caffeine were only marginally removed (7.9–28.6%). Alvarino et al.33 and Mazioti et al.34 reported that the degradation of these MPs was only observed under aerobic conditions depending on the activity of nitrifying bacteria. Perfluoropentanoic acid had a negative removal efficiency in this study, while other MPs had less than 35% removal efficiency via biodegradation under anoxic conditions (Fig. 1b). The negative removal of perfluoropentanoic acid could be resulted from the transformation of other perfluorinated compounds present in the sludge samples35.

Overall, most results of batch experiments under both conditions were consistent with previously reported results. According to the Nash–Sutcliffe model efficiency coefficient (NSE) results, the pseudo first-order degradation model can predict the majority of biodegradation rate constants of MP under aerobic and anoxic conditions except for perfluoropentanoic acid, perfluorobutanesulfonate, atrazine, and nitrosamines. The kbio of MPs ranged from 0 to 2.3 L g−1 h−1 and 0 to 1.8 L g−1 h−1 under aerobic and anoxic conditions, respectively (Table 1). As shown in Supplementary Fig. 1, however, most of the MPs showed higher kbio values in the aerobic process than those in the anoxic process. The sorption coefficients were very similar under both aerobic (0–0.44 L gMLSS−1) and anoxic conditions (0–0.5 L gMLSS−1), which agreed well with previously reported ranges25,28,36,37,38,39,40.

Table 1 Sorption coefficient and biological reaction rate constant of MPs in aerobic and anoxic processes estimated through the dynamic modeling approach.

Machine learning models combining clustering and classification

Machine learning models that combine clustering and classification are proposed in this study (Fig. 2). The clustering analysis was used to assign a label to unlabeled data that could be further used during classification41. The dataset consisted of physicochemical properties, functional groups, initial biotransformation rules, and rate constants of 42 MPs (Supplementary Table 1). The dataset was randomly divided into two parts: 29 MPs for the training and validation (70% for cross-validation) datasets and 13 MPs for the test (30%) dataset. It is noted that the abbreviations are used to indicate clustering scenarios based on the physicochemical properties and functional groups as PF and the initial biotransformation rules as BT.

Fig. 2: Schematic diagram of the combination of clustering and classification algorithms.
figure 2

The number of micropollutants (MPs) used in each step is noted in the diagram.

Clustering analysis and determination of marker constituents

The distance maps derived from the SOM are illustrated with different colors according to the relative distance between each neuron of the map (Figs. 3a and 4a). The MPs assigned closely in the distance map filled with similar colors were interpreted as MPs having analogous characteristics (Figs. 3b and 4b). The radius of sectors represents the relative importance of each input feature to cluster MPs. On the other hand, the MPs with remarkably different features were in the separate neurons with dissimilar colors. The solid lines determined by Ward’s method indicate the boundaries dividing each cluster. The marker constituents among MPs are indicated with superscripts (A) for aerobic and (AN) for anoxic conditions, respectively (Figs. 3 and 4).

Fig. 3: Clustering results for physicochemical properties and functional groups of MPs.
figure 3

a Distance between neighboring map units and clustering boundary and b weight vectors that represent the importance of each feature to organize the map. The color bar indicates the distance between neighboring map units.

Fig. 4: Clustering results for initial biotransformation rules of MPs.
figure 4

a Distance between neighboring map units and clustering boundary and b weight vectors. The initial biotransformation rules for each MP are presented in Supplementary Table 1. The color bar indicates the distance between neighboring map units.

Clustering analysis based on physicochemical properties and functional groups

Recent research has found that the fate of MPs is influenced by physicochemical properties such as the octanol-water partition coefficient and accessible functional moieties10,20,26. Hence, we first assessed the suitability of physicochemical properties and functional groups for clustering MPs (Fig. 3). Using Ward’s method, MPs having similar input features were clustered into 11 clusters with the lowest Davies–Bouldin index (DBI) in the PF scenario (0.49).

Because nitrosamines commonly contain amine and amide functional groups, they are clustered together as shown in the left upper side of Fig. 3a, b. However, N-nitrodiphenylamine and N-nitrosomorpholine were assigned in different clusters due to having diphenylamine and morpholine as aromatic functional group, respectively. Carbamazepine and N,N-diethyl-meta-toluamide were also grouped together with N-nitrodiphenylamine and N-nitrosomorpholine because they contain amine, amide, and aromatic ring as functional groups. The MPs having nitrogen- and sulfur-containing functional groups such as sulfathiazole, sulfamethazine, ranitidine, and cimetidine were assigned to one cluster in the lower-left corner of Fig. 3a. This clustering result is line with previous studies in which MPs with sulfonamide functional group were aggregated in the same cluster14 and sulfamethazine and sulfathiazole were closely located in the dendrogram on the basis of biodegradation rate11. The parabens were clustered in the same unit because of their high log Kow values and functional groups, i.e., ester and aromatic ring. The long alkyl ester chain and high log Kow value are the unique properties of parabens, which lead readily to sorption and biodegradation42. Although estrogens do not have an ester functional group in their structure, parabens and estrogens were assigned in the same cluster due to their similarity in log Kow value and having alcohol and aromatic ring as functional groups (upper right corner of Fig. 3a, b). The MPs located in the lower-right corner of Fig. 3a, b contain a halogen-containing functional group in common. The perfluorohexanoic acid and perfluoropentanoic acid were separately clustered from clofibric acid and iopromide because of the fluorinated carbon chain in their structure rather than the aromatic ring structure. Similar clustering results can be found in the previous study in which perfluorinated compounds were grouped in the same cluster due to their fluorinated carbon chain structure43.

In summary, the clustering result represented in the SOM map (Fig. 3a) was interpretable using the physicochemical properties of each MP (Fig. 3b). In the figure, the MPs on the left side have relatively low molecular weights or log Kow and consist of nitrogen-containing functional groups (i.e., amine and amide) compared with the MPs on the right side. The MPs having the aromatic ring functional group were in a diagonal direction (lower left to upper right), and the MPs with the chain structure were positioned at each corner, in the upper left and lower right. Other MPs containing sulfur and halogen atoms in their functional groups aggregated in the clusters at the bottom of the distance map. One limitation of the clustering result in this study was the uneven distribution of MPs in each cluster due to the lack of available MP data. This limitation should be overcome in future studies by increasing the number of MPs included in the analysis.

Clustering analysis based on biotransformation rules

Perfluorinated compounds, N-nitrosodimethylamine, and N-nitrosopyrrolidine were excluded in this section because their initial biotransformation rules were not predictable using Eawag-PPS. When the SOM clustered the MPs based on the BT scenario, the algorithm generated 15 clusters (DBI = 0.87). The MPs, most commonly following the 1–3 initial biotransformation rules, were aggregated in the same cluster (Fig. 4). 1-H-benzotriazole was grouped together with clofibric acid because its biotransformation was mainly initiated by the aromatic ring dihydroxylation (bt0005) (right middle area of Fig. 4a, b). Atenolol and iopromide were aggregated in the same cluster, since the biodegradation of atenolol and iopromide likely occurred through H-abstraction from side chains (bt0002) and demethylation or dealkylation of ether group (bt0023) (lower-left area of Fig. 4a, b). These results were not consistent with the previous report demonstrating that atenolol and 1-H-benzotriazole were tied to the same cluster when using the elimination rates instead of the biotransformation rule as input features43. Since sulfathiazole and sulfamethazine contain a sulfonamide functional group, which can be biodegraded through hydrolysis and bond-cleavage in sulfonamide group (bt0144), they were aggregated together in the same cluster. This is in line with a previous study showing that the MPs having the sulfonamide functional group were aggregated in the same cluster10,14. Among the nitrosamine compounds, N-nitrosodiethylamine and N-nitrosomethylethylamine, biodegradation mainly resulted from the monohydroxylation of methyl group (bt0334), and hence were aggregated in one cluster.

Estimation of rate constants using the proposed algorithms and markers

The feasibility of the proposed algorithms and derived marker constituents was evaluated by classifying unlabeled MPs, followed by estimating the range of rate constants for each MP. In this study, the role of a marker is to provide representative information regarding the rate constants of MPs in each cluster. Therefore, the marker was designated as an MP having a minimum Euclidean distance from the mean of the rate constants in each cluster. For example, when an unlabeled MP is classified in a specific cluster, the ranges of its rate constants can be calculated using the rate constants of the markers, \({K}_{d,m}\) and \({k}_{{bio},m}\). The rate constants for unlabeled MPs, \({K}_{d,u}\) and \({k}_{{bio},u}\), can have the values in the range as follows:

$${K}_{d,m}-N{\sigma }_{{K}_{d}}\le {K}_{d,u}\le {K}_{d,m}+N{\sigma }_{{K}_{d}}$$
(1)
$${k}_{{bio},m}-N{\sigma }_{{k}_{{bio}}}\le {k}_{{bio},u}\le {k}_{{bio},m}+N{\sigma }_{{k}_{{bio}}}$$
(2)

where \({\sigma }_{{K}_{d}}\) and \({\sigma }_{{k}_{{bio}}}\) indicate the standard deviation of sorption coefficient and biodegradation rate constant obtained from the MPs in each cluster, respectively. N was set to one, two, and three in this study. The estimation accuracy was calculated by counting the numbers of MPs that lie within the range calculated using Eqs. (1) and (2) (Table 2).

Table 2 Performance of clustering (DBI) and classification (accuracy, f1-score, precision, and recall) approaches and estimation performance of rate constants range using markers for each clustering scenario.

In the preliminary simulation to design this study, a random forest regressor (RFR) was solely employed to directly predict the degradation rate constants (Supplementary Fig. 2). The coefficient of determination (R2) for degradation rate constants in the test step was lower than 0.5 regardless of input features and operating conditions (Supplementary Figs. 3 and 4). An overfitting problem that the prediction accuracy for the training step was significantly higher (R2: 0.78–0.90) than the test step (R2: −0.08–0.45) occurred in the RFR model. However, the machine learning approach combining SOM and RFC performed better than the RFR model only; hence, the SOM and RFC were utilized in this study. In the training and validation steps, the classification accuracy (0.75) and f1-score (0.61) of the PF scenario were significantly higher than those of the BT scenario (accuracy: 0.43 and f1-score: 0.32). In the test step, with respect to the aerobic condition, the algorithm using the PF scenario was able to estimate the range for rate constants with an accuracy of 0.38 using one standard deviation and marker’s rate constants of each cluster. In contrast, one standard deviation was insufficient to estimate the range of rate constants in the BT scenario (0.10). The best estimation accuracy of the BT scenario (0.40) was relatively lower than that of the PF scenario when the estimations were made within three standard deviations (0.77). Similar to the aerobic condition, under the anoxic condition, the estimation accuracy for the PF scenario (0.46–0.77) showed better estimation performance compared with that of the BT scenario (0.30–0.40). Collectively, the PF scenario showed higher performance in the classification of MPs and estimation of rate constants compared with the BT scenario. The better clustering results can explain this relatively higher classification and estimation accuracies of the PF scenario than the BT scenario. For example, the DBI value for the PF scenario (0.49) was only half of the DBI value for the BT scenario (0.87), implying that the clustering using the PF scenario was more well-organized than that of the BT scenario.

Applicability of the proposed algorithm to different microbial community data

We further conducted simulations using the previously reported aerobic experimental data to evaluate the applicability of this proposed machine learning algorithm to different microbial community data10. In this case, the dataset consisted of 42 MPs, mainly degraded through biotransformations but not sorptions. The proposed algorithm was retrained using physicochemical properties, functional groups, and biotransformation to estimate the rate constants of MPs in the reported datasets. As expected, the proposed algorithm was able to classify MPs and estimate the rate constants of MPs in the different microbial community. Interestingly, in this case, the BT scenario (0.72) showed a slightly higher classification accuracy than that of the PF scenario (0.62). Consequently, the estimation performance using the BT scenario (0.69) was also slightly higher than that of the PF scenario (0.62) (Supplementary Table 2). This superior estimation accuracy under the BT scenario is likely because the selected MPs in the literature datasets follow the rules of biotransformation well, as stated10. However, the use of the biotransformation rules only as input features led to a lower estimation performance of rate constants compared with the PF scenario for our experimental datasets. This can be ascribed to the fact that the sorption is indirectly counted under the PF scenario, which has considered the physicochemical properties and functional groups of MPs but not under the BT scenario. As a result, the estimation of rate constants could not be precisely conducted under the BT scenario.

Comparison of model performances with previous studies

The proposed algorithm exhibited a comparable classification performance and superior estimation accuracy of MPs when estimating the range of rate constants compared to the ones proposed by previous studies. For example, a previous model based on meta-analysis accounted for only 17% of the variability in the removal efficiencies of the targeted MPs44, which is lower than the performance of the PF scenario under the aerobic condition within one standard deviation (0.38). In another study employing hierarchical clustering and multivariable analysis, the estimation accuracy for the complete dataset was only 0.19 owing to the unpredictable characteristics of biodegradation14. A recent study proposed an RFC to classify MPs into two classes (fast or slow biotransformation) with classification accuracies of 0.95 for the predicted biotransformation rules and 0.78 for the observed biotransformation rules45. This classification accuracy is similar to the present study. Importantly, in this study, a direct estimation of the range of rate constants of unlabeled MPs was possible. However, the previous study could only classify whether the MPs were biodegraded slowly or rapidly.

Overall, the superior estimation accuracy of this proposed machine learning algorithm suggested two noteworthy findings. First, the markers represented each cluster successfully, particularly when the physicochemical properties and functional groups of each MP were employed during the model training. Second, the markers derived from the proposed algorithm were used to estimate the range of rate constants for unlabeled MPs in the test dataset with relatively high accuracy, using only their physicochemical properties and functional groups as input features. In summary, the proposed machine learning approach could be employed to estimate the sorption and degradation rate of unlabeled and emerging MPs based only on the physicochemical properties and functional groups rather than measuring time-course change of their concentration to estimate the fate of MPs. The proposed machine learning approach trained with sufficient process operational and experimental data could reduce the labor and expenses required for monitoring MPs. Thus, monitoring only the marker MP could reduce the cost of measuring each MP concentration. As with other machine learning techniques, one important prerequisite for successfully applying this machine learning model is to secure sufficient data to train the model. With sufficient data, the grouping and positioning of MPs with SOM could become more refined while improving the accuracy of predictions with RFCs.

Methods

The details of the activated sludge, reagents, and chemicals used in this study are provided in the Supplementary Information (See Supplementary Note 1). Unless otherwise noted, all experiments were conducted using synthetic wastewater (SyWW). The detailed composition of SyWW is presented in Supplementary Table 3.

Batch experiments

The biodegradation of 42 MPs was evaluated under aerobic and anoxic redox conditions. These MPs were chosen because of their frequency of occurrence, persistence, and negative impact on aquatic life. The agitated batch reactor setups are presented in Supplementary Note 2 and Supplementary Fig. 5. Approximately 2.2 L SyWW with 0.8 L activated sludge was filled in 3 L batch reactors. A cocktail of 42 MPs was spiked into the reactors with a final concentration of 0.1 mg L−1. The concentration values of mixed liquor suspended solids (MLSS) and mixed liquor volatile suspended solids (MLVSS) were maintained at 3 g L−1 and 1.8 g L−1 in all the experiments, respectively. The pH and the water temperature were kept at pH = 7 and 22 °C, respectively, throughout the experiment. An 11 mL aliquot of the sample solution was collected from the reactors at the following periods: 0, 10, 20, and 30 min and 1, 2, 4, 8, 12, and 24 h. Control experiments without sludge were also performed to verify abiotic transformation of MPs with a sampling interval of 0 and 24 h. To investigate the adsorption effect on their removal (sterile control), the samples from the reactor spiked with sodium azide (3 g L−1) to suppress the microbial activity were collected at 0, 10, 20, 30 min, and 1 h. Prior to analysis, all samples were filtered using a 0.2 μm syringe filter (Whatman), fortified with internal standards (50 ng mL−1), and immediately stored in a freezer at −20 °C.

Analysis of micropollutants

Nitrosamines were analyzed using gas chromatography coupled to low-resolution mass spectrometry (GC-LRMS,6890 N GC system, Agilent Technologies, USA). The details of the procedure and validation of the GC-LRMS method have been reported elsewhere46. Thirty-five additional MPs were monitored using an ultra-high performance liquid chromatography (UHPLC) Vanquish system (Thermo Scientific, San Jose, USA). The system consisted of a cooling auto-sampler, column oven enabling temperature control, ultra-high pressure solvent delivery pump, and automatic degasser. Chromatographic separations of the samples were performed using a Cortecs C18 column (100 × 2.1 mm, 1.6 μm, Waters Co., Milford, MA, USA). The column temperature was set at 45 °C, and the injection volume was 3 μL with a flow rate of 0.3 mL min−1. The mobile phases included 0.1% hydrofluoric acid in high-performance liquid chromatography (HPLC) grade water (Solvent A) and methanol (Solvent B). The gradient elution consisted of 0–0.5 min, 40–70% B, 0.5–6.5 min, 70–100% B and a 1 min hold time, followed by a 4 min re-equilibration to the starting conditions. The internal standards were used for quantification of analytes and a ten-point calibration curve was constructed with a concentration range of 0.1 to 100 ng mL−1. Details of the optimization and validation of the UHPLC-MS/MS methods are described in Supplementary Note 3, Supplementary Tables 4 and 5.

Pseudo first-order degradation models

Based on the results obtained from lab-scale batch experiments, a pseudo first-order degradation kinetic model (Supplementary Note 4) has been frequently used for describing the fate of MPs25,38,47. The pseudo first-order degradation model in this study assumed fast sorption that reached the equilibrium condition immediately due to observation of instant reduction of soluble MP concentration. Other degradation/removal mechanisms such as volatilization were not considered. The performance of the model was evaluated using the NSE (Supplementary Note 5). Within the scope of this study, the pseudo first-order degradation model considering kbio and Kd can effectively describe the kinetics of MPs.

Machine learning approaches using clustering and classification for micropollutants

In Step 1, the SOM, followed by Ward’s method, was employed in the training and validation datasets to cluster MPs in the reduced dimension, mapping high-dimensional data onto a two-dimensional grid. Ward’s method draws the decision boundary to effectively separate clusters generated by SOM (Supplementary Note 6). The optimum number of clusters was calculated by evaluating the Davies–Bouldin index (DBI) (Supplementary Note 7). Step 1 aims to assign a label to MPs whose appropriate grouping rules do not yet exist. The MPs in the same cluster were considered to have similar functional groups or biodegradation rules. The labels derived in this step were used to train the classification algorithm in Step 2. The marker for each cluster was determined after verifying the number of clusters having a minimum DBI. The marker MPs are the representative MP of each cluster, which were used in Step 3 to estimate the degradation rate constants of the unlabeled MP in the test dataset. Two clustering scenarios were designed to find the proper input features for clustering MPs: clustering based on the physicochemical properties and functional groups (e.g., octanol-water partitioning coefficient, ether, ester, and amine functional groups) (see more information provided in Supplementary Table 1) and the initial biotransformation rules predicted from Eawag-PPS as presented in Supplementary Table 112,13.

In Step 2, the RFC was used to establish a classification algorithm predicting labels assigned to the training and validation datasets in the clustering (in Step 1). The input features, i.e., the physicochemical properties, functional groups, and the initial biotransformation rules, used in clustering were also employed to classify MPs to each label. Cross-validation with a five-fold size was conducted to evaluate the classification performance. The clustering scenario with the better classification accuracy and f1-score (Supplementary Note 9) was chosen as the best clustering scenario for the machine learning model.

In Step 3, the trained model (trained SOM-WARD-RFC model in Fig. 2) was utilized to classify the unlabeled MPs. When the trained model classified the unlabeled MP in the test dataset to the established cluster in Step 1, the classified MP could be considered to have similar degradation properties to other MPs in the same cluster. The markers in each cluster were used to estimate the range of rate constants for unlabeled MPs using Eqs. (1) and (2). Since the unlabeled MPs in the test dataset were completely separated from the MPs in the train and validation dataset, there was no possibility that the model had previewed the data used in the test step. The specific operation conditions regarding the SOM and RFC mentioned in this section are given in Supplementary Note 69. In this study, all simulations were performed using Python 3.7 and the clustering was conducted using the SOM from MiniSOM toolbox version 2.3.048. Ward’s method and the RFC from Scikit-learn version 1.0 were used to draw decision boundaries and classify MPs depending on input features, respectively49.