Introduction

The current global energy structure is still extremely dependent on oil and natural gas resources1. Metallic pipelines (e.g. X80, X70, X65) are widely used around the world as the fastest, safest, and cheapest way to transport oil and gas2,3,4,5,6. Nevertheless, pipelines may face leaks, bursts, and ruptures during serving and cause environmental pollution, economic losses, and even casualties7. Most investigations evaluating different failure modes of oil and gas pipelines show that corrosion is one of the most common causes and has the greatest negative impact on the degradation of oil and gas pipelines2. Lam’s8 analysis indicated that external corrosion is the main form of corrosion failure of pipelines. External corrosion of oil and gas pipelines is a time-varying damage mechanism, the degree of which is strongly dependent on the service environment of the pipeline (soil properties, water, gas, etc.), age, and whether and how external protection is applied1. All of these features contribute to the evolution and growth of various types of corrosion on pipelines. Among all corrosion forms, localized corrosion (pitting) tends to be of high risk. Therefore, estimating the maximum depth of pitting corrosion accurately allows operators to analyze and manage the risks better in the transmission pipeline system and to plan maintenance accordingly.

In recent years, many scholars around the world have been actively pursuing corrosion prediction models, which involve atmospheric corrosion, marine corrosion, microbial corrosion, etc.9,10,11,12,13. To predict the corrosion development of pipelines accurately, scientists are committed to constructing corrosion models from multidisciplinary knowledge. Initially, these models relied on empirical or mathematical statistics to derive correlations, and gradually incorporated more factors and deterioration mechanisms. The increases in computing power have led to a growing interest among domain experts in high-throughput computational simulations and intelligent methods. It is a trend in corrosion prediction to explore the relationship between corrosion (corrosion rate or maximum pitting depth) and various influence factors using intelligent algorithms.

ML has been successfully applied for the corrosion prediction of oil and gas pipelines. Ren et al.14 took the mileage, elevation difference, inclination angle, pressure, and Reynolds number of the natural gas pipelines as input parameters and the maximum average corrosion rate of pipelines as output parameters to establish a back propagation neural network (BPNN) prediction model. The predicted values and the real pipeline corrosion rate are highly consistent with an error of less than 0.1%. Xie et al.15 and Liao et al.16 employed the BPNN to predict the growth of corrosion in pipelines with different inputs. Meanwhile, other neural network (DNN, SSCN, et al.) models were widely used to predict corrosion of pipelines as well17,18,19,20,21,22. Support vector machine (SVR) is also widely used for the corrosion prediction of pipelines. Luo et al.23 established the corrosion prediction model of the wet natural gas gathering and transportation pipeline based on the SVR, BPNN, and multiple regression, respectively. Compared with the the actual data, the average relative error of the corrosion rate obtained by SVM is 11.16%, but 19.54% and 25.32% are obtained by the ANN and multivariate analysis methods, respectively. Zhang et al.24 combined modified SVM with unequal interval model to predict the corrosion depth of gathering gas pipelines, and the prediction relative error was only 0.82%. In addition, El Amine et al.2 proposed an efficient hybrid intelligent model based on the feasibility of SVR to predict the dmax of offshore oil and gas pipelines. Although the single ML model has proven to be effective, high-performance models are constantly being developed.

EL is a composite model, and its prediction accuracy is higher than other single models25. Ben et al.25 developed corrosion prediction models based on four EL approaches. In addition, they performed a rigorous statistical and graphical analysis of the predicted internal corrosion rate to evaluate the model’s performance and compare its capabilities. AdaBoost and Gradient boosting (XGBoost) models showed the best performance with RMSE values of 0.052. However, these studies fail to emphasize the interpretability of their models. Despite the high accuracy of the predictions, many ML models are uninterpretable and users are not aware of the underlying inference of the predictions26. That is, the prediction process of the ML model is like a black box that is difficult to understand, especially for the people who are not proficient in computer programs. Interpretable ML solves the interpretation issue of earlier models. It converts black box type models into transparent models, exposing the underlying reasoning, clarifying how ML models provide their predictions, and revealing feature importance and dependencies27. In recent studies, SHAP and ALE have been used for post hoc interpretation based on ML predictions in several fields of materials science28,29. However, how the predictions are obtained is not clearly explained in the corrosion prediction studies.

In this work, we applied different models (ANN, RF, AdaBoost, GBRT, and LightGBM) for regression to predict the dmax of oil and gas pipelines. Then the best models were identified and further optimized. More importantly, this research aims to explain the black box nature of ML in predicting corrosion in response to the previous research gaps. The study visualized the final tree model, explained how some specific predictions are obtained using SHAP, and analyzed the global and local behavior of the model in detail. Moreover, ALE plots were utilized to describe the main and interaction effects of features on predicted results. This study emphasized that interpretable ML does not sacrifice accuracy or complexity inherently, but rather enhances model predictions by providing human-understandable interpretations and even helps discover new mechanisms of corrosion.

Results and discussion

Data analysis and pre-processing

Variance, skewness, kurtosis, and coefficient of variation are used to describe the distribution of a set of data, and these metrics for the quantitative variables in the data set are shown in Table 1. Specifically, the kurtosis and skewness indicate the difference from the normal distribution. The coefficient of variation (CV) indicates the likelihood of the outliers in the data. It is generally considered that outliers are more likely to exist if the CV is higher than 0.15. As shown in Table 1, the CV for all variables exceed 0.15 excluding pp (pipe/soil potential) and bd (bulk density), which means that outliers may exist in the applied dataset. In addition, the variance, kurtosis, and skewness of most the variables are large, which further increases this possibility.

Table 1 Description of the overall distribution of the variables.

The violin plot reflects the overall distribution of the original data. Box plots are used to quantitatively observe the distribution of the data, which is described by statistics such as the median, 25% quantile, 75% quantile, upper bound, and lower bound. The box contains most of the normal data, while those outside the upper and lower boundaries of the box are the potential outliers. Figure 1 shows the combination of the violin plots and box plots applied to the quantitative variables in the database. It can be found that there are potential outliers in all features (variables) except rp (redox potential). Considering the actual meaning of the features and the scope of the theory, we found 19 outliers, which are more than the outliers marked in the original database, and removed them.

Fig. 1: Data distribution and outliers.
figure 1

A combination of box plot and violin plot to visualize the distribution and outliers of the data.

Although the coating type in the original database is considered as a discreet sequential variable and its value is assigned according to the scoring model30, the process is very complicated. To make the categorical variables suitable for ML regression models, one-hot encoding was employed. Table 2 shows the one-hot encoding of the coating type and soil type. For example, each soil type is represented by a 6-bit status register, where clay and clay loam are coded as 100000 and 010000, respectively. The status register bits are named as Class_C, Class_CL, Class_SC, Class_SCL, Class_SL, and Class_SYCL accordingly. The one-hot encoding also implies an increase in feature dimension, which will be further filtered in the later discussion.

Table 2 One-hot encoding of coating and soil type.

Performance evaluation of the models

This section covers the evaluation of models based on four different EL methods (RF, AdaBoost, GBRT, and LightGBM) as well as the ANN framework. It is noted that the ANN structure involved in this study is the BPNN with only one hidden layer. Five statistical indicators, mean absolute error (MAE), coefficient of determination (R2), mean square error (MSE), root mean square error (RMSE), and mean absolute percentage error (MAPE) were used to evaluate and compare the validity and accuracy of the prediction results for 40 test samples.

Table 3 reports the average performance indicators for ten replicated experiments, which indicates that the EL models provide more accurate predictions for the dmax in oil and gas pipelines compared to the ANN model. In general, the superiority of ANN is learning the information from the complex and high-volume data, but tree models tend to perform better with smaller dataset. The pre-processed dataset in this study contains 240 samples with 21 features, and the tree model is more superior at handing this data volume. The results show that RF, AdaBoost, GBRT, and LightGBM are all tree models that outperform ANN on the studied dataset.

Table 3 Average performance of models.

As shown in Fig. 2a, the prediction results of the AdaBoost model fit the true values best under the condition that all models use the default parameters. The scatters of the predicted versus true values are located near the perfect line as in Fig. 2b. Further analysis of the results in Table 3 shows that the Adaboost model is superior to the other models in all metrics among EL, with R2 and RMSE values of 0.895 and 0.624, respectively. The reason is that AdaBoost, which runs sequentially, enables to give more attention to the missplitting data and constantly improve the model, making the sequential model more accurate than the simple parallel model.

Fig. 2: Prediction results of models.
figure 2

a Predictions of different models on the test set, b scatter plot of predictions using Adaboost model.

Adaboost model optimization

The AdaBoost was identified as the best model in the previous section. The screening of features is necessary to improve the performance of the Adaboost model. A preliminary screening of these features is performed using the AdaBoost model to calculate the importance of each feature on the training set via “feature_importances_” function built into the Scikit-learn python module. As shown in Fig. 3, pp has the strongest contribution with an importance above 30%, which indicates that this feature is extremely important for the dmax of the pipeline. The pp (protection potential, natural potential, Eon or Eoff potential) is a parameter related to the size of the electrochemical half-cell and is an indirect parameter of the surface state of the pipe at a single location, which covers the macroscopic conditions during the assessment of the field conditions31. The industry generally considers steel pipes to be well protected at pp below −850 mV32.

Fig. 3: The relative importance of 21 features according to AdaBoost.
figure 3

The important features will be used for the subsequent optimization step.

pH and cc (chloride content) are another two important environmental factors, with importance of 15.9% and 12.3%, respectively. The acidity and erosion of the soil environment are enhanced at lower pH, especially when it is below 51. Chloride ions are a key factor in the depassivation of naturally occurring passive film. At concentration thresholds, chloride ions decompose this passive film under microscopic conditions, accelerating corrosion at specific locations33. The service time of the pipeline is also an important factor affecting the dmax, which is in line with basic fundamental experience and intuition. The contribution of all the above four features exceeds 10%, and the cumulative contribution exceeds 70%, which can be largely regarded as key features. Meanwhile, the calculated results of the importance of Class_SC, Class_SL, Class_SYCL, ct_AEC, and ct_FBE are equal to 0, and thus they are removed from the selection of key features. It is worth noting that this does not absolutely imply that these features are completely independent of the damx. Probably due to the small sample in the dataset, the model did not learn enough information from this dataset.

In order to identify key features, the correlation between different features must be considered as well, because strongly related features may contain the redundant information. They may obscure the relationship between the dmax and features, and reduce the accuracy of the model34. Figure 4 reports the matrix of the Spearman correlation coefficients between the different features, which is used as a metric to determine the related strength between these features. bd (soil bulk density) and class_SCL are closely correlated with the coefficient above 0.75, and t shows a correlation of 0.78 with ct_CTC (coal-tar-coated coating). Correlation coefficient 0.6–0.8 can be considered as strongly correlated. Based on the data characteristics and calculation results of this study, we used the median 0.7 as the threshold value. Strongly correlated (>0.7) features imply the similarity in nature, and thus the feature dimension can be reduced by removing less important factors from the strongly correlated features. As an example, the correlation coefficients of bd with Class_C (clay) and Class_SCL (sandy clay loam) are −0.80 and 0.75, respectively, which indicates a close monotonic relationship between bd and these two features. Specifically, class_SCL implies a higher bd, while Claa_C is the contrary. As determined by the AdaBoost model, bd is more important than the other two factors, and thus so Class_C and Class_SCL are considered as the redundant features and removed from the selection of key features. Only bd is considered in the final model, essentially because it implys the Class_C and Class_SCL. Similarly, ct_WTC and ct_CTC are considered as redundant.

Fig. 4: Matrix of the Spearman correlation coefficients between the different features.
figure 4

Strongly correlated (>0.7) features indicate the presence of redundant variables, which guide the screening of features.

Table 4 summarizes the 12 key features of the final screening. Among soil and coating types, only Class_CL and ct_NC are considered. In addition, the association of these features with the dmax are calculated and ranked in Table 4 using GRA, and they all exceed 0.9, verifying that these features are crucial. The accuracy of the AdaBoost model with these 12 key features as input is maintained (R2 = 0.96) and the model is more robust.

Table 4 The result of gray relational analysis.

There are numerous hyperparameters that affect the performance of the AdaBoost model, including the type and number of base estimators, loss function, learning rate, etc. In this study, the base estimator is set as decision tree, and thus the hyperparameters in the decision tree are also critical, such as the maximum depth of the decision tree (max_depth), the minimum sample size of the leaf nodes, etc. Figure 5 shows how the changes in the number of estimators and the max_depth affect the performance of the AdaBoost model with the experimental dataset. The line indicates the average result of 10 tests, and the color block is the error range.

Fig. 5: Performance of the model with different hyperparameter values.
figure 5

a Variation of AdaBoost model performance with maximum depth of decision tree, b Variation of AdaBoost model performance with the number of base estimators.

It can be found that as the estimator increases (other parameters are default, learning rate is 1, number of estimators is 50, and the loss function is linear), the MSE and MAPE of the model decrease, while R2 increases. The model performance reaches a better level and is maintained when the number of estimators exceeds 50. In addition, the error bars of the model also decrease gradually with the increase of the estimators, which means that the model is more robust. The max_depth significantly affects the performance of the model. The overall performance is improved as the increase of the max_depth. However, once the max_depth exceeds 5, the model tends to be stable with the R2, MSE, and MAEP equal to 0.950, 0.225, and 0.302, accordingly.

To further determine the optimal combination of hyperparameters, Grid Search with Cross Validation strategy is used to search for the critical parameters. In this study, only the max_depth is considered in the hyperparameters of the decision tree due to the small sample size. The candidate for the number of estimator is set as: [10, 20, 50, 100, 150, 200, 250, 300]. The candidates for the loss function, the max_depth, and the learning rate are set as [‘linear’, ‘square’, ‘exponential’], [3, 5, 7, 9, 12, 15, 18, 21, 25], and [0.05, 0.1, 0.2, 0.4, 0.6, 0.8, 1], accordingly. The total search space size is 8×3×9×7. Finally, the best candidates for the max_depth, loss function, learning rate, and number of estimators are 12, ‘liner’, 0.1, and 50, accordingly.

Model visualization and interpretation

Previous ML prediction models usually failed to clearly explain how these predictions were obtained, and the same is true in corrosion prediction, which made the models difficult to understand. To explore how the different features affect the prediction overall is the primary task to understand a model. In this work, SHAP is used to interpret the prediction of the AdaBoost model on the entire dataset, and its values are used to quantify the impact of features on the model output. A negative SHAP value means that the feature has a negative impact on the prediction, resulting in a lower value for the model output. Conversely, a positive SHAP value indicates a positive impact that is more likely to cause a higher dmax.

Figure 6a depicts the global distribution of SHAP values for all samples of the key features, and the colors indicate the values of the features, which have been scaled to the same range. Blue and red indicate lower and higher values of features. As can be seen that pH has a significant effect on the dmax, and lower pH usually shows a positive SHAP, which indicates that lower pH is more likely to improve dmax. Conversely, a higher pH will reduce the dmax. In addition, previous studies showed that the corrosion rate on the outside surface of the pipe is higher when the concentration of chloride ions in the soil is higher, and the deeper pitting corrosion produced35. This is consistent with the depiction of feature cc in Fig. 6a, where higher values of cc (chloride content) have a reasonably positive effect on the dmax of the pipe, while lower values have negative effect. Similarly, higher pp (pipe/soil potential) significantly increases the probability of larger pitting depth, while lower pp reduces the dmax. This is because sufficiently low pp is required to provide effective protection to the pipeline. t (pipeline age) and wc (water content) have the similar effect on the dmax, and higher values of features show positive effect on the dmax, which is completely opposite to the effect of re (resistivity). The remaining features such as ct_NC and bc (bicarbonate content) present less effect on the pitting globally.

Fig. 6: SHAP global interpretation.
figure 6

a SHAP value for all samples of key features, b Mean SHAP value of features.

The average SHAP values are also used to describe the importance of the features. As shown in Fig. 6b, cc has the highest importance with an average absolute SHAP value of 0.54. It means that the cc of all samples in the AdaBoost model improves the dmax by 0.54 mm on average. The next is pH, which has an average SHAP value of 0.48. pp and t are the other two main features with SHAP values of 0.27 and 0.3, respectively. Globally, cc, pH, pp, and t are the four most important features affecting the dmax, which is generally consistent with the results discussed in the previous section.

Although the overall analysis of the AdaBoost model has been done above and revealed the macroscopic impact of those features on the model, the model is still a black box. Visualization and local interpretation of the model can open up the black box to help us understand the mechanism of the model and explain the interactions between features. As previously mentioned, the AdaBoost model is computed sequentially from multiple decision trees, and we creatively visualize the final decision tree. According to the optimal parameters, the max_depth (maximum depth) of the decision tree is 12 layers. Figure 7 shows the first 6 layers of this decision tree and the traces of the growth (prediction) process of a record. Taking the first layer as an example, if a sample has a pp value higher than −0.60 V, then it will grow along the right subtree, otherwise it will turn to the left subtree. The sample tracked in Fig. 7 is branched five times and the prediction is locked at 0.97 after discriminating the values of pp, cc, pH, and t. It should be noted that this is the result of the calculation after 5 layer of decision trees, and the result after the full decision tree is 0.71, which is very close to the actual result. This decision tree is the basis for the model to make predictions. Once the values of these features are measured in the applicable environment, we can follow the graph and get the dmax.

Fig. 7: Visualization of decision trees.
figure 7

The splitting logic for the first six layers of the part of the decision tree generated by AdaBoost, and the variables that have been calculated are marked in yellow in the bottom box.

In addition to the global interpretation, Fig. 8 shows the instances of local interpretations (particular prediction) obtained from SHAP values. The gray vertical line in the middle of the SHAP decision plot (Fig. 8a) marks the base value of the model, and the colored ones are the prediction lines, which show how the model accumulates from the base value to the final outputs starting from the bottom of the plots. Figure 8a shows the prediction lines for ten samples numbered 140–150, in which the more upper features have higher influence on the predicted results. It is consistent with the importance of the features. Further, pH and cc demonstrate the opposite effects on the predicted values of the model for the most part. Figure 8b shows the SHAP waterfall plot for sample numbered 142 (black dotted line in Fig. 8a), which interprets the unique contribution of the variables to the result at any given point. In this plot, E[f(x)] = 1.9 is the baseline (average expected value) and the final value is f(x) = 1.57, which is also the predicted value for this instance. The SHAP value in each row represents the contribution and interaction of this feature to the final predicted value of this instance. For example, the pH of 5.56 has a positive effect on the damx, which adds 0.32 to the prediction from the baseline. However, cc (14.4 ppm) has a negative effect on the damx, which decreases the predicted result by 0.28. It indicates that the content of chloride ions, 14.4 ppm, has not yet reached the threshold to promote pitting. Interestingly, the rp of 328 mV in this instance shows a large effect on the results, but t (19 years) does not. However, none of these showed up in the global interpretation, so further quantification of the impact of these features on the predicted results is requested. Figure 8c shows this SHAP force plot, which can be considered as a horizontal projection of the waterfall plot and clusters the features that push the prediction higher (red) and lower (blue).

Fig. 8: SHAP local interpretation.
figure 8

a SHAP prediction lines for ten instances, b SHAP waterfall plot on selected instance, c SHAP force plot on selected instance.

To further depict how individual features affect the model’s predictions continuously, ALE main effect plots are employed. Figure 9 shows the ALE main effect plots for the nine features with significant trends. The ranking over the span of ALE values for these features is generally consistent with the ranking of feature importance discussed in the global interpretation, which indirectly validates the reliability of the ALE results.

Fig. 9: ALE main effect plot for dmax.
figure 9

ALE versus a cc, b pH, c pp, d t, e wc, f bd, g bc, h re, i rp.

As shown in Fig. 9a, the ALE values of the dmax present a monotonically increasing relationship with the cc in the overall. That is, the higher the amount of chloride in the environment, the larger the dmax. It is interesting to note that dmax exhibits a very strong sensitivity to cc (chloride content), and the ALE value increases sharply as cc exceeds 20 ppm. It means that the pipeline will obtain a larger dmax owing to the promotion of pitting by chloride above the critical level. However, the excitation effect of chloride will reach stability when the cc exceeds 150 ppm, and chloride are no longer a critical factor affecting the dmax. The reason is that high concentration of chloride ions cause more intense pitting on the steel surface, and the developing pits are covered by massive corrosion products, which inhibits the development of the pits36. In addition, there is not a strict form of the corrosion boundary in the complex soil environment, the local corrosion will be more easily extended to the continuous area under higher chloride content, which results in a corrosion surface similar to the general corrosion and the corrosion pits are erased35. pH is a local parameter that modifies the surface activity mechanism of the environment surrounding the pipe. Low pH environment lead to active corrosion and may create local conditions that favor the corrosion mechanism of sulfate-reducing bacteria31. The corrosion rate increases as the pH of the soil decreases in the range of 4–8.5, and the dmax is larger, as shown in Fig. 9b. The ALE values of dmax are monotonically increasing with both t and pp (pipe/soil potential), as shown in Fig. 9c and d. It means that the longer the exposure time of pipelines, the more positive potential of the pipe/soil is, and then the larger pitting depth is more accessible. From Fig. 9c, it is further found that the dmax increases rapidly for the values of pp above −0.8 V, while the pipeline is well protected for values below −0.8 V.

wc (water content) is also key to inducing external corrosion in oil and gas pipelines, and this parameter depends on physical factors such as soil skeleton, pore structure, and density31. As the wc increases, the corrosion rate of metals in the soil increases until reaching a critical level. Then, with the further increase of the wc, the oxygen supply to the metal surface decreases and the corrosion rate begins to decrease37. The critical wc is related to the soil type and its characteristics, the type of pipe steel, the exposure conditions of the metal, and the time of the soil exposure. The curve in Fig. 9e depicts a positive correlation between dmax and wc within 35%, but it is not able to determine the critical wc, which could be explained by the fact that the sample of the data set is still not extensive enough. The high wc of the soil also leads to the growth of corrosion-inducing bacteria in contact with buried pipes, which may increase pitting38. With the increase of bd (bulk density), bc (bicarbonate content), and re (resistivity), dmax presents a decreasing trend, and all of them are strongly sensitive within a certain range. Once bc is over 20 ppm or re exceeds 150 Ω·m, damx remains stable, as shown in Fig. 9f, g, h. rp (redox potential) has no significant effect on dmax in the range of 0–300 mV, but the oxidation capacity of the soil is enhanced and pipe corrosion is accelerated at higher rp39.

In addition to the main effect of single factor, the corrosion of the pipeline is also subject to the interaction of multiple factors. The interactio n effect of the two features (factors) is known as the second-order interaction. The ALE second-order interaction effect plot indicates the additional interaction effects of the two features without including their main effects. The red and blue represent the above and below average predictions, respectively. Figure 10a shows the ALE second-order interaction effect plot for pH and pp, which reflects the second-order effect of these features on the dmax. For low pH and high pp (zone A) environments, an additional positive effect on the prediction of dmax is seen. High pH and high pp (zone B) have an additional negative effect on the prediction of dmax. However, low pH and pp (zone C) also have an additional negative effect. As shown in Fig. 10b, Pourbaix diagram of the Fe-H2O system illustrates the main areas of immunity, corrosion, and passivation condition over a wide range of pH and potential. pp is the potential of the buried pipeline relative to the Cu/CuSO4 electrode, which is the free corrosion potential (Ecorr) of the pipeline40. While the potential in the Pourbaix diagram is the potential of Fe relative to the standard hydrogen electrode Ecorr in water. Despite the difference in potential, the Pourbaix diagram can still provide a valid guide for the protection of the pipeline. It is generally considered that the cathodic protection of pipelines is favorable if the pp is below −0.85 V40,41. In Fig. 10, zone A is not within the protection potential and corresponds to the corrosion zone of the Pourbaix diagram, where the pipeline has a severe tendency to corrode, resulting in an additional positive effect on dmax. Zones B and C correspond to the passivation and immunity zones, respectively, where the pipeline is well protected, resulting in an additional negative effect. In general, the calculated ALE interaction effects are consistent with the corrosion experience.

Fig. 10: Interaction of pH and pp.
figure 10

a ALE second-order interaction effect plot for pH and pp, b Pourbaix diagram of the Fe-H2O.

Similarly, more interaction effects between features are evaluated and shown in Fig. 11. Figure 11a reveals the interaction effect between pH and cc, showing an additional positive effect on the dmax for the environment with low pH and high cc. Although the increase of dmax with increasing cc was demonstrated in the previous analysis, high pH and cc show an additional negative effect on the prediction of the dmax, which implies that high pH reduces the promotion of corrosion caused by chloride. The interaction of low pH and high wc has an additional positive effect on dmax, as shown in Fig. 11b. That is, lower pH amplifies the effect of wc. Basic and acidic soils may have associated corrosion, depending on the resistivity1,42. According to the standard BS EN 12501-2:2003, Amaya-Gomez et al.42 reported a corrosion classification diagram for combined soil resistivity and pH, which indicates that oil and gas pipelines in low soil resistivity are more susceptible to external corrosion at low pH. This is verified by the interaction of pH and re depicted in Fig. 11c, where low pH and re additionally contribute to the dmax. In addition, low pH and low rp give an additional promotion to the dmax, while high pH and rp give an additional negative effect as shown in Fig. 11d. In the previous discussion, it has been pointed out that the corrosion tendency of the pipelines increases with the increase of pp and wc. As shown in Fig. 11e, this law is still reflected in the second-order effects of pp and wc. Within the protection potential, the increasing of wc leads to an additional positive effect, i.e., the pipeline corrosion is further promoted. In the lower wc environment, the high pp causes an additional negative effect, as the high potential increases the corrosion tendency of the pipelines. The difference is that high pp and high wc produce additional negative effects, which may be attributed to the formation of corrosion product films under severe corrosion, and thus corrosion is depressed. In addition, Fig. 11f indicates that the effect of bc on dmax is further amplified at high pp condition. More second-order interaction effect plots between features will be provided in Supplementary Figures.

Fig. 11: ALE second-order interaction plots.
figure 11

ALE plots for a pH and cc, b pH and wc, c pH and re, d pH and rp, e pp and wc, f pp and bd.

In the above discussion, we analyzed the main and second-order interactions of some key features, which explain how these features in the model affect the prediction of dmax. However, the effect of third- and higher-order effects of the features on dmax were done discussed, since high order effects are difficult to interpret and are usually not as dominant as the main and second order effects43.

In summary, five valid ML models were used to predict the maximum pitting depth (damx) of the external corrosion of oil and gas pipelines using realistic and reliable monitoring data sets. Spearman correlation coefficient, GRA, and AdaBoost methods were used to evaluate the importance of features, and the key features were screened and an optimized AdaBoost model was constructed. In addition, This paper innovatively introduces interpretability into corrosion prediction. The interpretations and transparency frameworks help to understand and discover how environment features affect corrosion, and provide engineers with a convenient tool for predicting dmax. The main conclusions are summarized below.

  1. (1)

    Compared with ANN, RF, GBRT, and lightGBM, AdaBoost can predict the dmax of the pipeline more accurately, and its performance index R2 value exceeds 0.95 after optimization.

  2. (2)

    cc (chloride content), pH, pp (pipe/soil potential), and t (pipeline age) are the four most important factors affecting dmax in several evaluation methods. While coating and soil type show very little effect on the prediction in the studied dataset.

  3. (3)

    The ALE values of dmax present the monotonic increase with increasing cc, t, wc (water content), pp, and rp (redox potential), which indicates that the increase of cc, wc, pp, and rp in the environment all contribute to the dmax of the pipeline. Conversely, increase in pH, bd (bulk density), bc (bicarbonate content), and re (resistivity) reduce the dmax.

  4. (4)

    The interaction of features shows a significant effect on dmax. pH exhibits second-order interaction effects on dmax with pp, cc, wc, re, and rp, accordingly. At the extreme values of the features, the interaction of the features tends to show the additional positive or negative effects.

Methods

Collection and description of experimental data

Sufficient and valid data is the basis for the construction of artificial intelligence models. The establishment and sharing practice of reliable and accurate databases is an important part of the development of materials science under the new paradigm of materials science development. The experimental data for this study were obtained from the database of Velázquez et al.30, which covers various important parameters in the initiation and growth of corrosion defects. This database contains 259 samples of soil and pipe variables for an onshore buried pipeline that has been in operation for 50 years in southern Mexico. These environmental variables include soil resistivity, pH, water content, redox potential, bulk density, and concentration of dissolved chloride, bicarbonate and sulfate ions, and pipe/soil potential. The service time of the pipe, the type of coating, and the soil are also covered. Soil samples were classified into six categories: clay (C), clay loam (CL), sandy loam (SCL), and silty clay (SC) and silty loam (SL), silty clay loam (SYCL), based on the relative proportions of sand, silty sand, and clay. Coating types include noncoated (NC), asphalt-enamel-coated (AEC), wrap-tape-coated (WTC), coal-tar-coated (CTC), and fusion-bonded-epoxy-coated (FBE). The maximum pitting depth (dmax), defined as the maximum depth of corrosive metal loss for diameters less than twice the thickness of the pipe wall, was measured at each exposed pipeline segment. Nine outliers had been pointed out by simple outlier observations, and the complete dataset is available in the literature30 and a brief description of these variables is given in Table 5. Figure 12 shows the distribution of the data under different soil types.

Table 5 Basic descriptive statistics of variables.
Fig. 12: Data distribution under different soil types.
figure 12

Samples were classified by soil type, and the overall distribution of these samples on the two feature dimensions was observed. Plots on the diagonal depicting the distribution of individual features for different soil type samples.

Data pre-processing

Data pre-processing is a necessary part of ML. In this study, we mainly consider outlier exclusion and data encoding in this session. Variance, skewness, kurtosis, and CV are used to profile the global distribution of the data. Specifically, Skewness describes the symmetry of the distribution of the variable values, Kurtosis describes the steepness, Variance describes the dispersion of the data, and CV combines the mean and standard deviation to reflect the degree of data variation. These statistical values can help to determine if there are outliers in the dataset. If the CV is greater than 15%, there may be outliers in this dataset. Combining the kurtosis and skewness values we can further analyze this possibility. To further identify outliers in the dataset, the interquartile range (IQR) is commonly used to determine the boundaries of outliers. The first quartile (25% quartile) is Q1 and the third quartile (75% quartile) is Q3, then IQR = Q3-Q1. Specifically, for samples smaller than Q1-1.5IQR (lower bound), and larger than Q3 + 1.5IQR (upper bound) are considered outliers and should be excluded.

The one-hot encoding can represent categorical data well and is extremely easy to implement without complex computations. The approach is to encode different classes of classification features using status registers, where each class has its own independent bits and only one of them is valid at any given time. That is, only one bit is 1 and the rest are zero.

Ensemble learning

Ensemble learning (EL) is an algorithm that combines many base machine learners (estimators) into an optimal one to reduce error, enhance generalization, and improve model prediction44. EL with decision tree based estimators is widely used. Generally, EL can be classified into parallel and serial EL based on the way of combination of base estimators. Parallel EL models, such as the classical Random Forest (RF), use bagging to train decision trees independently in parallel, and the final output is an average result. Sequential EL reduces variance and bias by creating a weak predictive model and iterating continuously using boosting techniques. This is true for AdaBoost, gradient boosting regression tree (GBRT) and light gradient boosting machine (LightGBM) models. We selected four potential algorithms from a number of EL algorithms by considering the volume of data, the properties of the algorithms, and the results of pre-experiments. The following part briefly describes the mathematical framework of the four EL models.

RF is a strongly supervised EL method that consists of a large number of individual decision trees that operate as a whole. Each individual tree makes a prediction or classification, and the prediction or classification with the most votes becomes the result of the RF45. The method consists of two phases to achieve the final output. In the first stage, RF uses bootstrap aggregating approach to select input features randomly and training datasets to build multiple decision trees. This random property reduces the correlation between individual trees, and thus reduces the risk of over-fitting. In the second stage, the average result of the predictions obtained from the individual decision tree is calculated as follow25:

$$y(x) = \frac{1}{n}\mathop {\sum}\limits_{i = 1}^n {y_i(x)}$$
(1)

Where, yi represents the i-th decision tree, and the total number of trees is n. y is the target output, and x denotes the feature vector of the input.

AdaBoost is a powerful iterative EL technique that creates a powerful predictive model by merging multiple weak learning models46. The general form of AdaBoost is as follow:

$$F_T(X) = \mathop {\sum}\limits_{t = 1}^T {f_t(X)}$$
(2)

Where ft denotes the weak learner and X denotes the feature vector of the input. Each iteration generates a new learner using the training dataset to evaluate all samples. During the process, the weights of the incorrectly predicted samples are increased, while the correct ones are decreased. Meanwhile, a new hypothetical weak learner will be added in each iteration to minimize the total training error, as follow.

$$E_t = \mathop {\sum}\nolimits_i {E\left[ {F_{t - 1}(X_i) + a_th(X_i)} \right]}$$
(3)

Ft-1 denotes the weak learner obtained from the previous iteration, and ft(X) = αth(X) is the improved weak learner. Eventually, AdaBoost forms a single strong learner by combining several weak learners.

Different from the AdaBoost, GBRT fits the negative gradient of the loss function (L) obtained from the cumulative model of the previous iteration using the generated weak learners. Then, the negative gradient direction will be decreased by adding the obtained loss function to the weak learner. The process can be expressed as follows45:

$$F_m(x) = F_{m - 1}(x) + {\mathrm{argmin}} \mathop {\sum}\limits_{{{{\mathrm{i}}}} = 1}^{{{\mathrm{n}}}} {{{\mathrm{L}}}} \left[ {y_i,F_{m - 1}(x_i) + h(x_i)} \right]$$
(4)
$${{{\mathrm{g}}}}_m(x) = - \frac{{\partial L[y,F_{m - 1}(x)]}}{{\partial F_{m - 1}(x)}}$$
(5)

where h(x) is a basic learning function, and x is a vector of input features. gm is the negative gradient of the loss function. The loss will be minimized when the m-th weak learner fits gm of the loss function of the cumulative model25. The final gradient boosting regression tree is generated in the form of an ensemble of weak prediction models.

LightGBM is a framework for efficient implementation of the gradient boosting decision tee (GBDT) algorithm, which supports efficient parallel training with fast training speed and superior accuracy. Instead of segmenting the internal nodes of each tree using information gain as in traditional GBDT, LightGBM uses a gradient-based one-sided sampling (GOSS) method. In addition, LightGBM employs exclusive feature binding (EFB) to accelerate training without sacrificing accuracy47.

Feature engineering

Feature engineering (FE) is the process of transforming raw data into features that better express the nature of the problem, enabling to improve the accuracy of model predictions on the invisible data. Data pre-processing, feature transformation, and feature selection are the main aspects of FE. Feature selection is the most important part of FE, which is to select useful features from a large number of features. It means that those features that are not relevant to the problem or are redundant with others need to be removed, and only the important features are retained in the end. Feature selection contains various methods such as correlation coefficient, principal component analysis, and mutual information methods. In this study, this process is done by the gray relation analysis (GRA) and Spearman correlation coefficient analysis, and the importance of features is calculated by the tree model. The implementation of data pre-processing and feature transformation will be described in detail in Section 3.1 and Section 3.2.

The basic idea of GRA is to determine the closeness of the connection according to the similarity of the geometric shapes of the sequence curves. The closer the shape of the curves, the higher the correlation of the corresponding sequences23,48. The method is used to analyze the degree of the influence of each factor on the results. The core is to establish a reference sequence according to certain rules, and then take each assessment object as a factor sequence and finally obtain their correlation with the reference sequence. In order to establish uniform evaluation criteria, variables need to be normalized according to Eq. 6 first due to the different attributes and units.

$$X_i(k) = \frac{{x_i(k)}}{{\overline {x_i} }}$$
(6)

Where, \(X_i(k)\) represents the i-th value of factor k. The gray correlation between the reference series \(X_0 = x_0(k)\) and the factor series \(X_i = x_i\left( k \right)\) is defined as:

$$r(X_0,X_i) = \frac{1}{n}\mathop {\sum}\limits_{k = 1}^n {r(X_0(k),X_i(k))}$$
(7)
$$r(X_0(k),X_i(k)) = \frac{{\mathop {{\min }}\limits_i \mathop {{\min }}\limits_k \left| {x_0(k) - x_i(k)} \right| + \rho \mathop {{\min }}\limits_i \mathop {{\min }}\limits_k \left| {x_0(k) - x_i(k)} \right|}}{{\left| {x_0(k) - x_i(k)} \right| + \rho \mathop {{\min }}\limits_i \mathop {{\min }}\limits_k \left| {x_0(k) - x_i(k)} \right|}}$$
(8)

Where, ρ is the discriminant coefficient and \(\rho \in \left[ {0,1} \right]\), which serves to increase the significance of the difference between the correlation coefficients. Usually ρ is taken as 0.5.

The Spearman correlation coefficient is a parameter-free (distribution independent) test for measuring the strength of the association between variables. The Spearman correlation coefficient is solved according to the ranking of the original data34. Regardless of how the data of the two variables change and what distribution they fit, the order of the values is the only thing that is of interest. Two variables are significantly correlated if their corresponding values are ranked in the same or similar order within the group. In particular, if one variable is a strictly monotonic function of another variable, the Spearman Correlation Coefficient is equal to +1 or −1. The Spearman correlation coefficients of the variables R and S follow the equation:

$$P = \frac{{\mathop {\sum }\nolimits_{i = 1}^N (R_i - \tilde R)(S_i - \tilde S)}}{{\sqrt {\mathop {\sum }\nolimits_{i = 1}^N (R_i - \tilde R)^2\mathop {\sum }\nolimits_{i = 1}^N (S_i - \tilde S)^2} }} = 1 - \frac{{6{\sum} {d_i^2} }}{{N(N^2 - 1)}}$$
(9)

Where, Ri and Si are are the values of the variable R and S with rank i. \(\tilde R\) and \(\tilde S\) are the means of variables R and S, respectively. N is the total number of observations, and di = Ri-Si, denoting the difference of variables in the same rank.

Explaining machine learning

The SHAP interpretation method is extended from the concept of Shapley value in game theory and aims to fairly distribute the players’ contributions when they achieve a certain outcome jointly26. SHAP values can be used in ML to quantify the contribution of each feature in the model that jointly provide predictions. The Shapley values of feature i in the model is:

$$Shap(i) = \mathop {\sum}\nolimits_{k \subseteq M\left\{ i \right\}} {\frac{{k!(N - k - 1)!}}{{k!}}} [f_x\left( {k \cup \left\{ i \right\} - f_x(k)} \right.]$$
(10)
$$f_x(k) = E[f(x)|x_k]$$
(11)

Where, N denotes a subset of the features (inputs). M{i} is the set of all possible combinations of features other than i. E[f(x)|xk] represents the expected value of the function on subset k. The prediction result y of the model is given in the following equation.

$$y = Shap_0 + \mathop {\sum}\limits_{i = 1}^N {Shap(i)}$$
(12)

Here, shap0 is the average prediction of all observations and the sum of all SHAP values is equal to the actual prediction. Further, the absolute SHAP value reflects the strength of the impact of the feature on the model prediction, and thus the SHAP value can be used as the feature importance score49,50.

The ALE plot describes the average effect of the feature variables on the predicted target. The most important property of ALE is that it is free from the constraint of variable independence assumption, which makes it gain wider application in practical environment. The key to ALE is to reduce a complex prediction function to a simple one that depends on only a few factors29. Then, the ALE plot is able to display the predicted changes and accumulate them on the grid. To quantify the local effects, features are divided into many intervals and non-central effects, which are estimated by the following equation.

$$\hat {\tilde f}_{j,ALE(x)} = \mathop {\sum}\limits_{k = 1}^{k_j(x)} {\frac{1}{{n_j(k)}}} \mathop {\sum}\limits_{i:x_j^i \in N_j(k)} {[f({{{\mathrm{Z}}}}_{k,j},x_j^i) - f({{{\mathrm{Z}}}}_{k - 1},x_j^i)]}$$
(13)

Where, Zi,j denotes the boundary value of feature j in the k-th interval. nj(k) represents the sample size in the k-th interval. i:xji is the k-th sample point in the k-th interval, and x denotes the feature other than feature j. To make the average effect zero, the effect is centered as:

$$\hat {\tilde f}_{j,ALE(x)} = \hat {\tilde f}_{j,ALE(x)} - \frac{1}{n}\hat {\tilde f}_{j,ALE}(x_j^i)$$
(14)

It means that the average effect is subtracted for each effect. Compared to the average predicted value of the data, the centered value could be interpreted as the main effect of the j-th feature at a certain point.

Performance metrics

In order to quantify the performance of the model well, five commonly used metrics are used in this study, including MAE, R2, MSE, RMSE, and MAPE. Their equations are as follows.

$$MAE = \frac{{\mathop {\sum }\nolimits_{i = 1}^n |P_i - T_i|}}{n}$$
(15)
$$R^2 = 1 - \mathop {\sum}\limits_{i = 1}^n {(P_i - T_i)^2} /\mathop {\sum}\limits_{i = 1}^n {(P_i - \bar T)^2}$$
(16)
$$MSE = \frac{{\mathop {\sum }\nolimits_{i = 1}^n (P_i - T_i)^2}}{n}$$
(17)
$$RMSE = \sqrt {MSE}$$
(18)
$$MAPE = \frac{{100{{{\mathrm{\% }}}}}}{n}\mathop {\sum}\limits_{i = 1}^n {\left| {\frac{{|P_i - T_i}}{{T_i}}} \right|}$$
(19)

Where, Ti represents the actual maximum pitting depth, the predicted value is Pi, and n denotes the number of samples. R2 reflects the linear relationship between the predicted and actual value and is better when close to 1. MSE, RMSE, MAE, and MAPE measure the relative error between the predicted and actual value. The values of the above metrics are desired to be low.

Implementation methodology

Step 1: Pre-processing

Pre-processing of the data is an important step in the construction of ML models. Although some of the outliers were flagged in the original dataset, more precise screening of the outliers was required to ensure the accuracy and robustness of the model. CV and box plots of data distribution were used to determine and identify outliers in the original database. In addition, the type of soil and coating in the original database are categorical variables in textual form, which need to be transformed into quantitative variables by one-hot encoding in order to perform regression tasks.

Step 2: Model construction and comparison

After pre-processing, 200 samples of the data were chosen randomly as the training set and the remaining 40 samples as the test set. The RF, AdaBoost, GBRT, and LightGBM methods introduced in the previous section and ANN models were applied to the training set to establish models for predicting the dmax of oil and gas pipelines with default hyperparameters. Then a promising model was selected by comparing the prediction results and performance metrics of different models on the test set.

Step 3: Optimization of the best model

The best model was determined based on the evaluation of step 2. However, the performance of an ML model is influenced by a number of factors. Apart from the influence of data quality, the hyperparameters of the model are the most important. In this step, the impact of variations in the hyperparameters on the model was evaluated individually, and the multiple combinations of parameters were systematically traversed using grid search and cross-validated to determine the optimum parameters. This optimized best model was also used on the test set, and the predictions obtained will be analyzed more carefully in the next step.

Step 4: Model visualization and interpretation

In the most of the previous studies, different from traditional mathematical formal models, the optimized and trained ML model does not have a simple expression. The model is saved in the computer in an extremely complex form and has poor readability. In this study, this complex tree model was clearly presented using visualization tools for review and application. After completing the above, the SHAP and ALE values of the features were calculated to provide a global and localized interpretation of the model, including the degree of contribution of each feature to the prediction, the influence pattern, and the interaction effect between the features.