Estimating pitting descriptors of 316 L stainless steel by machine learning and statistical analysis

Coelho, Leonardo Bertolucci; Torres, Daniel; Vangrunderbeek, Vincent; Bernal, Miguel; Paldino, Gian Marco; Bontempi, Gianluca; Ustarroz, Jon

doi:10.1038/s41529-023-00403-z

Download PDF

Article
Open access
Published: 21 October 2023

Estimating pitting descriptors of 316 L stainless steel by machine learning and statistical analysis

npj Materials Degradation volume 7, Article number: 82 (2023) Cite this article

1234 Accesses
4 Citations
1 Altmetric
Metrics details

Subjects

Abstract

A hybrid rule-based/ML approach using linear regression and artificial neural networks (ANNs) determined pitting corrosion descriptors from high-throughput data obtained with Scanning Electrochemical Cell Microscopy (SECCM) on 316 L stainless steel. Non-parametric density estimation determined the central tendencies of the Epit/log(jpit) and Epass/log(jpass) distributions. Descriptors estimated using conditional mean or median curves were compared to their central tendency values, with the conditional medians providing more accurate results. Due to their lower sensitivity to high outliers, the conditional medians were more robust representations of the log(j) vs. E distributions. An observed trend of passive range shortening with increasing testing aggressiveness was attributed to delayed stabilisation of the passive film, rather than early passivity breakdown.

Discovery of potent inhibitors of α-synuclein aggregation using structure-based iterative learning

Article Open access 17 April 2024

Scaling deep learning for materials discovery

Article Open access 29 November 2023

An autonomous laboratory for the accelerated synthesis of novel materials

Article Open access 29 November 2023

Introduction

Despite considerable achievements in the predictive modelling of pitting corrosion^1,2,3,4,5,6, more research is still undoubtedly needed. The challenge of estimating relevant pitting descriptors from experimental data is still seldom addressed in literature⁷.

Potentiodynamic polarisation (PP) curves are one of the main electrochemical techniques used for corrosion research in academia, also with a particularly high acceptance in the industry⁸, as a benchmark test for examining the resistance to localised corrosion. As summarised by Hughes et al.¹: “the cyclic polarisation (CP) method, such as the standard ASTM G61⁹, is probably the only standardised, traditional electrochemical method used to determine the relatively localised corrosion susceptibility. It involves the anodic polarisation of a specimen until localised corrosion initiates, as indicated by a large increase in the applied current. An indication of the susceptibility to initiation of localised corrosion in this test method is given by the potential (E) at which the anodic current increases rapidly, i.e., the breakdown potential. The nobler (more positive) this potential, the less susceptible the alloy is to initiate localised corrosion. The conventional understanding is that the breakdown potential is the potential above which pits are initiated”¹.

Not only do corrosion experts^7,10 often rely on a rather qualitative description of the pitting potential (“Epit is defined as a potential above which there is a rapid increase in the current on a polarisation curve”¹⁰), but also the referred standard⁹ is vague on the extraction of the descriptor out of PP (or CP) curves (“the potential in which a sharp rise in current is observed”). According to another standard, ISO 15158^11,12, “Epit is defined as the potential corresponding to the anodic current density of 10 μA cm⁻² in the region of stable pit growth”. Nonetheless, such a definition (yet quantitative) is potentially problematic since it is based on a fixed, static value, not considering the likely high variability of responses.

Beyond the sensitivity to the concentration and combination of aggressive species¹³ and the scan rate¹⁴, the determination of Epit was found to be dependent on the experimental method used¹⁵. Simple potentiodynamic polarisation experiments have shown extremely variable results in pitting potential¹⁶, exhibiting wide experimental scatters of hundreds of millivolts⁶.

Previously, it was believed that Epit had a sharp threshold value below which all specimens would exhibit infinite immunity to pitting, and any observed data scatter was attributed to poor experimental control¹⁷. However, Nathan and Dulaney¹⁸ were the first to challenge this notion, emphasising the importance of statistical approaches to localised corrosion. Subsequently, Shibata and Takeyama¹⁹ demonstrated that the random variation in data is an intrinsic property of pitting corrosion and should be analysed statistically. More recently, Nyby et al.⁷ precisely observed that the rapid increase in current density (j) occurs “when the applied potential is more noble than a specific range of values”. One research paper even argues that it is questionable that exact values of pitting potential can be experimentally measured¹⁰.

Difficulties in determining a generalised value for Epit are associated with events (stable pitting growth) that are very dynamic in nature (e.g., high pit growth rates and extreme pit chemistry changes) and take place on a nano-metre scale¹⁵. Aggressive species (Cl^-), in combination with surface heterogeneities, trigger a dynamic degradation process in which transient passivity breakdown/repassivation events occur over a large population of initiated pits⁴. The study of localised corrosion triggered by chloride remains a relevant topic within the scope of the targets set by the blue economy, as pitting corrosion is particularly harmful in marine environments and coastal areas²⁰. The development of advanced scanning electrochemical techniques, such as the scanning vibrating electrode technique (SVET) and scanning electrochemical microscopy (SECM)²¹, has facilitated substantial progress in research on localised corrosion¹. The Scanning Electrochemical Cell Microscopy (SECCM) is the next generation of the well-known electrochemical droplet cell technique²² and differs from the more commonly used SECM^23,24, as only small portions of a surface are exposed to electrolyte through brief meniscus contact from a nanopipet probe, and electrochemical signals are measured directly^25,26. In this work, the SECCM was selected as an experimental tool mainly due to its proven high-throughput capabilities^22,27,28. The collection of statistically representative amounts of data is key when high variance is expected in the target feature²⁹.

Only a limited amount of works in localised corrosion have used data-driven approaches^{2,7,30,31,32,33,34,35,36,37,38} so far. The major reasons for their limited application are the community traditionally relying on low-throughput means for data generation, focusing on specific input-output relationships; and complicated feature engineering due to the vast number of influencing variables. The conjecture of pitting corrosion should ideally be faced in the light of data-centric approaches. As shown in our previous work²⁹, the distributions of the local current density at potential regions associated with pitting are potentially uniform (high randomness). This means that, as the observation error tends to decrease with increased sample size³⁹, if only a few samples are considered, the actual underlying distributions are not captured (subrepresentation). Similarly to what has been done for ML modelling of corrosion inhibitors^{40,41,42,43,44,45}, the creation of structured databases for pitting corrosion is urged^31,37,46.

As explained by Weaver et al. in a 2022 communication on the unsupervised learning of voltammetric data⁴⁷, deviations from the model behaviour can significantly enhance the complexity of the data extraction. Therefore, instead of performing the task by hand, as traditionally done by electrochemists⁴⁷, there is an emerging call for recording data in (semi-) automated ways, including high-throughput screening⁴⁸.

This work elaborates on 5 datasets of log(j) vs. E (PP) curves obtained in a high-throughput fashion with the SECCM on 316 L stainless steel. We provide a methodology for estimating Epass (passive potential) and Epit from: 1. typical log(j) vs. E curves with a straightforward passivity breakdown (using an algorithm based on linear regression (LR)); 2. PP curves with more unique profiles mainly due to metastable events (using artificial neural networks (ANNs) trained on the LR estimates). The estimated Epit and Epass descriptors of 316 L are included in this article (Dataset 1,.ipynb files) and available to download in a public repository⁴⁹.

Furthermore, as there are cases where the estimate of the conditional distribution of y given x (log(j) given E) is not always a conditional mean (although this is most common⁵⁰), we also considered the analysis of quantiles curves (the conditional median, in particular). The main advantage of conditional quantiles is to give a more comprehensive analysis of the relationship between E and log(j) at different points in the conditional distribution of log(j)⁵⁰. Therefore, we also propose a simplified methodology for determining the central tendency of the Epit/log(jpit) and Epass/log(jpass) distributions using the conditional median (or mean) of the log(j) vs. E curves. These proxy estimations were compared against the outputs of non-parametric density estimations, considered as the ground truth of the central tendencies of the descriptors. The related code is available (https://github.com/bcoelho-leonardo/Estimating-pitting-descriptors-of-316L-stainless-steel-by-machine-learning-and-statistical-analysis/tree/5c7c8eac41907667f94c22881650f23a6aee0d64), and is expected to serve as a toolkit for future localised corrosion works dealing with big data. The same code can be a basis for extracting meaningful descriptors for other potentiostatic or potentiodynamic experiments important in electrodeposition, electrocatalysis, and other electrochemical processes^51,52.

Many decades ago, Evans⁵³ noted that studying the probability of corrosion is more practically important than determining the exact corrosion rate values. We expect to provide a foundation for the future development of monitoring tools (based on current or potential measurements) capable of predicting stable pitting with secure margins.

This work provides three main contributions: 1. a robust ML-based method for estimating Epit/log(jpit) and Epass/log(jpass) descriptors from individual polarisation curves; 2. an accurate proxy model (conditional median of log(j) for estimating the central tendencies of the descriptors distributions for a given dataset; 3. insights into localised corrosion mechanisms gained by interpreting the proxy models and also by selecting a subset of log(j) vs. E examples presenting the highest activities (high outliers).

Results and discussion

Density estimation of passivity and pitting descriptors

In this work, we were initially concerned with the problem of estimating conditional quantiles, as such analysis often results in further insights out of the distributions of our random variable (log(j)|E)⁵⁰.

Figure 1 shows the kernel density estimations of the Epass/log(jpass) and Epit/log(jpit) for the 5 experimental datasets. The quantiles of the log(j) vs. E curves are superimposed in the plots to illustrate the high dispersion of both passivity and pitting descriptors. As a general trend, the high dispersion of the descriptors observed in the log(j) direction seems relatively constant for all sets, while the dispersion in the E direction seems to increase with the testing aggressiveness. The distributions of Epass/log(jpass) and Epit/log(jpit) generally extend as far as the corresponding Qmin and Qmax curves, except in cases of individual outliers present (such as in Fig. 1c, d). In any case, the distributions of the descriptors clearly spread beyond the so-called interquartile ranges (the IQR is the middle half of a dataset, comprising the range between the second and third quartiles).

**Fig. 1: Kernel density estimation of the bivariate distributions of *Epass*/log(*jpass*) and *Epit*/log(*jpit*).**

Central tendency estimations of descriptors based on the mean and median models

While the estimation of descriptors based on the conditional mean of distribution might serve, there exists a large area of problems (outliers detection^54,55, risk assessment⁵⁶) where estimating a quantile, such as the median, would be a better choice⁵⁰.

The same hybrid LR-based ANN approach employed on each individual log(j) vs. E curves was applied to the conditional mean/median curves of the populations, to estimate their central tendency values. The Figs. 2 and 3 show the Epass/log(jpass) and Epit/log(jpit) values obtained from the mean and median curves for extreme case datasets (the least and the most aggressive conditions, respectively). The results corresponding to the intermediate conditions are displayed in Supplementary Figs. 1, 2, 3. In plots a of Figs. 2 and 3, the conditional means (with their conditional standard deviations (SD) and errors (SE)) are plotted with the Epass/Epit estimates provided by the mean model; while plots b of Figs. 2 and 3 display the conditional medians (with their conditional median absolute deviations (MAD)) with the outputs of the median model (all estimates are represented by cross markers, in reference to “dart attempts”). In all of these plots, the ground truth for the central tendencies of the descriptors (maximum kernel density estimation (KDE) values) is plotted as a benchmark (represented by circle markers in reference to “target locations”).

**Fig. 2: Central tendency estimation.**

Fig. 3: The central tendency values of *Epass*/log(*jpass*) (in green) and *Epit*/log(*jpit*) (in red) estimated by the: maximum KDE of the descriptors distributions (ground truth), represented as “target” markers; mean and median models, displayed as “dart attempt” markers.

It could be observed that the conditional mean of log(j), as a function of E, was generally a poor representative of a polarisation curves set (in line with the log(j) distributions scrutinised in ref. ²⁹). As illustrated in Fig. 3a (0.05 M NaCl, 50 mV s^-1), the averaged values do not outline a typical polarisation curve. In this most aggressive condition, the Epass feature could not even be estimated from the mean curve. On the contrary, the conditional median seemed to capture further the expected overall behaviour of this set of curves (Fig. 3b).

By analysing the Epass/log(jpass) and Epit/log(jpit) ground truth values with respect to the conditional mean, one could see they were relatively far from these curves, mostly lying outside the standard error limits (plot a, Fig. 2; Supplementary Figs. 1, 2, 4). This mismatch between the “target locations” and the mean curves explains the mean model’s failure to provide accurate values for the descriptors (as those are estimated by fitting the conditional mean curves).

On the contrary, estimating the Epass/log(jpass) and Epit/log(jpit) values from the conditional median curve of a set provided an accurate alternative for obtaining their central tendencies. One could observe the median curves generally crossing (or at least touching) the “target” markers (plot b, Fig. 2, Supplementary Fig. 1, Supplementary Fig. 2, Supplementary Fig. 3); the only exception was the Epit in 0.05 M NaCl (50 mV s^-1), which was the most difficult value to be appraised out of the 10 descriptors considered (high uncertainty of j values in E regions associated to pitting²⁹).

Moreover, one could see that the data distribution (KDE) extended beyond the standard deviation (by comparing plots a of Fig. 2, Fig. 3, Supplementary Fig. 1, Supplementary Fig. 2, Supplementary Fig. 3 with Fig. 1). The analysis of the quantile curves was illustrative of the high data dispersion, with the descriptors distributions extending as far as the Qmin and Qmax curves in some cases.

Evaluation of the central tendency estimates based on residuals

At 0.05 M (50 mV s^-1), the estimated location of Epit from the conditional median curve lagged behind this descriptor ground truth for this set (Fig. 3b). When evaluating the accuracy of estimation, not only the absolute distance between the estimate and the actual value is relevant, but also the sign of that difference; in other words, the sign of the model bias.

Residual analysis provides a basis for diagnosis checking while assessing model biases. The following bar charts present the residuals of estimation of log(jpass), Epass, log(jpit) and Epit, as a function of testing corrosiveness (Fig. 4). The shorter the bar, the more accurate the model estimation. Again, the ground truth for the central tendencies of the descriptors was the maximum KDE values of their distributions. The horizontal line (residual equal to zero) represents the ideal benchmark, where a regression would be 100% accurate. Results from both the conditional mean and conditional median models are displayed.

Fig. 4: Residuals of the central tendency estimation of the passivity/pitting descriptors (with respect to their ground truth) obtained by the mean model (in blue tone) and the median model (in green) as a function of testing aggressiveness.

When estimating the central tendency of passivity descriptors (Fig. 4a), the residuals of log(jpass) were consistently and significantly smaller for the median-based model than the mean-based one. With respect to Epass, the median-based approach was also generally more accurate, the only significative exception being at 0.005 M NaCl, but with a relatively small residual (–0.0067 V) still. Again, the residual comparison from both strategies is not possible for the most aggressive scenario (0.05 M NaCl, 50 mV s^-1), as the mean model could not even provide passivity descriptors. In the following most aggressive condition (0.05 M NaCl, 100 mV s^-1), the estimation errors related to log(jpass) and Epass were generally the largest for both models. Nonetheless, even in this case, the median model could reduce the estimation residuals by 54.2% and 73.2%, respectively, compared to the mean model.

Regarding estimating the central tendencies of pitting descriptors (Fig. 4b), the same overall observations made for the passivity features could be replicated here. First, the log(jpit) residuals of the median model were systematically lower than the mean model (reduction in residuals of 89.5, 77.4, 53.8, 25.8 and 97.1% for increasing testing aggressiveness). Secondly, the median model generally yielded smaller residuals than the mean for Epit; when not lower, the magnitude of the error was still acceptable (0.0047 V for 0.01 M NaCl (50 mV s^-1)). The only case where the median estimator underperformed was for Epit in 0.05 M NaCl (50 mV s^-1).

As already mentioned, the picture was somewhat less clear in the most aggressive set, likely the most challenging condition for the regression of features. Nevertheless, in the addressed cases where the median model produced larger residuals than the mean model, it is important to note that the estimations were at least negatively biased (positive residuals). When attempting the prediction of pitting corrosion, as no model is perfectly accurate, the negative bias would be favoured in comparison to positive bias: underestimation of Epit (or log(jpit)) is preferable to its overestimation. In other words, if estimation errors are unavoidable to a certain degree, it is more desirable to have stable pitting growth occurring at higher potentials (or current densities) than the expectations; as the opposite situation would imply in catastrophic failure based on overly optimistic predictions.

The magnitude and sign of the estimations of log(Epass) and log(Epit) can be appraised in Fig. 5a, in which the “passive current density ranges” are defined by the yellow bars. One can judge that the median model (green markers) outperformed to the mean model (blue markers) in accurately estimating the central tendency ground truth values for both current density descriptors.

**Fig. 5: Definition of passive ranges.**

Similarly, Fig. 5b presents the estimates of Epass and Epit compared against the corresponding “passivity ranges” (yellow bars). Both models produced relatively tight errors, although the median model resulted in overall better estimations. In general, the residuals of estimation were proportionally lower for the E descriptors than for the log(j) ones (further discussed in the section “Proxy models for estimating the central tendency of pitting descriptors”, including the analysis of coefficients of variation). As mentioned above, when the median model underperformed, at least an underestimation of Epit was verified (“preferable sign of bias” indicated in the plot). If one task is utterly crucial for the models, this would be the estimation of the Epit feature.

Larger errors of estimation were achieved in the two most aggressive conditions (0.05 M NaCl media) in general (considering all descriptors in Fig. 5), while the errors related to log(jpass) and log(jpit) in particular, increased with the testing corrosiveness (Fig. 5a).

This investigation provides a solid and simplified framework for estimating the central tendency of passivity/pitting descriptors. Instead of individually assessing an entire set of log(j) vs. E curves, estimating Epass/log(jpass) and Epit/log(jpit) from the conditional median curve can provide satisfactory outcomes, assuming that the data size is large enough. In the present case, all estimations of Epass and Epit (either from the conditional median of log(j)) of a set or the individual log(j) vs. E curves) were done using the same hybrid LR/ANN approach for a fair basis of comparison. By doing so, the authors avoided introducing additional sources of bias to the estimations. Nonetheless, other simplified median-based methods could be thought of (an expert could even proceed with “by hand” selection⁴⁷ of Epass/Epit).

Interpreting the higher robustness of the median model

In the case of polarisation curves displaying pitting corrosion, the conditional median of log(j) has qualitatively shown to be representative of the population of curves. As exemplified in Fig. 6, the location of the conditional median curve (plot b, green curve) was significantly coincident with the regions with the highest data density in the corresponding log(j) vs. E plot (plot a). Figure 6 illustrates the effect of high outliers (log(j) vs. E curves lying more than 1.5 times the IQR above Q3) on the conditional mean of log(j). The result is the shift of the conditional mean to log(j) values consistently higher than the conditional median curve.

Fig. 6: Schematic of the data distribution from the 0.05 M NaCl (100 mV s^-1) set illustrating how high outliers shift the conditional mean of log(j) from the regions with the highest data density.

The difference between the conditional mean and the conditional median tends to increase with E, because the high outliers present particularly high log(j) values at high potential regions (E > 1.15 V (vs. Ag/AgCl)). As demonstrated in²⁹, the log(j) distributions become more positively skewed with increased corrosiveness (more positive potential and higher [Cl^-]). The occurrence of high outliers with particularly high j values at high E regions results from pitting corrosion processes. Indeed, applying more positive potentials increases the likelihood of metastable pitting (accompanied by repassivation events), which may gradually change into stable pitting growth.

As stated by Koenker⁵⁰, assessing a set of conditional quantile curves provides a more informative description of the relationship among variables, especially in cases of: 1. non-constant variance; 2. non-normality of the noise distribution. The described picture illustrates well the datasets in question, in which: 1. the log(j) distributions are heteroscedastic (increased conditional variance as a function of E and testing aggressiveness)²⁹; 2. the anomalous high j values at high potentials, observed for the high outliers (Fig. 6), could be seen as “noise”, as they positively skewed conditional means that would otherwise (in the absence of pitting) be normally distributed; as expected for electrochemical descriptors derived from PP curves of passive systems^57,58. If the pitting activity could be considered as “noise”, it would undoubtedly implicate the referred “non-normality of the noise distribution”, as most of the population (grey curves in Fig. 6b) would have relatively low and similar noise levels with only a few examples (the high outliers) displaying significantly higher levels of noise.

To illustrate the positive skewness of log(j) beyond their conditional distributions, Fig. 7 displays the histograms of the log(jpass) and log(jpit) descriptors for the 5 datasets. From Fig. 7, it is confirmed that the medians of the log(jpass) and log(jpit) descriptors are more representative of the underlying distributions than the respective means, as the former were closer to the ground truth central tendencies of the distributions (maximum KDE values). Similar to what was observed for the conditional log(j) distributions (Fig. 6), the means of the descriptors were generally higher than the corresponding medians due to the presence of high outliers (Fig. 7). The only exception (median larger than mean) was at 0.05 M NaCl (100 mV s^-1) (Fig. 7d), where low outliers were more prominent than the high outliers (the lowest conditional Qmin curve was computed for this dataset, Fig. 1d).

**Fig. 7: Histograms of the estimated log(*jpass*) and log(*jpit*) descriptors for different testing aggressiveness.**

These statistical analyses further explain why quantile analysis (the conditional median of log(j), in particular) provided a robust model for a simplified estimation of passivity/pitting descriptors. In future investigations on predictive ML, instead of traditional least square regression, quantile (or robust) regression^59,60 might be a promising route for approaching pitting corrosion^50,61. As a perspective, the analysis of the quantile curves (Fig. 1) might also help locate data clusters (preliminary defined before the application of the rule-based algorithm).

Effect of corrosiveness on the pitting susceptibility

Comparison of the central tendency values of the log(j) descriptors did not indicate a clear trend with increased testing aggressiveness (Fig. 7). On the contrary, by comparing the distributions of Epit (and Epass) (Fig. 8), a few tendencies as a function of corrosiveness could be appraised. First, similarly as previously determined for the conditional log(j)²⁹, the higher the aggressiveness, the more spread the distributions of the E descriptors (clear trend in Fig. 8, from b to e) tend to be. Secondly, all these distributions were continuous and roughly unimodal, with increased multimodality with corrosiveness, ultimately leading to a uniform function-like behaviour at 0.05 M NaCl, 50 mV s^-1 (Fig. 8e). Despite Shibata and Takeyama’s conclusion that the random variation of the Epit of stainless steels (in macro-scale polarisation) obeys a normal distribution^6,19, all the achieved unimodal distributions failed to be formally described as normal, even with the removal of outliers (for illustration purposes, the normal curves plotted in Fig. 8 were fitted to data without outliers). Even when considering the highest p-values obtained (from D’Agostino and Pearson’s Test), these values consistently equalled or were lower than the significance level: 0.00 and 0.03 (0.05 M NaCl, 50 mV s^-1), 0.05 and 0.00 (0.05 M NaCl, 100 mV s^-1), 0.00 and 0.00 (0.01 M NaCl, 50 mV s^-1), 0.00 and 0.00 (0.01 M NaCl, 100 mV s^-1), 0.01 and 0.00 (0.005 M NaCl, 100 mV s^-1). In summary, the null hypothesis “the data is normally distributed” was consistently rejected for all sets based on the various normality tests employed.

**Fig. 8: Histograms of the estimated *Epass* and *Epit* descriptors for different testing aggressiveness.**

Most importantly, the Epass distribution generally presented a linear increase as a function of the considered testing corrosiveness; while the Epit distribution displayed a relatively constant behaviour in the least aggressive conditions (Fig. 8a–c), with a pronounced decrease in the most aggressive scenario (Fig. 8e). These trends could be appraised by following the evolution of the descriptors’ maximum KDE, as punctuated by the two arrows in Fig. 8. It should be noted that the 0.05 M NaCl set (100 mV s^-1) (Fig. 8d) was again as a group outlier (the same reason as elaborated above for the log(j) descriptors, Fig. 7d).

As expected, the combined effect of the Epass and Epit progression with increased corrosiveness resulted in an overall decrease in the passivity range. Analysis of Fig. 8 also reveals that this passivity range shortening was generally more affected by the increase in Epass than by the decrease in Epit. Although corrosionists often accredit more attention to the upper end of the passivity range, framing Epit as the main predictor of passivity breakdown, our data-driven analysis suggests that the robustness against pitting (related to the passivity range^1,62) would be highly sensitive to the lower end of passivity (Epass). i.e., the trend of passive range shortening seemed to be primarily influenced by the delayed stabilisation of the passive film rather than its early disruption. For instance, based on XPS measurements, Cl^- was reported to cause thinning of the Fe passive film, even under conditions where pitting did not occur (passivity)⁶³. Based on the film-breaking mechanism of pit initiation, the thin passive film is in a continual state of breakdown and repair^64,65; and in chloride media, there would be a lower likelihood for such a breakdown to heal (inhibition of repassivation by chloride)⁶.

In Shibata’s stochastic theory of pitting corrosion¹⁹, the coefficient of variation (CV) of Epit was calculated for the polished 316 stainless steel dataset, and a value of 9.6% was obtained. In our cases, the following CV values were obtained for the Epit distributions, from the least to the most aggressive conditions: 4.3, 1.8, 11.1, 11.4 and 32.7%. If we remove the uncertain Epit values assigned as 0.5 V by default, the obtained CVs were even lower: 2.6, 1.8, 3.0, 7.0, and 7.9% (highest variations obtained at 0.05 M—50 mV s^-1, as expected). Interestingly, the CV values obtained were relatively lower than the 9.6% reported for 316 under classic PP¹⁹; which is somewhat surprising, as our Epit values, derived from micro-scale PP measurements, are more sensitive to local surface heterogeneities. The reasons for the overall low relative variability of our sets as compared to the referred benchmark¹⁹ might be related to: the 316 L grade being more resistant to corrosion than the 316; the expected higher surface quality of our electropolishing in comparison to (2/0) emery polishing; and our Cl^- media being at least 1 order of magnitude less aggressive than their 3.5% NaCl solution; and, most importantly, our large number of samples, generally over one hundred per set (in 19, estimated to be only ~20). As a higher CV might indicate a greater degree of uncertainty in the shape of the underlying distribution, our sets arguably provide more representative distributions of Epit as compared to the Shibata’s set¹⁹ (mostly likely resulting from the referred discrepancy in the sample sizes).

In any case, the CVs calculated for the log(jpit) without considering the 0.5 V data points (10.0, 8.1, 10.3, 21.8 and 17.1%) were much larger than the corresponding values determined for Epit (3.8, 4.5, 3.4, 3.1 and 2.2 times larger, respectively), as a function of aggressiveness. Being the CV a statistical measure that represents the relative variability in a set, this quantitative outcome illustrates why the comparison of the distributions across the sets was less straightforward for log(jpit) (Fig. 7) than for Epit (Fig. 8); and also the reason for the generally more accurate estimation of the E descriptors than the log(j) descriptors (Figs. 4 and 5).

To further assess the different susceptibility to pitting among the sets, as suggested in Fig. 8, one extends the comparative analysis to a selection of the largest log(jpit) values achieved (Fig. 9) aligned with the “weakest link” theory applied to pitting corrosion¹⁹.

**Fig. 9: The top 4 highest log(j) vs. E curves for each dataset superimposed with their *Epit*/log(*jpit*) estimates.**

The “weakest link” concept has allowed advances in the statistical strength theory^66,67, explaining the high randomness observed in fracture stress resulting from flaws with varying dimensions in a solid material⁶⁸. The stochastic approach developed for the effect of the body volume on fracture stress⁶⁹, and often applied to describe sensitive structure properties such as fatigue life⁶⁸, was generalised to pitting corrosion^69,70. Concerning failure by pitting, the presence of a precursor or active state in the film is responsible for pit generation¹⁹. Extreme value analysis developed by Gumbel⁷¹ was applied to pitting corrosion of Al by Aziz⁷² and Eldredge⁷³.

Likewise, the “weakest link” concept applies to other electrochemical fields beyond pitting corrosion. For instance, in electrodeposition, nucleation of a new phase on a foreign substrate is primarily driven by the most active sites, similar to pit nucleation. In both cases, the growth of a film (or pit) is driven by the most active sites where nucleation (or initiation) preferably takes place. Recently, we demonstrated that the macroscopic electrodeposition response, described by the onset potential for nucleation, corresponds to that of the most active sites (more positive onset potentials) of a distribution of hundreds of voltammetric curves obtained by SECCM^51,52.

Hence, in our “weakest link” problem, one could expect the most active sites to ultimately determine the overall macroscopic electrochemical response in the pitting corrosion of stainless steel. Attempting to trace the most active pits, the top 4 highest log(j) Vs E examples of each set was plotted with their corresponding Epit/log(jpit) pair of estimates (Fig. 9). It could be observed that the higher the testing aggressiveness: 1. the higher the “top 4 highest log(j) Vs E curves” (ranking based on the mean of log(j)); 2. the lower the Epit and the higher the log(jpit). For a few particularly active log(j) vs. E curves, the Epit could not be determined due to a large uncertainty or an absence of a passivity breakdown (Epit modelled as 0.5 V by default). Likewise, the same tendency was appraised by examining the “top 4 conditional jmax curves as a function of E”, as presented in Supplementary Fig. 4. To conclude, by selecting only high j examples, we corroborate the notion (seen as tacit knowledge in macro-scale polarisation) that Epit substantially drops with corrosiveness.

As the range of applicability of the models is somewhat restricted to the limits of the training data, modelling based on local techniques is hardly generalisable. Additional log(j) vs. E datasets obtained at varied conditions (macro experiments, different substrates, alternative PP parameters, etc) would be recommended for evaluating the robustness of the developed modelling methodology. Particular efforts should be focused on validating the approach extended to classic potentiodynamic polarisation curves. Our estimation modelling strategy is expected to perform well on the macroscale, as less variability could be imagined. In fact, as only the most active pitting sites (Fig. 9) would drive the resulting overall corrosion behaviour of a macro surface, they would likely dominate the intensity of the electrochemical signal measured; thus possibly resulting in fewer data dispersion. Whether the conditional median of log(j), as proxy model for estimating the central tendencies of pitting descriptors from a population of micro-scale polarisation curves, would also be accurate in macro-scale experiments warrants further investigation.

In conclusion, our hybrid rule-based/ML approach, combining an LR-based algorithm with supervised ANN, was able to determine relevant pitting corrosion descriptors of electropolished 316 L from populations of localised polarisation curves with different testing aggressiveness. The rule-based LR provided initial estimates of Epit (or Epass) descriptors by fitting two independent linear regression lines to smoothed polarisation curves. However, unsatisfactory results were observed for some sets, indicating the need for an improved estimation strategy. To address this limitation, we leveraged supervised deep learning: the ANN was trained on sets with satisfactory estimates and then deployed on the sets with unsatisfactory estimates, significantly improving the estimation task. The ANN model was designed with feature engineering methods for selecting input features representative of the researched behaviour (passivity or pitting). To ensure the network’s ability to handle the complexity of the data, a data reduction step was performed, reducing the curves to a selection of log(j) values linearly spaced in terms of their “potential distances”. The training process involved hyperparameter tuning and pruning to achieve satisfactory validation performances. The resulting ANN demonstrated accurate mapping of the relationships in the data, with impressive final MSE and R² values, ensuring the generalisation ability on the unseen (unlabelled) data (our method was somewhat similar to active learning). Throughout this study, ML played a pivotal role in addressing the challenges of estimating pitting-related predictors and interpretating high-throughput data for improved mechanistic understanding. Looking ahead, the potential of ML as a framework for pitting prediction is promising. The automated extraction developed for pitting descriptors could be extended to larger datasets and diverse experimental conditions, seeking improved robustness and generalisability of the models.

Methods

The employed substrate and the data acquisition methodology were identical to the ones described in²⁹. Briefly, an industrial electropolished 316 L stainless steel sample was subjected to potentiodynamic Polarisation (PP) tests using an SECCM platform in hopping-mode protocol^27,28. Five different combinations of [NaCl] and voltammetric scan rates were employed: 0.005 M NaCl—100 mV s^-1, 0.01 M NaCl—100 mV s^-1, 0.01 M NaCl—50 mV s^-1, 0.05 M NaCl—100 mV s^-1, 0.05 M NaCl—50 mV s^-1. Single-barrel pipets (borosilicate) with a final internal circular diameter of ~2 µm were used as SECCM probes. The starting potential was –0.5 V, and the end anodic potential was 1.355 V (vs. Ag/AgCl) (LabVIEW (2019, National Instruments) interface running Warwick’s software (WEC-SPM, www.Warwick.ac.UK/electrochemistry)).

Data analysis

The code for data processing and visualisation was written in Python 3.8 language and was made available on GitHub (as Jupyter Notebook files): https://github.com/bcoelho-leonardo/Estimating-pitting-descriptors-of-316L-stainless-steel-by-machine-learning-and-statistical-analysis/tree/5c7c8eac41907667f94c22881650f23a6aee0d64.

The log(j) vs. E datasets: this work employed the same datasets reported in ref. ²⁹. The only difference is that eventual existing missing values were filled with an iterative imputer. The IterativeImputer class (from sklearn.impute) models each feature with missing values as a function of other features and uses that estimate for imputation. All 955 data samples (polarisation curves) with the referred update (filled missing values) are accessible at Mendeley Data⁷⁴.

As in ref. ²⁹, the datasets considered were sliced upward from 0.5 V (considerably more positive than the open circuit potential (OCP)). Passivity was presumably reached at potentials less positive than 0.5 V for only a few examples of curves; while in another few cases, passivity was not observed within this potential range. In those respective cases, as an approximation, the 0.5 V value was assigned to Epass and eventually assigned to Epit (early passivity and pitting occurrences, respectively). The self-passivation behaviour of 316 L (relative passive state already at OCP) was not considered.

Supervised hybrid rule-based/machine-learning algorithm

A deterministic rule-based algorithm, based on linear regression (LR), was developed to estimate Epit/jpit (or Epass/jpass) descriptors pairs (continuous values) from polarisation curves. Figure 10 schematically shows the overall hybrid rule-based/ML approach employed comprising the LR steps. As illustrated in the plots with borders (in green and grey), two independent linear regression lines fit the smoothed data, one fitted line starting at “low E” and the other ending at “high E” values (~0.7 and ~1.25 V, respectively). The obtained location for Epit (or Epass) is the one that maximises the sum of the R² (goodness of fit) for the two LRs.

**Fig. 10: Schematic of the supervised hybrid rule-based/machine-learning approach employed for estimating *Epit* (or *Epass*).**

This hand-crafted method (rules manually created by domain experts to define the system’s behaviour) is adaptive, requiring the user input for defining threshold values for the potential (thereby separating the “low E” and “high E” regions to which the LRs are separately applied). The different E thresholds define the existing groups (classes) of log(j) vs. E curves. The definition and validation of classes were qualitatively based on the degree of similarity among the PP curves⁷. Therefore, the developed code is dataset-specific, with specific condition statements for the different classes.

In summary, the described method was our initial labelling strategy, thus providing labels (target attributes) to the unlabelled data. The label validation was done by visual examination of the Epit/log(jpit) (or Epass/log(jpass)) in the individual curves. Although the LR-based model generally led to satisfactory estimates, for the 0.05 M NaCl (100 mV s^-1) and 0.005 M NaCl (100 mV s^-1) sets, unsatisfactory results were eventually observed (exemplified in the plot with grey border (Fig. 10)).

In the case of unsatisfactory estimates, instead of further hand-crafting the algorithm to improve the performance of the linear regressors, the strategy was to employ supervised ANN for this task. The ANN was trained on the set of satisfactory estimates and then deployed on the set of unsatisfactory examples (schematic plots with green and grey borders, respectively, in Fig. 10). Contrary to standard practice where a fixed proportion of the data (e.g., 20%) is randomly selected for testing, our test sets comprised specifically challenging samples. This “stress-testing” approach offered a more stringent test of the models’ robustness and generalisation ability, as the predictions were made on samples where the simpler model presented failed estimations. As a result, the proportion of samples in our test sets relative to the entire datasets varied (10%, 17%, and 3%, depending on the set).

The selection of input features was based on feature engineering, aiming to identify relevant features representative of the descriptor of interest, thus allowing the ANN to estimate the data targets accurately. To focus specifically on the relevant regions of the PP curves that encompass the passivity/pitting descriptors, the log(j) vs. E (smoothed) curves were partitioned through a data slicing procedure (Supplementary Fig. 5). The sliced curves presented 700 or 500 data points (0.539 or 0.385 V, in potential range), sufficiently capturing the regions related to Epass or Epit with high security margins. Given the complexity that an ANN would face in processing thousands of data points as features, a second data reduction step was undertaken to decrease further the dimension of the log(j) input array. Sparse sampling was conducted at every 40th (or 60th) point from the sliced log(j) array, leading to a final selection of 13 (or 12) log(j) values (for Epit or Epass, respectively). These numbers of input features were found to represent the target regions of the PP curves adequately.

Reducing the curves to a selection of log(j) values that are linearly spaced in terms of their “E (V) stamps” was sufficient to describe the relevant regions in the curves. This is demonstrated by the 13 “descriptor blue dots” in the “1. neural network training” plot (Fig. 10), which represent the selected log(j) input features (equidistant on the potential (V) scale). The potential was treated as a constant feature, and related values were not included as input. Finally, to improve the model convergence, we applied the StandardScaler method (sklearn.preprocessing package) to the sliced log(j) data for standardisation of both the input and output data. This method standardises features by removing the mean and scaling to unit variance.

A sequential model (keras.models.Sequential) was defined, generating a classic multi-layer perceptron networks (also known as feedforward neural networks). As shown in Fig. 10, the number of nodes in the input layer was equal to the number of input descriptors (12 or 13 log(j) values). The output layer consisted of a single node, providing only one output (either Epit or Epass). Given that log(j) is a function of E, the Epit and Epass estimates sufficed for finding the corresponding log(jpit) and log(jpass) values. The network’s topology, including the optimal number of hidden layers and nodes within each layer, was determined through exploratory testing and visual validation. Specifically, it was found that two hidden layers were sufficient to accurately map the relationships in the data, with the optimal number of nodes in the first and second hidden layers consisting of 12 and 11, respectively. Attempts to reduce the number of nodes in these hidden layers led to an unsatisfactory generalisation of the learning process (increasing the number of nodes beyond 12 would deviate from the logical progression of reducing the node count as approaching the output layer).

The training started with 20-fold cross-validation (CV) (KFold function from sklearn.model_selection), allowing hyperparameter tuning by monitoring the loss function of the validation set. The loss function was the mean squared error (MSE) with Adam optimiser. The ReLU activation was used in the input/hidden layers. The number of batches was equal to the number of training samples (110 and 278 for in 0.05 M and 0.005 M NaCl (100 mV s^-1)). Although after 200 epochs of training, the validation losses were generally fairly low (below 0.0001 (µA cm^-2)²), for a few validation sets, the losses reached relatively higher values (~0.025 (µA cm^-2)²). After this initial training, the network was pruned (using the tensorflow_model_optimization module) and further validated by random sampling (validation_split=0.1) of the labelled dataset. The number of epochs increased to 4500–6000 for the pruned network, and the learning rates eventually decreased from 10^-3 (default) to 10^-12–10^-5. After achieving satisfactory validation performance, the final stage consisted of retraining the model with the entire labelled dataset. The final MSE (Eq. 1) values were 9.87 × 10^-5, 1.35 × 10^-4 and 1.72 × 10^-4 (µA cm^-2)²; and the final R² (Eq. 2) values achieved were 0.9025, 0.9707 and 0.9653, respectively for 0.005 M NaCl 100 mV s^-1 (Epit) and 0.05 M NaCl 100 mV s^-1 (Epass and Epit).

$${MSE}=\frac{1}{N}\sum {{(y}_{i}-{\hat{y}}_{i})}^{2}$$

(1)

$${R}^{2}=1-\frac{{\sum }_{i}{({y}_{i}-{\hat{y}}_{i})}^{2}}{{\sum }_{i}{({y}_{i}-{\bar{y}}_{i})}^{2}}$$

(2)

Such as the LR-based estimates, the validation of the ANN estimates was conducted through visual examination of the obtained Epit/log(jpit) (or Epass/log(jpass)) in the individual log(j) vs. E curves. Again, the cost for pre-labelling the deployed set would be prohibitively high, implying further hard coding of the rule-based model. It is important to note that if the LR-based approach alone could solve our estimation problem, resorting to ML would not be necessary. It was precisely because the labels obtained were unsatisfactory that the rule-based algorithm was leveraged with ANN. In summary, we employed a supervised learning strategy on unlabelled targets⁷⁵. Our ANN strategy, in particular, is somewhat similar to a transductive transfer learning framework, where labelled data is only available for the source domain but not for the target domain^75,76. In our case, the unlabelled sets from the target domain served as our test sets, allowing us to evaluate the models’ generalisation ability on the unseen data in the target domain.

An alternative strategy based on active learning could be thought of: when the LR estimation is uncertain, human expertise could proceed with labelling. This process could be repeated iteratively, selecting the most informative instances for labelling based on the model’s uncertainty, thus improving the accuracy of ulterior ML modelling^77,78. As a perspective, convolutional neural networks (CNNs) may be a promising alternative for feature extraction from similar univariate electrochemical signals, given their ability to capture local patterns and spatial hierarchies in data^79,80.

Ground truth for the central tendencies of passivity and pitting descriptors

As explained above, a hybrid rule-based/ML supervised approach was used to estimate passivity/pitting descriptors from the populations of log(j) vs. E curves from the 5 different sets (Fig. 11a and b). Next, the bivariate distributions of the Epass/log(jpass) (or Epit/log(jpit)) estimates obtained for each set were modelled with Gaussian KDE (kde.gaussian_kde, scipy.stats module) and qualitatively validated (Fig. 11c). The maximum KDE values of the distributions were considered the ground truth of the central tendency values of Epass/log(jpass) and Epit/log(jpit)) (Fig. 11d).

**Fig. 11: Schematic of the two modelling strategies for estimating the central tendencies of the pitting descriptors.**

Epass and Epit distributions: normality tests

We applied three normality tests to assess whether the E descriptors followed a normal distribution: Shapiro-Wilk, D’Agostino and Pearson’s test, and Anderson-Darling. We used the shapiro, normaltest and anderson modules from the scipy.stats package to perform these tests. The null hypothesis for all tests was that “the E descriptors followed a normal distribution”. The decision whether or not to reject the null hypothesis was based on comparing the obtained p-values (or the test statistic values for Anderson-Darling) with the significance level (alpha = 0.05 by default).

Proxy models for estimating the central tendency of pitting descriptors

Two statistical estimation strategies were tested to verify whether the central tendency values of the Epass/log(jpass) and Epit/log(jpit) distributions could be estimated in a reduced manner. These simplified approaches were either mean-based or median-based (illustrated with the median in Fig. 11e). Our research question was whether the conditional mean (or conditional median) of log(j), as a function of E, could be used as a proxy model for accurate estimation of the central tendencies of the pitting descriptors. To test such hypotheses, the conditional mean and conditional median of log(j) were used as input features in the ANN (Fig. 10) to obtain Epass (or Epit) outputs (Fig. 11f). The biases of the trained ANN were transmitted to the median/mean-based estimations, thus establishing a fair basis of comparison with the ground truth data (also derived from the ANN estimates from individual curves).

Quantiles are data values that divide a dataset into adjacent intervals containing the same number of data samples⁸¹. Quantiles display variation in population samples without making assumptions about the underlying distribution. They are useful to gain insight into the distribution of a random value compared to its mean value. The conditional quantiles 0.35, 25, 50, 75 and 99.65% (referred to as Qmin, Q1, median, Q3 and Qmax) were used for representing the log(j) distributions as well as for estimation purposes (median).

The model assessment of the mean and median-based approaches was based on separate residual analysis for the central tendencies of Epass, log(jpass), Epit and log(jpit). The actual central tendencies of the descriptors corresponded to their maximum KDE values. As presented in Eq. 3, residuals are calculated by subtracting the estimated ŷ_i from the actual y_i value for the different descriptors y (Epass, log(jpass), Epit, log(jpit)) and sets _i.

$${residuals}={actual}\,y\left({y}_{i}\right)-{estimated}\,y\left({\hat{y}}_{i}\right)$$

(3)

Data availability

All data generated or analysed during this study are included in this published article (and its Supplementary Information files) and are available in the Mendeley Data repositories, https://data.mendeley.com/datasets/5x4dmc38bg/1,

https://data.mendeley.com/datasets/7j6b6y48jw/1.

Code availability

The code required to reproduce these findings is included in this published article as Dataset 1 (.ipynb extension, Jupyter Notebook) and is available to download from GitHub: https://github.com/bcoelho-leonardo/Estimating-pitting-descriptors-of-316L-stainless-steel-by-machine-learning-and-statistical-analysis/tree/5c7c8eac41907667f94c22881650f23a6aee0d64.

References

Hughes, A. et al. Corrosion inhibition, inhibitor environments, and the role of machine learning. Corros. Mater. Degrad. 3, 672–693 (2022).
CAS Google Scholar
Qu, Z. et al. Pitting judgment model based on machine learning and feature optimization methods. Front. Mater. 8, 1–8 (2021).
CAS Google Scholar
Wei, R. P. & Harlow, D. G. Mechanistically based probability modelling, life prediction and reliability assessment. Model. Simul. Mater. Sci. Eng. 13, R33–R51 (2005).
CAS Google Scholar
Macdonald, D. D. Passivity–the key to our metals-based civilization. Pure Appl. Chem. 71, 951–978 (1999).
CAS Google Scholar
Macdonald, D. D. & Engelhardt, G. R. Predictive Modeling of Corrosion. In Shreir’s Corrosion, Vol. 2 (eds. Richardson, J. A. et al.) 1630-1679 (Elsevier, Amsterdam, 2010).
Frankel, G. S. Pitting corrosion of metals: a review of the critical factors. J. Electrochem. Soc. 145, 2186–2198 (1998).
CAS Google Scholar
Nyby, C. et al. Electrochemical metrics for corrosion resistant alloys. Sci. Data 8, 58 (2021).
CAS Google Scholar
Coelho, L. B. et al. Corrosion inhibition of AA6060 by silicate and phosphate in automotive organic additive technology coolants. Corros. Sci. 199, 110188 (2022).
CAS Google Scholar
ASTM. ASTM G61-86(2018). Standard Test Method for Conducting Cyclic Potentiodynamic Polarization Measurements for Localized Corrosion Susceptibility of Iron-, Nickel-, or Cobalt-Based Alloys. (ASTM, 2018).
Yi, Y., Cho, P., Al Zaabi, A., Addad, Y. & Jang, C. Potentiodynamic polarization behaviour of AISI type 316 stainless steel in NaCl solution. Corros. Sci. 74, 92–97 (2013).
CAS Google Scholar
Jegdic, B. V., Bobić, B., Bošnjakov, M. & Alić, B. Testing of intergranular and pitting corrosion in sensitized welded joints of austenitic stainless steel. Metall. Mater. Eng. 23, 109–117 (2017).
Google Scholar
ISO. ISO 15158:2014 Corrosion of Metals and Alloys—Method of Measuring the Pitting Potential for Stainless Steels by Potentiodynamic Control in Sodium Chloride Solution. (ISO, 2014).
Anderko, A., Sridhar, N. & Dunn, D. S. A general model for the repassivation potential as a function of multiple aqueous solution species. Corros. Sci. 46, 1583–1612 (2004).
CAS Google Scholar
Wilde, B. E. & Williams, E. The relevance of accelerated electrochemical pitting tests to the long-term pitting and crevice corrosion behavior of stainless steels in marine environments. J. Electrochem. Soc. 118, 1057 (1971).
CAS Google Scholar
Soltis, J. Passivity breakdown, pit initiation and propagation of pits in metallic materials-review. Corros. Sci. 90, 5–22 (2015).
CAS Google Scholar
Williams, D. E., Westcott, C. & Fleischmann, M. Stochastic models of pitting corrosion of stainless steels: I. Modeling of the initiation and growth of pits at constant potential. J. Electrochem. Soc. 132, 1796–1804 (1985).
CAS Google Scholar
Freiman, L. I. & Metallov, Z. Potentiodynamic determination of stainless steel repassivation and pitting formation potentials. Corros. Sci. 8, 693–695 (1972).
Dulaney, C. C. N. & C. L. Localized Corrosion. NACE 184 (NACE, 1974).
Shibata, T. & Takeyama, T. Stochastic theory of pitting corrosion. Corrosion 33, 243–251 (1977).
CAS Google Scholar
Pereira, V. J. et al. In Blue Economy. 191–220 (Springer Nature Singapore, 2022).
Izquierdo, J. et al. Resolution of the apparent experimental discrepancies observed between SVET and SECM for the characterization of galvanic corrosion reactions. Electrochem. commun. 27, 50–53 (2013).
CAS Google Scholar
Bentley, C. L., Kang, M. & Unwin, P. R. Scanning electrochemical cell microscopy: new perspectives on electrode processes in action. Curr. Opin. Electrochem. 6, 23–30 (2017).
CAS Google Scholar
Bard, A. J., Fan, F. R. F., Kwak, J. & Lev, O. Scanning electrochemical microscopy. Introduction and principles. Anal. Chem. 61, 132–138 (1989).
CAS Google Scholar
Payne, N. A., Stephens, L. I. & Mauzeroll, J. The application of scanning electrochemical microscopy to corrosion research. Corrosion 73, 759–780 (2017).
CAS Google Scholar
Yule, L. C. et al. Nanoscale active sites for the hydrogen evolution reaction on low carbon steel. J. Phys. Chem. C. 123, 24146–24155 (2019).
CAS Google Scholar
Gateman, S. M., Georgescu, N. S., Kim, M.-K., Jung, I.-H. & Mauzeroll, J. Efficient measurement of the influence of chemical composition on corrosion: analysis of an Mg-Al diffusion couple using scanning micropipette contact method. J. Electrochem. Soc. 166, C624–C630 (2019).
CAS Google Scholar
Shkirskiy, V. et al. Nanoscale scanning electrochemical cell microscopy and correlative surface structural analysis to map anodic and cathodic reactions on polycrystalline Zn in acid media. J. Electrochem. Soc. 167, 041507 (2020).
CAS Google Scholar
Yule, L. C., Bentley, C. L., West, G., Shollock, B. A. & Unwin, P. R. Scanning electrochemical cell microscopy: a versatile method for highly localised corrosion related measurements on metal surfaces. Electrochim. Acta 298, 80–88 (2019).
CAS Google Scholar
Coelho, L. B. et al. Probing the randomness of the local current distributions of 316 L stainless steel corrosion in NaCl solution. Corros. Sci. 217, 111104 (2023).
CAS Google Scholar
Salami, B. A., Rahman, S. M., Oyehan, T. A., Maslehuddin, M. & Al Dulaijan, S. U. Ensemble machine learning model for corrosion initiation time estimation of embedded steel reinforced self-compacting concrete. Measurement 165, 108141 (2020).
Google Scholar
Coelho, L. B. et al. Reviewing machine learning of corrosion prediction in a data-oriented perspective. npj Mater. Degrad. 6, 8 (2022).
Enikeev, M., Enikeeva, L., Maleeva, M. & Gubaydullin, I. Machine learning in the problem of recognition of pitting corrosion on aluminum surfaces. CEUR Workshop Proc. 2212, 186–192 (2018).
Google Scholar
Sasidhar, K. N., Siboni, N. H., Mianroodi, J. R. & Rohwerder, M. Deep learning framework for uncovering compositional and environmental contributions to pitting resistance in passivating alloys. npj Mater. Degrad. 6, 71 (2022).
Yidong, X. Use of time series models to forecast the evolution of corrosion pit in steel rebars. Funct. Mater. 23, 457–462 (2016).
Google Scholar
Yang, X. et al. A new understanding of the effect of Cr on the corrosion resistance evolution of weathering steel based on big data technology. J. Mater. Sci. Technol. 104, 67–80 (2022).
CAS Google Scholar
Kamrunnahar, M. & Urquidi-Macdonald, M. Prediction of corrosion behavior using neural network as a data mining tool. Corros. Sci. 52, 669–677 (2010).
CAS Google Scholar
Jiang, X., Yan, Y. & Su, Y. Data-driven pitting evolution prediction for corrosion-resistant alloys by time-series analysis. npj Mater. Degrad. 6, 2–9 (2022).
Google Scholar
Zhu, Y., Macdonald, D. D., Qiu, J. & Urquidi-Macdonald, M. Corrosion of rebar in concrete. Part III: artificial neural network analysis of chloride threshold data. Corros. Sci. 185, 109438 (2021).
CAS Google Scholar
Borboudakis, G. et al. Chemically intuited, large-scale screening of MOFs by machine learning techniques. npj Comput. Mater. 3, 1–6 (2017).
CAS Google Scholar
Würger, T. et al. Data science based Mg corrosion engineering. Front. Mater. 6, 1–9 (2019).
Google Scholar
Feiler, C. et al. In silico screening of modulators of magnesium dissolution. Corros. Sci. 163, 108245 (2020).
CAS Google Scholar
Würger, T. et al. Exploring structure-property relationships in magnesium dissolution modulators. npj Mater. Degrad. 5, 2 (2021).
Google Scholar
Schiessler, E. J. et al. Predicting the inhibition efficiencies of magnesium dissolution modulators using sparse machine learning models. npj Comput. Mater. 7, 193 (2021).
Google Scholar
Galvão, T. L. P. et al. CORDATA: an open data management web application to select corrosion inhibitors. 4–7 https://doi.org/10.1038/s41529-022-00259-9 (2022).
Galvão, T. L. P., Novell-Leruth, G., Kuznetsova, A., Tedim, J. & Gomes, J. R. B. Elucidating structure–property relationships in aluminum alloy corrosion inhibitors by machine learning. J. Phys. Chem. C. 124, 5624–5635 (2020).
Google Scholar
Sridhar, N., Brossia, C. S., Dunn, D. S. & Anderko, A. Predicting localized. Corros. Seawater Corros. 60, 915–936 (2004).
CAS Google Scholar
Weaver, C., Fortuin, A. C., Vladyka, A. & Albrecht, T. Unsupervised classification of voltammetric data beyond principal component analysis. Chem. Commun. 58, 10170–10173 (2022).
CAS Google Scholar
Godfrey, D., Bannock, J. H., Kuzmina, O., Welton, T. & Albrecht, T. A robotic platform for high-throughput electrochemical analysis of chalcopyrite leaching. Green. Chem. 18, 1930–1937 (2016).
CAS Google Scholar
Coelho, L. B. & Ustarroz, J. Epit and Epass descriptors of 316L stainless steel estimated by machine learning-datasets, Mendeley Data. https://doi.org/10.17632/5x4dmc38bg.1 (2023).
Takeuchi, I., Le, Q. V., Sears, T. D. & Smola, A. Nonparametric Quantile Regression. https://www.semanticscholar.org/paper/Nonparametric-Quantile-Regression-Takeuchi-Le/5f7e54b38d096f202236f44e2561a6d635bdb79c (2005).
Torres, D. et al. Distribution of copper electrochemical nucleation activities on glassy carbon: a new perspective based on local electrochemistry. J. Electrochem. Soc. 169, 102513 (2022).
CAS Google Scholar
Bernal, M. et al. A microscopic view on the electrochemical deposition and dissolution of Au with scanning electrochemical cell microscopy–Part I. Electrochim. Acta 445, 142023 (2023).
CAS Google Scholar
Evans, U. R. Localized corrosion. NACE 144 (NACE, 1974).
Blázquez-García, A., Conde, A., Mori, U. & Lozano, J. A. A review on outlier/anomaly detection in time series data. ACM Comput. Surv. 54, 1–33 (2021).
Google Scholar
Hu, H., Nguyen, N., He, C. & Li, P. Advanced outlier detection using unsupervised learning for screening potential customer returns. in 2020 IEEE International Test Conference (ITC) 1–10 (IEEE, 2020).
Aprillia, H., Yang, H.-T. & Huang, C.-M. Statistical load forecasting using optimal Quantile Regression Random Forest and Risk Assessment index. IEEE Trans. Smart Grid 12, 1467–1480 (2021).
Google Scholar
Torbati-Sarraf, H., Ding, L., Khakpour, I. & Poursaee, A. Electrochemical Impedance Spectroscopic Analyses of the Influence of the Surface Nanocrystallization on the Passivation of Carbon Steel in the Pore Solution. J. Mater. Civ. Eng. 33, 4020419-1–04020419-10 (2021).
Google Scholar
Horta, D. G., Beviláqua, D., Acciari, H. A., Júnior, O. G. & Benedetti, A. V. Optimization of the use of carbon paste electrodes (cpe) for electrochemical study of the chalcopyrite. Quim. Nova 32, 1734–1738 (2009).
CAS Google Scholar
Dartois, J. E., Knefati, A., Boukhobza, J. & Barais, O. Using quantile regression for reclaiming unused cloud resources while achieving SLA. CloudCom 2018—10th IEEE International Conference on Cloud Computing Technology and Science, Dec 2018, Nicosia, Cyprus. pp. 89–98 (2018).
Koenker, R. & Bassett, G. Regression quantiles. Econometrica 46, 33 (1978).
Google Scholar
Mohammad Zubir, W. M. A., Abdul Aziz, I. & Jaafar, J. In Computational and Statistical Methods in Intelligent (eds. Silhavy, R., Silhavy, P. & Prokopova, Z.) vol. 859, 236–254 (Springer International Publishing, 2019).
Esmailzadeh, S., Aliofkhazraei, M. & Sarlak, H. Interpretation of cyclic potentiodynamic polarization test results for study of corrosion behavior of metals: a review. Prot. Met. Phys. Chem. Surf. 54, 976–989 (2018).
CAS Google Scholar
Strehblow, H.-H. In Corrosion Mechanisms in Theory and Practice (eds. Marcus, P. & Oudar, J.) 201 (Marcel Dekker, Inc., New York, 1995).
Sato, N. A theory for breakdown of anodic oxide films on metals. Electrochim. Acta 16, 1683–1692 (1971).
CAS Google Scholar
Richardson, J. A. & Wood, G. C. A study of the pitting corrosion of Al byscanning electron microscopy. Corros. Sci. 10, 313–323 (1970).
CAS Google Scholar
Weibull, W. A Statistical Theory of Strength of Materials. Generalstabens Litografiska Anstalts Förlag, Stockholm (1939).
Volkov, S. D. Statistical strength theory. FOREIGN Technol. DIV WRIGHT-PATTERSON AFB OHIO (Society for Industrial and Applied Mathematics, 1962).
Davidenkov, N., Shevandin, E, & Wittmann, F. The influence of size on the brittle strength of steel. J. Appl. Mech. 14, A63-A67A63–A67 (1947).
Hirata, M. Statistical phenomena in science and engineering. Kikai-no-Kenkyu 1, 231 (1949).
Google Scholar
Hori, M. Statistical aspects of fracture in concrete, I. An analysis of flexural failure of portland cement mortar from the standpoint of stochastic theory. J. Phys. Soc. Jpn. 14, 1444–1452 (1959).
Google Scholar
Gumbel, E. J. Statistics of Extremes. (Columbia University Press, 1958).
Aziz, P. M. Application of the statistical theory of extreme values to the analysis of maximum pit depth data for aluminum. Corrosion 12, 35–46 (1956).
Google Scholar
Eldredge, G. G. Analysis of corrosion pitting by extreme-value statistics and its application to oil well tubing caliper surveys★. Corrosion 13, 67–76 (1957).
Google Scholar
Coelho, L. B. & Ustarroz, J. Micro-Scale Potentiodynamic Polarisation (log(j)) Curves of 316L Stainless Steel—Datasets, Mendeley Data. https://doi.org/10.17632/7j6b6y48jw.1 (2023).
Weber, M., Auch, M., Doblander, C., Mandl, P. & Jacobsen, H. A. Transfer learning with time series data: a systematic mapping study. IEEE Access 9, 165409–165432 (2021).
Google Scholar
Pan, S. J. & Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345–1359 (2010).
Google Scholar
Yu, J., Li, X. & Zheng, M. Current status of active learning for drug discovery. Artif. Intell. Life Sci. 1, 100023 (2021).
CAS Google Scholar
Warmuth, M. K. in Advances in Neural Information Processing Systems. 14 (The MIT Press, 2001).
Nash, W., Drummond, T. & Birbilis, N. A review of deep learning in the study of materials degradation. npj Mater. Degrad. 2, 37 (2018).
Google Scholar
Ricolfe-Viala, C. & Blanes, C. Improving robot perception skills using a fast image-labelling method with minimal human intervention. Appl. Sci. 12, 1557 (2022).
CAS Google Scholar
Benson, F. A note on the estimation of mean and standard deviation from quantiles. J. R. Stat. Soc. Ser. B 11, 91–100 (1949).
Google Scholar

Download references

Acknowledgements

The author, L.B. Coelho, is a Postdoctoral Researcher of the Fonds de la Recherche Scientifique—FNRS (Belgium), which is gratefully acknowledged. D.T. acknowledges financial support to the Fonds de Recherche dans l’Industrie et dans l’Agriculture (FRIA). J.U. and M.B. acknowledge financial support to the Fonds de la Recherche Scientifique de Belgique (F.R.S.-FNRS) under Grant No. F.4531.19 and to the Fonds Wetenschappelijk Onderzoek (FWO) under contract G0C3121N. G.B. and G.P. are supported by the Service Public de Wallonie Recherche under grant nr 2010235–ARIAC by DigitalWallonia4.ai. The authors acknowledge Prof. Marjorie Olivier (University of Mons) for providing stainless steel treated plates. The author, L.B. Coelho, would like to thank Dr. Denis Steckelmacher for fruitful discussions on data manipulation and analysis.

Author information

Authors and Affiliations

ChemSIN—Chemistry of Surfaces, Interfaces and Nanomaterials, Université libre de Bruxelles (ULB), Brussels, Belgium
Leonardo Bertolucci Coelho, Daniel Torres, Miguel Bernal & Jon Ustarroz
Research Group Electrochemical and Surface Engineering (SURF), Vrije Universiteit Brussel, Brussels, Belgium
Leonardo Bertolucci Coelho, Vincent Vangrunderbeek & Jon Ustarroz
Machine Learning Group (MLG), Université libre de Bruxelles (ULB), Brussels, Belgium
Gian Marco Paldino & Gianluca Bontempi

Authors

Leonardo Bertolucci Coelho
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Torres
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Vangrunderbeek
View author publications
You can also search for this author in PubMed Google Scholar
Miguel Bernal
View author publications
You can also search for this author in PubMed Google Scholar
Gian Marco Paldino
View author publications
You can also search for this author in PubMed Google Scholar
Gianluca Bontempi
View author publications
You can also search for this author in PubMed Google Scholar
Jon Ustarroz
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

L.B.C.: conceptualisation, methodology, software, formal analysis, data curation, writing—original draft, visualisation, project administration, funding acquisition. D.T.: validation, investigation, writing—review & editing, visualisation. V.V.: methodology, software. M.B.: investigation. G.P.: methodology, software. G.B.: validation, formal analysis, writing—review & editing, visualisation. J.U.: validation, formal analysis, resources, writing—review & editing, visualisation, supervision.

Corresponding authors

Correspondence to Leonardo Bertolucci Coelho or Jon Ustarroz.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Dataset 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Coelho, L.B., Torres, D., Vangrunderbeek, V. et al. Estimating pitting descriptors of 316 L stainless steel by machine learning and statistical analysis. npj Mater Degrad 7, 82 (2023). https://doi.org/10.1038/s41529-023-00403-z

Download citation

Received: 04 May 2023
Accepted: 06 October 2023
Published: 21 October 2023
DOI: https://doi.org/10.1038/s41529-023-00403-z

This article is cited by

Electrochemical nucleation and the role of the surface state: unraveling activity distributions with a cross-system examination and a local electrochemistry approach
- Daniel Torres
- Jérome Bailly
- Jon Ustarroz
Journal of Solid State Electrochemistry (2024)

Subjects

Abstract

Similar content being viewed by others

Discovery of potent inhibitors of α-synuclein aggregation using structure-based iterative learning

Scaling deep learning for materials discovery

An autonomous laboratory for the accelerated synthesis of novel materials

Introduction

Results and discussion

Density estimation of passivity and pitting descriptors

Central tendency estimations of descriptors based on the mean and median models

Evaluation of the central tendency estimates based on residuals

Interpreting the higher robustness of the median model

Effect of corrosiveness on the pitting susceptibility

Methods

Data analysis

Supervised hybrid rule-based/machine-learning algorithm

Ground truth for the central tendencies of passivity and pitting descriptors

Epass and Epit distributions: normality tests

Proxy models for estimating the central tendency of pitting descriptors

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary information

Dataset 1

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Electrochemical nucleation and the role of the surface state: unraveling activity distributions with a cross-system examination and a local electrochemistry approach

Search

Quick links