Introduction

Corrosion is the most severe ordeal that materials are confronted with and negatively affects engineering life and safety1. Due to the environmental diversity and service complexity, various corrosion, such as pitting2, galvanic corrosion3, and stress corrosion cracking4 (SCC), have occurred in engineering structural materials. Aluminum (Al) alloys, the most widely used lightweight metal in aviation and transportation, suffer from corrosion problems5. Although their application reduces fossil energy consumption, their degradation is not to be neglected6. Currently, there are two solutions for monitoring metal corrosion. One is deploying corrosion sensors online to transmit the corrosion current in the real environment7. However, its obvious problem is that the transmission is unreliable. Once the protective film or corrosion products are formed on the surface, the corrosion current may be extremely noisy and even lose its functionality, especially for Al alloys. In addition, another trustworthy solution is field exposure experiments offline in the actual environment for a period of time (years) and then collect the samples to calculate the corrosion rate8. This solution, although highly dependable, is also costly; any prolonged field exposure time increases the risk of losing the sample.

Several research projects have been conducted to study the intrinsic essence of material corrosion under long-term field exposure in different areas. For instance, Sun reported that the corrosion rates of AA2024 and AA7075 in industrial environments were larger than those in coastal and urban environments9. However, corrosion is a complex process that is influenced by many factors, such as composition and environmental parameters10. In view of the limited capabilities of traditional modeling, only minority factors are considered11. With the development of big data, machine learning (ML) has shown significant advantages in data mining and forecasting12. Numerous studies have used miscellaneous ML algorithms to fit corrosion data and compare their pros and cons13,14,15,16,17. For the noisy data generated by the corrosion sensor in the first solution, Pei confirmed that the random forest (RF) is more suitable for corrosion data than the artificial neural network and supporting vector regression18. Despite the considerable noisy data, corrosion sensors which are mainly used for steel generate massive data for ML training. Evidently, reliable field exposure experiments lack data due to prohibitive acquisition costs. Predicting the corrosion rate of Al alloys using a small-sample dataset combined with environmental parameters is still a major obstacle. Establishing ML models with outstanding generalization ability will be of great significance to prolong the service life of materials and ensure engineering safety19.

We therefore propose a strategy that utilizes the ab-initio calculations to describe Al alloys intermetallic compounds (IMCs) and the precipitated phases, so as to improve the prediction accuracy of the Al alloys corrosion rate on the Earth with latitude ranging from South 6° to North 52° (See Fig. 1). Compared with the traditional dataset containing composition and heat treatment, the effect of secondary phases in Al alloys on corrosion rate is more comprehensive depicted by the calculated work functions (WFs), surface energies, and Cl adsorption energies (Cl AEs). By introducing the calculated phases descriptor, the generalization ability of the corrosion RF model was greatly improved. Besides, the model had been verified by the independent field exposure experiments in Southeast Asia (Singapore, Jakarta, and Bangkok), where the annual rainfall and particle composition were beyond the range of the training set. Based on this computation and experiment combined model, we deduced what statistical strategy should be employed when the ab-initio calculation is utilized to express the corrosion resistance of various phases that coexist in Al alloys.

Fig. 1: Calculations modified RF model flow and the field exposure sites.
figure 1

a ML flow diagram and dataset features. b Placement map. Corrosion data obtained in Singapore, Jakarta, and Bangkok are only used for generalization verification. ce The climate characteristics of regions. The field exposure time of Al alloys is range from 0.5 to 4 years.

Results

Parsing small-sample Al alloys field exposure corrosion data

To evaluate the characteristics of the Al alloys corrosion data, the effect of the basic RF parameters was first investigated. It can be seen from Fig. 2a that the average goodness-of-fit (R2) increased with the estimator rise. The growth of the model average accuracy was limited to 0.81% when the estimator exceeded 40. Nevertheless, the model R2 was anomalous when the estimator was lower than 40. Traditional optimal RF model accuracy was greatly improved with the decrease of the estimator, where the maximum R2 reached 0.86. Unlike the average accuracy increment, the maximum R2 of the optimal model slightly dropped by 1.22%. A low estimator was generally considered to be overfitting the data. Combined with the mean squared error (MSE), it can be confirmed that the error of the model whose estimator was less than 30 was significantly higher than the average error. After the estimator amount exceeded 40, the MSE of the model dropped to 0.086. In addition, the minimum MSE occurred at the model with100 estimators which was only 0.0443. Taken as a whole, the estimator of RF models for small-sample corrosion dataset must be greater than 40, the optimal value should be 100.

Fig. 2: Analysis of long-term Al alloys field exposure corrosion data by RF model.
figure 2

a The effect of the estimator number on R2 and MSE. The R2 improvement with b features and c training set ratio variation. The red symbol size clearly represents the precision of the model.

Realistically, our model accuracy still lags far behind the material mechanical prediction models20,21. To improve the model accuracy, we therefore deeply analyzed the features and data distribution of corrosion dataset. When the maximum feature number exceeded 10, the models average R2 remained around 0.51 (Fig. 2b). Interestingly, superfluous features did not positively affect the establishment of the optimal corrosion model (red symbol in Fig. 2b). Only when the maximum feature number was 11, the traditional RF model achieved the optimum (R2 ≈ 0.62). This proved that there were features that are unfavorable for corrosion rate prediction in Al alloys. Whereas the slow increase of average precision (after max features = 10) did not account for the unfavorable features number being 5. Another solution to improve the accuracy is to enlarge the proportion of the training set or multiply the data size. Figure 2c shows that with the increase in training set proportion, the model accuracy had been greatly improved (0.75 maximum, 24% higher than the original model).

Both the Pearson and Spearman’s correlation coefficients were carried out to analyze the linear and monotonic relationships between the corrosion data, so as to eliminate the useless features (Fig. 3a). There are two isolated regions in composition that show a strong correlation, one is the correlation between Al and Zn (exceed 0.8). Specifically, Zn is demonstrated to be a strengthening element in AA7xxx to form the η/η‘ phase22. However, owing to the high SCC susceptibility23 of AA7xxx, other elements such as Cu, Fe, and Si are supplemented to enhance its corrosion resistance. Therefore, Zn addition resulting the synchronous increase of other alloying elements so as to decrease the Al content. Another strong correlation is seen between Si and Fe which are often co-added to enhance the alloys strength and manufacturing fluidity24. Moreover, the correlation of annual rainfall with precipitation pH and Cl deposition were both above 0.9. This indicated that the annual rainfall could replace other environmental parameters during training. Besides, Fig. 3a also demonstrated the environmental parameters were more directly related to corrosion rate than alloy composition, as their coefficient with corrosion rate is 0.3–0.5. We investigated the importance of the features in the corrosion dataset through the traditional RF model (the training set ratio was 0.7 and the R2 was 0.62). A total of 6 features had an importance index exceeding 0.05 in Fig. 3b, which were exposure time, precipitation pH value, Al content, Zn content, annual average temperature, and Cl deposition, respectively. The trace elements did not have a significant impact (<0.03) on the corrosion prediction.

Fig. 3: Analysis of long-term exposure experiment data.
figure 3

a Pearson and Spearman correlation coefficient. b Importance ranking of corrosion data features based on the RF model.

By means of the RF model, the effect of alloying elements on corrosion rate could be investigated based on the commonly used AA7B05 for underframe corbels. Each element’s effect on corrosion rate was reflected by the monotonical variation which was summarized in Fig. 4. It could be found that the alloy corrosion rate fluctuated slightly when Zn content was less than 5.7%. Nevertheless, once the addition of Zn exceeded 6%, the alloys corrosion rate increased by 100.67%. What’s more, both superfluous Mg and Si are observed to deteriorate the corrosion resistance. Likewise, Cu displayed a surge in corrosion rate, but its content required was small (>1.5%). On the contrary, Fe and Mn could improve the Al alloys corrosion resistance (Fig. 4b). Usually, Fe is considered as impurity element for mechanical properties in Al alloys. However, it can be seen from ML that a small amount of Fe can significantly improve the corrosion resistance (although excessive addition would increase the corrosion rate). This was mainly attributed to the low formation energy of the Fe-phases, redundant Fe was easy to form coarse phases with high WFs which lead to galvanic corrosion. Mn also improved the corrosion resistance, but unlike Fe, the increase of Mn did not substantially reduce their corrosion resistance. In general, the gigantic effect of Zn on the corrosion rate of Al alloys can also be confirmed by materializing to specific alloys. Therefore, combining the correlation of Zn with Si, Fe, Mn, and Ti (Fig. 3a), it is feasible to predict the corrosion rate by optimizing the features (using Zn). However, it is unreasonable to only consider Zn, especially for Al alloys without Zn, so the Al content should also be considered. Although optimizing the features could improve the RF model accuracy, it is not optimal (up to 0.75).

Fig. 4: Elaboration of element effect on corrosion rate.
figure 4

a Common elements. b Trace elements.

Ab-initio calculations for the secondary phases

Complex secondary phases (including IMCs and precipitated phases) were another important characteristic in Al alloys, which were experimentally observed as shown in Fig. 5. Whereas how to obtain the phases information efficiently and quickly is a conundrum. The experimental means is troubled by the consequences of tiny element changes in the phases. However, the corrosion information of the secondary phases varied by the exposed element (type, positions, and content) can be considered as much as possible by calculations25.

Fig. 5: Micron-scale IMCs or precipitated phases in Al alloys.
figure 5

2xxx (θ-Al2Cu, S-Al2CuMg, N-Al7Cu2Fe, Mg2Si), 5xxx (Mg2Si, Al6(Fe, Mn)), 6xxx (Mg2Si, S-Al2CuMg, Al6(Fe, Mn)), and 7xxx (Al3Fe, Al23Fe4Cu, Mg2Si) Al alloys. Excluding the micro-scale secondary phases, the 5xxx Al alloys also contain β-Al3Mg2 and 7xxx Al alloys have η-MgZn2 phase. The length of bars is 2 μm.

WF, surface energy, and Cl AE of the secondary phases with different orientation were calculated to enrich the phase information in the corrosion dataset. The most easily formed Al surface was (111) and its WF was 4.05 eV26. The theoretical WF difference (Eq. (13)) between the various phases and (111) surface can qualitatively determine the role of phases in the corrosion process: anode or cathode27,28. To sum up, η-MgZn2 and β-Al3Mg2 belonged to the anode phase in Al alloys, the N-Al7Cu2Fe, θ-Al2Cu, Al3Fe, Al6 (Fe, Mn), and Al23Fe4Cu were the cathode phase (see Fig. 6a). In particular, the corrosion tendency of Mg-containing phases, such as S-Al2CuMg and Mg2Si, was strongly correlated with the crystallographic orientation and the atomic type of the outermost termination. Moreover, the difference between the minimum/maximum WF of the Mg2Si and that of Al was too large to cause corrosion. When Mg2Si acts as an anode, the Al alloys tend to have pitting corrosion29. Conversely, the Al alloys will undergo galvanic corrosion when Mg2Si enacts as a cathode. Fortunately, the surface energy of the most easily formed Mg2Si surface (100) was 0.81 J m−2 which indicated that the Mg2Si size was not enormous (experimentally measured close to 1 μm). Meanwhile, the (100) WF was 4.55 eV so that the Al alloys containing Mg2Si were relatively less prone to galvanic corrosion.

Fig. 6: Ab-initio calculations for the Al alloys secondary phases.
figure 6

a WF. Three orientations, (100), (110), and (111), were considered. The lowest WF is depicted as a line, and the remaining two surfaces are expressed in the form of summation on this line. b The phase surface energies and their adsorption energies for the Cl atom. The surface with the lowest WF of each phase is surrounded by a black circle. c Adsorption distance and charge distribution between the Cl atom and the phase surface with the lowest WF.

Hazardously, all orientation surface energies of Al23Fe4Cu were negative (the lowest value among them being −2.87 J m−2), which meant that large Al23Fe4Cu can be formed in Al alloys. Combined with the influence of its high WF, it was easy to cause galvanic corrosion. Whereas the positive surface energies did not mean that the phase cannot be precipitated/formed in the Al matrix, it just indicated that the surface formation process was a non-spontaneous reaction. In addition, except for the Al23Fe4Cu, the surface energies of the other eight phase orientations with the lowest WF were all positive. Most of the phase orientation (>81.48%) showed a clear affinity with respect to Cl. The lowest WFs of β-Al3Mg2, θ-Al2Cu, η-MgZn2, AlFe3 and Al23Fe4Cu2 had poor affinity for the Cl atom, so there was no obvious Cl additive effect30. Furthermore, the adsorption distance and charge distribution shown in Fig. 6c clearly divided the secondary phases into three categories. The first category was the precipitated phase that interacts weakly with the Cl atom, for instance the β-Al3Mg2, θ-Al2Cu, and η-MgZn2. The distance between the outermost surface of phase and the Cl atom was larger than 7 Å, and there was no obvious charge transfer at the intermediate points. The second category was the IMCs that generate strongly interacting ionic bonds with the Cl atom. Such phases were IMCs containing Fe and S-Al2CuMg. The distance between the Cl atom and the outermost atoms was 2 Å, and there was an apparent charge sharing. Finally, the worst phase was the Mg2Si, where the Cl atom had visibly eroded into the phase interior (distance <1 Å). Based on the above calculations, we numerically describe the role of the secondary phase in the corrosion process of Al alloys.

Facilitation and generalization for syncretic RF model

To establish a more accurate corrosion prediction model, various statistical methods were applied to handle the Al secondary phase information (WFs and Cl AEs) to refine the corrosion dataset, namely the Integrating Ab-initio Calculations with Random Forest models (IACRF). Top six important features (in Fig. 3) were utilized to further compare the statistical method differences. Figure 7 specifically shows the effect of the WF and Cl AE descriptor on the model accuracy. Evidently, using WFs to amend the corrosion dataset improved the mean prediction accuracy by 43.56 ± 1.03% (Fig. 7a). Besides, the strategies using the arithmetic mean method (AMM) had extremely similar improvements in accuracy as their mathematical expectations difference was less than 0.06%. The optimal model under the AMM appeared when using the average WFs with the lowest surface energy; its accuracy reached 0.861. Then, ignoring the phases type, only the extrema WF was utilized to supplement the corrosion dataset, the model accuracy fluctuated significantly. Especially, the mean accuracy of the model only considering the lowest WF improved to 0.737. On the contrary, the accuracy of the model that only considers the highest WF had dropped drastically, its Gaussian distribution had shifted to the left (see Supplementary Fig. 1). The model using the lowest WF achieved an optimal accuracy of 0.869.

Fig. 7: Contrast of different statistical strategies for calculation modified corrosion dataset.
figure 7

a Mean accuracy and optimal strategies using WF. b Statistics of model accuracy after supplementation with the Cl AE. c Average and maximum accuracy comparison of models with a single factor (WF or Cl AE) or both the factors.

From Fig. 7b, it could be concluded that the precision of the Cl AEs supplementation was significantly behind that of the WF, the average accuracy of the Cl AEs model was 0.630 only. The reason was that it was completely different from the WF which directly pointed out the ability of the secondary phases to gain or lose electrons. Furthermore, another reason is that oxide films are easily formed on the Al surface, the probability of Cl directly contacting with the Al matrix is low. Although the oxide film on the secondary phases may be weak, some of them, such as β-Al3Mg2, θ-Al2Cu, and η-MgZn2, were resistant to Cl adsorption. Similar to WF, the optimal model of Cl AE modification appeared in the extreme value method. Whereas the maximum model accuracy (0.837) was the Cl AE with the lowest surface energy. Besides, it is unadvisable to average the Cl AEs of all orientations for all phases, which significantly masked the pitting sensitive phases (optimal accuracy of avgall is 0.747, Fig. 7b). From Supplementary Fig. 2, the prediction accuracy of IACRF modified with WF or Cl AE varies slightly with the estimators increased.

A comprehensive comparison of the two ab-initio calculated descriptors and three amendatory methods (WF, Cl AE and both modified) are summarized in Fig. 7c. When the corrosion dataset was supplemented with both WF and Cl AE, the model average accuracy was similar to the Cl adsorption modification. Only the model amended by the average all phases strategy improved to 0.64. In terms of mean accuracy, the model using only WFs was significantly higher than the other two strategies. For the optimal model with the highest accuracy, the model combined with WF and Cl AE partially (in AMM) achieved the accuracy of model using only the WF. Specifically, the differences of the avg and avglowest strategies with that of WF were only 0.71% and 0.35%, respectively. Further, the accuracy of the other four statistical method models indicated that the Cl adsorption descriptors cancelled the beneficial effect of WF to some extent. Therefore, the single WF descriptor is the most beneficial calculated feature for the field exposure corrosion dataset of Al alloys.

To enhance the generalization ability of the corrosion prediction model, we further trained the model with different feature combinations. Finally, the selected features (exposure time, Al content, annual rainfall, and calculated WF) were retained to train the model. Figure 8a shows two corrosion RF models after repeated tuning: the experimental model and the calculation modified model (IACRF). Since some experimental features were removed from the model, the R2 of the experimental optimal model dropped to 0.821, where the MSE was 0.0142. Despite the low R2, in terms of corrosion data prediction, the fitting slope between the experimental data and its prediction value was 0.875 ± 0.049. Encouragingly, the IACRF model had the R2 as high as 0.963. The MSE of this model dropped to 0.0136, and the slope of linear fitting reached 0.942 ± 0.044. This illustrated that the secondary phases were a non-negligible factor in the process of studying the corrosion rate prediction. Adding the phases calculated descriptors to the corrosion dataset had greatly reduced the prediction difficulty and increased prediction accuracy by 17.30%.

Fig. 8: Independent verification of two feasible models for Al alloys field exposure data.
figure 8

a Comparison of RF models trained by the pure experimental corrosion dataset or ab-initio calculation modified dataset. b Three different Al alloys, AA5083, AA6N01, and AA7N01, were placed in Southeast Asia (Singapore, Jakarta, and Bangkok) for 2 years to examine the generalization ability of the RF model with phase calculation. c Corrosion morphology and statistics on the number and depth of pits in Al alloys after field exposure.

Coincidentally, it can be found from Fig. 8a that the predictions of both the models at a low corrosion rate (<0.01 μm a−1) had a larger error than that at a high corrosion rate (>0.1 μm a−1). The field exposure corrosion rate decreased with time, which was attributed to the fact that the corrosion rate mentioned here refers to the average annual corrosion rate. The Al alloys had a higher corrosion rate in the first exposure year. Compared with the increase in time, the amplification of corrosion weight loss was lower, which lead to a decrease in the corrosion rate. When the corrosion rate was converted to weight loss, the corrosion weight loss increased with exposure time. Besides, it was worth noting that the Al content of the AA6N01 was 98.24%, which clearly showed a 1.41% increase from the maximum training set (originally 96.85%). Additionally, the annual rainfall in Singapore (2345 mm, 1.146 after normalization) was also out of range31. Although both the compositional and environmental parameters were outside the range of the input values, the RF model trained using the WF retained its accuracy to some extent. After synthesizing the corrosion rates of three Al alloys in Southeast Asia, the overall R2 predicted reached 0.822 and the MSE was 0.0031. Moreover, the whole average error between the predicted rate and the experimental rate was 8.89%. Comparing the region difference, it was found that the IACRF model had the best prediction ability for the Al alloys placed in Jakarta; its error was only 6.33 ± 3.04%. However, the worst prediction region of the IACRF model was Bangkok as its error was as high as 11.53%. In Singapore, where the rainfall exceeded the dataset range, the accuracy reached 91.20 ± 7.43%. From the experimental results, the corrosion in Singapore was the most serious no matter which Al alloys series was under consideration. The morphology, size, and number of pits are summarized in Fig. 8c. The more and deeper the pits, the greater the risk of an engineering disaster. In addition, the experiment found that although the pits of the three Al alloys had decreased in Singapore, the pits were far deeper than those in the other two regions. Once the materials perforation occurred, the stress concentration caused the rapid failure of the Al alloy structure. While the other two regions had the same magnitude of pitting number, the average pits depth in Bangkok was deeper.

Discussion

By limiting the maximum features of subtree, the effect of the features number on the corrosion model was reported. Unfortunately, we only know that increasing the features number to a certain extent (≈10) was conducive to improving the model accuracy. It was not known exactly what features each subtree used. Based on the analysis of features and learning curve, we determined that the corrosion prediction models could be further optimized. Significantly different from the ML models with sufficient data, their learning accuracy curve consistently demonstrated tiny fluctuation with the increase in the data size32,33. It is adequately indicated that the field exposure corrosion dataset belongs to a small-data sample. Due to the time span of the field exposure experiments, performed from 0.5 years to 4 years where at least five samples of the same alloys were required to ensure the experimental accuracy, such high experimental costs obviously cannot expand the data size. However, learning from small-sample data is still a difficult issue34,35. In addition to the solution of enlarging the dataset from academic journals36, another feasible method is to utilize the partial sample generalization error to fine-tune the learner parameters. For instance, Xue provided feedback input for data with large prediction errors and used only 22 samples of data to construct a support vector machine algorithm that successfully fabricated a Fe-Ni based shape memory alloys37. Therefore, to further improve the corrosion model accuracy, the small-sample data needs to be deconstructed to reduce harmful features and add new features.

Creatively, we introduced the Al alloys phases descriptor by observing the relationship between corrosion rates and the secondary phases. According to the element distribution and atomic ratio of the second phase, we approximated the corresponding crystallographic models (Supplementary Fig. 3). In addition, it is worth mentioning that it is experimentally difficult to incorporate various phases information within corrosion dataset. Currently, the scanning Kelvin probe force microscopy is a potential means to determining the corrosion property of the secondary phases38. However, the cost of this means is exorbitant, the sequence of phase composition detection (carbon injection changes the surface state) and phase surface potential monitoring (priority implementation may result in repeated testing of the same phase) lowers the experimental efficiency. So, the utilization of ab-initio calculation is the optimal way to refining the corrosion dataset. By comparing the model accuracy with different calculations statistics, these results inspired us to consider more the secondary phases with lowest WF where the ab-initio calculations were carried out to determine the Al alloys corrosion resistance. Although the improvement of model accuracy by Cl AE descriptor is weaker than that of WF, it maybe attributed to the natural dense oxide film39,40 on the Al surfaces (thickness ~10 nm) and the low Cl concentration in atmosphere environment. The effect of Cl on the oxide/passivation film is controversial, mainly including Cl permeation and Cl-induced film thinning theory41. Therefore, the Cl AE feature may be more accurate in the training of metal/alloys without oxide/passivation films, i.e., carbon steel and Mg alloys. Furthermore, the prediction accuracy of stainless steel, whose passivation film42 (~1–2 nm) is extremely sensitive to Cl, may be significantly improved after Cl AE addition. For the thin iron passivation films accompanied by element doping (Cr, Ni, Mo) and structural transformation (Fe3O4, Fe2O3, Fe(OH)2, and Fe(OH)3), the best Cl calculated descriptor could be the Cl diffusion in the passivation films43 (especially for Cl permeation systems). The Cl diffusion in oxide-layer feature may has significantly impact on the accuracy when incorporating ab-initio calculations to predict the corrosion rate of stainless steel. In addition, the same secondary phase has many structures with different stoichiometry and crystallographic space groups during heat treatment44. Considering more orientations and different terminal atoms on the same orientation is beneficial to describe the physical properties of secondary phases more accurately. These calculations may improve the accuracy of models but is not certain. However, it should be noted that for the calculated descriptor, especially the WF, the range (min, max) of them should include that of metastable phases as much as possible, e.g., Mg2Si in Supplementary Table 1.

Actually, some environmental parameters, such as Cl deposition and rainfall pH, are needed to be collected using specialized instruments. If some environmental parameters used in the aforementioned models were missing, the accuracy would have been affected. Therefore, our further work was to use readily accessible environmental parameters to train the model. A model with R2 ≈ 0.96 was successfully trained by the combined feature optimization and ab-initio calculations modified corrosion dataset. Although the verification precision of the field exposure experiments in Southeast Asia dropped to 0.822, the overall error was less than 9%. We reckoned that this accuracy drop was attributed to the fact that the parameters in Southeast Asia exceeded the training set thresholds. Among them, some composition of AA5083 and AA6N01 (never utilized in training process) are below/beyond the minimum/maximum-value of the training set, while the rainfall in Singapore exceeds the max-value of the dataset.

In conclusion, we have proposed an optimal strategy for the Al alloys corrosion rate prediction based on small-sample data from long-term field exposure experiments. The IACRF model is conducive to the selection and utilization of Al alloys in various environments. The use of the ab-initio calculations for the WF and the Cl AE of the secondary phases in Al alloys further characterizes the corrosion dataset, improving the corrosion rate prediction accuracy by 17.30%. In particular, we explicitly interpreted the effect of the phase information on the Al alloys corrosion prediction. Moreover, through comparing various statistical methods of WF and Cl AE, it was evident that the minimum WF of all phases should be specified as the evaluation criterion when utilizing the ab-initio calculation to measure the corrosion resistance of Al alloys. In addition, the WF was more suitable for corrosion prediction of pitting (galvanic) corrosion-prone alloys, while Cl AE tends to play a more significant role in Cl-induced passivation film destruction or general corrosion systems. To generalize the model broadly, we utilized feasible experimental features, Al content, rainfall, and exposure time combined with phases WF, to train the model. Finally, independent field exposure experiments in Southeast Asia demonstrated and verified the generalization ability of IACRF in corrosion prediction.

Methods

Data pre-processing and random forest model

The corrosion dataset involves 15 experimental features and 2 ab-initio calculated features, with a total of 162 experimental data. The experimental features include ten element content (Al, Zn, Mg, Si, Cu, Fe, Cr, Mn, Ti, and the sum of other elements), four environmental parameters (annual average temperature, annual rainfall, precipitation pH, and the deposition of Cl), and exposure time. Each feature was linearly normalized. Experimental data were open access on Github, and the calculation results of secondary phases are summarized in Supplementary Table 2. In addition, six statistical strategies were utilized to find the optimal description for the secondary phase ab-initio calculation in Al alloys (Supplementary Tables 3 and 4). The detailed statistical strategies (the preprocessing of Cl AE is similar to that of WF) were as follows:

The descriptor \(\psi _{all}^{avg}\) refers to averaging the WFs of all crystallographic orientation including all phases in Al alloys, given by Eq. (1).

$$\psi _{all}^{avg} = \left( {\mathop {\sum}\nolimits_i^n {{\sum} {\psi _{surf}^i} } } \right)/n$$
(1)

where i is the serial number of secondary phases, n is total number of secondary phases in Al alloys, and \(\psi _{surf}^i\) indicates the WF of i-th phase on different orientations (surf).

The descriptor \(\psi _{lowest}^{avg}\) refers to specifying the lowest WF among all orientation in one phase, and then averaging all phases in Al alloys.

$$\psi _{lowest}^{avg} = \left( {\mathop {\sum}\nolimits_i^n {\psi _{lowest}^i} } \right)/n$$
(2)

where the \(\psi _{lowest}^i\) expresses the lowest WF corresponding to all orientations in the i-th phase.

The descriptor \(\psi _{\gamma _{lowest}}^{avg}\) uses the WF of the orientation with the lowest surface energy to represent the i-th phase, and then averages all phases in the Al alloys.

$$\psi _{\gamma _{lowest}}^{avg} = \left( {\mathop {\sum}\nolimits_i^n {\psi _{\gamma _{lowest}}^i} } \right)/n$$
(3)

where \(\psi _{\gamma _{lowest}}^i\) denotes the WF with the lowest surface energy (γ) in the i-th phase.

The descriptors \(\psi _{lowest}\) (\(\psi _{highest}\)) are the minimal (maximal) WF ignoring the number and type of phases.

$$\psi _{lowest} = \min \left( {\psi _{lowest}^n} \right)$$
(4)
$$\psi _{highest} = \max \left( {\psi _{highest}^n} \right)$$
(5)

where the \(\psi _{lowest}^n\) (\(\psi _{highest}^n\)) indicates the minimal (maximal) WF of all phases (n) in Al alloys.

The descriptor \(\psi _{\gamma _{lowest}}\) is the WF with the lowest surface energy neglecting the phase type.

$$\psi _{\gamma _{lowest}} = \min \left( {\psi _{\gamma _{lowest}}^n} \right)$$
(6)

The RF algorithm utilized was the classification and regression tree assembled by scikit-learn45. Besides, the regression criterion was the squared error loss function, and out-of-bag data was also used to evaluate the generalization ability of the model. Since the dataset size was small (~160), the maximum estimator was limited to 100. Therefore, the estimators increased from 1 to 100, and 28 repeats were performed for each estimator. Specifically, all features were utilized for model training, and the dataset is shown in Supplementary Table 5. To eliminate the effects of data variance, the split number of the dataset was fixed (0.8 training data and 0.2 testing data). In addition, a new leaf can generate when the sample amount reached 2, and the minimum samples number on each leaf was 1. The estimator’s number of IACRF model was 100, and the maximum depth of the tree is not limited. Moreover, the specific parameters of IACRF models can be obtained on Github. Furthermore, the evaluation criteria were calculated by MSE and R2, given by Eqs. (7) and (8).

$$MSE = \frac{1}{N}\mathop {\sum}\nolimits_i^N {\left( {y_i - y_{pred_i}} \right)^2}$$
(7)
$$R^2 = 1 - \frac{{\mathop {\sum}\nolimits_i^N {\left( {y_i - y_{pred_i}} \right)^2} }}{{\mathop {\sum}\nolimits_i^N {\left( {y_i - y_{mean_i}} \right)^2} }}$$
(8)

where N expresses the total number of data in corrosion dataset, and i-th is the serial number. yi denotes the true value of the i-th sample. \(y_{pred_i}\) is the predicted value and \(y_{mean_i}\) represents the average true values of all samples.

Long-term field exposure experiments

Al alloys including 2xxx, 5xxx, 6xxx, and 7xxx were subjected to field exposure experiment. Chemical compositions of the 18 kinds of Al alloys are listed in Supplementary Table 6, and the size of exposure sample was 150 mm × 75 mm × 4 mm. Prior to the field exposure experiments, the samples were all degreased in acetone, cleared with distilled water, and dehydrated by ethyl alcohol. Next, the dry samples were weighted by a balance with an accuracy of 0.01 mg and kept in desiccators. During field exposures, the samples were mounted on racks at an angle of 45° to the horizon, and the exposure time ranged from 0.5 to 4 years. Three Southeast Asian regions, namely Singapore, Jakarta, and Bangkok, were selected for validation. In addition, their available environmental parameters are shown in Supplementary Table 7. Subsequently, the samples were recovered for corrosion assessment according to the Chinese standard (GB/T 16545-2015). To be specific, corrosion products covered on the specimens were removed by immersion in the solution (50 mL H3PO4 + 20 g CrO3 + 1 L H2O) for 5–10 min at 80–100 °C. Then, the samples were rinsed with distilled water and air dried. The microstructures (scanning electron microscopy and energy dispersive spectroscopy mapping) of the secondary phases in Al alloys were detected by GeminiSEM 500 (ZEISS, UK). The corrosion rate (ω, µma−1) could be calculated as follow:

$$\omega = \frac{{1000\left( {G_0 - G_1} \right)}}{{2t\rho \left( {ab + ac + bc} \right)}}$$
(9)

where G0 and G1 are the original and the final weights (g), a, b, and c express the length, width, and thickness (mm) of the samples respectively, t denotes the exposure time (year, a), and ρ is the density (g cm−3) of the Al alloys.

Ab-initio calculations

All ab-initio calculations were carried out using the Vienna Ab-initio Simulation Package (VASP) with Medea46. A total of nine secondary phases in Al alloys were considered to calculate WF, surface energy and Cl AE, which initial structure is presented in Supplementary Fig. 3. In view of the complexity of numerous secondary phases in Al alloys, two criteria were utilized to determine the corresponding calculated secondary phases structures. (I) The most stable structure with the lowest formation energy should be taken into account; (II) The structure observed from actual exposure samples. To guarantee the calculation accuracy, the surface models consisted of at least 6-layer slabs. Besides, using the outermost atoms (containing the adsorbed Cl atom) as base points, a 15-Å-thick vacuum gap was applied to the model to create the surface state (See Supplementary Fig. 4). General gradient approximation47 and Perdew–Burke–Ernzerhof48 methods were utilized in the calculations. In addition, the structure optimization energy convergence accuracy was set to 10−5 eV Å−1, and the spacing of k-points and cutoff energy for adsorption calculation was set to 0.25 Å−1 and 450 eV, respectively. The surface energy and Cl AE can be calculated based on the Eqs. (10) and (11), respectively.

$$\gamma _{surf} = \left( {E_{surf} - {\sum} {nE_{atom}} } \right)/2A$$
(10)

where γsurf and Esurf represent the surface energy and total energy, respectively. Eatom indicates the energy of a single element in bulk and n is the atom number. A expresses the area.

$$E_{ads}^{Cl} = E_{surf}^{Cl} - E_{surf} - E_{Cl}$$
(11)

where \(E_{ads}^{Cl}\) is the adsorption energy of a single Cl atom on a phase surface, \(E_{surf}^{Cl}\) indicates the total energy of the adsorption system with the Cl adsorbed on the surface, Esurf represents the pure system without the Cl adsorbed, and ECl is the energy of a single Cl in bulk.

The WF (ψ) refers to the minimum energy required for the electrons of the metals to escape from the interior to the vacuum. In order to express the corrosion tendency of secondary phases in Al alloys more efficiently and accelerate the training process simultaneously, the WF is calculated by Eq. (12) and then preprocessed to obtain the theoretical WF differences (\(\psi _{matix}^{phase}\)) according to Eq. (13)49.

$$\psi = \phi - E_f$$
(12)
$$\psi _{matix}^{phase} = \frac{{\psi _{phase} - \psi _{matrix}}}{e}$$
(13)

where ϕ indicates the vacuum level determined from the average potential in the vacuum gap. Ef is the Fermi level of the system. \(\psi _{phase}\) and \(\psi _{matrix}\) are the WF of the phase and Al matrix. e is the charge of an electron.