## Introduction

The corrosion of metals and alloys remains a significant technological and financial issue—globally. Studies regarding the cost of corrosion, for example, those recently conducted in the USA1, China2, and Australia3 reveal that corrosion costs amount to ~3% of GDP annually (which equates to global costs of >US\$1 trillion per annum). As a result, methods for corrosion prevention remain critical. In terms of the corrosion protection of metals and alloys, for over half a century, the benchmark for exceptional performance from corrosion-inhibiting compounds has been demonstrated by hexavalent chromium (known as chromate)4. Chromate is a powerful inhibitor, as it can passivate many metals, including Zn, Al, Mg, etc. The mechanism of chromate protection involves the formation of a protective Cr (III) oxide layer on reactive metals from mobile and soluble Cr (VI) oxyanions that can migrate to ‘active’ (anode) sites5. Chromate serves as a corrosion inhibitor in aqueous solutions, but also as an additive to primers used to coat metals (such as steels and galvanised steels).

The International Agency for Research on Cancer (IARC) confirmed that hexavalent chromium (Cr (VI)) is a human carcinogen in 1990 based on independent studies around the world6. However, the corrosion inhibition performance of chromate-containing primers is appreciable, such that chromate-containing primers are the current industry benchmark in terms of performance (and additionally, consumer expectations of product performance). Given the documented concerns regarding the use of chromate and its disposal7,8,9, the evolution toward chromate-free corrosion inhibitors is underway. In a tangible sense, there are already numerous chromate-free corrosion inhibition strategies utilised in consumer products today. The adoption of alternative (chromate-free) approaches is progressing as suitable alternatives to chromate are identified albeit few are as (i) cost-effective, (ii) passivating, and (iii) applicable across a wide range of metals and alloys, as chromate.

A review by Gharbi and co-workers into chromate alternatives summarised that singular alternatives to chromate as a ‘drop-in’ replacement strategy are unlikely10. The past three decades have seen much research focus on alternatives for chromates. Some of the most widely explored alternatives that have demonstrated promising performance approaching that of chromate-containing inhibitors include rare-earth-based inhibitors11 and rare-earth coatings, vanadate-based coatings12 that are currently utilised in aerospace systems13,14, lithium-containing coatings15, organic coatings, nanocomposites, phosphate coatings16 and metal-rich primers17. Undoubtedly, the search for chromate alternates remains a very timely topic and a puzzle that is yet to be solved in the field.

The rapidly growing and large-scale material science knowledge base is typically published as archival ‘papers’. In this content, text mining has been one of the most exciting tools in recent years18,19,20,21. Most literature text remains unstructured or semi-structured data (natural language) which is not capable of being readily interpreted by a computer (whereby a computer is unable to readily interpret context). However, to extract comprehensible and meaningful information from text, supervised natural language processing (NLP) and machine learning methods have been shown to be promising and resulted in the exploration of text mining in the field of material science22,23,24,25. Supervised NLP requires part of the corpus (i.e., a body of writing) to be in the form of human-annotated data for training, and then tested by unlabelled text. Some supervised NLP algorithms include support vector machines (SVM), bayesian network (BN), maximum entropy (ME), conditional random field (CRF), as well as several other algorithms26,27,28,29. However, the vast majority—if not essentially all the open literature reports and published data—are unlabelled. Therefore, such text and data may be mined by unsupervised NLP algorithms. Clustering, a well-known unsupervised machine learning algorithm for classifying similar data into groups, has been demonstrated and used to generate machine learning datasets and to identify noisy data, in material science30,31.

In the present work, the aim was to apply unsupervised NLP in order to explore the suitability of such methods in aiding the interpretation of the corrosion-related text; specifically seeking to assess if NLP without the need for a human-in-the-loop could be applied to seeking alternatives for chromates.

Tshitoyan et al.32 utilised Word2Vec, an unsupervised word embedding method, to extract underlying structure–property relationships in materials and predict new thermoelectric materials. Word2Vec is a vector representation of words, which allows similar words to have a similar representation. Although word embedding is one of the most widely used representations of vocabulary, such an approach can only generate one vector for each word. Therefore, Word2Vec models are context-independent and different contexts of one word are not able to be taken into account. In addition, the Word2Vec model is not capable of learning Out-of-Vocabulary (OOV), since it generates tokens on the ‘word’ level. However, given the promise that Word2Vec has shown to date, in the field of materials, in this study we apply the method towards the open literature in order to seek possible chromate alternatives. In addition to using Word2Vec, we have also explored the utilisation of a state-of-the-art model, known as bidirectional encoder representations from transformers (BERT).

BERT is a language representation model developed by Google in 2018, enabling pre-training deep bidirectional representations from unlabelled text, by jointly conditioning both the left and right context (outlined further below) in all Transformer encoder layers33. In the BERT model, sub-word tokenization is utilised, with a principle that rare words are decomposed into sub-words; whilst frequent words should not be split33. This allows the model to process words it has never seen before; meaning BERT is capable of learning Out-of-Vocabulary. The BERT tokenizer is based on WordPiece embedding with 30,000 tokens33,34, by implementing the following methods: (i) Tokenizing (splitting texts to sub-word tokens), switch tokens to integers, and encoding/decoding; (ii) generating new tokens to the corpus; and (iii) adding and assigning special tokens: the mask to fill (MASK), separator token (SEP) and classification token (CLS).

A commonly used procedure for training models for various tasks in modern NLP systems is to first pre-train a general model on a large amount of unlabelled data, then finetune on downstream NLP tasks including classification, summarisation, etc. Masked language modelling (MLM) is a pre-training method and utilised for how BERT is pre-trained. Taking a sentence, the model randomly masks 15% of the words in the input, then runs the entire masked sentence through the model to predict the masked words35. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like generative pre-trained transformer (GPT) that every token only attends to context to its left, which internally masks future tokens36. The MLM approach allows the model to learn a bidirectional representation of a sentence. Figure 1 shows how MLM works, the input sentence is first tokenized with special tokens and fed into BERT as a sequence. The pre-training data is 800M words from BooksCorpus and 2500M words from Wikipedia33.

The representation of every other input word can be weighted by α (attention weight) during learning MASK word. For example, α = 1 means that each other word has equal weight in the representation. The tokens are passed to Transformer encoder layers, each layer applies bidirectional self-attention. Inputs then pass through a feed-forward network, then to the next encoder layer. Each output logit is the size of the vocabulary size and is transferred to a probability distribution by applying the softmax function33. A softmax function Eq. (1) is a normalisation process that transforms K input values into K values between 0 and 1 which sums to 1.

$$\sigma (\overrightarrow z )_i = \frac{{e^{z_i}}}{{\mathop {\sum }\nolimits_{j = 1}^K e^{z_j}}}$$
(1)

The output values then can be interpreted as probabilities. Predicted tokens are then calculated by applying argmax to probability distribution. An argmax function Eq. (2) is a function that returns the argument where the function has a maximum value. Given a function, f:X → Y, the argmax over subset S of X is defined as

$${\rm {argmax}}_Sf\left( x \right): = \left\{ {x \in S:f\left( s \right) \le f\left( x \right){\rm {for}}\,{\rm{all}}\,s{\it{\epsilon }}S} \right\}$$
(2)

The argmax is used to locate the token/class with the largest predicted probability. In the case shown in Fig. 1, for example, the predicted results could be ‘wide, large, broad, etc.’37. The fine-tuning part has a similar architecture to pre-training: For different downstream tasks, feed the model with task-specific inputs, add a task-specific output layer and finetune parameters.

## Results

Based on the results from previous work32 an advantageous outcome from the Word2Vec representation was identified as the ability to represent both ‘application words’ and ‘material formulae’ similarly. In the present work, we, therefore, expect the cosine distance of materials that have the same application, to be close; i.e. when the cosine similarity of a material representation and ‘chromate’ representation is high, such materials are very likely to have similar applications to chromium and therefore candidate alternatives to chromate. Therefore, the closeness of the cosine distance is utilised for the determination of candidate alternatives to chromate when using the Word2Vec model.

In the utilisation of the BERT MLM model, the different mode of operation of the model was exploited in order to seek chromate alternatives by posing open questions—namely six questions that required filling by six different masks. The top predictions arising from the exploration of the six [MASK], were deemed the candidate alternatives for chromate replacements.

For Word2Vec model and each of the six masked BERT models, the top 1000 results generated from the models were extracted, and this list of the top 1000 results was sorted to identify the candidate alternatives that are actually materials (i.e. materials, chemicals, or compounds) and that are relevant to the corrosion domain. Whilst most of the top 1000 results were related to the topic of materials, corrosion, and alloys (to a large extent), we excluded common terms that had no relevance as alternates (i.e. ‘steel’) and any terms that were not materials (i.e. general conversational words). The number of relevant suggestions for chromate alternatives from the Word2Vec and BERT models, which are highly related to corrosion protection, are shown in Fig. 2.

Of the top 1000 entries generated from each model, the Word2Vec approach detected 54 materials (which is inclusive of materials, compounds, or chemicals) that are relevant to serving as suitable alternatives to chromate. Conversely, the BERT approach identified a number of suitable alternatives that varied from 30 to 85—depending on the question asked. From Fig. 2, it is evident that the first three [mask] containing questions from the BERT model, yielded a relatively higher number of relevant results, which was correlated (by the authors) to how the mask-containing sentence was structured. The BERT model sentences with the lowest yield include ‘The best corrosion inhibitor is [mask]’ and ‘The best conversion coating is [mask]’. These two sentences have the prospect of generating a large number of verbs as outputs, such as ‘obtained’, or ‘needed’, or, adjectives as outputs such as ‘available’, or ‘possible’—all of which are reasonable to fill the sentence (but are not relevant materials). If we combine the results from six masked BERT models, the BERT model was capable of predicting 161 individual relevant suggestions for chromate alternatives. A list of each of the ranked chromate alternatives tallied in Fig. 2, is presented in Supplementary Tables 1 and 2.

Of the chromate alternative results predicted by the Word2Vec model, some results were the same as the results from the BERT model, with an overlap of 19%. When interpreting the results obtained, it was observed that the results from Word2Vec have all appeared at least once in the corpus. However, the BERT model was not only able to identify some low-frequency results but also identified results that had never appeared in the corpus. For example, the frequency of ‘cvd’ (which is chemical vapour deposition) is one and its prediction rank is almost the same as ‘nanoparticles’ which appears 61 times, as well as, ‘formaldehyde’, ‘acrylate’, etc. that do not exist in the dataset. This is an important insight because since the BERT model uses sub-word tokenization during its training, the model can ‘mix and match’ (between the pre-training and fine-tuning), allowing the model to predict words not seen before in the corpus of corrosion protection relevant training data. The pre-training step, which was carried out using technical documents (from the SciBERT database) and Wikipedia (from the chemBERT database)—is where the BERT model would have seen such words—and is then able to use such words following fine-tuning. This indicates that the ability to pre-train BERT models using a vast array of less-specialist text is meaningful, as the BERT model is able to predict in a human-like manner (including out-of-field).

To illustrate how many alternatives have the potential for the replacement of chromate, we compared the results of the predictions from the Word2Vec and BERT MLM model, with a list of benchmark chromate replacements. The list of potential alternatives was derived from three sources, each of which is a culmination of ‘expert’ level human analysis—and years of research and literature analysis10,38,39. The three studies/reports from which the benchmark list of chromate replacements was derived were not utilised in the model training herein and were reserved as independent validation. The benchmark alternative list was curated into 20 categories, ranging from trivalent chromium, rare-earth-based coatings, vanadate-based coatings, Li-containing coatings, organic systems, and phosphate-based systems to Mg-rich primers—as shown in Table 1.

When reviewing the outputs of the Word2Vec and BERT model, the authors manually identified relevant materials (suitable for consideration as chromate replacements) and allocated them to the benchmark category to which they relate—as also seen in Table 1.

To analyse the efficiency of the NLP models to predict chromate alternatives in an automated manner, we summarised the number of benchmark-related results in each category, for each NLP model and present the results in Fig. 3.

Specifically, a total of 45 results (out of 54 relevant results overall) were considered as relevant benchmark alternatives from the Word2Vec model. The Word2Vec model, therefore, exhibited an 83.3% benchmark-related rate, which was the highest rate—when compared to the six masked BERT models. It is also noted from Fig. 3, that the first three masked sentences (from the BERT model) outperform the latter three masked sentences by a factor of nearly two. The results that do not match benchmark alternatives are either materials/alloys (‘magnesium’), substrate materials, or terms that are not materials such as process-related techniques (e.g. ‘PVD’, ‘hard chromium plating’). For example, from the Word2Vec model, such words are epoxy, PVD, hard chromium plating, diamond-like, sol, neodymium, lanthanum, clays, magnesium and Nd.

To investigate these benchmark chromate alternative results in more detail, we focus on benchmark-related results in each category predicted by Word2Vec and BERT model. Figure 4 reveals the count of benchmark-related results in each of the twenty benchmark categories, isolating the performance of the Word2Vec model (in black) and the BERT model (in red). For this analysis, the BERT model is presented as a summation of the six masked models trialled, in order to examine overall performance.

Inspection of Fig. 4 reveals that the Word2Vec model did not identify four categories: trivalent chromium, titanium conversion coatings, zinc-based coatings and calcium-based coatings; while the BERT model covered all 20 categories, with at least one prediction. One of the categories, ‘silicon-based systems’, showed the highest number of predictions by the Word2Vec model. The other category of ‘organic systems’, revealed the same number of predictions by the BERT model. To further probe the four categories which were only identified by the BERT model, we report the prediction materials in each category and their frequency of occurrence in the original dataset, listed in Table 2. The frequency with which most of these materials were mentioned was relatively high, ranging from 25 to nearly 450 instances. We found that these words never appeared explicitly in the same sentence with ‘chromate’, but they connected to ‘chromate’ through other ways, such as ‘hydroxide’ occurs in the same paragraph with ‘Cr’, ‘TCP’ and ‘chromium’.

Overall, whilst the Word2Vec model and BERT model revealed the highest benchmark-related rate, that rate is only one metric of performance—and is directly linked to the number of predicted results. One of the more holistic assessments of model performance, when inspecting Fig. 4, suggests that the BERT model outperformed the Word2Vec model for detecting all of the 20 benchmark chromate alternatives—including in the variety of approaches therein. Whilst not necessarily probed further in the present work, the ability of the BERT model to predict chromate replacement results, was also described. Conforming to the prediction of benchmark results—whilst meaningful in the initial exploration and validity of NLP approaches—is a strong confirmation that expert human-level interpretation is capable (in an automated process). From an aspirational perspective, the NLP approaches should extend beyond human-level benchmarking and identify results and correlations that not readily having been interpreted by humans.

## Discussion

In this study, natural language processing (NLP) was utilised to automate the search of scientific literature for chromate replacements; specifically in the context of corrosion protection. It was revealed that the application of NLP was capable of serving in the role of searching for chromate replacements, without the need for a human to read any of the associated scientific literature. Herein, two NLP approaches were utilised, namely, the Word2vec approach (previously explored in the field of materials by others) and the BERT approach, recently developed by Google. The latter approach was explored on the basis of its potential in handling out-of-vocabulary words, and its ability to operate by finding alternative words for a [mask] (i.e. the ability to ‘answer questions asked’ of the BERT model). The finding from the study herein can be summarised as

• When comparing the NLP predictions from the work herein (which did not have a human in the loop) with three (3) benchmark studies/reviews from corrosion experts that have proposed a list of chromate replacements, it was determined that:

• The Word2vec model predicted the most accurate chromate alternative results, by simply calculating the cosine distance.

• The BERT model predicted the most extensive related results in the field, inclusive of even low-frequency terms.

• Both the BERT and Word2vec models could capture essentially all of the expert human-determined chromate replacement technologies—albeit with no domain experience.

• NLP was able to readily capture scientific knowledge for a niche application, revealing the approaches employed herein—not developed for the application of chromate replacement—can serve as general approaches for broad applications.

• This study presented a descriptive model for summarising chromate replacements from the literature without human annotation, by using NLP. This is a first-attempt report focused on insight into the past and aims to identify materials that experts can identify by extracting existing corrosion knowledge.

• Future work may explore the use of broader inputs, beyond those of the Scopus application programming interfaces, including webpages, and other collections. Future work may explore the use of materials properties, corrosion and protection mechanisms. Specifically, broadening the inputs and including mechanistic facets will possibly permit more chromate replacement predictions that are not in the benchmark alternative list.

## Methods

### Data collection and pre-processing

A total of 5990 entries were collected by accessing and extracting 84 million records from Scopus application programming interfaces (APIs) (https://dev.elsevier.com/). A set of wild card query terms was introduced to limit acquisition primarily related to the relevant topic. Only articles with ‘chrom*’and ‘replace*’ or ‘substitute’ in their titles, abstracts or keywords were collected. Furthermore, abstracts were filtered by applying query terms ‘alumin*’, ‘zinc’, ‘magnesium’, ‘alloy’, ‘steel’ or ‘iron’ (to ensure that they were relevant to substrates of interest). Abstracts that were in non-English languages were removed from the corpus to allow the use of a singular language setting as English. A number of articles with copyright limitations or missing passages were also removed, as were articles with content types not corresponding to peer-reviewed publications—leaving 1812 works forming the training dataset for the Word2Vec and BERT architecture.

Preprocessing of body text involved removing XML format and XML quotes tags, leading words such as ‘Abstract’ were also eliminated. In the Word2vec model, we followed the general preprocessing steps as per the unsupervised word embedding study from 201932. Element and element names, numbers and units were converted to tokens, such as #element, #nUm, #unit, respectively. Material formulas were normalised in an alphabetical way, such that any chemical formula was simplified regardless of the order of elements. The processed dataset includes one study in each line, and it was tokenized to a combination of individual words through ChemDataExtractor40. Chemical formulas were recognised by applying pymatgen41, regular expression and rule-based techniques, jointly. The body text was transformed to lowercase if the token was not a chemical formula or an abbreviation. Abbreviations were identified by instances when not only the first letter was uppercase. In the pre-trained BERT-based model, subword tokenization was used allowing the model to process words it has never seen before. The tokenizer includes 250,000+ tokens from the chemical domain. Rare words were tokenized into meaningful subwords while frequent words were only split into word tokens. BERT takes the whole input as a single sequence. Special tokens [CLS] and [SEP] were used to understand the input sequence. Besides token embeddings, BERT includes more information for each token with positional embeddings and segment embeddings.

### Training

In the Word2vec training, the same gensim model was utilised, following the hyperparameter tuning process performed on 14,042 material science analogy pairs, as shown in Tshitoyan’s work32. Hyperparameters are substantial parameters that control the learning process and are evaluated prior to the training. Common hyperparameters include learning rate, optimisation method, loss function, number of hidden layers, batch size, and epochs. Batch size is the length of data samples for training before the gradient descent updates. A development dataset for the hyperparameter tuning process was created herein, in which 10% of data were extracted randomly from the original dataset. The optimisation process is a grid search and searches through a specified set of parameters in the hyperparameter space. Models were trained with each pair hyperparameters and evaluated by the evaluation metric: analogy score, as discussed in the subsequent Evaluation section. A set of optimal hyperparameters were gained with the highest analogy score: a learning rate of 0.001, a size of embedding of 300, and a batch size of 128 and 30 epochs. The training is then performed by applying the set of optimal hyperparameters.

To focus the study in the Chemistry Domain, we used the pretrained chemical-bert-uncased model in Hugging Face40, and finetuned the training based on the mask language modelling implementation in Huggin Face with some modifications42. This pretraining and finetuning process is shown in Fig. 5. The chem-bert-uncased model is pretrained from SciBERT (https://huggingface.co/allenai/scibert_scivocab_uncased) with over 40,000 technical documents from Chemical Industrial and over 13,000 Wikipedia Chemistry articles. The software ‘Wandb’ (for tracking weights and biases) was introduced to track and visualise the hyperparameter tuning process43. Similarly, the best hyperparameters for finetuning were selected by training a model on a small parcel of data (the development set) over each pair hyperparameter and calculating corresponding perplexity, as discussed in Evaluation. The hyperparameters pairs are epoch = (10,20,30), batch size = (16,32), learning rate = (1e−5, 1e−4, 1e−3). As shown in Fig. 6, the best perplexity corresponds to the optimal hyperparameter pair (epoch = 10, batch size = 32, learning rate = 1e−4). We then fine-tuned the model with this optimal hyperparameter pair on the processed abstract dataset.

The fine-tuning process of BERT is notionally considered ‘straightforward’ for various downstream tasks (e.g. classification, sequence labelling and question answering). By adding one or more additional layers after the final pre-training layer, it is typical to freeze the early BERT layers and only train the later layers. Such downstream tasks are usually performed on task-specific labelled texts, and therefore most of the fine-tuning processes in BERT are supervised. However, the fine-tuning in the present study is still unsupervised; instead using masked language modelling with non-labelled corrosion-related text. Pretraining on domain-specific data in NLP tends to yield higher performance44,45,46. Therefore, we apply pre-training on chemical domain data (chemical-bert-uncased model40) and fine-tuning with corrosion domain information (via the Scopus API) to strengthen the understanding of the language model on chromate replacement. The application of masked language modelling is predicting which word is filled in the sentence, which is defined as ‘Fill Mask’.

### Evaluation

NLP tasks generally can be validated with measures of accuracy including f-score, root mean squared error (RMSE), etc. However, the evaluation of unsupervised learning can be challenging due to the unlabelled output. This is primarily because common evaluation methods require comparing an output value against a known value. For the Word2vec model, the evaluation metric is the analogy score, defined as the rate of correctly matched analogies from two chemical and element name pairs. Usually, the evaluation metrics for MLM are cross entropy and perplexity. Perplexity (given by Eq. (3)) is a commonly used value to evaluate language models in NLP:

$$PP\left( W \right) = 2^{H(W)} = 2^{ - \frac{1}{N}\log _2P(w_1,w_2,\, \ldots ,w_N)}$$
(3)

where H is the cross-entropy, P is the language model, w is a sequence of words, and N is the length of the words. A lower perplexity commonly indicates a better language model with more predictable results.

Herein, the means by which the Word2Vec and BERT models identify chromate replacement materials were designed. In the Word2Vec study, the cosine distance to vector ‘chromate’ was used, to represent the probability of a material being a chromate replacement. That is, the chromate replacements are among the materials most similar to chromate, which were determined by the projection of normalised word embeddings. While in the BERT experiment, we attempted to identify chromate replacements by ‘filling blanks’ in a sentence, this approach is known as Fill Mask. For example, top predictions of potential chromate replacements were sought by filling a [mask], whereby an example is: ‘Chromate can be replaced by [mask]’. Our model randomly masks 15% of the input, runs the whole masked sentence, and outputs the prediction of the masked words. Both the predicted results were categorised and compared with a benchmark alternative list which is summarised from known alternative corrosion preventative technologies, as discussed in the Results. The six masked sentences fed into the BERT model to seek potential alternatives to chromate are listed in Table 3. The ‘can’, ‘may’, ‘chromate’ and ‘perform’ sentences were designed to explore the model providing direct answers for possible chromate replacement materials, while the ‘perform’ sentence had distinctly different structure and embedded comparison semantics. The ‘inhibitor’ and ‘coating’ sentences, on the other hand, were designed according to the main application of chromate—whereby the top corrosion inhibitor and conversion coating materials were also seen as potential alternatives.

Herein, the alternatives to chromate predicted were refined to a list of 20 categories of benchmark alternatives, as described in the “Results” section.