Introduction

Materials scientists constantly strive to achieve better understanding, and therefore better predictions, of materials properties. This began with the collection of empirical evidence through repeated experimentation, resulting in mathematical generalizations, theories, and laws. More recently, computational methods have arisen to solve a large variety of problems that were intractable to analytical approaches alone1,2.

As experimental and computational methods have become more efficient, high-quality data has opened up a new avenue to materials understanding. Materials informatics (MI) is the resulting field of research that utilizes statistical and machine learning (ML) approaches in combination with high-throughput computation to analyze the wealth of existing materials information and gain unique insights2,3,4. As this wealth has increased, practitioners of MI have increasingly turned to deep learning techniques to model and represent inorganic chemistry, resulting in approaches such as ElemNet, IRNet, CGCNN, SchNet, and Roost5,6,7,8,9. In specific cases including CGCNN and SchNet, the compounds are represented using their chemical and structural information7,8,10,11,12,13,14,15.

Modeling approaches based on crystal structure are an excellent tool for MI. Unfortunately, there are many material property datasets that lack suitable structural information. An example of this is the experimental band gap data gathered by Zhou et al.16. Conversely, many databases such as the Inorganic Crystal Structure Database (ICSD) and Pearson’s Crystal Data (PCD) contain an abundance of structural information, but lack the associated material properties of the recorded structures. In both cases, the applicability of structure-based learning approaches are limited. This limitation is particularly evident in the discovery of novel materials, since it is not possible to know the structural information of (currently undiscovered) chemical compounds a priori. Therefore, the development of structure-agnostic techniques is well-suited to the discovery of novel materials.

A typical approach to structure-agnostic learning is done by representing chemistry as a composition-based feature vector (CBFV)17. This allows for data-driven learning in the absence of structural information. The CBFV is a common way to transform chemical compositions into usable features for ML and is generated from the descriptive statistics of a compound’s constituent element properties. Researchers have effectively used CBFV-based ML techniques to generate material property predictions17,18,19,20,21,22,23,24,25.

One potential issue with the CBFV approach lies in the way the element vectors are combined to form the vector describing the chemical compound. Typically, the individual element vectors of the compound are scaled by the element’s prevalence (fractional abundance) in the composition, before being used to form the CBFV. This step assumes that the stoichiometric prevalence of constituent elements in a compound dictates their chemical signal, or contribution, to the material’s property. However, this is not true in all cases; an extreme example of this is element doping. Dopants can be present in very small amounts in a compound, but can have a significant impact on its electrical23,26,27, mechanical20,28,29,30, and thermal properties31,32,33,34. In the case of a typical CBFV approach that uses the weighted average of element properties as a feature, the signal from dopant elements would not significantly change the vector representation of a compound. As a result, the trained ML model would fail to capture a portion of the relevant chemical information available in the full composition.

It is apparent that there is no generally accepted best way to model materials property behaviors. Different ML approaches lend themselves towards different modeling tasks. CGCNN requires access to structural information, ElemNet operates within the realm of large data, and classical models excel when domain knowledge can be exploited to overcome data scarcity35. To address the diversity of learning challenges, in Dunn et al., the Automatminer framework uses computationally expensive searches to optimize classical modeling techniques. They demonstrate effective learning on some data, but show shortcomings when deep learning is appropriate36.

In a similar spirit, we seek to overcome general challenges in the area of structure-agnostic learning using an approach we refer to as the Compositionally Restricted Attention-Based network (CrabNet). CrabNet introduces the self-attention mechanism to the task of materials property predictions, and dynamically learns and updates individual element representations based on their chemical environment. To enable this, we introduce a featurization scheme that represents and preserves individual element identities while sharing information between elements. Self-attention is a procedure by which a neural network learns representations for each item in a system based on the other items that are present. In this context, we treat the chemical composition as the system and the elements as the items within that system. This representation enables CrabNet to learn inter-element interactions within a compound and use these interactions to generate property predictions.

To perform self-attention, we use the Transformer architecture, which emerged from natural language processing (NLP) and is based on stacked self-attention layers37,38,39,40,41,42. A typical use of the Transformer architecture in NLP is to encode the meaning of a word given the surrounding words, sentences, and paragraphs. Beyond NLP, other example uses of the Transformer architecture are found in music generation43, image generation44, image and video restoration45,46,47,48,49, game playing agents50,51, and drug discovery52,53. In this work, we explore how our attention-based architecture, CrabNet, performs in predicting materials properties relative to the common modeling techniques Roost, ElemNet, and random forest (RF) for regression-type problems.

Results

The results of this study are described in three subsections. First, we describe the collection of materials property data used for benchmarking CrabNet. Second, we highlight the performance of CrabNet when compared to other current learning approaches which rely solely on composition. Third, we briefly outline how the self-attention mechanism in CrabNet enables visualizations and inspectability unique to attention-based modeling.

Data and materials properties procurement

For this work, we obtained both computational and experimental materials data for benchmarking. Our benchmark data includes materials properties from the Matbench dataset as provided by Dunn et al.36. In addition, materials properties data from a number of works6,54,55,56,57 are collected, which are referred to as the Extended dataset. We included 28 benchmark datasets in total: 10 from the Matbench and 18 from the Extended datasets ranging from 312 to 341,788 instances of data.

The Matbench datasets were split using fivefold cross-validation following instructions provided in the original publication36. Materials properties in the Extended dataset were split into train, validation, and test datasets using a fixed random seed. For both datasets, several steps were taken to process the original datasets to be compatible with structure-agnostic learning using CrabNet. Care was taken to ensure that (1) no duplicate compositions were present in each of the train, validation, and test datasets, and that (2) if a composition exists in the train or validation dataset, all compounds with the same composition are removed from the validation and test datasets. To remain comparable with the Automatminer publication36, we applied the data processing steps as mentioned above after splitting the data. Please note that since some datasets have more duplicate compositions than others, these processing steps may affect the train/val/test ratios. For duplicate compositions in the OQMD and MP datasets, the target value associated with the lowest formation enthalpy was selected. For other datasets, the mean of the target values was used. Please see the Supplementary Methods for more details.

The full processed benchmark dataset, comprising the Matbench and Extended datasets, were then used with Roost, CrabNet, ElemNet, and RF models. The training and validation data were used for training and hyperparameter tuning. The test data were held-out to provide a fair evaluation of performance metrics across all models. Model performance was only evaluated after all training and hyperparameter tuning was completed. A summary of the datasets is shown in Table 1. All datasets are provided as pre-split csv files to facilitate future comparisons to the metrics reported in this paper. Additional data processing and cleaning details can also be seen in the code on the dataset repository mse_datasets58. To maintain consistent and simple benchmark comparisons, we selected data suitable for regression tasks and ignored structural information when present.

Table 1 Benchmark datasets. List of all 28 material properties used to benchmark the ML models in this work, together with the dataset size and the original training, validation, and test set proportions.

Benchmark comparisons

With the benchmark data described above, we generated material predictions using the publicly available code repositories for Roost9, CrabNet59, and ElemNet5. The performance of these benchmarked models is compared using the mean absolute error (MAE) between \({n}\) true values (y) and predicted values (\(\hat{y}\)) as defined by Eq. (1):

$${\rm{MAE}}=\frac{1}{n}\mathop{\sum }\limits_{i = 1}^{n}\left|{y}_{i}-\hat{{y}_{i}}\right|.$$
(1)

This allows for consistent comparison to past works5,6,7,9.

Figure 1 shows the performance metrics from training and testing the models on all the benchmark materials properties outlined above. Here we note that the models for Roost, CrabNet, and ElemNet were all trained using the default model parameters provided with their respective repositories. In contrast to Roost and ElemNet, the default parameters for CrabNet were optimized using validation data from some of the same datasets on which we benchmarked. Although it is possible this offers a small advantage to CrabNet’s performance, we do not expect this to be significant due to CrabNet’s consistently strong performance on all benchmark tasks.

Fig. 1: Benchmark results.
figure 1

MAE scores of Roost, CrabNet, one-hot encoded CrabNet (HotCrab), and ElemNet on the held-out test datasets, compared with the random forest (RF) baseline for (a) the Matbench dataset and (b) the Extended dataset. Cells are colored according to relative MAE performance within each row (blue is better, and red is worse). A NaN (not a number) value is reported for instances where the models failed to converge on a given material property. Here we present model results trained using chemical information (Roost, CrabNet), no chemical information (HotCrab, ElemNet), and a standard CBFV (RF).

We tested two versions of CrabNet. The default CrabNet uses a mat2vec embedding when representing elements, similar to Roost. The second version of CrabNet (HotCrab) uses one-hot encodings (in the form of atomic numbers) and fractional amounts to represent each element in a composition. This is similar to ElemNet, as both models start without any chemical information. The random forest (RF) model utilizes a Magpie-featurized CBFV to represent chemistry. This is included as a performance baseline to match similar works5,9,36.

Overall, we see similar performance between Roost and the two versions of CrabNet tested. Given the different architectures and modeling philosophies of Roost and CrabNet, it is promising that both approaches converge towards the same performance metrics. We also see that Roost and both CrabNet versions achieve consistent and significant improvements to MAE compared to ElemNet and RF approaches. Interestingly, Fig. 1 shows that the use of mat2vec instead of one-hot with CrabNet improves prediction performance on all materials properties except for AFLOW thermal conductivity, MP elastic anisotropy, and those present in the largest datasets (OQMD).

The Matbench data provided by Dunn et al.36 was benchmarked using the Automatminer tool. These metrics are not included in Fig. 1, since all but two (expt_gap, and steels_yield) of Automatminer’s models use structural information. Consequently, we focus on these two materials properties when comparing CrabNet’s results to those from Automatminer. For these two metrics, CrabNet’s structure-agnostic approach outperforms the reported MAE values from Automatminer on the same tasks (expt_gap: 0.416 eV vs. 0.338 eV for CrabNet; steels_yield: 95.2 GPa vs. 91.7 GPa for CrabNet).

The performance of CrabNet on the steels_yield task is particularly interesting. The steels_yield dataset contains compounds with small dopant amounts in large chemical systems (up to 13 elements per composition) and only 312 total data. CrabNet’s ability to learn on this data-poor property and outperform all other tested models including the baseline RF model (which is traditionally better in the data-poor regime) is encouraging. We expected the steels_yield task to be difficult for all deep learning approaches. Nevertheless, repeated training and validation of CrabNet consistently produced error metrics better than the best result obtained by Automatminer (95.2 GPa).

Visualizing self-attention

CrabNet’s modeling and visualization capabilities are enabled by its attention-based learning framework. In statistical ML and many deep learning approaches akin to ElemNet, the chemical composition of a compound is represented as a single CBFV. In contrast, Roost and CrabNet represent a composition as a set of element vectors. Distinct to CrabNet, however, is the Transformer-based self-attention mechanism that learns to update these element vectors using learned attention matrices. In Fig. 2, we show example attention matrices for each attention head of a CrabNet model trained on the property mp_bulk_modulus, using Al2O3 as the example composition. These matrices contain information regarding how each element (rows) is influenced by all other elements in the system as well as itself (columns). The values in these attention matrices are used in the Transformer encoder to update the element vectors. A value of zero means that the element in the column is completely ignored when updating the element in that row. A value of one means that the entire vector update is based solely on that column’s element. Our implementation of CrabNet has three layers, each with four attention heads, with each head using the same data to generate its own independent attention matrix (see “Methods” for more details).

Fig. 2: Visualization of self-attention in one compound.
figure 2

Displayed are the four attention heads (ad) from the first layer of a CrabNet model trained on mp_bulk_modulus and evaluated on the composition Al2O3. Each row represents an element in the system. Each column represents an element being attended to. Each element’s fractional amount is shown on the x-axis. The values in the attention matrix are scores representing element-element interactions for the compound. As an example, in head a, Al0.4 and O0.6 are attending strongly to each other, with attention scores of 1.00 between these two elements.

Shifting our focus to another CrabNet model trained on aflow__Egap data, we show that in addition to visualization of the individual attention heads, we can also generate a global view of attention from the perspective of individual elements. In Fig. 3, we use four periodic tables to visualize, for each attention head, the average attention that silicon dedicates to other elements when they are in the same composition. The darker colored elements can be understood as highly influential when updating silicon’s vector representation.

Fig. 3: Visualization of average attention for one dataset.
figure 3

The average attention from each of the four attention heads (ad) from the first layer of a CrabNet model trained on the aflow__Egap data is shown for systems containing Si. The heatmap shows the average amount of attention that Si dedicates to the other elements in Si-containing compounds. The darker the coloring, the more strongly Si attends to that element. We can see that each attention head exhibits its own behavior, and attends to different groups of elements. Interestingly, head a attends to common n-type dopants and head c attends to many transition metals, whereas heads b and d have unfamiliar element groupings.

Interestingly, each attention head exhibits its own behavior, with some focusing on familiar groups and columns in the periodic table. This behavior lends credibility to CrabNet since there is no inherent reason that data-driven learning should converge to chemical rules that are familiar to materials scientists. Furthermore, the identification of unfamiliar element groupings enabled by the attention-based visualizations may allow us to formulate further research questions to study these inter-elemental interactions.

The preservation of elemental identity within a compound—as a result of the self-attention mechanism—also enables CrabNet to generate property predictions in a way that is different to other approaches shown in the literature. Typically, element information of a given compound is collapsed into a single vector first and then used to generate the property prediction. In contrast, CrabNet uses each element’s vector representation to directly predict the element’s contribution to the property prediction. Figure 4a shows the average contributions from each element for a CrabNet model trained on AFLOW bulk modulus data. The darker colored elements contribute more towards a compound’s bulk modulus value. Alternatively, elements can be visualized individually using their specific per-element contributions. In Fig. 4b we show distribution plots for lithium and tungsten’s contributions to bulk modulus. From these plots, we can see that CrabNet expects lithium to contribute little to the overall bulk modulus, whereas it expects tungsten to contribute largely. See Supplementary Fig. 3 for additional examples of these element contribution plots. The visualizations from Fig. 4 match closely—and reinforce—expectations regarding which elements most influence bulk modulus behavior in a compound. Exploration of data in this manner hints at the first steps towards model interpretability of CrabNet. We expect these types of property visualizations to be useful for exploring and verifying model and element behavior in detail.

Fig. 4: Overall element contribution to property predictions.
figure 4

Average contribution of all elements to bulk modulus predictions, computed from the AFLOW bulk modulus dataset, (a) plotted on a periodic table and (b) plotted as a distribution showing the per-element contribution amounts of Li and W, respectively, in all the compounds. The darker colored elements in the periodic table contribute more towards a compound’s bulk modulus value.

Finally, with per-element contributions in mind, we can demonstrate changes to CrabNet’s expected material property behavior as a function of chemical composition. To do this, we consider a normalized chemical system consisting of atoms A and B, in the form of AxB1−x. We then generate property predictions for all x {0.0, 0.02, . . . , 1.0}. In Fig. 5, we visualize CrabNet’s behavior when predicting band gap of the SixO1−x system using a model trained on the aflow__Egap data.

Fig. 5: Element contribution to property prediction as a function of composition.
figure 5

Model predictions over the SixO1−x system using a model trained on the aflow__Egap data. The x-axis is the fractional amount of Si. The y-axis shows the predicted band gap value at a given composition. The blue and red lines are the individual element contributions to the prediction, as predicted by CrabNet. The gray shading represents the aleatoric uncertainty for each prediction.

We first observe that the expected elemental contributions for both oxygen and silicon to band gap are similar throughout the varied stoichiometry range, with the exception of the peak in oxygen contribution at around x = 0.7. We also observe that the model indicates a transition of the SixO1−x system between conducting and semi-conducting between x = 0.5 and x = 0.7. We note that the only available training data sample from the SixO1−x system in the dataset was from the composition SiO2. Therefore, we can see that the band gap trend predicted here by CrabNet is based on the learned chemical representations and inter-elemental interactions from other elements and systems. The visualization of CrabNet model predictions within a given chemical space is an alternative way to explore model learning and prediction behavior, and may lead to an improved understanding of inter-elemental interactions within a chemical system.

Furthermore, we note that the ability of CrabNet in predicting material property trends for specific chemical systems without requiring a large amount of training data for that system is of great benefit. For future studies, this ability may be investigated for its application in predicting the behavior of new chemical systems while only requiring a sparse sampling or learning of their chemical information. Furthermore, we believe that transfer learning of trained CrabNet models to other material properties is possible, due to the ability of the self-attention mechanism to accurately capture inter-elemental interactions. We are confident that these ideas of probing and visualizing of CrabNet’s modeling process and model predictions will open up further interesting research directions and ultimately lead to more insights in the pursuit of inspectable models.

Discussion

Unique challenges exist when applying ML to materials science. In this paper, we address the limitations of ML on chemical composition by introducing CrabNet. The CrabNet architecture uses the self-attention mechanism and the EDM representation scheme to perform context-aware learning on materials properties. Using 28 benchmark datasets, we demonstrate CrabNet’s performance compared to Roost, ElemNet, and RF baselines. CrabNet exhibits consistent predictive accuracy across the full range of materials properties tested. Furthermore, we show that the self-attention-based learning technique also provides alternative methods for visualizing model behavior. We demonstrate the use of attention and per-element contribution prediction capabilities for visualizing common trends in our trained models that match chemical expectations.

Given this application of self-attention in the context of materials science, we expect that there can be many informative and impactful follow-up works. Specifically, we believe these will largely fall into three thematic categories:

  1. 1.

    CrabNet directly contributing to the community’s focus towards improved property predictions.

    CrabNet consistently generates good MAE scores. The performance achieved with the use of self-attention, combined with the innovative use of element and composition featurization techniques, will allow researchers to delve deeper into analyzing and predicting materials properties. As a result, we believe that CrabNet will be relevant in areas where other ML methods fall short (e.g., dopants, small data, and materials extrapolation tasks). We also note that with minimal changes to CrabNet, it can also perform classification tasks; we expect CrabNet to similarly excel at this.

  2. 2.

    Attention-based models allow for new ways of thinking about materials-specific problems.

    In this work, we briefly examined the attention mechanism. Attention highlights important interactions and may be used to understand which element interactions mediate materials properties. Model explainability has thus far been elusive to the traditional MI paradigms; the inclusion of self-attention in this work has introduced additional methods of model inspectability that may be a step towards this goal.

  3. 3.

    Augmentation of CrabNet using structural and domain-specific knowledge.

    This work intentionally used a compositionally restricted EDM representation with no structural information. Structure-agnostic learning is an important task in MI and CrabNet demonstrates that accurate learning is achievable using the self-attention mechanism. However, the prediction of materials properties using structural information is also an important task. Integration of structural information could be achieved by describing elements in their structural and chemical environments. We expect that the self-attention mechanism of CrabNet will be able to utilize this additional information to make more accurate predictions. This application of attention-based learning to crystal systems is an exciting and promising direction. We also expect that materials prediction tasks involving processing steps or other non-compositional features could be used in this approach. Both of these changes could easily be implemented as extensions to the EDM.

While further research is necessary to fully discern the utility of self-attention in materials problems, we believe that this paper highlights a major new direction in its application in MI and suggests exciting directions for future research.

Methods

Self-attention and the CrabNet architecture

Chemical compositions are input using the atomic numbers and fractional amounts of their constituent elements. The atomic numbers are used to retrieve element representations (either mat2vec or one-hot). The fractional amounts are used to obtain fractional embeddings (described below). An element embedding matrix is generated by applying a fully connected network to the element representations. A fractional embedding matrix is created from the fractional embeddings. These matrices are then added together (element-wise) to generate the element-derived matrix (EDM, see Fig. 6). Each row of the EDM (j-index) represents an element and the columns (k-index) contain the element embeddings. We batch each unique chemical composition onto a third dimension (the i-index). The resulting three-dimensional tensor contains the input data for the CrabNet architecture.

Fig. 6: EDM featurization scheme.
figure 6

Schematic illustration of the element-derived matrix (EDM) representation for Al2O3, where B represents the batch, dmodel is the element features, and nelements represents the number of elements. Composition slices, when concatenated across batch dimension i, form an EDM tensor which is then used as the model input to CrabNet. When a chemical formula has fewer elements than rows in the EDM, the extra data rows are filled with zeros.

We use the mat2vec element embeddings60 as the default source of chemical information for each element, even though there are other choices for element properties available, such as Jarvis22, Magpie61, Oliynyk18 or a simple one-hot encoding. The mat2vec embedding has the advantage of being pre-scaled and normalized, and having no missing elements nor element features. Regardless of the choice of element representation, the representation must be reshaped to fit the attention input dimensions of (dmodel). This is done using a learned embedding network; the result is a matrix of size (nelements, dmodel). In addition to the default training of CrabNet using the mat2vec embedding, a one-hot embedding of the elements was used to train an additional CrabNet model (HotCrab) to better facilitate comparison with ElemNet.

The stoichiometric information for each element in the EDM is represented by two fractional embeddings. The fractional embeddings are inspired by the positional encoder as described in the seminal work by Vaswani et al.37. We use sine and cosine functions of various periods to project the fractional amounts into a high-dimensional space (dimension d = dmodel/2) where smooth interpolation between fractional values is preserved. The first part of the fractional embedding represents the stoichiometry, using the normalized fractional amounts, on a linear scale with a fractional resolution of 0.01. The second part of the embedding maps stoichiometry using a log scale and spans from 1 × 10−6 to 1 × 10−1. This logarithmic transformation of the fractional embedding preserves small fractional amounts such as those present in doping. The two parts of the fractional embedding for all elements are concatenated across the embedding dimension to obtain a matrix of size (nelements, dmodel). See Supplementary Figs. 1 and 2 for example visualizations of the EDM embedding.

Once the element and fractional embeddings are calculated and added together, we then batch the finished EDMs across the first dimension. This gives the final input data of shape (ncompounds, nelements, dmodel), where ncompounds is the total number of compounds in a given batch, nelements is the number of rows in the EDM (inferred from the number of elements in the largest composition in a given dataset), and dmodel is the size of the embeddings. Here, we also note that the exact ordering of the element rows (j) in a compound in the EDM does not influence CrabNet due to the permutation-invariant nature of the self-attention mechanism.

CrabNet contains two primary modules with the default hyperparameters as shown in Table 2. The first module is a Transformer encoder with 3 layers and 4 attention heads in each layer. The second module is a residual network that converts element vectors into element contributions.

Table 2 List of default model parameters of CrabNet.

To understand the Transformer encoder, we first describe the self-attention mechanism. During self-attention (Fig. 7a), the EDM is operated on by three fully connected linear networks (FCQ, FCK, and FCV). These networks generate the query Q, key K, and value V tensors. These tensors can be conceptualized as a learned high-dimensional space where the model stores chemical behavior from the training data.

Fig. 7: Schematic of an attention block in the CrabNet architecture.
figure 7

a The initial projection of the input EDM into the Q, K and V tensors. b The scaled dot-product attention operation obtaining the self-attention matrix and the updated Z element representation. The batch dimension is not shown in b to improve legibility.

The K and Q tensors contain information regarding the magnitude to which elements interact. The V tensor stores the information that is used to map from element to property contribution. The dot product of each Q and KT tensor pair (where KT denotes the transpose of K) generates the relative element importances in the system (Fig. 7b). The importances are scaled using a constant \(\sqrt{{d}_{{\rm{k}}}}\) and then normalized using a softmax function. This results in the self-attention tensor, commonly referred to as the attention map. We denote this tensor as A. The matrix multiplication of A with V updates the element-representations in the compound based on the importance of each element.

Each of the four attention heads independently performs self-attention with their own Qh, Kh, Vh, and Zh tensors, where h denotes the head index for h = 1, …, H. As a result, the network generates four different element representations at each layer. The individual Zh tensors are concatenated across the last dimension to make the Z tensor (as seen in Fig. 8a). The Z tensor is then passed into a linear FC network which combines the element representations from each head. The output of this FC network is an updated EDM\(^{\prime}\) (for each composition in the batch). This process of converting an EDM into an updated EDM\(^{\prime}\) is referred to as a self-attention block. CrabNet repeats the process of updating the EDM via the self-attention block three times (hence, three layers) resulting in the final updated representations, denoted EDM. This concludes the Transformer encoder module.

Fig. 8: Overall CrabNet architecture and prediction of material property and uncertainty.
figure 8

a Schematic of the CrabNet architecture including the input EDM, the self-attention layers (repeated N times), the updated and final element representations (EDM\(^{\prime}\) and EDM), the residual network, and the final model output. b The calculation steps for element contributions and prediction of the targets and uncertainties. The \(p^{\prime}\) and \(u^{\prime}\) vectors represent the element-proto-contributions and the element uncertainties, respectively. \(y^{\prime}\) represent the element contributions. The material property is obtained by taking the mean of element-contributions (\(y^{\prime}\)) for each compound. Similarly, the mean of the element-uncertainties (\(u^{\prime}\)) gives us the estimated aleatoric uncertainty.

Once the Transformer encoder has updated the element representations, each EDM passes through a fully connected residual network with hidden layer dimensions of \({{\rm{node}}}_{{\rm{res}}}\). The residual network then transforms the EDMs into the shape (nelements, nelements, 3). We define these final three vectors as the element-proto-contributions \(p^{\prime}\), element-uncertainties \(u^{\prime}\), and element-logits (see Fig. 8a). The element scaling factor s is obtained by taking the sigmoid (σ) of the element-logits. The element-contributions are then obtained by multiplying the element-proto-contributions \(p^{\prime}\) by their respective scaling factor s. This results in element contributions \(y^{\prime}\). Finally, the mean of the element contributions is taken and output as the predicted property value for each compound (see Fig. 8b). Similarly, the mean of the element-uncertainties is used in the aleatoric uncertainty prediction as described by Roost9.

Training CrabNet

After the featurization of compositions into EDMs, the dataset loading and batching is performed with the built-in Datasets and DataLoaders classes from PyTorch. All target values are scaled to zero-centered mean and unit variance for training and inference. The target scaling is then undone for performance evaluation. Batch size during training is dynamically calculated using the training set size for faster training, and limited to be within the range 27–212. For inference, the batch size was fixed at 27.

Model weights are updated using the look-ahead62 and Lamb optimizer63 with a learning rate that is cycled between 1 × 10−4 and 6 × 10−3 every 4 epochs to achieve consistent model convergence. A robust MAE is used as the loss criterion for model performance9. The default parameters generalize well when predicting most of the benchmark materials properties. Although we expect that optimization of hyperparameters may improve CrabNet’s results for individual materials properties, we believe it is more important that materials scientists be able to use CrabNet with little or no adjustments to the underlying code.

It is a known phenomenon that random weight initialization can impact the performance of the Transformer encoder architecture. Thus, to mitigate variance in the performance metrics between different model runs, we trained CrabNet using a fixed random seed of 42 for all training runs across all materials properties. We do note that in the case of random model initialization, the run-to-run variation between different trained models is a feature that could be taken advantage of for determining the epistemic uncertainty. Unfortunately, due to the sheer volume of materials properties investigated in this work and the limited compute resources available, we have not investigated this thus far.

Finally, we note that all model training, evaluation, and benchmarking (for CrabNet, Roost, ElemNet, and RF) was conducted on a single workstation PC equipped with an Intel i9-9900K CPU, 32 GB of DDR4 RAM, and two NVIDIA RTX 2080 Ti GPUs with 11 GB VRAM per GPU. The deep learning models were trained on the GPU, while the RF models were trained on the CPU.

Reference models

Predictions for all materials properties were generated using code from the Roost repository9. Minor adaptations were made to the code to allow for automated training and benchmarking. Overall, Roost generates consistently impressive results. Roost relies on a soft-attention mechanism used over a graph representation of the compound. This is in the same spirit of CrabNet, and both seek to generate vector representations for the elements in the system without using structure information. The residual network and robust loss function from Roost were helpfully adopted into our architecture9.

Predictions from ElemNet were generated using default parameters using code from the repository5. Custom scripts were written to train and evaluate ElemNet over all materials properties data. ElemNet consistently under-performed compared to Roost and CrabNet. ElemNet failed to converge for multiple properties resulting in NaN (not a number) values in the model outputs. Examples of this occurring can be seen in the phonons and steels_yield datasets. Here, we would like to note that IRNet6 could also be benchmarked and compared in this study. However, due to the prohibitively large computational requirements, we chose not to train and evaluate IRNet. We do however note that the OQMD performance reported in the IRNet publication6 is consistently lower than both Roost and CrabNet for the same properties. These following values show the reported performance of IRNet vs. HotCrab, respectively, for formation enthalpy (0.048 eV vs. 0.031 eV), band gap (0.047 eV vs. 0.048 eV), energy per atom (0.070 eV vs. 0.033 eV), and volume per atom (0.394 Å3 vs. 0.277 Å3).

We generate baseline RF metrics using a random forest regression with the Magpie CBFV as defined by Matminer36. This is done using the scikit-learn Python package. The RF models were trained with nestimators = 500 and default parameters.