Reliable and Explainable Machine Learning Methods for Accelerated Material Discovery

Material scientists are increasingly adopting the use of machine learning (ML) for making potentially important decisions, such as, discovery, development, optimization, synthesis and characterization of materials. However, despite ML's impressive performance in commercial applications, several unique challenges exist when applying ML in materials science applications. In such a context, the contributions of this work are twofold. First, we identify common pitfalls of existing ML techniques when learning from underrepresented/imbalanced material data. Specifically, we show that with imbalanced data, standard methods for assessing quality of ML models break down and lead to misleading conclusions. Furthermore, we found that the model's own confidence score cannot be trusted and model introspection methods (using simpler models) do not help as they result in loss of predictive performance (reliability-explainability trade-off). Second, to overcome these challenges, we propose a general-purpose explainable and reliable machine-learning framework. Specifically, we propose a novel pipeline that employs an ensemble of simpler models to reliably predict material properties. We also propose a transfer learning technique and show that the performance loss due to models' simplicity can be overcome by exploiting correlations among different material properties. A new evaluation metric and a trust score to better quantify the confidence in the predictions are also proposed. To improve the interpretability, we add a rationale generator component to our framework which provides both model-level and decision-level explanations. Finally, we demonstrate the versatility of our technique on two applications: 1) predicting properties of crystalline compounds, and 2) identifying novel potentially stable solar cell materials.

our technique on two applications: 1) predicting properties of crystalline compounds, and 2) identifying novel potentially stable solar cell materials.

A. Motivation
Driven by the success of machine learning (ML) in commercial applications (e.g., product recommendations and advertising), there are significant efforts to exploit these tools to analyze scientific data. One such effort is the emerging discipline of Materials Informatics which applies ML methods to accelerate the selection, development, and discovery of materials by learning structure-property relationships. Materials Informatics researchers are increasingly adopting ML methods in their workflow to build complex models for apriori prediction of materials' physical, mechanical, optoelectronic, and thermal properties (e.g., crystal structure, melting temperature, formation enthalpy, band gap). While commercial use cases and material science applications may appear similar in their overall goals, we argue that fundamental differences exist in the corresponding data, tasks, and requirements. Applying ML techniques without careful consideration of their assumptions and limitations may lead to missed opportunities at best and a waste of substantial resources and incorrect scientific inferences at worst. In the following, we mention unique challenges that the Materials Informatics community must overcome for universal acceptance of ML solutions in material science. balanced) training data. It is well known that when there is an under-representation of certain classes in the data, standard ML algorithms fail to properly represent the distributive characteristics of the data and provide incorrect inferences across the classes of the data. Unfortunately, in most material science applications, balanced data is exceedingly rare, and virtually all problems of interest involve various forms of extrapolation due to underrepresented data and severe class distribution skews. As an example, materials scientists are often interested in designing (or discovering) compounds with uncommon targeted properties, e.g., high T C superconductivity or large ZT for improved thermoelectric power, shape memory alloys (SMAs) with the targeted property of very low thermal hysteresis, and band gap energy in the desired range (0.9 − 1.7 eV) for solar cells. In such applications, we encounter highly imbalanced data (with targeted materials being in the minority class) due to these design choices or constraints. Consider a task of predicting material properties (e.g., bandgap energy, formation energy, stability, etc.) from a set of feature vectors (or descriptors) corresponding to crystalline compounds. One representative database for such a data set is the Open Quantum Materials Database (OQMD) 1 , which contains several properties of crystalline compounds as calculated using density functional theory (DFT).
Note that, the OQMD database contains data sets with strongly imbalanced distributions of target variables, i.e., material properties. In Figure 1, we plot the histogram of several commonly targeted properties. It can be seen that, the data set exhibits severe distribution skews. For example, 95% of the compounds in the OQMD are possibly conductors with band gap value equal to zero.
Note that if the sole aim of the ML model applied to a classification problem is to maximize overall accuracy, the ML algorithm will perform quite well by ignoring or discarding the minority class. However, in practice, correctly classifying and learning from the minority class of interest is more important than possibly mis-classifying the majority classes.
Explainable ML Methods without Compromising the Model Accuracy: A common misconception is that increasing model complexity can address the challenges of underrepresented and distributionally skewed data. However, this can only superficially palliate some of these issues. Increasing the complexity of ML models may increase the overall accuracy of the system at the cost of making the model very hard to interpret. Ironically, scientists continue pushing from the opposite direction towards understanding rather than crunching numbers from big data.
Understanding why an ML model made a certain prediction or recommendation is crucial, since it is this understanding that provides the confidence to make a decision and that will lead to new hypotheses and ultimately new scientific insights. Most of the existing approaches define explainability as the inverse of complexity and achieve explainability at the cost of accuracy.
This introduces a risk of producing explainable but misleading predictions. With the advent of highly predictive but opaque ML models, it has become more important than ever to understand and explain the predictions of such models and to devise explainable scientific machine learning techniques without sacrificing predictive power.
Better Evaluation and Uncertainty Quantification Techniques for Building Trust in ML: Most material science problems are often under-constrained in nature as they suffer from a limitation of representative train/test samples yet involve a large number of physical variables.
For this reason, relying solely on labeled instances available for testing or evaluating trained ML models can often fail to represent the true nature of relationships in material science problems.
Hence, standard methods for assessing and ensuring generalizability of ML models break down and lead to misleading conclusions. In particular, it is easy to learn spurious relationships that look deceptively good on training and test sets (even after using methods such as cross-validation), but do not generalize well outside the available labeled data. A natural solution is to use a model's own reported confidence (or uncertainty) score. However, a model's confidence score alone may not be very reliable. For example, in computer vision, well-crafted perturbations to images can cause classifiers to make mistakes (such as, identifying a panda as a gibbon or confusing a cat with a computer) with very high confidence 2 . As we will show later, this problem also persists in the Materials Informatics pipeline (especially with distributional skewness).
Nevertheless, knowing when a classifier's (or regressor's) prediction can be trusted is useful in several other applications for building trust in ML. Therefore, we need to augment current error-based testing techniques with additional components to quantify generalization performance of scientific ML algorithms and devise reliable uncertainty quantification methods to establish trust in these predictive models.

B. Literature Survey
In the recent past, the materials science community has used ML methods for building predictive models for several applications 3-16 . Seko et al. 11 considered the problem of building ML models to predict the melting temperatures of binary inorganic compounds. The problem of predicting the formation enthalpy of crystalline compounds using ML models was considered recently 4,5,17 . Predictive modeling for crystal structure formation at a certain composition are also being developed 6,[18][19][20] . The problem of band gap energy prediction of certain classes of crystals 21,22 and mechanical property prediction of metal alloys was also considered in the literature 14,15 . Ward et al. 23 proposed a general-purpose ML framework to predict diverse properties of crystalline and amorphous materials, such as band gap energy and glass-forming ability.
Thus far, the research on applying ML methods for material science applications has predominantly focused on improving overall accuracy of predictive modeling. However, imbalanced learning, explainability and reliability of ML methods in material science have not received any significant attention. As mentioned earlier, these aspects pose a real problem in deriving correct and reliable scientific inferences and the universal acceptance of machine learning solutions in material science, and deserves to be tackled head on.

C. Our Contributions
In this paper, we take some first steps in addressing the challenge of building reliable and explainable ML solutions for Materials Informatics applications. The main contributions of the paper are twofold. First, we identify some shortcoming with training, testing, and uncertainty quantification steps in existing ML techniques while learning from underrepresented and distributionally skewed data. Our finding raises serious concerns regarding the reliability of existing Materials Informatics pipelines. Second, to overcome these challenges, we propose a generalpurpose explainable and reliable machine-learning methods for enabling reliable learning from underrepresented and distributionally skewed data. We propose the following solutions: 1) novel learning architecture to bias the training process to the goals of imbalanced domains; and 2) sampling approaches to manipulate the training data distribution so as to allow the use of standard ML models; 3) reliable evaluation metrics and uncertainty quantification methods to better capture the application bias. More specifically, we employ a novel partitioning scheme to enhance the accuracy of our predictions by first partitioning data into similar groups of materials based on their property values and training separate simpler regression models for each group. As oppose to existing approaches which train an independent regression model per property, we utilize transfer learning by exploiting correlation among different material properties to improve the regression performance. The proposed transfer learning technique can overcome the performance loss due to simplicity of the models. Next, to improve the explainability of the ML system, we add a rationale generator component to our framework. The goal of the rationale generator is twofold: 1) provide explanations corresponding to an individual prediction, and 2 provide explanations corresponding to the regression model. For individual prediction, the rationale generator provides explanations in terms of prototypes (or similar but known compounds). This helps a material scientist to use his/her domain knowledge to verify if similar known compounds or prototypes satisfy the requirements or constraints imposed. On the other hand, for regression models, the rationale generator provides global explanations regarding the whole material sub-classes. This is achieved by providing feature importance for every material sub-class. Finally, we propose a new evaluation metric and a trust score to better quantify confidence and establish trust in the ML predictions.
We demonstrate the applicability of our technique by using it for two applications: 1) predicting five physically distinct properties of crystalline compounds, and 2) identifying potentially stable solar cells. Our vision is that this framework could be used as a basis for creating explainable and reliable ML models based on the balanced/imbalanced data available in the materials databases and, thereby, initiate a major step forward in application of machine learning in materials science.

II. RESULTS AND DISCUSSIONS
The results of this work are described in two major subsections. First, we will discuss the development of our ML method with a focus on reliability and explainability using the data from the Open Quantum Materials Database (OQMD). Next, we will demonstrate the application of this method to two distinct material problems.

A. General-Purpose Reliable and Explainable ML Framework
To solve the problem of reliable learning and inference from distributionally skewed data, we propose a general purpose ML framework. Instead of developing yet another ML algorithm to improve accuracy for a specific application, our objective is to develop generic methods to improve reliability, explainablity and accuracy in the presence of imbalanced data. The proposed framework is agnostic to the type of training data, can utilize variety already-developed ML algorithms, and can be reused for a broad variety of material science problems. The framework is composed of three main components: 1) novel training procedure for learning from imbalanced data, 2) rationale generator for model-level and decision-level explainability, and 3) reliable testing and uncertainty quantification techniques to evaluate the prediction performance of ML pipelines.
where, X i is feature/attribute vector and Y j i is j th property value corresponding to compound i, the steps in the proposed training procedure are as follows: 1) Partition the property space in K regions/classes and obtain transformed training data a This partition can also be introduced artificially by imposing constraints on the gradient of the property values so that compounds with similar property value are in the same class.
, · · · , K}. 2) For each property in j ∈ {1, · · · , M }, perform sub-sampling b on sample compounds in K distinct classes, and obtain an evenly distributed training set: to predict which class a compound belongs to.
to predict property valuesŶ j i . 5) Finally, utilize correlation among properties to improve the model accuracy by employing transfer learning (explained next).
At the test time, to predict j th property of the test compound, the ML algorithm first identifies the class the test compound belongs to by using trained j th multi-class classifier. Next, depending on the predicted class k for property j, (j, k) th regressor is used, along with transfer learning step, to predict property values of the test compound. Next, we provide details and justifications for each of these steps in our ML pipeline.
Steps 1 to 3 transform a regression problem into a multi-class classification problem on subsampled training data. The change that is carried out has the goal of balancing the distribution of the least represented (but more important) material classes with the more frequent observations. Furthermore, instead of having a single model trained on the entire training set, having smaller and simpler models for different classes of materials helps to gain better understanding of subdomains using the rationale generator (explained later).
Next, we explain the proposed transfer learning technique which exploits correlations presented among different material properties to improve the regression performance. We devise a simple knowledge transfer scheme to utilize the marginal estimates/predictions from step 4 where regressors were trained independently for different properties. Note that, for each compound i, we get an independent estimateŶ i ≈ {Ŷ 1 i , · · · ,Ŷ M i } from step 4. In step 5, we augment the original attribute vector X i with independent estimatesŶ i and use it as a modified attribute vector and train regressors for each (j, k) pair. We found that this simple knowledge transfer scheme significantly improves the regression performance. user explainability of complex ML models. In our context, for every unseen test example, in addition to predicted property values, we provide similar experimentally known compounds with corresponding similarity to the test compound in the feature space. Our feature space is heterogeneous (both continuous and categorical features), thus, Euclidean distance is not reliable. Thus, we propose to quantify similarity using Gower's metric 26 . Gower's metric can be used to measure similarity between data containing a combination of logical, numerical, categorical or text entries. The distance is always a number between 0 (similar) and 1 (maximally dissimilar). Furthermore, as a consequence of breaking a large regression problem into a multiclass classification followed by a simpler regression problem, we can also provide a logical sequence of decisions taken to reach a prediction.
Model Level Explanations: Knowing which chemical attributes are important in a model's prediction (feature importance) and how they are combined can be very powerful in helping material scientists understand and trust automatic ML systems. Due to the structure of our pipeline (regression+classification), we can provide a more fine grained feature importance explanations compared to having a single regression model. Specifically, we break the feature importance of attributes to predict a material property into: 1) feature importance for discriminating among different material classes (inter-class), and 2) feature importance for regression on a material sub-domain (intra-class). This provides a more in depth explanation of the property prediction process. Furthermore, we can also provide simple classification rules for different material classes using a decision tree classifier. A decision tree combines simple questions about the materials data in an interpretable way and help to gain understanding of the decision-making and prediction process.

3) Robust Model Performance Evaluation and Uncertainty Quantification:
The distributionally skewed training data biases the learning system towards solutions that may not be in accordance with the user's end goal. Most existing learning systems work by searching the space of possible models with the goal of optimizing some criteria (or numerical score). These metrics are usually related to some form of average performance over the whole train/test data and can be misleading in cases where sampled train/test data is not representative of the true distribution. More specifically, commonly used evaluation metrics (such as mean squared error, R-squared error, etc.) assume an unbiased (or uniform) sampling of the test data and break down in the presence of distributionally skewed test data (shown later). Therefore, we propose to perform class specific evaluations (by partitioning the property space into multiple classes of interest) which better characterizes the predictive performance of ML models in the presence of distributionally skewed data. We also recommend visualizing predicted and actual property values in combination with the numeric scores to build a better intuition about the predictive performance.
Note that having a robust evaluation metric only partially solves the problem as ML models are susceptible to over-confident extrapolations. As we will show later, in imbalanced learning scenarios, ML models make overconfident extrapolations which have higher probability of being wrong (e.g., predicting conductor to be an insulator with 99% confidence). In other words, a model's own confidence score cannot be trusted. To overcome this problem, we use a set of labeled experimentally known compounds as side information to help determine a model's trustworthiness for a particular unseen test example. The trust score is defined as follows: The trust score T takes into account the average Gower distance d from the test sample X i to other samples in the same class c i vs. the average Gower distance to nearby samples in other classes. T ranges from 0 to 1 where a higher T value indicates a more trustworthy model.

B. Example Applications
In this section, we discuss two distinct applications for our reliable and explainable ML pipeline to demonstrate its versatility: predicting five physically distinct properties of crystalline compounds and identifying potentially stable solar cells. In both the cases, we use the same general framework, i.e., the same attributes and ML pipeline. On the other hand, ML methods offer the promise of property predictions at several orders of magnitude faster rates than DFT. Thus, we explore the use of data from the OQMD DFT calculation databases as training data for ML models that can be used rapidly to assess many more materials than what would be feasible to evaluate using DFT.
Data Set: The OQMD contains several properties of approximately 300, 000 crystalline compounds as calculated using DFT. The diversity and scale of the data in the OQMD make it ideal for studying the performance of general-purpose ML models using a single, uniform data set. We select a subset of 228, 573 compounds from OQMD that represents the lowest-energy compound at each unique composition and use them as our training set. Building on existing strategies 23 , we use a set of 145 attributes/features to represent each compound. Using these features, we consider the problem of developing reliable and explainable ML models to predict five physically distinct properties currently available through the OQMD: bandgap energy (eV), volume/atom (Å 3 /atom), energy/atom (eV/atom), thermodynamic stability (eV/atom) and formation energy (eV/atom) 27 . Units for these properties are omitted in the rest of the paper for ease of notation.
A detailed description of the 145 attributes (used as inputs) and 5 properties (used as outputs) are provided in the Supplementary Materials.

Method:
We quantify the predictive performance of our approach using 5-fold cross-validation.  augmented data (and refer to them as joint regressors as they exploit correlation present among properties to improve the prediction performance).
Results: For the conventional scheme, we train M independent GBR regressors to directly predict properties from the features corresponding to the compounds. In Table I, we report different error metrics to quantify the regression performance using the cross-validation. Note that these metrics report an accumulated/average error score on the test set (which comprises of compounds from all partitions of properties). These results are comparable to state of the art 23 and suggest that conventional regressors have excellent regression performance (low MAE/MSE and high R 2 score). Relying on the inference made by this evaluation method, we may be tempted to use these regression models in practice for different applications (such as, screening or discovery of novel solar cells). However, next we show that these metrics provide misleading inferences in the presence of distributionally skewed data. In Table II(  classes for bandgap energy and stability prediction where the data distribution is highly skewed (see Fig. 1). Unfortunately, the test data is also distributionally skewed and is not representative of the true data distribution. Thus, standard methods for assessing and ensuring generalizability of ML models break down and lead to misleading conclusions (as shown in Table I). On the other hand, class specific evaluations better characterize the predictive performance of ML models in the presence of distributionally skewed data.
In Table II compared to having a single complex model (as given in Table II(a)). This suggests that there is a trade-off between simplicity/explainability and accuracy.
Finally, Table II(c) shows how this performance loss due to simplicity of models can be overcome using the transfer learning (or correlation based fusion) step in our pipeline. We observe that the proposed transfer learning technique can very well exploit correlations in the property space which results in a significant performance gain compared to conventional regression approach d . Note that this gain is achieved in spite of having simper and smaller models in our ML pipeline. This suggests that a user can achieve high accuracy without sacrificing explainability. We also observed that sub-sampling step in our pipeline had a positive impact on the regression performance of minority classes.
Furthermore, our pipeline also quantifies uncertainties in its predictions providing a confidence score to the user. We show an illustration of the uncertainty quantification of bandgap energy and stability predictions on 50 test samples in Figure 3. It can be seen that regressors perform poorly in regions with high uncertainty.
We would also like to point out that in cases where the data from a specific class is heavily under-represented, none of the model design strategies will improve the performance and generating new data may be the only possible solution (e.g., bandgap energy prediction for minority classes). In such cases, relying solely on cross-validation score or confidence score may not provide reliable inference (shown later). To overcome this challenge, explainable machine learning can be a potentially viable solution.
Next, we show the output of rationale generator in our pipeline. Specifically, we provide 1) model-level explanations, as well as, 2) decision-level explanations for each sub-class of materials. For model-level explanations, our pipeline provides feature importance for both classification and regression steps. Feature importance provides a score that indicates how useful (or valuable) each feature was in the construction of the model. The more an attribute is used to make key decisions with (classification/regression) model, the higher its relative importance.
This importance is calculated explicitly for each attribute in the data set, allowing attributes to be ranked and compared to each other. In Fig. 4, we show the feature importance for  bandgap energy. Therefore, band structure changes as function of inter-atomic forces which is correlated with melting temperature. Similarly, in multi-element material system, as the electronegativity difference between different atoms increases, so does the energy difference between bonding and anti-bonding orbitals. Therefore, the bandgap energy increases as the electronegativities of constituent elements increase. Thus, the bandgap energy has a strong correlation with electro-negativity of constituent elements. Finally, mean volume per atom of constituent elements is also correlated with the inter-atomic distance in a material system. As explained above, inter-atomic distance is negatively correlated with the bandgap energy, and so does the mean volume per atom of constituent elements. Similar feature importance results for classspecific predictors can also be obtained (see Supplementary Material).
Our rationale generator also provides decision-level explanations. Specifically, for every unseen test example, in addition to predicted property value, we provide similar experimentally known compounds (or prototypes) with corresponding distances to the test compound. These prototypes are extremely useful in identifying if the ML model is making over-confident extrapolation which has higher probability of being wrong.
In Table III, we show 4 test compounds with ground truths (class, bandgap energy value), predictions (class, bandgap energy value), and corresponding confidence scores. It can be seen that both classifier and regressor make wrong and over-confident predictions on minority classes (i.e., classes 1 and 2). In other words, a higher confidence score from the model for minority class does not necessarily imply higher probability that the classifier (or regressor) is correct.
For compounds in minority classes, ML model may simply not be the best judge of its own trustworthiness. On the other hand, the proposed trust score (as given in (1)) consistently outperforms classifier's/regressor's own confidence score. A higher/lower trust score from the model imply higher/lower probability that the classifier (or regressor) is correct. Furthermore, as our trust score is computed using distances from experimentally known compounds from Inorganic Crystal Structure Database (ICSD) 28 , it also provides some confidence on compounds amenability to be synthesized. Data Set: Same as before, for the training data, we selected a subset of 228, 573 compounds from OQMD that represents the lowest-energy compounds at each unique composition. We use same 145 attributes as before. Using these attributes/features, we consider the problem of developing reliable and explainable ML models to predict two physically distinct properties of stable solar cells: bandgap energy, and stability. Note that this experiment is more challenging and practical as compared to Ward et al. 23 where the training data set was considered to be compounds that were reported to be possible to be made experimentally in the ICSD (a total of 25, 085 entries) so that only bandgap energy, and not stability, needed to be considered. We choose test data set from Meredig et al. 5 to be as-yet-undiscovered ternary compounds (4, 500 entries) which are not not yet in the OQMD.

2) Novel
Method: Following the procedure mentioned in Sec. II-A1, we partition the property space for each property in K = 3 classes. The decision boundary thresholds for class separation are as follows: bandgap energy (0.9, 1.7), and stability (0.0, 1.5). Similar to Sec. II-B1, we use Extreme Gradient Boosting (XGB) classifiers to do multiclass (K = 3) classification and Gradient Boosting Regressors (GBRs) to do marginal and joint regression. We use models' own confidence and trust score to rank the potentially stable solar cells.
Results: We used the proposed ML pipeline to search for new stable compounds (i.e., those not yet in the OQMD). Specifically, we use trained models to predict bandgap energy and stability of compositions that were suggested by Meredig et al. 5 to be as-yet-undiscovered ternary  Table IV) can also serve as an initial guess on the

III. CONCLUSIONS
In this paper, we considered the problem of learning reliable and explainable machine learning models from underrepresented and distributionally skewed materials science data. We identified common pitfalls of existing ML techniques while learning from imbalanced data. We show how applying ML techniques without careful consideration of its assumptions and limitations can lead to both quantitatively and qualitatively incorrect predictive models. To overcome the limitations of existing ML techniques, we proposed a general-purpose explainable and reliable ML framework for learning from imbalanced material data. We also proposed a new evaluation metric and a trust score to better quantify confidence in the predictions. The rationale generator component in our pipeline provides useful model-level and decision-level explanations to establish trust in the ML model and its predictions. Finally, we demonstrated the applicability of our technique on predicting five physically distinct properties of crystalline compounds, and identifying potentially stable solar cells.

IV. MATERIALS AND METHODS
All machine learning models were created using the Scikit-learn 29

VII. COMPETING INTERESTS
The authors declare no conflict of interest

A. Attributes and Properties
The first step of our pipeline is to compute attributes (or chemical descriptors) based on the composition of materials. These attributes should be descriptive enough to enable a ML algorithm to construct general rules that can possibly "learn" chemistry. Building on existing strategies 23 , we use a set of 145 attributes/features to represent each compound. These attributes are comprised of: stoichiometric properties, elemental statistics, electronic structure properties attributes, ionic compound attributes. A detailed procedure to compute these attributes can be found in The Materials Agnostic Platform for Informatics and Exploration (Magpie) 23 .
Using these features, we consider the problem of developing reliable and explainable ML models to predict five physically distinct properties currently available through the OQMD: bandgap energy (eV), volume/atom (Å 3 /atom), energy/atom (eV/atom), thermodynamic stability (eV/atom) and formation energy (eV/atom). Formation energy is just the total Energy/atom minus some correction factors (i.e., the material with the lowest formation energy at each composition also has the lowest energy per atom). Stability has to do with whether a particular material is thermodynamically stable or not. Compounds with a negative stability are stable and those with a positive stability are unstable. More information on the output properties are provided by Emery et al. 27,31 .

B. Feature Importance for Class-specific Regression
Feature importance results for class-specific predictors can also be obtained.
In Fig. 5, we show feature importance for formation energy prediction regressors for all 3 classes. For all three classes, the thermodynamic stability is found to be the most important attribute in predicting formation energy. From thermodynamic point of view, this makes sense as the stability is negatively correlated with the formation energy. More results are provided in the Supplementary Information associated with this manuscript.

D. Other
The software, training data sets and input files used in this work are provided in the Supplementary Information associated with this manuscript.