Introduction

Since Frank Rosenblatt created Perceptron to play checkers1, machine learning (ML) applications have been used to emulate human intelligence. The field has grown immensely with the advent of ever more powerful computers with increasingly smaller size combined with the development of robust statistical analyses. These advances allowed Deep Blue to beat Grandmaster Gary Kasparov in chess and Watson to win the game show Jeopardy! The technology has since progressed to more practical applications such as advanced manufacturing and common tasks we now expect from our phones like image and speech recognition. The future of ML promises to obviate much of the tedium of everyday life by assuming responsibility for more and more complex processes, e.g., autonomous driving.

When it comes to scientific application, our perspective is that current ML methods are just another component of the scientific modeling toolbox, with a somewhat different profile of representational basis, parametrization, computational complexity, and data/sample efficiency. Fully embracing this view will help the materials and chemistry communities to overcome perceived limitations and at the same time evaluate and deploy these techniques with the same level of rigor and introspection as any physics-based modeling methodology. Toward this end, in this essay we identify four areas in which materials researchers can clarify our thinking to enable a vibrant and productive community of scientific ML practitioners:

  1. 1.

    Maintain perspective on resources required

  2. 2.

    Openly assess dataset bias

  3. 3.

    Keep sight of the goal

  4. 4.

    Dream big enough for radical innovation

Maintain perspective on resources required

The recent high profile successes in mainstream ML applications enabled by internet-scale data and massive computation2,3 have spurred two lines of discussion in the materials community that are worth examining more closely. The first is an unmediated and limiting preference for large-scale data and computation, under the assumption that successful ML is unrealistic for materials scientists with datasets that are orders of magnitude smaller than those at the forefront of the publicity surrounding deep learning. The second is a tendency to dismiss brute-force ML systems as unscientific. While there is some validity to both these viewpoints, there are opportunities in materials research for productive and creative ML work with small datasets and for the “go big or go home” brute-force approach.

Molehills of data (or compute) are sometimes better than mountains

A common sentiment in the contemporary deep-learning community is that the most reliable means of improving the performance of a deep-learning system is to amass ever larger datasets and apply raw computational power. This sometimes can encourage the fallacy that large-scale data and computation are fundamental requirements for success with ML methods. This can lead to needlessly deploying massively overparameterized models when simpler ones may be more appropriate4, and it limits the scope of applied ML research in materials by biasing the set of problems people are willing to consider addressing. There are many examples of productive, creative ML work with small datasets in materials research that counter this notion5,6.

In the small-data regime, high-quality data with informative features often trump excessive computational power with massive data and weakly correlated features. A promising approach is to exploit the bias-variance trade-off by performing more rigorous feature selection or crafting a more physically motivated model form7. Alternatively, it may be wise to reduce the scope of the ML task by restricting the material design space or use ML to solve a smaller chunk of the problem at hand. ML tools for exploratory analysis with appropriate features can help us comprehend much higher dimensional spaces even at an early stage of the research, which may be helpful to have a bird’s-eye view on our target. For example, cluster analysis can help researchers identify representative groups in large high-throughput datasets, making the process of formulating hypotheses more tractable.

There are also specific ML disciplines aimed at addressing the well-known issues of small datasets, dataset bias, noise, incomplete featurization, and over-generalization, and there has been some effort to develop tools to address them. Data augmentation and other regularization strategies can allow even small datasets to be treated with large deep-learning models. Another common approach is transfer learning, where a proxy model is trained on a large dataset and adapted to a related task with fewer data points8,9,10. Chen et al.11 showed that multi-fidelity graph networks could be used in comparatively inexpensive low-fidelity calculations to bolster the accuracy of ML predictions for expensive high-fidelity calculations. Finally, active learning methods are now being explored in many areas of materials research, where surrogate models are initialized on small datasets and updated as predictions are used to guide the acquisition of new data generation, often in a manner that balances exploration with optimization12. Generally a solid understanding of the uncertainty in the data is critical for success with these strategies, but ML systems can lead us to some insights or perhaps serve as a guide for optimization which might otherwise be intractable.

We assert that the materials community would generally benefit from taking a more model-oriented approach to applied ML, in contrast to the popular prediction-oriented approach that many method-development papers take. With the current prediction-oriented application of ML to the physical sciences, the primary intent of the model is to obtain property predictions, often for screening or optimization workflows. We propose that the community would be better served to instead use ML as a means to generate scientific understanding, using, for instance, inference techniques to quantify physical constants from experiments. To achieve the goals of scientific discovery and knowledge generation, predictive ML must often play a supporting role within a larger ecosystem of computational models and experimental measurements. It can be productive to reassess13 the predictive tasks we are striving to address with ML methods; more carefully thought out applications may provide more benefit than simply collecting larger datasets and training higher capacity models.

Massive computation can be useful but is not everything

On the other hand, characterizing brute computation as “unscientific” can lead to missed opportunities to meaningfully accelerate and enable new kinds or scales of scientific inquiry14. Even without investment in massive datasets or specialized ML models, there is evidence that simply increasing the scale of computation applied can help compensate for small datasets. For example, ref. 15 show that simply by increasing the number of training iterations, large-object detection and segmentation models trained from random initialization can match the performance of the conventional transfer learning approach. In many cases, advances enabled in this way do not directly contribute to scientific discovery or development, but they absolutely change the landscape of feasible scientific research by lowering the barrier to exploration and increasing the scale and automation of data analysis.

A perennial challenge in organic chemistry is predicting the structure of proteins, but recent advances in learned potential methods16 have provided paradigm-shifting improvements in performance made possible by sheer computational power. In addition, massive computation can enable new scientific applications through scalable automated data analysis systems. Recent examples include phase identification in electron backscatter diffraction17 and X-ray diffraction18, and local structural analysis via extended x-ray absorption fine structure19,20. These ML systems leverage extensive precomputation through the generation of synthetic training data and training of models; this makes online data analysis possible, removing barriers to more adaptive experiments enabled by real-time decision making.

In light of the potential value of large-scale computation in advancing fundamental science, the materials field should make computational efficiency21 an evaluation criterion alongside accuracy and reproducibility22. Comparison of competing methods with equal computational budgets can provide insight into which methodological innovations actually contribute to improved performance (as opposed to simply boosting model capacity) and can provide context for the feasibility of various methods to be deployed as online data analysis tools. Careful design and interpretation of benchmark tasks and performance measures are needed for the community to avoid chasing arbitrary targets that do not meaningfully facilitate scientific discovery and development of novel and functional materials.

Openly assess dataset bias

Acknowledging dataset bias

It is widely accepted that materials datasets are distinct from the datasets used to train and validate ML systems for more “mainstream” applications in a number of ways. While some of this is hyperbole, there are some genuine differences that have a large impact on the overall outlook for ML in materials research. For instance, there is a community-wide perception that all ML problems involve data on the scale of the classic image recognition and spam/ham problems. While there are over 140,000 labeled structures in the Materials Project Database23 and the MNIST24 dataset contains about twice that amount, other popular ML benchmark datasets are much more modest in size. For instance, the Iris Dataset contains only 50 samples each of three species of Iris and is treated as a standard dataset for evaluating a host of clustering and classification algorithms. As noted above dataset size is not necessarily the major hurdle for the materials science community in terms of developing and deploying ML systems; however, the data, input representation, and task must each be carefully considered.

Viewed as a monolithic dataset, the materials literature is an extremely heterogeneous multiview corpus with a significant fraction of missing entries. Even if this dataset were accessible in a coherent digital form, its diversity and deficiencies would pose substantial hurdles to its suitability for ML-driven science. Most research papers narrowly focus on a single or a small handful of material instances, address only a small subset of potentially relevant properties and characterization modalities, and often fail to adequately quantify measurement uncertainties. Perhaps most importantly, there is a strong systemic bias toward positive results25. All of these factors negatively impact the generalization potential of ML systems.

Two aspects of publication bias play a particularly large role: domain bias and selection bias (Fig. 1b) . Domain bias results when training datasets do not adequately cover the input space. For example, ref. 26 recently demonstrated that the “tried and true” method of selecting reagents following previous successes artificially constrained the range of chemical space searched, providing the ML with a distorted view of the viable parameter space. Severe domain bias can lead to overly optimistic estimates of the performance of ML systems27,28 or in the worst case even render them unusable for real-world scientific application29,30.

Fig. 1: Impact of datasets and feature sets in implementing ML for materials research.
figure 1

a Materials literature with a heterogeneous dataset due to domain bias and selection bias. Domain bias results when training datasets do not adequately cover the research space. Selection bias arises when some external factors such as questionability and inexplicability restrict the likelihood of a data inclusion in the datasets; such data can be either experimental, theoretical, or computational. b Holistic description of the synthesis, composition, microstructure, and macrostructure of materials, which are related to material properties and performance. Identifying a sufficient feature space with essential variables such as synthesis parameters requires careful observation and lateral thinking.

Selection bias arises when some external factor influences the likelihood of a data points inclusion in the dataset. In scientific research, a major source of such selection bias is the large number of unreported failures (Fig. 1a). For instance the Landolt-Bornstein collection of ternary amorphous alloys lists 71% of the alloys as being glass formers while the actual occurence of glass-forming compounds is estimated to be about 5%31. This further complicates the already challenging task of learning from imbalanced datasets by skewing the prior probability of glass formation through dataset imbalance. Schrier et al.32 reported on how incorporating failed experiments into ML models can actually improve upon the overall predictive power of a model.

Furthermore, the annotations or targets used to train ML systems do not necessarily represent true physical ground truth. As an example, in the field of metallic glasses the full width half-maximum (FWHM) of the strongest diffraction peak at low wavevector is often used to categorize thin-film material as being metallic glass, nanocrystalline, or crystalline. Across the literature the FWHM value used as the threshold to distinguish between the first two classes varies from 0.4 to 0.7 Å−1 (with associated uncertainties) depending upon the research group. Although compendiums invariably capture the label ascribed to the samples, they almost ubiquitously omit the threshold used for the classification, the uncertainty in the measurement of the FWHM, and the associated synthesis and characterization metadata. Comprehensive studies often report only reduced summaries for the datasets presented and include full details only for a subset of “representative data”. These shortcomings are common across the primary materials science literature. Given that even experts can reasonably disagree on the interpretation of experimental results, the lack of access to primary datasets prevents detailed model critique, posing a substantial impediment to model validation29,33. The push for creating F.A.I.R. (Findable, Accessible, Interoperable, and Reusable34) datasets with human/computer readable data structures notwithstanding, most of the data and meta-data for materials that have ever been made and studied have been lost to time.

Systematic errors in datasets are not restricted to experimental results alone. Theoretical predictions from high-throughput density functional theory (DFT) databases, for example, are a valuable resource for predicted material (meta-) stability, crystal structures, and physical properties, but DFT computations contain several underlying assumptions that are responsible for known systematic errors, e.g., calculated band gaps. DFT experts are well aware of these limitations and their implications for model building; however, scientists unfamiliar with the field may not be able to reasonably draw conclusions about the potential viability of a model’s predictions given these limitations. Discrepancy between DFT and experimental data will expand as systems get increasingly more complex, a longstanding trend in applied materials science. A heterogeneous model, in particular, may cause large uncertainty depending on the complexity of the input structure, and many times little to no information is detailed about the structure or the rationale for choosing it.

Finally, even balanced datasets with quantified uncertainties are not guaranteed to generate predictive models if the features used to describe the materials and/or how they are made are not sufficiently descriptive. Holistically describing the synthesis, composition, microstructure, macrostructure of existing materials for their property/performance (Fig. 1b) is a challenging problem and the feature set used (e.g., microstructure 2-point correlation, compositional descriptors and radial distribution functions for functional materials, and calculated physical properties) is largely community driven. This presupposes that we know and can measure the relevant features during our experiments. Often identifying the parameters that strongly influence materials synthesis and the structural aspects highly correlated to function is a matter of scientific inquiry in and of itself. For example, identifying the importance of temperature in cross-linking rubber or the effect of moisture in the reproducible growth of super-dense, vertically aligned single-walled carbon nanotubes requires careful observation and lateral thinking to connect seemingly independent or unimportant variables. If these parameters (or covariate features, e.g., chemical vapor deposition system pump curves) are not captured from the outset, then there is no hope of algorithmically discovering a causal model, and weakly predictive models are likely to be the best case output.

There is no silver bullet that will solve the issue of dataset bias, but there are several concrete steps that can be taken to begin addressing it. For instance, as a community we can commit to re-balancing the data pool against selection bias by including in our supplementary material one failed (or subpar) result for every successful result in the main text. Domain bias is best addressed by first acknowledging its existence and then encouraging researchers (possibly through funding) to spend time exploring outside of the well-known regions within their respective fields (perhaps resulting in additional data points to address selection bias). In terms of the need to capture all relevant material features, we accept that (happily) new insights will constantly crop up, and when they do, public datasets should be updated to contain the newly important features. Even if the new field is left empty for historical records, its existence will draw attention to its relevance for model builders. Finally, individuals applying ML in their research should analyze and discuss sources of bias in the data used to train and evaluate models and their potential impact on reported results.

Productivity in spite of dataset bias

Bias in historical and as-collected datasets should be acknowledged, but it does not entirely preclude their use to train an ML targeted toward scientific inquiry. Instead one can continue to gain productive insights from ML by taking the appropriate approach and thinking analytically about the results of the model.

Especially with small datasets, it is important to characterize the extent of dataset bias and perform careful model performance analysis to obtain realistic estimates of the generalization of ML models. Rauer and Bereau28 provide compelling examples of these effects of dataset bias by comparing the empirical distribution in chemical space of three similar molecular property datasets. Dataset bias can cause common measures of a model’s generalization ability to become overconfident; typically generalization ability is measured through cross-validation where a portion of the data is withheld from the training data. Recent research in the chemical and materials informatics literature has focused on developing dataset unbiasing techniques that aim to find cross-validation splits that more faithfully serve as a check against overfitting. For example, the Asymmetric Validation Embedding method27 quantifies the bias of a dataset split by using a nearest-neighbor model to memorize the training data. If the nearest-neighbor lookup can achieve a good validation accuracy, then the training and validation sets are deemed to be too similar. Searching for cross-validation splits that minimize this bias metric can improve the robustness of the benchmark, but the Asymmetric Validation Embedding metric is specific to classification tasks. In contrast, the leave-one-cluster-out cross-validation35 is more general, using only distances in the input space to define cross-validation groups to reduce information leakage between folds. Extending these kinds of debiasing methods to additional material classification and prediction tasks will have an outsized impact on applied artificial intelligence for practical scientific advances and discoveries because by nature these goals depend on excellent generalization and extrapolation performance.

One method for maintaining “good” features and models is to adapt an active human intervention in the ML loop. For example, we have recently demonstrated that Random Forest models that are tuned to aggressively maximize only cross-validation accuracy may produce low-quality, unreliable feature ranking explainability36. Carefully tracking which features (and data points) the model is most dependent on for its predictions allows a researcher to ensure that the model is capturing physically relevant trends, identify new potential insight into material behavior, and spot possible outliers. Similarly, when physics-based models are used to generate features and training data for ML models, subsequent comparison of new predictions to theory-based results offers the opportunity for improvement of both models37. The preceding examples are all a human-initiated post-hoc investigation of model outputs. Kusne et al.38 recently demonstrated the inverse example where the ML model can request expert input, such as performing a measurement or calculation, that is expected to lower predictive uncertainties.

Dimensionality reduction tools and latent space models are useful to assess the general distribution of a data set. Visualizations from such models can illustrate potential bias and unequal distributions of a dataset by inspecting the internal structure/distribution and the true dimensionality. For instance, ref. 39 used principle component analysis as a method for investigating the role of dataset bias by investigating the density of data points with scores plots. Gomez-Bombarelli et al.40 have used variational autoencoders to identify sparsely sampled regions in the parameter space by pushing them toward the outside of the latent space distribution. They demonstrated that variational autoencoders can highlight when the model is incapable of recognizing certain classes, indicating the data is outside of the distribution that the model was trained on. A holistic analysis helps gain knowledge about both the ML models and the datasets and thus may lead to more effective research steps.

A culture of careful model criticism is also important for robust applied ML research41. A narrow focus on benchmark tasks can lead to false incremental progress, where, over time, models begin overfitting to a particular test dataset and then lack generalizability beyond the initial dataset. Ref. 42 demonstrated that a broad range of computer vision models suffer from this effect by developing extended test sets for the CIFAR-10 and ImageNet datasets extensively used in the community for model development. This can make it difficult to reason about exactly which methodological innovations truly contribute to generalization performance. Because many aspects of ML research are empirical, carefully designed experiments are needed to separate genuine improvements from statistical effects, and care is needed to avoid post-hoc rationalization (Hypothesizing After the Results are Known (HARK)43).

That there is historical dataset bias is both unavoidable and unresolvable, but once identified this bias does not necessarily constrain the search for new materials in directions that directly contradict the bias44. For instance, ref. 26 identified anthropogenic biases in the design of amine-templated metal oxides, in that a small number of amine complexes had been used for a vast majority of the literature. Their solution was to perform 548 randomly generated experiments to demonstrate that a global maximum had not been reached but also to erode the systemic data bias their models observed. This is not to say that such an approach is a panacea for dataset or feature set bias as such experiments are still designed by scientists carrying their own biases (e.g., using only amines) and may suffer from uncaptured (but important!) features. Of course, a question remains how to best remove human bias from the experimental pipeline.

One potential path forward is deployment of automated systems that perform the ultimate selection of the experiment to be performed and manage data acquisition, functionally to attack the small dataset problem by using automation to fill in the cracks. Using these tools and adopting objective functions that permit random or maximum expected improvement exploration may help researchers avoid biasing their research toward particular solutions, allowing them to focus more on higher-level problem formulation and hypothesis specification. Currently, model prototyping often is done in notebook computing environments, which are convenient for exploring new ideas but make it easy to create unsustainable software. More accessible tools for exploring new ideas while maintaining traceability, reproducibility, flexibility, interactivity, and integration with laboratory equipment will help researchers focus on goal setting, intuition and insights for featurization, and data curation. This is analogous to ML life-cycle management45, which is used in industrial settings to ensure traceability of predictions to specific models formulations.

Keep sight of the goal

While the implementation of ML in materials science is often focused on a push for better accuracy and faster calculations, these are not always the only objectives or even the most important ones. For the ML novice it is helpful to remember to keep the scientific aim at the forefront when selecting a model and then designing training and validation procedures. Consider the trade-off between accuracy and discovery. If one is optimizing the pseudopotentials to use for DFT46,47, then design may be centered around accuracy of predicting material characteristics when compared to an existing benchmark set, and this may lead to better predictions for other known compounds. On the other hand, one may want to sacrifice accuracy for exploratory studies. The aforementioned high-accuracy model may fail to predict the novel combination of physical properties of an undiscovered compound. In fact, even if the phase had been recently identified and included in the training set, the model may not be trustworthy due to the inherent lack of benchmark datasets whenever new science appears.

There are clearly cases where ML is the obvious choice to accelerate research, but there can be concerns about the suitability of ML to answer the relevant question. Many applied studies focus only on physical or chemical properties of materials and often fail to include parameters relating to their fundamental utility such as reproducibility, scalability, stability, productivity, safety, or cost48. While humans may not be able to find correlations or patterns in high-dimensional spaces, we have rich and diverse background knowledge and heuristics; we have only just begun the difficult work of inventing ways of building this knowledge into ML systems. In addition, for domains with small datasets, limited features, and a strong need for higher-level inference rather than a surrogate model, ML should not necessarily be the default approach. A more traditional approach may be faster due to the error in the ML models associated with sample size, and heuristics can play a role even with larger datasets49.

One alternative is to employ a hybrid method which may include a Bayesian methodology to analysis50 or may use ML to guide the work through selective intervention51. ML is only a means to model data, and a good fit to the dataset is no guarantee that the model will be useful since it may have little to no relationship to actual science as it attempts to emulate apparent correlations between the features and the targets (Fig. 2). To provide some insight into this issue, Lee and Lundberg52 developed Shapley additive explanations based on game theory to assess the impact of each feature on ML predictions.

Fig. 2: Comparison of theoretical and ML Models of the Hall-Petch effect.
figure 2

The success of a given ML model may have little or no relationship to the actual physical processes as the model is merely interpolating between observations. For example, a Gaussian Process model can "capture'' the changeover in the behavior of the flow stress in metals from being dependent on grain boundary density in large-grain metals78 to being dominated by grain boundary sliding in nanocrystalline alloys79 even though the model is unaware of either mechanism. However, outside the range of acquired data the lack of encoding scientific understanding results in rapidly increasing uncertainties, even in well-calibrated systems. Code for reproducing this figure is available at https://github.com/usnistgov/ml-materials-reflections80.

A corollary is that any ML predictions, especially when working with small datasets, may be unphysical. Again, we stress that it doesn’t imply that we should never use ML for small datasets. As demonstrated by ref. 53, non-negative matrix factorization can be constrained to provide predictions only within physical spaces. In any case, we need to employ ML tools judiciously and understand their limitations in the context of our scientific goals. For instance, while most ML models are reasonably good at interpolation54, ML is not nearly as robust when used for extrapolation, although this can be mitigated to some extent by including rigorous statistical analyses on the predictions55.

A discussion of errors and failure modes can help one understand the bounds of the validity of any ML analysis although it is often lacking or limited. An honest discourse includes not only principled estimates of model performance and detailed studies of predictive failure modes but also notes how reproducible the results within and across research groups. Explanation of model failure modes is required for validating the use of ML for any application.

Finally, one of the biggest potential pitfalls that can occur, even for large, well-curated datasets, is that one can lose sight of the goal by focusing on the accuracy of the model rather than using it to learn new science. There is a particular risk of the community spending disproportionate effort incrementally optimizing models to overfit against benchmark tasks42, which may or may not even truly represent meaningful scientific endeavors in themselves. We note that in the case of the MatBench benchmark dataset and ML challenge56, many of the top performing models are neural networks. While these models have impressive predictive capability their interpretability (and thus their ability to inform scientific progress) is limited. This is also the case for the Open Catalyst Challenge57.

The objective should not be to identify the one algorithm that is good at everything but rather to develop a more focused effort that addresses a specific research question. For ML to reach its true potential to transform research and not just serve as a tool to expedite materials discovery and optimization, it needs to help provide a means to connect experimental and theoretical results instead of simply serving as a convenient vehicle to describe them.

Dream big enough for radical innovation

To date, ML has increased its presence in materials science for mainly three applications: (1) automating data analysis that used to be done manually; (2) serving as lead-generation in a materials-screening funnel, illustrated by the Open Quantum Materials Database and Materials Project; and (3) optimizing existing materials, processes, and devices in a broadly incremental manner. While these applications are critically important in this field, radical innovation historically has often been accomplished outside of the context of these three general research frameworks, driven by human interests or serendipity along with stubborn trial and error. For instance, graphene was first isolated during Friday night experiments when Geim and Novoselov would try out experimental science that was not necessarily linked to their day jobs. Escobar et al.58 discovered that peeling adhesive tape can emit enough x-rays to produce images. Shirakawa59 discovered a conductive polyacetylene film by accidentally mixing doping materials at a concentration a thousand times too high.

Design research has argued that every radical innovation investigated was done without careful analysis of a person’s or even a society’s needs60. If this is the case, an ultimate question about ML deployment in materials science would be, can ML help humans make the startling discovery of “novel” materials and eventually new science? The new science often relies on a discrete discovery possibly outside the context of an existing theory, which is noticeably different from current ML applications which tackle problems like chess and Jeopardy!.

According to a proposed categorization in design research60, one can position their research based on scientific and application familiarity (Fig. 3a). Here, incremental areas (blue region) can provide easier data acquisition and interpretation of results but may hinder new discovery. In contrast, an unexplored area may more likely provide such unexpected results but presents a huge risk of wasting research resources due to the inherent uncertainty. Self-aware resource allocation and inter-area feedback will be needed to balance novelty with the probability of successful research outcomes. Although there is currently a lack of ML methods that can directly navigate one in the radical change/radical application region to discover new science, we expect that there are methodologies that can harness ML to increase the chance of radical discovery.

Fig. 3: Use of outside-the-box thinking in advancing scientific research with ML.
figure 3

a Conceptual research domain defined by a scientific concept and an applicational goal where the arrows represent a radical shift in research driven by outside-the-box thinking and/or creative artificial intelligence (AI). b Machine-learning-involved research loop in conjunction with possible generalization and outside-the-box thinking pathways. Blue arrows illustrate research flows in an incremental domain, green arrows show knowledge-based new research steps, and orange arrows illustrate radical shifts based on new hypotheses and generalizations in the loop.

Active outside-the-box exploration driven by ML-assisted knowledge acquisition

Human interests motivate outside-the-box research that may lead to a radical discovery, and these interests are fostered by theoretical or experimental knowledge acquisition. Therefore, any applied ML and automated research systems may contribute to discrete discovery by accelerating the knowledge feedback loop (Fig. 3b). Such ML-involved research loop can include a proposal of hypotheses, theoretical and experimental examination, knowledge extraction, and generalization, which may lead to an opportunity for radical thinking. Analysis and online visualization tools can help better interpret the result and mechanism of ML-involved research, which facilitates new hypotheses and generalization through knowledge extraction. Such interactive analysis/visualization can be implemented in various steps of the research loop such as feature selection, ML model investigation, and ML interpretation.

For ML to play a meaningful role in expediting this loop, one also should maintain exploratory curiosity at each step and be inspired or guided by any outputs while attentively being involved in the loop. In addition, at the very beginning of proof-of-concept research, either in a current research loop or outside-the-box search, the fear of reproducibility should not prevent the attempt at new ideas because the scientific community needs to integrate conflicting observations and ideas into a coherent theory61.

One can harken back to Delbruck’s principle of limited sloppiness62, which reminds us that our experimental design sometimes tests unintended questions, and hidden selectivity requires attention to abnormality. In this context, ML may help us notice the anomaly or even hidden variables with a rigorous statistical procedure, leading to new pieces of knowledge and outside-the-box exploration. For instance, ref. 63 used automated experiments and statistical analysis to clarify the effect of trace water (a hidden variable) on crystal/domain growth of halide perovskite (an important property), which had often been communicated only in intra-lab conversation. Since such correlation analysis can only shed light on a domain where features are input, researchers still need comprehensive experimental records containing both data and metadata to be fed, possibly regardless of their initial interests. Also, an unbiased and flexible scientific attitude based upon observation may be crucial to reforming a question after finding the abnormality.

Deep generative inverse design to assist in creating material concepts

Functionality-oriented inverse design64 is an emerging approach for searching chemical spaces65 for small molecules and possibly solid-state compounds66. Here, generative models simultaneously learn how to map existing materials to a set of few key variables and how to generate “new” materials from those key “latent” variables. One can then optimize a material by finding latent variables that should maximize the property and then generating a new material from those coordinates. Novel compounds likely to have desired properties can then be sampled from the generative model67. While the design spaces, such as the 166 billion molecules mapped by chemical space projects68, are far beyond the human capability to understand them comprehensively, ML may distill patterns connecting functionalities and compound structures spanning the space. This approach can be a critical step in conceptualizing materials design based upon desired functionalities and further accelerating the ML-driven research loop. One application of such inverse design is to create a property-first optimization loop which includes defining a desired property, proposing a material and structure for that property, validating the results with (automated) experiments, and refining the model.

While these generative methods may start to approach creativity, they still explicitly aim to learn an empirical distribution based on the available data. Therefore, extrapolation outside of the current distribution of known materials is not guaranteed to be productive. For instance, these methods would probably not generate a carbon nanotube given only pre-nanotube-era structures for training or generate ordered superlattices if there is none in the training data. In addition, these huge datasets are mainly constructed based on simulation, and we need to be careful about a gap between simulated and actual experimental data as discussed previously. Still, a new concept extracted from inverse design may inspire researchers to jump into a new discrete subfield of material design by actively interpreting the abstracted property-structure relationship.

Creative artificial intelligence for materials science

The essence of scientific creativity is the production of new ideas, questions, and connections69. The era of artificial intelligence as an innovative investigator in this sense has yet to arrive. However, since human creativity has been captured by actively learning and connecting dots highlighted by our curiosity, it may be possible that machine “learning” can be as creative as humans in order to reach radical innovation.

While conventional supervised natural language processing70 has required large hand-labeled datasets for training, a recent unsupervised learning study71 indicates the possibility of extracting knowledge from literature without human intervention to identify relevant content and capturing preliminary materials science concepts such as the underlying structure of the periodic table and structure-properties relationships. This unsupervised learning was demonstrated by encoding latent literature into information-dense word embeddings, which recommended some materials for a specific application ahead of human discovery. Since the amount of currently existing literature is too massive for human cognition, such generative artificial intelligence systems may be useful to suggest a specific design or concept given appropriately defined functionalities.

Beyond latent variable optimization, one may consider computational creativity, which is used to model imagination in fields such as the arts72, music73, and gaming. This endeavor may start with finding a vector space to measure novelty as a distance74. A novelty-oriented algorithm searches the space for a set of distant new objects that is as diverse as possible as to maximize novelty instead of an objective function75. Since there would be some bias for measuring the distance along with exploratory space, deep learning novelty explorer (DeLeNox) was recently proposed76 as a means to dynamically change the distance functions for improved diversity. These approaches could be applied to materials science to diversify research directions and help us pose and consider novel materials and ideas though measuring novelty may be subjective and most challenging for the community, and one always needs to be mindful of ethical and physical materials constraints.

Outlook

Machine learning has been effective at expediting a variety of tasks, and the initial stage of its implementation for materials research has already confirmed that it has great promise to accelerate science and discovery77. To realize that full potential, we need to tailor its usage to answer well defined questions while keeping perspective of the limits of the resources needed and the bounds of meaningful interpretation of the resulting analyses. Eventually, we may be able to develop ML algorithms that will consistently lead us to new breakthroughs. In the meantime, a complementary team of humans, ML, and robots has already begun to advance materials science.