Recent advances and applications of machine learning in solid-state materials science

One of the most exciting tools that have entered the material science toolbox in recent years is machine learning. This collection of statistical methods has already proved to be capable of considerably speeding up both fundamental and applied research. At present, we are witnessing an explosion of works that develop and apply machine learning to solid-state systems. We provide a comprehensive overview and analysis of the most recent research in this topic. As a starting point, we introduce machine learning principles, algorithms, descriptors, and databases in materials science. We continue with the description of different machine learning approaches for the discovery of stable materials and the prediction of their crystal structure. Then we discuss research in numerous quantitative structure–property relationships and various approaches for the replacement of first-principle methods by machine learning. We review how active learning and surrogate-based optimization can be applied to improve the rational design process and related examples of applications. Two major questions are always the interpretability of and the physical understanding gained from machine learning models. We consider therefore the different facets of interpretability and their importance in materials science. Finally, we propose solutions and future research paths for various challenges in computational materials science.


INTRODUCTION
In recent years, the availability of large datasets combined with the improvement in algorithms and the exponential growth in computing power led to an unparalleled surge of interest in the topic of machine learning. Nowadays, machine learning algorithms are successfully employed for classification, regression, clustering, or dimensionality reduction tasks of large sets of especially high-dimensional input data. 1 In fact, machine learning has proved to have superhuman abilities in numerous fields (such as playing go, 2 self driving cars, 3 image classification, 4 etc). As a result, huge parts of our daily life, for example, image and speech recognition, 5,6 web-searches, 7 fraud detection, 8 email/spam filtering, 9 credit scores, 10 and many more are powered by machine learning algorithms.
While data-driven research, and more specifically machine learning, have already a long history in biology 11 or chemistry, 12 they only rose to prominence recently in the field of solid-state materials science.
Traditionally, experiments used to play the key role in finding and characterizing new materials. Experimental research must be conducted over a long time period for an extremely limited number of materials, as it imposes high requirements in terms of resources and equipment. Owing to these limitations, important discoveries happened mostly through human intuition or even serendipity. 13 A first computational revolution in materials science was fueled by the advent of computational methods, 14 especially density functional theory (DFT), 15,16 Monte Carlo simulations, and molecular dynamics, that allowed researchers to explore the phase and composition space far more efficiently. In fact, the combination of both experiments and computer simulations has allowed to cut substantially the time and cost of materials design. [17][18][19][20] The constant increase in computing power and the development of more efficient codes also allowed for computational high-throughput studies 21 of large material groups in order to screen for the ideal experimental candidates. These large-scale simulations and calculations together with experimental highthroughput studies [22][23][24][25] are producing an enormous amount of data making possible the use of machine learning methods to materials science.
As these algorithms start to find their place, they are heralding a second computational revolution. Because the number of possible materials is estimated to be as high as a googol (10 100 ), 26 this revolution is doubtlessly required. This paradigm change is further promoted by projects like the materials genome initiative (Materials genome initiative) that aim to bridge the gap between experiment and theory and promote a more data-intensive and systematic research approach. A multitude of already successful machine learning applications in materials science can be found, e.g., the prediction of new stable materials, [27][28][29][30][31][32][33][34][35] the calculation of numerous material properties, [36][37][38][39][40][41][42][43][44][45][46][47][48][49][50][51] and the speeding up of firstprinciple calculations. 52 Machine learning algorithms have already revolutionized other fields, such as image recognition. However, the development from the first perceptron 53,54 up to modern deep convolutional neural networks was a long and tortuous process. In order to produce significant results in materials science, one necessarily has not only to play to the strength of machine learning techniques but also apply the lessons already learned in other fields.
As the introduction of machine learning methods to materials science is still recent, a lot of published applications are quite basic in nature and complexity. Often they involve fitting models to extremely small training sets or even applying machine learning methods to composition spaces that could possibly be mapped out in hundreds of CPU hours. It is of course possible to use machine learning methods as a simple fitting procedure for small low-dimensional datasets. However, this does not play to their strength and will not allow us to replicate the success machine learning methods had in other fields.
Furthermore, and as always when entering a different field of science, nomenclature has to be applied correctly. One example is the expression "deep learning", which is responsible for a majority of the recent success of machine learning methods (e.g., in image recognition and natural language processing 55 ). It is of course tempting to describe one's work as deep learning. However, denoting neural networks with one or two fully connected hidden layer as deep learning 56 is confusing for researchers new to the topic, and it misrepresents the purpose of deep-learning algorithms. The success of deep learning is rooted in the ability of deep neural networks to learn descriptors of data with different levels of abstraction without human intervention. 55,57 This is, of course, not the case in two-layer neural networks.
One of the major criticisms of machine learning algorithms in science is the lack of novel laws, understanding, and knowledge arising from their use. This comes from the fact that machine learning algorithms are often treated as black boxes, as machinebuilt models are too complex and alien for humans to understand. We will discuss the validity of the criticism and different approaches to this challenge.
Finally, there have already been a number of excellent reviews of materials informatics and machine learning in materials science in general, 13,[58][59][60][61][62] as well as some other covering specifically machine learning in the chemical sciences, 63 in materials design of thermoelectrics and photovoltaics, 64 in the development of lithium-ion batteries, 65 and in atomistic simulations. 66 However, owing to the explosion in the number of works using machine learning, an enormous amount of research has already been published since the past reviews and the research landscape has quickly transformed.
Here we concentrate on the various applications of machine learning in solid-state materials science (especially the most recent ones) and discuss and analyze them in detail. As a starting point, we provide an introduction to machine learning, and in particular to machine learning principles, algorithms, descriptors, and databases in materials science. We then review numerous applications of machine learning in solid-state materials science: the discovery of new stable materials and the prediction of their structure, the machine learning calculation of material properties, the development of machine learning force fields for simulations in material science, the construction of DFT functionals by machine learning methods, the optimization of the adaptive design process by active learning, and the interpretability of, and the physical understanding gained from, machine learning models. Finally, we discuss the challenges and limitations machine learning faces in materials science and suggest a few research strategies to overcome or circumvent them.

BASIC PRINCIPLES OF MACHINE LEARNING
Machine learning algorithms aim to optimize the performance of a certain task by using examples and/or past experience. 67 Generally speaking, machine learning can be divided into three main categories, namely, supervised learning, unsupervised learning, and reinforcement learning.
Supervised machine learning is based on the same principles as a standard fitting procedure: it tries to find the unknown function that connects known inputs to unknown outputs. This desired result for unknown domains is estimated based on the extrapolation of patterns found in the labeled training data. Unsupervised learning is concerned with finding patterns in unlabeled data, as, e.g., in the clustering of samples. Finally, reinforcement learning treats the problem of finding optimal or sufficiently good actions for a situation in order to maximize a reward. 68 In other words, it learns from interactions.
Finally, halfway between supervised and unsupervised learning lies semi-supervised learning. In this case, the algorithm is provided with both unlabeled as well as labeled data. Techniques of this category are particularly useful when available data are incomplete and to learn representations. 69 As supervised learning is by far the most widespread form of machine learning in materials science, we will concentrate on it in the following discussion. Figure 1 depicts the workflow applied in supervised learning. One generally chooses a subset of the relevant population for which values of the target property are known or creates the data if necessary. This process is accompanied by the selection of a machine learning algorithm that will be used to fit the desired target quantity. Most of the work consists in generating, finding, and cleaning the data to ensure that it is consistent, accurate, etc. Second, it is necessary to decide how to map the properties of the system, i.e., the input for the model, in a way that is suitable for the chosen algorithm. This implies to translate the raw information into certain features that will be used as inputs for the algorithm. Once this process is finished, the model is trained by optimizing its performance, usually measured through some kind of cost function. Usually this entails the adjustment of hyperparameters that control the training process, structure, and properties of the model. The data are split into various sets. Ideally, a validation dataset separate from the test and training sets is used for the optimization of the hyperparameters. Every machine learning application has to consider the aspects of overfitting and underfitting. The reason for underfitting usually lies either in the model, which lacks the ability to express the complexity of the data, or in the features, which do not adequately describe the data. This inevitably leads to a high training error. On the other hand, an overfitted model interprets part of the noise in the training data as relevant information, therefore failing to reliably predict new data. Usually, an overfitted model contains more free parameters than the number required to capture the complexity of the training data. In order to avoid overfitting, it is essential to monitor during training not only the training error but also the error of the validation set. Once the validation error stops decreasing, a machine learning model can start to overfit. This problem is also discussed as the bias-variance trade off in machine learning. 70,71 In this context, the bias is an error based on wrong assumptions in the trained model, while high variance is the error resulting from too much sensitivity to noise in the training data. As such, underfitted models possess high bias while overfitted models have high variance.
Before the model is ready for applications, it has to be evaluated on previously unseen data, denoted as test set, to estimate its generalization and extrapolation ability.
Different methods ranging from a simple holdout, over k-fold cross-validation, leave-one-out cross-validation, Monte Carlo crossvalidation, 72 up to leave-one-cluster-out cross-validation 73 can be used for the evaluation. All these methods rely on keeping some data hidden from the model during the training process. For a simple holdout, this is just performed once, while for k-fold crossvalidation the dataset is separated into k equally sized sets. The algorithm is trained with all but one of these k subsets, which is used for testing. Finally, the process is repeated for every subset. For leave-one-out cross-validation, each sample is left out of the training set once and the model is evaluated for that sample. It has to be noted that research in chemistry has shown that this form of cross-validation is insufficient to evaluate adequately the predictive performance of quantitative structure-property relationship and should therefore be avoided. 74,75 Monte Carlo crossvalidation is similar to k-fold cross-validation in the sense that the training and test set are randomly chosen. However, here the size of the training/test set is chosen independently from the number of folds. While this can be advantageous, it also means that a sample is not guaranteed to be in the test/training set. Leave-onecluster-out cross-validation 73 was specifically developed for materials science and estimates the ability of the machine learning model to extrapolate to novel groups of materials that were not present in the training data. Depending on the target quantity, this allows for a more realistic evaluation and a better understanding of the limitations of the machine learning model. Leave-one-cluster-out cross-validation removes a cluster of materials and then considers the error for predictions of the materials belonging to the removed cluster. This is, for example, consistent with the finding in ref. 76 that models trained on superconductors with a specific superconducting mechanism do not have any predictive ability for superconductors with other mechanisms.
Before discussing various applications of machine learning in materials science, we will give an overview of the different descriptors, algorithms, and databases used in materials informatics.

Databases
Machine learning in materials science is mostly concerned with supervised learning. The success of such methods depends mainly on the amount and quality of data that is available, and this turns out to be one of the major challenges in material informatics. 77 This is especially problematic for target properties that can only be determined experimentally in a costly fashion (such as the critical temperature of superconductors-see section "Prediction of material properties-superconductivity"). For this reason, databases such as the materials project, 78 the inorganic crystal structure database, 79 and others (Materials genome initiative, The NOMAD archive, Supercon, National Institute of Materials Science 2011) [80][81][82][83][84][85][86][87][88][89][90][91][92] that contain information on numerous properties of known materials are essential for the success of materials informatics.
In order for these databases and for materials informatics to thrive, a FAIR treatment of data 93 is absolutely required. A FAIR treatment encompasses the four principles: findability, accessibility, interoperability, and repurposability. 94 In other words, researchers from different disciplines should be able to find and access data, as well as the corresponding metadata, in a commonly accepted format. This allows the application of the data for new purposes.
Traditionally, negative results are often discarded and left unpublished. However, as negative data are often just as important for machine learning algorithms as positive results, 28,95 a cultural adjustment toward the publication of unsuccessful research is necessary. In some disciplines with a longer tradition of data-based research (like chemistry), such databases already exist. 95 In a similar vein, data that emerges as a side product but are not essential for a publication are often left unpublished. This eventually results in a waste of resources as other researchers are then required to repeat the work. In the end, every single discarded calculation will be sorely missed in future machine learning applications.

Features
A pivotal ingredient of a machine learning algorithm is the representation of the data in a suitable form. Features in material science have to be able to capture all the relevant information, necessary to distinguish between different atomic or crystal environments. 96 The process itself, denoted as feature extraction or engineering, might be as simple as determining atomic numbers, might involve complex transformations such as an expansion of radial distribution functions (RDFs) in a certain basis, or might require aggregations based on statistics (e.g., average over features or the calculation of their maximum value). How much processing is required depends strongly on the algorithm. For some methods, such as deep learning, the feature extraction can be considered as part of the model. 97 Naturally, the best choice for the representation depends on the target quantity and the variety of the space of occurrences. For completeness, we have to mention that the cost of feature extraction and of target quantity evaluation must never be comparable.
Ideally, descriptors should be uncorrelated, as an abundant number of correlated features can hinder the efficiency and accuracy of the model. When this happens, further feature selection is necessary to circumvent the curse of dimensionality, 98 simplify models, and improve their interpretability as well as training efficiency. For example, several elemental properties such as the period and group in the periodic table, ionization potential, and covalent radius, can be used as features to model formation energies or distances to the convex hull of stability. However, it was shown that, to obtain acceptable accuracies, often only the period and the group are required. 99 Having described the general properties of descriptors, we will proceed with a listing of the most used features in materials science. Without a doubt, the most studied type of features in this field are the ones related to the fitting of potential energy surfaces. In principle, the nuclear charges and the atomic positions are sufficient features, as the Hamiltonian of a system is usually fully determined by these quantities. In practice, however, while Cartesian coordinates might provide an unambiguous description of the atomic positions, they do not make a suitable descriptor, as the list of coordinates of a structure are ordered arbitrarily and the number of such coordinates varies with the number of atoms. The latter is a problem, as most machine learning models require a fixed number of features as an input. Therefore, to describe solids and large clusters, the number of interacting neighbors has to be allowed to vary without changing the dimensionality of the descriptor. In addition, a lot of applications require that the features are continuous and differentiable with respect to atomic positions.
A comprehensive study on features for atomic potential energy surfaces can be found in the review of Bartó et al. 100 . Important points mentioned in their work are: (i) the performance of the model and its ability to differentiate between different structures do not depend directly on the descriptors but on the similarity measurement between them; (ii) the quality of the descriptors is related to the differentiability with respect to the movement of the atoms, completeness of the representation, and invariance to the basis symmetries of physics (rotation, reflection, translation, and permutation of atoms of the same species). For clarification, a set of invariant descriptors q i , which uniquely determines an atomic environment up to symmetries, is defined as complete. An overcomplete set is then a set that includes more features than necessary.
Simple representations that show shortcomings as features are transformations of pairwise distances, [101][102][103] Weyl matrices, 104 and Z-matrices. 105 Pairwise distances (and also reciprocal or exponential transformations of these) only work for a fixed number of atoms and are not unique under permutation of atoms. The constrain on the number of atoms is also present for polynomials of pairwise distances. Histograms of pairwise atomic distances are non-unique: if no information on the angles between the atoms is given, of if the ordering of the atoms is unknown, it might be possible to construct at least two different structures with the same features. Weyl matrices are defined by the inner product between neighboring atoms positions, forming an overcomplete set, while permutations of the atoms change the order of the rows and columns. Finally, Z-matrices or internal coordinate representations are not invariant under permutations of atoms.
In 2012, Rupp et al. 106 introduced a representation for molecules based on the Coulomb repulsion between atoms I and J and a polynomial fit of atomic energies to the nuclear charge The ordered eigenvalues (ε) of these "Coulomb matrices" are then used to measure the similarity between two molecules.
Here, if the number of atoms is not the same in both systems, ε is extended by zeros. In this representation, symmetrically equivalent atoms contribute equally to the feature function, the diagonalized matrices are invariant with respect to permutations and rotations, and the distance d is continuous under small variations of charge or interatomic distances. Unfortunately, this representation is not complete and does not uniquely describe every system. The incompleteness derives from the fact that not all degrees of freedom are taken into account when comparing two systems. The non-uniqueness can be demonstrated using as an example acetylene (C 2 H 2 ). 107 In brief, distortions of this molecule can lead to several geometries that are described by the same Coulomb matrix.
Faber et al. 108 presented three distinct ways to extend the Coulomb matrix representation to periodic systems. The first of these features consists of a matrix where each element represents the full Coulomb interaction between two atoms and all their infinite repetitions in the lattice. For example: where the sum over k (l) is taken over the atom i (j) in the unit cell and its N closest equivalent atoms. However, as this double sum has convergence issues, one has to resort to the Ewald trick: X ij is divided into a constant and two rapidly converging sums, one for the long-range interaction and another for the short-range interaction. Another extension by Faber et al. considers electrostatic interactions between the atoms in the unit cell and the atoms in the N closest unit cells. In addition, the long-range interaction is replaced by rapidly decaying interaction. In their final extension, the Coulomb interaction in the usual matrix is replaced by a potential that is symmetric with respect to the lattice vectors.
In the same line of work, Schütt et al. 109 extended the Coulomb matrix representation by combining it with the Bravais matrix. Unfortunately, this representation is plagued by a degeneracy problem that comes from the arbitrary choice of the coordinate system in which the Bravais matrix is written. Another representation proposed by Schütt et al. is the so called partial radial distribution function, which considers the density of atoms β in a shell of width dr and radius r centered around atom α (see Fig. 2): Here N α and N β are the number of atom of types α and β, V r is the volume of the shell, and d αβ are the pairwise distances between two atom types.
Another form for representing the local structural environment was proposed by Behler and Parrinelo. 110 Their descriptors 111 involve an invariant set of atom-centered radial and angular symmetry functions where θ ijk is the angle between R j − R i and R k − R i . While the radial functions G r i contain information on the interaction between pairs of atoms within a certain radius, the angular functions G a i contains additional information on the distribution of the bond angles θ ijk . Examples for atom-centered symmetry functions are Here f c is a cutoff function, leading to the neglect of interactions between atoms beyond a certain radius R c . Furthermore, η controls the width of the Gaussians, R s is just a parameter that shifts the Gaussians, λ determines the positions of the extrema of the cosine, and ζ controls the angular resolution. The sum over neighbors enforces the permutation invariance of these symmetry functions. Usually, 20-100 symmetry functions are used per atom, constructed by varying the parameters above. Beside atom centered, these functions can also be pair centered. 112 A generalization of the atom-centered pairwise descriptor of Behler was proposed by Seko et al. 113 It consists of simple basis functions constructed from the multinomial expansion of the product between a cutoff function (f c ) and an analytical pairwise function (f n ) (for example, Gaussian, cosine, Bessel, Neumann, polynomial, or Gaussian-type orbital functions) where p is a positive integer, and R i jk indicates the distance between atoms j and k of structure i. The descriptor then uses the sum of these basis functions over all the atoms in the structure P j b i;j n;p ! . A similar type of descriptor is the angular Fourier series (AFS), 100 which consists of a collection of orthogonal polynomials, like the Chebyshev polynomials T l (cos θ) = cos (lθ), and radial functions AFS nl ¼ X i;j>i g n ðr i Þg n ðr j Þcosðlθ ij Þ: These radial functions are expansions of cubic or higher-order polynomials where ϕ α ðrÞ ¼ ðr c À rÞ αþ2 =N α : A different approach for atomic environment features was proposed by Bartok et al. 100,114 and leads to the power spectrum and the bispectrum. The approach starts with the generation of an atomic neighbor density function which is projected onto the surface of a four-dimensional sphere with radius r 0 . As an example, Fig. 3 depicts the projection for 1 and 2 dimensions. Then the hyperspherical harmonic functions U j m0m can be used to represent any function ρ defined on the surface of a four-dimensional sphere 115,116 ρ ¼ Combining these with the rotation operator and the transformation of the expansion coefficients under rotation leads to the formula (15) for the SO(4) power spectrum. On the other hand, the bispectrum is given by where C j j1 j2 m m1 m2 are the Clebsch-Gordon coefficients of SO(4). We note that the representations above are truncated, based on the band limit j max in the expansion.
Finally, one of the most successful atomic environment features is the following similarity measurement also known as the smooth overlap of atomic positions (SOAP) kernel. 100 Here ζ is a positive integer that enhances the sensitivity of the kernel to changes on the atomic positions and ρ is the atomic neighbor density function, which is constructed from a sum of Gaussians, centered on each neighbor: In practice, the function ρ is then expanded in terms of the spherical harmonics. In addition, k(ρ, ρ′) is a rotationally invariant kernel, defined as the overlap between an atomic environment and all rotated environments: The normalization factor ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi kðρ; ρÞkðρ0; ρ0Þ p ensures that the overlap of an environment with itself is one.
The SOAP kernel can be perceived as a three-dimensional generalization of the radial atom-centered symmetry functions and is capable of characterizing the entire atomic environment at once. It was shown to be equivalent to using the power or bispectrum descriptor with a dot-product covariance kernel and Gaussian neighbor densities. 100 A problem with the above descriptors is that their number increases quadratically with the number of chemical species. Inspired by the Behler symmetry functions and the SOAP method, Artrith et al. 117 devised a conceptually simple descriptor whose dimension is constant with respect to the number species. This is achieved by defining the descriptor as the union between two sets of invariant coordinates, one that maps the atomic positions (or structure) and another for the composition. Both of these mappings consist of the expansion coefficients of the RDFs and angular distribution functions (ADF) in a complete basis set ϕ α (like the Chebyshev 94% average cross- validation error). The advantage of stability and Here f c is a cut-off function that limits the range of the interactions. The weights w tj and w tk take the value of one for the structure maps, while the weights for the compositional maps depend on the chemical species, according to the pseudo-spin convention of the Ising model. By limiting the descriptor to two and three body interactions, i.e., radial and angular contributions, this method maintains the simple analytic nature of the Behler-Parrinelo approach. Furthermore, it allows for an efficient implementation and differentiation, while systematic refinement is assured by the expansion in a complete basis set.
Sanville et al. 118 used a set of vectors, each of which describes a five-atom chain found in the system. This information includes distances between the five atoms, angles, torsion angles, and functions of the bond screening factors. 119 The simplex representation of a molecular structure of Kuz'min et al. 120,121 consists in representing a molecule as a system of different simplex descriptors, i.e., a system of different tetratromic fragments. These descriptors become consecutively more detailed with the increase of the dimension of the molecule representation. The simplex descriptor at the one-dimensional (1D) level consists on the number of combinations of four atoms for a given composition. At the two-dimensional (2D) level, the topology is also taken into account, while at the 3D level, the descriptor is the number of simplexes of fixed composition, topology, chirality, and symmetry. The extension of this methodology to bulk materials was proposed by Isayev et al. 122 and counts bounded and unbounded simplexes (see Fig. 4). While a bonded simplex characterizes only a single component of the mixture, unbounded simplexes can describe up to four components of the unit cell.
Isayef et al. 41 also adapted property-labeled material fragments 123 to solids. The structure of the material is encoded in a graph that defines the connectivity within the material based on its Voronoi tessellation 124,125 (see Fig. 5). Only strong bonding interactions are considered. Two atoms are seen as connected only when they share a Voronoi face and the interatomic distance does not exceed the sum of the Cordero covalent bond lengths. 126 In the graph, the nodes correspond to the chemical elements, which are identified through a plethora of elemental properties, like Mendeleev group, period, thermal conductivity, covalent radius, etc. The full graph is divided into subgraphs that correspond to the different fragments. In addition, information about the crystal structure (e.g., lattice constants) is added to the descriptor of the material, resulting in a feature vector of 2500 values in total. A characteristic of these graphs is their adjacency matrix, which consists of a square matrix of order n (number of atoms) filled with zeros except for the entries a ij = 1 that occur when atom i and j are connected. Finally, for every property scheme q, the descriptors are calculated as where the set of indices go over all pairs of atoms or over all pairs of bonded atoms, and M ij are the elements of the product between the adjacency matrix of the graph and the reciprocal square distance matrix. A different descriptor, named orbital-field matrix, was introduced by Pham et al. 127 Orbital-field matrices consist in the weighted product between one-hot vectors o p i À Á , resembling those from the field of natural language processing. These vectors are filled with zeros with the exception of the elements that represent the electronic configuration of the valence of the atom. As an example, for the sodium atom with electronic configuration [Ne]3s 1 , the one-hot vector is filled with zeros except for the first element, which is 1. The elements of the matrices are calculated from: where the weight w k ðθ p k ; r pk Þ represents the contribution of atom k to the coordination number of the center atom p and depends on the distance between the atoms and the solid angle θ p k determined by the face of the Voronoi polyhedron between the atoms. To represent crystal structures, the orbital-field matrices are averaged over the number of atoms N p in the unit cell: Another way to construct features based on graphs is the crystal graph convolutional neural network (CGCNN) framework, proposed by Xie et al. 40 and shown schematically in Fig. 6. The atomic properties are represented by the nodes and encoded in the feature vectors v i . Instead of using continuous values, each continuous property is divided into ten categories resulting in one-hot features. This is obviously not necessary for the discrete properties, which can be encoded as standard one-hot vectors without further transformations. The edges represent the bonding interactions and are constructed analogously to the propertylabeled material fragments descriptor. Unlike most graphs, these crystal graphs allow for several edges between two nodes, due to periodicity. Therefore, the edges are encoded as one-hot feature vectors u ði;jÞ k , which translates into the kth bond between atom i and j. Crystal graphs do not form an optimal representation for predicting target properties by themselves; however, they can be improved by using convolution layers. After each convolution layer, the feature vectors gradually contain more information on the surrounding environment due to the concatenation between atom and bond feature vectors. The best convolution function of where W Here ⊕ denotes concatenation of vectors. After R convolutions, a pooling layer reduces the spatial dimensions of the convolution neural network. Using skip layer connections, 128 the pooling function operates not only on the last feature vector but also on all feature vectors (obtained after each convolution).
The idea of applying graph neural networks [129][130][131] to describe crystal structures stems from graph-based models for molecules, such as those proposed in refs. [131][132][133][134][135][136][137][138][139][140] . Moreover, all these models can be reorganized into a single common framework, known as message passing neural network 141 (MPNNs). The latter can be defined as a model operating on undirected graphs G, with edge features x v and vertex features e vw . In this context, the forward pass is divided into two phases: the message passing phase and the readout phase.
During the message passing phase, which lasts for T interaction steps, the hidden states h v at each node in the graph are updated based on the messages m tþ1 v : where S t (⋅) is the vertex update function. The messages are modified at each interaction by an update function M t (⋅), which depends on all pairs of nodes (and their edges) in the neighborhood of v in the graph G: where N(v) denotes the neighbors of v.
The readout phase occurs after T interaction steps. In this phase, a readout function R(⋅) computes a feature vector for the entire graph: Jørgensen et al. 142 extended MPNNs with an edge update network, which enforces the dependence of the information exchanged between atoms on the previous edge state and on the hidden states of the sending and receiving atom: where E t (.) is the edge update function. One example of MPNNs are causal generative neural networks. The message corresponds to z ðtÞ ði;jÞ k , defined in Eq. (27). Likewise, the hidden node update function corresponds to the convolution function of Eq. (27). In this case, we can clearly see that the hidden node update function depends on the message and on the hidden node state. The readout phase comes after R convolutions (or T iterations steps) and the readout function corresponds to the pooling layer function of the CGCNNs.
Up to now, we discussed very general features to describe both the crystal structure and chemical composition. However, should constrains be applied to the material space, the features necessary to study such systems can be vastly simplified. As mentioned above, elemental properties alone can be used as features, e.g., when a training set is restricted to only one kind of crystal structure and stoichiometry. 33,35,56,99,143 Consequently, the target property only depends on the chemical elements present in the composition. Another example can be found in ref. 144 , where a polymer is represented by the number of building blocks (e.g., number of C 6 H 4 , CH 2 , etc) or of pairs of blocks.
The crude estimations of properties can be an interesting supplement to standard features as discussed in ref. 77 . As its name implies, crude estimations of properties consist of the calculation of a target property (for example, the experimental band gap) utilizing crude estimators [for example, the DFT band gap calculated with the Perdew-Burke-Ernzerhof (PBE) approximation 145 to the exchange-correlation functional]. In principle, this approach can achieve successful results, as the machine learning algorithm no longer needs to predict a target property but rather an error or a difference between properties calculated with two well-defined methodologies. Fischer et al. 146 took another route and used as features a vector that completely denotes the possible ground states of an alloy: X ¼ ðx c1 ; x c2 ; :::; x cn ; x E1 ; x E2 ; :::; x Ec Þ; (33) where x ci denotes the all possible crystal structures present in the alloy at a given composition and x E1 the elemental constituents of the system. In this way, the vector X = (fcc, fcc, Au, Ag) would represent the gold-silver system. Furthermore, the probability density p(X) denotes the probability that X is the set of ground states in a binary alloy. With these tools, one can find the most likely crystal structure for a given composition by sorting the probabilities and predict crystal structures by evaluating the conditional probability p(X|e), where e denotes unknown variables.
Having presented so many types of descriptors, the question that now remains concerns the selection of the best features for the problem at hand. In section "Basic principles of machine learning-Algorithms", we discuss some automatic feature selection algorithms, e.g., least absolute shrinkage and selection operator (LASSO), sure independence screening and sparsifying operator (SISSO), principal component analysis (PCA), or even decision trees. Yet these methods mainly work for linear models, and selecting a feature for, e.g., a neural network force field from the various features we described is not possible with any of these methods. A possible solution to this problem is to perform through benchmarks. Unfortunately, while there are many studies presenting their own distinct way to build features and applying them to some problem in materials science, fewer studies 96,100,147 actually present quantitative comparisons between descriptors. Moreover, some of the above features require a considerable amount of time and effort to be implemented efficiently and are not readily and easily available.
In view of the present situation, we believe that the materials science community would benefit greatly from a library containing efficient implementations of the above-mentioned descriptors and an assembly of benchmark datasets to compare the features in a standardized manner. Recent work by Himanen et al. 148 addresses part of the first problem by providing efficient implementations of common features. The library is, however, lacking the implementation of the derivatives. SchNetPack by Schütt et al. 149 also provides an environment for training deep neural network for energy surfaces and various material properties. Further useful tools and libraries can be found in refs. [150][151][152] Algorithms In this section, we briefly introduce and discuss the most prevalent algorithms used in materials science. We start with linear-and kernel-based regression and classification methods. We then introduce variable selection and extraction algorithms that are also largely based on linear methods. Concerning completely nonlinear models, we discuss decision tree-based methods like random forests (RFs) and extremely randomized trees and neural networks. We start with simple fully connected feed-forward networks and convolutional networks and continue with more complex applications in the form of variational autoencoders (VAEs) and generative adversarial networks (GANs). In ridge regression, a multi-dimensional least-squares linear-fit problem, including a L 2 -regularization term, is solved: The extra regularization term is included to favor specific solutions with smaller coefficients.
As complex regression problems can usually not be solved by a simple linear model, the so-called kernel trick is often applied to ridge regression. 153 Instead of using the original descriptor x, the data are first transformed into a higher-dimensional feature space ϕ(x). In this space, the kernel k(x, y) is equal to the inner product 〈ϕ (x), ϕ(y)〉. In practice, only the kernel needs to be evaluated, avoiding an inefficient or even impossible explicit calculation of the features in the new space. Common kernels are, e.g., 154 , the radial basis function kernel or the polynomial kernel of degree d Solving the minimization problem given by Eq. (34) in the new feature space results in a non-linear regression in the original feature space. This is usually referred to as kernel ridge regression (KRR). KRR is generally simple to use, as for a successful application of KRR only very few hyperparameters have to be adjusted. Consequently, KRR is often used in materials science. Support vector machines 155 (SVMs) search for the hyperplanes that divide a dataset into classes such that the margin around the hyperplane is maximized (see Fig. 7). The hyperplane is completely defined by the data points that lie the closest to the plane, i.e., the support vectors from which the algorithm derives its name.
Analogously to ridge regression, the kernel trick can be used to arrive at non-linear SVMs. 153 SVM regressors also create a linear model (non-linear in the kernel case) but use the so-called εinsensitive loss function: where f(x) is the linear model and ε a hyperparameter. In this way, errors smaller than the threshold defined by ε are neglected. When comparing SVMs and KRR, no big performance differences are to be expected. Usually SVMs arrive at a sparser representation, which can be of advantage; however, their performance relies on a good setting of the hyperparameters. In most cases, SVMs will provide faster predictions and consume less memory, while KRR will take less time to fit for medium datasets. Nevertheless, owing to the generally low computational cost of both algorithms, these differences are seldom important for relatively small datasets. Unfortunately, neither method is feasible for large datasets as the size of the kernel matrix scales quadratically with the number of data points.
Gaussian process regression (GPR) relies on the assumption that the training data were generated by a Gaussian process and therefore consists of samples from a multivariate Gaussian distribution. The only other assumption that enter the regression are the forms of the covariance function k(x, x′) and the mean (which is often assumed to be zero). Based on the covariance matrix, whose elements represent the covariance between two features, the mean and the variance for every possible feature value can be predicted. The ability to estimate the variance is the main advantage of GPR, as the uncertainty of the prediction can be an essential ingredient of a materials design process (see section "Adaptive design process and active learning"). GPR also uses a kernel to define the covariance function. In contrast to KRR or SVMs where the hyperparameters of the kernel have be optimized with an external validation set, the hyperparameters in GPR can be optimized with gradient descent if the calculation of the covariance matrix and its inverse are computationally feasible. Although modern and fast implementations of Gaussian processes in materials science exist (e.g., COMBO 156 ), their inherent scaling is quite limiting with respect to the data size and the descriptor dimension as a naive training requires an inversion of the covariance matrix of order OðN 3 Þ and even the prediction scales with OðN 2 Þ with respect to the size of the dataset. 157 Based on the principles of GPR, one can also produce a classifier. First, GPR is used to qualitatively evaluate the classification probability. Then a sigmoid function is applied to the latent function resulting in values in the interval [0, 1].
In the previous description of SVMs, KRR, and GPR, we assumed that a good feature choice is already known. However, as this choice can be quite challenging, methods for feature selection can be essential.
The LASSO 158,159 attempts to improve regression performance through the creation of sparse models through variable selection. It is mostly used in combination with least-squares linear regression, in which case it results in the following minimization problem 159 : where y i are the outcomes, x i the features, and β the coefficients of the linear model that have to be determined. In contrast to ridge regression, where the L 2 -norm of the regularization term is used, LASSO aims at translating most coefficients to zero. In order to actually find the model with the minimal number of non-zero components, one would have to use the so called L 0 -norm of the coefficient vector, instead of the L 1 -norm used in LASSO. (The L 0norm of a vector is equal to its number of non-zero elements). However, this problem is non-convex and NP-hard and therefore infeasible from a computational perspective. Furthermore, it is proven 160 that the L 1 -norm is a good approximation in many cases. The ability of LASSO to produce very sparse solutions makes it attractive for cases where a simple, maybe even simulatable model (see section "Discussion and conclusions-Interpretability"), is needed. The minimization problem from Eq. (38), under the constraint of the L 0 -norm and the theory around it, is also known as compressed sensing. 161 Ghiringhelli et al. described an extended methodology for feature selection in materials science based on LASSO and compressed sensing. 162 Starting with a number of primary features, the number of descriptors is exponentially increased by applying various algebraic/functional operators (such as the absolute value of differences, exponentiation, etc.) and constructing different combinations of the primary features. Necessarily, physical notions like the units of the primary features constrain the number of combinations. LASSO is then used to reduce the number of features to a point where a brute force combination approach to find the lowest error is possible. This approach is chosen in order to circumvent the problems pure LASSO faces when treating strongly correlated variables and to allow for nonlinear models.
As LASSO is unfortunately still computationally infeasible for very high-dimensional feature spaces (>10 9 ), Ouyang et al. developed the SISSO 163 that combines sure independence screening, 164 other sparsifying operators, and the feature space generation from ref. 162 . Sure independence screening selects a subspace of features based on their correlation with the target variable and allows for extremely high-dimensional starting spaces. The selected subspace is than further reduced by applying the sparsifying operator (e.g., LASSO). Predicting the relative stability of octet binary materials as either rock-salt or zincblende was used as a benchmark. In this case, SISSO compared favorably with LASSO, orthogonal matching pursuit, 165,166 genetic programming, 167 and the previous algorithm from ref. 162 Bootstrapped-projected gradient descent 168 is another variable selection method developed for materials science. The first step of bootstrapped-projected gradient descent consists in clustering the features in order to combat the problems other algorithms like LASSO face when encountering strongly correlated features. The features in every cluster are combined in a representative feature for every cluster. In the following, the sparse linear fit problem is approximated with projected gradient descent 169 for different levels of sparsity. This process is also repeated for various bootstrap samples in order to further reduce the noise. Finally, the intersection of the selected feature sets across the bootstrap samples is chosen as the final solution.
PCA 170,171 extracts the orthogonal directions with the greatest variance from a dataset, which can be used for feature selection and extraction. This is achieved by diagonalizing the covariance matrix. Sorting the eigenvectors by their eigenvalues (i.e., by their variance) results in the first principal component, second principal component, and so on. The broad idea behind this scheme is that, in contrast to the original features, the principal components will be uncorrelated. Furthermore, one expects that a small number of principal components will explain most of the variance and therefore provide an accurate representation of the dataset. Naturally, the direct application of PCA should be considered feature extraction, instead of feature selection, as new descriptors in the form of the principal components are constructed. On the other hand, feature selection based on PCA can follow various strategies. For example, one can select the variables with the highest projection coefficient from, respectively, the first n principal components when selecting n features. A more indepth discussion of such strategies can be found in ref. 171 .
The previous algorithms can be considered as linear models or linear models in a kernel space. An important family of non-linear machine learning algorithms is composed by decision trees. In general terms, decision trees are graphs in tree form, 172 where each node represents a logic condition aiming at dividing the input data into classes (see Fig. 8) or at assigning a value in the case of regressors. The optimal splitting conditions are determined by some metric, e.g., by minimizing the entropy after the split or by maximizing an information gain. 173 In order to avoid the tendency of simple decision trees to overfit, ensembles such as RFs 174 or extremely randomized trees 175 are used in practice. Instead of training a single decision tree, multiple decision trees with a slightly randomized training process are built independently from each other. This randomization can include, for example, using only a random subset of the whole training set to construct the tree, using a random subset of the features, or a random splitting point when considering an optimal split. The final regression or classification result is usually obtained as an average over the ensemble. In this way, additional noise is introduced into the fitting process and overfitting is avoided.
In general, decision tree ensemble methods are fast and simple to train as they are less reliant on good hyperparameter settings than most other methods. Furthermore, they are also feasible for large datasets. A further advantage is their ability to evaluate the relevance of features through a variable importance measure, allowing a selection of the most relevant features and some basic understanding of the model. Broadly speaking, these are based on the difference in performance of the decision tree ensemble by including and excluding the feature. This can be measured, e.g., through the impurity reduction of splits using the specific feature. 176 Extremely randomized trees are usually superior to RFs in higher variance cases as the randomization decreases the variance of the total model 175 and demonstrate at least equal performances in other cases. This proved true for several applications in materials science where both methods were compared. 99,177,178 However, as RFs are more widely known, they are still prevalent in materials science.
Boosting methods 179 generally combine a number of weak predictors to create a strong model. In contrast to, e.g., RFs where multiple strong learners are trained independently and combined through simple averaging to reduce the variance of the ensemble model, the weak learners in boosting are not trained independently and are combined to decrease the bias in comparison to a single weak learner. Commonly used methods, especially in combination with decision tree methods, are gradient boosting 180,181 and adaptive boosting. 182,183 In materials science, they were applied to the prediction of bulk moduli 184,185 and the prediction of distances to the convex hull, respectively. 99,186 Ranging from feed-forward neural networks over selforganizing maps 187 up to Boltzmann machines 188 and recurrent neural networks, 189 there is a wide variety of neural network structures. However, until now only feed-forward networks have found applications in materials science (even if some Boltzmann machines are used in other areas of theoretical physics 190 ). As such, in the following we will leave out "feed-forward" when referring to feed-forward neural networks. In brief, a neural network starts with an input layer, continues with a certain number of hidden layers, and ends with an output layer. The neurons of the nth layer, denoted as the vector x n , are connected to the previous layer through the activation function ϕ(x) and the weight matrix A nÀ1 ij : The weight matrices are the parameters that have to be fitted during the learning process. Usually, they are trained with gradient descent style methods with respect to some loss function (usually L 2 loss with L 1 regularization), through a method known as back-propagation.
Inspired by biological neurons, sigmoidal functions were classically used as activation functions. However, as the gradient of the weight-matrix elements is calculated with the chain rule, deeper neural networks with sigmoidal activation functions quickly lead to a vanishing gradient, 191 hampering the training process. Modern activation functions such as rectified linear units 192,193 ϕðxÞ ¼ x alleviate this problem and allow for the development of deeper neural networks.
The real success story of neural networks only started once convolutional neural networks were introduced to image recognition. 55,195 Instead of solely relying on fully connected layers, two additional layer variants known as convolutional and pooling layers were introduced (see Fig. 9).
Convolutional layers consist of a set of trainable filters, which usually have a receptive field that considers a small segment of the total input. The filters are applied as discrete convolutions across the whole input, allowing the extraction of local features, where each filter will learn to activate when recognizing a different feature. Multiple filters in one layer add an additional dimension to the data. As the same weights/filters are used across the whole input data, the number of hidden neurons is drastically reduced in comparison to fully connected layers, thus allowing for far deeper networks. Pooling layers further reduce the dimensionality of the representation by combining subregions into a single output. Most common is the max pooling layer that selects the maximum from each region. Furthermore, pooling also allows the network to ignore small translations or distortions. The concept of convolutional networks can also be extended to graph representations in material science, 139 in what can be considered MPNNs 141 . In general, neural networks with five or more layers are considered deep neural networks, 55 although no precise definition of this term in relation to the network topology exists. The advantage of deep neural networks is not only their ability to learn representations with different abstraction levels but also to reuse them. 97 Ideally, the invariance and differentiation ability of the representation should increase with increasing depth of the model.
Obviously, this saves resources that would otherwise be spent on feature engineering. However, some of these resources have now to be allocated to the development of the topology of the neural network. If we consider hard-coded layers (like pooling layers), one can once again understand them as feature extraction through human intervention. While some methods for the automatic development of neural network structures exist (e.g., the neuroevolution of augmenting topologies 196 ), in practice the topologies of neural networks are still developed through trial and error. The extreme speedup in training time through graphics processing unit implementations and new methods that improve the training of deep neural networks, like dropout 197 and batch normalization, 198 also played a big role in the success story of neural networks. As these methods are included in open source libraries, like tensorflow 199 or pytorch, 200 they can easily be applied in materials science.
Neural networks can also be used in a purely generative manner, for example, in the form of autoencoders 201,202 or GANs. 203 Generative models learn to reproduce realistic samples from the distribution of data they are trained on. Naturally, one of the end goals of machine learning in materials science is the development of generative models, which can take into account material properties and therefore encompass most of the material design process.
Autoencoders are built with the purpose of learning a new efficient representation of the input data without supervision. Typically, the autoencoder is divided into two parts (see Fig. 10). The first half of the neural network is the encoder, which ends with a layer that is typically far smaller than the input layer in order to force the autoencoder to reduce the dimensionality of the data. The second half of the network, the decoder, attempts to regain the original input from the new encoded representation.
VAEs are based on a specific training algorithm, namely, stochastic gradient variational Bayes, 204 that assumes that the VAE learns an approximation of the distribution of the input. Naturally, VAEs can also be used as generative models by generating data in the form of the output of the encoder and subsequently decoding it.
GANs consist of two competing neural networks that are trained together (see Fig. 11): a generative model that attempts to produce samples from a distribution and a discriminative model that predicts the probability that an input belongs to the original distribution or was produced by the generative model. GANs have found great success in image processing 205,206 and have recently been introduced to other fields, such as astronomy, 207 particle physics, 208 genetics, 209 and also very recently to materials science. 29,210,211 More information about these algorithms can be found in the references provided or in refs. 1,[212][213][214][215][216] . We would like to note that the choice of the machine learning algorithm is completely problem dependent. It can be useful to establish a baseline for the quality of the model by using simple approaches (such as extremely randomized trees) before spending time optimizing hyperparameters of more complex models.

MATERIAL DISCOVERY
Nearly 30 years ago, the at the time editor of Nature, John Maddox wrote "One of the continuing scandals of physical science is that it remains in general impossible to predict the structure of even the simplest crystalline solids from a knowledge of their chemical composition." 217 While this is far from true nowadays, predicting the crystal structure based solely on the composition remains one of the most important (if not even the key) challenge in materials science, as any rational materials design has to be grounded in the knowledge of the crystal structure.
Unfortunately, the first-principle prediction of crystal or molecular structures is exceptionally difficult, because the combinatorial space is composed of all possible arrangements of the atoms in three-dimensional space and with an extremely complicated energy surface. 18 In recent years, advanced structure selection and generation algorithms such as random sampling, 218-221 simulated annealing, 222-224 metadynamics, 225 minima hopping, 226 and evolutionary algorithms, 19,[227][228][229][230][231][232][233] as well as the progress in energy evaluation methods, expanded the scope of application of "classical" crystal structure prediction methods to a wider range of molecules and solid forms. 234 Nevertheless, these methods are still highly computationally expensive, as they require a substantial amount of energy and force evaluations. However, the search for new or better high-performance materials is not possible without searching through an enormous composition and structure space. As there are tremendous amounts of data involved, machine learning algorithms are some of the most promising candidates to take on this challenge.
Machine learning methods can tackle this problem from different directions. A first approach is to speed up the energy evaluation by replacing a first-principle method with machine learning models that are orders of magnitude faster (see section "Machine learning force fields"). However, the most prominent approach in inorganic solid-state physics is the so-called component prediction. 61 Instead of scanning the structure space for one composition, one chooses a prototype structure and scans the composition space for the stable materials. In this context, thermodynamic stability is the essential concept. By this we mean compounds that do not decompose (even in infinite time) into different phases or compounds. Clearly, metastable compounds like diamond are also synthesizable and advances in chemistry have made them more accessible. 235,236 Nevertheless, thermodynamically stable compounds are in general easier to produce and work with. The usual criterion for thermodynamic stability is based on the energetic distance to the convex hull, but in some cases the machine learning model will directly calculate the probability of a compound existing in a specific phase.

Component prediction
Clearly the formation energy of a new compound is not sufficient to predict its stability. Ideally, one would always want to use the distance to the convex hull of thermodynamic stability. In contrast to the formation energy, the distance to the convex hull considers the difference in free energy of all possible decomposition channels. De facto, this is not the case because our knowledge of the convex hull is of course incomplete. Fortunately, as our knowledge of the convex hull continuously improves with the discovery of new stable materials, this problem becomes less important over time. Lastly, most first-principle energy calculations are done at zero temperature and zero pressure, neglecting kinetic effects on the stability.
Faber et al. 35 applied KRR to calculate formation energies of two million elpasolites (with stoichiometry ABC 2 D 6 ) crystals consisting of main group elements up to bismuth. Errors of around 0.1 eV/ atom were reported for a training set of 10 4 compositions. Using energies and data from the materials project, 78 phase diagrams were constructed and 90 new stoichiometries were predicted to lie on the convex hull.
Schmidt et al. 99 first constructed a dataset of DFT calculations for approximately 250,000 cubic perovskites (with stoichiometry ABC 3 ) using all elements up to bismuth and neglecting rare gases and lanthanides. After testing different machine learning methods, extremely randomized trees 175 in combination with adaptive boosting 183 proved the most successful with an mean average error of 0.12 eV/atom. Curiously, the error in the prediction depends strongly on the chemical composition (see Fig. 12). Furthermore, an active learning approach based on pure exploitation was suggested (see section "Adaptive design process and active learning").
In ref. 186 , the composition space for two ternary prototypes with stoichiometry AB 2 C 2 (tI10-CeAl 2 Ga 2 and the tP10-FeMo 2 B 2 prototype structures) were explored for stable compounds using the approach developed in ref. 99 . In total, 1893 new compounds were found on the convex hull while saving around 75% of computation time and reporting false negative rates of only 0% for the tP10 and 9% for the tI10 compound.
Ward et al. 34 used standard RFs to predict formation energies based on features derived from Voronoi tessellations and atomic properties. Starting with a training set of around 30,000 materials, the descriptors showed better performance than Coulomb matrices 108 and partial RDFs 109 (see section "Basic principles of machine learning-Features" for the different descriptors). Surprisingly, the structural information from the Voronoi tessellation did not improve the results for the training set of 30,000 materials. This is based on the fact that very few materials with the same composition, but different structure, are present in the dataset. Changing the training set to an impressive 400,000 materials from the open quantum materials database 80 proved this point, as the error for the composition-only model was then 37% higher than for the model including the structural information.
A recent study by Kim et al. 237 used the same method for the discovery of quaternary Heusler compounds and identified 53 new stable structures. The model was trained for different datasets (complete open quantum materials database, 80 only the quaternary Heusler compounds, etc.). For the prediction of Heusler compounds, it was found that the accuracy of the model also benefited from the inclusion of other prototypes in the training set. It has to be noted that studies with such large datasets are not feasible with kernel-based methods (e.g. KRR, SVMs) due to their unfavorable computational scaling.
Li et al. 33 applied different regression and classification methods to a dataset of approximately 2150 A 1−x A′ x B 1−y B′ y O 3 perovskites, materials that can be used as cathodes in high-temperature solid oxide fuel cell. 238 Elemental properties were used as features for all methods. Extremely randomized trees proved to be the best classifiers (accuracy 0.93, F 1 -score 0.88) while KRR and extremely randomized trees had the best performance for regression, with mean average errors of <17 meV/atom. The errors in this work are difficult to compare to others as the elemental composition space was very limited. Mean average errors of 9 and 26 meV/atom were then obtained, respectively, for unmixed and mixed garnets with the composition C 3 A 2 D 3 O 12 . By reducing the mixing to the C-site and including additional structural descriptors, Ye et al. were able to once again decrease the latter error to merely 12 meV/atom. If one compares this study to, e.g., refs. 1,35 , the errors seem extremely small. This is easily explained once we notice that ref. 56 only considers a total compound space of around 600 compounds in comparison to around 250,000 compounds in ref. 1 . In other words, the complexity of the problem differs by more than two orders of magnitude.
The CGCNNs (see section "Basic principles of machine learning -Features") developed by Xie et al., 40 the MatErials Graph Networks 132 by Chen et al., and the MPNNs by Jørgensen et al. 142 also allow for the prediction of formation energies and therefore can be used to speed up component prediction.
Up to this point, all component prediction methods presented here relied on first-principle calculations for training data. Owing to the prohibitive computational cost of finite temperature calculations, nearly all of this data correspond to zero temperature and pressure and therefore neglects kinetic effects on the stability. Furthermore, metastable compounds, such as diamond, which are stable for all practical purposes and essential for applications, risk to be overlooked. The following methods bypass this problem through the use of experimental training data.
The first structure prediction model that relies on experimental information can be traced back to the 1920s. One example that was still relevant until the past decade is the tolerance factor of Goldschmidt, 239 a criterion for the stability of perovskites. Only recently, modern methods like SISSO, 163 gradient tree boosting, 180 and RFs 174 improved upon these old models and allowed a rise in precision from 74% to >90% 143,240,241 for the stability prediction of perovskites. Balachandran et al. 241 also predicted whether the material would exist as a cubic or non-cubic perovskite, reaching a 94% average cross-validation error. The advantage of stability prediction based on experimental data is a higher precision and reliability, as the theoretical distance to the convex hull is a good but far from perfect indicator for stability. Taking the example of perovskites, one has to increase the distance to the convex hull up to 150 meV/atom just to find even 95% of the perovskites present in the inorganic crystal structure database 79 (see Fig. 13).
Another system with a relatively high number of experimentally known structures are the AB 2 C Heusler compounds. Oliynyk et al. 242 used RFs and experimental data for all compounds with AB 2 C stoichiometry from Pearson's crystal data 243 and the alloy phase diagram database 88 to build a model to predict the probability to form a full-Heusler compound with a certain composition. Using basic elemental properties as features, Olynyk et al. were able to successfully predict and experimentally confirm the stability of several novel full-Heusler phases.
Legrain et al. extended the principle of this work to half-Heusler ABC compounds. While comparing the results of three ab initio high-throughput studies 37,244,245 to the machine learning model, they found that the predictions of the high-throughput studies were neither consistent with each other nor with the machine learning model. The inconsistency between the first-principle studies is due to different publication dates that led to different knowledge about the convex hulls and to slightly differing methodologies. The machine learning model performs well with 9% false negatives and 1% false positives (in this case, positive means stable as half-Heusler structure). In addition, the machine learning model was able to correctly label several structures for which the ab initio prediction failed. This demonstrates the possible advantages of experimental training data, when it is available.
Zheng et al. 36 applied convolutional neural networks and transfer learning 246 to the prediction of stable full-Heusler compounds AB 2 C. Transfer learning considers training a model for one problem and then using parts of the model, or the knowledge gained during the first training process, for a second Fig. 13 Histogram of the distance to the convex hull for perovskites included in the inorganic crystal structure database. 79 Calculations were performed within density functional theory with the Perdew-Burke-Ernzerhof approximation. The bin size is 25 meV/ atom. (Reprinted with permission from ref. 99 . Copyright 2017 American Chemical Society.) training thereby reducing the data required. An image of the periodic table representation was used in order to take advantage of the great success of convolutional neural networks for image recognition. The network was first trained to predict the formation energy of around 65,000 full-Heusler compounds from the open quantum materials database, 80 resulting in a mean absolute error of 7 meV/atom (for a training set of 60,000 data points) and 14 meV/atom (for a training set of 5000 compositions). The weights of the neural network were then used as starting point for the training of a second convolutional neural network that classified the compositions as stable or unstable according to training data from the inorganic crystal structure database. 79 Unfortunately, no data concerning the accuracy of the second network was published.
Hautier et al. 247 combined the use of experimental and theoretical data by building a probabilistic model for the prediction of novel compositions and their most likely crystal structures. These predictions were then validated with ab initio computations. The machine learning part had the task to provide the probability density of different structures coexisting in a system based on the theory developed in ref. 146 . Using this approach, Hautier et al. searched through 2211 ABO compositions where no ternary oxide existed in the inorganic crystal structure database 79 and where the probability of forming a compound was larger than a threshold. This resulted in 1261 compositions and 5546 crystal structures, whose energy was calculated using DFT. To assess the stability, the energies of all decomposition channels that existed in the database were also calculated, resulting in 355 new compounds on the convex hull.
It is clear that component prediction via machine learning can greatly reduce the cost of high-throughput studies through a preselection of materials by at least a factor of ten. 1 Naturally, the limitations of stability prediction according to the distance to the convex hull have to be taken into consideration when working on the basis of DFT data. While studies based on experimental data can have some advantage in accuracy, this advantage is limited to crystal structures that are already thoroughly studied, e.g., perovskites, and consequently a high number of experimentally stable structures is already known. For a majority of crystal structures, the number of known experimentally stable systems is extremely small and consequently ab initio data-based studies will definitely prevail over experimental data-based studies. Once again, a major problem is the lack of any benchmark datasets, preventing a quantitative comparison between most approaches. This is even true for work on the same structural prototype. Considering, for example, perovskites, we notice that three groups predicted distances to the convex hull. 33,56,99 However, as the underlying composition spaces and datasets are completely different it is hardly possible to compare them.

Structure prediction
In contrast to the previous section, where the desired output of the models was a value quantifying the probability of compositions to condense in one specific structure, models in this chapter are concerned with differentiating multiple crystal structures. Usually this is a far more complex problem, as the theoretical complexity of the structural space dwarfs the complexity of the composition space. Nevertheless, it is possible to tackle this problem with machine learning methods.
Early attempts, which predate machine learning, include, e.g., Pettifor structural maps that use elementary properties to separate different binary or ternary structures from each other in a 2D plot, allowing the prediction of new stable structures. [248][249][250][251] In some sense, Pettifor maps are already closely related to recent work, such as ref. 163 , where a structural map for binary structures based on chemical properties was developed with SISSO. Some of the first applications of modern machine learning crystal structure prediction can be found in ref. 146 . There, Fischer et al. developed an approach based on the cumulant expansion method, described in ref. 252 , to predict the probability of an elemental composition forming a specific binary crystal structure. Their method estimates the correlation of the stability of two structures with respect to their composition. The model was trained with data from ref. 90 and evaluated with leave-one-out cross-validation. It was able to predict the correct structure in 90% of the cases during the first five guesses, in comparison to 62% when picking the structures according to their frequency in the dataset. It has to be mentioned here once again that leave-one-out cross validation is not a good method to evaluate the extrapolation ability of such models. 74,75 Olynyk et al. 32 applied cluster resolution feature selection 253 to the classification of binary crystal structures. These features were then used as input for partial least-squares discriminant analysis (PLS-DA) and SVMs. In order to reduce the complexity of the problem, only the seven most common binary prototype structures were considered. A dataset of 706 compounds was divided into three sets, 235 for feature selection, 235 for optimization of the PLS-DA and SVMs, and 236 for validation. The SVMs performed better with an average false positive rate of 5.8%, a false negative rate of 7.3%, and an accuracy of 93.2%, compared to the PLS-DA with 3.5%, 34.0%, and an accuracy of 77.1%. It has to be noted that these values differ significantly depending on the crystal system that one tries to predict (see Fig.  14). This approach was adapted by Olynyk et al. 254 to equiatomic ternary compounds. SVMs were used on a dataset of~1500 As crystal structure prediction is only the first step in the rational design process, combining stability determination with property design is necessary. Balachandran et al. 31 studied a set of 60,000 potential xBiMe′ y Me″ 1 − y O 3 -(1 − x)PbTiO 3 perovskites with several machine learning methods. First, SVMs were used to classify them into perovskites and non-perovskites, followed by a prediction of the Curie temperature of those classified as perovskites. Once a candidate was experimentally synthesized, it was added to the training set and the process was repeated. Of the ten synthesized compounds, six perovskites were found, whose highest Curie temperature was reported to be 898 K.
Graser et al. 30 applied RFs for crystal structure classification of 24,215 compounds from Pearson's crystal data 243 database. Naturally, a lot of prototypes only have very few representatives (<10) in the database. In order to circumvent this problem, prototypes with fewer instances than a certain cutoff number were put into a group denoted as "other." As the "other" class comprised between 92.51% and 64.1% of the dataset, depending on the choice of cutoff number, this greatly reduced the complexity of the dataset. Graser et al. then researched the change in predictive ability of the model with respect to the cutoff number. As expected, the recall improved with increasing cutoff number. The confusion table (see Fig. 15) demonstrates that, through the use of large datasets, even a simple method can achieve impressive results for an extremely challenging task like crystal structure prediction.
Park et al. 255 tackled the problem of crystal structure prediction from a slightly different perspective. All the others methods discussed up to now used the chemical composition, or data derived from the chemical composition, and structure as descriptors. In contrast, Park et al. used powder X-ray diffraction patterns to determine the crystal system, extinction group, and space group of inorganic compounds. Because the three-dimensional electron density is contracted into a 1D diffraction pattern, the symmetry of the crystal is often not fully determined from the diffraction pattern alone, especially for low-symmetry structures. While software for indexing and determining the space group exists, it requires substantial expertise and human input to obtain the correct results. Previous machine learning attempts in this field mostly considered the task of feature engineering (e.g., PCA [256][257][258][259] or manual featurization 260,261 ) or considered smaller datasets and shallower neural networks. In contrast, in ref. 255 deep convolutional neural networks were developed, using X-ray patterns as input and giving as output the space group, extinction group, or crystal system. For the training data, structural data from the inorganic crystal structure database 79 was used to calculate randomly perturbed spectra, which simulated real spectra. During testing on a dataset that amounted to around 20% of the training set, the network reached accuracies of 81.14, 83.83, and 94.99% for space group, extinction group, and crystal system classifications, respectively. Furthermore, the model was able to correctly identify the structural system of two novel compounds 262,263 whose prototype structure did not appear in the database (and therefore neither in the training set). Albeit the model was not performing better than human experts using a software like TREOR, 264 it has the potential to be a useful tool to non-experts and in order to speed up the identification process of X-ray diffraction spectra in general. The success of the model is not surprising, as the use of convolutional neural networks in image classification [265][266][267] is well established in computer science.
A similar approach for crystal structure classification was followed in ref. 268 . Zilleti et al. used convolutional neural networks to classify crystal structures by a simulated 2D diffraction fingerprint. The approach was limited to the classification of crystal structures, because the 2D diffraction pattern is not unambiguous for all space groups, and consequently the neural network is not able to distinguish between rhombohedral and hexagonal structures, for example. They also used attentive response maps, [269][270][271][272][273] trying to achieve some interpretability and visualization of the model. This will be further discussed in section "Discussion and conclusions-Interpretability".
Further interesting work concerning micro-structure and material characterization, using machine learning-based image processing, can be found in refs. 274-279. If we consider structure prediction through machine learning, we also have to consider global structure prediction methods, where the whole energy surface has to be explored efficiently. This can also be considered as a surrogate-based optimization problem (see section "Adaptive design process and active learning"), where the expensive experiment is the local geometry optimization through DFT. Yamashita et al. 280 started with a large set of initial structures from which a random subset was locally optimized and used to train a Bayesian regressor to predict the energy. Using Thompson sampling, 281 structures from the initial set were sampled randomly according to their probability of minimizing the energy. The approach was tested on NaCl and Y 2 Co 17 and reduced the average number of trials until finding the optimal structure by, respectively, 31% and 39% when compared to random structure selection.
Lastly, we discuss two works that introduced modern neural network architectures to crystal structure prediction and generation. Both methods have also been used recently for microstructures by Li et al. 210,282 Ryan et al. 28 applied VAEs (see section "Basic principles of machine learning-Algorithms") to crystal structure prediction. The 42-layer VAEs develop a more efficient representation for the input (see section "Basic principles of machine learning-Features"). For training and testing, a dataset of around 50,000 crystal structures from the inorganic crystal structure database 79 and the crystallography open database 89 were used. The encoding of the original descriptors was used as input for a five-layer sigmoid classifier that predicts the most likely elements to form the topology represented by the atomic fingerprints. A third auxiliary neural network, in this case a five-layer softmax classifier, combined the non-normalized atomic fingerprints and the output of the sigmoid classifier that improves the prediction. To predict directly the crystal structure from this approach, one requires training data of negatives or, in other words, knowledge of crystal structures that do not exist. Unfortunately, no such database is available in physics. In order to circumvent this problem, Ryan et al. calculated the likelihood of the existence of a structure as the product of the probabilities of elements existing at the single atomic sites. Application to test data demonstrated a clear superiority of this approach in comparison with random choices.
Nouira et al. 29 introduced a GAN-based strategy (see section "Basic principles of machine learning-Algorithms") to crystal structure generation in the form of CrystalGAN. Specifically, they created a novel GAN structure to generate stable ternary structures on the basis of binary hydrides. Remarkably, the method generates structures of higher complexity and is able to include constraints based on domain knowledge. However, as no data about the stability of the generated structures were published, the evaluation of the usefulness of this approach is still pending. A second application of GANs in materials science, and in particular in chemistry, can be found in ref. 211 .

PREDICTION OF MATERIAL PROPERTIES
Machine learning methods have proven to be successful in the prediction of a large number of material properties. An overview of different properties that were predicted can be found in Table  1. In the following, we discuss in depth a few properties, studied in various works, which provide good examples for current challenges in computational materials science, and possible strategies to overcome them.
Band gaps Design of functional materials for applications like light-emitting diodes (LEDs), photovoltaics, scintillators, or transistors, always requires detailed knowledge of the band gap. Consequently, a lot of effort was invested in theoretical methods for high-throughput calculations of this electronic property. It is well known that standard exchange correlation functionals, like the PBE, 145 systematically underestimate band gaps in comparison to experimental results. More modern functionals like the modified Becke-Johnson by Tran and Blaha 332 or the strongly constrained and appropriately normed meta-GGA 333 by Jianwei et al. improve upon these results. However, the state-of-the-art higher-fidelity methods still remain the many-body GW approximation or hybrid functionals. Unfortunately, these usually come with a prohibitively high computational cost. Machine learning is one possibility to overcome this obstacle by either directly predicting band gaps based on experimental or theoretical training data or by using the results of low-fidelity methods to predict experimental or highfidelity theoretical results.
Zhuo et al. 289 tried to circumvent the problems of the different theoretical methods by directly predicting experimental band gaps. Their approach started with a classification of the materials as either metal or non-metal using SVM classifiers and then progressed by predicting the band gap with SVM regressors. The performance of the resulting models in predicting experimental band gaps lies somewhere between basic functionals (like the PBE) and hybrid functionals. The error turns out to be comparable to, e.g., ref. 40 Lattice parameter 300 Debye temperature and heat capacity [41][42][43] Glass transition temperature 301,302 Thermal expansion coefficient 41 Thermal boundary resistance 303 Thermal conductivity 37,[46][47][48][49][50][51]304,305 Local magnetic moments 127,306 Melting temperature 39,48,307 Magnetocaloric effects 283 Grain boundaries 308 Grain boundary energy [309][310][311][312] Grain boundary mobility 312 Interface energy 300 Seebeck coefficient 46,313,314 Thermoelectric figure of merit 315 Bulk and shear moduli [40][41][42]132,184,185,316 Electrical resistivity 46 Density of states 109,317,318 Fermi energy and Poisson ratio 40 Dopant solution energy 319 Metal-insulator classification 65 Topological invariants [320][321][322][323][324][325][326] Superconducting critical temperature 73 291 approached the problem from a different perspective by using low-fidelity DFT gaps (modified Becke-Johnson and PBE), as well as basic crystalline and elemental properties as features for optimized link state routing, LASSO, and nonlinear SVR. They predicted gaps calculated with G 0 W 0 starting from the ground-state obtained with the Heyd-Scuseria-Ernzerhof hybrid functional. 334 SVR using both the modified Becke-Johnson and the PBE gaps, as well as the other features, yielded the best results, with a root mean square error of 0.24 eV.
Pilania et al. 292 applied a co-kriging statistical learning framework to learn high-fidelity band gaps. Their approach differed from previous ones, as low-fidelity band gaps were not explicitly used as features and were therefore not necessarily needed as input for all compounds.
Another interesting attempt at the prediction of high-fidelity band gaps can be found in ref. 293 . Rajan et al. used KRR, SVR, GPR, and decision tree boosting methods to predict the G 0 W 0 band gaps of MXenes. They started by generating and selecting features with LASSO 159 and then optimizing the feature space for each method. Counterintuitively, the PBE band gap was not included in the optimized feature space of any method. However, other researchers suggest to include this information 283,294 and stress the importance of so called crude estimations of property 77 (see section "Basic principles of machine learning-Features").
Weston et al. 295 investigated the band gaps of kesterite compounds and developed a logistic regression classifier for the prediction of the direct-indirect property of these band gaps. A total of 184 semiconducting materials were used for training, and the best model demonstrated an accuracy, recall, precision, and f 1 score of around 90%.
Bulk and shear moduli Two other popular properties in solid-state machine learning are the bulk and shear moduli, which determine the stress-strain relations in the linear range. They are also correlated with other properties like the bonding strength, thermal conductivity, 184,335,336 charge carrier mobility, 337 and of course the hardness of the material. 338,339 As such, they are often used as a proxy in the search for superhard 340 (hardness >40 GPa) materials. In general, these properties are available as a result of DFT calculations; however, they are too computationally expensive for really large high-throughput studies. Two less computationally expensive alternatives exist, specifically force-field methods 341 and theoretical models for the direct calculation of bulk and shear moduli. However, force fields lack accuracy, and most theoretical models only span a highly restricted chemical and structural space. [342][343][344][345] This opens up the question whether machine learning algorithms can show better generalizability.
de Jong et al. 184 developed a new machine learning technique, called gradient boosting machine local polynomial regression, that extends the principles of gradient boosting frameworks 180 to the case of multivariate local polynomial regression. 346 They used this technique to predict the Voigt-Reuss-Hill averages 347 of the bulk and shear moduli on the basis of elemental properties. In this case, they used the volume per atom, row number, cohesive energy, and the electronegativity as features. The use of the cohesive energy as a feature is slightly problematic as it also requires DFT calculations. The training set consisted of around 2000 materials and a root mean square error of 0.075 log(GPa) and 0.138 log(GPa) were reached for the logarithm of the bulk and shear moduli. The logarithm was used to decrease the emphasis on large values. It has to be noted that the training set was biased toward metallic compounds and rather simple materials.
The previously discussed CGCNNs by Xie et al. 40 also allows for the prediction of bulk and shear moduli. Test set errors of 0.105 log(GPa) and 0.127 log(GPa) for these properties were reported for the dataset of ref. 184 . The network was also tested on 1585 materials that were recently added to the materials project database. 78 Once again the network demonstrated good generalizability for the new dataset with different crystal groups. Doubling the size of the training data, the MEGNet model of Chen et al. 132 obtained around 10% lower errors for bulk and shear moduli.
A similar study for siliceous zeolites was performed in ref. 185 , where gradient boosting regressors were used to predict once again the logarithm of the bulk and shear moduli. They obtained an error of 0.102 ± 0.034 log(GPa) for the bulk and 0.0847 ± 0.022 log(GPa) for the shear moduli. Even if the training set only contained 121 zeolites, this method seems to compare favorably to the 5 conventional force field methods 348-352 reported in ref. 353 . In contrast to ref. 184 , Evans et al. used structural and local descriptors as the challenge was to differentiate between the different siliceous zeolites and not between materials of different elemental composition.
Furmanchuk et al. 42 used RFs to predict the bulk modulus. A wide variety of 1428 compounds from the thermoelectric design lab database, 91 containing from unitary up to quinary combinations of 62 elements, was used for training. The notable fact about this study is that thermal effects, which are usually neglected in DFT calculations, were included through Birch and Murnaghan fits. 354,355 As features, properties of the element itself and experimentally measured properties of elemental substances were used. Another set of 356 theoretically calculated materials and 69 experimentally measured ones was kept for testing. A root mean square error of 18.75 GPa was reported for the first set, while no error was reported for the experimental set.
Isayef et al. 41 developed an extension of property-labeled material fragments to be used for solids. As this leads to a very general feature vector, one can apply it to the prediction of a variety of properties. 41 Using gradient boosting decision trees and a training set of around 3000 materials, they achieved errors of 14.25 and 18.43 GPa for the bulk and shear moduli, respectively. It has to be noted that this training set only considered unary to ternary compounds and neglected quaternary compounds. In contrast, these were also considered in, e.g., refs. 184,185,316 .
Another interesting machine learning study of the bulk and shear moduli of solids is ref. 316 . Mansouri et al. combined elemental and structural properties as descriptors and used SVRs to screen a chemical space of around 120,000 materials for superhard incompressible materials. This was actually followed by the synthesis and characterization of two novel superhard materials. Once again, the cohesive energy was identified as one of the crucial features for both moduli.

Topological states
The discovery of topological insulators has sparked an extreme interest into the field of topological states in condensed matter. [356][357][358] It is therefore not surprising that in the past 2 years machine learning ansätze were introduced to the topic. In general, learning topological phases is a highly non-trivial task as topological invariants are inherently non-local. In the field of topological states, neural networks are by far the most relevant machine learning method used. [320][321][322][323][324][325][326] In refs. 321,323,324 , this technique was used to predict the topological invariants of, respectively, 1D topological insulators of the A 3 class, the 2D XYmodel, and 1D topological insulators of the A 3 class, as well as 2D insulators of the A class. In the later two works, analysis of the neural network confirmed that it learned both the winding number formula 321 and a formula for the Berry curvature in the case of the A class insulators 323 (see section "Discussion and conclusions-Interpretability" for a more extensive discussion). Another interesting application that takes advantage of the extreme success of neural networks for image classification is quantum loop topography. 320 In this method, an image representing the Hamiltonian or wave function is constructed and entered into a neural network that decides on the topological phase of the system with great accuracy. Although the input is the result of Monte Carlo simulations, it is rather efficient, as it only requires single steps and not Monte Carlo averages.
While the previous examples are still mostly concerned with theoretical models, more recent work is already concerned with designing topological photonic devices directly through machine learning methods. 325 Pilozzi et al. designed photonic devices described by the Aubry-André-Harper model 359,360 with neural networks (see Fig. 16). The desired property are edge states with specific frequencies ω t , which are determined by a set of structure parameters. In order to solve the problem, direct and inverse models are combined. The process starts with the inverse neural network, determining the structure parameters required for an edge state with frequency ω ind t . In order to simplify the problem, some categorical features are not included in the neural network, but actually one neural network is trained for each categorical value. The obtained structure parameters from the inverse neural network are used as input for the direct neural network that produces a new frequency ω dir t . If the discrepancy between the two frequencies is smaller than a certain threshold ω ind t À ω dir t < δ, the structure parameters from the indirect neural network are accepted. This self-consistent approach is used to filter out the unphysical structures from the results of the inverse neural network.

Superconductivity
Even 30 years after its discovery, 361 unconventional superconductivity remains one of the unsolved challenges of theoretical condensed matter physics. As machine learning methods do not require a complete theoretical understanding of the problem, determining the critical temperature T c is an obvious challenge for these methods. In the case of critical temperatures, data accumulation is problematic, as there are few computational methods to calculate critical temperatures, [362][363][364] and these are limited to conventional superconductors. Moreover, they are far less widely available than, e.g., methods to calculate the band gap or bulk moduli. On the one hand, this is a drawback as it severely limits the acquisition of data, but on the other hand machine learning methods could prove even more important as no general working theoretical model exists.
There was some early work, akin to machine learning, on clustering superconductors based on quantum structure diagrams 365,366 and some more recent work concerning the filtering of materials for cuprate superconductors based on their electronic structure. 367 A discussion of similar design approaches can be found in ref. 368 . In refs. 327,328 , the superconducting critical temperature is fitted to the lattice parameters with an SVM. Unfortunately, both studies clearly suffer from the difficulty of accumulating data. The former is concerned with iron-based superconductors and has a training set of 30 materials while the latter only treats doped MgB 2 with a training set of 40 materials. Even though these examples do not take advantage of the fortes of machine learning methods, they still reach an error 1.17 K and 1.10 K, admittedly for a very limited domain. The actual search for superconductors in a larger domain is far more challenging, because the Kohn-Luttinger theorem 369 suggests that fermionic systems with a Coulomb interaction are in general superconducting for T → 0. This presents a difficulty, as leaving compounds with no reported critical temperature out of the dataset, or assuming that critical temperature is zero, would either lead to a misrepresentation or underrepresentation of data. 76 However, as we are often interested in high-temperature superconductors from a technological perspective, we can circumvent the problem by classifying potential superconductors as low or high T C , instead of using a regressor to predict the critical temperature.
Isayev et al. 122 used RFs to divide superconductors into groups, one group with T C below and one group with T C >20 K and RFs and partial least squared regression to build a continuous model of the transition temperature. The training set size was nevertheless still very limited (464 classification, 295 regression).
A study by Stanev et al. 76 considered a larger training set of around 14,000 materials from the SuperCon database. 82 Superconductors were first classified into groups with T c below and above 10 K, resulting in an accuracy and F 1 score of about 92%. The features were created using Magpie 370 and consisted of elemental properties and combinations of them. Interestingly enough, when reducing the number of descriptors to only the three used in refs. 365,366 , specifically the average number of valence electrons, the metallic electronegativity differences, and orbital radii differences, the accuracy of the classifier only decreased by around 3%. This suggests that little progress was made in terms of such features in the meantime.
The regression model for log(T C ) was built for materials with transition temperatures >10 K to avoid the previously discussed problems and reached an R 2 score of around 0.88. By dividing the training set into different groups of superconductors, Stanev et al. could demonstrate that the model recovered physical knowledge, such as the isotope effect or other empirical relations. 371 Furthermore, it was clear that the model was not able to extrapolate from one group of superconductors to another, e.g., from conventional to cuprate superconductors. This is, of course, expected owing to the different superconducting mechanisms involved in the two families. This extrapolation problem of materials science machine learning models and methods to estimate it are also discussed in ref. 73 . Finally, Stanev et al. applied the classifier and regressor to the materials in the inorganic crystal structure database 79 and scanned it for new high-T C superconductors. As a byproduct, a feature of the band structure that is known to increase the T C was recovered even though no electronic structure data were used in the model.
Ling et al. 329 also used RFs in research concerning the design process of materials including superconductors, but as their research is mainly concerned with the optimization of the design process, we will discuss it later in the section "Adaptive design process and active learning". Another interesting study 372 of superconductors applied k-means clustering, PCA, and Bayesian linear unmixing to scanning tunneling microscopy data in order to extract meaningful data regarding electronic interactions in the spin-density wave regime. Note that these are expected to play a key role in the existence of unconventional superconductivity. Fig. 16 Strategy to solve the inverse design problem for photonic devices, 325 where χ 0 is the structure parameter suggested by the inverse network and ξ, m ± ij , and s ± are extra inputs for both networks. (Reprinted with permission from ref. 325 licensed under the CC BY 4.0 cc license.)

ADAPTIVE DESIGN PROCESS AND ACTIVE LEARNING
The previous chapters were concerned with the prediction of the stability, atomic structure, and physical properties. Necessarily, all of these methods have the end goal of minimizing the time until a new optimal material with tailored properties is found. This can either imply the minimization or maximization of a single property or the search for a material on the Pareto front in the case of multiple objectives. In order to reach this goal, we aim at reducing the number of "experiments" that have to be carried out, as these are the most time consuming and expensive segment of the discovery process. In our case, experiment may denote computationally expensive calculations, like the ones necessary to obtain the phonon and electron transport properties required for the design of thermoelectrics. A more general discussion of such optimization problems can be found in the literature under the name of surrogate-based optimization 373,374 and active learning.
The adaptive design process consists of two interwoven tasks: (i) A surrogate model has to be developed; (ii) Based on the prediction of the surrogate model, optimal infill points have to be chosen in order to retrain the surrogate model and finally find the optimum.
The challenge in this process is to balance the end goal of finding the best material (exploitation) with the need to explore the space of materials in order to improve the model. 375 The most naive strategy is naturally pure exploitation, in which case the design algorithm always chooses the material with the highest prediction of the target value (lowest in the case of minimization). From other fields, e.g., drug design 376,377 and quantitative structure-activity relationship in chemistry, 378,379 it is already known that such an unsophisticated approach is far from optimal. More sophisticated policies, such as maximum likelihood of improvement or maximum expected improvement, try to strike a balance between these strategies. However, choosing the optimal experiment according to one's strategy requires machine learning models that not only return predictions but also the uncertainty of a prediction.
Starting from this requirement, the most obvious algorithm choice are Bayesian prediction models like Gaussian processes 380 as they also provide the variance of the predicted function. Gaussian processes have been applied to a wide variety of structure optimization and design problems in materials science. A few examples are optimizing thermal conductance in nanostructures, 47 predicting interface 309 and crystal 280 structures, optimizing materials for thermoelectric 381,382 and optoelectric 382 devices, or optimizing GaN LEDs. 383 Furthermore, these studies already resulted in successfully synthesized materials. 384,385 We already discussed in the section "Basic principles of machine learning-Algorithms" that the inherent scaling of Gaussian processes both with respect to training set size as well as feature dimension is quite bad. 157 At the moment, a lot of adaptive design studies still treat extremely small datasets (see, e.g., ref. 384 with a training set size of 22), in which case this is irrelevant. However, a large number of the previously discussed models for stability or property prediction use high-dimensional descriptors and are therefore also unsuitable for Bayesian methods. 329,386 One alternative to Bayesian predictors are standard machine learning algorithms, like SVRs or decision tree methods, in combination with bootstrapping methods to estimate the uncertainty. In ref. 375 , Balachandran et al. compared different surrogate models and strategies on a set of M 2 AX compounds for the optimization of elastic properties. From a pure prediction perspective, SVRs with radial basis function slightly outperformed Gaussian processes for training set sizes >120 materials. Different design strategies were then used in combination with the SVR. It turned out that efficient global optimization, 387 as well as knowledge gradient, 388 showed the best results. Xue et al. 384 obtained similar results concerning the choice of algorithms for the composition optimization of NiTi-based shape memory alloys. Starting with a set of 22 materials, Xue et al. successfully synthesized 14 materials (from a total of 36 synthesized in total during 9 feedback loops), which were superior to the original dataset.
Balachandran et al. 389 also applied SVRs in combination with efficient global optimization to the maximization of the band gap of A 10 (BO 4 ) 6 X 2 apatites. In this case, the performance for two feature sets, one containing the Shannon ionic radii and the other one the Pauling electronegativity differences was compared. Interestingly, the design based on the ionic radii performed better, finding the optimal material after 22 materials (13 materials in the initial training set, 9 chosen by the design algorithm) in comparison to 30 for the electronegativities, while having a far larger error in the machine learning model (0.54 eV compared to 0.19 eV for the electronegativities). The result is most likely due to the fact that, of the three atomic species considered for the B-site (P, V, As), P provides clearly higher band gaps than the other elements and has a different ionic radius while the electronegativity of P and As are nearly the same. 389 Using this information, the algorithm eliminated all compositions without P on the B-site. This example demonstrates that sometimes the algorithm with the highest predictive power will not necessarily lead to the best optimal design results. A combination of the two predictors leads to even better results with the optimal composition after one iteration; however, the mean absolute error of the model was still slightly worse (0.21 eV) than the one of the purely electronegativity-based model.
Ling et al. 329 treated a high-dimensional (with respect to the descriptor space) materials design problem with the RF framework FUELS. 390 By adding a bias term to the uncertainty, which accounts for noise and missing degrees of freedom, they expanded upon previous uncertainty estimates from refs. 391,392 . Tested on 4 datasets (magnetocaloric, thermoelectric, superconductors, and thermoelectric) with higher descriptor number (respectively, 54,54,56,22), FUELS compared favorably with the Bayesian framework COMBO and random sampling, while being roughly an order of magnitude faster. In order to evaluate various selection strategies or model algorithms, different metrics were used. In materials science, a commonly used metric is the number of experiments until the optimal material is found. While this metric has some merit, in most cases opportunity cost (the distance of the current best from the overall best) or the number of experiments until the current best is within a specific distance (e.g., 1%) is superior and is also used more often in the literature. 393,394 Monte Carlo tree searches 395 are a second algorithm with superior scaling that has recently been introduced to materials science. The application is inspired by its success in go, 2 where a combination of neural networks, reinforcement learning, and Monte Carlo tree search allowed for the first superhuman performance in this ancient strategy game. Dieb et al. 396 implemented a materials design version in the form of the open source library MDTS. Using the test case of the optimal design of thermoelectric Si-Ge alloys, they demonstrated that, although Bayesian optimization has advantages for small problems due to its advanced prediction abilities, Monte Carlo tree search design time stays close to constant (see Fig. 17) with increasing problem size. Furthermore, and in contrast to genetic algorithms, it does not require the determination of hyperparameters. Owing to the unfavorable scaling of Bayesian optimization, at some point the computational effort of the design becomes larger than the computational effort of the experiments, at which point Monte Carlo methods become superior. For the interface structure optimization in ref. 396 , this is already the case for interfaces with >22 atoms. Further applications to the determination of grain boundary structures 397 and the structure of boron-doped graphene 398 also demonstrate the viability of the method for structure design problems. A more in-depth review of Bayesian optimization and Monte Carlo tree search in materials design can be found in ref. 399 .
Sawada et al. 400 also developed an algorithm for multicomponent design based on game tree search. Optimizing the composition in a seven-component Heusler compound, the algorithm proved to be around nine times faster than expected improvement or upper confidence bound 401 strategies based on Gaussian processes.
Dehghannasiri et al. 402 proposed an experimental design framework based on the mean objective cost of uncertainty. This is defined as the expected difference in cost between the material, which minimizes the expected cost for a surrogate model and the optimal material. 403 Applying the framework to the minimization of the dissipation energy of shape memory alloys demonstrated the superiority of the algorithm to pure exploitation and random selection.
So far, none of the discussed algorithms considered nested decision problems or cases where it is more efficient to carry out experiments in batches of similar experiments instead of one at a time. The latter is, for example, true for the case of the photoactive device design considered by Wang et al. 404 The size of thiol-gold nanoparticles and their density on the surface determine the efficiency of the device. While one can easily explore different densities of nanoparticles in a batch of experiments, it is difficult to change the size of the nanoparticle due to the cost of their synthesis. Therefore, it is more efficient to consider a nested problem where the algorithm first chooses a size and then a batch of densities. Wang et al. extended the concept of knowledge gradient 388 to the case of nested decisions and batches of experiments. Applying it to the previously described design problem, the new algorithm proved to be superior to all naive strategies (pure exploitation/exploration, or ε-greedy which chooses either pure exploration or exploitation with probability ε) and also to sequential knowledge gradient (batch size 1) if one considers the number of batches. If one instead considers the total number of experiments, the performance of knowledge gradient was only slightly better.
If we consider typical design problems, one often has to consider multiple objectives. For example, for the design of a shape memory alloy, one desires a specific finish temperature, thermal hysteresis, and possibly a high maximum transformation strain. Naturally, this requires more sophisticated measures of improvement (see ref. 405 for a review) than single objective optimization methods. A typical measure is the expected hypervolume improvement 406 that measures the change in hypervolume of the space dominated by the best known materials. Solomou et al. 407 applied this metric to the optimization of shape memory alloys in combination with a Gaussian process model, once for two objectives (specific finish temperature and thermal hysteresis) and once for three objectives (adding the maximum transformation strain), and demonstrated that it is clearly superior to a random or purely exploitative strategy.
Talapatra et al. 408 also combined expected hyper-volume improvement with Gaussian processes in order to simultaneously maximize the bulk modulus while minimizing the shear modulus. Instead of using a single Gaussian regressor, they developed a method called Bayesian model averaging, which combines different models. This approach can prove useful in cases where the available data is too limited to choose good features or hyperparameters.
Gopakumar et al. 409 compared both SVRs and Gaussian processes on multiple datasets: optimal thermal hysteresis and transition temperature for shape memory alloys, optimal bulk and Young's modulus for M 2 AX phases, and optimal piezoelectric modulus and band gap for piezoelectric materials. SVRs performed better as regressors and were consequently chosen as surrogate model. Several optimal design strategies were used, specifically random, exploitation, exploration, centroid, and maximin. For the smallest dataset, maximin surprisingly performed only slightly better for large experimental budgets and worse than pure exploitation for small budgets. However, for the larger dataset of elastic moduli both centroid and maximin proved to be clearly superior.
An additional popular choice of global optimization algorithms that can also be applied to adaptive design, especially to structure development, are genetic algorithms. Reviews of their application to materials design can be found in refs. 230,410 .
It is difficult to compare the ability of the different optimal design algorithms and frameworks discussed in this section because no systematic study has ever been carried out. Nevertheless, it is quite clear that, given sufficient data, adaptive design algorithms produce superior results in comparison to naive strategies like pure exploration or exploitation, which are unfortunately still extremely common in materials science. Furthermore, several works demonstrated that experimental resources are used more efficiently if they are allocated to the suggestions of the design algorithm instead of a larger initial random training set. Machine learning models can be quite limited in their accuracy; however, the inclusion of knowledge of this uncertainty in the design process can alleviate these limitations. This allows for a feedback cycle between experimentalists and theoreticians, which increases trust and cooperation and reduces the number of expensive experiments.

MACHINE LEARNING FORCE FIELDS
As previously discussed, first-principle calculations can accurately describe most systems but at a high computational price. Usually this price is too high for use in molecular dynamics, Monte Carlo, global structural prediction, or other simulation techniques that require frequent evaluations of the energy and forces. Even DFT is limited to molecular dynamics runs of a few picoseconds and simulations with hardly more than thousands of atoms. For this reason, the research concerning empirical potentials and the development of models for the potential energy surfaces never faded away.
In fact, most molecular dynamics simulations are normally computed with classical force fields. [411][412][413][414][415][416][417] As these potentials often scale linearly with the number of atoms, they are computationally inexpensive and the loss in accuracy is overlooked in favor of the possibility to perform longer simulations or simulations with hundreds of thousands or even millions of atoms. Another approach is DFT-based tight binding. [418][419][420] This quantum-mechanical technique scales with the cube of the number of electrons but has a much smaller prefactor than DFT. Certainly, calculations performed with this method are not as accurate as in DFT, but they are more reliable than classical force field calculations. In addition to the reduced precision, the construction of force fields and tight-binding parameters is unfortunately not straightforward.
Neural networks were the first machine learning method used in the construction of potential energy surfaces. As early as 1992, Sumper et al. 421 used a neural network to relate the vibration spectra of a polyethylene molecule with its potential energy surface. Unfortunately, the large amount of input data and architecture optimization required deemed this approach as too cumbersome and difficult to apply to other molecular systems. It was the work of Blank et al. 422 in 1995 that really showed the potential, and marked the birth, of machine learning force fields. Their work on the surface diffusion of CO/Ni(111) relied on neural network potentials, which mapped the energy of a system with its structure, mainly the lateral position of the center of mass, the angle of the molecular axis relative to the surface normal, and the position of the center of mass. The training set was obtained from electronic structure calculations and no further approximations were used. Their seminal study proved that neural networks could be used to make accurate and efficient predictions of the potential energy surface for systems with several degrees of freedom.
Since then, many machine learning potentials were reported. As several reviews on these potentials can be easily found in the literature, 112,[423][424][425] here we discuss only the most prominent and recent approaches related to materials science.
One of the most successful applications of machine learning to the creation of a reliable representation of the potential energy surface is the Behler and Parrinelo approach. 110 Here the total energy of a system is represented as a sum of atomic contributions E i . This became the standard for all later machine learning force fields, as it allows their application to very large systems. In the Behler-Parrinelo approach, a multilayer perceptron feedforward neural network is used to map each atom to its contribution to the energy. Every atom of a system is described by a set of symmetry functions, which serve as input to a neural network of that element. Every element in the periodic table is characterized by a different network. As the neural network function provides an energy, analytical differentiation with respect to the atomic positions or the strain delivers, respectively, forces and stresses. This approach was originally applied to bulk silicon, reproducing DFT energies up to an error of 5 meV/atom. Furthermore, molecular dynamic simulations using this potential were able to reproduce the RDF of a silicon melt at 3000 K. Many applications of this methodology to the field of materials science have appeared since then, for example, to carbon, 426 sodium, 427 zinc oxide, 428 titanium dioxide, 111 germanium telluride, 429 copper, 430 gold, 431 and Al-Mg-Si alloys. 432 Since its publication in 2007, several improvements were made to the Behler and Parrinelo approach. In 2015, Ghasemi et al. proposed a charge equilibration technique via neural networks, 433 where an environment-dependent atomic electronegativity is obtained from the neural networks and the total energy is computed from a charge equilibration method. This technique successfully reproduced several bulk properties of CaF 2 . 434 In 2011, the cost function was expanded to include force terms. 424,428 This extension was first proposed by Witkoskie et al. 435 and later extended and generalized by Pukrittayakamee et al. 436,437 These works show that the inclusion of the gradients in the training substantially improves the accuracy of the force fields, not only due to the increase of the size of the training set but also due to the additional restrictions in the training. Hajinazar et al. 438 devised a strategy to train hierarchical multicomponent systems, starting with elemental substances and going up to binaries, ternaries, etc. They then applied this technique to the calculations of defects and formation energies of Cu, Pd, and Ag systems and were able to obtain an excellent reproduction of phonon dispersions. Another improvement concerns the replacement of the original Behler-Parrinelo symmetry functions by descriptors that can be systematically improved. One such descriptor is given by Chebyshev polynomials, 117 which also allow for the creation of potentials for materials with several chemical elements, due to its constant complexity with respect to the number of species. Potentials constructed with this descriptor are the reported machine learning potentials that can describe more chemical species, with 11 so far.
Artrith et al. 439 proved the applicability of specialized neural network potentials in their study of amorphous Li-Si phases. They compared the results obtained with two different sampling methods. The first involved a delithiation algorithm, which coupled a genetic algorithm with a specialized potential trained with only 725 structures close to the crystalline Li x Si 1−x phase. The second method consisted of an extensive molecular dynamics heat-quench sampling and a more general potential. Figure 18 shows the accuracy of the latter neural network potential.
We note that not only machine learning methods are changing the field of materials science but also machine learning methodologies. The spectral neighbor analysis potential 440 from Thompson et al. consists of a linear fit that associates an atomic environment, represented by the four-dimensional bispectrum components, with the energies of solids and liquids. The first application of these potential to tantalum showed promising results, as it was able to correctly reproduce the relative energy of different phases. Furthermore, in the application of this potential to molybdenum by Chen et al., 441 PCA was used to examine the distribution of the features in the space. This technique increases the efficiency of the fitting, as it ensures a good coverage of the feature space and reduces the number of structures in the training set. Their potential achieved good accuracies for energies and stresses (9 meV and 0.9 GPa, respectively). Although the accuracy in the forces was considerably worse (0.30 eVÅ), they also managed to reproduce correctly several mechanical properties such as the bulk modulus, lattice constants, or phonon dispersions (see Fig. 19). Wood et al. 442 proposed an improvement of the The resulting potential yielded a good accuracy for energies, forces, and stresses, enabling the prediction of several physical properties, such as lattice constants and phonon spectra.
In a different approach, Li et al. 447 devised a molecular dynamics scheme that relies on forces obtained by either Bayesian inference using GPR or by on-the-fly quantum mechanical calculations (tight binding, DFT, or other). Certain simulations in materials science involve steps where complex, recurring, chemical bonding geometries are encountered. The principal idea behind this scheme is that an adaptive approach can handle the occurrence of unseen geometries while the recurring ones are trained for. This is achieved by the following predictor-corrector algorithm 448,449 : After n steps of the simulation with a force field, the latest configuration is selected for quantum mechanical treatment and the accuracy of the force field is tested. Should the accuracy fall below a certain threshold, the force field is refitted. This scheme might not be the most efficient for a singular molecular dynamics cycle but excels when the simulations involve monotonic cycles between two temperatures, for example. Applications to silicon, 447 aluminum, and uranium 450 (with linear regression) reveal accuracies for forces <100 meV/Å. The phonon density of states and melting temperature of aluminum obtained with this scheme are also in good agreement with ab initio calculations. In the same spirit, Glielmo et al. 451 employed vectorial Gaussian process 452,453 regression to predict forces using vector two-body kernels of covariant nature. Their results for nickel, silicon, and iron indicate that the inclusion of symmetries results in a more efficient learning and that it is not necessary to impose energy conservation to achieve force covariance. Additional improvements of this methodology include the replacement of the features by higher-order n-body-based kernels. 454 Another family of highly successful machine learning potentials is the Gaussian approximation potentials (GAPs). First introduced by Bartók et al., 114 these potentials interpolate the atomic energy in the bispectrum space using GPR. Tests for semiconductors and iron revealed a remarkable reproduction of the ab initio potential energy surface. Advances in this methodology include the replacement of the bispectrum descriptor by the SOAP descriptor and the training of not only energies but also forces and stresses, 455 the generalization of the approach for solids 456 by adding two-and three-body descriptors, and the possibility to compare structures with multiple chemical species. 457 The materials studied in these works were tungsten, carbon, and silicon, respectively. The application of the GAPs to bcc ferromagnetic iron by Dragoni et al. 458 proves the accuracy of these potentials for both DFT energetics and thermodynamical properties. In particular, bulk point defects, phonons, the Bain path, and Γ surfaces 459 are correctly reproduced. By combining single-point DFT calculations, GAPs, and random structure search, 220,221 Deringer et al. showed a procedure that simultaneously explores and fits a complex potential energy surface. 460 They used 500 random structures to train a GAP model, which was then used to perform the conjugate gradient steps of the random search. The minimum structures were added to the training set after being recalculated with single-point DFT calculations. The potential for boron resulting from this procedure was able to describe the energetics of multiple polymorphs, which included αB 12 and βB 106 .
The GAP methodology was also applied to graphene. 461 The potential constructed by Rowe et al. was able to reproduce DFT phonon dispersion curves at 0 K. In addition, the potential predicted quantitatively the lattice parameter, phonon spectra at finite temperature, and the in-plane thermal expansion. Other works concerning GPR include its application to formaldehyde and comparison of the results with neural networks 462 and the acceleration of geometry optimization for some molecules. 463 Jacobsen et al. presented another structure optimization technique based on evolutionary algorithms and atomic potentials constructed using KRR. 464 To represent the atomic environment, they used the fingerprint function proposed by Oganov and Valle. 465 By using the atomic potentials to estimate the energy, they were able to reach a considerable speed-up of the search for the global minimum structure of SnO 2 (110)-(4 × 1).
In an unconventional way to construct atomic potentials, Han et al. 466 presented a deep neural network that, for each atom in a structure, takes as input N c functions of the distance between the atom and its neighbors, where N c is the maximum number of neighbors considered. As a consequence, some of the inputs of the neural network have to be zero. Furthermore, the potential might have transferability problems if ever used on a structure with smaller inter-atomic distances than the ones considered in the training set. Nevertheless, their potential showed good accuracy in energy predictions for copper and zirconium. Zhang et al. 467 improved this methodology with the generalization of the loss function to include forces and stresses.

DFT FUNCTIONALS
The application of machine learning techniques also spread to the creation of exchange and correlation potential and energy functionals. The first application emerged from the work of Tozer et al. 468 in 1996, where they devised a one-layer feed-forward multiperceptron neural network to map the electronic density ρ(r) to the exchange and correlation potential v xc (r) at the same points. Technically, this exchange and correlation functional belongs to the family of local-density approximations. Tozer et al. trained the neural network on two different datasets, first on the data of a single water molecule and afterwards on several molecules (namely, Ne, HF, N 2 , H 2 O, and H 2 ). Using 3768 data points calculated with a regular molecular numerical integration scheme, 469 the method achieved an accuracy of 2-3% in the exchange and correlation energy of the water molecule. When applied in a self-consistent Kohn-Sham calculation, the potential lead to eigenvalues and optimized geometries congruent with the local density approximation. On the other hand, for the set of Fig. 19 Comparison between the phonon dispersion curves obtained with density functional theory and the spectral neighbor analysis potential model for a 5 × 5 × 5 supercell of Mo. (Reprinted with permission from ref. 441 . Copyright 2017 American Physical Society.) several molecules, Tozer et al. obtained an error of 7.6% using 1279 points. The points were obtained in the same manner as before but were constrained to avoid successive points with similar densities. This potential generated geometries close to the local density approximation and good eigenvalues for molecules sufficiently represented in the training set. Meanwhile, and as expected, the neural network potential failed for molecules not sufficiently represented in the training like LiH and Li 2 .
In 2012, Snyder et al. tackled the problem of noninteracting spinless fermions confined to a 1D box. 470 They employed KRR to construct a machine learning approximation for the kinetic energy functional of the density. This is the idea behind orbital-free DFT and an attempt to bypass the need to solve a Shrödinger-like equation. The kinetic energy and density pairs of up to four electrons were obtained using Numerov's method 471 for several external potentials. These potentials were created using a linear combination of three Gaussian dips with random depths, widths, and centers. Furthermore, 1000 densities were taken for the test set while M were taken for the training set. For M = 200 chemical accuracy was achieved, as no error surpassed 1 kcal/mol. To obtain the correct behavior of the functional derivative of this energy, which is necessary for the self-consistent DFT procedure, PCA was used. Self-consistent calculations with this functional led to a range of similar densities instead of a unique density and to higher errors in the energy than when using the exact density. Nevertheless, the functional reached chemical accuracy.
This methodology was later improved during the study of the bond breaking for a 1D model of a diatomic molecule, subjected to a soft Coulomb interaction. 472 The training data consisted of Kohn-Sham energies and densities calculated with the localdensity approximation for 1D H 2 , H 2 , Li 2 , Be 2 , and LiH with different nuclear separations. Choosing up to 20 densities for each molecule for the training set produced smaller errors in the kinetic energy functional than those due to the approximation to the exchange-correlation functional. This new functional was able to produce binding energy curves indistinguishable from the localdensity approximation.
A different path was taken by Brockherde et al. 473 that, instead of solving the Kohn-Sham equations self-consistently as usually, used KRR to learn the Hohenberg-Kohn map between the potential v(r) and the density n(r). Among the machine learning community, this approach is normally designated as transductive inference. The energy is obtained from the density, also using KRR. When applied to the problem of noninteracting spinless fermions confined to a 1D box (same problem as in ref. 470 ), this machine learning map reproduced the correct energy up to 0.042 kcal/mol (if calculated in a grid) or 0.017 kcal/mol (using other basis sets), for a training set of 200 samples. Comparison of this map with other machine learning maps that learn only the kinetic energy reveals that the Hohenberg-Kohn map approach is much more accurate. Furthermore, this map achieved similar results when applied to molecules, reaching accuracies of 0.0091 kcal/mol for water and 0.5 kcal/mol for benzene, ethane, and malinaldehyde. These values measure the difference to the PBE energy. The training sets consisted of 20 points for the water and 2000 points for the other molecules. To generate the training sets for the larger molecules, molecular dynamics simulations using the general amber force-field 474 were used to yield a large set of geometries. These were subsequently sampled using the k-means approach to obtain 2000 representative structures that were only then evaluated using the PBE functional. In addition, the precision of the density prediction for benzene was compared with the results for the local-density approximation and PBE. Not only did the Hohenberg-Kohn map produce densities with errors smaller than the difference between different functionals (when evaluated on a grid) but these errors were also smaller than the ones introduced by evaluating the PBE functional using a Fourier basis representation instead of the evaluation on the grid.
A distinct approach comes from Liu et al., 475 who applied a neural network to determine the value of the range-separation parameter μ of the long-range corrected Becke-Lee-Yang-Parr functional. 476,477 They trained a neural network, characterized by one hidden layer, with 368 thermochemical and kinetic energies. These values came from experimental data and from highly accurate quantum chemistry calculations. When compared with the original functional (μ = 0.47), the new functional improved the accuracy of heats of formation and atomization energies while performing slightly worse in the calculation for ionization potentials, reaction barriers, and electronic affinities.
Nagai et al. 478 trained a neural network with 2 hidden layers (300 nodes) to produce the projection from the charge density onto the Hartree-exchange-correlation potential (v Hxc ). For that, they solved a simple model of two interacting spinless fermions under the effect of a 1D Gaussian potential, using exact diagonalization. The ground state density was then used to calculate v Hxc using an inverse Kohn-Sham method based on the Haydock-Foulkes variational principle. 479,480 When applied in the Kohn-Sham self-consistent cycle, this potential reproduced the exact densities and total energies, provided that a suitable training set was chosen (see Fig. 20). The system studied by the authors admits as solution either a bound and an unbound state or two bound states, depending on the Gaussian potential. Choosing points surrounding the boundary for the training set of the neural network leads to the most accurate results, with errors around 10 −3 a.u. everywhere except at the boundary (where they can

DISCUSSION AND CONCLUSIONS Interpretability
We already noted in the introduction that a major criticism of machine learning techniques is that their black-box algorithms do not provide us with new "physical laws" and that their inner workings remain outside our understanding. 481 For example, Ghiringhelli et al. argue that "a trustful prediction of new promising materials, identification of anomalies, and scientific advancement are doubtful," if the scientific connection between features and prediction is unknown. 96 Johnson writes in the context of quantitative structure-activity relationships: "By not following through with careful, designed, hypothesis testing we have allowed scientific thinking to be co-opted by statistics and arbitrarily defined fitness functions." 378 The main concern is that models not based on physical principles might fail in completely unexpected cases (that are trivial for humans) while providing a very good result on average. Such cases can only be predicted and prevented if one understands the causality between the inputs and outputs of the model. Furthermore, especially in applications where a single failure is extremely expensive or potentially deadly (as in medicine), the lack of trust in black-box machine learning models stops their widespread use even when they provide a superior performance. 273 As there are different concepts of interpretability, we will define its various facets according to Lipton et al. 482 To start with, we can divide interpretability into transparency and post hoc explanations, which consist of additional information provided by or extracted from a model.
Transparency can once again be split into the concepts of simulatability, decomposability, and algorithmic transparency. Simulatability is a partially subjective notion and concerns the ability of humans to follow and retrace the calculations of the model. This is, e.g., the case for sparse linear models such as the ones resulting from LASSO, 159 SISSO,163 or flat decision tree models. Decomposability is closely related to the intelligibility of a model and describes whether its various parts (input, parameters, calculations) allow for an intuitive interpretation. Algorithmic transparency considers our understanding of the error surface (e.g., whether the training will converge to a unique solution). This is clearly not the case for modern neural networks, for example.
Post hoc interpretability considers the possibility to extract additional information from the model. Examples for this are variable importance from a decision tree model or active response maps, which highlight regions of a picture that were particularly important for its classification by a convolutional neural network.
Starting from these concepts of interpretability, it is obvious that the notion of a complex model runs counter to the claim that it is simulatable by a human. Furthermore, models that are simulatable (e.g., low-dimensional linear models) and accurate often require unintuitive highly processed features that reduce the decomposability 483 (e.g., spectral neighbor analysis potential potentials) in order to reach a comparable performance to a more complex model. In contrast, a complex model like a deep convolutional neural network only requires relatively simple unengineered features and relies on its own ability to extract descriptors of different abstraction levels. In this sense, there is a definite conflict between the complexity and accuracy of a model, on one hand, and a simulatable decomposable model on the other hand.
The simplest examples of models that are simulatable are techniques based in dimensionality reduction or feature selection algorithms, like SISSO. 163 These are usually used in combination with linear fits and result in simple equations describing the problem. An example is the estimation of the probability of a material to exist as a perovskite (ABX 3 ), as given in ref. 143 : where n A is the oxidation state of A and r i is the ionic radius of ion i. Another example is given by Kim et al., 38 who used LASSO, as well as RF and KRR, to predict the dielectric breakdown field of elemental and binary insulators, on the basis of eight features obtained from first-principle calculations (e.g. band gap, phonon cutoff frequency, etc.). In the end, all three methods determined the same two features as optimal and demonstrated nearly the same error. However, Kim et al. favored LASSO, 159 because it provided a simple analytical formula, even if no further knowledge was gained from the formula. In any case, the knowledge of the analytical formula and therefore the simulatability seems to be far less relevant than the knowledge of the most relevant physical variables. In general, we can even argue that simulatability is not relevant for materials science as computational methods based on physical reasoning, like DFT or tight binding, are even further removed from simulatability than most machine learning models. A second method that provides a variable importance measure (see section "Basic principles of machine learning-Features") are RFs or other decision tree-based methods. Stanev et al. demonstrate the usefulness of this method for post hoc interpretability in ref. 76 , by recovering numerous known (e.g., isotope effect) and some unknown rules and limits for the superconducting critical temperature. This was done by first reducing the number of features via variable importance measure (Gini importance) and subsequently visualizing the correlation between the features and the critical temperature (see Fig. 21).
Pankajakshan et al. 168 developed bootstrapped-projected gradient descent as a feature selection method specifically for materials science. The motivation came from some consistency issues for correlated or linearly dependent variables (present, for example, in LASSO), which bootstrapped-projected gradient descent can alleviate through extra clustering and bootstrapping. In their work, Pankajakshan et al. used machine learning mostly to find and understand descriptors, in order to improve the d-band model of catalysts for CO 2 reduction, 484 instead of actually using the machine learning model for predictions. This can definitely be a reasonable approach in cases where datasets are too small and incomplete for any successful extrapolation. Notwithstanding, in most cases it is questionable if a classical (in the sense of "nonmachine learning") model should be used directly when a machine learning model is superior, as in the case of the d-band model. 485 Of course, it is a bonus when a classical model exists, as it can be used to check for consistency issues or as a crude estimation of property. However, in our opinion, pragmatic applications of advanced materials design should always use the best model. While RFs and linear fits are considered more accessible from a interpretability point of view, deep neural networks are one of the prime examples for algorithms that are traditionally considered a black box. While their complex nature often results in superior performance in comparison to simpler algorithms, an unwanted consequence is the lack of simulatability and algorithmic transparency. As the lack of interpretability is one of the main challenges for a wider adoption of neural networks in industry and experimental sciences, post hoc methods to visualize the response and understand the inner workings of neural networks were developed during the past years. One example are attentive response maps for image recognition networks that highlight regions of the picture according to their importance in the decision making process. Kumar et al. 271 demonstrated that, by combining the understanding gained from attentive response maps with domain knowledge and applying it to the design process of the neural network, one can not only achieve a better informed decision making process but also higher performance. An improvement of the performance through integration of domain knowledge is not completely surprising, but the result is nevertheless remarkable, as usually higher interpretability comes at the cost of a lower performance.
Zilleti et al. 268 introduced attentive response maps, as implemented in ref. 271 , to materials science in order to visualize the ability of their convolutional neural networks to recognize crystal structures from diffraction patterns. The response maps of the different convolutional layers demonstrate that the neural networks recover the position of the diffraction peaks and their orientation as features (see Fig. 22).
A second example, which demonstrates the ability of neural networks to convey additional post hoc information, is described in ref. 40 . Xie et al. used a crystal graph convolutional neural network to learn the distance to the convex hull of perovskites ABX 3 . By using the output of the pooling layers instead of the fully connected layers as a predictor, the energy can be split into contributions from the different crystal sites (see Fig. 23). This allowed Xie et al. to not only confirm the importance of the radii of the A-and B-atoms but also to gain new insights that were then used for an efficient combinatorial search of perovskites. In ref. 486 From the theoretical equation for the winding number, 487 one can derive that the second convolutional layer should produce an output linearly depending on ΔΦ with the exception of a jump at ΔΦ = π. We can see in Fig. 24 that this is exactly the case, and consequently, the convolutional neural network actually learned the discrete formula for the winding number. Sun et al. 323 studied similar models of higher complexity with deep convolutional neural networks and were also able to demonstrate that their networks learned the known mathematical formulas for the winding and the Chern numbers. 488 Naturally, neural networks will never reach the algorithmic transparency of linear models. However, representative datasets, a good knowledge of the training process, and a comprehensive validation of the model can usually overcome this obstacle. Furthermore, if we consider the possibilities for post hoc explanations or the decomposability of neural networks, they are actually far more interpretable than their reputation might suggest. To conclude this chapter, we would like to summarize a few points: (i) Interpretability is not a single algorithmic property but a multifaceted concept (simulatability, decomposability, algorithmic transparency, post hoc knowledge extraction) (ii) The various facets have different priorities depending on the dataset and the research goal. (iii) Simulatability is usually non-existent in materials science (e.g., in DFT or Monte Carlo simulations) regardless of whether one uses a machine learning or a classical algorithm. Therefore, it should probably not be a point of concern in materials informatics. (iv) Part of the progress of materials informatics has to include the increasing use of post hoc knowledge techniques, like attentive response maps, to improve the viability of, and the trust in, high-performing black-box models. Often this knowledge alleviates the fear that the model is operating on unphysical principles. 268,321,323 Conclusions Just like the industrial revolution, which consisted of the creation of machines that could perform mechanical tasks more efficiently than humans, in the field of machine learning machines are progressively trained to identify patterns and to find relations between properties and features more efficiently than us. In materials science, machine learning is mostly applied to classification and regression problems. In this context, we discussed a wide variety of quantitative structure-property relationships, which encompass a high number of properties essential for modern technology. It seems likely that further properties, should they be needed, can also be predicted with a similar level of accuracy.
If we consider the direction of future research, there will be a clear division between methodologies depending on the availability of data. For continuous properties, which can be calculated realistically for ≥10 5 materials, we assume that universal models and especially deep neural networks, like Xie et al.'s crystal graph convolutional networks 40 or Chen et al.'s MatErials Graph Networks, 132 will be the future. They are able to predict a diverse set of properties, such as formation energies, band gaps, Fermi energies, bulk moduli, shear moduli, and Poisson ratios for a wide material space (87 elements, 7 lattice systems, and 216 space groups in the case of ref. 40 ). At the same time, they reach an accuracy with respect to DFT calculations that is comparable with (or even smaller than) the DFT errors with respect to experiment. Such models have the potential to end the need for applications trained for only a single structural prototype and/or property, which can in turn drastically reduce the amount of resources spent by single researchers. Comparing to the state of the art of neural network architectures and training methods in fields like image recognition and natural language procession, we can also expect that the success of neural network models will only increase once modern topologies, training methods, and fast implementations reach a wider audience in materials science. To reach this goal, a closer interdisciplinary collaboration with computer scientists will be essential.
In other cases that are characterized by a lack of data, several strategies are very promising. First of all, one can take into consideration surrogate-based optimization (active learning), which allows researchers to optimize the results achieved with a limited experimental or computational budget. Surrogate-based optimization allows us to somewhat overlook the limited accuracy of the machine learning models while nevertheless arriving at sufficient design results. As the use of such optimal design algorithms is still confined to relatively few studies with small datasets, much future work can be foreseen in this direction. A second strategy to overcome the limited data available in materials science is transfer learning. While it has already been applied with success in chemistry, 489 wider applications in solidstate materials informatics are still missing. A last strategy to handle the small datasets that are so common in materials science was discussed by Zhang et al. in ref. 77 . Crude estimation of properties basically allows us to shift the problem of predicting a property to the problem of predicting the error of the crude model with respect to the higher-fidelity training data. Up to now, this strategy was mostly used for the prediction of band gap, as datasets of different fidelity are openly available (DFT, GW, or experimental). Moreover the use of crude estimators allows researchers to benefit from decades of work and expertise that went into classical (non-machine learning) models. If the lowerfidelity data are not available for all materials, it is also possible to use a co-kriging approach that still profits from the crude estimators but does not require it for every prediction. 292 Component prediction is a highly effective way to speed up the material discovery process and we expect high-throughput searches of all common crystal structure prototypes that were not yet researched in the coming years. While the prediction of the energy can also be considered, a quantitative structure-property relationships, metastable materials, and an incomplete knowledge of the theoretical convex hull have to be taken into account. Several studies demonstrated that better accuracy can be achieved with experimental training data. However, as experimental data are seldom available and expensive to generate, the number of prototypes for which studies analog to ref. 143 are an option will quickly be exhausted. A second challenge is the lack of published data of failed experiments. In this case, a cultural shift toward the publication of all valid data, may it be positive or negative, is required. The direct prediction or generation of a crystal structure is still an extremely challenging problem. While several studies demonstrate how to differentiate between a small number of prototypes for a certain composition, the difficulty quickly rises with an increasing number of possible crystal structures. This is amplified by the fact that the majority of available data belongs to only a small number of extensively researched prototypes. Recently, more complex modern neural network structures (e.g., VAEs, GANs, etc.) were introduced to the problem, with some interesting results. Moreover, the use of machine learning-based optimization algorithms, like Bayesian optimization for global structure prediction, is also a direction that should be further explored.
Machine learning was successfully integrated with other numerical techniques, such as molecular dynamics and global structural prediction. Force fields built with neural networks enjoy an efficiency that parallels that of classical force fields and an accuracy comparable to the reference method (usually DFT in solid state, although in chemistry some force fields already achieved coupled cluster accuracy 489 ). Consequently, we expect them to completely replace classical force fields in the long term. Owing to their vastly superior numerical scaling, machine learning methods allow us to tackle challenging problems, which go far beyond the limitations of current electronic structure methods, and to investigate novel, emerging phenomena that stem from the complexity of the systems.
The majority of early machine learning applications to solidstate materials science employed straightforward and simple-touse algorithms, like linear kernel models and decision trees. Now, that these proofs-of-concept exist for a variety of application, we expect that research will follow two different directions. The first will be the continuation of the present research, the development of more sophisticated machine learning methods, and their applications in materials science. Here one of the major problems is the lack of benchmarking datasets and standards. In chemistry, a number of such datasets already exists, such as the QM7 dataset, 490,491 QM8 dataset, 491,492 QM7b dataset, 493,494 etc. These are absolutely essential to measure the progress in features and algorithms. While we discussed countless machine learning studies in this review, definitive quantitative comparisons between the different works were mostly impossible, impeding the evaluation of progress and thereby progress itself. It has to be noted that there has been one recent competition for the prediction of formation energies and band gaps. 495 In our opinion, this is an very important step in the right direction. Unfortunately, the dataset used in this competition was extremely small and specific, putting the generalizability of the results to larger and more diverse datasets into doubt.
The second direction regards the usability of machine learning models. In the electronic structure community, both the models (e.g., new approximations to the exchange-correlation functional of DFT) and the computer codes are developed by a relatively  small group of experts and put at the disposal of the much larger community of materials scientists. Even though this is slowly starting to change, models from most publications are not publicly available. This results in most researchers spending resources on building their own models to solve very specific problems. We note that frameworks to disseminate models are now starting to emerge. 496 In conclusion, we reviewed the latest applications of machine learning in the field of materials science. These applications have been mushrooming in the past couple of years, fueled by the unparalleled success that machine learning algorithms have found in several different fields of science and technology. It is our firm conviction that this collection of efficient statistical tools are indeed capable of speeding up considerably both fundamental and applied research. As such, they are clearly more than a temporary fashion and will certainly shape materials science for the years to come.