Abstract
The everincreasing capability of computational methods has resulted in their general acceptance as a key part of the materials design process. Traditionally this has been achieved using a socalled computational funnel, where increasingly accurate  and expensive – methodologies are used to winnow down a large initial library to a size which can be tackled by experiment. In this paper we present an alternative approach, using a multioutput Gaussian process to fuse the information gained from both experimental and computational methods into a single, dynamically evolving design. Common challenges with computational funnels, such as misordering methods, and the inclusion of noninformative steps are avoided by learning the relationships between methods on the fly. We show this approach reduces overall optimisation cost on average by around a factor of three compared to other commonly used approaches, through evaluation on three challenging materials design problems.
Introduction
Engineers and material scientists frequently seek to discover new materials that exhibit specific sets of properties. Most properties of active interest whether they be optoelectronic, structural, catalytic or physiochemical have a complex relationship with the variables that are under experimental control. This fact, in combination with the vast number of synthesisable materials and the relatively high cost of experimental synthesis and characterisation is the central challenge of materials discovery^{1,2,3}.
One tool in the material designers toolbox is the use of simulation as a proxy for experiment^{4}. Due to the reduced cost of simulation over synthesis and characterisation of a material this offers the potential for orders of magnitude increases in the number of materials that can be evaluated during the materials discovery process. The mitigating factor for most cases of molecular and materials design is that given practical limits on computational resources, simulation is not sufficiently accurate to match experiment^{5}. This means that materials discovery cannot proceed using simulation alone and consequently some means of combining simulation and experimental synthesis and characterisation must be used^{6}. This remains the case despite the significant improvements in underlying electronic structure methods and the increases in computational resources that have occurred over the past decades.
Inspired by drug discovery, the traditional way workflows make use of cheaper approximate measures is through the use of a screening approach known as a computational funnel^{7} –Fig. 1. Here, starting with cheap less accurate methods (e.g. QSAR models), increasingly more expensive and accurate methods including more complex simulation methodologies (e.g. molecular dynamics and/or abinitio quantum calculations) and readily accessible experimental measures (e.g. single property measurements or spectroscopic characterisation in the materials discovery setting or invitro experiments in the drug discovery setting) are applied to screen out a smaller and smaller fraction of a potential material candidate pool eventually yielding a small set of highly promising candidates which can be evaluated using the most accurate experimental measurements (e.g. full experimental characterisation in the materials discovery setting or animal and human trials in the drug discovery setting).
Recently, emerging technologies such as machinelearning have driven evermore efficient materials screening campaigns^{8,9}. One particularly impactful approach has been to replace the expensive to evaluate simulations with data driven models, either through replacement of the potential energy calls^{10,11,12} or through direct modelling of the property of interest^{13,14,15}.
Whilst computational funnels have proven successful, several disadvantages can be identified:

To construct the hierarchy of methods detailed upfront knowledge about the relative accuracies of each method along with its cost is required.

The total quantity of resources to be used in the entire design process need to be known and specified a priori.

The relative spread of resources amongst the different levels must also be known and specified a priori.
The first of these challenges is particularly relevant when integrating machinelearning models as layers within the funnel since it is often impossible to know the true accuracy of a datadriven model ahead of time, for an arbitrary datapoint, since generalised performance is often intrinsically linked to the data and methods used to train the model, rather than the model itself.
In this paper, we present an alternative to the computational funnel for materials discovery which instead relies on an extension of Bayesian optimization that can make use of cheaper approximate measurements Fig. 2. In our approach, a Bayesian model is constructed which dynamically learns to relate the different approximate methods and the ground truth experimental value (referred to here as the different methodological fidelities) to each other. This model is used to dynamically traverse the full set of candidate materials in a budget aware, accuracy aware manner. It is progressive rather than hierarchical, allows termination to be decided by the user rather than fixed ahead of time, is implicitly dynamic in its allocation of resources to the different methods and does not require knowledge of the accuracies of the different fidelities ahead of time.
Results and discussion
We demonstrate the effectiveness of our proposed approach through application to three hybrid simulationexperiment discovery challenges, comparing its performance to the commonly used computational funnel, and Bayesian optimization – an emerging approach to sampleefficient experimental design applied only to the target fidelity. We also investigate how fidelity cost and crosscorrelation influence the behaviour of our approach relative to these reference methods through the use of a set of artificial functions where these factors can be directly controlled.
MultiFidelity machine learning
Whilst machine learning has shown strong potential as an emerging paradigm for rapidly generating predictions of materials’ properties of interest, as a datadriven technology its utility can be limited by the availability of highquality data. An emerging approach to deal with this challenge is to build machine learning models built from multiple different fidelities of data, which can then act as predictors for cases where sufficient amounts of data are not available to build traditional QSAR or machinelearning models^{16,17}. These approaches typically rely on building a model which is able to relate the different fidelities of information to each other, typically by building a single model with multiple output values – one per fidelity. It should be noted that this is distinct from the Dmachine learning approach^{18}, in which a singleoutput model learns a correction to apply to a low fidelity to better approximate a highfidelity output. Applying multifidelity machine learning approaches to the materials domain has seen some notable early successes. For example, Chen et al. apply a multifidelity setting of a graph network to the prediction of bandgaps^{19}. They found that the inclusion of information from a lower fidelity calculation, in their case using the PerdewBurkeErnzehof methodology^{20}, led to an improvement in the mean absolute error they were able to achieve of between 22 and 45%. Similar information fusion approaches have been used in the polymer space, for example Patra et al. use a cokriging scheme to fuse information from a variety of sources to build a predictive model for polymer bandgaps^{21}. In their study, they observed both an increased performance over a singlefidelity Gaussian process approach, but also greater generalisation for their model.
Bayesian optimization
Bayesian optimisation is a family of sample efficient optimisers which balance the twin pressures of exploration (knowledge generation) and exploitation (knowledge utilisation) through the iterative construction of a scoring function based on Bayesian machinelearning models^{22}. It has shown promise in diverse fields from hyperparameter optimization^{23} to drug discovery^{24} to engineering^{25} to materials discovery^{26,27}.
This scoring function  sometimes known as the acquisition function  relates the parameters being optimised to their expected utility given the current state of the model. Perhaps the most commonly used acquisition function is Expected Improvement (EI)^{28}, which balances exploration and exploitation through considering both the likelihood of an improvement and its potential magnitude. Iterative training of the model, calculating posterior predictions followed by maximisation of the acquisition function then drives the candidate selection within the Bayesian optimization paradigm. Gaussian processes are the most commonly utilised Bayesian machine learning model for this task, though others such as Bayesian neural networks (BNNs) have also been effectively deployed^{29,30,31}. We note that the proliferation of new Bayesian models such as BNNs which can scale to large data sizes together with advances that mitigate the cubic scaling^{32} previously associated with GPs mean that practical limitations that prevented Bayesian optimisation being applied large scale design problems are no longer the barrier they once were.
Multifidelity bayesian optimization
Inspired by the successes of the inclusion of lower fidelity information through multifidelity model building, and the emerging area of Bayesian optimization for materials screening, this paper extends standard approaches to Bayesian optimization of materials from a sampleefficient method for optimising a target property, to a multifidelity technique, capable of taking advantage of all available fidelities. Since this approach is agnostic to the source of the data, this naturally allows for combination of experimental and theoretical data in a way not achieved by current machinelearning driven screening approaches. To achieve this goal, it is necessary to build a model which is able to link data from each fidelity and draw inferences from the composite information. For this purpose, we utilise a multioutput Gaussian process^{33}.
Multifidelity Bayesian optimization makes use of the same approach of iteratively training a probabilistic model and using it to rank possible materials for measurement. It extends the search space from a set of materials or molecules to a combination of these candidates and a choice of a particular measurement modality or fidelity. Therefore, given an effective acquisition function, it is possible to efficiently trade off information collection at cheap but noisy fidelities with targeted acquisition of data at the highest fidelity when required. Typically, the reason for employing such an approach is to reduce the total budget spent on the optimization, since the highest fidelity may be very technically or financially challenging to acquire. Throughout this paper we use the terms low fidelity and high fidelity to mean low expense and high expense for continuity of language with previous works in the field. We note, however, that our approach does not require that the accuracy and the cost of the fidelities be both monotonic and ranked, rather the only requirement is that a target fidelity be specified at the start of the optimization.
Different approaches to multifidelity optimization can be decomposed into choices modelling the relationships between the different fidelities and choices regarding how to construct the acquisition function. For example, Song et al. utilise a phased approach, with initial exploration performed using a low fidelity until some stopping criterion is hit, at which point highfidelity data acquisition is considered^{34}. Palizhati et al. consider both epsilongreedy and lowerconfidencebound (LCB) settings to build multifidelity screening approaches. They found that the best results were when the entire lowfidelity data set was given as a priori knowledge to a multifidelity model, which resulted in acceleration of at least 20% on materials discovery tasks^{35}. Our approach, which we name Targeted Variance Reduction (TVR), naturally extends arbitrary single fidelity Bayesian optimization acquisition functions to a multifidelity domain. The TVR algorithm is described in detail in the Methods section with pseudocode presented in the ESI, but is summarised as follows: after computing a standard acquisition function on the target fidelity samples, (in this paper, we use the aforementioned EI acquisition function) the combination of the choice of input sample and fidelity is made by picking the pair that will minimise the variance of the model prediction at the point with the greatest acquisition function score per unit cost. This process is repeated iteratively until the budget is exhausted.
Synthetic data set
Approaches to screening challenges, such as those encountered in materials discovery, are affected by two main effects – the relative cost of making evaluations at the different fidelities, and the correlation between each of the fidelities. In an ideal system, cheaper fidelities are highly correlated both to each other, and to the target fidelity, enabling an efficient winnowing of the candidate pool without significant computational expense. In a worstcase scenario, fidelities are completely uncorrelated, essentially reducing each stage of a computational funnel to a lottery.
To demonstrate and systematically probe the effects of cost and accuracy of the lower fidelity proxies on optimisation based on computational funnels and the TVREI algorithm, we make use of a synthetic function as the target of our optimisation, and generate lower fidelity proxies in a manner that allows us to control the degree to which the lower fidelity is correlated to the ground truth target. We utilise Liu’s 1D function Eq. (1)  as our target function, as it is complex enough to differentiate different optimization strategies, but not so complex as to obfuscate the effects of algorithmic component choices.
We generate a set of functions with differing degrees of correlation to our target function using our previously described method^{36} to act as the lower fidelity proxy to (1). Plotted examples of generated functions can be found in the ESI. To examine the effect of the relative cost we consider range of discount factors that express the degree to which the lower fidelity proxy is cheaper to evaluate than the target function.
The results of the experiments making use of the synthetic functions are seen in Fig. 3 which shows a heatmap indicating relative performance of TVREI, and a computational funnel. Performance is scored by the difference between the total computational cost the optimal computational funnel required to discover a solution scored at the 99th percentile best values and the total computational cost required by TVREI to discovery a 99th percentile value, averaged over computational replicates where a unit of cost is defined by the price of a single evaluation of the ground truth target. The two axes show the effect of the discount (in cost) of taking measurements using the lower fidelity proxy relative to the ground truth function and the Pearson correlation of the lower fidelity proxy to the ground truth function. The lower left corner of the grid is associated with expensive accurate proxies whilst the upper right corner is associated with cheap inaccurate proxies.
The figure shows that for the synthetic case the TVREI algorithm outperforms when either the expense of the proxy is relatively high, or when the accuracy of the proxy is relatively high, whilst the computational funnel shows higher performance when the proxy is both lower cost and lower accuracy. The greater performance for both the expensive, yet inaccurate proxy functions and the cheap accurate proxies can be rationalised by TVREI’s capacity to dynamically adjust how it allocates budget based upon information gained during the optimization. Unlike a computational funnel, where budget is fixed and preallocated, TVREI can dynamically allocate more budget to proxies if it is determined that the proxy is informative, or less if the correlation is deemed to be low. We postulate that the utility of the funnel for cheaper lower accuracy proxies may be because TVREI is more sensitive to mismatches between the proxy and the target and thus can exhibit an overly conservative behaviour, avoiding proxies that can still be somewhat useful.
It can also be observed that the magnitude of the difference between the two methods is marked, with no score lower than −20 (i.e. the computational funnel effectively required 20 fewer target samples), but with the highest score over 100 (i.e. TVREI effectively required 100 fewer target samples).
Materials discovery challenges
Building from our understanding of the algorithm taken from its use on synthetic functions, we now test our approach on three materials discovery examples. A detailed description of these data sets can be found in the Data Sets section, but in summary each contain a mixture of computationally calculated and experimentally measured fidelities for impactful materials properties  polarizability (Alexandria), power conversion efficiency (HOPV15) and bandgap (Chen).
As we have previously stated, computational funnels require the user to provision the computational budget in advance. It is worth noting that throughout this study we effectively assume that the funnel is capable of being provisioned perfectly, which is not a situation which reflects reality. Our primary point of comparison for each task is to a composite of funnels, where for each budget we run separate funnel which are provisioned with the specified budget and report the final performance of said funnel. This is contrasted with the other methods which are being run once and their performance tracked as they expend increasingly greater resources. We note that this represents an upper limit on the performance of a computational funnel – perfectly budgeted, ideally provisioned. A comparison of performance between TVREI, single fidelity EI, an ideally provisioned composite funnel and random (Monte Carlo) are shown in Fig. 4. In this study, we use ‘regret’ as a measure of performance, where a score of zero regret indicates that the best possible solution has been discovered. Here, regret was calculated with respect to an exhaustive search at the highest fidelity.
We can observe that for each of these challenges, the multifidelity Bayesian optimisation approach using TVREI equals or betters the performance of both the computational funnel and the single fidelity EI method. However, as we would expect given the insight given by the varying of correlation and cost with the synthetic function, the behaviour of the different optimisation algorithms varies considerably among the different datasets.
Table 1 shows a numerical summary of performance of TVREI in comparison to an ideally provisioned composite funnel and expected improvement Bayesian optimization run on the target fidelity. We can measure performance in two ways:

Relative efficiency of the methodologies compared to TVREI (Expense multiplier in the table): Here a score of 1 means that the same budget is consumed, and greater than one means that TVREI was more efficient

Relative regret compared to TVREI: here we calculate how much worse a solution has been discovered by the comparison methods when TVREI has found the optimal solution. A score of zero means that the method has also discovered the optimal solution.
We observe that on average TVREI has an average efficiency gain of 2.8x compared to competitive methods analysed in this study, and an average normalised relative regret gain of 20%, indicating the potential for this approach to deliver significant improvements in materials screening challenges.
The Chen dataset results highlight the advantages of using wellcorrelated lower fidelity proxies with both the computational funnel and TVREI able to reach 99th percentile insulators with an order of magnitude reduction in cost relative to the random search baseline and a factor of 5 decrease in cost relative to the single fidelity EI optimisation. The good performance of the funnel can be attributed to the large difference in cost between the experimental target and computational surrogates. The relatively poor performance of a single fidelity Bayesian optimization approach suggests that the relationship between the molecular representations and the bandgap poses challenges to building a powerful internal model in the low data regime demanded by the high data acquisition cost. We posit that this could be due to a rough functional relationship, where small changes in structure can lead to large differences in bandgap, requiring a greater volume of data to resolve satisfactorily. Thus, this optimization challenge is characterised by informative proxies in combination with a relatively challenging optimisation target function. We see that TVREI offers comparable performance to the funnel. A breakdown of how the algorithm allocates its budget (Table 2) indicate that it achieves this by making significant use of the lower fidelities to focus on sampling only the most valuable of the more expensive experimental samples.
In contrast to the Chen dataset, for the HOPV dataset both the single the EI and the TVR algorithms were able to rapidly identify optimal candidates, significantly outperforming the random search baseline which itself significantly outperforms the computational funnel. We note that this problem represents the worstcase scenario for the computational funnel approach, and is characterised by reasonably constructed, yet poorly correlated fidelities, as demonstrated in Fig. 5. We also note that the success of the single fidelity Bayesian optimisation algorithm (EI) indicates that the functional relationship between the reduced dimensional representation and the experimental power conversion efficiency does not suffer pathologies.
Since this dataset is characterised by relatively expensive, yet mostly lowquality surrogate fidelities in combination with a relatively easy optimisation target function, it is easy to perceive why TVREI will outperform a computational funnel approach. This is born out through inspection of the budget allocation, which can be seen in Table 3, with a significant number of the samples being drawn from the target fidelity, despite the large cost of doing so. This indicates that the method has learned that for this target, the surrogate fidelities do not carry much information and once this is determined, allocates almost no budget to these fidelities. Further inspection of the breakdown of budget allocation can yield additional insights. For example, in this task, the MO62x fidelity is both relatively expensive and yet also uncorrelated to the target (see Fig. 5), and indeed TVREI consistently allocates almost no budget to investigating this fidelity. Additionally, Fig. 5 also shows that the PBE0 and B3LYP fidelities are strongly correlated, leading to the TVREI algorithm to consistently invest more budget into the cheaper PBE0 fidelity, given that the information content is similar. Correlation plots of the other two data sets can be found in the supplementary information.
In contrast to Chen and HOPV15, which have been chosen to demonstrate situations that favour either a computational funnel or Bayesian optimization approach, the Alexandria dataset shows an intermediate case. Both the computational funnel and the singlefidelity EI method have comparable performance  despite achieving this performance through fundamentally different mechanisms  with both outperforming random searches. This indicates that the target function is effectively optimisable and that the balance of cost vs. correlation for the computational proxies means they are highly informative. For this challenge, we observe that our approach significantly outperforms both computational funnels and singlefidelity Bayesian optimization approaches, enhancing the signal exploited by the single fidelity Bayesian optimization approach with additional information taken from the lower fidelity approaches.
It is also informative to consider the effects of nonideal provisioning, which is more akin to a realworld situation. The results of such a comparison are shown in Fig. 6. When comparing to a nonideal (either over, or under, provisioned) funnel, we observe that the effects we describe throughout are enhanced. For the purposes of this study, we define an ideally provisioned funnel as the minimum budget required to reach zero regret in the median case, an overprovisioned funnel as having twice this budget, and an under provisioned funnel as having half this budget. An equivalent to Fig. 4 where the worst case is tracked in place of the median – thus providing a lower bound to the composite funnel performance – can be found in the ESI.
These three challenges span a range of thermochemical and optoelectronic properties including both experimental and computational value, and represent a wide spectrum of characteristics, designed to test our approach against a variety of situations in which it may reasonably be applied. In the datasets we have examined, we observe clear benefits can be seen to application of TVREI which matches or betters best case performance of the commonly used computational funnel and emerging Bayesian optimization methodology, while not requiring front loading of resources nor definitive knowledge of the relative accuracy of possible proxies. Our method is demonstrated to be robust to uninformative proxies and able to leverage internal correlations to remove the requirement to expend budget on proxies which share high correlation to lower cost alternatives. We believe that this demonstrates that our TVREI algorithm has promise as a tool for molecular and materials design where cheaper proxy measures are available as an alternative to the wellestablished computational funnel or singlefidelity Bayesian optimization methods, and establishes its utility for mixed simulationexperiment experimental designs.
Methods
Targeted variance reduction
MultiFidelity Targeted Variance Reduction (MFTVR) is a conceptually simple algorithm. After computing a standard acquisition function on the target fidelity, in this case EI, the combination of the choice of input sample and fidelity is made by picking the pair that minimise the variance of the model prediction at the point with the greatest Expected Improvement, scaled by the cost of making the evaluation. We do not separate out the low fidelity search, and high fidelity exploit, into distinct stages, but instead use the lower fidelities to improve the quality of the acquisition function itself, thus directly impacting the sampling efficiency. These steps are illustrated graphically in Fig. 7 and pseudocode for the TVREI algorithm can be found in the ESI.
Multioutput Gaussian process
In the case of a single fidelity GP, training data takes the form of a matrix of material representations X and corresponding property values \(\vec y\), and we have another matrix of representations X_{*} for which we would like to make predictions. We suppose we have a kernel function defined by a set of hyperparameters, which is typically a universal smoothing kernel such as the radial basis function (RBF) or a Matern kernel. This kernel function can be used to compute prior covariances between vector representations of materials, and by extension can be used to compute a prior covariance matrix among a set of materials. The posterior predicted means for the materials to be evaluated are then given by:
where \(\overrightarrow {\mu _ \ast }\) is the vector of predicted mean values, K_{*} is the prior covariance matrix between X and X_{*} as determined by the kernel function, and K^{−1} is the inverse prior covariance matrix between X and X again as determined using the kernel function. Similarly, the posterior covariances for the materials to be evaluated are given by:
where σ_{*} is the posterior covariance matrix between X_{*} and X_{*} and K_{**} is the equivalent prior covariance matrix between X_{*} and X_{*}.
Hyper parameters of the kernel function can either be sampled or learned by maximising the log marginal likelihood of the training data^{37}.
This setup can be extended to the multifidelity case by creating a representation for the fidelities and concatenating it with the representation of the materials. This allows us to make use of the same kernel functions to generate a prior samplefidelity covariance matrix.
We choose to represent the fidelities via a onehot encoding, where we drop the ground truth high fidelity dimension – this results in representing a fidelity by a vector with the number of dimensions equal to the number of approximate fidelities, with the high fidelity reference mapping to the zero vector, and the other fidelities mapping to the unit vectors in each axis. This choice to drop the dimension that would normally represent the high fidelity biases the model to care more about the relationship between the various fidelities and the high fidelity than between the lower fidelities themselves. This is of direct benefit to our use cases, since here we care explicitly about the former rather than the latter.
Thus the only addition to the single fidelity case defined above is that training/evaluation data for materials and property values have been replaced by training/evaluation data for material/fidelity combinations and property values. i.e. previously where the i_{th} row of the matrix of material representations X corresponded simply to the vector \(\overrightarrow {x_i}\) representing material i, we now have rows defined by
where \(\overrightarrow {x_i^k}\) is the representation of the i_{th} material at the k_{th} fidelity, and \(\overrightarrow {f_k}\) is the onehot representation of the k_{th} fidelity.
The prior covariance matrix for multifidelity training data can be thought of as a block matrix by partitioning it according to the fidelities. The ondiagonal blocks of this matrix characterise correlation between materials measured within the same fidelity (and are therefore equivalent to the covariance matrices for corresponding single fidelity GPs), while the offdiagonal blocks characterise correlation between materials measured at different fidelities. By optimising or sampling kernel hyperparameters the degree to which lower fidelities are correlated with the ground truth measurements is learned; if within the training data a given approximate fidelity is not correlated with the ground truth high fidelity measurements then the lengthscale associated with that fidelity’s onehot encoded dimension will shrink and correspondingly the prior covariance between the measurements at that fidelity and ground truth measurements will tend towards zero, while the opposite will occur for fidelities that are highly correlated with the ground truth high fidelity function.
For our model, the full prior covariance matrix is constructed using a Matern 5/2 kernel in combination with automatic relevance determination. Hyperparameters are optimised via the log marginal likelihood.
Data sets
As previously stated, the three datasets used in this study were chosen to span a range of thermochemical and optoelectronic properties including both experimental and computational values. The three datasets selected were the Harvard organic photovoltaic dataset (HOPV)^{38}, the Alexandria quantum chemical library (Alexandria)^{39} and the Chen Alchemical library^{40}.
Harvard Organic Photovoltaic Dataset (HOPV)
350 molecular structures were extracted alongside their experimental power conversion efficiencies and computational analogs computed using the Scharber model^{41} built from energy levels calculated using four different density functionals  BP86^{42,43}, PBE0^{20,44}, B3LYP^{42,45} and M062X^{46,47} in combination with the doubleζ def2SVP basis set^{48}. For this dataset the optimization target was to discover the material with the highest power conversion efficiency, with the computational analogs available as lower fidelity proxies. Costs for each fidelity were assigned as 1.0, 1.25, 1.75, 2.0 and 20.0 to evaluating at the BP86, PBE0, B3LYP, M062X level of theory and experiment respectively. Molecular structures were described using the SOAP descriptors^{49} which were reduced to a 20D representation using principle component analysis.
Alexandria dataset
946 structures were extracted from the Alexandria dataset, containing structure which had both experimental polarizabilities and computational analogs calculated at both the HartreeFock level of theory in combination with the 631 G** basis set^{50,51,52} and using the B3LYP functional^{42,45} in combination with the augccpVTZ basis set^{53,54}. For this dataset the optimization target was to locate the material with the highest experimental polarisability with the HF and B3LYP calculations available as lower fidelity proxies for the experimental target. Costs for each fidelity were assigned at of 1.0, 2.0 and 6.0 to evaluating at the HF and B3LYP levels of theory and via experiment respectively. Molecular structures were described using MAACS keys^{55}, which were reduced to 20D using a principle component analysis.
Chen dataset
1766 structures were extracted containing examples containing measurements of experimental bandgaps and a computational analog using the PBE functional using the projector augmented wave method and a 520 eV cut off. For this dataset the optimization target was to discover the most insulating i.e. highest bandgap material as determined by the experimental measurement with the PBE calculations available as lower fidelity proxies for the experimental target. Costs were assigned as 0.5 for evaluating the PBE calculated bandgap and 10 for evaluating the experimental values. Molecular structures were described using the SOAP descriptors^{49} which were reduced to a 20D representation using principle component analysis.
Data availability
All materials datasets used are publicly available at the references cited within this manuscript.
^{39} – Alexandria Data Set
^{38} – Harvard Organic Photovoltaic Data Set (HOPV)
^{40} – Chen Data Set
Code availability
Experiments were performed using IBM’s Bayesian Optimization Accelerator, a commercial program. In order to aid reproducibility, the authors have included implementation details, including pseudocode, for functions key to this paper as a dedicated section in the supplementary information.
References
Rajan, K. Combinatorial materials sciences: Experimental strategies for accelerated knowledge discovery. Ann. Rev. Mater. Res 38, 299–322 (2008).
Potyrailo, R. et al. Combinatorial and highthroughput screening of materials libraries: Review of state of the art. ACS combinatorial Sci. 13, 579–633 (2011).
Mennen, S. M. et al. The evolution of highthroughput experimentation in pharmaceutical development and perspectives on the future. Org. Process Res Dev. 23, 1213–1242 (2019).
PyzerKnapp, E. O., Suh, C., GómezBombarelli, R., AguileraIparraguirre, J. & AspuruGuzik, A. What is highthroughput virtual screening? A perspective from organic materials discovery. Ann. Rev. Mater. Res 45, 195–216 (2015).
PyzerKnapp, E. O., Simm, G. N. & Guzik, A. A. A Bayesian approach to calibrating highthroughput virtual screening results and application to organic photovoltaic materials. Mater. Horiz. 3, 226–233 (2016).
Bajorath, J. Integration of virtual and highthroughput screening. Nat. Rev. Drug Disco. 1, 882–894 (2002).
Hautier, G. Finding the needle in the haystack: Materials discovery and design through computational ab initio highthroughput screening. Comput. Mater. Sci. 163, 108–116 (2019).
Suh, C., Fare, C., Warren, J. A. & PyzerKnapp, E. O. Evolving the materials genome: How machine learning is fueling the next generation of materials discovery. Ann. Rev. Mater. Res 50, 1–25 (2020).
PyzerKnapp, E. O. et al. Accelerating materials discovery using artificial intelligence, high performance computing and robotics. NPJ Comput. Mater. 8, 1–9 (2022).
Smith, J. S., Isayev, O. & Roitberg, A. E. ANI1: An extensible neural network potential with DFT accuracy at force field computational cost. Chem. Sci. 8, 3192–3203 (2017).
Behler, J. Representing potential energy surfaces by highdimensional neural network potentials. J. Phys.: Condens. Matter 26, 183001 (2014).
Behler, J., Martoňák, R., Donadio, D. & Parrinello, M. Metadynamics simulations of the highpressure phases of silicon employing a highdimensional neural network potential. Phys. Rev. Lett. 100, 185501 (2008).
PyzerKnapp, E. O., Li, K. & AspuruGuzik, A. Learning from the harvard clean energy project: The use of neural networks to accelerate materials discovery. Adv. Funct. Mater. 25, 6495–6502 (2015).
Balachandran, P. V. Machine learning guided design of functional materials with targeted properties. Comput. Mater. Sci. 164, 82–90 (2019).
Chibani, S. & Coudert, F.X. Machine learning approaches for the prediction of materials properties. APL Mater. 8, 080701 (2020).
Meng, X. & Karniadakis, G. E. A composite neural network that learns from multifidelity data: Application to function approximation and inverse PDE problems. J. Comput. Phys. 401, 109020 (2020).
Yang, C.H. et al. Multifidelity machine learning models for structure–property mapping of organic electronics. Comput. Mater. Sci. 213, 111599 (2022).
Ramakrishnan, R., Dral, P. O., Rupp, M. & von Lilienfeld, O. A. Big data meets quantum chemistry approximations: The Δmachine learning approach. J. Chem. Theory Comput. 11, 2087–2096 (2015).
Chen, C., Zuo, Y., Ye, W., Li, X. & Ong, S. P. Learning properties of ordered and disordered materials from multifidelity data. Nat. Comput Sci. 1, 46–53 (2021).
Perdew, J. P., Burke, K. & Ernzerhof, M. Generalized gradient approximation made simple. Phys. Rev. Lett. 77, 3865–3868 (1996).
Patra, A. et al. A multifidelity informationfusion approach to machine learn and predict polymer bandgap. Comput. Mater. Sci. 172, 109286 (2020).
Brochu, E., Cora, V. M. & de Freitas, N. A. Tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and Hierarchical reinforcement learning. Preprint at https://arxiv.org/abs/1012.2599
Kandasamy, K. et al. Tuning hyperparameters without grad students: Scalable and Robust Bayesian optimisation with dragonfly. J. Mach. Learn. Res. 21, 1–27 (2020).
PyzerKnapp, E. Bayesian optimization for accelerated drug discovery. IBM J. Res Dev. 62, 2–1 (2018).
Lam, R., Poloczek, M., Frazier, P. & Willcox, K. E. Advances in Bayesian optimization with applications in aerospace engineering. In 2018 AIAA NonDeterministic Approaches Conference 1656 (2018).
PyzerKnapp, E. O., Chen, L., Day, G. M. & Cooper, A. I. Accelerating computational discovery of porous solids through improved navigation of energystructurefunction maps. Sci. Adv. 7, eabi4763 (2021).
Zhang, Y., Apley, D. W. & Chen, W. Bayesian optimization for materials design with mixed quantitative and qualitative variables. Sci. Rep. 10, 1–13 (2020).
Mockus, J. The Bayesian approach to global optimization. in System Modeling and Optimization 473–481 (Springer, Berlin, Heidelberg, 1982).
Springenberg, J. T., Klein, A., Falkner, S. & Hutter, F. Bayesian optimization with robust Bayesian neural networks. Adv. Neural Inform. Process. Sys. 29, 2171–2180 (2016).
Snoek, J. et al. Scalable Bayesian Optimization Using Deep Neural Networks. arXiv Preprint at https://arxiv.org/abs/1502.05700 (2015).
HernándezLobato, J. M., Requeima, J., PyzerKnapp, E. O. & AspuruGuzik, A. Parallel and distributed Thompson sampling for largescale accelerated exploration of chemical space. in International Conference On Machine Learning 1470–1479 (PMLR, 2017).
Wang, K. A. et al. Exact Gaussian Processes on a Million Data Points. Preprint at https://arxiv.org/abs/1903.08114 (2019).
Liu, H., Cai, J. & Ong, Y.S. Remarks on multioutput Gaussian process regression. Knowl. Based Syst. 144, 102–121 (2018).
Song, J., Yuxin, C. & Yue, Y. A General Framework for Multifidelity Bayesian Optimization with Gaussian Processes. The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 2019.
Palizhati, A., Aykol, M., Suram, S., Hummelshøj, J. S. & Montoya, J. H. Multifidelity Sequential Learning for Accelerated Materials Discovery. Preprint at https://doi.org/10.26434/chemrxiv.14312612.v1 (2021)
Fare, C., Fenner, P. & PyzerKnapp, E. O. A Principled Method for the Creation of Synthetic Multifidelity Data Sets. Preprint at https://arxiv.org/abs/2208.05667 (2022).
Rasmussen, C. & Williams, C. Gaussian Processes for Machine Learning. (MIT Press, 2006).
Lopez, S. A. et al. The Harvard organic photovoltaic dataset. Sci. Data 3, 1–7 (2016).
Ghahremanpour, M. M., Van Maaren, P. J. & Van Der Spoel, D. The Alexandria library, a quantumchemical database of molecular properties for force field development. Sci. Data 5, 1–10 (2018).
Chen, G. et al. Alchemy: A quantum chemistry dataset for benchmarking ai models. Preprint at https://arxiv.org/abs/1906.09427 (2019).
Scharber, M. C. et al. Design rules for donors in bulkheterojunction solar cellstextemdashtowards 10% energyconversion efficiency. Adv. Mater. 18, 789–794 (2006).
Becke, A. D. Densityfunctional exchangeenergy approximation with correct asymptotic behavior. Phys. Rev. A 38, 3098–3100 (1988).
Perdew, J. P. Densityfunctional approximation for the correlation energy of the inhomogeneous electron gas. Phys. Rev. B 33, 8822–8824 (1986).
Perdew, J. P., Ernzerhof, M. & Burke, K. Rationale for mixing exact exchange with density functional approximations. J. Chem. Phys. 105, 9982–9985 (1996).
Becke, A. D. Densityfunctional thermochemistry. III. The role of exact exchange. J. Chem. Phys. 98, 5648–5652 (1993).
Zhao, Y. & Truhlar, D. G. The M06 suite of density functionals for main group thermochemistry, thermochemical kinetics, noncovalent interactions, excited states, and transition elements: two new functionals and systematic testing of four M06class functionals and 12 other functionals. Theor. Chem. Acc. 120, 215–241 (2008).
Zhao, Y. & Truhlar, D. G. Density functionals with broad applicability in chemistry. Acc. Chem. Res. 41, 157–167 (2008).
Weigend, F. & Ahlrichs, R. Balanced basis sets of split valence, triple zeta valence and quadruple zeta valence quality for H to Rn: Design and assessment of accuracy. Phys. Chem. Chem. Phys. 7, 3297–3305 (2005).
Bartók, A. P., Kondor, R. & Csányi, G. On representing chemical environments. Phys. Rev. B 87, 184115 (2013).
Hehre, W. J., Stewart, R. F. & Pople, J. A. SelfConsistent molecularorbital methods. I. Use of gaussian expansions of slatertype atomic orbitals. J. Chem. Phys. 51, 2657–2664 (1969).
Hehre, W. J., Ditchfield, R., Stewart, R. F. & Pople, J. A. SelfConsistent molecular orbital methods. IV. Use of gaussian expansions of slatertype orbitals. extension to secondrow molecules. J. Chem. Phys. 52, 2769–2773 (1970).
Hehre, W. J., Ditchfield, R. & Pople, J. A. Self—consistent molecular orbital methods. XII. Further extensions of Gaussian—type basis sets for use in molecular orbital studies of organic molecules. J. Chem. Phys. 56, 2257–2261 (1972).
Kendall, R. A., Dunning, T. H. Jr. & Harrison, R. J. Electron affinities of the firstrow atoms revisited. Systematic basis sets and wave functions. J. Chem. Phys. 96, 6796–6806 (1992).
Woon, D. E. & Dunning, T. H. Jr. Benchmark calculations with correlated molecular wave functions. I. Multireference configuration interaction calculations for the second row diatomic hydrides. J. Chem. Phys. 99, 1914–1929 (1993).
Durant, J. L., Leland, B. A., Henry, D. R. & Nourse, J. G. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 42, 1273–1280 (2002).
Acknowledgements
This work was supported by the Hartree National Centre for Digital Innovation, a collaboration between Science and Technology Facilities Council and IBM.
Author information
Authors and Affiliations
Contributions
E.P.K. and C.F. conceived the project, E.P.K. supervised the project, C.F., P.F., and A.V. performed the computational experiments and all authors analysed the output of the experiments and contributed to writing the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Fare, C., Fenner, P., Benatan, M. et al. A multifidelity machine learning approach to high throughput materials screening. npj Comput Mater 8, 257 (2022). https://doi.org/10.1038/s41524022009479
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41524022009479