Benchmarking the Performance of Bayesian Optimization across Multiple Experimental Materials Science Domains

In the field of machine learning (ML) for materials optimization, active learning algorithms, such as Bayesian Optimization (BO), have been leveraged for guiding autonomous and high-throughput experimentation systems. However, very few studies have evaluated the efficiency of BO as a general optimization algorithm across a broad range of experimental materials science domains. In this work, we evaluate the performance of BO algorithms with a collection of surrogate model and acquisition function pairs across five diverse experimental materials systems, namely carbon nanotube polymer blends, silver nanoparticles, lead-halide perovskites, as well as additively manufactured polymer structures and shapes. By defining acceleration and enhancement metrics for general materials optimization objectives, we find that for surrogate model selection, Gaussian Process (GP) with anisotropic kernels (automatic relevance detection, ARD) and Random Forests (RF) have comparable performance and both outperform the commonly used GP without ARD. We discuss the implicit distributional assumptions of RF and GP, and the benefits of using GP with anisotropic kernels in detail. We provide practical insights for experimentalists on surrogate model selection of BO during materials optimization campaigns.


Introduction
Autonomous experimental systems have recently emerged as the new frontier for accelerated materials research.
These systems excel at optimizing materials objectives, e.g. environmental stability of solar cells or toughness of 3D printed mechanical structures, that are typically costly, slow, or difficult to simulate and experimentally evaluate. While autonomous experimental systems are often associated with high sample synthesis rates via high-throughput experiments (HTE), they may also utilize closed-loop feedback from machine learning (ML) during materials property optimization.
The latter has motivated integration of advanced lab automation components with ML algorithms. Specifically, active learning [1,2] algorithms have traditionally been applied to minimizing total experiment costs while maximizing machine learning model accuracy through hyperparameter tuning. Their primary utility for materials science research, where experiments remain relatively costly, lies in an iterative formulation that proposes targeted experiments with regard to a specific design objective based on prior experimental observations. Bayesian optimization (BO) [3,4,5], one class of active-learning learning methods, utilizes surrogate model to approximate a mapping from experiment parameters to an objective criterion, and provides optimal experiment selection when combined with an acquisition function. BO has been shown to be a data-efficient closed-loop active learning method for navigating complex design spaces [3,6,7,8,9,10]. Consequently, it has become an appealing methodology for accelerated materials research and optimizing material properties [11,12,13,14,15,16,17,18,19,20,21,22] beyond state-of-the-art.
The materials science community has seen successful demonstrations in performing materials optimization via autonomous experiments guided by BO and its variants [17,23,24,25,26,27]. Naturally, previous work emphasized the ability to achieve materials optimization with fewer experimental iterations. There have been very few quantitative analyses of the acceleration or enhancement resulting from applying BO algorithms and discussions on sensitivity of BO performance to surrogate model and acquisition function selection. Rohr et al. [28], Graff et al. [29] and Gongora et al. [24] have evaluated the performance of BO using multiple surrogate models and acquisition functions within specific electrocatalyst, ligand and mechanical structure design spaces respectively. However, comprehensive benchmarking of the performance of BO algorithms across a broad array of experimental materials systems, as we present here, has not been done. Although one could test BO across various analytical functions or emulated materials design spaces [30,25], empirical performance evaluation on a broader collection of experimental materials science data is still necessary to provide practical guidelines. Optimization algorithms need systematic and comprehensive benchmarks to evaluate their performance, and lack of these could significantly slow down advanced algorithm development, eventually posing obstacles for building full autonomous platforms. Presented below, the benchmarking framework, practical performance metrics, datasets collected from realistic noisy experiments, and insights derived from side-by-side comparison of BO algorithms will allow researchers to evaluate and select their optimization algorithm before deploying it on autonomous research platforms. Our work provides comprehensive benchmarks for optimization algorithms specifically developed for autonomous and high-throughput experimental materials research. Ideally, it provides insight for designing and deploying Bayesian optimization algorithms that suit the sample generation rate of future autonomous platforms and tackle materials optimization in more complex design spaces.
In this work, we benchmark the performance of BO across five different experimental materials science datasets, optimizing properties of carbon nanotube polymer blends, silver nanoparticles, lead-halide perovskites, and additively manufactured polymer structures and shapes. We utilize a pool-based active learning framework to approximate experimental materials optimization processes. We also adapt metrics such as enhancement factor and acceleration factor to quantitatively compare performances of BO algorithms against that of a random sampling baseline. We observe that when paired with the same acquisition functions, Random Forest (RF) [31,32,33] as a surrogate model can compete with Gaussian Process (GP) [4] with automatic relevance detection (ARD) [34] that has anisotropic kernels. They also both outperform commonly used GP without ARD. Our discussion on the differences in the implicit distributional assumptions of surrogate models and the benefits of using GP with anisotropic kernels yield deeper insights regarding surrogate model selection for materials optimization campaigns. These discussions suggest guidelines on using BO for general materials optimization. We also offer open source implementation of benchmarking code and datasets to support future development of such algorithms in the field.

Experimental materials datasets
As seen in Table 1, we have assembled a list of five materials datasets with varying sizes, dimensions n dim , and materials systems. These diverse datasets are generated from autonomous experimental studies of collaborators, and facilitate BO performance analysis across a broad range of materials. They contain three to five independent input features, one property as materials optimization objective, and contain from a few tens to hundreds of data points. Based on their optimization objectives, the design space input features in the datasets range from materials compositions to synthesis processing parameters, as seen in Table T1 -T5 in supplementary information (SI).  [36] Silver nanoparticles Flow synthesis 164 5 Absorbance spectrum score Perovskite [23] Thin film perovskite Spin coating 94 3 Stability score Crossed barrel [24] 3D printed structure 3D printing 600 4 Mechanical toughness AutoAM [37] Materials manufacturing 3D printing 100 4 Shape score The datasets were processed for comparison purposes: 1. For each each dataset, its optimization objective values are independently centered to its mean and scaled to unit variance.
3 A PREPRINT -JUNE 3, 2021 2. For each each dataset, its optimization problems is formulated as global minimization for consistency.
It should be noted that while all datasets were gathered from relatively high-throughput experimental systems, P3HT/CNT, AgNP, Perovskite, and AutoAM had BO guiding the selection of subsequent experiments partially through the materials optimization campaigns. Across the datasets, the differences in distribution of normalized objective values can be observed in Figure 1(a); the differences in distribution of sampled data points in its respective materials design space can be seen in Figure 1(b). The five materials datasets in the current study are available in the following GitHub repository [38].

Bayesian Optimization: surrogate models and acquisition functions
Bayesian optimization (BO) [3,4,5] aims to solve the problem of finding a global optimum (min or max) of an unknown objective function g: holds the assumption that this black-box function g can be evaluated at any #» x ∈ X and the responses are noisy The surrogate model f is probabilistic and consists of a prior distribution that approximates the unknown objective function g, and is sequentially updated with collected data to yield a Bayesian posterior belief of g. Decision policies aimed to find the optimum in fewer experiments are implemented in acquisition functions, which can use the mean and variance predicted at any #» x ∈ X in the posterior to select the next observation to be performed.
The BO algorithm is comprised of both a surrogate model and an acquisition function. The surrogate models considered in this study are random forest (RF) [31], Gaussian process (GP) regression [39], and GP with automatic relevance detection (ARD) [39,5,34].
1. To approximate the experience of a researcher with little prior knowledge of a materials design space, for RF, we have hyperparameters applicable across all five datasets without loss of generality: n tree = 50 and bootstrap = True.
2. For hyperparameters of GP, we choose kernels from Matérn52, Matérn32, Matérn12, radial basis function (RBF), and multilayer perceptron (MLP). The initial lengthscale for each kernel was set to unit length.
3. For hyperparameters of GP ARD, we use ARD, which allows GP to keep anisotropic kernels. The kernel function of GP then has individual characteristic lengthscales l j [5,34] for each of the input feature dimensions j.
As an example, in dimension j, Matérn52 kernel function between two points #» p , #» q in design space would be where r = (p j − q j ) 2 , σ is standard deviation and l j is the characteristic lengthscale. These characteristic lengthscales can be used to estimate the distance moved along j th dimension from the input values in design space before the change of objective values become uncorrelated with this feature. 1 l j is thus useful in understanding the sensitivity of objective value to input feature j.
We then pair the selected surrogate model with one of three acquisition functions, including expected improvement (EI), probability of improvement (PI), and lower confidence bound (LCB) LCB λ ( #» x ) = −λμ( #» x ) +σ( #» x ), whereμ andσ are the mean and standard deviation estimated by surrogate model while λ is an adjustable ratio between exploitation and exploration.
In addition, these surrogate models, their hyperparameters, and acquisition functions were chosen because they represent the majority of off-the-shelf options accessible, and are ones that have been widely applied to materials optimization campaigns in the field. Our study provides a comprehensive test across the five datasets in order to reflect how each BO algorithm, resulting from the pairing above, performs across many different materials science design spaces. GP and RF were also selected as examples to specifically illustrate how the differences in implicit distributional assumptions of surrogate model could affect the prediction of the mean and standard deviation when selecting subsequent experiments.

Pool-based active learning benchmarking framework
Within each respective experimental dataset, the set of data points form a discrete representation of ground truth in the materials design space. Figure 2 shows the pool-based active learning benchmarking framework we use to simulate materials optimization campaigns guided by BO algorithms in each materials system.
The framework has the following properties: 1. It has the traits of an active learning study as it contains a machine learning model that is iteratively refined through subsequent experimental observation selection based on information from previously explored data points. The framework is also adapted for BO, and emphasizes optimization of materials objectives over building an accurate regression model in design space.
2. It is derived from pool-based active learning. Besides the randomly selected initial experiments, the subsequent experimental observations are selected from the total pool of undiscovered data points ( #» x , y) ∈ D, whose input features #» x are all made available for evaluation by the acquisition functions. The ground truth in the materials design space was represented with discrete data points over a continuous emulation for the following reasons: Observation of performance through case study on crossed barrel dataset While the five datasets covered a breadth of materials domains, the relative performances of tested BO algorithms were observed to be consistent. Figure 3(a) demonsrates the performances of RF, GP ARD (Matérn52), and GP (Matérn52). For the full combinatorial study including all types of GP kernels and acquisition functions, please refer to supplementary information (SI). We showcase the benchmarking results using the crossed barrel dataset [24], which was collected by grid sampling the design space through a robotic experimental system while optimizing the toughness

Planning Inference
Benchmarking with performance metrics contains ordered list of ⃗, in search for Acquisition functions α: Expected improvement, Lower confidence bound, Probability of improvement Surrogate models : Gaussian Process without ARD, Gaussian Process with ARD, Random Forest Figure 2: Benchmarking framework including a simulation of BO performing closed-loop optimization with alternating inference and planning stages. X is the iteratively collected sequence of experimental data ( #» x , y) during optimization campaign. D is the original pool or total undiscovered set of data from which next experiments are selected. f is the surrogate model used to estimate meanμ and standard deviationσ, which parameterize the acquisition function α to select next experiment #» x * to be evaluated. of additive manufactured crossed barrel structure. As for the performance metric, we use to show the fraction of the crossed barrel structures with top 5% toughness that have been discovered by cycle i = 1, 2, 3, . . . , N . Top% describes how quickly can a BO guided autonomous experimental system could identify multiple top candidates in a materials design space. Keeping multiple well-performing candidates allows one to not only observe regions in design space that frequently yield high-performing samples but also have backup options for further evaluation should the most optimal candidate fail in subsequent evaluations. There are research objectives related to finding any good materials candidate, yet in those cases, random selection could outperform optimization algorithms due to luck in a simple design space. Our objective of finding multiple or all top tier candidates is more applicable to experimental materials optimization scenarios and suitable for demonstrating the true efficacy and impact of BO.   2. Both RF and GP with ARD outclass GP without ARD.
3. LCB 1 , which has equal weights for exploration and exploitation, outperforms other acquisition functions LCB λ that overly emphasize exploration. LCB 1 also outperformed EI, which is a very popular acquisition function in previous materials optimization studies but has also been known to make excessive greedy decisions [40,41,42].
Because of this observation, when trying to compare BO algorithms with different surrogate models in this work, the same acquisition function LCB 1 was used. We have also evaluated the performance of BO algorithms using probability of improvement (PI) as acquisition function, but its performance was mostly worse than EI and therefore not the focus of discussion; this observation can be partially attributed to PI only focusing on how likely is an improvement occurs, but not considering how much improvement could be made.
To further quantify the relative performance, we set Top% = 0.8 as a realistic goal to indicate we have identified 80% of the structures with top 5% toughness (Figure 3a). For surrogate models paired with LCB 1 , we see that GP with ARD and RF reach that goal by evaluating approximately 30 candidates out of the total of 600, whereas GP without ARD needs about 90 samples. Top% rises initially as slowly as the random baseline because the surrogate models suffer from high variance in prediction, having only been trained with a small datasets; Top% ramps up very quickly as the model learns to become more accurate in identifying general regions of interest to explore; the rate of learning eventually slows down at high learning cycles because the local exploitation for the global optimum has exhausted most if not all top 5% toughness candidates, and the algorithms therefore switch to exploring sub-optimal regions. Therefore, it can be assumed that the most valuable regions to examine performance is before each curve reaches Top% = 0.8 and Top% = 0.8 can be used as a realistic optimization goal.
To quantify the acceleration of discovery from BO, we adapt two other metrics similar to the ones from Rohr et al. [28].
Both compared to a statistical random baseline, Enhancement Factor(EF) shows how much improvement in a metric one would receive at cycle i, and Acceleration Factor(AF) is the ratio of cycle numbers showing how much faster one could reach a specific value Top%(i BO ) = Top%(i random ) = a ∈ [0, 1]. The aggregated performance of BO algorithms is further quantified via EF and AF curves in Figure 3(b), 3(c): starting off with small EFs or AFs before the surrogate model gains more accuracy; reaching absolute EF max and AF max of up to 16×. Eventually, the learning algorithms show diminishing returns from an information gain perspective as we progress deeper into our optimization campaigns during pool-based active learning. We observe that for the two BO algorithms both with same acquisition function LCB 1 but different surrogate model GP ARD and RF, they reach EF max at different learning cycles and AF max at different Top%, both corresponding to the switch of best performing algorithm around Top% = 0.67. RF: LCB 1 clearly excels at lower learning cycles, yet GP ARD: LCB 1 takes the lead and would reach Top% = 0.8 with fewer experiments. Therefore, these results objectively show that optimal BO algorithm selection varies with assigned experiment budget and specific optimization task [28].
Since we identified two BO algorithms, RF: LCB 1 and GP ARD: LCB 1 , to have similar performance, we wanted to further investigate how similar their optimization paths were in the design space when starting from the same initial experiments. In Figure 3(d), we use Jaccard similarity index to quantify the similarity in optimizations paths. Jaccard similarity, J = |A ∩ B| |A ∪ B| , is the size of the intersection divided by the size of the union of two finite sample sets; specifically in our benchmarking study, using the same 50-ensemble runs that generated Figure 3 In addition, the Jaccard similarity value does not increase monotonically, and a significant drop can be seen in J such as one around i = 30, which coincides with the learning cycles where GP ARD : LCB 1 overtook RF : LCB 1 as best performing algorithm in Figure 3(a). Since the two algorithms used the same acquisition function, this observation shows that while in general the optimization paths of the two algorithms have more overlap overtime, occasional divergence paths still take place because the two algorithms have considerable difference in gathered data used to learn their surrogate models and how their surrogate models predict mean and standard deviation. GP ARD : LCB 1 and RF : LCB 1 started at the same two initial experiments and use the same acquisition function, and the only difference is the surrogate model used. Thus, the divergence and convergence in optimization paths can be again primarily attributed to GP ARD and RF exploiting underlying physics of crossed barrel structure differently. Figure 3(d) highlights the impact of different surrogate model selection beyond final performance, and to provide better guidelines to future research, inspires us to further investigate the role of surrogate models.

Comparison of performance across datasets
To further assess the performance of BO, optimization campaigns were conducted for the P3HT/CNT, AgNP, AM ARES, and Perovskite datasets. Across most, if not all the investigated datasets, it was quite observed consistently that the performance of BO algorithms using GP with ARD and RF as surrogate models were comparable, and both outperform those using GP without ARD in most datasets. To illustrate, in Figure 4, we show such relative performance using normalized EF max of BO algorithms same acquisition function LCB 1 but with different surrogate models across all five datasets. In each dataset, the BO algorithm with the largest EF max had its EF scaled to 1, and the other two BO algorithms showing lower EF max were correspondingly scaled, resulting in five sets of column plots. In addition to the observation on relative performance, we also observe that BO algorithms with RF and GP ARD as surrogate model also have plenty overlap between their 25th to 75th percentile across five datasets, further indicating their similarity in performance.
Notably, EF max of the other four datasets datasets were in the 2× to 5× range as seen in Figure S1 from SI, which is much lower than the EF max of the crossed barrel dataset in Figure 3(b). The difference in the absolute EF max can be attributed to the data collection methodology of the individual datasets. While the crossed barrel dataset was collected using a grid sampling approach, the other four studies were collected along the path of a BO guided materials optimization campaign. Therefore, these four datasets were smaller in size and possessed an intrinsic enhancement and acceleration within their datasets. As a result, it is reasonable that these datasets demonstrate lower EFs, AFs during benchmarking. Nevertheless, the fact that both EFs and AFs are still larger than 1 indicate that further acceleration and enhancement is still possible when given design spaces as parts of optimization paths. Noticeably, the Perovskite dataset had the most intrinsic acceleration because its next experimental choice was guided by BO infused with probabilistic constraints generated from DFT proxy calculations of the environmental stability [23] of perovskite . As a result, the optimization sequence to be chosen in that study is already narrowed down to a more efficient path from initial experiments to final optimum, making the random baseline to appear arbitrarily much worse. Another interesting Figure 4: Normalized EF max demonstrated by BO algorithms having GP without ARD, GP with ARD, and RF as surrogate models and all using LCB 1 as acquisition function. For each algorithm applied across datasets, the 50 th percentile of EF max is shown by the barplots, and its 25 th and 75 th percentile are shown by respective floating bars.
observation is how the performance of GP without ARD (isotropic kernels) as surrogate model catches up with GP ARD and RF when the design space has an already "easier" path towards the optimum, further supporting our observations on relative performance. The hypothesis that the lower EF max are caused by intrinsic acceleration and enhancement resulting from dataset collection process can be verified by collecting a subset from the uniformed grid sampled crossed barrel dataset. This subset is collected by running BO algorithm GP: EI until all candidates with top 5% toughness are found, representing an "easier" path towards optimums, and therefore carries intrinsic enhancement and acceleration.
We run the same benchmarking framework on this subset, and observe that while EF max is significantly reduced, the relative ranking of the three surrogate models is consistent, as seen again in Figure S1 from SI. Despite the differences described above, all the investigated BO algorithms outperformed the random baseline demonstrating the efficacy of BO in materials optimization campaigns.

Discussion
In this section, we discuss performance of BO algorithms in the context of autonomous and high-throughput materials optimization.

Comparison of RF and GP as surrogate models
While BO algorithms with GP type surrogate models have been extensively used in the field, we observe that the performance of RF as surrogate model is comparable to that of GP. The results heavily suggest that RF is a capable surrogate model to consider besides GP in BO for future HT materials optimization campaigns.
Before more detailed comparison, an important distinction to make between RF and GP as surrogate models is how they predict mean and standard deviation, which can be attributed to the implicit assumptions when using them as surrogates to represent an unknown ground truth within materials domains. A GP in this work, whether with ARD or not, is essentially a distribution over a materials domain such that any finite selection of data points in This important distinction explains the differences of results in Figure 3 and Figure 4. Other factors affecting surrogate model selection are further discussed via the following side-by-side comparisons of GPs and RFs.
We would first like to discuss the time complexity of these surrogate models. Most commonly in autonomous experimental materials optimization studies, time spent on generating samples is much more significant than that of surrogate model training. However, in our study, there was a noticeable difference in time when benchmarking different BO algorithms, which reveals foreseeable challenges to future research. Across five datasets in this study, starting from the same initial experiments and using the same acquisition function LCB 1 , the ratio of average running time to finish benchmarking framework between the three surrogate models is t RF : t GP : t GP ARD = 1 : 1.32 : 1.54. With n as the number of training data, n dim as design space dimension, n tree as number of decisions trees kept in RF model, in terms of general training time complexity, we have t RF = O(nlog(n) · n dim · n tree ) < t GP = O(n 3 + n 2 · n dim ) [44,45,46].
The relatively expensive computational complexity of GP model is mostly due to the process of calculating the inverse of an n by n matrix during its training process, and keeping anisotropic kernels certainly adds extra computational time.
For reasonable choice of n, n dim , and n tree in materials science research, RF not only can be trained even faster via parallel computing of its decision trees, but also suffer less in performance with increasing n than GP. While the O(n 3 ) time complexity of GPs is typically less of a concern when working with smaller datasets, it could quickly become intractable when applied to larger datasets after experimental samples can be generated at unprecedented rate through advancement of automation in materials laboratories. At each learning cycle, time used in synthesis will eventually match with the time used in model retraining and prediction, typically within seconds. Therefore, in the future, if we do not consider the number of experiments as the budget, but instead consider the total experimental time invested, then RF has a potential advantage over GP when we aim to have fast and seamless feedback loop between model and materials experiments.
We next discuss some properties in the RF, GP, and GP ARD models that would explain the observed differences in performance. The benchmarking effort presented here is unique in its use of abundant experimental materials data with built in noise, often unavoidable in physical science research. The results thus provide a realistic performance evaluation for optimization algorithms in the context of materials research. RF having good performance across the five experimental datasets can be partially attributed to predictions by RF being empirical estimates from its ensemble of decision trees and free of distributional assumptions. The multiple decision trees of RF naturally have low bias and high variance; the aggregation process in RF of different decision trees mitigates this issue, resulting in a model that has relatively low bias and medium variance, and thus are likely more robust to noise and applicable for generalized prediction with unknown assumption [43,32]. These properties of RF can be partially observed in Figure 4, where we not only see the performance of RF match that of GP ARD as surrogate model, but also observe the variance of EF max for RF is on average lower than those of GP with ARD and GP without ARD. Generally speaking, on one hand, if the ground truth manifold of a design space indeed satisfies the gaussianity assumption of GP, then arguably GP type surrogate models have an advantage in learning a model with low bias and variance. On the other hand, if there were sharp discontinuities, piece-wise constants or changes in orders of magnitude through local regions of materials design space, The decision trees of RF would be able to capture these points accurately and reflect their influences on future predictions via aggregated result. These points are specifically regions of interest to be further investigated by researchers, whether they are new findings or outliers from experiments, but they typically fall out of distributional assumptions of GP and could be smoothed out in the learned model without considerable effort in kernel hyperparameter tuning.
We last discuss the effort required hyperparameter tuning of surrogate model during optimization. Despite HTE drastically increasing the rate of materials data collection, active learning for optimization in new materials domains still requires data collection in a sequential or batched manner. While RF has potentially more hyperparameters such as n tree , max depth, and max split to select, it is less penalized for sub-optimal choice of hyperparameters compared to GP.
In this study, across five datasets, as long as sufficient n tree were used in RF, its performance as surrogate model in BO algorithm has been consistently comparable to that of GP. Other hyperparameters of RF have had less of an impact.
Meanwhile, besides the implicit distribution assumption of using a GP type surrogate model, a kernel (covariance function) of GP specifies a specific prior on the domain. Choosing a kernel that is incompatible with the domain manifold could significantly slow down optimization conversion due to loss of generalization, as seen in Figure S2 -S6 from SI, where BO algorithms with GP surrogate models are highly sensitive to kernel selection. For example, Matérn52 kernel analytically requires the fitted GP to be 2 times differentiable in the mean-square sense [4], which can be difficult to verify for unknown materials design spaces. Selecting such a kernel could introduce extra domain manifold assumptions to an unfamiliar design space, as we often have limited data to make confident distribution assumptions at optimization onset. Instead of devoting nontrivial experimental budget to optimize the kernels of GP using adaptive kernels [47], automating kernel selection [48] or keeping a library of kernels available via online learning, RF is an easier off-the-shelf option that allows one to make fewer structural assumptions about unfamiliar materials domains. If a GP surrogate model is still preferred, a Multilayer Perceptron (MLP) kernel [49] mimicking neural networks would be suggested as it has comparable performance to other kernels (see Figure S2 -S6 from SI).
Admittedly, our benchmarking framework might have given RF a slight advantage by discretizing the materials domain through actively acquiring a new datapoint at each cycle and limiting the choice of next experiments within the pool of undiscovered datapoints. However, the crossed barrel dataset has a sampling density, size, and range within its design space sufficient to cover its manifold complexity. A drawback of RF is that it performs poorly in extrapolation beyond the search space covered by training data, yet in the context of materials optimization campaigns, this disadvantage can be mitigated by clever design of initial experiments, namely using sampling strategies like latin hypercube sampling (LHS). In this way, we can not only preserve the pseudo random nature of selecting initial experiments but also cover a wider range of data in each dimension so that RF surrogate model would not have to often extrapolate to completely unknown regions. Considering our benchmarking results and the comparisons above, RF's relative ease of use paired with the intuitive tuning of LCB's weights to adjust exploration and exploitation forms a BO algorithm suitable for general materials optimization campaigns at early stages.

Benefits of using GP ARD with anisotropic kernels
As seen from Figure 4, using anisotropic kernels with GP essentially removes the performance gap between RF and GP with isotropic kernels. with the benefit that the one can exploit the underlying modeling assumptions of the GP ARD for more advanced analysis.
As mentioned earlier, ARD allows us to utilize individual lengthscales for each input dimension j in the kernel function of GP, which are subsequently optimized along learning cycles. These lengthscales in an anisotropic kernel provide a "weight" for measuring relative relevancy of each feature to predicting the objective, i.e. understanding the sensitivity of objective value to each input feature dimension. The reason GP without ARD shows worse performance is as follows: it will have a single lengthscale in an isotropic kernel as scaling parameter controlling GP's kernel function, which is at odds with the fact that each input feature has its distinct contribution to the objective. Depending on how different each feature is in nature, range and units, e.g. solvent composition vs. printing speed, using the same lengthscale in the kernel function for each feature dimension could provide unreliable predictive results. The materials optimization objective naturally has different sensitivities to each input variable, and thus it is rationale then, that the "lengthscale" parameter inside the GP kernel should be independent. In Figure 4, the noticeable improvements of using an anisotropic kernel can be seen in the relative lower performance of GP without ARD compared to that of GP with ARD. While data normalization can partially alleviate the problem, how it is conducted is highly subject to a researcher's choice, and therefore we would like to raise awareness of the benefits of using GP with anisotropic kernels.
In addition, the lengthscales from the kernels of GP with ARD provides us with more useful information about the input features. These lengthscale values have been used for removing irrelevant inputs [4], where high l j values imply low relevancy input feature j. In the context of materials optimization, we find the following use of ARD especially useful: ARD could identify a few directions in the input space with specially high "relevance." This means that if we train GP with ARD on input data with their original units and without normalization, once we extract the lengthscale of each feature l j , our GP model in theory should not be able to accurately extrapolate more than l j units away from collected observations in j th dimension. Thus, l j suggests the range of next experiments to be performed in the j th dimension of the materials design space. It also infers a suitable sampling density in each dimension in the experimental setting. When a particular input feature dimension has a relative small l j or large 1 l j , it means that for small change in objective value, we would have a relatively large change in the location within this input feature dimension; thus, the the sampling density or resolution in this dimension should be high enough to capture such sensitivity. In addition, people have considered using information extracted from these lengthscales for even more advanced analysis and variable selection [50]. At the expense of computation time tolerable in the context of materials optimization campaigns, an anisotropic kernel provides not only a better generalizable GP model but also useful information in analyzing input feature relevancy at each learning cycle. For the above mentioned reasons, it would be great practice for researchers to emphasize their use of GP with anisotropic kernels as surrogate models during materials optimization campaigns.
In conclusion, we benchmarked the performance of BO algorithms across five different experimental materials science domains. We utilize a pool-based active learning framework to approximate experimental materials optimization processes, and adapted active learning metrics to quantitatively evaluate the enhancement and acceleration of BO for common research objectives. We demonstrate that when paired with the same acquisition functions, RF as surrogate model can compete with GP with ARD, and both outperform GP without ARD. In the context of autonomous and high-throughput experimental materials research, we discuss the differences in implicit distributional assumptions of surrogate models and the benefits of using GP with anisotropic kernels. We provide practical insights on surrogate model selection for materials optimization campaigns, and also offer open source implementation of benchmarking code and datasets to support future algorithmic development.
Establishing benchmarks for active learning algorithms like BO across a broad scope of materials systems is only a starting point. Our observations demonstrate how the choice of active learning algorithms has to adapt to their applications in materials science, motivating more efficient ML guided closed-loop experimentation, and will likely directly result in a larger number of successful optimization of materials with record breaking properties. The impact of this work can be extended to not only other materials systems, but also a broader scope of scientific studies utilizing closed-loop and high-throughput research platforms. Through our benchmarking effort, we hope to share our insights with the field of accelerated materials discovery and motivate a closer collaboration between ML and physical science communities.

Methods Prediction by surrogate models and acquisition functions
In order to estimate the meanμ( #» x * ) and standard deviationσ( #» x * ) of predicted objective value at a previously undiscovered observation #» x * in design space: For a Gaussian process (GP), it assumes a prior over the design space that is constructed from already collected observations ( #» x i , y i ), i = 1, 2, ..., n. This prior is the source of implicit distributional assumptions, and when an undiscovered new observation ( #» x * , y * ) is being considered during noisy setting (σ = 0.01), the joint distribution between the objective values of collected data #» y ∈ R n and y * is K is the covariance matrix of the input features X = { #» x i |i = 1, 2, ..., n}; K * is the covariance between the collected data and new input feature #» x * ; K ** is the covariance between the new data. For each of the covariance matrices, The standard deviation valueσ( #» x ) can be obtained from the diagonal elements of this covariance matrix.
For a random forest (RF), letĥ k ( #» x * ) denote the prediction of objective value from the k th decision tree in the forest, k = 1, 2, ..., n tree , thenμ andσ The median or other variations could also be used in future studies to aggregate the predictions for potential improvement in robustness [43].
We tested three acquisition functions in our study, including expected improvement (EI), probability of improvement (PI), and lower confidence bound (LCB).
EI( #» x ) = (y best −μ( #» x ) − ξ) · Φ(Z) +σ( #» x )ϕ(Z) (10) where µ andσ are estimated mean and standard deviation by surrogate model; y best is best discovered objective value within all collected values so far; ξ = 0.01 is jitter value that can slightly control exploration and exploitation; Φ and ϕ are the cumulative density function and probability density function of a normal distribution.
where λ is a adjustable ratio between exploitation and exploration.

Pool-based active learning framework
As seen in Figure  and standard deviationσ( #» x ). We then evaluate the acquisition function values α(μ( #» x ),σ( #» x )) for each remaining experimental action #» x ∈ D in parallel. At each cycle, action #» x * = arg max x α( #» x ) will be selected as next experiment.
During inference stage, after selecting action #» x * , the corresponding sample observation y * is obtained, and ( #» x * , y * ) is added to X and removed from set D. The new observation ( #» x * , y * ) is incorporated into the surrogate model. The sequential alternation between planning and inference is repeated until undiscovered data points run out.

Statistical baselines
In Figure 3 amd 4, we have introduced some statistical baselines when benchmarking the performance of BO algorithms with a pool-based active learning framework.
For the random baseline in Figure 3 Then at cycle i = 2, 3, ..., N , there is and In Figure 3(d), between two optimization paths starting with the same two initial data points: 1. The statistically most overlap happens when two paths are identical, resulting in J(i) = 1, i = 1, 2, ..., N ; 2. The statistically least overlap happens when the two follow drastically different paths until they run out of data points undiscovered by both algorithms, resulting in

Data availability
The five experimental datasets in the current study is available are the following GitHub repository [38]: https://github.com/PV-Lab/Benchmarking.

Code availability
The code for pool-based active learning framework and visualization in the current study are available in the following GitHub repository [38]: https://github.com/PV-Lab/Benchmarking.