Introduction

With advances in computational engineering, materials design and discovery have been increasingly viewed as optimization problems with the goal of achieving desired material properties or device performance1,2,3. One challenge of designing new materials systems is the co-existence of qualitative and quantitative design variables associated with material compositions, microstructure morphology, and processing conditions. While microstructure morphology can be described using quantitative, variables such as those associated with correlation function4, descriptors3,5,6, and spectral density functions1,2, many composition and processing conditions are discrete and qualitative by nature. For example, in polymer nanocomposite design, there are numerous choices of material constituents (e.g., the types of filler and matrix) and processing conditions (e.g., the type of surface treatment); each combination follows drastically different physical mechanisms with significant impact on the overall properties3,7. As illustrated in Fig. 1, the existence of both quantitative and qualitative material design variables results in multiple disjointed regions in the property/performance space. The combinatorial nature poses additional challenges in materials modeling and the search for optimal solution.

Figure 1
figure 1

Target space exploration in materials design.

In contrast to the traditional trial-and-error based experiment approach to materials design, computational materials design methods have emerged as an efficient and effective alternative in the past decade, building upon the advancement of simulation techniques such as finite element analysis (FEA) and density functional theory (DFT) that can accurately model and predict material properties at different length scales8,9. Recent years have seen new developments in computational methods for designing materials with quantitative and qualitative variables simultaneously, e.g., structure optimization and materials selection of a multilayer beam structure supported by a spring using a genetic algorithm (GA)10, and optimization of a thermal insulation system using mixed variable programming (MVP) and a pattern search algorithm11,12. However, directly employing expensive simulation models in mixed-variable optimization is still restrictive because optimization algorithms commonly require hundreds of evaluations of the objective (material properties) for problems with only quantitative variables and the computational demand is significantly higher with mixed-variables.

To address the issue of expensive simulation, a common strategy in simulation-based optimization is to build a metamodel, a.k.a., response surface model, based on data generated by simulations, and then directly use the metamodel for optimization. Design of experiments (DOE) methods, such as Latin-hypercube sampling (LHS)13, are often used to improve the overall metamodel accuracy by generating samples that cover the design space as evenly as possible. However, this is not the most efficient approach if the design objective is well defined because it is apparent that more sample points should be selected close to the “optimal” locations rather than uniformly over the whole design space.

In contrast, Bayesian Optimization (BO) provides an adaptive paradigm to sample the design space more efficiently for identifying the global optimum. In particular, a prior response surface model of the objective is prescribed and then sequentially refined as data are observed via an acquisition function14. One essential advantage of using BO for materials design is the emergence of various materials databases, such as the polymer nanocomposites data resource NanoMine15,16, and the open quantum materials database (OQMD) that stores high-throughput DFT data17. These databases provide valuable low-cost existing knowledge as a “starting point” for Bayesian inference to guide the rapid exploration of novel material designs. Recent years have seen a number of extensions of using BO in materials design, such as the optimization of the synthesis process of short polymer fibers18, adaptive optimization of the elastic modulus of the MAX2 phase19, and the prediction of crystal structures20. Nevertheless, these materials design applications of BO are all limited to considering only quantitative design variables, such as the constriction angle, channel width and solvent speed18, the s-, p-, and d-orbital radii of atoms19, and structure descriptors20.

In real applications, most materials design scenarios involve qualitative or categorical factors, such as compositions (selection of material types) and particle surface treatment conditions (e.g., octyldimethylmethoxysilane and aminopropyldimethylethoxysilane) in nanodielectrics7. The challenge of including qualitative design factors within BO lies in response surface modeling in the absence of any direct ordering and distance metrics between the factor levels like the inherent distance metric for quantitative design variables. Gaussian process (GP) models, a.k.a., kriging models have become the most popular method for modeling simulation response surfaces21,22,23 and thus are widely employed in BO frameworks, because of its flexibility to capture complex nonlinear response surface as well as to quantify uncertainties in prediction. However, these standard GP models are only applicable for quantitative inputs (i.e., quantitative design variables). Most state-of-the-art BO implementations use 0/1 dummy variables to represent qualitative input levels and then fit standard GP models with quantitative inputs, which is essentially equivalent to fitting a multi-response GP model where a different response surface (over the quantitative inputs) is assumed for each combination of qualitative factor levels. This has been shown to be restrictive theoretically and incapable of capturing complex correlations between qualitative levels when the number of levels is large24,25,26.

To overcome the aforementioned limitations, we propose a BO framework for data-driven materials design with mixed qualitative and quantitative variables (Fig. 2). The proposed framework is built upon a novel latent variable GP (LVGP) modeling approach that we recently developed for creating response surfaces with both qualitative and quantitative inputs27. The key idea of the LVGP approach is to map the qualitative factors into low-dimensional quantitative latent variable (LV) representations. Our study27 showed that the LVGP modeling dramatically outperforms existing GP models with qualitative input factors in terms of predictive root mean squared error (RMSE). In addition to empirical evidence of far better RMSE predictive performance across a variety of examples, the LVGP approach has strong physical justification in that the effects of any qualitative factor on a quantitative response must always be due to some underlying quantitative physical input variables (otherwise, one cannot code the physics of the simulation model)27. Noting that the underlying physical variables may be extremely high-dimensional (which is why they are treated as a qualitative factor in the first place), the LVGP mapping serves as a low-dimensional LV surrogate for the high-dimensional physical variables that captures their collective effect on the response. Sufficient dimension reduction arguments28,29 provide justification for this, since they imply that the collective effects of these high-dimensional underlying physical variables can usually be represented approximately as a function of manifold coordinates over some low-dimensional manifold in the high-dimensional space. We elaborate on this later. In addition to outstanding predictive performance, this LV mapping provides an inherent ordering and structure for the levels of the qualitative factor(s), such as the type of material constituents and processing types, which can provide substantial insights into their influence on the material properties/performance. In this manner the LVGP approach can model a large number of qualitative levels with a relatively small number of parameters, which improves the prediction while maintaining low computational costs. Moreover, in contrast to the existing methods for handling qualitative factors24,25,26,30, LVGP is compatible with any standard GP correlation functions for quantitative inputs, including nonseparable correlation functions such as power exponential, Matèrn and lifted Brownian.

Figure 2
figure 2

Bayesian optimization framework for data-driven materials design.

In this study, we integrate the LVGP approach with a BO framework (we call it LVGP-BO) for data-driven materials design that involves both qualitative and quantative matereials design values. We examine the LVGP-BO approach over a variety of mathematical and real materials design examples and demonstrate its superior optimization performance over other state-of-the-art methods.

Results

We present two mathematical examples and two materials design examples to demonstrate the efficacy of the proposed LVGP-BO approach for mixed-variable problems and the physical insights it brings. The proposed method is compared to the state-of-the-art BO implementation bayesopt in MATLAB31, which represents qualitative factors as dummy variables first and then fits a standard GP model with quantitative inputs26. Bayesopt is chosen as the only benchmark for two reasons: (1) Most existing GP-based BO packages that support qualitative inputs (e.g., skopt, a popular python library used to optimize hyperparameters of machine learning models) all treat qualitative inputs as dummy variables and are fundamentally similar to the approach exploited by bayesopt. (2) The implementation of our approach relies on Matlab’s optimization solver engine, so Matlab’s built-in bayesopt implementation was chosen as the benchmark to reduce platform and solver engine differences. When comparing both methods, we start with the same set of initial datasets and stop at the maximal allowed number of iterations. The same procedure is repeated multiple times to assess the statistical stability of both methods. In both scenarios, expected improvement (EI)32 is used as the acquisition function. We denote the LVGP-BO result as “LV-EI” and bayesopt as “MC-EI” because its GP model is equivalent to the multiplicative covariance (MC) model in literature24,26.

Materials design examples

High-performance light absorbing quasi-random solar cell

We first present a solar cell design problem, in which both the light scattering structure pattern and material selection are optimized simultaneously, in contrast to the original design problem that only considered tuning the scattering pattern represented by quantitative variables2,33. Fig. 3 shows the setup of this solar cell design problem: the middle layer with the quasi-random structure is the light-trapping layer of thickness t1, patterned on the amorphous silicon (a-Si) based absorbing layer with a total thickness of t. The bottom silver layer prevents the light escaping on the back side. The light-trapping structure is optimized for maximizing light absorption at 650 nm by balancing the competing processes of light reflection and scattering. Besides designing the light trapping layer’s pattern, we also consider the choices of materials. First, there are five types of a-Si’s with different refractive indices, and on top of the light trapping layer, is a 70 nm thin layer of anti-reflection coating (ARC), which can be chosen from five different materials. Table 1 lists the refractive indices (n) of the five a-Si’s and five ARC’s and the extinction coefficient (k) of a-Si. Many existing materials systems involve such concurrent materials selection and structure optimization decisions shown in this example.

Figure 3
figure 3

Setup of the solar cell design problem: the middle layer is the quasi-random scattering pattern with thickness t1, directly etched on top of the a-Si substrate of a total thickness of t = 600 nm. The top layer in light blue is an ARC coating to reduce reflection of incident light. We use the SDF based approach to generate the quasi-random scattering pattern.

Table 1 Refractive index for the candidate materials.

Rigorous coupled wave analysis (RCWA)34,35 is employed to evaluate the light absorption coefficients of the reconstructed structures. RCWA is a Fourier-domain-based algorithm that can solve the scattering problems for both periodic and aperiodic structures. The length of the unit cell for RCWA calculation is set at 2000 nm. The light trapping pattern is represented using the SDF (spectral density function) method as we proposed in33, which significantly reduces the dimensionality of quasi-random microstructures. In this case, we use a uniform SDF defined by its left end a and width b. These two parameters, along with the space filling ratio ρ, are the three design variables that describe the light trapping layer’s pattern. The overall thickness t of a-Si layer is fixed as 600 nm, but the thickness of the light trapping layer t1 is adjustable in this example. The feasible ranges of all design variables, including both quantitative and qualitative variables, are listed in Table 2.

Table 2 Range of design variables in the solar cell design problem.

In this design study, both the LV-EI and MC-EI methods start with 30 random initial samples where the quantitative variables \(\{{t}_{1},\rho ,a,b\}\) are generated by the maximin LHD and the two materials type variables a-Si Type and ARC type are sampled uniformly. Both methods are terminated after the maximal allowed 100 iterations, and the same procedure is repeated for 20 replicates. Because there are certain uncertainties associated with the microstructure reconstruction process when generating the quasi-random scattering pattern using SDF, for each fixed pair of {ρ, a, b}, we generate three statistically equivalent microstructures and take the average of their light absorptions simulated in RCWA as the corresponding response. The acquisition function used in this case is EI with the “plug-in” \({\mu }_{min}({\boldsymbol{x}})\), which is introduced in Method Section for the noisy response scenario. We use “EI” in our notation for simplicity. The left panel of Fig. 4 displays the optimization convergence history of both methods over 100 iterations. It is apparent that the LV-EI method converged significantly faster and also consistently achieved better solutions compared to MC-EI. The best solution found possessing a light absorption coefficient of 0.94, with the optimized quantitative design variables \(\{{a}^{\ast }=0.0049\,n{m}^{-1},\,{b}^{\ast }=0.0035\,n{m}^{-1},\,{\rho }^{\ast }=0.875,\,{t}_{1}^{\ast }=100.95\,nm\}\) and the optimal choices of ARC and a-Si are type 5 and type 3 respectively. The right panel shows three random scattering structures using the optimal design variables.

Figure 4
figure 4

Optimization results of the light absorbing solar cell: the left panel displays the convergence plot of LV-EI and MC-EI method, with medians and median absolute deviations. The LV-EI outperforms MC-EI by consistently achieving better solutions with less uncertainty. The right panel shows three quasi-random patterns according to the results from LV-EI method. The top 3D structure is based on the first 2D pattern.

The LVGP model represents the correlation between qualitative levels by the distances in the 2D latent variable space: larger distances means more difference and less correlation. In this test case, the estimated latent variables of ARC type and a-Si type are illustrated in Fig. 5, which shows that five different ARC materials are positioned approximately in a straight line with the sequence 1-2-3-4-5, consistent with the differences in their refractive indices listed in Table 2. This provides some insights and indicates that the refractive indices are the dominating characteristics of the ARC design factor compared to other simulation inputs, in terms of its effect on the properties. The relationships between the five a-Si’s are not straightforward at first glance, because n and k have coupled effects of the overall performance: the real part n quantifies refraction and the imaginary part k represents the loss of flux intensity in the medium, so higher n and lower k are preferred for better absorption. As shown in the LV plot (Fig. 5), the 3rd type a-Si has very different effects on the response compared to the other four types. This can potentially be explained based on the physical parameters presented in Table 1 where a-Si type 3 has both moderate n and k, leading to the best combination of refraction and loss for achieving higher absorption than other four types of a-Si.

Figure 5
figure 5

Estimated latent variable of the two qualitative factors represent different materials: the LV of ARC approximately lines up in a straight line following the order 1-2-3-4-5, consistent with their refractive indices; from the LV of a-Si, type 3 is far from the other four, indicating a distinct influence on response compared to the others.

Combinatorial search of hybrid organic-inorganic perovskite (HOIP)

We present here another materials design example adopted from literature with a combinatorial search of Hybrid Organic-Inorganic Perovskites (HOIPs)36, where all design variables are qualitative. Such problems are quite popular in materials design with emphasis on design of materials constitutes. HOIPs are an exciting class of new materials that exhibit extremely promising photovoltaic (PV) properties. The goal is to search the perovskite compositional space for an ABX3 combination with optimal intermolecular binding energy to a solvent molecule, S0. The A-site cation has three candidates {MA, FA, Cs}, the B-site in HOIP is often occupied by metal cations and in this case, is fixed to be Pb. X denotes the halide, which has three options {Cl, Br, I} and in this design scenario, mixed halides are allowed, which means the three X’s in ABX3 can be different. There are eight different solvents S0’s to explore, and the binding energies between ABX3 and S0 are results from DFT calculations.

We took the problem only with qualitative variables for two reasons: (1) GP models in general struggle to handle qualitative variables and pure combinatorial search would be a challenging task for GP based Bayesian optimization and we wanted to test our approach’s performance under this setting. (2) The use of only qualitative variables is not necessarily the best formulation to solve this specific design problem. It is possible that other formulations (e.g., using domain knowledge to create quantitative features as surrogates for the qualitative variables that one suspects may have large influence on the response) would have been better for this specific application, but we want to emphasize the generality of our approach and how it can be used without such domain knowledge.

There are five qualitative design variables: the first denotes the choice of the A-site cation (three levels), three other variables indicate the selection of each of the three halides in the configuration (three levels each), and the fifth variable represents the type of the solvent (eight levels). Out of all the possible combinations, 240 are stable, whose binding energies are precomputed by DFT. The code and dataset are made available by the authors of36 at https://github.com/clancyLab/NCM2018, where the DFT calculations are done through the Physical Analytics Pipeline (PAL) developed by the Clancy Research Group from Cornell University36. Fig. 6(a) shows the distributions of the binding energies of the 240 samples, which indicates that majority of the samples have negative binding energies larger than −30, and the best solution has binding energy around −41.3. This ground truth was determined via exhaustive simulation, which the mixed-variable BO algorithms seeks to avoid. Their efficiency can be assessed based on whether the method can identify the true optimal solution from among the 240 combinations using as few evaluations as possible.

Figure 6
figure 6

(a) histogram of the simulated binding energy from DFT, (b) optimization convergence plot of LV-EI and MC-EI methods with 10 initial random samples, median and median absolute deviation at each iteration are plotted, (c) negative binding energy distribution by solvent types, (d) estimated latent variable for solvent types.

When comparing the LV-EI and MC-EI methods, we repeat our tests for 30 replicates, and in each replicate, both methods start with the same set of 10 initial random samples chosen from the samples with binding energy larger than −30 (to make the optimization more challenging), and the optimization was terminated after another 50 iterations. With respect to the capability of identifying the correct optimal solution, from Fig. 6b, we note both methods work well in this combinatorial search problem: the objective function drops quickly within less than ten iterations, and in most of the replicates both methods found the global optimum. Nevertheless, the proposed LV-EI method outperforms MC-EI with smaller variance and quicker convergence. Moreover, the LV-EI method is more robust. Specifically, the boxplot in the inset shows that LV-EI found the exact best solution in 28 of the 30 replicates, whereas MC-EI found the exact best solution in only 22 of 30 replicates.

To better understand the correctness of the fitted LVGP model, we visualize the estimated LVs for solvent type in Fig. 6d: solvent types 1 and 7 are located far from the other six types of solvents, which indicates they might have effects on the binding energy more distinct than the others. To validate this finding, we analyze the distribution for binding energies of the 240 samples in the dataset by singling out solvent types 1 and 7 in Fig. 6c. All the samples with large binding energy (>30) are formed with solvent types 1 and 7 and using other six types of solvents resulted in much smaller binding energies, which is consistent with our interpretation of the estimated LVs in Fig. 6d. Our LVGP model successfully found that THTO is a superior solvent for dissolving ABX3. Using conventional approaches without latent space estimation, it would be difficult to draw such design insights from the original raw dataset. Even without any knowledge or description about the eight (8) different types of solvents, we can tell how much solvent 1 and 7 are different from the other six solvents simply from their locations in the latent space (Fig. 6d) as they are both far from the other six solvents. In conclusion, the proposed mixed variable LVGP-based Bayesian optimization approach effectively searches a combinatorial design space, efficiently identifies the global optimal solution, and provides additional design insights through the LV representation of qualitative factors.

Mathematical examples

Two mathematical examples are used to illustrate the capability of LVGP-BO in finding the global optimization solutions for mixed variable problems.

Branin function

The first mathematical function considered is the Branin-Hoo function37, which originally has two quantitative variables \({x}_{1}\) and \({x}_{2}\), while in this example we convert \({x}_{2}\) to be qualitative for testing purpose:

$$f({x}_{1},{x}_{2})={\left({x}_{2}-\frac{5.1}{4{\pi }^{2}}{{x}_{1}}^{2}+\frac{5}{\pi }{x}_{1}-6\right)}^{2}+10\left(1-\frac{1}{8\pi }\right)\cos ({x}_{1})+10,$$

where \({x}_{1}\in [\,-\,5,10]\) and \({x}_{2}\) is qualitative, with four levels corresponding to the values \(\{0,\,5,10,\,15\}\). As plotted in Fig. 7a this function has 6 local minimum and a global minimum \(f(-2.6,10)\approx 2.79118\). We can also observe that levels 1 and 2 of x2 are closely correlated and levels 3 and 4 are closely correlated, while levels 1 and 3 are quite different from each other. Both methods start with 10 random initial points and continue sampling for another 30 iterations. Fig. 7c shows the convergence history of the LV-EI and MC-EI method, from which we see that the LV-EI converges much faster than MC-EI and after 30 iterations LV-EI achieved a more accurate solution as illustrated in the inset of Fig. 7.

Figure 7
figure 7

(a) Plot of the Branin function, (b) Plot of the Gold-Stein Price function, and problems having multiple local minima, and y-axes are in log-scale to visualize the local minimum better; (c) Optimization history of 30 replicates for the Branin example, starting with 10 initial sample points; (d) Optimization history of 30 replicates for the Gold-Stein Price example, with 20 initial sample points, y-axis in log scale. The middle lines represent the median values of the 30 replicates at each iteration, and the shaded bounds represent median +/− median absolute deviations (MAD). Insets in (c) and (d) display the boxplots of the minimum obtained after 30 iterations;.

Goldstein-price function

The second mathematical example is the Goldstein-price function37, which also has two input variables \({x}_{1}\) and \({x}_{2}\), where \({x}_{2}\) is made qualitative:

$$f({x}_{1},{x}_{2})=[1+{({x}_{1}+{x}_{2}+1)}^{2}(19-14{x}_{1}+3{{x}_{1}}^{2}-14{x}_{2}+6{x}_{1}{x}_{2}+3{{x}_{2}}^{2})],$$

where \({x}_{1}\in [-2,2]\) and \({x}_{2}\) is qualitative, with five levels corresponding to the values \(\{-2,\,1,\,0,\,1,\,2\}\). The global minimum is \({f}^{\ast }(0,\,-\,1)=3\). In this example, both methods start with 20 random initial points, and continue sampling for another 30 iterations. Fig. 7d shows the convergence history of the LV-EI and MC-EI method, from which, we see that LV-EI converges much faster than MC-EI consistently out of the 30 replicates. The LV-EI method is also more robust as it has a smaller variance than MC-EI. According to the inset boxplot, we also note that after 30 iterations, most of the 30 replicates of LV-EI are very close to the real global minimum 3 while MC-EI has a much larger gap to the real solution. The two numerical tests shown here illustrate our proposed LVGP based BO approach as an effective method for global optimization of mixed-variable problems.

Discussion

In this paper, we integrate a novel latent variable approach to GP modeling into Bayesian Optimization to support a variety of materials design applications with mixed qualitative and quantitative design variables. The LVGP approach not only provides superior model predictive accuracy compared to existing GP models, it also offers meaningful visualization of the correlation between qualitative levels (Figs. 5 and 6d). The proposed Bayesian Optimization approach is especially useful for materials design as it can fully utilize the existing materials databases and sequentially explores the unknown design space via Bayesian inference and “on-demand” simulations/experiments. Our proposed method overcomes the challenge of using BO for problems with qualitative input factors and has achieved superior performance compared to the state-of-the-art MATLAB BO implementation the using dummy variable representation for qualitative factors, which is demonstrated through both mathematical and real materials design examples (Figs. 4, 6b and 7c,d). For materials design, we successfully utilized the proposed BO framework to improve the light absorption of the quasi-random solar cell by designing the microstructure and selecting materials constituents simultaneously. Moreover, we showed that our method is efficient and effective for optimizing the material constituents of a hybrid organic-inorganic perovskite, a more challenging combinatorial search problem.

While this paper is focused on design of new materials and materials systems, the method presented is generic and can be used for other challenging engineering optimization problems where qualitative and quantitative design variables co-exist. The current BO framework will be further extended for multi-objective problems and the inclusion of physical constraints by refining the sampling strategy in association with the acquisition function.

Methods

Latent variable gaussian process (LVGP) for both qualitative and quantitative factors

We first briefly review the technical details of the standard GP model for quantitative variables, and then describe our novel LVGP model to handle qualitative factors. To make the discussion more concrete, let \(y(\,\cdot \,)\) denote the true physical response surface model with inputs \({\boldsymbol{w}}=({\boldsymbol{x}},{\boldsymbol{t}})\) where \({\boldsymbol{x}}=({x}_{1},\ldots ,{x}_{p})\in {{\mathbb{R}}}^{p}\) represents \(p\) quantitative variables and qualitative factors \({\boldsymbol{t}}=({t}_{1},\ldots ,{t}_{q})\in \{1,2,\ldots ,{m}_{1}\}\times \{1,2,\ldots ,{m}_{2}\}\times \ldots \times \{1,2,\ldots ,{m}_{q}\}\), where the jth qualitative factor \({t}_{j}\) has \({m}_{j}\) levels that are coded (without loss of generality) as \(\{1,2,\ldots ,{m}_{j}\}\). A GP model with only quantitative input variables is commonly assumed to be of the form:

$$y({\boldsymbol{x}})=\mu +Z({\boldsymbol{x}}),$$
(1)

where \(\mu \) is a constant prior mean term, Z(·) is a zero-mean Gaussian process with stationary covariance function K(·,·) = σ2R(·,·), σ2 is the prior variance, and R(·,·) = R(·,·|ϕ) denotes the correlation function with parameters ϕ. A Gaussian correlation function is commonly used:

$$R({\boldsymbol{x}},\,{\boldsymbol{x}}{\prime} )=\exp \{-\mathop{\sum }\limits_{i=1}^{p}{\phi }_{i}{({x}_{i}-x{{\prime} }_{t})}^{2}\},$$
(2)

which represents the correlation between \(Z({\boldsymbol{x}})\) and \(Z({\boldsymbol{x}}{\prime} )\) for any two input locations \({\boldsymbol{x}}=({x}_{1},\ldots ,\,{x}_{p})\) and \({\boldsymbol{x}}{\prime} =({x}_{1}{\prime} ,\,\ldots ,\,{x}_{p}{\prime} )\), where \({\boldsymbol{\phi }}={({\phi }_{1},\ldots ,{\phi }_{p})}^{T}\) is the vector of correlation parameters to be estimated via MLE, along with \(\mu \) and \({\sigma }^{2}\). The correlation between \(y({\boldsymbol{x}})\) and \(y({\boldsymbol{x}}{\prime} )\) depend on the spatial distance between \({\boldsymbol{x}}\) and \({\boldsymbol{x}}{\prime} \) and the correlation parameters. Other choices of correlation functions include power exponential, Matèrn38 and lifted Brownian39. These types of correlation functions cannot be directly applied to problems with qualitative factors because the distances between levels of qualitative factors are not defined, and the levels have no natural ordering since \({\boldsymbol{t}}\) are assumed to be nominal categorical factors.

The LVGP approach27 provides a natural and convenient way to handle qualitative input variables by mapping the levels of each qualitative factor \(t\) to a 2-dimensional (2D) continuous latent space. This has strong physical justification, which may explain the outstanding predictive performance of the LVGP approach27: For any real physical system with a qualitative input factor \(t\), there are always underlying quantitative physical variables \(\{{v}_{1},{v}_{2},\ldots \}=\{{v}_{1}(t),{v}_{2}(t),\ldots \}\) (perhaps very high-dimensional, poorly understood, and difficult to treat individually as quantitative input variables) that account for the differences between the responses at different levels of the qualitative factors. As shown in Fig. 8, the three qualitative levels are associated with points in a high-dimensional space of \(\{{v}_{1},{v}_{2},\ldots \}\), and the distances between the three points indicates the differences between the three levels. Our LVGP model uses a low-dimensional (2D) representation \({\boldsymbol{z}}(t)=g({v}_{1}(t),{v}_{2}(t),\ldots )\) to approximate the actual distances between the three levels. This implicitly makes the rather mild assumption that the collective effect of \(\{{v}_{1}(t),{v}_{2}(t),\ldots \}\) on \(y\), as \(t\) varies across its levels, depends predominantly on some low-dimensional combinations \({\boldsymbol{z}}(t)=g({v}_{1}(t),{v}_{2}(t),\ldots )\) of the underlying high-dimensional variables (reminiscent of sufficient dimension reduction arguments28,29). In many applications, a two-dimensional representation \({\boldsymbol{z}}(t)\) suffices to approximate the high dimensional data.

Figure 8
figure 8

For a single factor t with three levels, depiction of the mapping from the true high-dimensional underlying quantitative variables to the 2D latent variables.

To make these arguments more concrete, first consider the following beam bending example27, in which the qualitative factor \(t\) corresponds to six different beam cross-sectional shapes. The underlying numerical variables {\({v}_{1}(t),\,{v}_{2}(t),\,{v}_{3}(t),\,\ldots \)} for cross-section type \(t\) would be the complete cross-sectional geometric positions of all the elements in the finite element mesh of the beam. The physics of this beam bending problem is well studied that we know the beam deflection depends on {\({v}_{1}(t),\,{v}_{2}(t),\,{v}_{3}(t),\,\ldots \)} via the moment of inertia \(I=I(t)=I({v}_{1}(t),\,{v}_{2}(t),\,{v}_{3}(t),\,\ldots )\). Therefore, the impact of the underlying high-dimensional numerical variables {\({v}_{1}(t),\,{v}_{2}(t),\,{v}_{3}(t),\,\ldots \)} on the response can be reduced to a single numerical variable \(I(t)\). Reference27 demonstrated that the \({\boldsymbol{z}}(t)\) estimated by the LVGP approach was a direct representation of \(I(t)\). Second, consider the case where no definitive prior knowledge or a simulator with transparent mechanisms are available and the actual underlying quantitative space are high dimensional, an example of which is modeling of shear modulus (the response, \({\rm{y}}\) simulated by DFT) of material compounds belonging to the family of M2AX phases27. The M, A and X atoms have ten, two and twelve candidate choices respectively. Our study showed that the LVGP model using three qualitative factors that represent atom types with a 2D latent representation significantly outperformed the GP model with seven quantitative atom descriptors that were believed to have a large impact on the compound shear modulus. In this case, even though the actual underlying numerical spaces ares very high dimensional (formed by various properties of each atom), the proposed LVGP model still managed to achieve superior modeling results with a low-dimensional latent representation.

In the relatively rare situations that the impact of each qualitative variable cannot be represented in a two-dimensional space, the latent variable component \({{\boldsymbol{z}}}^{j}({t}_{j})=({z}_{1}^{j}({t}_{j}),\,{z}_{2}^{j}({t}_{j}))\) in (3) can be easily extended to three and higher dimensions \({{\boldsymbol{z}}}^{j}({t}_{j})=({z}_{1}^{j}({t}_{j}),\,\ldots ,\,{z}_{dim{ \mbox{-} }_{z}}^{j}({t}_{j}))\) to increase the model’s flexibility. In our published R package “LVGP”40, the dimension of latent variable space dim_z is treated as a tunable model hyperparameter, so that users have the flexibility to choose the appropriate dimension of latent space if needed.

When there are multiple qualitative factors, let \({{\boldsymbol{z}}}^{j}({t}_{j})=({z}_{1}^{j}({t}_{j}),\,{z}_{2}^{j}({t}_{j}))\) denotes the 2D mapped LV for the qualitative factor \({t}_{j}\). For a Gaussian correlation function, the LVGP approach then assumes the correlation function

$$cor(y({\boldsymbol{x}},{\boldsymbol{t}}=({t}_{1},\ldots ,{t}_{q})),y({\boldsymbol{x}}{\prime} ,{\boldsymbol{t}}{\prime} ={t{\prime} }_{1},\ldots ,t{{\prime} }_{q}))=\exp \{-\mathop{\sum }\limits_{i=1}^{p}{\phi }_{i}{({x}_{i}-{x}_{i}^{{\prime} })}^{2}-\mathop{\sum }\limits_{j=1}^{q}{\Vert {{\boldsymbol{z}}}^{j}({t}_{j})-{{\boldsymbol{z}}}^{j}({t}_{j}^{{\prime} })\Vert }_{2}^{2}\},$$
(3)

where \(||\,\cdot \,||\) denotes the \({L}_{2}\) norm. Note that the 2D mapped LVs \({{\boldsymbol{z}}}^{j}({t}_{j})=({z}_{1}^{j}({t}_{j}),\,{z}_{2}^{j}({t}_{j}))\) are unknown and will be estimated along with \(\mu \), \({\sigma }^{2}\) and ϕ through maximum likelihood estimation (MLE). Under model (3), the log-likelihood function is:

$$l(\mu ,{\sigma }^{2},{\boldsymbol{\phi }},{\bf{Z}})=-\frac{n}{2}\,\mathrm{ln}(2\pi {\sigma }^{2})-\frac{1}{2}\,\mathrm{ln}|{\bf{R}}({\boldsymbol{\phi }},{\bf{Z}})|-\frac{1}{2{\sigma }^{2}}{({\bf{y}}-\mu {\bf{1}})}^{T}{\bf{R}}{({\boldsymbol{\phi }},{\bf{Z}})}^{-1}({\boldsymbol{y}}-\mu {\bf{1}}),$$
(4)

where \(n\) is the sample size, 1 is an n-by-1 vector of ones, y is the n-by-1 vector of observed response values, \({\bf{Z}}=({{\boldsymbol{Z}}}^{1},\ldots ,{{\boldsymbol{Z}}}^{q})\), \({{\boldsymbol{Z}}}^{j}=({{\boldsymbol{z}}}^{j}(1),\ldots ,{{\boldsymbol{z}}}^{j}({m}_{j}))\) represents the values of the LV corresponding to the \({m}_{j}\) levels of the qualitative variable \({t}_{j}\), and \({\bf{R}}={\bf{R}}({\boldsymbol{\phi }},{\bf{Z}})\) is the \(n\)-by-\(n\) correlation matrix whose elements are obtained by plugging pairs of the \(n\) sample values of \(({\boldsymbol{x}},\,t)\) into (3). The MLE of \(\mu \) and \({\sigma }^{2}\) in (4) can be represented in terms of the correlation matrix:

$$\hat{\mu }={({1}^{T}{{\bf{R}}}^{-1}{\bf{1}})}^{-1}{{\bf{1}}}^{T}{{\bf{R}}}^{-1}{\bf{y}}$$
$${\hat{\sigma }}^{2}=\frac{1}{n}{({\bf{y}}-\hat{\mu }{\bf{1}})}^{T}{{\bf{R}}}^{-1}({\bf{y}}-\hat{\mu }{\bf{1}}).$$
(5)

Substituting the above into (4) and neglecting constants, the log-likelihood function becomes:

$$l({\boldsymbol{\phi }},{\bf{Z}}) \sim -\,n\,\mathrm{ln}({\hat{\sigma }}^{2})-\,\mathrm{ln}|{\bf{R}}({\boldsymbol{\phi }},{\bf{Z}})|,$$
(6)

which is maximized over the correlation matrix \({\bf{R}}\) that depends on the correlation parameters ϕ and the values of the mapped latent variables in \({\bf{Z}}\).

After the MLEs of ϕ and Z are obtained, the LVGP response predictions at any new point x* is:

$$\hat{y}({{\boldsymbol{x}}}^{\ast })=\hat{\mu }+{\boldsymbol{r}}({{\boldsymbol{x}}}^{\ast }){{\bf{R}}}^{-1}({\bf{y}}-\hat{\mu }{\bf{1}}),$$
(7)

where \({\boldsymbol{r}}({{\boldsymbol{x}}}^{\ast })=(R({{\boldsymbol{x}}}^{\ast },{{\boldsymbol{x}}}^{(1)}),\ldots ,R({{\boldsymbol{x}}}^{\ast },{{\boldsymbol{x}}}^{(n)}))\) is a vector of the pairwise correlation between x* and each data point \({{\boldsymbol{x}}}^{(j)},\,j=1,\ldots ,n\). Moreover, to quantify predictive uncertainty, the variance of the error for this prediction is:

$${\hat{s}}^{2}({{\boldsymbol{x}}}^{\ast })={\hat{\sigma }}^{2}({\boldsymbol{r}}({{\boldsymbol{x}}}^{\ast })-{\boldsymbol{r}}({{\boldsymbol{x}}}^{\ast }){{\bf{R}}}^{-1}{\boldsymbol{r}}{({{\boldsymbol{x}}}^{\ast })}^{T}).$$
(8)

When the actual model is non-deterministic, we add an extra “nugget” parameter λ to each diagonal element of the correlation matrix R to account for the noise of the response and it is estimated along with ϕ and Z.

LVGP-BO framework for materials design

Our proposed Bayesian Optimization framework for data-driven materials design consists four major steps (as shown in Fig. 1): (1) Step 1 involves creating a materials dataset (either physical or/and computer data) based on the information gathered from literature, lab experiments and simulations, (2) Step 2 fits the LVGP model using the available dataset and provides uncertainty quantification of model prediction based on the nature of data, (3) Step 3 makes inference about where to sample the next point (either physical or computer experiments) based on an acquisition function that balances sampling where the response appears to be optimized (exploitation) vs. where the predictive uncertainty is high (exploration), and (4) Step 4 evaluates the chosen design point(s) to augment the materials database and update the metamodel. As this procedure keeps going, more sample points will be sequentially added to update the metamodel prediction and identify the global optimum solution.

There are two key components of the BO framework, a metamodel that provides predictions with uncertainty quantification, and a criterion that determines where to sample next. The LVGP model introduced in the previous section provides robust approximations of the actual response surface model with mixed variable types, as well as the uncertainty quantification. The next critical step is to consider where to sample next based on inferences of the fitted model (Step 3 “on-demand” design exploration on Fig. 2), through a measure of the value of the information gained from sampling at a certain point, known as acquisition functions. Three commonly used acquisition functions in the literature are expected improvement (EI)32, probability of improvement (PI), and lower confidence bound (LCB). For deterministic responses, expected improvement (EI) is the most widely used and works well over a variety of problems, while for noisy responses, EI with plug-in and knowledge gradient (KG) are proper choices37,41. Detailed benchmark studies of different acquisition functions are available in37.

EI balances exploitation and exploration as elaborated in Fig. 9: a GP model is first fitted based on the four sample points (Fig. 9a), and the fitted mean prediction \(\hat{y}({\boldsymbol{x}})\) is the blue curve, while the actual response \(y({\boldsymbol{x}})\) is the red dashed curve. Instead of picking the minimizer of \(\hat{y}({\boldsymbol{x}})\) as the next sample point, EI also considers the model uncertainties. Mathematically, EI quantifies the possible improvement of a particular point by incorporating both \(\hat{y}({\boldsymbol{x}})\) and the associated uncertainty \({\hat{s}}^{2}({\boldsymbol{x}})\):

$$EI({\boldsymbol{x}})=E[\,{\rm{\max }}(0,\Delta ({\boldsymbol{x}}))]=\hat{s}({\boldsymbol{x}})\phi \left(\frac{\Delta ({\boldsymbol{x}})}{\hat{s}({\boldsymbol{x}})}\right)+\Delta ({\boldsymbol{x}})\Phi (\frac{\Delta ({\boldsymbol{x}})}{\hat{s}({\boldsymbol{x}})}),$$
(9)
Figure 9
figure 9

GP model evolution using EI as the acquisition function: (a) initial GP model based on four data points, (b) EI of the initial GP model, (c) updated GP model after sampling one additional data point based on EI, (d) EI of the updated GP model.

Where \(\,\hat{y}({\boldsymbol{x}})\) and \({y}_{{\rm{\min }}}\) are the mean prediction of the fitted GP model and the minimal value observed so far, \(\Delta ({\boldsymbol{x}})={y}_{min}-\hat{y}({\boldsymbol{x}})\), \({\boldsymbol{\phi }}\) and Φ are the probability density function (PDF) and cumulative density function (CDF) of standard normal distribution. In practice, a trade-off parameter \(\xi \) can be included in \(\Delta ({\boldsymbol{x}})={y}_{min}-\hat{y}({\boldsymbol{x}})-\xi \) to balance exploration and exploitation and higher \(\xi \) values lead to more exploration. Equation (9) is a special case where the trade-off parameter \(\xi =0\) and is used for the EI implementation across all examples in this paper. The maximizer of EI in Fig. 9b is chosen as the next sampling point and a new GP model is fitted with this additional data point (shown in Fig. 9c). The updated mode provides a more accurate approximation of the true model with much less uncertainty. The updated EI profile (Fig. 9d) also indicates that the region around the true response optimizer has a large expected improvement. This BO algorithm is summarized in Table 3.

Table 3 Bayesian Optimization Algorithm.