Agents for sequential learning using multiple-fidelity data

Sequential learning for materials discovery is a paradigm where a computational agent solicits new data to simultaneously update a model in service of exploration (finding the largest number of materials that meet some criteria) or exploitation (finding materials with an ideal figure of merit). In real-world discovery campaigns, new data acquisition may be costly and an optimal strategy may involve using and acquiring data with different levels of fidelity, such as first-principles calculation to supplement an experiment. In this work, we introduce agents which can operate on multiple data fidelities, and benchmark their performance on an emulated discovery campaign to find materials with desired band gap values. The fidelities of data come from the results of DFT calculations as low fidelity and experimental results as high fidelity. We demonstrate performance gains of agents which incorporate multi-fidelity data in two contexts: either using a large body of low fidelity data as a prior knowledge base or acquiring low fidelity data in-tandem with experimental data. This advance provides a tool that enables materials scientists to test various acquisition and model hyperparameters to maximize the discovery rate of their own multi-fidelity sequential learning campaigns for materials discovery. This may also serve as a reference point for those who are interested in practical strategies that can be used when multiple data sources are available for active or sequential learning campaigns.

. The tuned hyperparameters and prediction performance of evaluated regressors from scikit-learn 1 and GPy 2 . If a hyperparameter was not explicitly listed, the default was used.

Regressor
Hyperparameter if high fidelity hypotheses generated < n then 5: if composition m has seed data support a then H ← H + D l f j a For each composition, we used l 2 norm to compute its features (generated with matminer 5 ) similarity to all other compositions in the seed data. If there was an l 2 norm that is lower than a threshold we set, we considered the candidate composition as similar to some composition(s) in the seed data, thus has seed data support. An l 2 norm of 0 means the lower fidelity measurement already exists in seed data. b For the candidate composition predicted to be ideal but without seed data support, we acquired lower fidelity data with the closest similarity to it. The similarity here was also calculated with the l 2 norm. The number of low-fidelity queries can be specified as a campaign hyperparameter. We chose 1 for our work, meaning whenever the lower fidelity data of the same composition has not been acquired, it gets acquired first.
Algorithm 2: Gaussian process lower confidence bound multi-fidelity agent Input: Total hypotheses acquisition budget m. For Gaussian process: uncertainty mixing parameter α, uncertainty threshold β , and rank threshold γ.
Results: m total hypotheses generated. if σ h f i < β then Hallucinate aŷ l f i to seed data and get LCB *   Optimizing α: Given the objective of this work is to evaluate the performance of multi fidelity agents compare to their corresponding single fidelity agents, optimization was performed before the boundary condition acquisition with high fidelity data, and the optimized α was used consistently in all agents (i.e. single and multi-fidelity agents in boundary condition, in-tandem acquisitions). For optimization process, we designed a campaign with various α values. For each campaign, we provided first 500 ICSD 6 reported compositions as seed data, and rest of the compositions as candidate data. Each campaign ran for 20 iterations with a budget of 10 acquisitions per iteration. Based on the results in Table S2, smaller α values resulted in better total number of discoveries from the subset we tested. α=0.08 was selected, as it resulted in the highest number of total discoveries.
Optimizing β and γ: β and γ were used in multi-fidelity in-tandem campaigns, so optimizations were performed before the in-tandem acquisition. We provided the in-tandem multi-fidelity acquisition seed and candidate data to the agent. With α = 0.08, we simulated campaigns with combinations of β , γ from β = [5, 10, 20, 30, 40] and γ = [0, 5, 10]. Each campaign ran for 20 iterations with a budget of 10 acquisitions per iteration. Each acquisition can be DFT or experiments. Given the results ( Figure S1), we note that a few β with γ=10 performed the best. We selected γ=10 and β =5. Since β is the uncertainty threshold, and we want the agent only acquired experiments when it was more confident.