Automated crystal structure analysis based on blackbox optimisation

In the present study, we show that time-consuming manual tuning of parameters in the Rietveld method, one of the most frequently used crystal structure analysis methods in materials science, can be automated by considering the entire trial-and-error process as a blackbox optimisation problem. The automation is successfully achieved using Bayesian optimisation, which outperforms both a human expert and an expert-system type automation despite the absence of expertise. This approach stabilises the analysis quality by eliminating human-origin variance and bias, and can be applied to various analysis methods in other areas which also suffer from similar tiresome and unsystematic manual tuning of extrinsic parameters and human-origin variance and bias.


INTRODUCTION
The physical properties of materials are governed by their crystal structure; thus, crystal structure analysis is an indispensable element in materials science research 1,2 . Compared to the drastic improvement in material fabrication and measurement throughput, the throughput of crystal structure analysis has not been improved because the analysis heavily relies on manual timeconsuming trial-and-error methods [3][4][5][6][7][8] . Rietveld refinement or the Rietveld method 9,10 , one of the most widely used crystal structure analysis methods for powder diffraction data, such as X-ray diffraction (XRD) and neutron diffraction, faces this problem as well.
Rietveld refinement involves complex curve fitting using an experimental result and a simulation based on a physical model. The physical model comprises intrinsic parameters for governing crystal structure such as lattice constants and atomic positions and extrinsic parameters such as background signals and peak shape defined by instrumental resolution. The method is designed to minimise the difference between the observed and simulated results by updating the parameters in a given physical model until the difference reaches a threshold. All parameters need initial values, including categorical ones such as background and peak shape functions. For each numerical parameter, a binary parameter (fix or refine) to control the scope of the refinement needs to be set. The number of intrinsic parameters depends on the complexity of the crystal structure and ranges from 10 to 100, while that of extrinsic parameters is typically less than 20.
It is commonly known that refining all parameters at once often leads to physically unreasonable results. The basic strategy to avoid such an issue is to gradually increase the number of parameters to be refined; that is, beginning with a few to finally incorporate all in order to keep them physically acceptable. Consequently, one needs to update the scope of the refinement at each step of the refinement process and, eventually, to determine the sequence of scopes. To guide researchers among a massive number of possible combinations and sequences of different refinement scopes with several tens of refinable parameters, empirical guidelines which significantly limit the size of "scope space" are often employed. However, even with the limited scope space, one has to search the right path to a reasonable solution in the search space by trial and error; furthermore, it is not guaranteed whether the guidelines will lead researchers to the optimal crystal structure in the whole scope space. Considering the wide use of Rietveld refinement in various materials science communities, this situation, that only proficient experts can exploit Rietveld refinement properly, should be improved.
Such complexity in parameter tuning would have consequences on analysis quality in terms of both variance and bias. For example, the consistency in experts' results shown by round robin tests in refs. [11][12][13] would have declined due to a wider distribution of individual skills to deal with the complexity if more researchers had participated in the tests. On the contrary, the consistency itself indicates the success of the refinement guidelines shared by experts. While such guidelines contribute to the consistency as expected, they can also cause a systematic bias in refinement results. In this study, based on blackbox optimisation (BBO) 14 , we offer a general automation framework for Rietveld refinement to alleviate these problems and allow practitioners to spend more time examining candidate structures proposed by the framework, and not searching for them in a vast parameter space.
BBO refers to a problem setup, in which the objective function and/or constraint function are given by blackboxes, i.e., the analytic forms of the function are unknown. BBO has recently emerged as a popular concept in the machine learning community because it performs an important task of tuning hyperparameters in machine learning models automatically, which is termed as hyperparameter optimisation (HPO) 15 . HPO plays a vital role in machine learning; (1) it reduces the human effort necessary for applying machine learning, (2) it improves the performance of machine learning algorithms, and (3) it facilitates fair comparisons among the results of machine learning studies by providing the same level of tuning 15 . HPO is formalised as the maximisation of machine learning model performance in the hyperparameter configuration space. For this case, the relationship between a hyperparameter configuration and the machine learning model performance is considered as a blackbox. At present, Bayesian optimisation 16 , a global search method for BBO, is the most successful in this HPO problem [17][18][19][20][21] . This topic is closely related to our focus, i.e., to stabilise the analysis quality and bypassing unsystematic (and often unexciting) tasks in scientific studies. Herein, we realised that the concept of HPO can be applied to Rietveld refinement.
Whereas numerous studies have been conducted on Rietveld refinement 9,10,22-33 and software 34-39 , up to now, the number of studies that aim to address aforementioned problems is very limited. Powder 3D Parametric 40 is a semi-automated software toolkit for Rietveld refinement based on user-provided configuration. This software is particularly designed for sequential analysis of a large number of data such as time-resolved measurement. SrRietveld 41,42 , an automated software toolkit for Rietveld refinement, provides several pre-written scripts for automating the common tasks in the method. AutoFP 43 uses an expert system algorithm to simulate the manual process performed by human experts. The expert system consists of simple if-then rules, based on a collection of empirical strategies and simple ideas. Recently, PowderBot 44 , which applies reinforcement learning 45 to automate Rietveld refinement, has been proposed. Reinforcement learning is suitable for solving problems with stochastic state transitions which are formulated as a Marcov decision process. This method is famous for its great success in playing games like Go. While the reinforcement learning-based approach is conceptually interesting, it seems that another simpler approach such as BBO is sufficient for our purpose because Rietveld refinement is deterministic. In single-crystal diffraction, SCAR 46 , machinelearning-based automated crystal structure analysis software has been proposed very recently. It supports precise analysis and is applicable to various single-crystal diffraction data, and is of interest in the field of crystal structure analysis. However, in powder X-ray diffraction, there is still room to apply ideas from machine learning. To the best of our knowledge, no attempt has been made to apply the BBO approach to Rietveld refinement for automating the time-consuming trial-and-error process and search beyond the boundaries defined by the common sense of human experts.

RESULTS AND DISCUSSION
Formalisation Here, we introduce our "BBO-Rietveld" approach that automates the tuning of the refinement scope in the Rietveld method in the same manner as previously done in HPO. Fig. 1 illustrates the overview of the BBO-Rietveld approach described below, and Table 1 shows the list of optimised parameters. Most of them are for defining the scope and an extrinsic categorical parameter for the background function. Note that we did not perform numerical initial parameter search contrary to the possible expectation that some of the readers would have.
In Rietveld refinement, the weighted profile residual factor (R wp ), which is the difference between calculated and observed diffraction intensity, is widely used as a metric to evaluate the Rietveld refinement result 29 . Let x be a fixed parameter configuration of Rietveld refinement including the scope of the refinement, y Obs i be the i-the observed diffraction intensity, y Calc x;i be the diffraction intensity calculated using the physical model with the configuration x, and w i be the weight corresponding to the inverse of the uncertainty of i-th intensity. Then, R wp is defined as follows: The crystal structure with the minimum R wp is the most probable structure based on the given experimental data.
Our task can be formalised as the following BBO problem: minimise R wp ðxÞ; subject to cðxÞ ! 0; x 2 X; where X is the configuration space and c(x) is the set of constraint functions. In the problem, a configuration x is considered as a variable and it affects R wp through y Calc x;i . The constraints are set to ensure that the refined parameters are physically valid, or sometimes to limit the search space and reduce manual trialand-error attempts with a plausible explanation. In this study, we limit ourselves to introduce the latter type of constraints and we only introduce an essential constraint to keep human-origin biases to a minimum (see Methods section for more details).
By solving the problem, we obtain the best configuration x ? ¼ argmin x R wp ðxÞ, which achieves the best fit without manual process.
Additionally, goodness-of-fit (GOF) is used as a metric to evaluate the optimised Rietveld refinement result. The definition of GOF is given as follows: where N obs is the number of data points of observed XRD pattern, Fig. 1 Overview of the BBO-Rietveld approach. An optimiser iteratively samples a new promising configuration and then runs a Rietveld refinement software with the sampled configuration. A refined crystal structure that achieves the best fit is obtained as a final output. It should be noted that small R wp or GOF does not guarantee that the corresponding refined crystal structure would be appropriate for the obtained experimental data. The examination of the analysis result is still an important task for human experts 29 .

Experiments
To evaluate the proposed BBO-Rietveld approach, we optimised parameter configurations of Rietveld refinement for XRD patterns of  Table 1. The initial crystal structure models were taken from crystallographic information files (CIFs) on the GitHub repository of AutoFP 47 , presumably those used in ref. 43 . To evaluate statistical property and reproducibility, optimisations have been performed 100 times with different random seeds each for Y 2 O 3 and DSMO. Histograms of 100 R wp values for both materials are shown in Fig. 2. Reference values for a human expert and AutoFP are also shown for comparison. The latter ones are taken from the previous study 43 .
Surprisingly, 90% and 99% of our optimisation runs for both test datasets, Y 2 O 3 and DSMO, achieve better R wp than the human expert and AutoFP, although no additional expertise in Rietveld refinement is used other than the conventional refinement procedure (see Methods section for details). Fig. 3 shows the refinement results of the best configuration derived by BBO for each benchmark material. We cannot find apparent flaws in the results refined with the best configuration. The changes of R wp and goodness-of-fit (GOF) are shown in Fig. 4 (Y 2 O 3 ) and 5 (DSMO). Both R wp and GOF were improved as optimisation progresses, and in both metrics, our proposed method exceeded human expertise at around 100 evaluations. R wp and GOF will be further improved if we continue the optimisation.
To visualise the refined structure features (x, y, z positions and U iso of each atom) in two-dimensional space, we utilise the multidimensional scaling (MDS) 48,49 , one of the dimensionality reduction algorithms. MDS can represent high-dimensional data in a low-dimensional space by approximating the distance in the original space. The number of original structure feature dimension is 16 for Y 2 O 3 and 20 for DSMO. The MDS visualisation for DSMO is shown in Fig. 6. Each point in this figure represents a refined crystal structure, and the distance between points indicates their similarity. The majority of the refined structures form a loose cluster at the lower-right side of the figure, which means they are similar to each other. In the cluster, there are three special points; the best one from BBO, the one from the human expert, and the one from BBO and very close to the human expert result as shown   Fig. 6. While the best structure obtained by BBO (Table 2) scores smaller R wp than that of the human expert result (Table 2), U iso of O1 atom in the BBO result is noticeably smaller than that of O2 atom, which violates a conventional criterion that U iso of atoms with similar mass should be comparable. On the other hand, the third special point, obtained by BBO and close to the human expert result, passes the criterion (Table 2). This demonstrates that BBO efficiently obtains candidates with small R wp values but additional examinations are still required to choose the most preferable one. Next, we discuss the interesting outlier at the upper left of Fig. 6. The outlier ( Table 2) has good converged R wp , but the x position of O1 atom in the refined structure is considerably different from others. The difference exceeds the uncertainty calculated by the software, and, from the structural point of view, the positional shift corresponding to 5% of the lattice parameter would be enough to affect physical properties. This implies that the outlier corresponds to a local minimum not belonging to the cluster discussed above. In such a situation, multiple crystal structure candidates can explain the experimental data sufficiently with a similar level, which may affect the conclusion of a study or evoke a new discussion. Despite that it meets the constraint in the optimisation (positive U iso ), the outlier can be rejected due to the violation of the conventional criterion for U iso as well as the best result by BBO. We believe that imposing constraints other than universal physical requirements in BBO can result in biased optimisation of results toward human expectations. Experts can examine the list of candidates using conventional criteria and other knowledge, and can eliminate inappropriate solutions at any time. This ability to propose multiple candidates without human effort and bias is a great advantage of automation and may lead to the discovery of hitherto unnoticed new knowledge beyond the standard practices of conventional manual tuning and the expert system simulating the practices. The use of empirical constraints in a manual analysis by experts also has the purpose of bounding the search space to reach a valid solution as soon as possible. However, Rietveld refinement using BBO can evaluate a large number of parameter combinations and suggest candidate structures much more efficiently than manual approach, that is, the constraints imposed to save time are no longer essential. The results demonstrate that the proposed approach is reliable, allowing more practitioners without much expertise to utilise Rietveld refinement. In addition, the automation of the trial-anderror processes not only gives better results, but also stabilises the quality of results by eliminating human-origin variance and bias accumulated in manual tuning processes as HPO does in machine learning.   The efficiency against execution (wall-clock) time is also important in practice. In this study, each optimisation run (200 trials) takes less than an hour using a general workstation (2.3 GHz 18 core CPU, 256 GB RAM), and it can be performed within a practical time limit using a general laptop. The execution time of the optimiser is improved by parallelisation. For the same number of trials, the performance of parallel optimisation is typically slightly worse than in sequential optimisation, but the execution time will be about 1/n when the number of workers is n. Increasing the number of experts is difficult, but increasing computational resources is relatively easy. It is also an advantage of automation over manual work.
The problems of human-origin variance and bias and low stability of the quality of model fitting results are very common in other research areas. We consider that the framework of automation as a blackbox optimisation in this study is powerful and our approach can be easily applied to similar problems in other fields.

Details of implementation
To solve the problem (Eq. (2)), we connect GSAS-II 39 , a crystallography data analysis software, which provides Rietveld refinement to Optuna 50 , an automatic HPO software framework. Extrinsic parameters listed in Table 1 are optimised using the TPE. Other extrinsic parameters related to Rietveld refinement are initialised with GSAS-II default values, and the intrinsic parameters for a crystal structure are obtained from ref. 43 . The TPE is easy to use and supports real, integral, categorical, and conditional types of parameters, making it suitable for application in our case. It iteratively samples a new promising configuration based on the past observations (i.e. the pairs of the configurations and the corresponding R wp values) and then runs GSAS-II to perform Rietveld refinement with the sampled configuration to find the best one. The refinement procedure described herein consists of the following conventional steps: (1) set initial 2θ range, (2) refine the background, (3) refine the 2θ zero correction and the unit cell parameters, (4) refine sample displacement parameters and the histogram scale factor, (5) refine atom parameters, and (6) refine the above parameters without 2θ bounds in step 1. During the optimisation, the constraint that all refined U iso must be non-negative is imposed to avoid unreasonable structure models by setting R wp to a large constant (=10 9 ) for penalty when the constraint is violated. The TPE does not mimic human experts but simply optimises the objective function. This is completely different from a conventional expert system that aims to reproduce the empirical strategies of human experts.

Visualisation of results
To visualise the crystal structure features with MDS in Fig. 6, feature standardisation, which makes the values of each feature in the data have zero-mean and unit-variance, was performed before applying MDS. We used the MDS implementation in scikit-learn 51 , a machine learning toolkit for Python.

DATA AVAILABILITY
The dataset of diffraction patterns and initial crystal structures are available in the GitHub repository of AutoFP 47 (Y 2 O 3 , DSMO) and PowBase 52 (LiCoO 2 ).

CODE AVAILABILITY
The codes for automated crystal structure analysis, documents, and supplementary resources that support the findings of this study are available at https://doi.org/ 10.5281/zenodo.3747147. These are also available at the BBO-Rietveld repository on GitHub (https://github.com/quantumbeam/BBO-Rietveld). As the supplementary resources, we provide not only the experimental results of Y 2 O 3 , DSMO that we discussed in this paper but also the result of LiCoO 2 that are also comparable to the result of human expert.