Deep learning framework for material design space exploration using active transfer learning and data augmentation

Neural network-based generative models have been actively investigated as an inverse design method for finding novel materials in a vast design space. However, the applicability of conventional generative models is limited because they cannot access data outside the range of training sets. Advanced generative models that were devised to overcome the limitation also suffer from the weak predictive power on the unseen domain. In this study, we propose a deep neural network-based forward design approach that enables an efficient search for superior materials far beyond the domain of the initial training set. This approach compensates for the weak predictive power of neural networks on an unseen domain through gradual updates of the neural network with active transfer learning and data augmentation methods. We demonstrate the potential of our framework with a grid composite optimization problem that has an astronomical number of possible design configurations. Results show that our proposed framework can provide excellent designs close to the global optima, even with the addition of a very small dataset corresponding to less than 0.5% of the initial training dataset size.


INTRODUCTION
In order to discover or design novel materials having outstanding properties, significant effort has been paid to devise various material design approaches such as biomimicry, design of experiment methods, and other conventional optimization methods [1][2][3][4][5][6][7][8][9][10][11][12][13] . However, these approaches often require in-depth physics-based analysis of the relationship between materials descriptors and properties. Hence, a fundamental understanding on the underlying physical mechanisms determining the material properties is a primer for the material design. Machine learning models are alternative promising tools for materials design, because they enable design space exploration only with a database representing the relationship between the descriptors of material (inputs) and the properties (outputs). Trained machine learning models can infer the relationship with several orders of magnitude speedup compared to actual data generation from experiments or physics-based simulations tools [14][15][16][17][18][19][20][21][22] . In many applications, the machine learning models, such as Gaussian process regression, radial basis function network, support vector machine, and deep neural network (DNN), are adopted as surrogate forward models, which predict the outputs from the corresponding inputs [23][24][25] . These models are combined with highthroughput screening and various optimization methods to obtain new materials with targeted properties [26][27][28][29] . However, it requires a lot of effort to find desired materials in vast design space with a forward design approach, because a large number of candidates must be tested to search for the optimal material due to the absence of the gradient of predicted output with respect to input features 19,[30][31][32] .
In this regard, inverse design methods, which adapt machine learning models as a designer directly suggesting promising candidate materials based on target properties, are being intensively studied to avoid the aforementioned arduous design space exploration process 18,[33][34][35][36][37][38] . Autoencoder (AE), variational autoencoder (VAE), and generative adversarial network (GAN) are three of the most commonly used DNN based generative models 18,36,38 . These methods are very efficient if the target design is located within or close to the seen domain, i.e., the domain of the training dataset 34,[39][40][41][42] . However, they cannot generate data in the unseen domain, i.e., the domain outside the ranges of the training dataset defined by the input feature space and output values. Hence, their applicability is limited, because it is computationally infeasible to generate training data large enough to cover the entire high-dimensional design space of most material design problems. 35 Moreover, DNN is likely to have lower predictive performance on unseen domains unless the DNN learns a governing equation representing the relationship between the inputs and the outputs which require domain expertise and some heuristics to create an appropriate mathematical formulation [43][44][45][46][47][48][49] .
Thus, in most cases, it is inevitable to rely on the lower predictive power of DNN and repeat the validation of many suggested promising candidates through laborious simulations or experiments until the material with the desired property is obtained 35 . In addition, even if active learning is adopted to increase the prediction reliability of a DNN in the domain near the target design, a large number of data should be validated and utilized to update the DNN 35,50 . This is because the proposed data is likely to be positioned far from the target design due to the poor reliability of the DNN originally trained with the initial training dataset. Hence, the design process requires a very large computational cost.
In this study, we propose a systematic neural network-based forward design approach to efficiently search for the desired materials outside the training data domain which overcomes the aforementioned limitations of the existing methods. It is known that DNN trained by gradient descent algorithms with unbounded activation function is capable of making reliable predictions on the unseen domain close to the training dataset with linear approximation 44,51,52 . Hence, our framework gradually expands the reliable prediction domain of DNN toward the region of desired properties by updating DNN via active transfer learning. Relatively sparse and small additional datasets including materials with incrementally superior properties are iteratively added to the training set based on a data augmentation technique to increase generalization of DNN, i.e., ability to make accurate and stable predictions on unseen data 53 . The limitation of a forward design approach is mitigated by using a hyper-heuristic genetic algorithm on top of the updated DNN. Our study demonstrates that materials with desired properties can be designed out of inferior original training datasets with small dataset augmentation and validation.

Schematic of the forward design framework
The schematic of our framework is depicted in Fig. 1. DNN trained with the initial training dataset is capable of making a reliable prediction on the design space slightly larger than the training data domain, as represented in the bluish region. To find the materials with desired properties, which are positioned outside the domain of initial training data, DNN should be able to make a reliable prediction on the domain containing the desired design. In this regard, using the trained DNN to predict the properties of new material designs proposed by the genetic algorithm, a relatively small set of materials superior to those in the existing dataset is suggested. Since the newly proposed materials are outside of the current training set and the DNN predictions on them are not accurate, their properties are evaluated again with accurate physics-based simulations (if high-throughput experimental facility is available, one can use experiments). Those data are integrated to the training dataset with a data augmentation technique, and the DNN is updated based on the new training data with active transfer learning as represented with a black arrow. This process is repeated until the DNN is able to make a reliable prediction in the domain close to the optimal point represented as the large redpoint. The DNN after the last update is used to find the optimal design. A detailed explanation of our framework is provided in the following sections.

Architectures of the deep neural network (DNN)
To leverage the prediction of DNN on the unseen domain, the DNN architecture consists of an unbounded activation function, i.e., leaky rectified linear unit activation function (leaky ReLU) with coefficient 0.1 46,52 (See the comparison with other activation functions in Supplementary Figure 1). The architectures are constructed based on residual network (Resnet) with full preactivation, which is known for good generalization performance with a sufficient number of learnable parameters, with batch normalization layer as regularization methods 54,55 (See the details of DNN architecture in Supplementary Figure 2). We check the predictive performance of the DNN in the seen and unseen domains by setting the randomly chosen 10% of data as validation sets, and the dataset with the highest 10% output values as validation sets, respectively. In the optimization procedure, one has to explore the dataset having output values higher than those of the initial training set. Hence, validation on the dataset with the highest 10% output values for the DNN trained on the dataset with the lowest 90% output values would represent the predictive performance of the DNN in the unseen domain during the optimization procedure. The flowchart representing the process for constructing the DNN architecture is depicted in Supplementary Figure 3.

Prediction results upon seen domain and unseen domain
In this study, we demonstrate the applicability of our framework by solving a representative problem with a large design spacethe design of composite microstructures with superior mechanical properties, i.e., stiffness, strength, and toughness, which are close to the global optimum located far beyond the domain of initial training data. The details of data generations are presented in the Methods section and Fig. 2. The training results of DNN regarding stiffness and strength upon seen and unseen domains are represented in Fig. 3. The prediction accuracy gradually decreases as data is located further away from the training data, as reported in the literature 44,52 . The training results on seen domain generally show better results compared to the results on the unseen domain, as expected. Still, despite the mismatch in the absolute values, the DNN network is able to distinguish relative magnitudes to some extent, as shown in Fig. 3c-d. However, the training results for toughness show that it is infeasible to make predictions on unseen output value range (Supplementary Figure 4). We suspect that the poor predictive performance on toughness originates from the complexity in determining toughness from the entire stress-strain curve encompassing the full failure process involving complex crack propagation and branching process. We expect that this challenge might be overcome with the sequential learning methods that can learn the complex and nonlinear material behavior beyond the onset of yield or failure 56,57 . However, because it is beyond the scope of this study, we leave the optimization of toughness as future work. In the stiffness and strength training results in an unseen domain, the DNN predictions gradually deviate more as the data is positioned further away from the training dataset.

Material design process
A flowchart of the proposed material design framework is depicted in Fig. 4. DNN tested with randomly selected validation sets are adopted to allow reliable prediction in a broader domain of output values. For the genetic algorithm, 30 microstructures Y. Kim et al. having the highest properties are selected for the mating pool as a greedy sampling method to enable sufficient genomic variations at each generation. We note that selecting the proper amount of data for the mating pool is important when considering the tradeoff between the risk of being stuck in local minima and computational time. Additionally, we utilize the intuition from solid mechanics that the symmetrical microstructure is beneficial for the load-bearing capacity and that soft material at the crack tip is able to relieve the stress concentration at the crack tip 1,4 . A hyper-heuristic genetic algorithm combining this domain knowledge is implemented by constraining the explored microstructure design to satisfy the prescribed conditions. The constraints accelerated the optimization process compared to conventional particle swarm optimization and genetic algorithm (See the details of the comparison for each optimization method in Supplementary Figure 6). In a genetic algorithm, the crossovers are implemented by selecting two microstructures from the mating pool as parents and randomly assigning stiff material to the area occupied by stiff materials in parent configurations. The mutations are applied by randomly switching the position of stiff material block and soft material block by keeping the ratio between the stiff and soft blocks. Approximately 4 × 10 4 unique candidate microstructures are generated from the mating pool at each generation.
The output values of the candidate microstructures are predicted with the DNN. Because the microstructures proposed by the genetic algorithm are located close to the microstructures from the previous generation in terms of feature space and output   3 DNN training results. a-b DNN training results in terms of stiffness and strength with the validation sets randomly selected 10% of data in the data distribution. c-d DNN training results to check generalization on an unseen domain by setting the training sets and the validation sets as 90% of data with lowest output value and 10% of data having highest output value, respectively. The coefficient of determination (R 2 ) are calculated by the following equation: . y i ; f i and y represent the actual value, fitted value, and mean of actual values. The root mean squared error (RMSE) is calculated by the following equation: values, the DNN could make reliable predictions on the microstructures suggested by the genetic algorithm for certain generations. Based on those predictions, we select candidate microstructures, which are expected to have output values closer to the target design than the existing microstructures, for the mating pool, and continue this process until DNN-predicted output values do not show improvement, i.e., convergence to the predictable limit of DNN. After the convergence, we calculate the true output values of the candidate microstructures in the mating pool with FEA simulations and integrate them to the training data with data augmentation techniques. The data augmentation is implemented by oversampling the new data, i.e., replicating the new data by 50 times which is split into training and validation sets with a 9:1 ratio. By employing the data augmentation methods, emphasis is placed on the generalization on the unseen domain, as suggested by the previous study 53,58,59 . Without the data augmentation, the added data are diluted in the training data because it is relatively sparser and smaller compared to the existing training dataset. We note that other data augmentation techniques, such as synthetic minority oversampling technique, are difficult to be adapted due to the characteristic of the input features, i.e., binary matrix, for the composite optimization problem considered in the study, and that different data augmentation schemes can be adapted depending on the characteristics of the design problem. The update of DNN is conducted by re-training the DNN based on the integrated training data with reduced learning rate (10 −6 ) and reduced training epoch (10) in the transfer learning scheme. Because DNN requires relatively moderate updates on learnable parameters to expand the reliable prediction domain, it takes a dramatically reduced time for DNN updates compared to the initial training. The details of the DNN training history are included in Supplementary Figures 8-9.

Composite microstructures with maximum mechanical properties
The design process for maximizing stiffness and strength is represented in Fig. 5. The data points at the 0th update represent the initial training dataset. In the design process for maximizing stiffness shown in Fig. 5 (a), approximately 1.6 × 10 5 unique microstructures are investigated within 4 h through DNN. This corresponds to the several orders of magnitude speedup compared to FEA simulation tools which would take more than 8 weeks for the generation of 10 5 data. In the design process for maximum stiffness, a significant improvement is observed with the initial training of DNN, because the stiffness prediction upon unseen output value range indicates that the DNN is able to distinguish the relative magnitudes of stiffness despite some mismatch in absolute values. We hypothesize that the moderate level of extrapolation beyond the seen domain is made possible due to the piecewise linear unbound function from leaky ReLU used in the DNN. Indeed, our test on various activation functions shows that unbounded ReLU-based DNN has reasonable predictive power on the unseen domain, while bounded activation functions are inefficient for the task (Supplementary Figure 1). The iterative update of DNN is also beneficial in making robust improvements on local minima explorable with the DNN and in escaping from local minima, as demonstrated by the gradual improvement of converged values shown in Fig. 5a. Given that almost identical microstructure designs come out from the 3rd and 4th updates of DNN, we believe that the design process is converged at the optimization process after the 4th update. An identical process is also applied for finding the microstructure with the maximum strength, which is shown in Fig. 5b. The maximum strength value increases until we update the DNN iteratively five times. The material designs having the six highest stiffness and strength at the last DNN update are depicted in Fig. 6. The best microstructures for stiffness and strength are positioned at the  Fig. 6a and Fig. 6b, respectively. Microstructures ranked at 2nd to 6th are visualized at the bottom. The theoretical upper bound of elastic modulus (E max ) of composite having a volume fraction of stiff materials (f ) can be achieved with the following equation: For our design problem with f ¼ 71=121, E max is 1223.7 MPa if it were not for the pre-existing crack. The stiffness of optimized microstructures (1144.7 MPa) is close to the theoretical upper bound even though a pre-crack exists that inherently reduces the stiffness of the structure. Therefore, we can infer that the maximum stiffness design found from our framework is close to the global optimum. For the composite with maximum strength, the strength is significantly improved compared to the best design in the training set. In those designs, a majority of stiff blocks are located in the region far from the pre-crack (right side), because such unbalanced material distribution leads to the increased stiffness while reducing the negative effect from the pre-crack. At the same time, since the crack initiation may occur at any point with a highstress concentration other than the pre-existing crack tip, it is important to reduce the stress concentration at the vertices made by different material blocks. Figure 6b shows that the stress concentration at the crack tip is relived owing to the optimized structure compared to some randomly chosen composite structures in the initial training data (Supplementary Figure 10). As a result, we could obtain a design that is more than twice as strong as the best design in the original dataset, (i.e., the maximum strength increases from 0.076165-0.16276 MPa throughout the optimization process). The proposed designs also outperform the composite structures which are manually designed based on the physical intuitions (See the details in Supplementary Figure 11). The results for maximum stiffness and strength are obtained only by augmenting 366 and 424 microstructures, respectively. The augmented dataset size is about 0.4% of the initial training dataset size. The histograms for the additional data are depicted in Fig. 7.

DISCUSSION
In this study, we propose a systematic forward material design framework to obtain superior design far beyond the domain of the initial training dataset. The framework is applied to a composite microstructure design problem for obtaining the maximum mechanical properties of 11 11 grid composite, which has an astronomically high number of possible configurations. Because this type of design process inherently requires efficient search in the unseen domains, a forward design approach is adopted by gradually expanding the reliable prediction domain of DNN with active transfer learning and data augmentation. The better design candidates are firstly proposed by the genetic algorithm based on DNN predictions. The properties of the candidate are obtained with FEA calculations before they are augmented to the training dataset, in order to secure the predictive performance of the DNN for a newly added dataset. The limitations in a forward design approach, such as being stuck in local minima, are mitigated by updating DNN and controlling the mutation methods in a hyper-heuristic genetic algorithm.
We emphasize that the iterative and gradual expansion of the reliable prediction domain assisted by the genetic algorithm is key to the superior efficiency of our framework. In contrast to candidates obtained directly from advanced generative models in previous studies, candidates suggested from the genetic algorithm are located relatively closer to the dataset from the previous generation in terms of feature space and output values. Hence, the prediction accuracy of DNN is maintained in some degree during the gradual expansion of the reliable prediction domain. Also, expansion proceeds along relatively narrow but correct routes towards the target design as depicted in Fig. 1. The superior efficiency of our design framework can be more clearly demonstrated by comparing the size of the additional datasets required to get the final design. In our study, only~4 × 10 2 additional data are validated to update the DNN originally trained  with the 10 5 initial training data in the design space having~1.8 × 10 34 possible configurations.
We expect that our framework is applicable to a wide range of optimization problems in other science and engineering disciplines with astronomically large design space, because it provides an efficient way of gradually expanding the reliable prediction domain toward the target design while avoiding the risk of being stuck in local minima. Especially, being a less-data-hungry method, design problems in which data generation is time-consuming and expensive will benefit most from our framework.

Data generation based on finite element analysis (FEA)
As a training dataset, the mechanical properties, i.e., stiffness, strength, and toughness, of two-dimensional 11 11 grid composites are obtained from finite element analysis (FEA) under plane stress conditions. The composites consist of perfectly bonded 70 stiff material blocks and 51 soft material blocks containing a pre-existing crack. The input features are formatted as one-hot binary encoding representing the position of stiff and soft materials, and the outputs are defined as corresponding mechanical properties. The 100,000 unique microstructures are constructed, and the corresponding stress-strain curves are obtained by applying uniaxial tension with quasi-static infinitesimal strain increment of 0.0000227 until the complete fracture occurs. 100,000 data correspond to the fraction of~10 −29 out of the total number of available combinations, C 120; 70 ð Þ$10 34 . The stiffness, strength, and toughness are measured by the initial gradient of a stress-strain curve, the maximum stress point, and the total area under the stress-strain curve, respectively. The materials are assumed as linear elastic materials with the following elastic modulus ðEÞ, Poisson's ratio (v) and critical strain energy release rate (G c ), respectively, for stiff and soft materials: For the simulation of the crack growth, the hybrid crack phase-field (CPF) model are adopted in commercial FEA software ABAQUS with user-element subroutine 60 . The hybrid CPF model enables the modeling of complex crack nucleation and propagation by describing the failure of material with a continuous scalar field called crack phase field dðxÞ. The material completely loses its load-bearing capacity when dðxÞ reach to 1, whereas d x ð Þ ¼ 0 indicates the absence of any damage. A detailed explanation and implementation of the model can be found in our previous study 60 .

DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request. Fig. 7 Histogram of optimization results. Histogram of (a) stiffness and (b) strength after the optimization. The initial dataset are colored black, the data from 1st, 2nd, 3rd, 4th, and 5th updates of DNN are colored red, green, blue, purple, and cyan, respectively.