Data-driven prediction of diamond-like infrared nonlinear optical crystals with targeting performances

Combining high-throughput screening and machine learning models is a rapidly developed direction for the exploration of novel optoelectronic functional materials. Here, we employ random forests regression (RFR) model to investigate the second harmonic generation (SHG) coefficients of nonlinear optical crystals with distinct diamond-like (DL) structures. 61 DL structures in Inorganic Crystallographic Structure Database (ICSD) are selected, and four distinctive descriptors, including band gap, electronegativity, group volume and bond flexibility, are used to model and predict second-order nonlinearity. It is demonstrated that the RFR model has reached the first-principles calculation accuracy, and gives validated predictions for a variety of representative DL crystals. Additionally, this model shows promising applications to explore new crystal materials of quaternary DL system with superior mid-IR NLO performances. Two new potential NLO crystals, Li2CuPS4 with ultrawide bandgap and Cu2CdSnTe4 with giant SHG response, are identified by this model.

. However, they suffer from some intrinsic shortcomings, e.g., low laser damage threshold (LDT) 6 , strong two-photon absorption 7 , and difficult/dangerous crystal growth 8 , which hinder their wide applications in high-power laser industry. Therefore, it is still a challenging task to promote and develop new mid-IR crystals with superior NLO performances.
Generally, a practical mid-IR NLO crystal should meet the following requirements 9 : (i) good IR transparency in the important mid-IR atmospheric window (3-5 μm and 8-12 μm); (ii) large SHG coefficient d ij (at least larger than 10 × KDP (d 36 ≈ 0.39 pm/V) and at best larger than AgGaS 2 (d 36 ≈ 13.4 pm/V)); (iii) high LDT. For a good mid-IR NLO crystal, the energy band gap E g should be more than 3.0 eV; (iv) moderate birefringence Δn (∼0.03 − 0.10); (v) easy crystal growth and chemical stability. It should be noted that a critical problem for the development of mid-IR NLO crystals is to fulfill the suitable balance between the large SHG response (d ij ) and enough band gap (E g ) in the light of their inverse dependence. Though many researchers aimed at mid-IR crystals and involved structure-property relationship in recent years, the traditional trial-and-error experiments and first-principle simulations are still time-consuming and laborious. Therefore, first-principles prediction combining high-throughput screening and machine learning is a burning issue for rapid development of mid-IR NLO crystals.
With the introduction of "Material Genome Project" 10 , it has become a new research hotspot in the field of material science for combining High-Throughput Computing (HTC) with Machine Learning (ML) models to clarify the inherent structure-property relationship of materials and to accelerate the research and development of new materials, especially of functional materials, such as thermoelectric materials, low dimensional materials, solar cell, superconductors and superhard materials [11][12][13][14][15][16][17][18]

Dataset
In past few years, metal chalcogenides with DL structures have received increasing attention as they provide an attractive performance tuning dataset for a range of important applications, including lithium-ion conductors 21 , band gap adjustable semiconductors 22 , solar cells 23 , thermoelectric materials 24 and nonlinear optics 25 . Several commercially NLO crystals belong to the DL structure. AgGaS 2 exhibits a large SHG effect (~13 pm/V) and possesses a wide transparent window (from 0.74 to 13 μm); it also acts as a standard crystal to evaluate other crystals' performance 26,27 . ZnGeP 2 (ZGP) is the most important and widely used NLO crystal in 3-5 μm due to its multiple advantages 28 . CdGeAs 2 owns the currently largest SHG coefficient (d 36 = 236 pm/V) and has been successfully applied to the difference-frequency generation (DFG) and optical parametric generation (OPG) 6,29 . In 2017, Liang et al. 30 systematically summarized and analyzed the structure and property relationships in mid-IR NLO metal chalcogenides, and proposed that polar aligned DL structure would be the most favorable system due to its large band gap, sufficient SHG effect, moderate birefringence, and good crystal growth habit and chemical stability. After then, quite a few experiments demonstrated this prediction, such as for Hg 2 GeSe 4 31 , Li 4 HgGe 2 S 7 32 , and Ga 2 Se 3 33 . Therefore, the similar stacking type and abundant composition in the DL structure definitely enables the exploration for superior mid-IR crystals in the context of ML and HTC.
In this paper, we mainly considered three classes of DL crystals, AX, ABX 2 and A 2 BCX 4 (A, B, C = IA, IIA, IIIA, IVA, IB, IIB cations; X = VA, VIA, VIIA anions) which belong to the space groups F-43m,I-42d and I-42m, respectively. As shown in Fig. 1a,b, the splitting of the cation site in AX leads to the ternary compound ABX 2 and quaternary compound ABCX 4 , in which all of their basic structural units are tetrahedral. In these three kinds of crystal structures, all anion-centered tetrahedrons are arranged along the [111] direction, which is very beneficial for the superposition of microscopic second-order susceptibility of units. After a screening (Fig. 1c) in the Inorganic Crystallographic Structure Database (ICSD) 34 , totally 61 DL structures (26 binary compounds, 29 ternary compounds and 6 quaternary compounds) were collected. It should note that all these compounds, except the quaternary compound Li 2 CuPS 4 , was experimentally synthesized and their crystal structure data has been determined.

first-principles Methods
To provide the learning data for machine learning, the first-principles simulations were fulfilled by the plane-wave pseudopotential method 35 based on density functional theory (DFT) 36 implemented in the CASTEP module 37 . The cell parameters and atomic positions in the unit cells of all crystals were firstly optimized using the BFGS method 38 with the convergence criterion of 5×10 -6 eV/atom, 0.01 eV/Ǻ, 0.02 GPa, and 5.0 × 10 −4 Ǻ for energy change, maximum force, maximum stress, and maximum displacement, respectively, between two consecutive processes. The exchange-correlation functionals were described by the local density approximation (LDA) 39 . A kinetic energy cutoff 880 eV and Monkhorst-Pack k-point meshes 40 spanning less than 0.07 Ǻ −3 in the Brillouin zone were chosen. We used the screened-exchange local density approximation (sx-LDA) 41 in the calculation of electronic structure in order to obtain the accurate band gap E g . The static SHG coefficients d ij are calculated using an expression originally proposed by Rashkeev et al. 42 . It has been revealed from previous researches that our first-calculations provided a good estimate of NLO properties in IR sulfide crystals, as demonstrated in LiGaS 2 43 , AgGaS 2 44 , BaGa 4 S 7 9 , and LiZnPS 4 9 . The calculated values of band gaps E g and SHG coefficients d ij for the common DL NLO materials are listed in the supplementary information file.

Machine Learning Methods
Random forest regression. We propose to use Random Forests Regression (RFR) to predict the target variable, SHG coefficients in the selected NLO crystals (Fig. 2). For the RFR, the Scikit-learn package in Python was used 45 .
RF is an integrated algorithm that combines multiple decision trees, with the advantages of good generalization performance, insensitivity to data outliers and fewer hyper parameters. RFR's training procedure was first proposed by Breiman 46 . (1) Acquire a random bootstrap sample from the data set. (2) For each bootstrap sample, nurture a tree with the following rule: at each node, find the best split point by a specific feature. The split criterion is to maximize the Information Gain (IG), which can be defined (for a binary split) as x i is the feature, N p , N left and N right are the number of sample at the parent node and two child nodes, respective, I p , I left and N right represent the impurity function for the parent and child nodes, respectively. The impurity index I(t) at the node t can be calculated as∑ t N t is the number of sample at t node, y (i) is the true target value and ŷ t is the average target value of the sample. The RFR model uses the mean squared error (MSE) criteria to nurture every decision tree and the average value of the decision trees predicts the target variable. feature selection. The representations of a crystal, called "descriptors" or "features", play an indispensable role in applying ML model to predict its physical properties. In materials science, descriptors can be divided into elemental or structural representations 47 . The selection of descriptor candidates is an essential part in www.nature.com/scientificreports www.nature.com/scientificreports/ constructing a ML model, which need not only satisfy the necessary specific requirements (e.g. dimensional invariance for chemical compositions), but also reflect physical meanings related to the target variable. Based on the existing knowledge, we chose four features to represent the DL crystals, i.e., band gap, flexibility index, ionic group's volume and the electronegativity of anions.
(i) Energy band gap Eg. The band gap is a physical quantity closely related to the SHG coefficients. Generally, for the similar structure and composition, they have a negative correlation48. The band gap of a crystal can be acquired by first-principles simulations or experiments (e.g. Photoluminescence). Here, we choose the experimental value of the band gap as one of features. (ii) Flexibility index F. In 2014, Jiang et al. 49 proposed a flexible dipole model based on the concept of bond-valence. The results showed that the magnitude of NLO effects was determined by the compliance of the dipole moment in response to external disturbances. "Flexibility index" between two connected atoms is defined as where R l is the distance of two atoms, R 0 is the tabulated ideal bond length in the bond-valence theory, f(x i ) is the predicted value of the model, y i is the target variable and Var(y) is the variance of the sample data. A model with smaller RMSE and R 2 closer to 1 will have a higher level of prediction ability.

Results and Discussion
We selected 6 typical NLO crystals (CdSe, GaAs, AgGaS 2 , AgGaSe 2 , ZnGeP 2 , CdGeAs 2 ) and 3 new quaternary compounds ( Table 1). The rest 48 crystals in the dataset were randomly split using "bootstrapping" to produce the training and validation set. After repeating 100 random "bootstrapping" and training, we found that the RFR model with the number of estimators 50 yielded a small RMSE as well as a high R 2 (indicated in the black circle in Fig. 3a). Then, the test set was input into the trained RFR model with 50 estimators. Figure 3b shows the RFR predicted results and their comparison with first-principles or experimental values. The results in Fig. 3 and Table 1 demonstrate that the RFR model performs well on the studied DL NLO crystals. The predicted SHG coefficients are in good agreement with the first-principles values, proving that the RFR model has reached within the accuracy of first-principles simulations.
Notably, the RFR model does not need the scissors operator which is vital in the first-principles simulations. When calculating the optical properties of crystals with DFT, the scissors operator is usually used to shift upward all the conduction bands to agree with the experimental band gap 9 . But for those crystals with a small band gap E g (<1 eV), this type of scissors sometimes make the first-principles simulation fail to give the accurate optical properties 51 . Another useful method to determine the scissors operator is by aligning the first peak of the calculated conduction bands with the corresponding experimental peak. In Table 1 the first-principles SHG coefficient values using these two kinds of scissors operators and the RFR results for three NLO crystals with small E g (CdGeAs 2, Cu 2 CdSnSe 4 and Cu 2 CdSnTe 4 ) are listed. CdGeAs 2 is an important ternary NLO crystal because the experiments demonstrated that it has extremely high SHG coefficient (d 36 = 236 pm/V) 29  www.nature.com/scientificreports www.nature.com/scientificreports/ of scissors operator gives d 36 = 904.08 pm/V, while the second gives d 36 = 194.04 pm/V. The remaining difference between the calculation and the experiment may come from the fact that the first-principle parameters are selected to balance the cost of time and accuracy. But clearly, the latter first-principles result agrees better with the experimental value, which is also correctly predicted by the RFR model (254.32 pm/V). The similar situations occur in the cases of Cu 2 CdSnSe 4 and Cu 2 CdSnTe 4 ; the comparison of the first-principles results by adopting the first and second types of scissors operators for these two compounds are 223 pm/V vs. 71.80 pm/V (Cu 2 CdSnSe 4 ) and 528 pm/V vs. 209.05 pm/V (Cu 2 CdSnTe 4 ), respectively. Considering the fact that the experimental value of Cu 2 CdSnS 4 is d 36 = 31 pm/V, the choice of the second type of scissors operator is more reasonable. In comparison, the RFR predicted SHG coefficients for Cu 2 CdSnSe 4 and Cu 2 CdSnTe 4 are d 36 = 61.68 pm/V and 239.05 pm/V, respectively, which are also very reasonable and independent of the choice of scissors operator. Therefore, the RFR model bypasses the scissors operator problem presented in the first-principles calculations and can obtain the accurate predictions, since it performs on the basic chemical and physical information in crystals.
Moreover, it should be emphasized that our RFR model is successful for the prediction of SHG coefficients in considerable variation of chemical constituents from binary, ternary to quaternary crystals. Thus, this method has the good capability to explore new NLO crystals in the DL system. In particular, two new NLO crystals, Li 2 CuPS 4 and Cu 2 CdSnTe 4 with superior mid-IR NLO performances were identified by this model. Cu 2 CdSnTe 4 crystallizes in the stannite structure type with the space group I-42m. Dong et al. 52 synthesized this compound by direct reaction of the corresponding elements and discussed its temperature dependent transport properties. In addition, the structural, optoelectronic and thermoelectric properties of Cu 2 CdSnTe 4 were theoretically studied, which showed that this compound is a potential candidate in the fields of solar cell and thermoelectric 53 . However, the NLO properties of Cu 2 CdSnTe 4 have not been paid attention. We predict that Cu 2 CdSnTe 4 is a promising NLO crystal with a band gap 0.8 eV and the SHG coefficient value d 36 = 239.05 pm/V. As a comparison, CdGeAs 2 owns the currently largest SHG coefficient value (d 36 = 236 pm/V) in known inorganic materials, but its band gap is only 0.57 eV. Therefore, Cu 2 CdSnTe 4 is expected to have a higher LDT than CdGeAs 2 in the case of comparable SHG effects. We encourage further experiments to verify our predictions.
Li 2 CuPS 4 was predicted by Zhu et al. 54 as a sulfide-based super ionic conductor with the kesterite structurel. The higher ionic conductivity arising from the smaller Li ion binding and the reduced electronegativity difference between the anion element and non-lithium cation elements enables Li 2 CuPS 4 as a promising solid-state electrolyte material. Though this crystal hasn't been synthesized experimentally, the first-principle calculations have provided the reliable structure information and electronic properties. Its calculated band gap is 3.3 eV using the HSE06 hybrid functional. After inputting the related features of Li 2 CuPS 4 , our model predicts that it has the SHG coefficient 7.66 pm/V (~0.6 × AGS), which can be a potential NLO crystal once experimentally synthesized. Compared to AGS (E g = 2.64 eV), Li 2 CuPS 4 is likely to have the smaller SHG response but the higher LDT, which accords with the inverse relationship between the band gap and the SHG response.
conclusion An HTC and ML workflow has been designed and performed to screen for DL crystals with good balance on the band gap and the SHG coefficient. By selecting four distinctive descriptors, i.e., band gap, electronegativity, group volume and bond flexibility, the predicted results using RFR model are in good agreement with the first-principle calculations, especially on some representative crystals like AgGaS 2 , ZnGeP 2 and CdGeAs 2 . More interestingly, this fast workflow is independent of the selection of scissors operators, making it more practical. Additionally, this model can be used to facilitate the research of new DL systems. Two unexplored quaternary crystals with good NLO properties, Li 2 CuPS 4 and Cu 2 CdSnTe 4 , are predicted by this model and wait for the experimental investigations. In summary, this new method opens opportunities for the fast design of NLO crystals with targeting properties.