High-throughput Density Functional Perturbation Theory and Machine Learning Predictions of Infrared, Piezoelectric and Dielectric Responses

Many technological applications depend on the response of materials to electric fields, but available databases of such responses are limited. Here, we explore the infrared, piezoelectric and dielectric properties of inorganic materials by combining high-throughput density functional perturbation theory and machine learning approaches. We compute {\Gamma}-point phonons, infrared intensities, Born-effective charges, piezoelectric, and dielectric tensors for 5015 materials in the JARVIS-DFT database. We find 3230 and 1943 materials with at least one far and mid-infrared mode, respectively. We identify 577 high-piezoelectric materials, using a threshold of 0.5 C/m2. Using a threshold of 20, we find 593 potential high-dielectric materials. Importantly, we analyze the chemistry, symmetry, dimensionality, and geometry of the materials to find features that help explain variations in our datasets. Finally, we develop high-accuracy regression models for the highest infrared frequency and maximum Born-effective charges, and classification models for maximum piezoelectric and average dielectric tensors to accelerate discovery.

As a result, the community is gradually migrating towards systematic computation-driven materials selection paradigms, 4,5,[8][9][10][11][12][13][14][21][22][23][24][25][26][27][28][29] where functional materials are screened by establishing a direct link between the macroscopic functionality and the atomic-scale nature of the material. We are in a data-rich, modeling-driven era where trial and error approaches are gradually being replaced by rational strategies, 24,30-32 which couple predictions not only from specific electronicstructure calculations of a given property, but also by learning from the existing data using machine learning. Subsequent targeted experimental synthesis and validation provides a means of rapid iteration to verify and improve computational models.
Applications of these computational techniques in a high-throughput manner have led to several databases of computed geometries and many physicochemical properties such as AFLOW 33 , Materials-project 34 , Khazana 17 , Open Quantum Materials Database (OQMD) 35 . Despite a few systematic experimental databases of IR data (such as https://webbook.nist.gov/chemistry/vibser/), a systematic investigation of IR for inorganic materials is still lacking. Similarly, there have been only a few systematic databases developed for Piezoelectric (PZ) and dielectric (DL) materials such as by De Jong et al 36 , Petousis et al 37 Roy et al. 38 , and Choudhary et al 39 . In this work, we significantly expand the scope of these previous efforts by developing systematic databases for infra-red (IR) absorption spectra, piezoelectric tensors, and dielectric tensors.
Vibrational spectroscopy based on infrared (IR), Raman and neutron scattering are ubiquitous methods to probe the chemical bonding, and thus the electronic structure of a material. Infrared frequencies are classified in three categories: far (30-440 cm -1 ), mid (400-4000 cm -1 ) and near (4000-14000 cm -1 ) IR frequencies. Traditionally, IR spectroscopy was only used to probe organic materials, but with the availability of instruments capable to detect frequencies 600 cm -1 , IR spectroscopy has also been proven successful in distinguishing phases of inorganic compounds [40][41][42] , thermal imaging 43 , infrared astronomy 44 and food quality control 45 .
The piezoelectric tensor (PZ) 46 describes the change in the polarization of a material due to mechanical stress or strain, or conversely, the change in stress or strain due to an external electric field. Similarly, the dielectric tensor 47,48 describes the change in polarization due to the electric field. A related quantity, the Born-effective charge (BEC) tensor, describes first order response of atomic positions to an electric field. All of the PZ tensor-components are zero for materials that have inversion symmetry, leaving 138 space groups with non-zero PZ response while DL tensors and BEC tensors can have non-zero components regardless of symmetry. However, even when non-zero values are allowed, symmetry still strongly restricts the structure of these tensors.
While IR data can be used in infrared-detector design, the associated piezoelectric (PZ) 49 , dielectrics (DL) 50 and Born-effective charge data can be utilized for designing sensors, actuators and capacitors [50][51][52] . In addition, there is significant interest in finding lead-free piezoelectric materials to replace lead zirconate titanate (PZT) in various applications. All the above quantities can be obtained from density functional perturbation theory calculations (DFPT) [53][54][55][56][57] . The DFPT is a well-known technique to effectively calculate the second derivative of the total energy with respect to the atomic displacement. The computational phonon-spectroscopy method has been recently shown to be an effective means for characterizing materials, as discussed by Skelton et. al. 62 and Kendrik et. al. 58,59 . The information obtained using the DFPT method can also be utilized for predicting piezoelectric coefficients, static dielectric matrix and Born effective charges, as described by Gonze et. al. 61 and Wu et. al. 62 To validate our calculations, we compare them to the handful of available experimental data and show the uncertainty in predictions. We also identify materials that we predict have exceptional IR, PZ or DL properties, which may be good candidates for experimental synthesis and characterization. Finally, we develop high accuracy supervised machine learning models based on the classification and regression methods to pre-screen high-performance materials without performing additional first principles calculations. This work is a continuation of our previous datasets for exfoliability 60

Infrared intensity
As mentioned above, infrared intensity is important for thermal-imaging, infrared-astronomy, food-quality quality control. Using Eq. 7 (see Methods section) we calculate the IR intensities of 3411 materials with bandgap > 0.1 eV and positive phonon frequencies at Γ-point. We compare nine experimental (Exp.) IR modes with the DFPT results to estimate the data-uncertainty. The comparison is listed in supporting information (see Table S1). Based on the data in table S1, the difference in IR mode location between our calculations and experiments has a mean absolute deviation of 8.36 cm -1 . This difference is small compared to the total range of IR frequencies, from 0-1000 cm -1 . Several previous studies also report the difference between DFPT computed frequencies and measured peaks in this range [65][66][67] . In addition to intrinsic limitations in the DFTexchange-correlation function, this small discrepancy may be caused by the fact that our calculations are carried out at 0 K while experiments are typically done at 300 K, or that experimental systems can contain defects and impurities, which are absent in our calculations.
Given the strong agreement between theory and experiment, we expect that our carefully curated database will be useful for materials characterization.
To analyze the overall trend in the IR data, we plot the results for all materials together in Fig. 1.
We observe that most of the modes are less than 1500 cm -1 which is consistent with the fact that inorganic materials generally have much lower IR frequencies than organic and soft materials, which typically contain covalently bonded light elements, leading to higher vibrational frequencies. Moreover, a close look at the dataset suggests that 3230 materials have at least one far-IR mode, and 1943 materials have at least one mid-IR mode. As expected, we couldn't find any material with near-IR modes, as the largest frequency we find is 3764 cm -1 for Mg5(HO3)2 (JVASP-13093). We find 41 elemental systems such as crystalline nitrogen (JVASP-25051), krypton (JVASP-907), xeon (JVASP-25276) etc. have no IR peaks. This can be attributed to the acoustic sum rule (ASR) for Born effective charges, which forces the total Born effective charge contain hydrogen and the next highest set of frequencies are obtained for compounds with C-N bonds, which is reasonable, as ꞷ~ m -1/2 (where ꞷ is frequency and m is the atomic mass). These trends may be useful as a starting point in identifying new materials for infrared-related devices, like detectors, sensors and lenses.
Next, we compare the DFPT and finite-displacement method (FDM) phonon frequencies (obtained from our elastic-constant database 61 ) for 2926 materials and 72624 phonon modes (as shown in Fig. 1b). We find that the DFPT and FDM compare very well (Pearson correlation coefficient of 0.99), and significant differences are only found for a few molecular systems such as crystalline H2, N2 and for a few lanthanide oxides (such as ErBiO, JVASP-49979). This consistency suggests an overall high quality of our computational dataset.
In Fig. 1c, we analyze the space-group variation of materials with at least one far-IR (blue) and mid-IR (green) peaks. Some of the high-symmetry space-groups with a high number of materials with far-IR peaks are 36, 80,129, 166 and 216. As shown in Venn diagram in Fig. 1d, we find that a large number of materials with far-IR peaks are chalcogenides (O, S, Se, Te-based compounds); however, our database of inorganic materials is generally biased towards oxides (39% are chalcogenides, 14% are halogens,19% pnictides and 28% others), and we find many examples of halides and pnictides with far-IR peaks as well. Halides such as MgF2, LiF etc. have been used in several astronomical telescopes such as Hubble telescope 68 for infrared astronomy, and non-oxide chalcogenides such as CdTe have been used for infrared night vision cameras. We find that denser materials generally have lower phonon frequencies, which can be observed in the Fig. 1e. This can be explained by the fact that the denser materials have heavier atoms leading to lower phonon frequencies for fixed spring constants. Conversely, high IR modes seem to require low-density materials. Finally, we analyzed how the dimensionality 61 of materials affect the far-IR peaks ( Fig.1f). While we find that most of the far IR-peak materials have 3D bonding, we also determine that a significant fraction (20.9%) are low-dimensional, which is slightly higher than their overall representation in the database of 17.2%. The low/vdW-bonding is determined based on the bondtopology or lattice constant criteria. 61 This result suggests that vdW-materials can be used when designing, for instance, ultrathin IR absorbers.

Fig. 1 Analysis of the IR-data. a) IR peaks for all the materials in the database, b) Comparison of finite-difference (FDM) and DFPT phonon frequencies for conventional cells, c) space-group distributions of materials with at least one far (blue) and mid (green) IR peaks, d) Venn-diagram of the chemistry of materials containing chalcogenides, halides. or pnictides. materials, e)
Minimum frequency vs density of the system, f) dimensionality analysis of the far-IR materials.

Piezoelectric properties
The piezoelectric effect is a reversible process where materials exhibit electrical polarization resulting from an applied mechanical stress, or conversely, a strain due to an applied electric field.
Common applications for piezoelectricity include medical applications, energy harvesting devices, actuators, sensors, motors, atomic force microscopes and high voltage power generation 46  classes of materials, such as oxides, nitrides, and sulfides, and in several crystal structures. We find that the mean absolute deviation in max (eij) is 0.79 C/m 2 , which is reasonable at least for initial screening purposes. The relaxed-ion contribution to the piezoelectric tensor is proportional to the inverse dynamical matrix (see Eq. 5 in Methods section), which makes this contribution very sensitive to low-frequency IR modes. These modes are often very sensitive to temperature changes and defects, especially in ferroelectric materials, which makes a direct comparison between theory and experiment challenging 69  As depicted in Fig. 2c we compare the stress and strain-based PZ. We find that there is no obvious relation between the stress and strain PZ tensors, implying that an accurate compliance tensor is essential to predict the dij values. It is important to note that PZ strain tensors are the components which are generally measured during the experiments. Next, in Fig. 2d we analyze the chemistry of high PZ materials (using 0.5 C/m 2 as a threshold). Like the IR data, the high PZ materials are dominated by chalcogenides, with very few halides. We find most of the materials with high PZ are 3D, with low-dimensional materials under-represented. Analyzing the crystal systems of the high-PZ materials, we observe that orthorhombic systems are the most represented crystal system ( Fig. 2e), while space group 216 is the most common space group. As the PZ is a tensor quantity, we analyze the tensor component distribution for the whole dataset in Fig. 3a  we compare the dielectric constants from this work to the low-frequency limit of the linear optics method we used to calculate the frequency dependent dielectric function in our previous work 39 .
The electronic part of the DFPT dielectric constant is in close agreement with that obtained from the linear optics method. Importantly, the ionic contribution to the static dielectric constant is frequently larger than the electronic contribution. Our dataset of electronic and ionic components can guide experimentalists to choose a materials with either high and/or low ionic-contribution compounds. As expected, the electronic part of the dielectric constant and the bandgaps are inversely related 70 leading to the behavior in Fig. 4c. Electronic bandgap properties are already discussed in detail in our previous work 39 . The dataset suggest that most of the high avg materials are 3D, with again a high number of chalcogenides (Fig. 4d). However, the low-dimensional dielectric materials could be of great technological importance, because the capacitance of a layer is generally inversely proportional its thickness, potentially allowing for ultrathin vdW-bonded devices. We find similar crystal system trends for high DL and PZ materials. In Fig. 4e and 4f we show the crystal system and dimensionality trends of the screened materials. We observe that trigonal and tetragonal crystal systems are highly favored for the high DL materials. Similar to the PZ data, most of the high PZ value materials in our database are 3D. The electronic part of the dielectric constant can be directly used to estimate other physical properties such as the refractive index and the birefringence, which will be analyzed in detail in a follow-up work in future. is evident that almost all the high-response materials have high BEC and low-frequency modes at Γ-point. We note that while anomalous BECs are often a signature of ferroelectric materials, we remove materials with unstable phonon modes at Γ in this work.

Fig. 5 Born effective charge distribution and its relation to PZ, DL and bandgaps. a) Histogram of Born-effective charge data, b) histogram of maximum BEC and formal charge c) BEC wrt bandgap with color-coded max PZ, d) BEC with respect to the min-IR-frequency with color-coded ionic part of dielectric constant.
Next, we depict the trends across the periodic table of the properties investigated in this work in Fig. 6. In order to understand the contribution of various elements to a given property, we weigh an element in a material one or zero depending on whether the material has a maximum-IR for far-IR peak (Fig. 6a), has PZ value greater than 0.5 C/m 2 (Fig. 6b), has dielectric constant more than 10 ( Fig. 6c) and has BEC more than 5 or not (Fig. 6d). After such weighing for all the materials in our dataset, we calculate the probability that an element is part of a high-value material using the  Then the percentage probability of finding the element in a high-value material is calculated.

Machine learning
Recent advances in data infrastructure, statistics, machine learning, and computational methods has led to an explosion of computed data in the field of materials science. 19,20,72,73 Here, we apply machine learning classification model using classical force-field inspired descriptors (CFID) 74 , gradient-boosting decision trees (GBDT) 75,76 to our dataset. We partition the complete dataset of highest IR frequency, highest Born-effective charge in a system, highest PZ tensor value and average dielectric tensor in 90%-10% train-test divisions. To obtain optimized parameters for each model, we carry out hyper-parameter optimization with five-fold cross-validation on the training data to obtain optimized parameters for each model. First, we train a regression model for predicting the highest IR frequency mode of a material. For the regression model on the 10% held set, we get a mean absolute error (MAE) and r 2 as 67.8 cm -1 and 0.96, respectively. A scatter plot of the performance on the held set is shown in Fig. 7a. While the MAE of the ML model of 67.8 cm -1 is about 8 times higher than the MAE of DFPT with respect to experiments, the high r 2 value and obvious trend in Fig. 7 shows that the maximum IR frequency can be predicted using ML.
GBDT inherently allows accessing the importance of all the features of the models. We analyze some of the important features for each model. Feature importance analysis shows that some of the most important features for this model are radial distribution function components especially at 1.5 and 5.9 Å, electron affinity, atomic radii of elements and so on.
Similarly, we train a regression model for maximum Born effective charge in materials. Predicting for the test dataset, we find the MAE and r 2 in BEC as 0.6 and 0.76 respectively as shown in Fig.   7c, showing that a significant portion of the BEC can be predicted using ML. Given the non-trivial relationship between the BEC and the formal charge, shown in Fig. 7b, it may be difficult to improve this model. Some of the important features are the radial distribution function at 3.4 Å, 6.2 Å, angle distribution at 143 o and dihedral angle at 95 o . We provide the learning curves for the above two regression models in supporting information (see Fig. S1). The learning curve shows how the model performs as we add more training data. It is evident from Fig. S1 that the model error decreases or the model improves as we add more data suggesting that as our database further increases, we could further improve the ML models. We try similar regression models for maximum value of the PZ tensor (MAE: 0.47 C/m 2 ) and average DL tensor (MAE: 4.91), but the regression models perform poorly with r 2 <0.2. Therefore, we train classification models for predicting the high PZ and high dielectric models using a threshold of 1 C/m 2 for PZ and 10 for the dielectric constant instead of predicting the exact values like the regression models for IR and BEC. In classification models, the accuracies of the classification models are evaluated based on the area under curve (AUC) of the receiver operating characteristics (ROC) curves (Fig. 7b, d).
The ROC curve illustrates the model's ability to differentiate between high and low-performance materials. We find ROC AUCs of 0. 86

Conclusions
We have established one of the largest datasets of infrared, dielectric and piezoelectric properties.
We verify our workflow by comparing our computational results with several experimental measurements and alternative computational techniques, finding strong agreement. Using this database, we analyze the trends in these properties in terms of the dimensionality of materials (0D/1D/2D/3D), space-group, and chemical constituents, and we find various correlations that both match expected trends based on chemical and physical reasoning, and that may help design improved materials. In addition, we note many specific interesting and potentially technologically relevant materials, including infrared absorbing materials, high dielectric constant materials, and high PZ materials, including lead-free materials. We also develop machine learning models that can both pre-screening new materials as well as identify important material descriptors. We believe that our results, workflow and tools can act as a guide for the experimental synthesis and characterization of various next generation materials.

Fig. 8 Flow-chart portraying different steps for the DFT and ML methods.
The DFT calculations were carried out using the Vienna Ab-initio simulation package (VASP) 78,79 . The entire study was managed, monitored, and analyzed using the modular workflow, which we have made available on our github page (https://github.com/usnistgov/jarvis). Please note that commercial software is identified to specify procedures. Such identification does not imply recommendation by the National Institute of Standards and Technology. We use the projected augmented wave method 80,81 and OptB88vdW functional 82 , which gives accurate lattice parameters for both and non-vdW (3D-bulk) solids 83 . In this work, a material is defined as lowdimensional if it contains vdW-bonding in one (2D-bulk), two (1D-bulk), or three (0D-bulk) crystallographic directions. 83,84 . Both the internal atomic positions and the lattice constants are allowed to relax until the maximal residual Hellmann-Feynman forces on atoms are smaller than 0.001 eV Å −1 . The k-point mesh and plane-wave cut-off were converged for each materials using the automated procedure in the JARVIS-DFT 85 . We assume that achieving absolute convergence in energy is sufficient for obtaining reasonable DFPT results, and this assumption is supported by the agreement between frozen-phonon/finite-difference/finite-displacement method (FDM) and DFPT as well as linear-optics and DFPT results (discussed later). We carry out the DFPT calculation on the standard conventional cell for each material 86 . DFPT calculations, as implemented in the VASP code, were used to determine the Born effective charge tensors and the phonon eigenvectors were determined using phonopy code 67 .
To down-select the number of materials to investigate, we start with screening materials with The dielectric constant can be derived from the dielectric susceptibility using: In Eq. 4 and 5, the first term represents the electronic contribution and the second term the ionic contribution for DL and PZ constants respectively.
The PZ is a 3x6 tensor, the DL 3x3 and the BEC Nx3x3 tensor. The IR intensity of phonon modes is calculated using: where ( , ) is the normalized vibrational eigenvector of the nth phonon mode of the s th atom in the unit cell, and α, β are the cartesian coordinates.
( ) is the Born effective charge tensor of s th atom (here we explicitly write both the cartesian indices of Z). These approaches are universal and have been already applied to various material classes 41,58,87,88 .More details about the DFPT formalism can be found in elsewhere 53,54 .
Our machine-learning models were trained using gradient boosting decision trees (GBDT) 75,76 and classical force-field inspired descriptors (CFID) descriptors 74 using a five-fold cross-validation grid search on the 90% training set. Using the best model found during the grid search, we test the model on the 10% held set and report the performance. The accuracy of the regression models and classification models were evaluated using mean absolute error and receiver operating characteristics (ROC) curves, respectively. The principal idea behind the GBDT algorithm is to build new base learners to be maximally correlated with the negative gradient of the loss function associated with the whole ensemble. The CFID approach gives a unique representation of a material using structural (such as radial, angle and dihedral distributions), chemical, and charge properties for a total of 1557 descriptors. We trained machine learning regression models to predict the highest IR frequency and maximum BEC of a material and classification models to predict whether a material has high PZ coefficient (>0.5 C/m 2 ) and dielectric constant (>20).