High-throughput computation and evaluation of raman spectra

Raman spectroscopy is used ubiquitously in the characterization of condensed materials, spanning from biomaterials, minerals to polymers, as it provides a unique fingerprint of local bonding and environment. In this work, we design and demonstrate a robust, automatic computational workflow for Raman spectra that employs first-principle calculations based on density functional perturbation theory. A set of computational results are compared to Raman spectra obtained from established experimental databases to estimate the accuracy of the calculated properties across chemical systems and structures. Details regarding the computational methodology and technical validation are presented along with the format of our publicly available data record.


Background & Summary
Raman spectroscopy is an effective method for obtaining vibrational frequencies and local chemical bonding information in condensed materials. Due to its relative quick and non-destructive analytical nature, it is widely used in the analysis of crystalline materials and polymers -extending even into bio-medical materials and pharmaceuticals 1-3 , with advanced technologies and portable devices developed over the years. Indispensable for a complete vibrational spectrum observation, Raman spectroscopy has established an unique role in obtaining "fingerprints" for modern material science characterization. The increase in open access of experimentally measured Raman spectra databases, such as the RRUFF 4 Project and Bio-Rad's KnowItAll Raman Spectral Library, has incurred increased popularity. These resources now contain thousands of Raman spectral data of materials ranging from minerals, semiconductors to prescription drugs, which greatly facilitates the referencing process and identification of novel materials.
Compared to an experimental Raman database, a computational Raman Spectrum database from ab initio electronic structure calculations has many advantages [5][6][7] . Cheaper and more abundant computing resources enable thousands of calculations for a larger number of materials in an automated fashion, significantly reducing human effort. Systematic calculations result in a uniform data type, enabling accelerated classification of modes, discovery of correlations, and screening of useful materials. However, despite these advantages, there has been limited effort in constructing a computational Raman Spectrum database or a systematic workflow for its development. The WURM Project 8 was an early prototype, but has shown little to no activity since 2012.
In this paper, we develop a computational workflow to calculate Raman spectra using ab initio density functional perturbation theory (DFPT): an accessible and fairly accurate tool for describing lattice dynamics 9 . Following the development of the dielectric tensor components in Petousis et al. 10 , computed directly using DFPT as implemented in VASP, the Raman tensors of 55 inorganic compounds are derived. They are then used to generate the Raman spectra data reported, which is benchmarked against experimental Raman spectra. These Raman spectra are integrated into the Materials Project (MP) 11 , a free, online database with computed properties to enable accelerated materials design. While Raman spectra are still computationally expensive, this work establishes a systematic workflow and associated benchmarking of numerical parameters for automated Raman spectra calculation that can be utilized by the community at large, and help increase the availability of reference Raman spectra for a wide variety of structures and chemical systems.

Methods
Theory and definitions. In Raman spectroscopy measurements of crystalline systems, incident laser photons with frequency ω L interact with lattice vibrations, which can be described in the form of phonons. The inelastically scattered photons either exhibit a decrease or increase in frequency, resulting in symmetric Stokes or Anti-Stokes shifts, respectively. The recorded spectrum shows the intensity of the scattered light as a function of its frequency www.nature.com/scientificdata www.nature.com/scientificdata/ change, often converted to wavenumbers v in units of cm −1 , known as Raman shifts. The Raman spectrum therefore provides a "fingerprint" description of the normal modes in the crystal through the comparison of incident and scattered photon frequencies. In our work, a normal mode's wavenumber v mode is directly obtained through a DFPT phonon calculations. The intensity of the mode i with polarization along β and electric field along γ is represented [12][13][14][15] as where ω i is the vibrational mode frequency and ω L , the laser frequency, is usually arbitrarily set in computational methods since the shift in photon frequency is independent of incident photon frequency due to the nature of the is the Bose occupation number that is dependent on both temperature and frequency. In order to accommodate the most common experimental conditions, we set T = 300 K and ω L to the wavenumber corresponding to a 532 nm wavelength laser. For each mode, the Raman tensor 12,16 is obtained from the finite difference derivative of the dielectric tensor, ε βγ , with respect to the displacement u inv . m n is the mass of atom n, ν the direction of displacement, Ω the unit cell volume, and e inγ the eigenvector of the dynamical matrix. More specifically, the central difference scheme is employed as atoms are moved independently with displacement of 0.005 Å in both the positive (+) and negative (−) directions to calculate the differential of dielectric tensor with respect to displacement. However, the intensity expression in Eq. (1) is ideal for single crystals. Quite often experimental data use polycrystaline mineral specimens or powdered samples, in which case the intensity must be averaged over all possible orientations of the crystals which is accomplished by separating the total intensity I = I ⊥ + I || 17-19 into the depolarized light intensity and the polarized light intensity where the rotation invariants 17,19 are defined as For the Raman spectra of each compound, the intensities of each mode are normalized by setting the max intensity in the spectra to 100 and then scaling other mode intensities accordingly.
Computational workflow. The workflow designed for automatic Raman Spectrum calculations is outlined in Fig. 1. The relaxed material structures are obtained from the Materials Project Database 11,20,21 and serve as input structures to the following DFPT simulation. For all calculations of vibrational normal modes, DFPT simulations were run using the Vienna Ab-Initio Simulation Package, also known as VASP [22][23][24][25] , using the GGA/PBE 26 + U 27,28 exchange-correlation functional and projector augmented wave (PAW) pseudopotentials 29,30 . The k-point density was set at 3,000 per reciprocal atom and the plane wave energy cut-off at 600 eV, same as those in the calculation of dielectric tensors 10 . The structure is then displaced along the calculated normal mode eigenvectors for dielectric tensor calculations. Raman susceptibility tensors are calculated from the finite difference derivative of the dielectric tensor as shown in Eq. (2) and the collection of computational Raman analysis data is stored with all associated metadata from the Materials Project.

Data Records
computational data record. The computational data is stored in a lightweight data-interchange format, and the JSON document can be downloaded directly from the Figshare repository 31 . The nested key/value pairs can be navigated using Tables 1 and 2. The document includes the original structure data from the Materials Project, normal modes information, tensors, and calculated intensities.
www.nature.com/scientificdata www.nature.com/scientificdata/ experimental data record. The experimental data extracted from the RRUFF database is provided in a separate JSON document with nested structure as shown in Table 3. The file includes mineral type, peak locations, intensity values, RRUFF_id, and matched Materials Project IDs. To ensure identical structures, each mineral is matched to a compound from the Materials Project by fitting Crystallographic Information Files (CIF) using pymatgen tools or manually referencing to mineralogy data 32 . The RRUFF spectra data and all CIF format files used are included in the Figshare repository 31 for increased transparency and ease in reproduction of the benchmarking results.

technical Validation
Unlike evaluating the accuracy of scalar material properties, comparison of the Raman spectra requires a matching process between the calculated vibrational modes and experimental Raman spectra peaks from the RRUFF database 4 . Hence, a linearly modelled cost metric, u = w 1 (v mode − v peak ) + w 2 (I mode − I peak ) = w 1 Δv + w 2 ΔI, considering the proximity in both wavenumber and normalized intensity is used in the matching process: within the same compound, each of the computed vibrational modes is paired with a corresponding experimental peak that minimizes the overall cost. To ensure the matching process produces more accurate results, we process the calculated Raman spectra data in the following two steps.
First, for each compound's computed Raman spectra, a signal to noise threshold of 0.356% is used to remove vibrational modes with experimentally undetectable intensities. This threshold is the average noise intensity to www.nature.com/scientificdata www.nature.com/scientificdata/ maximum intensity obtained from reference experimental data. Second, for each compound, the low wavenumber region is removed from the computational Raman spectrum for both experimental and computational reasons. From an experimental point of view, the RRUFF database records different initial probing locations for different compounds' Raman spectra, which presents a noticeable challenge for the peak matching process. The average distance between the starting location of RRUFF Raman spectra and its first peak location is 113 cm −1 , which provides a location reference for the computational Raman spectra of each compound. From a computational perspective, DFPT tend to exhibit larger errors at lower wavenumbers; these modes correspond to long range oscillations that are difficult to compute within limited size of periodic unit cells. For example, these low wavenumber modes, corresponding to phonon wavelengths on the orders of microns, can be influenced by long range features such as defects, dislocations and grain boundaries that are not captured in the simulations of perfectly periodic, single crystal structures.
The data consists of 55 compounds and 205 pairs of matched peaks/modes, where the total calculation cost amounted to 9.5 Million CPU hours. The accuracy of the calculated phonon modes' wavenumber values v mode is represented in Fig. 2, where the compounds range from simple oxides, sulfides to carbonates. The average wavenumber deviation Δv for all the compounds is −9.66 cm −1 with a standard deviation of 18.58 cm −1 , which indicates that our calculations tend to slightly underestimate the Raman peak locations. With a mean absolute relative error (MARE) of −2%, we conclude that our test set is in reasonably good agreement with the experimental data available. In Fig. 3, violin plots are used to depict the wavenumber deviations grouped by mineral type. The oxide family consists 40% of the data, and shows weak underestimation of peak locations with average absolute wavenumber deviation of −7.5 cm −1 . The sulfate and carbonate families exhibit the largest wavenumber deviations. While the sulfate family's larger deviation is more likely due to the limited number of sulfate compounds used in this work, the carbonate family exhibits more abundant data for distribution analysis, showcasing in an average wavenumber deviation of −19.3 cm −1 . Most notably, we observe large wavenumber deviations of carbonates' modes at Raman shifts around 1000 cm −1 . These modes are all generated from the vibrations of oxygen anions, which could be affected by small (unknown) amounts of water in mineralogical samples, which are not accounted for in the calculations.
The intensity deviations do not show a clear bias. Most often, as seen in Fig. 4, carefully tuned weights w 1 and w 2 can yield well paired peaks and modes, yet discrepancies in intensity still remain. It should be noted that our structures from MP assume a perfectly ordered bulk crystal at 0 K, and therefore extrinsic factors associated with temperature, pressure, defects, and phonon anharmonicity are not considered. These effects on the dielectric tensors are described in detail in Petousis et al. 33 , which carries over to the Raman tensors and eventually affects the calculated intensities. Interfaces and defects in particular are expected to be a strong scattering center for phonons which could account for much of the intensity variation as the phonon population deviates from a simple Bose-Einstein distribution.
Besides the extrinsic effects mentioned above, there are other factors that could contribute to discrepancies between computational and experimental values. One such effect is that of the use of the GGA/PBE exchange-correlation function, which has been shown to exhibit lattice under binding 34 , yielding larger bond lengths, lattice parameters and unit cell volumes of the relaxed structures. We investigated the correlation between the unit cell volume expansions % and Raman peaks' wavenumber discrepancies (cm −1 ) to determine if the GGA/   PBE known under-binding causes a systematic and significant deviation of the computed Raman data. Despite an average volume expansion of 4.94%, the weak correlation between the two (+0.09), as seen in Fig. 5, suggested that the effects of GGA/PBE is not significant in this work. Therefore, we do not attribute the observed wavenumber discrepancies to the use of GGA/PBE exchange correlation function. Other salient differences between computational assumptions and experimental conditions may also contribute to the discrepancies. For example, the applied rotation invariant intensity equation is an average result considering every possible scattering geometry, and therefore assumes a powdered or isotropic poly-crystalline sample. However, the quality of mineral samples do not always satisfy these ideal conditions, which may allow specific sample orientations or laser polarization directions in experimental measurements to create discrepancies in Raman spectra.

Usage Notes
An automatic workflow for ab initio electronic structure Raman spectra is expected to be of interest to the broad community of materials science and chemistry. We expect this growing dataset to be used in the understanding of Raman characterization and provide a reference for experimentally measured Raman data. Furthermore, our work opens up opportunities in the development of a data intensive approaches. For example, the application of machine learning techniques on existing spectra may be utilized to enhance the data set as well as to improve the peak matching process. Such work provide the basis for quick identification by screening through compounds on the Materials Project website, comparing their computational Raman spectra data to the reference experimental counterpart. Still, we emphasize that the computational cost for developing this database is rather significant. While this work illustrates the promise of a computational Raman spectra database, further work is necessary to reduce the computational cost to access truly large data sets needed for machine learning approaches.