High-throughput computation of Raman spectra from first principles

Raman spectroscopy is a widely-used non-destructive material characterization method, which provides information about the vibrational modes of the material and therefore of its atomic structure and chemical composition. Interpretation of the spectra requires comparison to known references and to this end, experimental databases of spectra have been collected. Reference Raman spectra could also be simulated using atomistic first-principles methods but these are computationally demanding and thus the existing databases of computational Raman spectra are fairly small. In this work, we developed an optimized workflow to calculate the Raman spectra efficiently and taking full advantage of the phonon properties found in existing material databases. The workflow was benchmarked and validated by comparison to experiments and previous computational methods for select technologically relevant material systems. Using the workflow, we performed high-throughput calculations for a large set of materials (5099) belonging to many different material classes, and collected the results to a database. Finally, the contents of database are analyzed and the calculated spectra are shown to agree well with the experimental ones.

spectra are compared to those obtained using previous computational methods as well as to the experimental ones reported in the literature. The database of Raman spectra and vibrational properties reported along with this paper consists of 5099 compounds from many different material classes, far surpassing in size the previous computational databases and comparable to the experimental ones.

Simulation of Raman spectra
In Raman spectroscopy measurements, incident laser photons with a specific frequency ω L interact with lattice vibrations, described in the form of phonons in crystalline materials, and the spectrum of inelastically scattered photons are recorded. Scattered photons exhibit either a decrease in frequency ω S upon creation of phonon or increase in frequency upon annihilation of a phonon, denoted as Stokes or Anti-Stokes shifts, respectively. The intensity of the peaks is related to the Raman scattering cross section, which can be challenging to calculate since the ion (and electron) dynamics in the material need to be described concurrently with the light-matter interaction 17,18 .
There are several approaches for calculating the Raman spectra: (i) scattering probability from third-order perturbation theory (absorption, electron-phonon coupling, and emission) 9,19,20 , (ii) from the gradient of the electronic susceptibility (usually via finite-differences) in Placzek approximation [20][21][22] , and (iii) from the autocorrelation function of time-dependent susceptiblity 23,24 . Methods (i) and (ii) only yield the Raman tensor, but the phonon eigenvectors and frequencies need to be determined first in a separate calculation step. In method (iii), the peak positions and intensities are obtained at once, but it is computationally highly demanding. Method (ii) is computationally most affordable and easy to implement in high-throughput setting 16 and thus adopted in this work. The method is briefly described below.
In the first step, the phonons are calculated as described in depth in many previous publications 25, 26 . Within harmonic approximation, the potential energy surface is written as a Taylor expansion U = U 0 + Φ αβ (ki, lj)u α (ki)u β (lj), where U 0 is the ground state energy and force constant matrix Φ describes the second-order change in potential energy, In Eq. (1), u α (ki) is the displacement of the kth atom in the ith unit cell in the cartesian direction α. F α (ki) is the force in atom ki, and in the equation above its change is induced by the displacement of atom lj. After harmonic ansatz for the temporal evolution of the vibrational modes v, the classical equations of motion for atoms in unit cell "0" become M k ω 2 v α (k0) = ∑ l,j,β Φ α,β (k0, lj)v β (lj) (2) where M k is the mass of atom k. The infinite sums over unit cells l in periodic crystals can be avoided by moving to reciprocal space and, after rescaling v and Φ by √ M, Eq. 2 is cast into an eigenvalue equation ∑ lβ D αβ (kl, q)e β (l, qν) = [ω(qν)] 2 e α (k, qν) where D is the mass-scaled Fourier-transformed Φ (denoted dynamical matrix), q is the wave vector, e is the eigenvector of the band index ν, and ω 2 are the eigenvalues. To obtain D, force constants Φ need to be evaluated from the forces induced at atoms lj by displacing each atom k0 in the unit cell. To guarantee sufficiently large distance between atoms k0 and lj, supercell calculations are usually required. If the crystal symmetry is not considered, the construction of the force constant matrix requires performing 3N DFT calculations when each of the N atoms in the unit cell is displaced in each of the three cartesian directions. Differential cross section for the Stokes component of Raman scattering from the νth eigenmode far from resonance is given as 17,22 whereÊ S andÊ L are the unit vectors of the polarization for the scattered and the incident light, V is scattering volume, and ξ is a normal-mode coordinate along the mass-scaled eigenvector e α (k) = e α (k)/ √ M k ∼ v α (k) and χ is the electronic susceptibility tensor. The directional derivative can be written out as The first two forms involve calculation of derivatives of χ with respect to displacement of each atom u(k), whereas in the last two forms all atoms are displaced simultaneously along e and explicitly written in the finite-difference approximation as implemented in the code (displacing the atoms in both positive and negative directions). Normalizedê = e /|e | (and h = h |e |) is used in order to have consistent step size h in systems and modes with different masses (and in units of Å). Specifically, the Raman tensor is defined as 22 incorporating V 2 /(4π) 2 from Eq. (4). To evaluate the change in χ, we used the macroscopic dielectric constant ε βγ containing only the electronic contribution with clamped ions (sometimes denoted as the high-frequency dielectric constant ε ∞ ), which is readily provided by most DFT codes. While the expression in Eq. (4) yields complete information, quite often experimental results are obtained for polycrystalline mineral specimens or powdered samples, in which case the intensity must be averaged over all possible orientations of the crystals. When the direction of incident light, its polarization, and the direction of outgoing light are all perpendicular, the Raman intensity becomes 20,22 where I Raman is Raman activity that is independent of experimental factors such as temperature and incoming photon energy and thus used when comparing our results to other calculations, whereas Eq. 7 is used (and must be used) when comparing to experimental spectra.

Workflow
We now describe how the theory described above is turned to an efficient computational workflow. As mentioned, the computational procedure involves two sets of calculations: (i) force constants to get the vibrational modes and (ii) the Raman tensors for each mode. While the phonons at Γ-point can be calculated efficiently, we would like to have access to the full force constant matrix. This allows calculation of phonon dispersion and also, e.g., estimation of isotope effects and line broadening due to defects or grains via phonon confinement model 17,[27][28][29] . Both steps can be computationally demanding for systems with large number of atoms in the unit cell, which has hindered previous efforts to building such databases in the past. The most important design decisions that distinguish our work from the previous ones are the following. First, we have decided to build our database on top of the Atsushi Togo's Phonon database 30 , that contains the calculated full force constant matrix, and our work only focuses on calculating the Raman tensors. We are using the same computational parameters, and thus our database is fully consistent with the Phonon database, which is further linked to the Materials project database 31 via the material-IDs.
Second, to reduce calculation time and make the workflow more efficient compared to existing methods, Ramanactive modes are found based on group theory and the Raman tensors are calculated only for modes that are known to be active or whose activity could not be determined. Known inactive modes and the three zero-frequency acoustic modes are ignored. For this purpose, the symmetry information about Raman activity was implemented.
The workflow developed for automatic Raman tensors calculations is illustrated in Fig. 1. At the conceptual level, the workflow steps are following: 1. Select material from Phonon database, read in optimized structure, computational parameters, and force constant matrix.
2. Calculate the eigenvectors and eigenvalues at Γ-point.
3. Determine the irreducible representation (irrep) of the modes and whether they are Raman and/or infrared active.
4. Perform prescreening to check that the material is dynamically and thermodynamically stable and the material is not metallic or near-metallic.
5. Calculate the Raman tensors for Raman-active modes and the dielectric tensors for the optimized structure.
6. All the results (structure, eigenvalues, irreducible representation, Raman tensors, etc.) are collected in a database.
The softwares used in each step are also indicated in Fig. 1. Atsushi Togo's Phonon database contains the optimized structures, calculated force constants, and all the computational parameters used to obtain them. These are calculated using VASP software 32, 33 . The eigenvalues and eigenvectors at Γ-point, as well as the irreducible representations of the modes are calculated using Phonopy 30 . All of this information together with selected material properties obtained from the Materials Project database are collected in a database for prescreening. For this, we adopted to use the database tools in atomic simulation environment (ASE) 34 . In the last step, the calculated Raman tensors are added to this database, which is then also served through a web app implemented in ASE.
For automating the computationally intensive part, i.e., the calculation of the Raman tensors, we used the Atomate 35 that is a Python-based package for constructing complex materials science computational workflows. The workflow objects generated by Atomate are given to Fireworks workflow software 36 for managing, storing, and executing them with the help of Custodian package for error management 37 . As the DFT calculator we used here VASP, with the parameters taken from the Phonon database. During these calculations, all the input parameters and results are stored in a Mongo database, which are afterwards transferred to the database (Computational Raman Database, CRD).

Prescreening
Before Raman tensor calculations we performed the following prescreening, also illustrated in Fig. 2: (i) We check that the material has Raman active mode(s) based on the symmetry analysis. (ii) We check that the material is dynamically stable, i.e., there are no modes with imaginary frequencies at the Γ-point. (iii) We check that the material is thermodynamically stable by requiring that the energy above the convex hull is less than 0.1 eV/atom, as materials with the energy > 0.1 eV are unlikely to be experimentally synthesized 38 . (iv) We check that the bandgap is larger than 0.5 eV, since our computational approach is strictly valid only for non-resonant conditions (i.e., photon energy smaller than the band gap), and metallic systems require very large k-point meshes which will increase the computational cost. For (iii) and (iv) we use information from the Materials Project database at the same material ID 31 . Finally, we have 8382 (83.55%) materials satisfying these conditions and flagged for calculation. It is also worth noting that Phonon database contains only materials that are non-metallic, non-magnetic, and non-triclinic.
The workflow first performs calculation of dielectric tensors of the optimized structure, which can be compared to that provided in Phonon database. Additionally, the maximum forces are checked in this step and the calculation terminated if the forces are > 0.001 eV/Å, but no such case was encountered.

Computational parameters
All density-functional theory (DFT) calculations are carried out using VASP (Vienna Ab initio Simulation Package) 32, 39 with projector-augmented wave method 40 . PBEsol exchange-correlation functional 41 and other computational parameters were taken to be the same as used in Phonon database. In particular, plane wave cutoff is set to 1.3 times the maximum cutoff listed in PAW setups. In Phonon database, the structures of standardized unit cells are given, whereas we adopt to use the primitive cell in Raman tensor calculations to save computational time. The primitive cell can be readily obtained using Phonopy 30 . In the calculation of eigenvectors, non-analytic corrections are not included, as the eigenvectors would then depend also on the direction from which q → 0 is approached and thereby complicate the calculations significantly. Fortunately, this mostly happens for the IR-active modes and less for the Raman-active modes. Moreover, the induced change in eigenvectors and in Raman tensors is expected to be small and the splitting of the modes can be determined a posteriori.
There are then only two parameters left to decide: the k-point mesh and the magnitude of the atomic displacements in evaluation of the Raman tensor by finite differences.
In Phonon database, the Brillouin zone of the unit cell is sampled by a mesh whose density is defined by the R k parameter in VASP. We adopt the same approach, but it is worth noting that since we use primitive cell, the exact density and positions of mesh points can be slightly different. Moreover, metals and small-gap semiconductors usually require higher density k-point mesh than large-gap insulators. All calculations in the Phonon database used R k = 20, which should be sufficient for the structural optimization of materials included in the database (band gap > 0.5 eV). Determination of Raman tensor may, however, require a higher value. In order to benchmark this, we selected two materials from the Phonon database with different band gaps: the largest band gap material among the common III-V semiconductors is AlN (4.05 eV) and Si is a small band gap material (0.85 eV).
As illustrated in Fig. S1, R k = 40 is needed to achieve converged results for dielectric constant and Raman intensity of a small band gap material Si, whereas for a large band gap material AlN R k = 20 is sufficient. See Benchmark section in SI for more details. In our workflow, we have chosen to use the following values: R k = 20 for the structures with a band gap more than the 2 eV, R k = 30 for band gaps in the range of 1-2 eV, and R k = 40 for band gaps smaller than 1 eV.
In order to benchmark the displacement, we chose materials with heavy and light elements, PbO and Cd(HO) 2 . As shown in Fig. S2, varying the displacement from 0.001 Å to 0.04 Å (default value being 0.005 Å), we found little change in the Raman tensors or the dielectric constants. Therefore, we chose to use the default value. Finally, we verified the computational workflow in Atomate by comparing the Raman spectra of few structures to those obtained using VASP_Raman code 42 . As shown in Fig. S3, a good agreement is found. We note that Atomate had wrong normalization of eigenvectors which in some cases resulted in overestimation of the Raman intensities, but was fixed in the version used here.

Data Records Computational Raman Database
The final database contains vibrational information and Raman tensors stored in JSON document that can be downloaded directly from the Materials Cloud Archive 43 and queried with a simple python script. The Table 1 shows all the database keys with their related descriptions. The data can also be browsed online in Computational Raman Database website (ramandb.oulu.fi).

Database statistics
As shown in Fig. 2, there were 10032 materials in the Phonon database and 8382 of them were flagged for calculation. Since each structure contains several vibrational modes, the total number of modes in our database was 725163, and 428081 modes of them are Raman active or the activity is unknown.
Figs. 3(a,b) shows the number of materials in the database (before prescreening) grouped by the calculated band gaps and the number of atoms in their structures, respectively. The histogram with respect to the number of atoms, peaks at around 20-30. There are some materials with very large primitive cells containing more than 100 atoms, but many of these appear to be disordered/alloyed/defective variants of the small primitive cell systems and thus of limited interest. Since the Phonon database only includes non-metallic materials, the number of materials with a band gap smaller than 0.5 eV is small, and therefore neglecting those materials in our prescreening step has small impact.
We proceeded to carry out the Raman tensor calculations in the order of increasing number of atoms in the primitive cell. The database included here contains 5099 calculated structures. We calculated all materials with less than 10 atoms in the primitive cell and all experimentally observed materials (as indicated by MP) less than 40 atoms in the primitive cell. For this, we used about 9.5 million CPU hours. We estimate that for calculating the remaining 3283 structures would require more than 20 million CPU hours, owing to the much larger cell sizes.
In Fig. 3(c) we compare the number of materials considered in this work and in Materials Project database as grouped by the type of compound (oxides, halides, etc.). "MP" denotes the full Materials Project database, whereas "MP*" includes the same conditions (band gap larger than 0.5 eV and energy above hull less than 0.1 eV) as used in our material set (PhDB*). "CRD" refers to the calculated set of materials. In total, almost 20% of the MP* structures are contained in the PhDB* dataset and about 12% are calculated. Also, the different types of compounds are included in our database with similar statistics as in Materials Project. As an example, the percentage of oxides and halogenides are 52 % and 27 % in our database, compared to 67 % and 26 % in MP*. Finally, we used the algorithm

5/12
proposed by Larsen et al. 44 for identifying the dimensionality of the structures in our database: 4137 structures (more than 80 %) are three-dimensional, 385 structures are two-dimensional, 72 structures are one-dimensional, 277 structures are 0D and others are a mixture of different dimensionality, such as 0D+1D, 0D+2D, 0D+3D, etc. This shows that our database covers most different material classes.

Technical Validation
Comparison to experiments Selected computational benchmarks were already presented in the Computational parameters section. In this section, we compare the calculated spectra from our approach with experimental results extracted from the RRUFF database to validate our method and calculations. RRUFF contains only (estimated) chemical formula and lattice parameters but not atomic positions, and thus we cannot guarantee exact structural match. Based on mineral names, there are 703 entries in RRUFF database that matched with 288 structures of our database. The Table S1 contains mineral names, formula, and their RRUFF IDs for structures with the same formula as found in Phonon database, 92 in total. 27 of these were found to have the similar lattice parameters compared to the matched structure in our database and thus very likely to be the same structure. Moreover, in most cases, the energy above hull is zero or very small, the maximum being 40 meV/atom. Fig. 4 shows a comparison between calculated spectra and experimental Raman spectra of few selected minerals: HgO, MgCO 3 , CaMg(CO 3 ) 2 , and SiO 2 . Overall good agreement between computational and experimental results is found. In most cases, the frequencies (Raman shifts) differ from the experimental values by less than few percent. The variation in peak intensities is somewhat larger but qualitatively correct. We note that the comparison to the experiment is complicated by the varying linewidths in the experimental spectra, which in turn modifies the peak maxima. The linewidth is related to the phonon lifetime, which is not evaluated in our calculations. Instead, in the simulated spectra we have only included a reasonable phonon lifetime-induced broadening of 8 cm −1 . The experimental spectra appear to contain also a Gaussian-type (instrumental) broadening, which we do not attempt to reproduce here. Also, while perfectly ordered bulk crystals are used in calculations, in experiments the material purity or even exact composition may be unknown and the spectrum is affected by parameters such as temperature, pressure, and measurement geometry. While we are relying in harmonic approximation, phonon renormalization due to anharmonic effects can affect the frequencies as well as linewidths. Also, we are simulating non-resonant Raman spectra, while in resonant Raman the intensities may change depending on the electronic resonance conditions. Nevertheless, in the cases where the Raman tensors are affected by any of these effects, the Raman-active modes found based on the group theory can still be used to assist in the analysis of the experimental spectra.

Usage Notes
We introduced an optimized workflow for performing high-throughput first-principles calculations of Raman tensors. The workflow takes full advantage of the crystal symmetry, adopts carefully benchmarked computational parameters, and avoids calculation of vibrational modes by importing them from existing Phonon database. We carried out such calculations for 5099 materials and the results are included in the dataset accompanying this paper. The database encompasses a wide variety of materials from different compound classes (oxides, halides, etc.) and of different dimensionality. The calculated spectra were also shown to compare favorably with the experimental ones.
The final database contains Raman tensors and other vibrational information, such as phonon eigenmodes, Born charges, and symmetry information, stored in JSON document that can be downloaded directly from the Materials Cloud Archive 43 and queried with a simple python script. The whole dataset can also be browsed online in Computational Raman Database website (http://ramandb.oulu.fi), wherein one can also find other relevant information, such as atomic structure, phonon dispersion, and infrared spectrum. We hope that the vibrational properties and Raman spectra of materials in the database will prove useful for computational and experimental researchers alike.     In the third step, we investigated the effect of step size by calculating Raman tensors of PbO and Cd(HO) 2 with step sizes of 0.001, 0.02, and 0.04 Å and compared them with those using the standard step size of 0.005 Å. Fig. S2 shows the changes in dielectric constants of PbO and Cd(HO) 2 in different directions and plotted for three-step sizes: 0.005, 0.02, and 0.04 Å. Since dielectric tensor is symmetric (xy=yx, xz=zx, and yz=zy), we only plot the inequivalent components. As shown in Fig. S2, whenever there are pronounced changes in the dielectric constant (corresponding to non-zero components in Raman tensor), the dependence on step size is close to linear. In some cases there is a small parabolic dependence, seen particularly well in the xy=yx component which contains no linear dependence, but these will not affect the Raman tensor since we are using two-point finite-difference stencil. Moreover, in this range of step sizes there is no discernible noise, although some noise could be observed in 0.001 Å results (not shown). Thus, we consider the default value of 0.005 Å a good choice.

Code availability
As mentioned in the text, there was an error in the normalization of eigenvectors in Atomate. We fixed the normalization error and changed the formulations to match with the vasp_raman code ?, ? . To verify our approach, we used vasp_raman code to calculate Raman tensors and compared them to Atomate with the fixed and old versions of eigenvector normalization. Fig. S3 shows the Raman activity spectra of MoS 2 , WS 2 , SrGaSnH, and BaAlSiH. The revised normalization yields activities closely matching with vasp_raman code. The incorrect normalization, on the other hand, tends to lead to overestimation of Raman activities and is particularly severe with modes that have very small Raman activity.
In CRD website (ramandb.oulu.fi), the total Raman intensity is separated into depolarized (I ⊥ ) and polarized (I || ) components, I = I ⊥ + I || , with where we have taken out the (ω L − ω ν ) 4 term that depends on the laser wavelength, since (i) this removes the dependence on one external parameter from our spectra, (ii) our calculations are for non-resonant conditions and one needs to be careful to only compare to wavelengths that are far from resonance, and (iii) the dependence on ω ν and thus the changes in the spectra after normalization are usually small. The rotation invariants are ?, ?  Default approach vasp_raman approach Fixed approach (a) Figure S3. Comparison of the Raman activity from our workflow and that from vasp_raman code. The spectra from old version of Atomate with incorrect eigenvector normalization is also shown.