An automatically curated first-principles database of ferroelectrics

Ferroelectric materials have technological applications in information storage and electronic devices. The ferroelectric polar phase can be controlled with external fields, chemical substitution and size-effects in bulk and ultrathin film form, providing a platform for future technologies and for exploratory research. In this work, we integrate spin-polarized density functional theory (DFT) calculations, crystal structure databases, symmetry tools, workflow software, and a custom analysis toolkit to build a library of known, previously-proposed, and newly-proposed ferroelectric materials. With our automated workflow, we screen over 67,000 candidate materials from the Materials Project database to generate a dataset of 255 ferroelectric candidates, and propose 126 new ferroelectric materials. We benchmark our results against experimental data and previous first-principles results. The data provided includes atomic structures, output files, and DFT values of band gaps, energies, and the spontaneous polarization for each ferroelectric candidate. We contribute our workflow and analysis code to the open-source python packages atomate and pymatgen so others can conduct analogous symmetry driven searches for ferroelectrics and related phenomena.

Ferroelectrics have important technological applications, such as in tunable capacitors, non-volatile random access memory devices, and electro-optical data storage. In addition, ferroelectrics are capable of displaying couplings between their electronic degrees of freedom with magnetic or lattice degrees of freedom in multiferroic materials. Ferroelectricity often arises from a structural phase transition between a high-symmetry nonpolar structural phase to a low-symmetry polar structural phase with decreasing temperature, resulting in the emergence of a spontaneous polarization [31][32][33] . In this scenario, the atomic geometry of the nonpolar structure can continuously distort such that the new polar structure has a subset of the symmetries of the original structure, satisfying the requirements of a second-order phase transition; in these cases, the polar space group must be an isotropy subgroup of the nonpolar space group, which is a stronger requirement than they only share a simple group-subgroup relation [34][35][36] .
Thus, certain ferroelectrics can be systematically screened by searching for pairs of nonpolar and polar structures related by a small symmetry-breaking distortion. In the late 1980s, Abrahams performed some of the earliest searches for ferroelectrics in crystallographic databases using symmetry criteria 5,6 . More recently, automated searches for new ferroelectric candidates have used symmetry arguments to identify nonpolar reference structures for existing polar materials [7][8][9] . Other studies have used a combination of group theoretic and first-principles calculations to propose ferroelectric candidates [10][11][12] . Bennett and co-workers proposed using high-throughput calculations to perform chemical substitution into structures of known classes of ferroelectrics [13][14][15] . Recent work used high-throughput phonon calculations to identify ferroelectrics through polar soft phonon modes of nonpolar phases 16 .
Previous studies have focused on a limited number of compounds or families of compounds using a relatively narrow set of symmetry conditions. Current curated lists of ferroelectrics only include known ferroelectrics that have been experimentally verified. With shrinking computing costs, high-throughput material searches using first-principles methods provide an efficient strategy to discover and catalog materials. Ferroelectric databases and systematic screening of properties such as band gaps, polarizations, volume expansion, critical temperatures, and coupling to magnetic and/or topological degrees of freedom may lead to new functional materials and potentially new physical phenomena.
In this work, we integrate density functional theory (DFT), crystal structure databases, symmetry tools, workflow software, and a custom analysis toolkit to build a workflow capable of generating libraries of known, previously-proposed and newly-proposed ferroelectrics. This workflow is general and can be used with any crystal structure dataset. We present the results from performing this workflow on the Materials Project database of inorganic crystal structures 24 . We screen over 67,000 material structures using symmetry relations between nonpolar and polar structure pairs and calculating the polarization from first-principles calculations. We identify 255 ferroelectric candidates, 200 being classified as high-quality candidates by a stringent verification process. Within these high-quality candidates, 74 are known or previously proposed, and 126 are new ferroelectrics. With the workflow developed here, we construct the first automatically-curated first-principles dataset of diverse, multi-class known and new ferroelectrics calculated with a standardized method that permits straightforward comparison. This dataset can be used to develop new tools and criteria for studying ferroelectricity across diverse materials systems. In addition our code for conducting this search has been contributed to the open-source python packages atomate and pymatgen so others can conduct searches of their own and build directly on this work 37,38 .
Our automated workflow has three stages: symmetry analysis, first-principles calculations, and post-processing. Accordingly, the rest of the work is organized as follows: the description of the workflow, based in the concept of ferroelectric nonpolar-polar symmetry pair, is described in the Methods section. Technical aspects of our workflow and database are included in the section Data Records. Finally, we validate our workflow method against experimental databases of ferroelectrics and verify our workflow is comparable to previous first-principles results in the section Technical Validation.

Methods
Identifying ferroelectricity from first principles. Ferroelectrics are characterized by a polarization versus electric-field hysteresis loop. Experimentally, the spontaneous polarization can be determined as half of the change in polarization at zero external field 31 . The spontaneous polarization is not a direct observable; one measures the change in spontaneous polarization between two stable configurations of a material 39 .
In three dimensions, the only space groups compatible with a polarization, meaning they leave a vector invariant under its symmetry operations, are those with polar point groups. Out of the 32 crystallographic point groups, 10 are polar; these polar point groups can keep points along specific lines (point groups 2, 3, 4, 6, mm2, 4mm, 3m, 6mm), planes (point group m), or all points in three-dimensional space (point group 1) invariant 40,41 . All other point groups are nonpolar. We define polar structures as crystal structures with a polar space group, which is composed of a polar point group plus translations, likewise for nonpolar structures.
Following the modern theory of polarization 39,[42][43][44][45][46][47] , the polarization, P, of a crystal is defined as, where e is the charge of the electron, n i is an integer, R i is a primitive lattice vector, and Ω is the unit cell volume (bold letters denote vectors). P 0 includes electronic and ionic contributions. The second term on the right hand side of Eq. 1 is the quantum of polarization, eR / i Ω, a consequence of translation symmetry. In the general case, P 0 is defined up to any integer multiple of Ω eR / i ; for nonmagnetic crystals containing elements from even columns of the Periodic Table, P 0 is defined up to even integer multiples. Since we are screening many systems here, we use the more general definition in this work.
Only differences of the computed polarization on the "same branch" are physically meaningful, where different branches are related by integer multiples of the quantum of polarization. Equivalently, the evolution of the polarization along an adiabatic path between two states must be smooth. Nonpolar space groups can only host formal polarizations that are zero or one-half modulo the polarization quantum. We use a nonpolar reference structure to calculate the change in polarization due to a polar distortion. In general, paths between two opposite polar configurations are sufficient to compute the change in polarization; however, we choose to use a nonpolar reference structure in this work as is standard for calculating the polarization from first principles.
To recover a smooth polarization path, we ensure the nonpolar structure must be continuously deformable into the polar structure along a path that preserves the symmetry of the polar structure and for which the system remains insulating. We then perform calculations of multiple structures along the distortion path to compute the spontaneous polarization, which can be compared to experiment. Using this approach, the spontaneous polarization can be directly predicted using first-principles methods with good accuracy 48 .
Hence, for the purposes of this work, we only consider ferroelectrics for which a high-symmetry nonpolar reference structure can be readily identified in the database for a lower-symmetry polar structure that can support a polarization 42 . We automate a search for compounds supporting two such phases and then compute the polarization difference along the structural path connecting the two structures.
If a polar ferroelectric structure corresponds to a metastable state, and is higher in energy than a nearby non-polar ground state by a small amount, the system can be considered an antiferroelectric 49,50 . Antiferroelectrics exhibit double hysteresis loops in polarization versus electric-field measurements; the field-induced first-order phase transition originates with an energy barrier between the nonpolar ground state and the polar metastable phase. In an antiferroelectric, the nonpolar ground state phase is related by a nonpolar distortion to a distinct nonpolar reference structure; the ground state structure is generally characterized as "anti-polar" to distinguish it from the nonpolar reference structure. Symmetry conditions for antiferroelectrics are described in ref. 51 . For completeness, we note that to identify antiferroelectrics using the workflow presented here, in addition to finding a reference nonpolar phase, the polar metastable ferroelectric phase and an antipolar ground state, the material would need to display a small energy difference between the polar and antipolar structures on the order of 1-10 meV 52 .
Workflow overview. We first describe the general workflow diagram comprising symmetry analysis, first-principles calculations, and post-processing. As shown in Fig. 1, the complete workflow involves the passing of data between many separate calculations. In developing our workflow, we automate the following tasks: 1. Identifying candidate materials possessing nonpolar-polar structure pairs related by a continuous distortion (an isotropy subgroup symmetry relation). 2. Performing spin-polarized DFT calculations of changes in total energy, band gap, and polarization for multiple structures along the nonpolar-polar distortion. 3. Post-processing calculation data to compute the spontaneous polarization of the polar ferroelectric phase. 4. Validating the calculation quality for each ferroelectric candidate. 5. Creating an interface for viewing the results for all candidates (see the section Graphical Interface).
We start by choosing a crystal structure database on which to perform the search (see Structure Selection Symmetry Analysis). We emphasize that any crystallographic database (e.g. any of the databases described in [24][25][26][27][28][29][30] can be used to perform our workflow, as long as the atomic coordinates and lattice parameters of the structures are provided. Within the chosen database, we perform a symmetry analysis to find candidate materials possessing nonpolar-polar structure pairs related by a continuous symmetry deformation. Any such pairs found to satisfy the symmetry deformation criteria are stored in the Distortion Database as being deformable by symmetry. This criteria includes the following conditions: (1) The polar structure belongs to a space group that is a subgroup of the space group of the nonpolar structure; and (2) There exists a transformation matrix between the high-symmetry setting of the nonpolar structure to the low-symmetry setting of the polar structure. The latter imposes that the www.nature.com/scientificdata www.nature.com/scientificdata/ distortion of the lattice parameters and atomic coordinates between the nonpolar and polar structures is continuous, meaning the polar structure belongs to an isotropy subgroup of the nonpolar structure.
We then carry out DFT calculations on the candidate pairs to extract the changes in the band gaps, total energies, and polarization along the nonpolar-polar distortion (see Computational Methods). These results are stored in a Workflow Database and then accessed by our Computing Resources to perform the calculations. Next, the information stored in the Distortion, Workflow, and Calculation Databases is used together to post-process quantities such as the computed spontaneous polarization and to validate ferroelectric candidates using experimental and previous first-principles results (see Post-processing and Spontaneous Polarization Values and Verification of Computational Methodology). The information needed to assess the quality and properties of the candidates is then added to the Candidate Database where it can be accessed by our web Interface for viewing the candidate materials in aggregate (see Graphical Interface). Finally, candidates are screened to ensure the polarization and energy profile across the nonpolar-polar distortion are smooth and continuous, i.e. all calculations ended correctly and provide reliable results.
In the sections below, we describe in detail the methods used for creating an automatically curated dataset of ferroelectrics from the Materials Project database 24 . The Materials Project database is largely based on structures from the Inorganic Crystal Structure Database (ICSD) 27,28 and includes hypothetical structures created through stoichiometric substitution. We use the Materials Project database to test our workflow. Our results for the Materials Project are not intended as the most general curated list of ferroelectrics; however, as the first automatically obtained list of ferroelectrics, they uncover new candidates and provide a blueprint for further studies. More elaborately curated lists may be constructed by applying our workflow to additional databases in future studies. The results from applying our workflow are described below and summarized in Table 1. We note that our workflow is modular and open-source, so it can be readily adapted and applied by others to expand the search for ferroelectrics and related materials such as antiferroelectrics and multiferroics.
Structure selection. As motivated earlier, the input to our workflow is a collection of candidate materials possessing nonpolar-polar structure pairs. There are several methods that can be used to create candidate nonpolar-polar structure pairs. For example, one can apply a polar distortion to an existing nonpolar structure or create a hypothetical nonpolar reference structure for an existing polar structure. In this work, to classify a material as a ferroelectric candidate, we require both nonpolar and polar structures to be present in the database. As shown below, even this direct approach provides new candidate materials, previously overlooked as ferroelectrics. Future studies may choose to relax this requirement to identify a greater number of promising materials.
To search for compatible nonpolar-polar structure pairs in the Materials Project dataset, we first select compounds possessing nonpolar and polar structures with space groups which are related by a group-subgroup relation. Note that in principle the same compound may display more than one ferroelectric structural transition, and therefore have more than one nonpolar-polar symmetry pair. For each of these pairs, we require that the number of sites in the nonpolar structure is less than or equal to the number of sites in the polar structure. We perform this initial query using pymatgen, spglib, and the Materials Project API 37,53-55 . We provide the number of nonpolar-polar structures pairs resulting from this query in the top box of Table 1.
At the time of this search, the Materials Project database had approximately 67,000 structures, approximately 15,000 of which are polar. We find approximately 17,000 nonpolar-polar structure pairs related via group-subgroup space group relations. This number is large, in part, because the same polar structure may be paired with multiple possible nonpolar reference structures, and vice versa. This number is also large because the requirement that the polar structure is in a subgroup of the nonpolar space group is a much weaker requirement than the polar structure belongs to an isotropy subgroup of the nonpolar structure-which we check later in the workflow. These roughly 17,000 pairs contain approximately 1,600 of the approximately 10,000 distinct polar compositions in the Materials Project database. The remaining polar compositions in the Materials Project do not have symmetry compatible nonpolar structures within the same composition present in the database. We note that it is possible to propose hypothetical nonpolar reference structures for polar candidates using group theoretic methods or by relaxing the symmetry tolerance between nonpolar-polar distortions [7][8][9]16 ; this is left for future work.
Naming conventions. We adopt the pymatgen alphabetical_formula method for the Composition class (with spaces and 1 s removed) to output consistent formulas for our candidates. We note that this method orders elements in such a way that does not match conventions in the literature. For example, we use O 3 PbTi where the standard in the literature is PbTiO 3 . Compositions printed by pymatgen also differ from those used in mineralogy, such as for boracite, lawsonite and many other minerals in our dataset. In our datafiles, we also provide formula name output using the pymatgen reduced_formula method for the Composition class, which sorts elements by electronegativity.
Symmetry analysis. The automated nature of our ferroelectric search relies on strict symmetry criteria. As described in the Structure Selection section, we pre-screen our candidate nonpolar-polar structure pairs using the symmetry tools in pymatgen and spglib to ensure that these pairs satisfy preliminary group-subgroup relationships. We then use the Structure Relations symmetry tool provided by the Bilbao Crystallographic Server (BCS) [56][57][58] to impose the symmetry criteria described in the Workflow Overview, namely, to obtain a transformation matrix connecting the lattice parameters and atomic coordinates of the structure pair 59,60 . The BCS has a freely available web interface for accessing a wide variety of symmetry tools. We create python scripts to automate interaction with and scrape returned data from the BCS to perform our symmetry checks using the python package mechanize 61 . The Structure Relations tool checks the following: 1.1 The group-subgroup index relations are compatible. The index of a group-subgroup relation indicates the number of ferroelectric domains (distinct polar variants) that arise from the symmetry breaking of the high-symmetry structure. 1.2 There exists a path of maximal subgroups between the high-symmetry structure and low-symmetry structure.
1.3 The Wyckoff position splitting of the high-symmetry structure is compatible with the Wyckoff positions of the low-symmetry structure. 1.4 The lattice of the high-symmetry structure in the low-symmetry setting must be within a defined tolerance of the lattice of the low-symmetry structure (see Symmetry Precision section). 1.5 Each atom in the high-symmetry structure in the low-symmetry setting can be paired to an atom in the low-symmetry structure such that atom pairs are separated by a distance no greater than a given tolerance (see Symmetry Precision section).
Structure Relations takes Crystallographic Information Files (CIFs) of high-symmetry and low-symmetry structures and tolerance threshholds as arguments. We use a lattice tolerance of 3 Å and 10° for lattice parameters and angles, respectively. These tolerances are generous for materials with average sized unit cells (i.e.with lattice parameters less than 20 Å) and permit a wide variety of distortions. For the present work, high-quality candidates are reported for a maximum pairing distance of 1.5 Å. As shown in Table 1, out of the 17,000 structure pairs that we test with Structure Relations, 413 are found to be deformable by symmetry with a maximum distortion less than or equal to 1.5 Å.
Symmetry precision. Symmetry precision is a tolerance factor used to assess whether an atom is equivalent to another after a symmetry operation up to a maximum distance. A symmetry precision between 10 −1 and 10 −5 Å is typically used. In the Materials Project database, a symmetry tolerance of 10 −1 Å is used for the reported space group stored in the database. We use the same tolerance to generate CIFs sent to the BCS Structure Relations.
We evaluated how varying the symmetry tolerance changes the resolved space group for all the structures in the Materials Project. We were able to determine this efficiently by using a binary search on a log 10 scale for a maximum and minimum symmetry tolerance of 10 −1 Å and 10 −5 Å, respectively. Out of the 67,000 structures we checked, 50,000 (75%) structures were resolved into one distinct space group for the entire symmetry precision range. For additional discussion about the sensitivity of symmetry precision on resulting space groups in the search for ferroelectrics, see refs. 16,62 . DFT calculation details. In our workflow, we perform spin-polarized DFT calculations using the Vienna Ab initio Software Package (VASP) version 5.3.5 [63][64][65] . We use the generalized gradient approximation (GGA) functional of Perdew, Burke, and Ernzerhof (PBE) 66 . Our calculations use PAW pseudopotentials and an energy cutoff of 520 eV for the plane-wave basis 67,68 ; this is 1.3 times the highest cutoff recommended for the pseudopotentials used 69 . Structures are initialized with ferromagnetic ordering in all cases in this work. Since the default is for parallel alignment of the spins, we expect the workflow to be reliable for nonmagnetic or ferromagnetic materials. Materials with more complicated magnetic ground states, such as antiferromagnets, would require consideration of different antiferromagnetic orderings. Therefore, some common multiferroics possessing antiferromagnetic or near antiferromagnetic ordering may not be identified by this workflow. Extensions to include antiferromagnetic spin arrangements are relegated to future work. These settings correspond to default values used to create the Materials Project database and therefore allow a direct comparison of our results with the Materials Project database.
We use the Berry phase approach from refs. 39,43,44,70,71 , as implemented in VASP to calculate the electronic part of the macroscopic spontaneous polarization. We calculate the ionic part of the polarization using the point charge and position for each atom in the unit cell, see the calc_ionic function in pymatgen.analysis. ferroelectricity.polarization for details.
www.nature.com/scientificdata www.nature.com/scientificdata/ We use the default parameters for VASP inputs as defined in pymatgen (and used by Materials Project) and atomate 69,72,73 . For details on these parameters, see the documentation for pymatgen.io.vasp.sets. MPStaticSet. We use a Hubbard U correction to correct the DFT-PBE description of d states of select oxides and fluorides following the approach in ref. 74 . To see the guidelines for which compounds we apply a U, see refs. 24,69,72,73 . We use a reciprocal k-point density of 50 k-points per (1/Å) 3 for structural relaxations and 100 k-points per (1/Å) 3 for static and polarization calculations. We use total energy convergence criteria of 5 × 10 −5 eV per atom for the electronic self-consistent loop and 5 × 10 −4 eV/Å per atom for the ionic relaxation loop for structural optimizations. These convergence parameters were tested against higher-accuracy convergence parameters on a set of 182 chemically diverse compounds in ref. 73 , yielding total energies within 15 meV/atom and lattice volumes within 7.5% of the higher-accuracy calculations for nearly 96% of the compounds.
We note that while the local density approximation (LDA) is commonly used to describe certain ferroelectric oxides, and therefore to compute their polarizations, use of a generalized gradient approximation (GGA), such as the PBE functional, tends to be standard for wider classes of materials nowadays. PBE is also the default functional used by the Materials Project for structural relaxations and calculating material properties. Thus, we use PBE for this effort. Our results are in line with the typical overestimation of PBE for the lattice parameters, and therefore, we expect a similar overestimation of the polarization. In addition, while DFT-PBE tends to underestimate electronic band gaps, the latter plays a minimal role in the determination of standard ferroelectric materials, and as shown below, will only limit the computation of the spontaneous polarization for a small number of them.
Scientific workflow packages. We construct the scientific workflows to perform the structural relaxations and spin-polarized DFT calculations of energy, band gap, and polarization using the FireWorks and atomate python packages 38,75 . FireWorks is built for managing computational scientific workflows. atomate is built for constructing workflows for multiple computational material science codes, such as VASP. atomate uses FireWorks classes to develop modules for performing common DFT calculations with VASP, such as structural relaxations and self-consistent calculations of total energy. atomate also provides a framework for building custom modules, which we use to construct our structural interpolations and polarization calculations modules.
DFT workflow. We use the DFT workflow, shown in Fig. 2, to compute the physical properties needed to identify ferroelectric candidates. We perform spin-polarized calculations and for systems with spin-polarized ground states, consider only ferromagnetic ordering. We execute the DFT workflow shown in Fig. 2 for the 413 pairs with continuous nonpolar-polar transformations, with maximum distortions that do not exceed 1.5 Å.
For each structure pair, we begin with the nonpolar structure in the low-symmetry setting (obtained from BCS Structure Relations in the Symmetry Analysis step) and the polar structure. We use the nonpolar structure transformed into the low-symmetry setting so we can perform structure interpolations and polarization calculations across similar lattices. We perform relaxations of the unit cell and atomic positions of both of these structures twice, using a level of convergence similar to what Materials Project uses for its database entries. As mentioned, all calculations are spin-polarized and use PBE, with a Hubbard U correction for systems with non-d 0 open-shell cations; our workflow assumes ferromagnetic ordering for all systems. Extensions of our workflows to consider antiferromagnetic and other orderings will be the subject of future work. We then fix the relaxed nonpolar and polar structures, and perform a self-consistent DFT calculation to compute the total energy and band gap. If either the nonpolar structure or polar structure is found to be metallic at the DFT-PBE level in our spin-polarized calculations initialized with ferromagnetic orderings and a standard U-here we define metallic as having a DFT-PBE band gap of less than 10 meV-we stop the workflow for that structure pair.
If the polar and nonpolar structures are both insulating, we compute the polarization along the distortion path. As shown in Table 1, 80 of the 413 structure pairs were computed to have metallic endpoints in our spin-polarized calculations: 30 were found to have a metallic nonpolar structure but insulating polar structure, 2 were found to have a metallic polar structure but insulating nonpolar structure, and 24 were found to have both metallic polar and nonpolar structures. 24 additional structures have at least a metallic nonpolar structure, but these workflows were halted before the polar structures had their band gaps computed.
We compute the DFT total energy, band gap, and polarization of eight evenly-distributed linearly-interpolated structure, or interpolations, of the nonpolar to polar structures. We found eight interpolations to be sufficient for reconstructing a smooth polarization trend for at least 75% of our candidates. Similiar to the previous step where a metallic calculation causes the workflow to stop, metallic interpolations similarly halt the workflow since we would be unable to calculate the polarization of that structure. 59 candidates were found to have metallic interpolations. The DFT workflow is labeled as complete when all polarization calculations along the path have completed. As shown in Table 1, 255 structure pairs successfully completed the workflow, and satisfy our requirements of a ferroelectric phase transition.
Post-processing spontaneous polarization values. As discussed earlier, only polar space groups are compatible with a polarization vector that is not integer or half integer multiples of the polarization quantum 46 . If a reference nonpolar structure that is continuously deformable into the polar structure can be identified, we can then calculate the polarization of several interpolated structures between the reference nonpolar and the target ferroelectric polar structure. The nonpolar structure is used as a means to calculate the spontaneous polarization; however, we note in general it is not necessary for the nonpolar structure to be experimentally observable for the polar material to be ferroelectric.
We start by calculating the formal polarization of the nonpolar structure, which is either zero or a half quantum of polarization (modulo the quantum of polarization) along the three lattice directions. Then we perform the same calculation for the first interpolated structure along the distortion, and then the next, until we arrive at the final polar structure. For a sufficient number of interpolations between the nonpolar and polar structures, we can trace (2020) 7:72 | https://doi.org/10.1038/s41597-020-0407-9 www.nature.com/scientificdata www.nature.com/scientificdata/ out smooth, continuous polarization "paths" along the distortion; there will be infinitely many paths due to the periodicity of the polarization lattice. Subtracting the polarization values at the nonpolar and the polar endpoints of the same path or "branch" will give us the spontaneous polarization vector of the polar ferroelectric phase.
We perform the following steps to recover the same branch "proper" polarization, which is independent of choice of branch. The first step, which is crucial, is to readjust the polarization for each structure along the www.nature.com/scientificdata www.nature.com/scientificdata/ distortion to be in the polar polarization lattice. To do this, we modify the polarization of the intermediate structures by the ratio of the quantum of polarizations of the two lattices (the lattice parameters divided by the volume multiplied by the electron charge), i.e., and Ω η are the lattice parameters and volume of the η th structure along the distortion, R polar and polar Ω are the lattice parameters and volume of the polar structure, and i is a lattice direction â,  b or ĉ. If we do not perform this adjustment, we are calculating what is called the "improper polarization" which will depend on the choice of branch and is therefore unphysical 76 . See Fig. 3 for an example of the differences between proper and improper polarization for BaTiO 3 (here we use conventional notation for the name).
After adjusting our initial polarizations, we then construct a periodic lattice with lengths and angles corresponding to the quantum of polarization along each polar lattice direction; this corresponds to the second term of Eq. (1). For the polarization lattice, the lengths of the lattice vectors are the cell lattice vectors divided by the volume of the unit cell and multiplied by conversion factors for electron charge and length scale.
Our algorithm for adjusting the polarizations to be on the same branch is depicted in Fig. 4. First, we take the nonpolar polarization (adjusted to be in the polar polarization lattice), and choose the "image" (or periodic value) of the polarization value in the nonpolar polarization lattice that is closest to the Cartesian origin (0, 0, 0). The value of the nonpolar polarization along â, b  , and ĉ can either be zero or a half-quantum. Then, we find the image of the first interpolation polarization value (again adjusted to be in the polar polarization lattice) that is closest to the Cartesian coordinates of the adjusted nonpolar polarization value. We continue this process until we get to the polarization of the polar structure.
This algorithm will find the polarization path with the smallest difference between polarizations of subsequent interpolations. An issue is that this algorithm can incorrectly find the same branch polarization in cases where the change in polarization between interpolations is larger than the quantum of polarization between branches. One example of this type of failure is the polarization resolved for CrO 3 with search ID 187, see Fig. 5 (the search ID being a simplified unique identifier, defined for the purposes of our work, for pairs of structures used in our workflow search, see Online-only Table 3). In this example, the algorithm chooses a discontinuous path that has a smaller spontaneous polarization of 78.2 μC/cm 2 in red. However, the correct path uses the last three interpolations in the branch shown with a dashed red line and gives a polarization of 122.9 μC/cm 2 . To correctly reconstruct this polarization with our existing algorithm, more structures would be needed to better interpolate between the nonpolar and polar structures. Alternatively, the curvature of the spline connecting nonpolar and polar structures could be used to identify points on the same branch; we leave this refinement for future work.
Graphical interface. To view the DFT ferroelectric candidate data in aggregate, we create an interactive web site for viewing polarization and total energy plots, animations of the distortion, and other data. The interface Fig. 3 Examples of improper (left) and proper (right) polarization curves for BaTiO 3 along the ĉ direction versus distortion from the nonpolar to polar structure, showing the importance of calculating the proper polarization. Due to the change in lattice parameters and volume across the distortion, the quantum of polarization defined for each structure along the ĉ direction is different. Using these different quanta causes the improper spontaneous polarization predicted by different branches to differ, as can be seen in the polarization values given in the right of the image. In contrast for the proper polarization (right), we re-scale the polarization of each intermediate structure to be in the polar structure's polarization lattice and use the quantum of polarization as defined by the polar structure. This results in predictions that are branch independent, which is what we use to assess candidates. Note that while, in this specific case, the calculated polarization values for all interpolations were on the same branch, this need not generally be the case.
www.nature.com/scientificdata www.nature.com/scientificdata/ consists of two main pages: (1) a page containing a sortable table of ferroelectric candidates organized by category (whether the candidate had a value of polarization successfully calculated and if so with what level of confidence) and (2) individual candidate pages that show energy and polarization plots, distortion animations, and other data specific to that candidate. This interface is available at https://blondegeek.github.io/ferroelectric_search_site/.

Data records
This dataset is available as two JSON files deposited in figshare 77 and our GitHub repository (http://github.com/ blondegeek/ferroelectric_search_site) 78 . The JSON files provide details of the symmetry analysis performed for each candidate and data generated by DFT calculations and post-processing from the workflow. Zipped folders of the input and output VASP files for each candidate deposited in figshare 77 . The title of the zipped folder includes the workflow ID to correlate the VASP files to information in the JSON files provided. We also provide an interface for viewing the dataset at http://blondegeek.github.io/ferroelectric_search_site with the code for the interface located at http://github.com/blondegeek/ferroelectric_search_site. File format. We contribute the following data: 1. JSON file with information on workflow status of each calculated candidate and calculation details extracted from VASP inputs and outputs. This includes total energy, band gap, polarization, post-processed information, and validation criteria for candidates with completed calculation. See Tables 2 and 3 for details.

Fig. 4
A visual demonstration of the same branch polarization algorithm demonstrated in one-dimension (rather than three-dimensions) using BaTiO 3 . The values for the polarization for each interpolation are those circled in red. In the first panel, we adjust the nonpolar polarization to be on the branch closest to zero. In the second panel, we move the first interpolated polarization to be on the branch closest to the adjusted nonpolar polarization. In the third panel, we adjust the second interpolated polarization to be on the branch closest to the first interpolation polarization. If the algorithm finishes successfully, all the adjusted polarizations will be on the same branch. reported properties. For each candidate we provide an initial nonpolar-polar pair of structures, including the nonpolar structure in both the nonpolar, high-symmetry setting and polar, low-symmetry setting. We also provide the displacements of each atom and other metrics provided by BCS Structure Relations.
For each successful calculation, we provide the structure used for calculation, the ionic and electronic polarization computed by VASP, the ionic polarization computed via the method of point charges, the total energy and energy per atom of the structure, and other commonly computed quantities such as total magnetization, magnetization per atom, forces, and stresses. We also give details as to which calculations (out of the 22 computed) for a given candidate pair were completed.
For each set of completed calculations we also provide, the recovered spontaneous polarization using the workflow described in the Methods section, as well as spline data characterizing the smoothness of the polarization and energy trends across the nonpolar-polar distortion.
Graphical representation of results. In the top row of Fig. 6, we partition the high-quality candidates found in the Materials Project into known and newly-proposed ferroelectrics and further partition those   www.nature.com/scientificdata www.nature.com/scientificdata/ ferroelectrics into subclasses. In Fig. 6 we see that known and new ferroelectric candidates are well mixed along the metrics of nonpolar-polar structure energy difference, distortion maximum between nonpolar and polar structures, PBE band gap of polar structure, and energy above hull (a measure of the thermodynamic driving force for decomposition 79 ).
In the middle row of Fig. 6, the candidates with large polarizations denoted with red triangles belong to the perovskite family. We recover many known perovskite ferroelectrics, such as: LiNbO 3 , AlBiO 3 , BiInO 3 , KNbO 3 , BaO 3 Ti, CdO 3 Ti and O 3 PbTi. In addition, we recover well-known double perovskite ferroelectrics, such as Bi 2 Nb 2 O 9 Pb and Bi 2 O 9 SrTa 2 . We also recover well-known antiferroelectrics possessing ferroelectric metastable phases, such as NaNbO 3 , HfO 3 Pb and O 3 PbZr. We again note that we do not use conventional notation for these systems but rather use alphabetical order of elements to provide a consistent ordering with our workflow output.
Other classes in the middle row of Fig. 6 are the organic (NH 4 ) 2 SO 4 family in blue and structures already proposed by theory to be ferroelectric in purple. These data show that there are many known and proposed ferroelectrics in the literature with polarizations of 10 μC/cm 2 or less.
In the bottom row of Fig. 6, we categorize new ferroelectric candidates into different trending compositions or structure types. There are several candidates containing fluorides, carbon-oxygen compounds, and hydroxyl groups. We highlight these candidates because they are very different in composition from oxide ferroelectrics most common in the literature. We also point out some hypothetical non-magnetic hexagonal manganite-like structures found in the Materials Project database that have polarizations of approximately 10 μC/cm 2 and half-quantum nonpolar polarizations.
In Fig. 7, we show trends in the number of pairs with continuous deformation with given nonpolar-polar point group transitions. There are 32 crystallographic point groups; nonpolar and polar point groups are shown in blue and red, respectively. The thickness and color of the line connecting nonpolar and polar point groups indicate the number of structures in the dataset with a continuous deformation between those point groups. We find that point group transitions that correspond to orthorhombic structures such as mmm → mm2, monoclinic structures such as 2/m → 2 or 2/m → m, and hexagonal 6/mmm → 6mm are the most prevalent.
In Fig. 8, we show the computed polarization of the ferroelectric candidates plotted with respect to their polar point group, similar to the plot of piezoelectric tensor magnitudes in ref. 4 . The majority of candidates have polarization less than 5 μC/cm 2 , shown on the right. The candidates in point group 4mm with large polarizations are perovskites with a reference structure in point group m m 3 . The degree of shading of a radial cell is proportional to the number of candidates in that region of the plot. For example, there are many candidates with polar point groups 2 and mm2 that have polarizations within 25 μC/cm 2 .

Technical Validation
Verification of computational methodology. Several checks are needed to ensure our automated calculations have completed satisfactorily and the information automatically extracted from them is reliable. We describe these tests below.
Testing smoothness of energy and polarization trends with distortion. We flag any ferroelectric candidates whose calculations cannot be used to reliably assess the quality of the candidate. For example, if the trend in total energy is not continuous, we cannot be confident that we can extract a meaningful polarization trend. Similarly, if the same branch polarization is not continuous, we cannot be confident that an accurate spontaneous polarization has been determined.  www.nature.com/scientificdata www.nature.com/scientificdata/ To assess the smoothness of trends in polarization and energy across a distortion, we use UnivariateSplines from scipy.interpolate 80 . We use cubic splines for fitting polarizations and quartic splines for fitting total energies. We use the default smoothness parameter of 1.0. These splines are generated using the Polarization class in pymatgen.analysis.ferroelectricity.polarization.
We find that 31 out of the 55 materials that do not have smooth interpolations contain atoms with nonzero magnetic moments, mostly containing 3d elements (V, Cr, Mn, Fe, Co, Ni) and one containing the 5d element W. We found 26 materials to have several discontinuities in total energy (even when these calculations resulted in smooth polarizations). These materials were transition metal oxides, fluorides, carbonates, orthosilicates, and phosphates with alkali or alkaline earth metals (Li, Na, and Ba for these specific examples), many being Li-ion battery cathode candidates. The transition metals in these materials (V, Cr, Mn, Fe, Co, Ni) can take multiple oxidation states. Because these discontinuities in energy were coincident with discontinuities in the total magnetization, we believe these jumps were caused by the transition metal species changing oxidation state through the distortion, suggesting the need to asses their ground state magnetic ordering more carefully.
Metallic endpoints and metallic interpolations. Polarization calculations require a nonzero band gap along the distortion path. Therefore, workflows that have either polar or nonpolar structures that are calculated to be metallic are halted. In the workflow_data.json, these workflows are designated by a workflow_status of "DEFUSED". Of the 413 pairs considered, 80 possess metallic endpoints and therefore interpolations were not performed. Occasionally, interpolated structures between two nonmetallic structure endpoints are metallic and there were 59 that had metallic interpolation structures. If any structure along the path from nonpolar to polar structure is metallic, the quality of automated analysis cannot be guaranteed. We include these candidates in our dataset, but they are noted as having a polarization_len (see Tables 2 and 3) of less than 10 or do not have a polarization_change_norm (in cases where some of the interpolated polarizations are None). If a   www.nature.com/scientificdata www.nature.com/scientificdata/ candidate has a polarization_len equal to 10 and have a polarization_change_norm, all interpolated polarization calculations completed successfully.
We also note that the method used to determine whether a material is insulating differs for pymatgen and VASP. In our workflow, pymatgen determines the band gap by comparing the energies of the band edges. When VASP determines whether to proceed with calculating the polarization, it checks whether the material is insulating by checking the occupations of the band edges. Since the occupations can be sensitive to the choice of smearing and k-grid density, there are instances where using our default settings in the workflow leads to partial occupancies, while the band edge energies suggest the material has a finite gap; therefore VASP will not proceed with calculating the polarization whereas our workflow deems the material to be insulating. This tends to occur for materials containing 3d elements which, as discussed earlier in the context of magnetism, would require more careful calculations to reliably capture their properties, and therefore not necessarily expected to successfully complete the workflow.

Fig. 6
Validated ferroelectric candidates from our automated search in the Materials Project. Computed spontaneous polarization plotted against nonpolar-polar energy difference, maximum atomic distortion, PBE + U band gap, and energy above hull of the polar structure. All results are generated for spin-polarized DFT-PBE + U calculations. Note, that for spin-polarized systems, we only initialized the calculations with a ferromagnetic ordering. The energy above hull is extracted from the Materials Project. The legend labels different subcategories considered in this work and described in the text. (2020) 7:72 | https://doi.org/10.1038/s41597-020-0407-9 www.nature.com/scientificdata www.nature.com/scientificdata/ Comparison of materials project to relaxed structures. Two structural relaxation calculations are already performed on all structures in the Materials Project. We perform additional relaxations of the unit cell and atomic positions to ensure total energy convergence. We found only less than 5% (10%) of our relaxed structures have lattice parameter differences of more than 3% (1%) from the original Materials Project structures. Because we perform relaxations of nonpolar structures transformed to the low-symmetry polar setting, we compare the relaxed nonpolar structure to the low-symmetry transformed structure output by BCS of the nonpolar structure from Materials Project.
Identifying high-quality candidates. In any high-throughput search, there are calculations that complete without errors and some that require further scrutiny to interpret its results. We deem as high-quality candidates those calculations where the polarization and total energy trends are smooth and continuous; we define this criteria in  www.nature.com/scientificdata www.nature.com/scientificdata/ the following way: the maximum difference between the data and spline fitted to the same branch polarization must be less than 10 −1 μC/cm 2 , and the maximum difference between the data and the spline fitted to the energy trend must be less than 10 −2 eV.
As shown in Table 1, out of the 255 candidates, 200 pass through our stringent verification criteria and ensures the polarization and energy trends across the ferroelectric distortion are smooth and continuous. The remaining candidates are still valid candidates; we recommend checking the polarization and energy trends by hand as the algorithms used for analysis may not have reliably recovered the spontaneous polarization in these cases, as in Fig. 5. These high-quality candidates are further described in the section Determining Known and New Ferroelectrics. There are candidates included in this list with polarization values of zero, as computed within the accuracy of our calculations; we include these since they pass the above criteria and could be tuned to host nonzero polarizations such as by chemical substitution. As well, the polarization values may be very small but this is also true of some of the known ferroelectric candidates computed with our workflow, and so for the sake of completeness we also list these candidates. We also call the reader's attention to the materials with magnetic elements. Since all materials are initialized with ferromagnetic ordering, we do not expect to reliably capture materials with other magnetic orderings (e.g. antiferromagnets); as such, materials with nonzero magnetic moments may not necessarily have been calculated using their true magnetic ground state and those results should be interpreted with caution. comparison to DFT calculated polarizations from literature. We verify that our workflow reproduces the first-principles calculated polarizations for a variety of ferroelectrics in the literature. DFT calculated values for the ferroelectric polarization depends heavily on the structures and functional used in the calculation, and for magnetic systems, it will also be sensitive to magnetic ordering. For example, while in our work the endpoint structures are fully relaxed, other works constrain the relaxed polar unit cell to have the same volume as the experimental structure 81 ; these calculated polarizations will be systematically smaller than ours due to PBE optimized structures having larger lattice parameters than experimental values.
We compare to literature where a fully optimized (unit cell, volume, and atomic positions) relaxation procedure is used. For clarity of comparison, we use the chemical formula used by these references rather than the alphabetical_formula that is convention for the rest of this work.
The ferroelectric first-principles literature is largely dominated by studies of perovskites. We compare to calculations for the perovskites BaTiO 3 , PbTiO 3 , LiNbO 3 , BiAlO 3 , CdTiO 3 , BiFeO 3 , and we include in our comparison the double perovskite SrBi 2 Ta 2 O 9 . These comparisons are summarized in Table 6.
We note again here that because our workflow assumes ferromagnetic ordering, it will fail for many multiferroics since they tend to have other magnetic orderings, such as antiferromagnetic or noncollinear orderings. For example, two important multiferroics are BiFeO 3 and YMnO 3 ; the former is only captured to some extent by our workflow and the latter is not reported at all by our database. The standard polar phase of BiFeO 3 features G-type antiferromagnetic ordering, and that for YMnO 3 features A-type antiferromagnetic ordering. As the present version of our workflow initializes all calculations with ferromagnetic ground states for all magnetic systems, we do not expect to capture most multiferroics using the workflow as implemented for this search with the Materials Project database. Additionally, we note that the standard nonpolar reference structure for BiFeO 3 is in space group 167 (R3c); however, the R3c structure of BiFeO 3 is not in the Materials Project database. That the R3m and Pm3m structures of BiFeO 3 calculated in this workflow, with ferromagnetic ordering and the default U value, are found to be insulating is fortuitous and allows it to be reported as a high quality candidate. However, the BiFeO 3 experimentally verified ground state structure, 161 (R3c), did not complete our workflow due to metallic interpolations. Conversely, YMnO 3 is DEFUSED and does not proceed to the interpolations because its nonpolar structure is found to be metallic in our calculations. Extending the workflows to account for antiferromagnetic and other spin-ordered ground states, as well as to relax the constraint that a nonpolar reference structure exist in the database, will enable casting a wider net for known and new multiferroics, and will be the subject of future work.
Determining known and new ferroelectrics. We distinguish known from new ferroelectrics in our workflow depending on whether or not a material has been reported in the literature as ferroelectric. Thus, we perform a literature review by hand for every considered candidate, and leave automating such literature searches to future work. We find that out of 200 high quality candidates, 74 are known or previously proposed ferroelectrics and 126 are, to our knowledge, new ferroelectric candidates.
In Fig. 6, we plot calculated spontaneous polarization versus nonpolar-polar total energy difference, maximum distortion distance, PBE band gap, and energy difference between the polar structure and convex hull reported in the Materials Project for known, proposed and new ferroelectric candidates. We also provide tables of the known and new candidates grouped by chemical formula and polar space group in Online-only Tables 1 and 2, respectively. Details connecting these candidate to specific workflow calculations are in Online-only Table 3.
As seen in Fig. 6, known and newly-proposed ferroelectrics display similar dispersion and overlap in the range considered. The middle row of Fig. 6 demonstrates the variety of known ferroelectric candidates that we are able to recover, from perovskites to candidates in the organic (NH 4 ) 2 SO 4 family to candidates proposed by theory. The bottom row of Fig. 6 shows categories of new ferroelectric candidates we find in the Materials Project, some from previously known ferroelectric classes such as hexagonal manganite-like structures and less-studied categories such as fluorides, carbon-oxygen compounds, and crystals containing hydroxyl groups.

comparison to hand-curated list of ferroelectrics in the Pauling Files database. The Pauling
File is a materials database accessible through SpringerMaterials 18 . In this database, there are materials tagged as ferroelectric and antiferroelectric. We use these tagged entries to validate whether the workflow is able to successfully identify diverse ferroelectrics by examining which tagged (anti)ferroelectrics complete the workflow. In the www.nature.com/scientificdata www.nature.com/scientificdata/ Pauling Files, there are 955 distinct compositions tagged as (anti)ferroelectric, 306 of which are pure (not doped) compositions. Out of 306 pure compositions, 95 of those compositions are included in the Materials Project as polar materials. This does not necessarily mean that the Materials Project database contains the same ferroelectric polar structure as referenced in the Pauling Files; rather, it simply means that there exists a polar structure in the Materials Project with the same composition as a tagged ferroelectric or antiferroelectric in the Pauling Files. 57 of these compositions have nonpolar-polar structure pairs in the Materials Project, 40 of which are found to have a continuously distortion by BCS Structure Relations. 32 of the 40 are successfully identified by the workflow as high-quality candidates, meaning the energy and polarization trends are smooth. Out of the 8 candidates that did not successfully make it through the workflow as high quality candidates, 4 of them (CrO 3 Y, Eu 2 GeSe 4 , Cl 3 CoTl, and MnO 3 Y) had metallic endpoints, 2 candidates (Cl 3 CrRb and Br 3 MnRb) had metallic interpolation structures, and 2 candidates (Cl 4 CoRb 2 and B 7 ClCr 3 O 13 ) did not have smooth energy trends due to fluctuating band gaps and magnetic moments in the interpolations. We note that most of these 8 candidates are non-d 0 systems, and therefore expected to exhibit magnetic order. This suggests that the primary impact to the robustness of our workflow is the level of DFT used and the magnetic orderings considered; since PBE tends to underestimate band gaps, and since ferromagnetically-ordered systems tend to be overwhelmingly metallic, several of these candidates are experimentally insulating but are metallic in our workflow. comparison to experimental measurements in Landolt-Börnstein series. To validate that our workflow calculates polarizations that can be used to guide experimental efforts, we compare to tabulated experimentally measured polarizations of known ferroelectrics in Landolt-Börnstein -Group III Condensed Matter -Ferroelectrics and Related Substances [19][20][21][22][23] . This series classifies hundreds of ferroelectrics into a 72 class numbering scheme. We note that polarization values for ferroelectrics in the Landolt-Börnstein series volumes 36A, 36B and 36C may be superseded by more recent experimental measurements. Experimentally measured polarizations depend greatly on the quality of the sample and the method used. For many ferroelectrics, polarization measurements made across decades vary greatly depending on these factors. Any comparison between theory must be taken in this context.
Nonetheless, plots of polarizations values calculated by our workflow vs. experimental spontaneous polarizations reported in the Landolt-Börnstein series volumes 36A, 36B and 36C are shown in Fig. 9 and tabulated in Table 7. We only compare search candidates to Landolt-Börnstein entries that match in composition and polar space group. For example, NaNbO 3 is LB Number 1A-1 and has a polarization of 12 μC/cm 2 for the 161 space group structure. However, all the NaNbO 3 found in our search are orthorhombic (space groups 26 and 29), so we do not make the comparison.
For experimental polarizations greater than 10 μC/cm 2 , the majority of experimental polarizations are between 25% and −50% of those that we calculate, well within an order of magnitude. The exceptions are both polar structures of Bi 2 O 9 SrTa 2 (space groups 36 and 41), which are calculated to have polarizations much greater   www.nature.com/scientificdata www.nature.com/scientificdata/ than their experimental values. Multiple entries of a given formula indicate multiple calculations for different structure pairs in our dataset for the same compound. Compounds with polarizations of less than or equal to 1 μC/cm 2 are not shown on the plot given the log-log scale. For polarizations less than 5 μC/cm 2 , we see our calculations capture the general trends of the experimental polarizations.
We find that the PBE functional we use for our DFT calculations overestimates polarizations. This is partially due to unit cells relaxed with PBE having larger than experimental volumes and thus larger distortions.

Usage Notes
In this work, we present 413 nonpolar-polar structure pairs in the Materials Project database that are compatible with a second-order phase transition as ferroelectric candidates and perform DFT calculations of total energy, band gap, and polarization for these structures pairs. This dataset offers the first opportunity to compare a large number of known, previously proposed, and new ferroelectrics side by side with the same methodology. We believe by setting strict criteria for ferroelectricity and casting a wide-net using high-throughput searches, we will find candidates that challenge and advance our understanding of ferroelectric phenomena. As seen in our candidates, there may be ferroelectrics waiting to be discovered that defy our expectations. This dataset will be useful for creating new tools and criteria for analyzing diverse ferroelectrics.
The infrastructure provided by the Bilbao Crystallographic Server, FireWorks, pymatgen, and atomate is crucial to being able to perform these types of searches efficiently. Thus, we also provide our code and data for these searches with the hope they will provide access for others to perform and develop similar searches.
Our code for performing structural interpolations and polarization calculations has been incorporated into the pymatgen and atomate packages. We also provide the code that we use to create the interface that we used to view our candidates in aggregate.
The workflow we have presented can be extended to any crystal structure database, experimental and hypothetical. Several modifications can be made to this workflow to extend the scope of these searches. Notably, extending our workflow to treat different magnetic orderings will enable it to treat multiferroics. Additionally, the same DFT workflow can be used to screen any experimentally measured polar structure -even one without an existing nonpolar reference -by generating nonpolar reference structures with BCS Pseudosymmetry 7 . Our workflow can also be adapted to perform species substitutions and find symmetry relations between structure types, classes of structures that share space groups, Wyckoff positions, and other lattice similarities. Our workflow code is included since atomate version 0.6.7 and our analysis code is available in pymatgen since Fig. 9 Log-log plot of experimental vs. calculated polarization for ferroelectric materials in the Landolt-Börnstein series with polarization value larger than 1 μC/cm 2 . 6 materials with polarization value equal to or smaller than 1 μC/cm 2 are not shown in plot but are included in the www.nature.com/scientificdata www.nature.com/scientificdata/