High performance Wannier interpolation of Berry curvature and related quantities with WannierBerri code

Wannier interpolation is a powerful tool for performing Brillouin zone integrals over dense grids of k points, which are essential to evaluate such quantities as the intrinsic anomalous Hall conductivity or Boltzmann transport coefficients. However, more complex physical problems and materials create harder numerical challenges, and computations with the existing codes become very expensive, which often prevents reaching the desired accuracy. In this article, I present a series of methods that boost the speed of Wannier interpolation by several orders of magnitude. They include a combination of fast and slow Fourier transforms, explicit use of symmetries, and recursive adaptive grid refinement among others. The proposed methodology has been implemented in the python code WannierBerri, which also aims to serve as a convenient platform for the future development of interpolation schemes for other phenomena.


INTRODUCTION
Wannier functions 1,2 (WFs) are a powerful tool for evaluating various electronic properties of solids, ranging from electric polarization [3][4][5] and orbital magnetization [6][7][8] to topological properties [9][10][11][12][13] . Moreover, WFs provide a way to describe a group of energy bands in a crystal by a compact Hamiltonian, thus allowing a fast evaluation of the band structure at any point of the Brillouin zone (BZ) without an extra call to the ab initio code 14 . This procedure called Wannier interpolation is similar in spirit to the tight-binding method 15 , but with the significant advantage that it offers a systematic way of precise description of any number of bands without truncation of the hopping integrals. Moreover, not only the band energies but also the matrix elements of various observables and operators, which are expressed in terms of wavefunctions and their derivatives over momentum space are precisely interpolated, while the tight-binding approach contains an unavoidable error due to the limited basis set. Wannier interpolation is particularly useful in the search for Weyl points in the band structure 11,16 , and in the evaluation of momentum-space integrals of rapidly varying functions. Such integrals appear, for example, in calculations of the anomalous Hall conductivity 17 , orbital magnetization 18 , Boltzmann transport coefficients 19 , and optical properties 20,21 . By now, Wannier interpolation schemes have been developed for many other properties that demand dense BZ samplings, such as electron-phonon coupling 22,23 , gyrotropic effects 24 , and spin Hall conductivity 25,26 .
Due to their gauge dependence, WFs are strongly non-unique and may be constructed in multiple ways. The most popular technique is the maximal localization procedure 2,27 , implemented in the well-established code Wannier90 28,29 , whose development is now driven by a broad community 30 . The construction of a good set of WFs requires some careful input from the user. This includes specifying a set of trial orbitals, which serve as an initial guess for the target WFs, and, if the bands of interest do not form an isolated group, choosing the "disentanglement" energy windows 14 . Recently, there has been significant progress in the automated construction of WFs with minimal user intervention. Different techniques have been proposed, such as the optimized projection functions method 31 , the selected columns of the density matrix method [32][33][34] , and an automated method to choose trial orbitals and energy windows 35,36 . In addition, a database of Wannier Hamiltonians is currently being constructed 37 . These advances constitute significant steps towards employing Wannier interpolation for high-throughput automated calculations of electronic properties of solids.
Most of the Wannier interpolation schemes mentioned above have been implemented within the popular codes-Wannier90 (namely its post-processing module postw90.x) and Wannier-Tools 11 . These codes, being well-established and widely adopted by the community, have quite broad functionalities. However, more complex materials and physical effects pose harder numerical challenges, and calculation with those codes can become quite heavy. In particular, for a system with a large number of WFs and a complicated Fermi surface, it may be hard to achieve convergence with respect to momentum-space grid in a reasonable time. When it comes to high-throughput calculations, performance becomes even more important.
In this article, I present a series of methodological improvements that allow to improve dramatically the performance of the Wannier interpolation method, without compromising its accuracy. The proposed methodology is implemented in the python code WannierBerri (WB), which is freely available to install and open for contributors (see "Code availability"). The present article covers only the core methodology of the code which procure its efficiency. The more broad scope of the code can be found on its web page and will be detailed in future publications. Interesting to note that WannierBerri may be equally used for Wannier interpolation and tight-binding calculations, and also offers a convenient platform for the development of more functionality. The name of the code is derived from Wannier functions and the Basque word "berri" which means "new" and enters local toponyms (e.g., Lekunberri, Ekainberri).
The high efficiency of the WB code is achieved by the combination of several methodological improvements. First, note that in a typical calculation for a 3D bulk material using postw90.x, the bottleneck is the Fourier transform, which is implemented as a standard discrete Fourier transform. Therefore, it looks appealing to use fast Fourier transforms (FFTs) 38,39 which are widely used in numerical calculations 40,41 . However, it is problematic to do so over a very dense grid of points, which may include up to 10 8 points. This issue is overcome in this work by a mixed scheme employing both fast and "slow" Fourier transforms. Next, the symmetries of the system may be used to reduce the evaluation only to the irreducible points and obtain the contributions from other points by applying symmetry operations. This also helps to render the result more symmetric (tensor components that should be equal or vanish will be exactly equal, or exactly vanish), even if the symmetries are slightly broken due to numerical inaccuracies in wannierization. Then I introduce an adaptive refinement algorithm that identifies points that give the largest contribution to the integral and makes the grid denser in the vicinity of such points. This helps to have a more accurate description near special points, where the integrand is rapidly changing-for example, near Weyl nodes or nodal lines. This is similar in spirit to the adaptive refinement scheme used in 17,42 , but is more automatic and requires less input from the user. Finally, I introduce methods that drastically reduce the computational cost of the minimal-distance replica selection method (MDRS) 30 , as well as the cost of scanning over multiple Fermi levels.
The article is organized as follows. First, I describe the set of proposed improvements to the existing Wannier interpolation methodology. Next, the usage of the code is demonstrated for the "textbook" example of the anomalous Hall conductivity of bcc iron, and the performance of the WB code is benchmarked against postw90 on the basis of that example. Finally, a broader functionality implemented in the WB code is described. "Methods" describes more computational details, and the routines to obtain the matrix elements that are not implemented in the interfaces of most ab initio codes to Wannier90, but essential to calculate the properties related to spin and orbital moment of Bloch electrons.

General equations for Wannier interpolation
We start with a brief overview of the Wannier interpolation method, mainly with a goal to introduce notation necessary for further discussion. For more details, please refer to review ref. 2 and original articles cited therein. The problem of Wannier interpolation is stated in the following way. First, we evaluate the energies E nq and wavefunctions ψ nq (r) ≡ e iq⋅r u nq (r) from first principles on a rather coarse grid of N q ¼ N 1 q N 2 q N 3 q wavevectors q within the reciprocal unit cell. Next, we want to find the energies and wavefunctions at points on a denser grid of wavevectors k. Further, we will consistently use q and k to denote the ab initio and interpolation grids, respectively.
For a group of entangled bands, one can define a set of J WFs defined as where J q ! J and R are real-space lattice vectors. The matrices V mn 0 ðqÞ contain all the information on the construction of WFs and may be generated by the Wannier90 code. They are constrained by Next, to obtain energies at an arbitrary point k one needs to construct the Wannier Hamiltonian which further may be diagonalized as X where U nl 0 ðkÞ are unitary matrices with columns corresponding to the eigenvectors of the Hamiltonian (4). In a similar way, for any operatorX, for which the matrix elements are evaluated on the ab initio grid, one may obtain the real-space matrix elements where in a simple case (e.g.,X ¼ σ) or ifX involves momentum-space derivatives, (e.g., the position operatorr α i∂=∂k α ) may also involve matrix elements between neighboring q points (see refs. 17,18 for details). Then the matrix elements may be interpolated to any k point in the Wannier gauge by and further rotated to the Hamiltonian gauge Note that Eqs. (3), (4), and (5) are particular cases of (6), (8), and (9). Equation (6) can be performed by means of FFT, and its result is periodic in R with a supercell formed by vectors A i ¼ a i N i q , where a i (i = 1, 2, 3) are the primitive unit cell vectors. Among the equivalent R vectors, we choose those belonging to the corresponding Wigner-Seitz (WS) supercell. If an R vector belongs to the WS supercell boundary, we include all equivalent vectors on the boundary with the corresponding elements X(R) divided by the degeneracy of the R vector. Further, the MDRS method (see "Minimal-distance replica selection method (MDRS)") may also slightly modify the set of R vectors.
As an example, the total Berry curvature of the occupied manifold is interpolated 17 via where the ingredients of the equation are obtained using Eqs. (8), The anomalous Hall conductivity is evaluated as an integral Note, that while the direct Fourier transform (6) is performed only once for the calculation, and is not repeated for the multiple k points upon interpolation, the inverse Fourier transform (8) is repeated for every interpolation k point. In fact, it presents the most time-consuming part of the calculation of AHE in relatively small 3D bulk systems as implemented in the Wannier90 code.

Mixed Fourier transform
In this section, we will see how the evaluation of Eq. (8) may be accelerated. It is easy to see that the computation time of a straightforward discrete Fourier transform scales with the number of R vectors and k points as t ∝ N R N k , and we are typically interested in a case N k ≫ N R (N R ≈ N q ).
When the Fourier transform is done on a regular grid of k points, it is usually appealing to use the FFT. For that one needs to place the R vectors on a regular grid of size N k , fill the missing spots with zeros and perform the standard FFT, for which the time will scale as t / N k log N k . However, there are some difficulties with such FFT. Mainly, because to perform FFT on a large grid implies storing the data for all k points in memory at the same time, which becomes a severe computational limitation. Also, FFT does not allow to reduce computation to only the symmetryirreducible k points and is more difficult to do in parallel. However, there is a way to combine the advantages of both the FFT and the usual discrete Fourier transform, leading to the concept of mixed Fourier transform.
We want to evaluate Eq. (8) for a set of k points.
where 0 This is always possible unless N i k is a prime number. But for really dense grids, we can adjust N i k a bit, to be factorizable in any way we want. Then the set of points Eq. (12) is equivalent to a set of points k = K + κ, where This separation is illustrated in Fig. 1a, which shows a 2 × 2 grid of K points, each corresponding to 4 × 4 FFT grid (dots of a certain color). Now for each K point, we can define X mn ðK; RÞ X mn ðRÞe iKÁR (14) and then Eq. (8) reads as The principle idea of mixed Fourier transform consists in performing the Fourier transform Eq. (15) as FFT, while Eq. (14) is performed directly. To perform the FFT, we put all the R vectors on a grid N 1 FFT N 2 FFT N 3 FFT , and a vector R ¼ P 3 i¼1 n i a i is placed on a slot with coordinates e n i ¼ n i mod N i FFT (n i are both positive and negative integers, while 0 e n i <N i FFT ). If the FFT grid is not big enough, contributions from different R vectors may appear on the same slot. In this case, they should be added up, and it does not affect the final results. However, for optimal performance, such a situation should be avoided at least for the majority of R vectors. Typically, it is a good choice to take the size of the FFT grid approximately equal to the ab initio grid.
The advantages of this approach are the following. First, the computational time scales as t 1 ∝ N K N R for Eq. (14) and t 2 / N K N FFT log N FFT for Eq. (15). Because it is required that N FFT ≥ N R (to fit all R vectors in the FFT box), we have t 1 t 2 / N k log N FFT (in practice it occurs that t 1 ≪ t 2 ), which scales better than both the Fast and "slow" Fourier transforms. Next, we can perform Eqs. (14) and (15) independently for different K points. This saves us memory and also offers a simple parallelization scheme. Also, we can further restrict evaluation only to symmetry-irreducible K points (see "Symmetries") and also perform adaptive refinement over K points (see "Recursive adaptive refinement").
Moreover, the evaluation time of a mixed Fourier transform only logarithmically depends on the size of the ab initio grid (recall that N FFT~NR~Nq ), while for the slow Fourier transform, the dependence is linear. However, in practice, we will see (see "Computation time") that the Fourier transform in the present implementation consumes only a small portion of computational time, and therefore the overall computational time is practically independent of the size of the ab initio grid.

Symmetries
When we integrate some quantity over the BZ, at every K point (after summing over κ points), we obtain the result as a rank-m Fig. 1 Illustration of the procedure of mixed Fourier transform, adaptive refinement, and use of symmetries. a-f show the procedure step by step using a 2D picture for visualization purposes, while the code actually works in 3D. The area of colored circles corresponds to the weight of the K point, gray crosses denote the points with zero weight. See the text for a detailed description. g AHC of bcc Fe, evaluated from a grid of 50 × 50 × 50 k points and 50 recursive adaptive refinement iterations. tensor X i1; ;im ðKÞ, for example, the berry curvature vector Ω γ or the conductivity tensor σ xy . Then the BZ integral is expressed as a sum and we initially set {K} as a regular grid (13a) and w K = 1/N K . Suppose G is the magnetic point group of the system. (Because X (K) is invariant under translations, here we are interested in the point group, rather than space group.) We define the set of symmetry-irreducible K points irr as a set of points that 8K; K 0 2 irr, ∀ g ∈ G holds gK ≠ K 0 , unless g = E (identity). Then we can rewrite the sum Eq. (16) as where we choose g K such that g À1 K K 2 irr (this choice maybe not unique), and obviously g K = E for K ∈ irr. Thus, only the irreducible K points need to be evaluated. Next, to make sure that the result respects the symmetries, despite possible numerical inaccuracies, we symmetrize the result as: Note, that e X ¼ X if the model respects the symmetry precisely (e.g., when symmetry-adapted WFs 43 are used). Combining Eqs. (17) and (18) and using where G ⋅ K denotes the orbit of K under the action of group G. The latter equation reflects the implementation in the WB code. Starting from a regular grid of K points, we search for pairs of symmetry-equivalent points. Whenever such a pair is found, one of the points is excluded and it's weight is transferred to the other point. Compare Fig. 1a, b: the red points are removed and their weight is moved to green points. Thus we end with a set of irreducible K point with weights e w K ¼ P GÁK K 0 w K 0 . Next, we evaluate X(K) (employing the corresponding interpolation scheme) only at symmetry-irreducible K points. Note, that although some k points corresponding to the same K point (same color in Fig. 1) are equivalent, we have to evaluate them all to be able to use the FFT. Finally, after summation, we symmetrize the result. The described procedure achieves two goals: (i) reduce the computational costs, and (ii) make the result precisely symmetric, even if the WFs are not perfectly symmetric. In the present example, we managed to obtain highly symmetric WFs (although without the employment of symmetry-adapted WFs method), and therefore the symmetrization procedure does not change the result (within relative accuracy~10 −5 ). However, for complex materials, such quality of WFs is not always easy to achieve.

Recursive adaptive refinement
It is well-known that in calculations of quantities involving Berry curvature (33) or orbital moment (34) of Bloch states (e.g., AHC, Berry curvature dipole (29) 44 or gyrotropic magnetoelectric tensor (30) 24,45 ), one performs integration over k-space of a function that rapidly changes with k. As a result, small areas of k-space give a major contribution to the integral. Such areas often appear in the vicinity of Weyl points, nodal lines, as well as avoided crossings. To accelerate convergence with respect to the number of k points, we utilize adaptive mesh refinement similar to refs. 17,42 . The authors of refs. 17,42 assumed a pre-defined threshold, and the k points yielding Berry curvature above the threshold was refined. This is inconvenient because one needs a good intuition to guess an optimal value for this threshold because it depends both on the quantity one wants to calculate, and the material considered.
In WB, it is implemented in a way that does not require an initial guess from the user. This procedure, in combination with the symmetrization described above, is illustrated in Fig. 1 in two dimensions (2D), while the actual work in 3D is described below. After excluding symmetry-equivalent K points (Fig. 1b), the results are evaluated for every K point and stored. We assume that initially each K point has weight e w K and corresponds to a volume defined by vectors c i K ¼ b i =N i k centered at K. Then we pick a few "most important K points". The criteria of importance may be different-either the Maximal value for any E F , or maximal value summed over all E F , or yielding most variation over the E F (if the evaluated quantity is a function of Fermi level E F ). Suppose we selected the magenta point, then those points are refined-replaced with eight points around it with coordinates where all combinations of ± signs are used. In Fig. 1c, four new blue K points in the 2D case. The weight and volume of the initial point is distributed over the new points, thus Then the symmetrization is applied again (the four blue points are connected by fourfold rotation) to exclude the equivalent points, and the weight of the equivalent points is collected on the remaining point, while the vectors c i K 0 are not changed. After the new K points are evaluated, we go to the next iteration of refinement. On each iteration, any point may be refined, including both those from the initial regular grid, and those created during previous refinement iterations. The procedure stops after the pre-defined number of iterations was performed. Figure 1g shows how undesired artificial peaks of the AHC curve are removed iteration by iteration, yielding a smooth curve (see "Example: AHC of bcc iron" for details).

Minimal-distance replica selection method (MDRS)
The MDRS method 30 allows to obtain a more accurate Wannier interpolation, in particular when moderate q grids are used in the ab initio calculations. With the MDRS method, the Fourier transform (8) is modified in the following way: where T ðjÞ mnR are N mnR lattice vectors that minimize the distance |r m − (r n + R + T)| for a given set m, n, R. However, the evaluation of Eq. (21) is quite slower than Eq. (8), because every k, m, n, R an extra loop over j is needed. Therefore, calculations employing MDRS in postw90.x (which is enabled by default) takes more time. Instead, it is convenient to re-define the real-space matrix elements as e X mn ðRÞ ¼ only once for the calculation, and then the transformation to kspace is performed via Note, that the set of R vectors in Eq. (22) is increased compared to the initial set of vectors in Eq. (6) in order to fit all nonzero elements e X mn ðRÞ. Equation (23) having essentially the same form as Eq. (8) can be evaluated via mixed Fourier transform, as described in section "Mixed Fourier transform".
Thus the MDRS method is implemented in WB via Eqs. (22)- (23), and has practically no extra computational cost, while giving notable accuracy improvement.

S.S. Tsirkin
Scanning multiple Fermi levels It is often needed to study anomalous Hall conductivity (AHC) not only for the pristine Fermi-level E F but considering it as a free parameter ϵ. On the one hand, it gives an estimate of the accuracy of the calculation, e.g., sharp spikes may indicate that the result is not converged. On the other hand, ϵ-dependence gives access to the question of the influence of doping and temperature, and also allows calculation of anomalous Nernst effect (27). As implemented in postw90.x, evaluation of multiple Fermi levels has a large computational cost. However, there is a way to perform the computation of AHC for multiple Fermi levels almost without extra costs. To show this, let's rewrite Eqs. (10), (11) as σ αβ ðϵÞ ¼ Àϵ αβγ where k = K + κ, the definitions of P n and Q ln straightly follow from Eq. (10), and we omit the cartesian index γ further in this subsection. Now suppose we want to evaluate Ω(ϵ i ) for a series of Fermi levels ϵ i . For different k-points and Fermi levels ϵ, the sets of occupied O(k, ϵ) and unoccupied states U(k, ϵ) change, and repeating these summations many times may be computationally heavy. Instead, we note that when going from one Fermi-level ϵ i to another ϵ i+1 only a few states at a few κ points change from unoccupied to occupied. Let's denote the set of such κ points as δκ i then, the change of the total Berry curvature is where where C mn;γ ðRÞ ϵ αβγ h0mjr α ÁĤ Á ðr β À R β ÞjRni, B mn;β ðRÞ h0mjĤ Á ðr β À R β ÞjRni and the other ingredients were explained under Eq. (10). Equation (26) is written following the approach of ref. 18 , but the result has a different form, which can be straightforwardly processed by analogy with Eqs. (24) and (25), where the first line of Eq. (26) expresses P n (k), while the second and third lines correspond to Q ln (k). Note that evaluation of C mn, γ (R) requires additional matrix elements evaluated on the ab initio grid, see "Evaluation of additional matrix elements" for details.

Example: AHC of bcc iron
In this section, the usage of the WannierBerri code is demonstrated on a simple example-anomalous Hall conductivity of bcc iron. After the Wannier functions are constructed with Wannier90 (see "Computational details"), the calculation is performed by the following short python script. First, we import the needed packages: import wannierberri as WB import numpy as np Then, we read the information about the system and WFs: system=WB.System_w90(Fe,berry=True) from files Fe.chk, Fe.eig, Fe.mmn (the first is written by Wannier90, the other two by the interface of the ab initio code, e.g., pw2wannier90.x), or we can read all information from a file Fe_tb.dat, which is also written by Wannier90, or maybe composed by the user from any tight-binding model: Next, we define the symmetries that we wish to take into account. In the ab initio calculation, we have specified the magnetization along the z axis, hence the symmetries that are preserved are inversion I , fourfold rotation around the z axis C 4z , and a combination of time-reversal T and twofold rotation around the x axis C 2x . Here, we need only the generators of the symmetry group. The other symmetries will be automatically obtained by taking products of these generators, for example, the mirror is Next, we need to set the grids of k, K, and κ points. Most conveniently, it can be done by setting the "length" parameter (in Å): grid=WB.Grid(system,length=100) This will guarantee the grid to be consistent with the symmetries, and the spacing of k points will be Δk % 2π length . In this particular case, the reciprocal lattice vectors have length |b i | = 3.1 Å −1 , hence the suggested grid size is lengthÁjbi j 2π % 49. However, the grid size is adjusted to 50 × 50 × 50 points in order to factorize it as 10 × 10 × 10κ grid and 5 × 5 × 5K grid.
Next, we want to integrate the Berry curvature to get the AHC. This is done by the WB.integrate method.
WB.integrate(system, grid, Efermi=np.linspace(17.,18.,1001), smearEf=10, # 10K quantities=["ahc","cumdos"], numproc=16, adpt_num_iter=50, fout_name="Fe") and in addition to AHC, we evaluate the cumulative density of states (cDOS) Eq. (32). We consider Fermi level as a free parameter, scanning over a set of Fermi levels from 17 to 18 eV with a step of 1 meV, and small smearing over the Fermi level corresponding to temperature 10 K (~1 meV) is used. It is known, that in the BZ integration, some k points may give a large contribution to the integral. This is especially strong for Berry curvature, which blows up near band degeneracies and avoided crossings, that fall close to the Fermi level. This is reflected as huge spikes in the E F -resolved curves-see blue curve in Fig. 1g. To make the calculation more precise around such points, an adaptive recursive refinement algorithm is used, and we set the number of iterations to 50.
From the cDOS, we can find the precise position of the Fermi level E F = 17.618 eV-the energy at which the cumulative DOS reaches eight electrons per unit cell. This is more accurate than the result evaluated from a coarse ab initio grid. Next, it is instructive to plot the AHC after each iteration. In Fig. 1g, one can see that after 50 iterations the chaotic peaks are removed, and we S.S. Tsirkin can get a reasonably smooth curve, although we have started from a rather coarse grid of only 50 × 50 × 50 k points. In practice, it is still recommended to start with denser grids, when possible.
To demonstrate why the adaptive refinement is so useful for calculations of AHC, we can also visualize the Berry curvature. With the following lines WB.tabulate(system, grid, quantities=["berry"], frmsf_name="Fe", numproc=16, ibands=np.arange (4,9), Ef0=12.610) we produce files Fe_berry-?.frmsf, containing the energies and Berry curvature of bands 4-8 (band counting starts from zero) tabulated over the 3D Brillouin zone. The format of the files allows being directly passed to the FermiSurfer visualization tool 46 which can produce a plot like Fig. 2, which clearly shows that Berry curvature is large in the regions where two bands come close to each other.
This short example demonstrates that the calculations with WB may be run with a few lines of Python script and the broader functionality will be detailed further.

Computation time
Now let us compare the time for the calculations of anomalous Hall conductivity using postw90.x and WannierBerri. We will take the example of bcc Fe and vary different parameters while using the same computational resource (see "Methods").
The computation consists of two phases. First, some preliminary operations are done. Those include reading the input files and performing Fourier transform from ab initio grid q to real-space vectors R : Eqs. (3), (6), and (7). This operation takes in WB (postw90.x) between 2 (3) seconds for the small q grid 4 × 4 × 4 and 2 (3) minutes for a large grid of 16 × 16 × 16. This time is mostly taken by reading the large formatted text file Fe.mmn, and it is done only once and does not scale with the density of the interpolation grid. In WB this is done in the constructor of the System_w90 class, and the object can be saved on disk using a pickle module so that this operation does not repeat for further calculations.
Next comes the interpolation part itself, for which the evaluation time scales linearly with the number of k points used. Further, the time for an interpolation grid 200 × 200 × 200 is given, which is a rather good grid to make an accurate calculation for this material.
We start with comparing time with the MDRS switched off and without the use of symmetries in WB. As can be seen in Fig. 3, for a small q grid 4 × 4 × 4 WB is just slightly faster then postw90.x. However, for dense q grids the computational time of postw90.
x grows linearly with the number of q points, while in WB it stays almost the same. This happens because in postw90.x the Fourier transform is the major time-consuming routine. On the other hand, in WB, although the cost of the mixed Fourier transform is expected to grow logarithmically with the ab initio grid, we do not see it because Fourier transform amounts only tõ 10% of the computational time. Next, we switch on the MDRS method and the computational time of postw90.x grows by a factor of 5. On the other hand, the computational time of WannierBerri does not change (not shown).
Finally, let's switch on the use of symmetries in WB. Thus the computational time decreases by a factor of 8. In the ultra-dense grid limit, one would expect the speedup to be approximately equal to the number of elements in the group-16 in the present example, due to exclusion of symmetry-equivalent K points. But this does not happen, because we use an FFT grid of 25 × 25 × 25 κ points for all ab initio grids, hence the K grid is only 8 × 8 × 8, and a considerable part of K points are at high-symmetry positions. Therefore they have less symmetric partners to be excluded from the calculation.
Thus we can see that the difference in computational time with postw90.x and WB reaches three orders of magnitude for this example. Note that the examples above were performed only for the pristine Fermi level. Now let's see what happens upon scanning the Fermi levels (Fig. 4). In WB, the computational time remains practically unchanged when we use up to N ϵ ≈ 1000 Fermi levels, and only start to grow considerably at N ϵ~1 0 4 . On the other hand in postw90.x, the computational time significantly grows with N ϵ , which is especially remarkable for small q grids, where the growth becomes linear already from N ϵ~1 0. For denser q grids, the fixed amount of time (independent of N ϵ ) is larger, so the linear growth starts at higher N ϵ .
In this section, we did not use the adaptive refinement procedure. However, when one starts from a rather large grid of K points, the new K points coming from the refinement procedure constitute only a small portion of the initial grid, and hence do not contribute much to computation time.  We have shown that the methods suggested in this article help to significantly reduce the computation time from days to minutes. However, bcc iron is a simple example with only one atom per unit cell, and only 18 WFs are needed. More complicated systems will require more time. For example, to obtain a converged value of anomalous Nernst conductivity in PrAlGe 47 using wannier19 (an early version of WB), the calculation took approximately 30 h on the same computation node. Estimates predict that the same calculation with postw90.x could take several months. Thus it is the case where the numerical advance not only saves time but also brings the calculation from the unreasonably time-consuming area, where most people would avoid working, to a reasonably feasible computation time.

The functionality implemented in WannierBerri
This appendix outlines the functionality that is currently implemented in WannierBerri. For more detailed and updated information, please refer to the online documentation.

DISCUSSION
In this article, I have presented a series of methods that boost the performance of Wannier interpolation to a higher level. The methods are implemented in the Python code WannierBerri. It is important to note that the mixed Fourier transform and the optimization of MDRS method and Fermi-level iteration while giving a large computational advantage, do not affect the result within machine precision. Hence, WannierBerri can be easily benchmarked with the established postw90.x code. The code not only allows to perform high-speed and high-precision calculations of AHC and a palette of other properties but also serves as a platform for implementing more functionalities involving Wannier interpolation. Thus it has the potential to become a community code. Interestingly, the code uses the same routines to perform calculations both based on WFs and tight-binding models. Finally, in combination with recent advances in automated construction of WFs, it paves a way to highthroughput calculations of properties of solids that require Wannier interpolation.

Computational details
The input files for WannierBerri are prepared by a combination of an ab initio code and Wannier90. In this article, we used the QuantumEspresso (QE) code 50 , and closely followed the Tutorial#18 of Wannier90. First, we perform self-consistent calculations on a grid 16 × 16 × 16 q points, fixing the magnetization along [001] axis. We use Generalized gradient approximation (GGA) 51 for exchange-correlation functional and a norm-conserving pseudopotential from the PseudoDojo library 52,53 , and spin-orbit coupling is taken into account. Next, we perform a non-selfconsistent calculations using a Γ-centered grid of 8 × 8 × 8 q points. Further, we define the set of trial orbitals sp 3 d 2 2, d xy , d xz , and d yz , which set the initial guess for 18 WFs describing the conduction band of Fe. And prepare the input files for Wannier90 by means of the pw2wannier90.x interface. Finally, the maximally localized Wannier functions are constructed by the Wannier90 code. The resulting files may be used to start calculations both by postw90.x and WannierBerri. Further, we repeat this procedure for a series of non-self-consistent q grids. The reader may refer to the documentation of Wannier90 and QE for more details.
Calculations were performed on identical 32-core virtual nodes of the ScienceCloud cluster at the University of Zürich. The nodes are based on

External libraries used in the code
The WannierBerri code is developed in the object-oriented Python3 language and makes use of the popular numerical libraries, which include NumPy 54 , SciPy 55 , and lazy-propery. Fast Fourier transform is implemented via libraries pyFFTW (wrapper of FFTW3 library 56 ) and NumPy, and the user has an option to choose between them, although no notable difference in performance has been observed. The calculations are performed in parallel over K points by means of the multiprocessing module and the parameter "numproc" specifies that a Pool of 16 worker processes is used.

Evaluation of additional matrix elements
Wannier interpolation starts from certain matrix elements defined on the ab initio (q) grid. Those matrix elements should be evaluated within the ab initio code, namely within its interface to Wannier90. However, only QuantumEspresso 50 has the most complete interface pw2wannier90.x. The other codes provide only the basic interface, which includes the eigenenergies E nq (.eig file) and overlaps (file .mmn), where b vector connects neighboring q points. This information allows to interpolate the band energies (and their derivatives of any order) as well as Berry connections 20 and Berry curvature 17 . However, to evaluate the orbital magnetization via Eq. (26) or the orbital moment of Bloch states (34), one needs matrix elements of the Hamiltonian 18 (.uHu file) C b1;b2 mn ðqÞ ¼ hu mqþb1 jĤ q ju nqþb2 i: The evaluation of Eq. (37) is very specific to the details of the ab initio code, and implemented only in pw2wannier90.x and only for norm-conserving pseudopotentials. To enable the study of properties related to the orbital moment with other ab initio codes, the following workaround may be employed. By inserting a complete set of Bloch states at a particular q point This equation is implemented within the wannierberri.mmn2uHu submodule, which allows to generate the .uHu file out of .mmn and .eig files. The equality in Eq. (38) is exact only in the limit l max ! 1 and infinitely large basis set for the wavefunctions representation. So in practice, one has to check convergence for a particular system. As an example, we calculate the orbital magnetization of bcc Fe using the .uHu file computed with pw2wan-nier90.x and using the wannierberri.mmn2uHu interface with different summation limit l max in (38). As can be seen in Fig. 5 already l max ¼ 200 (corresponding energy~230 eV) yields a result very close to that of pw2wannier90.
x. However, one should bear in mind that convergence depends on many factors, such as the choice of WFs and pseudopotentials. In particular, for tellurium, we observed 24 that including only a few bands above the highest p-band is enough to obtain accurate results. However for iron, using a pseudopotential shipped with the examples of Wannier90, we failed to reach convergence even with l max ¼ 600.
To interpolate the spin operator expectation value, the matrix s mn ðqÞ ¼ hu mq jσju nq i is needed. To facilitate study of spin-dependent properties within VASP 57 code, a submodule wannierberri.vaspspn is included, which computes S mnq based on the normalized pseudo-wavefunction read from the WAVECAR file. Note that the use of pseudo-wavefunction instead of the full PAW 58 wavefunction is an approximation, which however in practice gives a rather accurate interpolation of spin.
The mmn2uHu and vaspspn modules were initially developed and used in ref. 24 as separate scripts, but were not published so far. Now they are included in the WannierBerri package with the hope of being useful to the community.

DATA AVAILABILITY
All input files needed to generate the Wannier functions and perform the example of the code, as well as the resulting output data plotted in the figures may be freely downloaded from the Materials Cloud Archive (https://doi.org/10.24435/ materialscloud:1r-8w).
Received: 5 October 2020; Accepted: 11 January 2021; Fig. 5 Orbital magntization of bcc Fe. Given as a function of the Fermi level E F relative to the pristine Fermi level E 0 F . Curves evaluated using the .uHu file computed with pw2wannier90.x (dashed line) and using the wannierberri.mmn2uHu interface (solid lines) with different summation limit l max in (38).