Tensor-structured algorithm for reduced-order scaling large-scale Kohn-Sham density functional theory calculations

We present a tensor-structured algorithm for efficient large-scale DFT calculations by constructing a Tucker tensor basis that is adapted to the Kohn-Sham Hamiltonian and localized in real-space. The proposed approach uses an additive separable approximation to the Kohn-Sham Hamiltonian and an L1 localization technique to generate the 1-D localized functions that constitute the Tucker tensor basis. Numerical results show that the resulting Tucker tensor basis exhibits exponential convergence in the ground-state energy with increasing Tucker rank. Further, the proposed tensor-structured algorithm demonstrated sub-quadratic scaling with system size for both systems with and without a gap, and involving many thousands of atoms. This reduced-order scaling has also resulted in the proposed approach vastly outperforming plane-wave DFT implementation for systems beyond 5,000 electrons.


Introduction
Density functional theory (DFT) has been the workhorse of ab-initio materials simulations for over three decades, providing many key insights into materials properties and materials behavior. In order to study ground-state properties, based on the Hohenberg-Kohn theorem 1 and the Kohn-Sham formulation 2 , DFT reduces the Schrödinger equation in 3N e spatial coordinates (N e denoting the number of electrons) to an equivalent problem in the electron-density that only depends on three spatial coordinates. This reduces the exponential computational complexity (with system-size) of solving the Schrödinger equation to the cubic computational complexity of DFT. While DFT has enabled wide-ranging ab-initio calculations, with ∼ 1/4th of the computational resources on some public supercomputers utilized for DFT calculations, the cubic computational complexity has limited routine DFT calculations to typical system-sizes involving a few hundred atoms. In an attempt to enable DFT calculations on large-scale systems that are critical to understanding many aspects of complex materials phenomena, many efforts over the past three decades have focused on developing reduced-order scaling algorithms for electronic structure calculations [3][4][5][6][7][8][9][10][11][12][13] . While these approaches have been demonstrated to provide close to linear-scaling complexity for materials with a gap, the computational complexity of metallic systems (without a gap) is close to cubic-scaling in practice. This work is a key step towards addressing the computational complexity of DFT calculations, where we demonstrate achieving systematically convergent DFT calculations that exhibit sub-quadratic scaling for systems with and without a gap over system-sizes spanning many thousands of atoms.
This line of work is motivated from a study that revealed a low-rank representation for the electronic structure using Tucker and canonical tensor decomposition [14][15][16] . Another study based on a posteriori analysis showed that the rank required to approximate the electron density is only weakly dependent on the system-size 17 . These studies have thereby prompted the development of a tensor-structured approach for DFT calculations 18 , where a Tucker tensor basis adapted to the Kohn-sham Hamiltonian was employed to solve the Kohn-Sham equations. Importantly, the rank of the Tucker tensor basis was only weakly dependent with system-size for materials systems with and without a gap, which revealed the potential for realizing a reduced-order scaling approach for DFT calculations. However, since the constructed Tucker-tensor basis is global, the representation of the Kohn-Sham Hamiltonian in this basis was dense. With increasing system-size, albeit the slow growth in the size of the Hamiltonian matrix due to the weak rank dependence, the approach became computationally prohibitive for large systems and limited the system-sizes to a few hundred atoms.
In this work, L 1 localization is used to overcome the aforementioned drawbacks, and we demonstrate systematically convergent, efficient and reduced-order scaling large-scale DFT calculations using tensor-structured techniques (cf. Fig. 1 for an overview of the approach). The L 1 localization utilizes the idea of auto-encoder commonly recognized in machine learning to construct a series of 1-D functions that are localized yet closely approximate the function space of interest. The 1-D localized functions that are a close approximation to the eigensubspace of a suitably constructed additive separable approximation of the Kohn-Sham Hamiltonian are used to generate the localized Tucker tensor basis for the DFT problem. The locality of the Tucker tensor basis results in a sparse discrete Kohn-Sham Hamiltonian matrix, which is exploited in the solution of the Kohn-Sham equations using the Chebyshev filtering subspace iteration scheme. The sparsity of the Kohn-Sham Hamiltonian matrix represented in the localized Tucker tensor basis improves both the computational efficiency and the memory footprint. Further, as will be demonstrated, the proposed approach has enabled sub-quadratic scaling DFT calculations on large-scale systems involving many thousands of atoms. The approach is generic and treats both systems with and without a gap on an equal footing. Importantly, this translates to substantial speed-ups over Quantum Espresso, a widely used state-of-the-art plane-wave DFT code 19,20 , with speed-ups of ∼ 8-fold for metallic nano-particles containing ∼ 2, 000 atoms.

Results
The ground state energy in Kohn-Sham DFT (spin independent formulation) of a materials system with N a atoms and N e electrons is computed by solving a non-interacting single-particle Schrödinger equation in a mean field determined by the effective potential V eff (x): Equation 1 represents a non-linear eigenvalue problem with H := − 1 2 ∇ 2 + V eff being the Kohn-Sham Hamiltonian, and i denoting the i-th Kohn-Sham eigenvalue and Ψ i denoting the corresponding Kohn-Sham orbital (eigenvector). The effective potential V eff can be computed efficiently using a low-rank Tucker tensor approximation. We refer to the Supplementary Information (SI) for details on using tensor-structured techniques for an efficient computation of the effective potential, and subsequently the Hamiltonian matrix elements. The electron density ρ(x) is computed in terms of the Kohn-Sham eigenstates as ρ( denotes the orbital occupancy factor given by the Fermi-Dirac distribution f ( ; µ) = 1/ 1 + exp( −µ k B T ) with the Boltzmann constant k B , the Fermi energy µ, and the smearing temperature T . We note that Eq. 1 represents a non-linear eigenvalue problem, as the Kohn-Sham Hamiltonian depends on ρ, which in turn depends on the eigenstates. Thus, Eq. 1 is solved self-consistently via a self-consistent field (SCF) iteration 21 .
Tensor-structured algorithm for Kohn-Sham DFT using L 1 localized 1-D functions In our previous work 18 , it was suggested that an additive separable approximation to the Kohn-Sham Hamiltonian can be used to construct a Tucker tensor basis that is systematically convergent. In particular, using a tensor-structured cuboidal domain Ω spanned by tensor product of 1-D domains ω k=1,2,3 , an additive separable approximation to the Kohn-Sham Hamiltonian (H 1 (x 1 ) + H 2 (x 2 ) + H 3 (x 3 ) ≈ H(x) ) retains some features of the Hamiltonian, and thus presents a useful operator to generate reduced-order basis functions. To this end, the eigenfunctions of the additive separable approximation to the Hamiltonian, which constitute a Tucker tensor basis formed from the 1-D eigenfunctions of the separable parts of the Hamiltonian (H k , k = 1, 2, 3), are used to solve the Kohn-Sham equations. While an efficient basis, the global nature of the ensuing Tucker tensor basis limits the computational efficiency of the algorithms to solve the Kohn-Sham equations. In the proposed work, in place of the 1-D eigenfunctions of H k , we instead construct compressed modes preserving the subspace spanned by the 1-D eigenfunctions using L 1 localization technique 22 . The obtained 1-D localized functions are then used to generate the 3-D Tucker tensor basis, which is localized in real-space and allows us to exploit the sparsity of the Kohn-Sham Hamiltonian represented in this basis for both computational efficiency and realizing reduced-order scaling in solving the Kohn-Sham equations. The various aspects of our tensor-structured algorithm are now presented, which includes the generation of the additive separable approximation of the Kohn-Sham Hamiltonian, the evaluation of the L 1 localized 1-D functions, the construction of the localized Tucker tensor basis, the projection of the Kohn-Sham problem onto the localized Tucker tensor basis, and the solution of the Kohn-Sham equations.

Construction of separable Hamiltonian
We seek to construct a separable approximation to the Kohn-Sham Hamiltonian H 1 ( based on a rank-1 approximation of the eigenfunction corresponding to the lowest eigenvalue. To this end, we consider the rank-1 representation for the eigenfunction as Ψ (x) = ψ 1 (x 1 )ψ 2 (x 2 )ψ 3 (x 3 ). Thus, the problem of computing the smallest eigenvalue of the Kohn-Sham Hamiltonian using the rank-1 approximation is given by the variational problem min ψ k L(Ψ ) subject to: Ψ |Ψ = 1 (2) with the Lagrangian L(Ψ ) = Ψ − 1 2 ∇ 2 + V eff (x) Ψ . Upon taking the variations of the functional with respect to ψ 1 , ψ 2 and ψ 3 , we obtain three simultaneous 1-D eigenvalue problems As H k and α k are parametrized by ψ l =k (see SI for details), the three simultaneous 1-D eigenvalue problems represent a non-linear problem that can be solved self-consistently via SCF iteration. Upon achieving self-consistency, the 1-D Hamiltonians (H k ) we obtain represent the additive separable approximation of the Kohn-Sham Hamiltonian that we seek. The eigenfunctions of this additive separable approximation to the Hamiltonian, which can be obtained as the tensor product of the 1-D eigenfunctions of H k (k = 1, 2, 3), constitute a complete basis, thus providing systematic convergence as will be demonstrated. We remark that the proposed approach represents one possibility of constructing an additive separable approximation to the Kohn-Sham Hamiltonian, and other possibilities may exist. The proposed separable approximation retains some aspects of the Kohn-Sham Hamiltonian, and thus the eigenbasis of this separable approximation represents a systematically convergent basis that is more efficient than the plane-wave basis, as will be demonstrated from our numerical studies. We note that the Kohn-Sham Hamiltonian changes during the course of the SCF iteration. While the separable approximation to the Kohn-Sham Hamiltonian can be computed for each SCF iteration adapting the tensor-structured basis to the SCF iteration, our numerical studies suggest that it suffices to compute the tensor structured basis in the first SCF iteration (using the superposition of atomic densities to compute the Kohn-Sham potential) and keep the basis fixed for the SCF iteration.

Computation of the L 1 localized 1-D functions
The tensor-structured basis computed using the 1-D eigenfunctions of H k represents an efficient basis. However, the global nature of the basis limits the computational efficiency and scaling (with system size) of solution to the Kohn-Sham equations. To this end, we use an L 1 localization approach 22 to construct a spatially localized tensor-structured basis that is a close approximation to the original tensor-structured basis. The localized basis is obtained by solving the following variational problem (for k = 1, 2, 3) where H k is matrix representation of H k in a suitable orthogonal basis with dimension n, Ψ k denotes the representation of N k trial localized functions in the chosen basis, and µ is a parameter controlling the trade-off between the representability of the original eigensubspace and the locality of the 1-D functions, with |·| denoting the L 1 norm of the matrix. The minimizer of this variational problem, henceforth denoted as Ψ L k , provides localized functions whose span closely approximates the eigensubspace of the lowest N k eigenfunctions of H k , as will be demonstrated subsequently. We refer to the methods section for the solution procedure employed to solve the aforementioned variational problem.

Construction of the localized 3-D Tucker tensor basis T L
The 1-D localized functions whose span is a close approximation to the subspace spanned by the 1-D eigenfunctions of H k are subsequently used to construct the 3-D Tucker tensor basis. Denoting the 1-D localized functions as , the 3-D localized tensor-structured basis functions T L I are given by The rank of the Tucker tensor basis is given by (R 1 , R 2 , R 3 ) which denotes the number of localized 1-D functions in each direction. The space spanned by the 3-D localized tensor-structured basis functions is denoted as T L .

Projection of the Kohn-Sham Hamiltonian onto T L
The Kohn-Sham Hamiltonian is projected onto T L spanned by the 3-D tensor-structured localized basis functions, given bỹ In order to facilitate an efficient evaluation of the Hamiltonian matrix elements, a low-rank Tucker approximation of the effective potential V eff , denoted byṼ eff , is used. We refer to SI for details on using tensor-structured techniques for an efficient computation ofṼ eff , and subsequently the Hamiltonian matrix elements. Upon computing the projected Kohn-Sham Hamiltonian, a truncation tolerance is introduced to zero the Hamiltonian elements below the tolerance to improve the sparsity of the Hamiltonian. The truncation tolerance is chosen such that the error resulting from the truncation is less than the discretization error and the desired chemical accuracy.

Computation of the occupied eigenstates
The discretized Kohn-Sham problem, corresponding to Eq. 1, in the localized orthonormal tensor-structured basis is given by the standard eigenvalue problem where H L denotes the truncated sparse Kohn-Sham Hamiltonian matrix. We use the Chebyshev filtering based subspace iteration (ChFSI) 23 to efficiently solve the Kohn-Sham equations. In the ChFSI method, in each SCF iteration, a suitably constructed Chebyshev filter is employed to construct a close approximation to the eigensubspace of the occupied states, and the Kohn-Sham eigenstates are computed by projecting the problem onto the Chebyshev filtered subspace. The ChFSI method has been demonstrated to be efficient with good parallel scalability for real-space implementations of DFT 24,25 . We refer to the methods section for details on the ground-state DFT calculations.

Eigensubspace representability of the localized 1-D functions
We now demonstrate the the ability of the L 1 localized functions to closely approximate the eigensubspace of H k using Al 147 nano-particle with icosahedral symmetry. We compute the additive separable approximation of the Kohn-Sham Hamiltonian for this nano-particle, and, using this, compute the lowest 70 eigenstates of H k . We subsequently use the L 1 localization approach to compute the localized functions that are a close approximation to the eigensubspace. Figure 2 shows the lowest 5 eigenfunctions of H 1 (one of the 1-D separable Hamiltonian) (top) and the corresponding 1-D localized functions (bottom). We refer to SI (Fig. SI1) for an illustration of all 70 eigenstates and the corresponding 1-D localized functions. It is evident that, while the eigenfunctions are global in nature, the functions obtained from the L 1 localization approach are localized in real-space. This locality is key to the sparsity of the Kohn-Sham Hamiltonian matrix in the Tucker tensor basis, and the resulting computational efficiency.
In order to demonstrate the accuracy of the L 1 localization approach in closely approximating the eigensubspace of the separable Hamiltonian, we consider the first 70 eigenstates of H 1 and the eigenvalues of the matrix Figure 3 shows the eigenvalues of H 1 and the eigenvalues of K ij . It is interesting to note that the eigenvalues of the first 65 states are almost identical, with only slight deviations for the higher states. This demonstrates that the space spanned by the localization functions obtained using the L 1 localization approach is a close approximation to the eigensubspace of H k . We also note here that better accuracy can be achieved, when necessary, by simply increasing the size of N k to be solved for in Eq. 4.

Convergence of the tensor-structured basis
We next investigate the convergence properties of the 3-D Tucker tensor basis constructed from the 1-D localized functions. For the convergence study we consider two benchmark problems: (i) C 60 (fullerene) molecule; (ii) tris (bipyridine) ruthenium, a transition metal complex. We note that these systems have no tensor structure symmetry and serve as stringent benchmarks to assess the convergence and accuracy afforded by the proposed Tucker tensor basis. The ground-state energy for these molecules is computed for various Tucker tensor ranks R (R 1 = R 2 = R 3 = R), and the error is measured with respect to a well-converged Quantum Espresso result. The converged Quantum Espresso ground-state energies for the the fullerene molecule is taken to be -155.1248 eV/atom (E cut = 60 Ha) and that of tris (bipyridine) ruthenium is taken to be -118.2128 eV/atom (E cut = 65 Ha). Fig. 4 show the relative error in the ground-state energy for the various ranks of the Tucker tensor basis. It is evident from these results that the Tucker tensor basis constructed using our approach provides an exponential convergence in the ground-state energy with increasing Tucker rank. The convergence study of these molecules suggests that the proposed tensor-structured technique provides systematic convergence with high accuracy and is capable of handling generic materials systems, including those involving transition metals.

Performance and scaling analysis
To study the performance and scaling with system-size of the proposed tensor-structured approach for DFT calculations, we consider two classes of benchmark systems: (i) Aluminum nano-particles of various sizes ranging from 13 atoms to 6,525 atoms; (ii) Silicon quantum dots with system sizes ranging from 26 atoms to 7,355 atoms. These benchmark systems constitute materials systems with and without a gap, thus allowing us to assess the system-size scaling for both classes of materials. In order to compare the efficiency of the proposed tensor-structured approach with the widely used plane-wave DFT calculations, we also conducted the DFT calculations using Quantum Espresso wherever possible. For the sake of estimating the computational efficiency, the energy cutoff for Quantum Espresso and the Tucker rank is chosen such that the ground-state energy is converged to within 10 meV/atom measured with respect to a highly converged reference calculation. The reference ground-state energies are obtained from Quantum Espresso (using a high energy cut-off) for smaller systems and the DFT-FE code 25 -a massively parallel real-space code for large-scale DFT calculations-for larger system sizes. The cell size for plane-wave calculations is chosen such that each atom is at least 10 Bohr away from the boundary, which was needed to obtain the desired accuracy.

Aluminum nano-particles
The speed-up ratio of the proposed tensor-structured approach and Quantum Espresso for the various aluminum nano-particles with icosahedral symmetry considered in this work are provided in Fig. 6(a). The speed-up is computed with the computational times per SCF iteration in node-hrs using the proposed tensor-structured approach divided by that of Quantum Espresso. While Quantum Espresso is more efficient for the smaller system-sizes, the tensor-structured approach starts to substantially outperform for larger system sizes. Notably, for Al 2057 , the tensor-structured approach is 8-fold more efficient. Using the computational times, we estimate the scaling of the proposed tensor-structured approach to be O(N 1.78 e ) with N e denoting the number of electrons. Notably, the scaling with system-size is sub-quadratic for this metallic system over system-sizes spanning many thousands of atoms, as opposed to the cubic-scaling complexity for plane-wave DFT calculations. Figure 6(b) compares the computational performance of the proposed tensor-structured approach with Quantum Espresso for a wide range of silicon quantum dots passivated with hydrogen. As in the case of aluminum nanoparticles, the proposed approach starts competing with Quantum Espresso beyond a few hundred atoms, and substantially outperforms for larger systems. The scaling with system size for the tensor-structured algorithm, for a range of system-sizes with the largest containing 7,355 atoms, is estimated to be O(N 1.8 e ). Notably, this scaling is similar to that obtained for aluminum nano-particles as the algorithm treats systems with and without a gap on a similar footing.

Discussion
We have presented a tensor-structured algorithm, where the Tucker tensor basis is constructed as a tensor product of localized 1-D functions whose span closely approximates the eigensubspace of a suitably constructed additive separable approximation to the Kohn-Sham Hamiltonian. The resulting localized Tucker tensor basis, that is adapted to the Kohn-Sham Hamiltonian, provides a systematically convergent basis as evidenced by the exponential convergence of the ground-state energy with increasing Tucker rank. Our numerical studies on the computational performance suggest that the proposed approach exhibits sub-quadratic scaling (with system-size) over a wide range of system-sizes with the largest involving many thousands of atoms. Importantly, the sub-quadratic scaling is realized for both systems with and without a gap, as the algorithm treats both metallic and insulating systems on an equal footing. Further, comparing the computational efficiency of the proposed approach with Quantum Espresso, we observe substantial outperformance for system-sizes beyond 5,000 electrons.
We note that the sub-quadratic scaling is a consequence of the slow growth of the Tucker rank with system-size, with the resulting number of basis functions growing sub-linearly with system-size even for systems containing many thousands of atoms. By combining the proposed approach with reduced-order scaling techniques that exploit the locality of the wavefunctions in real-space, there is further room to reduce the scaling with system-size and is a 5/20 useful future direction to pursue. Further, the proposed tensor-structured approach is amenable to GPU acceleration that can further substantially enhance the computational efficiency of the approach, and is currently being pursued.

Tucker tensor representation
Tucker tensor representation is a higher-order generalization of principal component analysis for a tensor [26][27][28] . An N-way tensor is approximated by a Tucker tensor through Tucker decomposition with a smaller N-way core tensor and N factor matrices whose columns are the rank-1 components from the decomposition [26][27][28][29][30] . In the scope of this work, the discussion is restricted to three-way tensor. Let A ∈ R I 1 ×I 2 ×I 3 be a real-valued three-way tensor of size I 1 × I 2 × I 3 indexed by a set of integers (i 1 , i 2 , i 3 ) where σ ∈ R R 1 ×R 2 ×R 3 is the core tensor, u r d d ∈ R I d forms the factor matrix U d ∈ R I d ×R d , and "•" denotes the vector outer product (u r 1 The core tensor stores the coefficients σ r 1 r 2 r 3 for each rank-1 tensor u r 1 1 • u r 2 2 • u r 3 3 . The core tensor and the factor matrices can be viewed as the higher-order correspondence of the singular values and unitary matrices. The tensor representation of the Tucker form can be obtained with the higher-order singular value decomposition (HOSVD). The HOSVD flattens the given tensor in three directions and employs singular-value decomposition to obtain the factor matrices. The factor matrices are then used to contract with the given tensor to obtain the core tensor. We refer to 30 and 31 for details of HOSVD and further review on tensor decomposition and tensor analysis. In this work, an MPI implementation in C++ for Tucker decomposition is used 32, 33 .

Computation of L 1 localized 1-D functions
The variational problem in Eq. 4 is solved by the splitting orthogonality constraint algorithm (SOC). The SOC algorithm introduces two auxiliary variables controlling the orthonormality and the locality constraints, and translates the variational problem into a constrained minimization problem. The constrained minimization problem is split into one minimization problem and two constraints. The SOC algorithm is capable of providing a set of compressed modes with good locality, yet preserving the orthogonality. We refer to Ozolinš et al. 22 and Lai et al. 34 for a detailed discussion, and present the algorithm in the context of this work in the SI for the sake of completeness.

Ground-state DFT calculations
All calculations are performed using the norm-conserving Troullier-Martin pseudopotentials in Kleinmann-Bylander form 35, 36 , and a local density approximation (LDA) for the exchange-correlation functional 37, 38 . The numerical parameters comprising the Tucker decomposition rank used inṼ eff and the truncation tolerance used in the Hamiltonian matrix elements are chosen such that the numerical errors are lesser than the Tucker tensor basis discretization error and the desired chemical accuracy. In the ChFSI method employed to solve the Kohn-Sham equations, we use a Chebyshev filter constructed using polynomial degree of 10-20 for the various materials systems reported in this study. The n-stage Anderson mixing scheme 39 is employed in the SCF iteration. The performance benchmarks are obtained on compute nodes comprising of 68-core Intel Xeon Phi Processor 7250 and 96GB memory per node.

S.1 Effective potential computation with tensor technique
Here we elaborate the various aspects of utilizing the low-rank tensor decomposition to evaluate and represent the various components of the Kohn-Sham effective potential. We express the effective potential V eff (x) as V eff (ρ; R nu ) to emphasize that the effective potential is a functional of the electron density ρ and is parametrized by the coordinates of nuclei R nu = {R 1 , R 2 , . . . , R N a }. The effective potential is then re-written as

6/20
where V loc eff denotes the local part of the Kohn-Sham Hamiltonian and V nl ext denotes the non-local part.
The local part includes the Hartree potential V H , the exchange-correlational functional potential V XC , and the local part of the external pseudopotential V loc ext .

S.1.1 Hartree Potential
The Hartree potential V H is computed following the tensor-structure approach presented in 14,40 . We recall that the Hartree potential is given by the convolution integral We first compute a low-rank Tucker decomposition of the electron density as In the above, R is a composite index denoting the decomposition rank (R 1 , R 2 , R 3 ) chosen to perform the Tucker decomposition. The kernel 1 |x−x | in Eq. SI3 is approximated by a series of Gaussian functions to take advantage of the tensor-structured nature as where w k and α k are coefficients and K is the number of terms used to expand the kernel. We refer to the literature 40, 41 for the derivation and the algorithm. The pre-computed coefficients can be found on the webpage 42 . Substituting Eq. SI4 and Eq. SI5 into Eq. SI3 , the Hartree potential can be evaluated using a separable form The local part of the effective potential is then computed by summing the Hartree potential, exchange-correlational functional and the local part of the external potential. We next compute the Tucker decomposition of the local part of the effective potential to exploit the tensor structure in the computation of the Hamiltonian matrix elements in the Tucker tensor basis (cf. S.4). The Tucker decomposed effective potential is thus represented as

S.1.2 Non-local projector of pseudopotential
In this work, we use the norm-conserving Troullier Martin pseudopotential in Kleinmann-Bylander form. The pseudopotential operator V ext comprises of a local part V loc ext and a non-local part V nl ext . The action of the pseudopotential operator on a Kohn-Sham orbital in real-space is given by where V loc,J ext (x − R J ) is the corresponding local potential for the J-th atom, and R J is the coordinates of the J-th atom. The action of the non-local operator in real-space is given by Therein, C J lm is the non-local projector, V J l (x) is the pseudopotential component of the J-th atom corresponding to the l azimuthal quantum number, and ϕ J lm (x) is the single atom pseudo-wavefunction of the J-th atom corresponding to the l azimuthal quantum number and m magnetic quantum number. In order to efficiently compute the action of non-local projector on the Kohn-Sham wavefunction, we compute the Tucker decomposition of V J l (x) and ϕ J lm (x) and denote them as The non-local part of the external potential computed with the Tucker decomposed quantities is denoted asṼ nl

S.2 Minimization problem for separable Hamiltonian quantities computation
In the Lagrangian of the minimization problem, we use the Tucker decomposed effective potential for efficient evaluation of the ensuing integrals. The Lagrangian, accounting for the normality constraint with the Lagrange multiplier λ, is given by Taking the variation of the Lagrangian Eq. SI14 with respect to ψ , we arrive at the simultaneous 1-D eigenvalue problems

8/20
We note that and α k = −(λ + a k ) in Eq. 3 of the main text. Here, we write out the 1-D quantities in Eq. SI15 and refer to Motamarri et al. 18 for details: The minimization problem can thus be written into a set of simultaneous 1-D eigenvalue problem where each eigenvalue problem is parametrized by the solution of the other two directions. The simultaneous eigenvalue problem can be solved using a self-consistent iteration procedure.

S.3 SOC Algorithm
The splitting orthogonality constraint algorithm (SOC) 22, 34 can be used for finding a set of localized functions which closely approximate the eigenspace, yet preserving the orthogonality of the localized functions. In this work, the SOC algorithm is used to construct localized functions that closely approximate the eigensubspace of the separable approximation of the Kohn-Sham Hamiltonian. Since the 1-D separable Hamiltonian is computed using a finite-element discretization, which in turn results in a generalized eigenvalue problem, the eigenfunctions are M-orthogonal (M denoting the overlap matrix). We hereby summarize the SOC algorithm for the M-orthogonality constraint variance. Consider the generalized 1-D eigenvalue problem along direction k given by H k Ψ k = M k Ψ k Λ k , where Ψ k ∈ R n×N k with n denoting the dimension of the basis and N k denoting the number of lowest eigenstates of interest, and M k is the overlap matrix of the k-th dimension. Using Cholesky factorization applied on M k so that M k = L T L, the orthogonality constraint is given by (LΨ k ) T (LΨ k ) = I. Thus, the original orthogonality constraint for Ψ k is thus replaced by LΨ k . The SOC algorithm for a generalized eigenvalue problem is given by In the Algorithm 1, η and κ are the penalty factors for each constraint, tol is the stopping criteria for the error measure e. Ψ L k is the computed localized 1-D functions. The solution for the three sub-problems 1-3 are where U, V, S are the left singular vectors, the right singular vectors, and the singular values for

S.4 Computation of projected Hamiltonian
Using the effective potential decomposed into the Tucker tensor format, the computation of the projected Hamiltonian is elaborated here for clarity. Substituting the decomposed quantities Eq. SI7, Eq. SI13 in Eq. 6 in the main text, the entries of the Hamiltonian matrix in the localized Tucker tensor basis are given bỹ (SI20) has a tensor structure, and thus each term in Eq. SI20 can be computed using 1-D integrals as follows: where For the non-local part of the effective potential, we first consider the expression where ν J lm can again be efficiently computed with Tucker decomposition and exploiting the tensor structure.

S.5 The localized 1-D functions for Al 147
We computed the additive separable approximation of the Kohn-Sham Hamiltonian H k for the Al 147 nano-particle with icosahedral symmetry, and, using this, computed the lowest 70 eigenstates of H k . We subsequently use the L 1 localization approach to compute the localized functions that are a close approximation to the eigensubspace. Figure SI1 shows the lowest 70 eigenfunctions of H 1 (top) and the corresponding 1-D localized functions (bottom). The locality pattern shows that while the eigenfunctions are global, the L 1 localized functions are fairly localized in space. As noted in the main text, the locality is the key to constructing a sparse Kohn-Sham Hamiltonian and the resulting computational efficiency.

Figure 1.
Overview of the tensor-structured algorithm for Kohn-Sham DFT using L 1 localized functions. The tensor-structured algorithm seeks to construct a systematically convergent reduced-order tensor-structured basis for efficiently solving the Kohn-Sham equations. To this end, an additive separable approximation to the Kohn-Sham Hamiltonian is constructed, whose eigenbasis presents a suitable reduced order basis, given by T I = ψ 1,i 1 ψ 2,i 2 ψ 3,i 3 , where I is a composite index I = (i 1 , i 2 , i 3 ). However, the discrete Kohn-Sham Hamiltonian in this basis is dense due to the global nature of the 1-D functions ψ k,i k , k = 1, 2, 3 (lower-left). L 1 localization is applied to alleviate this bottleneck, where localized tensor-structured basis functions are constructed such that the subspace spanned by this localized basis is a close approximation to the eigensubspace of the separable Hamiltonian. Denoting the localized 1-D functions by ψ L k,i k , the 3-D localized basis functions are given by T L I = ψ L 1,i 1 ψ L 2,i 2 ψ L 3,i 3 . The 3-D localized basis functions, being compactly supported, yields a sparse Hamiltonian (lower-right). The resulting sparse Hamiltonian, in conjunction with the slow growth of the Tucker rank with system-size to accurately represent the electronic structure, has provided sub-quadratic scaling with system-size for both insulating and metallic systems spanning over many thousands of atoms. 14/20